{"title": "Parsimonious Bayesian deep networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3190, "page_last": 3200, "abstract": "Combining Bayesian nonparametrics and a forward model selection strategy, we construct parsimonious Bayesian deep networks (PBDNs) that infer capacity-regularized network architectures from the data and require neither cross-validation nor fine-tuning when training the model. One of the two essential components of a PBDN is the development of a special infinite-wide single-hidden-layer neural network, whose number of active hidden units can be inferred from the data. The other one is the construction of a greedy layer-wise learning algorithm that uses a forward model selection criterion to determine when to stop adding another hidden layer. We develop both Gibbs sampling and stochastic gradient descent based maximum a posteriori inference for PBDNs, providing state-of-the-art classification accuracy and interpretable data subtypes near the decision boundaries, while maintaining low computational complexity for out-of-sample prediction.", "full_text": "Parsimonious Bayesian deep networks\n\nMingyuan Zhou\n\nDepartment of IROM, McCombs School of Business\nThe University of Texas at Austin, Austin, TX 78712\n\nmingyuan.zhou@mccombs.utexas.edu\n\nAbstract\n\nCombining Bayesian nonparametrics and a forward model selection strategy, we\nconstruct parsimonious Bayesian deep networks (PBDNs) that infer capacity-\nregularized network architectures from the data and require neither cross-validation\nnor \ufb01ne-tuning when training the model. One of the two essential components of\na PBDN is the development of a special in\ufb01nite-wide single-hidden-layer neural\nnetwork, whose number of active hidden units can be inferred from the data. The\nother one is the construction of a greedy layer-wise learning algorithm that uses a\nforward model selection criterion to determine when to stop adding another hidden\nlayer. We develop both Gibbs sampling and stochastic gradient descent based\nmaximum a posteriori inference for PBDNs, providing state-of-the-art classi\ufb01ca-\ntion accuracy and interpretable data subtypes near the decision boundaries, while\nmaintaining low computational complexity for out-of-sample prediction.\n\n1\n\nIntroduction\n\nTo separate two linearly separable classes, a simple linear classi\ufb01er such logistic regression will often\nsuf\ufb01ce, in which scenario adding the capability to model nonlinearity not only complicates the model\nand increases computation, but also often harms rather improves the performance by increasing the\nrisk of over\ufb01tting. On the other hand, for two classes not well separated by a single hyperplane,\na linear classi\ufb01er is often inadequate, and hence it is common to use either kernel support vector\nmachines [1, 2] or deep neural networks [3\u20135] to nonlinearly transform the covariates, making the\ntwo classes become more linearly separable in the transformed covariate space. While being able to\nachieve high classi\ufb01cation accuracy, they both have clear limitations. For a kernel based classi\ufb01er, its\nnumber of support vectors often increases linearly in the size of training data [6], making it not only\ncomputationally expensive and memory inef\ufb01cient to train for big data, but also slow in out-of-sample\npredictions. A deep neural network could be scalable with an appropriate network structure, but it is\noften cumbersome to tune the network depth (number of layers) and width (number of hidden units)\nof each hidden layer [5], and has the danger of over\ufb01tting the training data if a deep neural network,\nwhich is often equipped with a larger than necessary modeling capacity, is not carefully regularized.\nRather than making an uneasy choice in the \ufb01rst place between a linear classi\ufb01er, which has fast\ncomputation and resists over\ufb01tting but may not provide suf\ufb01cient class separation, and an over-\ncapacitized model, which often wastes computation and requires careful regularization to prevent\nover\ufb01tting, we propose a parsimonious Bayesian deep network (PBDN) that builds its capacity\nregularization into the greedy-layer-wise construction and training of the deep network. More\nspeci\ufb01cally, we transform the covariates in a layer-wise manner, with each layer of transformation\ndesigned to facilitate class separation via the use of the noisy-OR interactions of multiple weighted\nlinear hyperplanes. Related to kernel support vector machines, the hyperplanes play a similar role\nas support vectors in transforming the covariate space, but they are inferred from the data and their\nnumber increases at a much slower rate with the training data size. Related to deep neural networks,\nthe proposed multi-layer structure gradually increases its modeling capability by increasing its number\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof layers, but allows inferring from the data both the width of each hidden layer and depth of the\nnetwork to prevent building a model whose capacity is larger than necessary.\nTo obtain a parsimonious deep neural network, one may also consider using a two-step approach\nthat \ufb01rst trains an over-capacitized model and then compresses its size [7\u201310]. The design of PBDN\nrepresents a distinct philosophy. Moreover, PBDN is not contradicting with model compression, as\nthese post-processing compression techniques [7\u201310] may be used to further compress PBDN.\nFor capacity regularization of the proposed PBDN, we choose to shrink both its width and depth. To\nshrink the width of a hidden layer, we propose the use of a gamma process [11], a draw from which\nconsists of countably in\ufb01nite atoms, each of which is used to represent a hyperplane in the covariate\nspace. The gamma process has an inherence shrinkage mechanism as its number of atoms, whose\nrandom weights are larger than a certain positive constant \u0001 > 0, follows a Poisson distribution,\nwhose mean is \ufb01nite almost surely (a.s.) and reduces towards zero as \u0001 increases. To shrink the\ndepth of the network, we propose a layer-wise greedy-learning strategy that increases the depth by\nadding one hidden layer at a time, and uses an appropriate model selection criterion to decide when\nto stop adding another one. Note related to our work, Zhou et al. [12,13] combine the gamma process\nand greedy layer-wise training to build a Bayesian deep network in an unsupervised manner. Our\nexperiments show the proposed capacity regularization strategy helps successfully build a PBDN,\nproviding state-of-the-art classi\ufb01cation accuracy while maintaining low computational complexity for\nout-of-sample prediction. We have also tried applying a highly optimized off-the-shelf deep neural\nnetwork based classi\ufb01er, whose network architecture for a given data is set to be the same as that\ninferred by the PBDN. However, we have found no performance gains, suggesting the ef\ufb01cacy of the\nPBDN\u2019s greedy training procedure that requires neither cross-validation nor \ufb01ne-tuning. Note PBDN,\nlike a conventional deep neural network, could also be further improved by introducing convolutional\nlayers if the covariates have spatial, temporal, or other types of structures.\n\n2 Layer-width learning via in\ufb01nite support hyperplane machines\n\ngenerate {rk, \u03b2k}1,\u221e, making the in\ufb01nite sum(cid:80)\u221e\non one side of the hyperplane while decreasing on the other. We pass \u03bbi =(cid:80)\u221e\n\nThe \ufb01rst essential component of the proposed capacity regularization strategy is to learn the width\nof a hidden layer. To ful\ufb01ll this goal, we de\ufb01ne in\ufb01nite support hyperplane machine (iSHM), a\nlabel-asymmetric classi\ufb01er, that places in the covariate space countably in\ufb01nite hyperplanes {\u03b2k}1,\u221e,\nwhere each \u03b2k \u2208 RV +1 is associated with a weight rk > 0. We use a gamma process [11] to\nk=1 rk be \ufb01nite almost surely (a.s.). We measure\nthe proximity of a covariate vector xi \u2208 RV +1 to \u03b2k using the softplus function of their inner product\n(cid:48)\nas ln(1 + e\u03b2(cid:48)\nkxi) that is widely used\nin deep neural networks [14\u201317]. We consider that xi is far from hyperplane k if ln(1+e\u03b2(cid:48)\nkxi ) is close\nto zero. Thus, as xi moves away from hyperplane k, that proximity measure monotonically increases\nkxi),\na non-negative weighted combination of these proximity measures, through the Bernoulli-Poisson\nlink [18] f (\u03bbi) = 1 \u2212 e\u2212\u03bbi to de\ufb01ne the conditional class probability as\n\nkxi), which is a smoothed version of ReLU(\u03b2\n\n(cid:48)\nkxi) = max(0, \u03b2\n\nk=1 rk ln(1 + e\u03b2(cid:48)\n\npik = 1 \u2212 e\u2212rk ln(1+e\u03b2(cid:48)\n\nk xi ).\n\nNote the model treats the data labeled as \u201c0\u201d and \u201c1\u201d differently, and (1) suggests that in general\nP (yi = 1|{rk, \u03b2k}k, xi) (cid:54)= 1 \u2212 P (yi = 0|{rk,\u2212\u03b2k}k, xi). We will show that\n\nP (yi = 1|{rk, \u03b2k}k, xi) = 1 \u2212(cid:81)\u221e\n(cid:80)\n\nk=1(1 \u2212 pik),\n(cid:14)(cid:80)\n\ni pik\n\ni pikxi\n\n(1)\n\n(2)\n\ncan be used to represent the kth data subtype discovered by the algorithm.\n\n2.1 Nonparametric Bayesian hierarchical model\n\nin\ufb01nite hidden causes pik. Denoting(cid:87) as the logical OR operator, as P (y = 0) =(cid:81)\ny =(cid:87)\n\nOne may readily notice from (1) that the noisy-OR construction, widely used in probabilistic\nreasoning [19\u201321], is generalized by iSHM to attribute a binary outcome of yi = 1 to countably\nk P (bk = 0) if\nk bk and E[e\u2212\u03b8] = e\u2212r ln(1+ex) if \u03b8 \u223c Gamma(r, ex), we have an augmented form of (1) as\n(3)\nwhere bik \u223c Bernoulli(pik) can be further augmented as bik = \u03b4(mik \u2265 1), mik \u223c Pois(\u03b8ik),\nwhere mik \u2208 Z, Z := {0, 1, . . .}, and \u03b4(x) equals to 1 if the condition x is satis\ufb01ed and 0 otherwise.\n\nk=1 bik, bik \u223c Bernoulli(pik), pik = 1 \u2212 e\u2212\u03b8ik , \u03b8ik \u223c Gamma(rk, e\u03b2(cid:48)\n\nyi =(cid:87)\u221e\n\nkxi),\n\n2\n\n\fweight, the iSHM generates the label under the Bernoulli-Poisson link [18] as\n\nWe now marginalize bik out to formally de\ufb01ne iSHM. Let G \u223c \u0393P(G0, 1/c) denote a gamma process\nde\ufb01ned on the product space R+ \u00d7 \u2126, where R+ = {x : x > 0}, c \u2208 R+, and G0 is a \ufb01nite and\ncontinuous base measure over a complete separable metric space \u2126. As illustrated in Fig. 2 (b) in the\nk=1 rk\u03b4\u03b2k, where \u03b2k is an atom and rk is its\n\nAppendix, given a draw from G, expressed as G =(cid:80)\u221e\nyi | G, xi \u223c Bernoulli(cid:0)1 \u2212 e\u2212(cid:80)\u221e\n\ni\u03b2k )(cid:1),\nFrom (3) and (5), it is clear that one may declare hyperplane k as inactive if(cid:80)\n\nyi = \u03b4(mi \u2265 1), mi =(cid:80)\u221e\n\nwhich can be represented as a noisy-OR model as in (3) or, as shown in Fig. 2 (a), constructed as\n\nk=1 mik, mik \u223c Pois(\u03b8ik), \u03b8ik \u223c Gamma(rk, e\u03b2(cid:48)\n\ni bik =(cid:80)\n\n(5)\ni mik = 0.\n\nk=1 rk ln(1+ex(cid:48)\n\nkxi).\n\n(4)\n\n2.2\n\nInductive bias and distinction from multilayer perceptron\n\nBelow we reveal the inductive bias of iSHM in prioritizing the \ufb01t of the data labeled as \u201c1,\u201d due to the\nuse of the Bernoulli-Poisson link that has previously been applied for network analysis [18, 22, 23]\nand multi-label learning [24]. As the negative log-likelihood (NLL) for xi can be expressed as\n\nNLL(xi) = \u2212yi ln(cid:0)1 \u2212 e\u2212\u03bbi(cid:1) + (1 \u2212 yi)\u03bbi, \u03bbi =(cid:80)\u221e\n\nk=1 rk ln(1 + ex(cid:48)\n\ni\u03b2k ),\n\nwe have NLL(xi) = \u03bbi \u2212 ln(e\u03bbi \u2212 1) if yi = 1 and NLL(xi) = \u03bbi if yi = 0. As \u2212 ln(e\u03bbi \u2212 1)\nquickly explodes towards +\u221e as \u03bbi \u2192 0, when yi = 1, iSHM would adjust rk and \u03b2k to avoid at all\ncost overly suppressing xi (i.e., making \u03bbi too small). By contrast, it has a high tolerance of failing\nto suf\ufb01ciently suppress xi with yi = 0. Thus each xi with yi = 1 would be made suf\ufb01ciently close\nto at least one active support hyperplane. By contrast, while each xi with yi = 0 is desired to be far\naway from any support hyperplanes, violating that is typically not strongly penalized. Therefore, by\ntraining a pair of iSHMs under two opposite labeling settings, two sets of support hyperplanes could\nbe inferred to suf\ufb01ciently cover the covariate space occupied by the training data from both classes.\nNote as in (4), iSHM may be viewed as an in\ufb01nite-wide single-hidden-layer neural network that\nconnects the input layer to the kth hidden unit via the connections weights \u03b2k and the softplus\nnonlinear activation function ln(1 + e\u03b2(cid:48)\nkxi), and further pass a non-negative weighted combination\nof these hidden units through the Bernoulli-Poisson link to obtain the conditional class probability.\nFrom this point of view, it can be related to a single-hidden-layer multilayer perceptron (MLP)\n[5, 25] that uses a softplus activation function and cross-entropy loss, with the output activation\nexpressed as \u03c3[w(cid:48) ln(1 + eBxi)], where \u03c3(x) = 1/(1 + e\u2212x), K is the number of hidden units, B =\n(\u03b21, . . . , \u03b2K)(cid:48), and w = (w1, . . . , wK)(cid:48) \u2208 RK. Note minimizing the cross-entropy loss is equivalent\n\nto maximizing the likelihood of yi | w, B, xi \u223c Bernoulli(cid:2)(1 + e\u2212(cid:80)K\n\ni\u03b2k ))\u22121(cid:3), which\n\nk=1 wk ln(1+ex(cid:48)\n\nis biased towards \ufb01tting neither the data with yi = 1 nor these with yi = 0, since\nNLL(xi) = ln(e\u2212yiw(cid:48) ln(1+eBxi ) + e(1\u2212yi)w(cid:48) ln(1+eBxi )).\n\nTherefore, while iSHM is structurally similar to an MLP, it is distinct in its unbounded layer width, its\npositive constraint on the weights rk connecting the hidden and output layers, its ability to rigorously\nde\ufb01ne whether a hyperplane is active or inactive, and its inductive bias towards \ufb01tting the data labeled\nas \u201c1.\u201d As in practice labeling which class as \u201c1\u201d may be arbitrary, we predict the class label with\nk}1,\u221e are\n\u2217\nfrom a pair of iSHMs trained by labeling the data belonging to this class as \u201c1\u201d and \u201c0,\u201d respectively.\n\ni\u03b2\u2217\nk ))/2, where {rk, \u03b2k}1,\u221e and {r\u2217\n\ni\u03b2k ) + e\u2212(cid:80)\u221e\n\n(1 \u2212 e\u2212(cid:80)\u221e\n\nk=1 rk ln(1+ex(cid:48)\n\nk ln(1+ex(cid:48)\n\nk=1 r\u2217\n\nk, \u03b2\n\n2.3 Convex polytope geometric constraint\n\nIt is straightforward to show that iSHM with a single unit-weighted hyperplane reduces to logistic\nregression yi \u223c Bernoulli[1/(1 + e\u2212x(cid:48)\ni\u03b2)]. To interpret the role of each individual support hyperplane\nwhen multiple non-negligibly weighted ones are inferred by iSHM, we analogize each \u03b2k to an expert\nof a committee that collectively make binary decisions. For expert (hyperplane) k, the weight rk\nindicates how strongly its opinion is weighted by the committee, bik = 0 represents that it votes \u201cNo,\u201d\nk=1 bik, the committee would vote \u201cNo\u201d if\nand only if all its experts vote \u201cNo\u201d (i.e., all bik are zeros), in other words, the committee would vote\n\u201cYes\u201d even if only a single expert votes \u201cYes.\u201d Let us now examine the con\ufb01ned covariate space that\n\nand bik = 1 represents that it votes \u201cYes.\u201d Since yi =(cid:87)\u221e\n\n3\n\n\fsatis\ufb01es the inequality P (yi = 1| xi) \u2264 p0, where a data point is labeled as \u201c1\u201d with a probability no\ngreater than p0. The following theorem shows that it de\ufb01nes a con\ufb01ned space bounded by a convex\npolytope, as de\ufb01ned by the intersection of countably in\ufb01nite half-spaces de\ufb01ned by pik < p0.\nTheorem 1 (Convex polytope). For iSHM, the con\ufb01ned space speci\ufb01ed by the inequality\n\n(6)\nis bounded by a convex polytope de\ufb01ned by the set of solutions to countably in\ufb01nite inequalities as\n\nP (yi = 1|{rk, \u03b2k}k, xi) \u2264 p0\n\ni\u03b2k \u2264 ln(cid:2)(1 \u2212 p0)\n\n(cid:48)\n\nrk \u2212 1(cid:3), k \u2208 {1, 2, . . .}.\n\n\u2212 1\n\nx\n\n(7)\n\nThe convex polytope de\ufb01ned in (7) is enclosed by the intersection of countably in\ufb01nite half-spaces.\nIf we set p0 = 0.5 as the probability threshold to make binary decisions, then the convex polytope\nassigns a label of yi = 0 to an xi inside the convex polytope (i.e., an xi that satis\ufb01es all the\ninequalities in Eq. 7) with a relatively high probability, and assigns a label of yi = 1 to an xi outside\nthe convex polytope (i.e., an xi that violates at least one of the inequalities in Eq. 7) with a probability\nof at least 50%. Note that hyperplane k with rk \u2192 0 has a negligible impact on the conditional class\nprobability. Choosing the gamma process as the nonparametric Bayesian prior sidesteps the need to\ntune the number of experts. It shrinks the weights of all unnecessary experts, allowing automatically\ninferring a \ufb01nite number of non-negligibly weighted ones (support hyperplanes) from the data. We\nprovide in Appendix B the connections to previously proposed multi-hyperplane models [26\u201330].\n\n2.4 Gibbs sampling and MAP inference via SGD\n\nbase measure as G0 =(cid:80)K\n(cid:82) N (0, \u03b1\n\n\u03b2k \u223c(cid:89)V\n\nv=0\n\nFor the convenience of implementation, we truncate the gamma process with a \ufb01nite and discrete\n\u03b30\nK \u03b4\u03b2k, where K will be set suf\ufb01ciently large to approximate the truly\n\nk=1\n\ncountably in\ufb01nite model. We express iSHM using (5) together with\n\nrk \u223c Gamma(\u03b30/K, 1/c0), \u03b30 \u223c Gamma(a0, 1/b0), c0 \u223c Gamma(e0, 1/f0),\nvk )Gamma(\u03b1vk; a\u03b2, 1/b\u03b2k)d\u03b1vk, b\u03b2k \u223c Gamma(e0, 1/f0),\n\u22121\n\nwhere the normal gamma construction promotes sparsity on the connection weights \u03b2k [31].\nWe describe both Gibbs sampling, desirable for uncertainty quanti\ufb01cation, and maximum a pos-\nteriori (MAP) inference, suitable for large-scale training, in Algorithm 1. We use data augmen-\ntation and marginalization to derive Gibbs sampling, with the details deferred to Appendix B.\nFor MAP inference, we use Adam [32] in Tensor\ufb02ow to minimize a stochastic objective func-\ntion as f ({\u03b2k, ln rk}K\n), which embeds the hierarchical\nBayesian model\u2019s inductive bias and inherent shrinking mechanism into optimization, where M is\n(cid:80)K\nthe size of a randomly selected mini-batch, y\u2217\nk=1 eln rk ln(1 + ex(cid:48)\n\ni := 1 \u2212 yi, \u03bbi :=(cid:80)K\nK ln rk + c0eln rk(cid:1) + (a\u03b2 + 1/2)(cid:80)V\n(cid:0)\u2212 \u03b30\n\n1 ,{yi, xi}iM\n) =(cid:80)K\n\nf ({\u03b2k, ln rk}K\n\n1 ,{yi, xi}iM\n\ni , xi}iM\n\n) + f ({\u03b2\n\n\u2217\nk, ln r\u2217\n\ni\u03b2k ), and\n\nk=1\n\ni1\n\ni1\n\ni1\n\n(cid:2)\u2212yi ln(cid:0)1 \u2212 e\u2212\u03bbi(cid:1) + (1 \u2212 yi)\u03bbi\n\nk=0\n\nv=0\n\n(cid:3) .\n\nk}K\u2217\n1 ,{y\u2217\n(cid:80)iM\n\ni=i1\n\n[ln(1 + \u03b22\n\nvk/(2b\u03b2k))] + N\nM\n\n(8)\n\n3 Network-depth learning via forward model selection\n\nThe second essential component of the proposed capacity regularization strategy is to \ufb01nd a way to\nincrease the network depth and determine how deep is deep enough. Our solution is to sequentially\nstack a pair of iSHMs on top of the previously trained one, and develop a forward model selection\ncriterion to decide when to stop stacking another pair. We refer to the resulted model as parsimonious\nBayesian deep network (PBDN), as described below in detail.\nThe noisy-OR hyperplane interactions allow iSHM to go beyond simple linear separation, but with\nlimited capacity due to the convex-polytope constraint imposed on the decision boundary. On the\nother hand, it is the convex-polytope constraint that provides an implicit regularization, determining\nhow many non-negligibly weighted support hyperplanes are necessary in the covariate space to\nsuf\ufb01ciently activate all data of class \u201c1,\u201d while somewhat suppressing the data of class \u201c0.\u201d In this\npaper, we \ufb01nd that the model capacity could be quickly enhanced by sequentially stacking such\nconvex-polytope constraints under a feedforward deep structure, while preserving the virtue of being\nable to learn the number of support hyperplanes in the (transformed) covariate space.\n\n4\n\n\fMore speci\ufb01cally, as shown in Fig. 2 (c) of the Appendix, we \ufb01rst train a pair of iSHMs that\ni = 1 \u2212 yi, respectively, on the\nregress the current labels yi \u2208 {0, 1} and the \ufb02ipped ones y\u2217\noriginal covariates xi \u2208 RV . After obtaining K2 support hyperplanes {\u03b2(1\u21922)\n}1,K2, constituted\nby the active support hyperplanes inferred by both iSHM trained with yi and the one trained with\ny\u2217\ni , we use ln(1 + ex(cid:48)\n) as the hidden units of the second layer (\ufb01rst hidden layer). More\nprecisely, with t \u2208 {1, 2, . . .}, K0 := 0, K1 := V , \u02dcx(0)\n:=\n[1, (\u02dcx(t\u22121)\ni )(cid:48)](cid:48) \u2208 RKt\u22121+Kt+1 as the input data vector to layer t + 1, the tth added pair of\niSHMs transform x(t)\ni\n\ninto the hidden units of layer t + 1, expressed as\n\n:= xi, denoting x(t)\ni\n\n:= \u2205, and \u02dcx(1)\n\n)(cid:48), (\u02dcx(t)\n\ni\u03b2(1\u21922)\n\nk\n\ni\n\ni\n\ni\n\nk\n\n=(cid:2) ln(1 + e(x(t)\n\n\u02dcx(t+1)\n\ni\n\ni )(cid:48)\u03b2(t\u2192t+1)\n\n1\n\n), . . . , ln(1 + e(x(t)\n\ni )(cid:48)\u03b2(t\u2192t+1)\n\n)(cid:3)(cid:48)\n\n(cid:105)\n\n(cid:88)T\n\nt=1\n\nKt+1\n\n.\ni )(cid:48), (\u02dcx(t+1)\n= [1, (\u02dcx(t)\n\n)(cid:48)](cid:48) \u2208\ni\u03b2k (e.g., logistic regres-\nt=1 (Kt\u22121 + Kt + 1)Kt+1/(V + 1).\n)(cid:48)](cid:48), or other related concatena-\n\ni\n\nHence the input vectors used to train the next layer would be x(t+1)\nRKt+Kt+1+1. Therefore, if the computational cost of a single inner product x(cid:48)\n\nsion) is one, then that for T hidden layers would be about(cid:80)T\n\ni\n\ni )(cid:48), (\u02dcx(t+1)\n\ni\n\ni\n\ni\n\ni\n\n= [(x(t)\n\n= \u02dcx(t+1)\n\n, or x(t+1)\n\nNote one may also use x(t+1)\ntion methods to construct the covariates to train the next layer.\nOur intuition for why PBDN, constructed in this greedy-layer-wise manner, works well is that for\ntwo iSHMs trained on the same covariate space under two opposite labeling settings, one iSHM\nplaces enough hyperplanes to de\ufb01ne the complement of a convex polytope to suf\ufb01ciently activate all\ndata labeled as \u201c1,\u201d while the other does so for all data labeled as \u201c0.\u201d Thus, for any xi, at least one\npik would be suf\ufb01ciently activated, in other words, xi would be suf\ufb01ciently close to at least one of\nthe active hyperplanes of the iSHM pair. This mechanism prevents any xi from being completely\nsuppressed after transformation. Consequently, these transformed covariates ln(1 + e\u03b2(cid:48)\nkxi), which\ncan also be concatenated with xi, will be further used to train another iSHM pair. Thus even though a\nsingle iSHM pair may not be powerful enough, by keeping all covariate vectors suf\ufb01ciently activated\nafter transformation, they could be simply stacked sequentially to gradually enhance the model\ncapacity, with a strong resistance to over\ufb01tting and hence without the necessity of cross-validation.\nWhile stacking an additional iSHM pair on PBDN could enhance the model capacity, when T hidden\nlayers is suf\ufb01ciently deeper that the two classes become well separated, there is no more need to add\nan extra iSHM pair. To detect when it is appropriate to stop adding another iSHM pair, as shown in\nAlgorithm 2 of the Appendix, we consider a forward model selection strategy that sequentially stacks\nan iSHM pair after another, until the following criterion starts to rise:\n\nAIC(T ) =\n\n[2(Kt + 1)Kt+1] + 2KT +1 \u2212 2\n\nln P (yi | x(T )\n\ni\n\n) + ln P (y\n\ni | x(T )\n\u2217\n\ni\n\n)\n\n,\n\n(9)\n\nk\n\nk\n\nt=12(cid:0)(cid:107)|Bt| > \u0001\u03b2t max(cid:107)0 + (cid:107)|B\u2217\n\n, the connection weights between adjacent layers, using\nln P (yi | x(T )\n\nwhere 2(Kt + 1)Kt+1 represents the cost of adding the tth hidden layer and 2KT +1 represents the\ncost of using KT +1 nonnegative weights {r(T +1)\n}k to connect the T th hidden layer and the output\nlayer. With (9), we choose PBDN with T hidden layers if AIC(t + 1) \u2264 AIC(t) for t = 1, . . . , T \u2212 1\nand AIC(T + 1) > AIC(T ). We also consider another model selection criterion that accounts for the\nsparsity of \u03b2(t\u2192t+1)\nAIC\u0001(T ) =(cid:80)T\n, (10)\nwhere (cid:107)B(cid:107)0 is the number of nonzero elements in matrix B, \u0001 > 0 is a small constant, and Bt and\nt consist of the \u03b2(t\u2192t+1)\nB\u2217\ntrained by the \ufb01rst and second iSHMs of the tth iSHM pair, respectively,\nwith \u03b2t max and \u03b2\u2217\nNote if the covariates have any spatial or temporal structures to exploit, one may replace each\ni\u03b2k in iSHM with [CNN\u03c6(xi)](cid:48)\u03b2k, where CNN\u03c6(\u00b7) represents a convolutional neural network\nx(cid:48)\nparameterized by \u03c6, to construct a CNN-iSHM, which can then be further greedily grown into CNN-\nPBDN. Customizing PBDNs for structured covariates, such as image pixels and audio measurements,\nis a promising research topic, however, is beyond the scope of this paper.\n\nt max as their respective maximum absolute values.\n\n(cid:1) + 2KT +1 \u2212 2(cid:80)\n\n) + ln P (y\u2217\n\nt| > \u0001\u03b2\u2217\n\nt max(cid:107)0\n\ni | x(T )\n\n(cid:104)\n\n(cid:105)\n\nk\n\n)\n\ni\n\ni\n\ni\n\n4\n\nIllustrations and experimental results\n\nCode for reproducible research is available at https://github.com/mingyuanzhou/PBDN. To illustrate\nthe imposed geometric constraint and inductive bias of a single iSHM, we \ufb01rst consider a challenging\n\n5\n\n(cid:88)\n\n(cid:104)\n\ni\n\n\f1) A pair of in\ufb01nite support hyperplane machines (iSHMs)\n\n2) PBDN with 2 pairs of iSHMs\n\n3) PBDN with 4 pairs of iSHMs\n\n4) PBDN with 10 pairs of iSHMs\n\nFigure 1: Visualization of PBDN, each layer of which is a pair of iSHMs trained on a \u201ctwo spirals\u201d dataset\nunder two opposite labeling settings. For Sub\ufb01gure 1), (a) shows the weights rk of the inferred active support\nhyperplanes, ordered by their values and (d) shows the trace plot of the average per data point log-likelihood.\nFor iSHM trained by labeling the red (blue) points as ones (zeros), (b) shows a contour map, the value of each\npoint of which represents how many inequalities speci\ufb01ed in (7) are violated, and whose region with zero values\ncorresponds to the convex polytope enclosed by the intersection of the hyperplanes de\ufb01ned in (7), and (c) shows\nthe contour map of predicted class probabilities. (e-f) are analogous plots to (b-c) for iSHM trained with the blue\npoints labeled as ones. The inferred data subtypes as in (2) are represented as red circles and blue pentagrams\nin subplots (b) and (d). Sub\ufb01gures 2)-4) are analogous to 1), with two main differences: (i) the transformed\ncovariates to train the newly added iSHM pair are obtained by propagating the original 2-D covariates through\nthe previously trained iSHM pairs, and (ii) the contour maps in subplots (b) and (e) visualize the iSHM linear\nhyperplanes in the transformed space by projecting them back to the original 2-D covariate space.\n\n2-D \u201ctwo spirals\u201d dataset, as shown Fig. 1, whose two classes are not fully separable by a convex\npolytope. We train 10 pairs of iSHMs one pair after another, which are organized into a ten-hidden-\nlayer PBDN, whose numbers of hidden units from the 1st to 10th hidden layers (i.e., numbers of\nsupport hyperplanes of the 1st to 10th iSHM pairs) are inferred to be 8, 14, 15, 11, 19, 22, 23, 18, 19,\nand 29, respectively. Both AIC and AIC\u0001=0.01 infers the depth as T = 4.\nFor Fig. 1, we \ufb01rst train an iSHM by labeling the red points as \u201c1\u201d and blue as \u201c0,\u201d whose results\nare shown in subplots 1) (b-c), and another iSHM under the opposite labeling setting, whose results\nare shown in subplots 1) (e-f). It is evident that when labeling the red points as \u201c1,\u201d as shown in 1)\n(a-c), iSHM infers a convex polytope, intersected by three active support hyperplanes, to enclose\nthe blue points, but at the same time allows the red points within that convex polytope to pass\nthrough with appreciable activations. When labeling the blue points as \u201c1,\u201d iSHM infers \ufb01ve active\nsupport hyperplane, two of which are visible in the covariate space shown in 1) (e), to enclose the red\npoints, but at the same time allows the blue points within that convex polytope to pass through with\nappreciable activations, as shown in 1) (f). Only capable of using a convex polytope to enclose the\ndata labeled as \u201c0\u201d is a restriction of iSHM, but it is also why the iSHM pair could appropriately place\ntwo parsimonious sets of active support hyperplanes in the covariate space, ensuring the maximum\ndistance of any data point to these support hyperplanes to be suf\ufb01ciently small.\nSecond, we concatenate the 8-D transformed and 2-D original covariates as 10-D covariates, which is\nfurther augmented with a constant bias term of one, to train the second pair of iSHMs. As in subplot\n2) (a), \ufb01ve active support hyperplanes are inferred in the 10-D covariate space when labeling the blue\npoints as \u201c1,\u201d which could be matched to \ufb01ve nonzero smooth segments of the original 2-D spaces\nshown in 2) (e), which well align with highly activated regions in 2) (f). Again, with the inductive\nbias, as in 2) (f), all positive labeled data points are suf\ufb01ciently activated at the expense of allowing\nsome negative ones to be only moderately suppressed. Nine support hyperplanes are inferred when\nlabeling the red points as \u201c1,\u201d and similar activation behaviors can also be observed in 2) (b-c).\n\n6\n\n\fTable 1: Visualization of the subtypes inferred by PBDN in a random trial and comparison of classi\ufb01cation\nerror rates over \ufb01ve random trials between PBDN and a two-hidden-layer DNN (128-64) on four different\nMNIST binary classi\ufb01cation tasks.\n\n(a) Subtypes of 3 in 3 vs 5\n\n(b) Subtypes of 3 in 3 vs 8\n\n(c) Subtypes of 4 in 4 vs 7\n\n(d) Subtypes of 4 in 4 vs 9\n\n(e) Subtypes of 5 in 3 vs 5\n\n(f) Subtypes of 8 in 3 vs 8\n\n(g) Subtypes of 7 in 4 vs 7\n\n(h) Subtypes of 9 in 4 vs 9\n\nPBDN\n\nDNN\n\n2.53% \u00b1 0.22%\n2.78% \u00b1 0.36%\n\n2.66% \u00b1 0.27%\n2.93% \u00b1 0.40%\n\n1.37% \u00b1 0.18%\n1.21% \u00b1 0.12%\n\n2.95% \u00b1 0.47%\n2.98% \u00b1 0.17%\n\nWe continue the same greedy layer-wise training strategy to add another eight iSHM pairs. From\nFigs. 1 3)-4) it becomes more and more clear that a support hyperplane, inferred in the transformed\ncovariate space of a deep hidden layer, could be used to represent the boundary of a complex but\nsmooth segment of the original covariate space that well encloses all or a subset of data labeled as \u201c1.\u201d\nWe apply PBDN to four different MNIST binary classi\ufb01cation tasks and compare its performance\nwith DNN (128-64), a two-hidden-layer deep neural network that will be detailedly described below.\nAs in Tab. 1, both AIC and AIC\u0001=0.01 infer the depth as T = 1 for PBDN, and infer for each class\nonly a few active hyperplanes, each of which represents a distinct data subtype, as calculated with (2).\nIn a random trial, the inferred networks of PBDN for all four tasks have only a single hidden layer\nwith at most 6 active hidden units. Thus its testing computation is much lower than DNN (128-64),\nwhile providing an overall lower testing error rate (both trained with 4000 mini-batches of size 100).\nBelow we provide comprehensive comparison on eight widely used benchmark datasets between\nthe proposed PBDNs and a variety of algorithms, including logistic regression, Gaussian radial\nbasis function (RBF) kernel support vector machine (SVM), relevance vector machine (RVM) [31],\nadaptive multi-hyperplane machine (AMM) [27], convex polytope machine (CPM) [30], and the deep\nneural network (DNN) classi\ufb01er (DNNClassi\ufb01er) provided in Tensor\ufb02ow [33]. Except for logistic\nregression that is a linear classi\ufb01er, both kernel SVM and RVM are widely used nonlinear classi\ufb01ers\nrelying on the kernel trick, both AMM and CPM intersect multiple hyperplanes to construct their\ndecision boundaries, and DNN uses a multilayer feedforward network, whose network structure\noften needs to be tuned to achieve a good balance between data \ufb01tting and model complexity, to\nhandle nonlinearity. We consider DNN (8-4), a two-hidden-layer DNN that uses 8 and 4 hidden\nunits for its \ufb01rst and second hidden layers, respectively, DNN (32-16), and DNN (128-64). In the\nAppendix, we summarize in Tab. 4 the information of eight benchmark datasets, including banana,\nbreast cancer, titanic, waveform, german, image, ijcnn1, and a9a. For a fair comparison, to ensure the\nsame training/testing partitions for all algorithms across all datasets, we report the results by using\neither widely used open-source software packages or the code made public available by the original\nauthors. We describe in the Appendix the settings of all competing algorithms.\nFor all datasets, we follow Algorithm 1 to \ufb01rst train a single-hidden-layer PBDN (PBDN1), i.e., a\npair of iHSMs \ufb01tted under two opposite labeling settings. We then follow Algorithm 2 to train another\npair of iHSMs to construct a two-hidden-layer PBDN (PBDN2), and repeat the same procedure to\ntrain PBDN3 and PBDN4. Note we observe that PBDN\u2019s log-likelihood increases rapidly during the\n\ufb01rst few hundred MCMC/SGD iterations, and then keeps increasing at a slower pace and eventually\n\ufb02uctuates. However, it often takes more iterations to shrink the weights of unneeded hyperplanes\ntowards deactivation. Thus although insuf\ufb01cient iterations may not necessarily degrade the \ufb01nal out-\nof-sample prediction accuracy, it may lead to a less compact network and hence higher computational\ncost for out-of-sample prediction. For each iHSM, we set the upper bound on the number of support\nhyperplanes as Kmax = 20. For Gibbs sampling, we run 5000 iterations and record {rk, \u03b2k}k with\nthe highest likelihood during the last 2500 iterations; for MAP, we process 4000 mini-batches of size\nM = 100, with 0.05/(4 + T ) as the Adam learning rate for the T th added iSHM pair. We use the\ninferred {rk, \u03b2k}k to either produce out-of-sample predictions or generate transformed covariates\nfor the next layer. We set a0 = b0 = 0.01, e0 = f0 = 1, and a\u03b2 = 10\u22126 for Gibbs sampling. We\n\ufb01x \u03b30 = c0 = 1 and a\u03b2 = b\u03b2k = 10\u22126 for MAP inference. As in Algorithm 1, we prune inactive\nsupport hyperplanes once every 200 MCMC or 500 SGD iterations to facilitate computation.\n\n7\n\n\fTable 2: Comparison of classi\ufb01cation error rates between a variety of algorithms and the proposed PBDNs with\n1, 2, or 4 hidden layers, and PBDN-AIC and PBDN-AIC\u0001=0.01 trained with Gibbs sampling or SGD. Displayed\nin the last two rows of each column are the average of the error rates and that of computational complexities of\nan algorithm normalized by those of kernel SVM.\n\nLR\n\n11.16\n\n22.32\n\n23.38\n\n23.20\n\n32.73\n\n21.59\n\n11.08\n\n10.85\n\n10.73\n\n28.44\n\n30.13\n\n31.56\n\n22.33\n\n28.85\n\n31.82\n\n18.76\n\nSVM RVM AMM CPM DNN\n(8-4)\n14.07\n\nDNN\n(32-16)\n47.76\n10.88\n\u00b14.38 \u00b10.57 \u00b10.69 \u00b14.09 \u00b13.00 \u00b15.87 \u00b10.36\n28.05\n33.51\n\u00b13.68 \u00b14.52 \u00b14.66 \u00b14.47 \u00b15.26 \u00b14.77 \u00b16.47\n22.67\n22.35\n\u00b10.98 \u00b10.63 \u00b11.08 \u00b18.56 \u00b13.23 \u00b11.38 \u00b11.36\n13.33\n11.66\n\u00b10.59 \u00b10.86 \u00b10.72 \u00b11.13 \u00b11.97 \u00b10.91 \u00b10.89\n23.63\n28.47\n\u00b11.70 \u00b12.51 \u00b12.28 \u00b13.73 \u00b12.15 \u00b13.60 \u00b13.00\n17.53\n2.54\n\u00b11.05 \u00b10.52 \u00b10.59 \u00b10.87 \u00b10.71 \u00b11.51 \u00b10.45\n8.41\n5.17\n\u00b10.60 \u00b10.53 \u00b11.40 \u00b11.99 \u00b11.14 \u00b11.33 \u00b10.83\n15.34\n16.16\n\u00b10.11 \u00b10.33 \u00b10.12 \u00b10.23 \u00b10.23 \u00b10.23 \u00b10.22\n1.087\n2.237\n\n1.110\n\n1.000\n\n1.234\n\n1.227\n\n1.260\n\n11.81\n\n25.13\n\n15.97\n\n23.67\n\n13.52\n\n12.43\n\n23.57\n\n27.83\n\n15.87\n\n15.39\n\n15.56\n\n15.82\n\n3.59\n\n4.83\n\n4.01\n\n5.37\n\n4.83\n\n6.35\n\n23.30\n\n2.84\n\n3.82\n\n3.82\n\n4.82\n\nDNN\n\n(128-64)\n10.91\n\u00b10.45\n32.21\n\u00b16.64\n22.28\n\u00b11.37\n11.20\n\u00b10.63\n25.73\n\u00b12.62\n2.44\n\u00b10.56\n4.17\n\u00b10.97\n16.97\n\u00b10.55\n1.031\n\nPBDN1\n\nPBDN2\n\n22.54\n\u00b15.28\n29.22\n\u00b13.53\n22.88\n\u00b10.59\n11.67\n\u00b10.90\n23.57\n\u00b12.43\n3.32\n\u00b10.59\n5.46\n\u00b11.32\n15.74\n\u00b10.12\n1.219\n\n10.94\n\u00b10.46\n30.26\n\u00b14.86\n22.25\n\u00b11.27\n11.40\n\u00b10.86\n23.80\n\u00b12.22\n2.43\n\u00b10.51\n4.33\n\u00b10.84\n15.64\n\u00b10.10\n1.009\n\nPBDN4\n\n11.45\n\n22.88\n\n29.22\n\n23.00\n\n32.08\n\n11.42\n\n29.22\n\n22.88\n\nAIC\nSGD\n11.49\n\nAIC\nAIC\u0001\nGibbs Gibbs\n10.97\n10.97\n\nAIC\u0001\nSGD\n10.85\n11.79\n\u00b10.43 \u00b10.46 \u00b10.46 \u00b10.65 \u00b10.89\n30.26\n29.87\n\u00b15.55 \u00b13.53 \u00b13.53 \u00b16.18 \u00b15.27\n22.16\n23.00\n\u00b11.13 \u00b10.59 \u00b10.59 \u00b10.37 \u00b10.37\n11.69\n12.04\n\u00b11.34 \u00b10.94 \u00b10.93 \u00b10.59 \u00b10.72\n23.77\n23.93\n\u00b12.27 \u00b12.52 \u00b12.56 \u00b12.60 \u00b12.01\n2.30\n2.27\n\u00b10.40 \u00b10.54 \u00b10.52 \u00b10.41 \u00b10.36\n4.09\n4.15\n\u00b10.76 \u00b10.84 \u00b10.84 \u00b10.73 \u00b10.68\n15.70\n17.13\n\u00b10.09 \u00b10.12 \u00b10.12 \u00b10.30 \u00b10.19\n1.029\n0.998\n\n1.073\n\n1.006\n\n0.996\n\n23.77\n\n15.74\n\n12.38\n\n26.87\n\n19.90\n\n23.90\n\n15.74\n\n2.30\n\n4.06\n\n2.18\n\n4.18\n\n2.36\n\n4.33\n\n0.006\n\n1.000\n\n0.113\n\n0.069\n\n0.046\n\n0.073\n\n0.635\n\n8.050\n\n0.042\n\n0.060\n\n0.160\n\n0.057\n\n0.064\n\n0.128\n\n0.088\n\nbanana\n\nbreast cancer\n\ntitanic\n\nwaveform\n\ngerman\n\nimage\n\nijcnn1\n\na9a\n\nMean of SVM\n\nnormalized errors\n\nMean of SVM\nnormalized K\n\nTable 3: The inferred depth of PBDN that increases its depth until a model selection criterion starts to rise.\nDataset\nAIC-Gibbs\nAIC\u0001=0.01-Gibbs\nAIC-SGD\nAIC\u0001=0.01-SGD\n\nwaveform\n1.90 \u00b1 0.74\n2.00 \u00b1 0.67\n2.40 \u00b1 0.52\n1.50 \u00b1 0.53\n\nbreast cancer\n1.00 \u00b1 0.00\n1.00 \u00b1 0.00\n1.90 \u00b1 0.99\n1.00 \u00b1 0.00\n\n1.00 \u00b1 0.00\n1.00 \u00b1 0.00\n3.20 \u00b1 0.45\n1.00 \u00b1 0.00\n\n1.00 \u00b1 0.00\n1.00 \u00b1 0.00\n1.00 \u00b1 0.00\n1.00 \u00b1 0.00\n\n1.30 \u00b1 0.67\n1.60 \u00b1 0.84\n2.80 \u00b1 0.63\n1.00 \u00b1 0.00\n\n2.40 \u00b1 0.52\n2.60 \u00b1 0.52\n2.90 \u00b1 0.74\n2.00 \u00b1 0.00\n\n2.00 \u00b1 0.00\n3.40 \u00b1 0.55\n3.20 \u00b1 0.45\n3.00 \u00b1 0.00\n\n2.30 \u00b1 0.48\n2.30 \u00b1 0.48\n3.20 \u00b1 0.78\n2.80 \u00b1 0.63\n\ngerman\n\nbanana\n\nijcnn1\n\ntitanic\n\nimage\n\na9a\n\nWe record the out-of-sample-prediction errors and computational complexity of various algorithms\nover these eight benchmark datasets in Tab. 2 and Tab. 5 of the Appendix, respectively, and summarize\nin Tab. 2 the means of SVM normalized errors and numbers of support hyperplanes/vectors. Overall,\nPBDN using AIC\u0001 in (10) with \u0001 = 0.01 to determine the depth, referred to as PBDN-AIC\u0001=0.01, has\nthe highest out-of-sample prediction accuracy, followed by PBDN4, the RBF kernel SVM, PBDN\nusing AIC in (9) to determine the depth, referred to as PBDN-AIC, PBDN2, PBDN-AIC\u0001=0.01 solved\nwith SGD, DNN (128-64), PBDN-AIC solved with SGD, and DNN (32-16).\nOverall, logistic regression does not perform well, which is not surprising as it is a linear classi\ufb01er that\nuses a single hyperplane to partition the covariate space into two halves to separate one class from the\nother. As shown in Tab. 2 of the Appendix, for breast cancer, titanic, german, and a9a, all classi\ufb01ers\nhave comparable classi\ufb01cation errors, suggesting minor or no advantages of using a nonlinear\nclassi\ufb01er on them. By contrast, for banana, waveform, image, and ijcnn1, all nonlinear classi\ufb01ers\nclearly outperform logistic regression. Note PBDN1, which clearly reduces the classi\ufb01cation errors\nof logistic regression, performs similarly to both AMM and CPM. These results are not surprising\nas CPM, closely related to AMM, uses a convex polytope, de\ufb01ned as the intersection of multiple\nhyperplanes, to enclose one class, whereas the classi\ufb01cation decision boundaries of PBDN1 can\nbe bounded within a convex polytope that encloses negative examples. Note that the number of\nhyperplanes are automatically inferred from the data by PBDN1, thanks to the inherent shrinkage\nmechanism of the gamma process, whereas the ones of AMM and CPM are both selected via cross\nvalidations. While PBDN1 can partially remedy their sensitivity to how the data are labeled by\ncombining the results obtained under two opposite labeling settings, the decision boundaries of the\ntwo iSHMs and those of both AMM and CPM are still restricted to a con\ufb01ned space related to a\nsingle convex polytope, which may be used to explain why on banana, image, and ijcnn1, they all\nclearly underperform a PBDN with more than one hidden layer.\nAs shown in Tab. 2, DNN (8-4) clearly underperforms DNN (32-16) in terms of classi\ufb01cation\naccuracy on both image and ijcnn1, indicating that having 8 and 4 hidden units for the \ufb01rst and second\nhidden layers, respectively, is far from enough for DNN to provide a suf\ufb01ciently high nonlinear\nmodeling capacity for these two datasets. Note that the equivalent number of hyperplanes for DNN\n(K1, K2), a two-hidden-layer DNN with K1 and K2 hidden units in the \ufb01rst and second hidden layers,\nrespectively, is computed as [(V + 1)K1 + K1K2]/(V + 1). Thus the computational complexity\nquickly increases as the network size increases. For example, DNN (8-4) is comparable to PBDN1\nand PBDN-AIC in terms of out-of-sample-prediction computational complexity, as shown in Tabs.\n\n8\n\n\f2 and 5, but it clearly underperforms all of them in terms of classi\ufb01cation accuracy, as shown in\nTab. 2. While DNN (128-64) performs well in terms of classi\ufb01cation accuracy, as shown in Tab. 2, its\nout-of-sample-prediction computational complexity becomes clearly higher than the other algorithms\nwith comparable or better accuracy, such as RVM and PBDN, as shown in Tab. 5. In practice,\nhowever, the search space for a DNN with two or more hidden layers is enormous, making it dif\ufb01cult\nto determine a network that is neither too large nor too small to achieve a good compromise between\n\ufb01tting the data well and having low complexity for both training and out-of-sample prediction. E.g.,\nwhile DNN (128-64) could further improve the performance of DNN (32-16) on these two datasets, it\nuses a much larger network and clearly higher computational complexity for out-of-sample prediction.\nWe show the inferred number of active support hyperplanes by PBDN in a single random trial in\nFigs. 3-6. For PBDN, the computation in both training and out-of-sample prediction also increases\nin T , the network depth. It is clear from Tab. 2 that increasing T from 1 to 2 generally leads to the\nmost signi\ufb01cant improvement if there is a clear advantage of increasing T , and once T is suf\ufb01ciently\nlarge, further increasing T leads to small performance \ufb02uctuations but does not appear to lead to\nclear over\ufb01tting. As shown in Tab. 3, the use of the AIC based greedy model selection criterion\neliminates the need to tuning the depth T , allowing it to be inferred from the data. Note we have tried\nstacking CPMs as how we stack iSHMs, but found that the accuracy often quickly deteriorates rather\nthan improving. E.g., for CPMs with (2, 3, or 4) layers, the error rates become (0.131, 0.177, 0.223)\non waveform, and (0.046, 0.080, 0.216) on image. The reason could be that CPM infers redundant\nunweighted hyperplanes that lead to strong multicollinearity for the covariates of deep layers.\nNote on each given data, we have tried training a DNN with the same network architecture inferred\nby a PBDN. While a DNN jointly trains all its hidden layers, it provides no performance gain over\nthe corresponding PBDN. More speci\ufb01cally, the DNNs using the network architectures inferred by\nPBDNs with AIC-Gibbs, AIC\u0001=0.01-Gibbs, AIC-SGD, and AIC\u0001=0.01-SGD, have the mean of SVM\nnormalized errors as 1.047, 1.011, 1.076, and 1.144, respectively. These observations suggest the\nef\ufb01cacy of the greedy-layer-wise training strategy of the PBDN, which requires no cross-validation.\nFor out-of-sample prediction, the computation of a classi\ufb01cation algorithm generally increases linearly\nin the number of support hyperplanes/vectors. Using logistic regression with a single hyperplane for\nreference, we summarize the computation complexity in Tab. 2, which indicates that in comparison to\nSVM that consistently requires the most number of support vectors, PBDN often requires signi\ufb01cantly\nless time for predicting the class label of a new data sample. For example, for out-of-sample prediction\nfor the image dataset, as shown in Tab. 5, on average SVM uses about 212 support vectors, whereas\non average PBDNs with one to \ufb01ve hidden layers use about 13, 16, 29, 50, and 64 hyperplanes,\nrespectively, and PBDN-AIC uses about 22 hyperplanes, showing that in comparison to kernel SVM,\nPBDN could be much more computationally ef\ufb01cient in making out-of-sample prediction.\n\n5 Conclusions\n\nThe in\ufb01nite support hyperplane machine (iSHM), which interacts countably in\ufb01nite non-negative\nweighted hyperplanes via a noisy-OR mechanism, is employed as the building unit to greedily con-\nstruct a capacity-regularized parsimonious Bayesian deep network (PBDN). iSHM has an inductive\nbias in \ufb01tting the positively labeled data, and employs the gamma process to infer a parsimonious set\nof active hyperplanes to enclose negatively labeled data within a convex-polytope bounded space. Due\nto the inductive bias and label asymmetry, iSHMs are trained in pairs to ensure a suf\ufb01cient coverage\nof the covariate space occupied by the data from both classes. The sequentially trained iSHM pairs\ncan be stacked into PBDN, a feedforward deep network that gradually enhances its modeling capacity\nas the network depth increases, achieving high accuracy while having low computational complexity\nfor out-of-sample prediction. PBDN can be trained using either Gibbs sampling that is suitable for\nquantifying posterior uncertainty, or SGD based MAP inference that is scalable to big data. One may\npotentially construct PBDNs for regression analysis of count, categorical, and continuous response\nvariables by following the same three-step strategy: constructing a nonparametric Bayesian model\nthat infers the number of components for the task of interest, greedily adding layers one at a time,\nand using a forward model selection criterion to decide how deep is deep enough. For the \ufb01rst\nstep, the recently proposed Lomax distribution based racing framework [34] could be a promising\ncandidate for both categorical and non-negative response variables, and Dirichlet process mixtures of\ngeneralized linear models [35] could be promising candidates for continuous response variables and\nmany other types of variables via appropriate link functions.\n\n9\n\n\fAcknowledgments\n\nM. Zhou acknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation,\nthe support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research,\nand the computational support of Texas Advanced Computing Center.\n\nReferences\n[1] V. Vapnik, Statistical learning theory. Wiley New York, 1998.\n[2] B. Sch\u00f6lkopf, C. J. C. Burges, and A. J. Smola, Advances in Kernel Methods: Support Vector Learning.\n\nMIT Press, 1999.\n\n[3] G. Hinton, S. Osindero, and Y.-W. Teh, \u201cA fast learning algorithm for deep belief nets,\u201d Neural Computation,\n\nvol. 18, no. 7, pp. 1527\u20131554, 2006.\n\n[4] Y. LeCun, Y. Bengio, and G. Hinton, \u201cDeep learning,\u201d Nature, vol. 521, no. 7553, pp. 436\u2013444, 2015.\n[5] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.\n[6] I. Steinwart, \u201cSparseness of support vector machines,\u201d J. Mach. Learn. Res., vol. 4, pp. 1071\u20131105, 2003.\n[7] C. Bucilu\u02d8a, R. Caruana, and A. Niculescu-Mizil, \u201cModel compression,\u201d in KDD, pp. 535\u2013541, ACM, 2006.\n[8] Y. Gong, L. Liu, M. Yang, and L. Bourdev, \u201cCompressing deep convolutional networks using vector\n\nquantization,\u201d arXiv preprint arXiv:1412.6115, 2014.\n\n[9] G. Hinton, O. Vinyals, and J. Dean, \u201cDistilling the knowledge in a neural network,\u201d arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[10] S. Han, H. Mao, and W. J. Dally, \u201cDeep compression: Compressing deep neural networks with pruning,\n\ntrained quantization and huffman coding,\u201d arXiv preprint arXiv:1510.00149, 2015.\n\n[11] T. S. Ferguson, \u201cA Bayesian analysis of some nonparametric problems,\u201d Ann. Statist., vol. 1, no. 2,\n\npp. 209\u2013230, 1973.\n\n[12] M. Zhou, Y. Cong, and B. Chen, \u201cThe Poisson gamma belief network,\u201d in NIPS, 2015.\n[13] M. Zhou, Y. Cong, and B. Chen, \u201cAugmentable gamma belief networks,\u201d J. Mach. Learn. Res., vol. 17,\n\nno. 163, pp. 1\u201344, 2016.\n\n[14] V. Nair and G. E. Hinton, \u201cRecti\ufb01ed linear units improve restricted Boltzmann machines,\u201d in ICML,\n\npp. 807\u2013814, 2010.\n\n[15] X. Glorot, A. Bordes, and Y. Bengio, \u201cDeep sparse recti\ufb01er neural networks,\u201d in AISTATS, pp. 315\u2013323,\n\n2011.\n\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks,\u201d in NIPS, pp. 1097\u20131105, 2012.\n\n[17] W. Shang, K. Sohn, D. Almeida, and H. Lee, \u201cUnderstanding and improving convolutional neural networks\n\nvia concatenated recti\ufb01ed linear units,\u201d in ICML, 2016.\n\n[18] M. Zhou, \u201cIn\ufb01nite edge partition models for overlapping community detection and link prediction,\u201d in\n\nAISTATS, pp. 1135\u20131143, 2015.\n\n[19] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan\n\nKaufmann, 1988.\n\n[20] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, \u201cAn introduction to variational methods for\n\ngraphical models,\u201d Machine learning, vol. 37, no. 2, pp. 183\u2013233, 1999.\n\n[21] S. Arora, R. Ge, T. Ma, and A. Risteski, \u201cProvable learning of noisy-or networks,\u201d arXiv preprint\n\narXiv:1612.08795, 2016.\n\n[22] F. Caron and E. B. Fox, \u201cSparse graphs using exchangeable random measures,\u201d J. R. Stat. Soc.: Series B,\n\nvol. 79, no. 5, pp. 1295\u20131366, 2017.\n\n[23] M. Zhou, \u201cDiscussion on \u201csparse graphs using exchangeable random measures\u201d by Francois Caron and\n\nEmily B. Fox,\u201d arXiv preprint arXiv:1802.07721, 2018.\n\n[24] P. Rai, C. Hu, R. Henao, and L. Carin, \u201cLarge-scale Bayesian multi-label learning via topic-based label\n\nembeddings,\u201d in NIPS, 2015.\n\n[25] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford university press, 1995.\n[26] F. Aiolli and A. Sperduti, \u201cMulticlass classi\ufb01cation with multi-prototype support vector machines,\u201d vol. 6,\n\npp. 817\u2013850, 2005.\n\n10\n\n\f[27] Z. Wang, N. Djuric, K. Crammer, and S. Vucetic, \u201cTrading representability for scalability: Adaptive\n\nmulti-hyperplane machine for nonlinear classi\ufb01cation,\u201d in KDD, pp. 24\u201332, 2011.\n\n[28] N. Manwani and P. S. Sastry, \u201cLearning polyhedral classi\ufb01ers using logistic function.,\u201d in ACML, pp. 17\u201330,\n\n2010.\n\n[29] N. Manwani and P. S. Sastry, \u201cPolyceptron: A polyhedral learning algorithm,\u201d arXiv:1107.1564, 2011.\n[30] A. Kantchelian, M. C. Tschantz, L. Huang, P. L. Bartlett, A. D. Joseph, and J. D. Tygar, \u201cLarge-margin\n\nconvex polytope machine,\u201d in NIPS, pp. 3248\u20133256, 2014.\n\n[31] M. Tipping, \u201cSparse Bayesian learning and the relevance vector machine,\u201d J. Mach. Learn. Res., vol. 1,\n\npp. 211\u2013244, June 2001.\n\n[32] D. P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[33] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,\nM. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,\nB. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi\u00e9gas, O. Vinyals, P. War-\nden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, \u201cTensorFlow: Large-scale machine learning on\nheterogeneous systems,\u201d 2015. Software available from tensor\ufb02ow.org.\n\n[34] Q. Zhang and M. Zhou, \u201cNonparametric Bayesian Lomax delegate racing for survival analysis with\n\ncompeting risks,\u201d in NeurIPS, 2018.\n\n[35] L. A. Hannah, D. M. Blei, and W. B. Powell, \u201cDirichlet process mixtures of generalized linear models,\u201d J.\n\nMach. Learn. Res., vol. 12, no. Jun, pp. 1923\u20131953, 2011.\n\n[36] K. Crammer and Y. Singer, \u201cOn the algorithmic implementation of multiclass kernel-based vector machines,\u201d\n\nJ. Mach. Learn. Res., vol. 2, pp. 265\u2013292, 2002.\n\n[37] D. B. Dunson and A. H. Herring, \u201cBayesian latent variable models for mixed discrete outcomes,\u201d Biostatis-\n\ntics, vol. 6, no. 1, pp. 11\u201325, 2005.\n\n[38] M. Zhou, L. Hannah, D. Dunson, and L. Carin, \u201cBeta-negative binomial process and Poisson factor\n\nanalysis,\u201d in AISTATS, pp. 1462\u20131471, 2012.\n\n[39] M. Zhou and L. Carin, \u201cNegative binomial process count and mixture modeling,\u201d IEEE Trans. Pattern\n\nAnal. Mach. Intell., vol. 37, no. 2, pp. 307\u2013320, 2015.\n\n[40] N. G. Polson and J. G. Scott, \u201cDefault Bayesian analysis for multi-way tables: a data-augmentation\n\napproach,\u201d arXiv:1109.4180v1, 2011.\n\n[41] M. Zhou, L. Li, D. Dunson, and L. Carin, \u201cLognormal and gamma mixed negative binomial regression,\u201d in\n\nICML, pp. 1343\u20131350, 2012.\n\n[42] N. G. Polson, J. G. Scott, and J. Windle, \u201cBayesian inference for logistic models using P\u00f3lya\u2013Gamma\n\nlatent variables,\u201d J. Amer. Statist. Assoc., vol. 108, no. 504, pp. 1339\u20131349, 2013.\n[43] M. Zhou, \u201cSoftplus regressions and convex polytopes,\u201d arXiv:1608.06383, 2016.\n[44] G. R\u00e4tsch, T. Onoda, and K.-R. M\u00fcller, \u201cSoft margins for AdaBoost,\u201d Machine Learning, vol. 42, no. 3,\n\npp. 287\u2013320, 2001.\n\n[45] T. Diethe, \u201c13 benchmark datasets derived from the UCI, DELVE and STATLOG repositories.\u201d https:\n\n//github.com/tdiethe/gunnar_raetsch_benchmark_datasets/, 2015.\n\n[46] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, \u201cTraining and testing low-degree\n\npolynomial data mappings via linear SVM,\u201d J. Mach. Learn. Res., vol. 11, pp. 1471\u20131490, 2010.\n\n[47] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, \u201cLIBLINEAR: A library for large linear\n\nclassi\ufb01cation,\u201d J. Mach. Learn. Res., pp. 1871\u20131874, 2008.\n\n[48] C.-C. Chang and C.-J. Lin, \u201cLIBSVM: A library for support vector machines,\u201d ACM Transactions on\n\nIntelligent Systems and Technology, vol. 2, pp. 27:1\u201327:27, 2011.\n\n[49] N. Djuric, L. Lan, S. Vucetic, and Z. Wang, \u201cBudgetedsvm: A toolbox for scalable SVM approximations,\u201d\n\nJ. Mach. Learn. Res., vol. 14, pp. 3813\u20133817, 2013.\n\n11\n\n\f", "award": [], "sourceid": 1628, "authors": [{"given_name": "Mingyuan", "family_name": "Zhou", "institution": "University of Texas at Austin"}]}