{"title": "A Comparison of Projection Pursuit and Neural Network Regression Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 1159, "page_last": 1166, "abstract": null, "full_text": "A Comparison of Projection Pursuit and Neural \n\nNetwork Regression Modeling \n\nJellq-Nellg Hwang, Hang Li, \nInformation Processing Laboratory \n\nDept. of Elect. Engr., FT-lO \n\nUniversity of Washington \n\nSeattle WA 98195 \n\nMartin Maechler, R. Douglas Martin, Jim Schimert \n\nDepartment of Statistics \n\nMail Stop: GN-22 \n\nUniversity of Washington \n\nSeattle, WA 98195 \n\nAbstract \n\nTwo projection based feedforward network learning methods for model(cid:173)\nfree regression problems are studied and compared in this paper: one is \nthe popular back-propagation learning (BPL); the other is the projection \npursuit learning (PPL). Unlike the totally parametric BPL method, the \nPPL non-parametrically estimates unknown nonlinear functions sequen(cid:173)\ntially (neuron-by-neuron and layer-by-Iayer) at each iteration while jointly \nestimating the interconnection weights. In terms of learning efficiency, \nboth methods have comparable training speed when based on a Gauss(cid:173)\nNewton optimization algorithm while the PPL is more parsimonious. In \nterms of learning robustness toward noise outliers, the BPL is more sensi(cid:173)\ntive to the outliers. \n\n1 \n\nINTRODUCTION \n\nThe back-propagation learning (BPL) networks have been used extensively for es(cid:173)\nsentially two distinct problem types, namely model-free regression and classification, \n1159 \n\n\f1160 \n\nHwang, Li, Maechler, Martin, and Schimert \n\nwhich have no a priori assumption about the unknown functions to be identified \nother than imposes a certain degree of smoothness. The projection pursuit learning \n(PPL) networks have also been proposed for both types of problems (Friedman85 \n[3]), but to date there appears to have been much less actual use of PPLs for both \nregression and classification than of BPLs. In this paper, we shall concentrate on re(cid:173)\ngression modeling applications of BPLs and PPLs since the regression setting is one \nin which some fairly deep theory is available for PPLs in the case of low-dimensional \nregression (Donoh089 [2], Jones87 [6]). \n\nA multivariate model-free regression problem can be stated as follows: given n \npairs of vector observations, (Yl , Xl) = (Yll,\u00b7\u00b7\u00b7, Ylq; Xll,\u00b7\u00b7\u00b7, Xlp ), which have been \ngenerated from unknown models \n\n1=1,2,\u00b7.\u00b7,n; \n\ni=I,2,\u00b7\u00b7\u00b7,q \n\nYIi=gi(XI)+tli, \n\n(1) \nwhere {y,} are called the multivariable \"response\" vector and {x,} are called the \n\"independent variables\" or the \"carriers\". The {gd are unknown smooth non(cid:173)\nparametric (model-free) functions from p-dimensional Euclidean space to the real \nline, i.e., gi: RJ> ~ R, Vi. The {tli} are random variables with zero mean, \nE(tli] = 0, and independent of {x,}. Often the {tli} are assumed to be independent \nand identically distributed (iid) as well. \nThe goal of regression is to generate the estimators, 91, 92, ... , 9q, to best approxi(cid:173)\nmate the unknown functions, gl, g2, ... , gq, so that they can be used for prediction \nof a new Y given a new x: Yi = gi(X), Vi. \n\n2 A TWO-LAYER PERCEPTRON AND \n\nBACK-PROPAGATION LEARNING \n\nSeveral recent results have shown that a two-layer (one hidden layer) perceptron \nwith sigmoidal nodes can in principle represent any Borel-measurable function to \nany desired accuracy, assuming \"enough\" hidden neurons are used. This, along with \nthe fact that theoretical results are known for the PPL in the analogous two-layer \ncase, justifies focusing on the two-layer perceptron for our studies here. \n\n2.1 MATHEMATICAL FORMULATION \n\nA two-layer percept ron can be mathematically formulated as follows: \n\np L WkjXj -\n\n(h = wf x - (h, \n\nk = 1, 2, \n\nm \n\nYi \n\nj=1 \n\nm \n\nk=l \n\nm \n\nk=1 \n\n(2) \n\nwhere Uk denotes the weighted sum input of the kth neuron in the hidden layer; \nOk denotes the bias of the kth neuron in the hidden layer; Wkj denotes the input(cid:173)\nlayer weight linked between the kth hidden neuron and the jth neuron of the input \n\n\fA Comparison of Projection Pursuit and Neural Network Regression Modeling \n\n1161 \n\nlayer (or ph element of the input vector x); f3ik denotes the output-layer weight \nlinked between the ith output neuron and the kth hidden neuron; fk is the nonlinear \nactivation function, which is usually assumed to be a fixed monotonically increasing \n(logistic) sigmoidal function, u( u) = 1/(1 + e- U ). \nThe above formulation defines quite explicitly the parametric representation of \nfunctions which are being used to approximate {gi(X), i = 1,2\"\", q}. A sim(cid:173)\nple reparametrization allows us to write gi(X) in the form: \n\nA() \ngj x = ~ f3ikU( \n\nm \n\"'\"' \nk=l \n\nT \n\nakx-/-lk \n) \n\nSk \n\n(3) \n\nwhere ak is a unit length version of weight vector Wk. This formulation reveals how \n{gd are built up as a linear combination of sigmoids evaluated at translates (by \n/-lk) and scaled (by Sk) projection of x onto the unit length vector ak. \n\n2.2 BACK-PROPAGATION LEARNING AND ITS VARIATIONS \n\nHistorically, the training of a multilayer perceptron uses back-propagation learning \n(BPL). There are two common types of BPL: the batch one and the sequentialone. \nThe batch BPL updates the weights after the presentation of the complete set of \ntraining data. Hence, a training iteration incorporates one sweep through all the \ntraining patterns. On the other hand, the sequential BPL adjusts the network \nparameters as training patterns are presented, rather than after a complete pass \nthrough the training set. The sequential approach is a form of Robbins-Monro \nStochastic Approximation. \n\nWhile the two-layer perceptron provides a very powerful nonparametric modeling \ncapability, the BPL training can be slow and inefficient since only the first derivative \n(or gradient) information about the training error is utilized. To speed up the train(cid:173)\ning process, several second-order optimization algorithms, which take advantage of \nsecond derivative (or Hessian matrix) information, have been proposed for training \nperceptrons (Hwang90 [4]). For example, the Gauss-Newton method is also used in \nthe PPL (Friedman85 [3]). \n\nThe fixed nonlinear nodal (sigmoidal) function is a monotone non decreasing differ(cid:173)\nentiable function with very simple first derivative form, and possesses nice properties \nfor numerical computation. However, it does not interpolate/extrapolate efficiently \nin a wide variety of regression applications. Several attempts have been proposed to \nimprove the choice of nonlinear nodal functions; e.g., the Gaussian or bell-shaped \nfunction, the locally tuned radial basis functions, and semi-parametric (non-fixed \nnodal function) nonlinear functions used in PPLs and hidden Markov models. \n\n2.3 RELATIONSHIP TO KERNEL APPROXIMATION AND DATA \n\nSMOOTHING \n\nIt is instructive to compare the two-layer perceptron approximation in Eq. \n(3) \nwith the well-known kernel method for regression. A kernel K(.) is a non-negative \nsymmetric function which integrates to unity. Most kernels are also unimodal, with \n\n\f1162 \n\nHwang, Li, Maechler, Martin, and Schimert \n\nmode at the origin, K(tl) ~ K(t 2), 0 < tl < t 2. A kernel estimate of gi(X) has the \nform \n\n_ \ngK,i(X) = ~ Yli hq K( \n\n~ 1 \n\nIIx - xIII \n), \n\nh9 \n\n1=1 \n\n(4) \n\nwhere h is a bandwidth parameter and q is the dimension of YI vector. Typically a \ngood value of h will be chosen by a data-based cross-validation method. Consider for \na moment the special case of the kernel approximator and the two-layer perceptron \nin Eq. (3) respectively, with scalar YI and XI, i.e., with p = q = 1 (hence unit length \ninterconnection weight Q' = 1 by definition): \n\n~ .!.K( Ilx - xdl) = ~ :\"K(x - XI) \n~ YI h \n1=1 \nm \n\n~ YI h h ' \n1=1 \n\nh \n\nL ,BkO\"( X -\nSk \nk=1 \n\nIlk) \n\ng(X) \n\n(5) \n\n(6) \n\nThis reveals some important connections between the two approaches. \nSuppose that for g( x), we set 0\" = K, i.e., 0\" is a kernel and in fact identical to the \nkernel K, and that ,Bk,llk,sk = s have been chosen (trained), say by BPL. That is, \nall {sd are constrained to a single unknown parameter value s. In general, m < n, \nor even m is a modest fraction of n when the unknown function g(x) is reasonably \nsmooth. Furthermore, suppose that h has been chosen by cross validation. Then one \ncan expect 9K(X) ~ gq(x), particularly in the event that the {1lA:} are close to the \nobserved values {x,} and X is close to a specific Ilk value (relative to h). However, \nin this case where we force Sk = S, one might expect gK(X) to be a somewhat better \nestimate overall than 9q(x), since the former is more local in character. \nOn the other hand, when one removes the restriction Sk = s, then BPL leads \nto a local bandwidth selection, and in this case one may expect gq(x) to provide \nbetter approximation than 9K(X) when the function g(x) has considerably varying \ncurvature, gll(X), and/or considerably varying error variance for the noise (Ii in Eq. \n(1). The reason is that a fixed bandwidth kernel estimate can not cope as well with \nchanging curvature and/or noise variance as can a good smoothing method which \nuses a good local bandwidth selection method. A small caveat is in order: if m is \nfairly large, the estimation of a separate bandwidth for each kernel location, Ilk, may \ncause some increased variability in gq (x) by virtue of using many more parameters \nthan are needed to adequately represent a nearly optimal local bandwidth selection \nmethod. Typically a nearly optimal local bandwidth function will have some degree \nof smoothness, which reflects smoothly varying curvature and/or noise variance, and \na good local bandwidth selection method should reflect the smoothness constraints. \nThis is the case in the high-quality \"supersmoother\", designed for applications like \nthe PPL (to be discussed), which uses cross-validation to select bandwidth locally \n(Friedman85 [3]), and combines this feature with considerable speed. \nThe above arguments are probably equally valid without the restriction u = J(, be(cid:173)\ncause two sigmoids of opposite signs (via choice of two {,Bk}) that are appropriately \n\n\fA Comparison of Projection Pursuit and Neural Network Regression Modeling \n\n1163 \n\nshifted, will approximate a kernel up to a scaling to enforce unity area. However, \nthere is a novel aspect: one can have a separate local bandwidth for each half of \nthe kernel, thereby using an asymmetric kernel, which might improve the approxi(cid:173)\nmation capabilities relative to symmetric kernels with a single local bandwidth in \nsome situations. \n\nIn the multivariate case, the curse of dimensionality will often render useless the \nkernel approximator 9K,i(X) given by Eq. (4). Instead one might consider using a \nprojection pursuit kernel (PPK) approximator: \n\n9PPK,i(X) = LL Yli hk J\u00ab(1:kX~kD:kXI) \n\nn m IT T \n\n1=1 k=l \n\n(7) \n\nwhere a different bandwidth hk is used for each direction D:k . In this case, the \nsimilarities and differences between the PPK estimate and the BPL estimate 9q,i(X) \nbecome evident. \n\nThe main difference between the two methods is that PPK performs explicit smooth(cid:173)\ning in each direction D:k using a kernel smoother, whereas BPL does implicit smooth(cid:173)\ning with both fJk (replacing Yli/ h k) and /-lk (replacing aT XI) being determined by \nnonlinear least squares optimization. In both PPK and BPL, the D:k and hk are \ndetermined by nonlinear optimization (cross-validation choices of bandwidth pa(cid:173)\nrameters are inherently nonlinear optimization problems) (Friedman85 [3]). \n\n3 PROJECTION PURSUIT LEARNING NETWORKS \n\nThe projection pursuit learning (PPL) is a statistical procedure proposed for mul(cid:173)\ntivariate data analysis using a two-layer network given in Eq. (2). This procedure \nderives its name from the fact that it interprets high dimensional data through \nwell-chosen lower-dimensional projections. The \"pursuit\" part of the name refers \nto optimization with respect to the projection directions. \n\n3.1 COMPARATIVE STRUCTURES OF PPL AND BPL \n\nSimilar to a BPL perceptron, a PPL network forms projections of the data in \ndirections determined from the interconnection weights. However, unlike a BPL \nperceptron, which employs a fixed set of nonlinear (sigmoidal) functions, a PPL \nnon-parametrically estimates the nonlinear nodal functions based on nonlinear op(cid:173)\ntimization approach which involves use of a one-dimensional data-smoother (e.g., a \nleast squares estimator followed by a variable window span data averaging mech(cid:173)\nanism) (Friedman85 [3]) . Therefore, it is important to note that a PPL network \nis a semi-parametric learning network, which consists of both parametrically and \nnon-parametrically estimated elements. This is in contrast to a BPL perceptron, \nwhich is a completely parametric model. \n\n3.2 LEARNING STRATEGIES OF PPL \n\nIn comparison with a batch BPL, which employs either 1st-order gradient descent or \n2nd-order Newton-like methods to estimate the weights of all layers simultaneously \n\n\f1164 \n\nHwang, Li, Maechler, Martin, and Schimert \n\nafter all the training patterns are presented, a PPL learns neuron-by-neuron and \nlayer-by-Iayer cyclically after all the training patterns are presented. Specifically, it \napplies linear least squares to estimate the output-layer weights, a one-dimensional \ndata smoother to estimate the nonlinear nodal functions of each hidden neuron, \nand the Gauss-Newton nonlinear least squares method to estimate the input-layer \nweights. \n\nThe PPL procedure uses the batch learning technique to iteratively minimize the \nmean squared error, E, over all the training data. All the parameters to be esti(cid:173)\nmated are hierarchically divided into m groups (each associated with one hidden \nneuron), and each group, say the kth group, is further divided into three subgroups: \nthe output-layer weights, {,Bik, i = 1\"\", q}, connected to the kth hidden neuron; \nthe nonlinear function, h( u), of the kth hidden neuron; and the input-layer weights, \n{Wkj, j = 1\"\" ,p}, connected to the kth hidden neuron. The PPL starts from up(cid:173)\ndating the parameters associated with the first hidden neuron (group) by updating \neach subgroup, {,Bid, h(u), and {Wlj} consecutively (layer-by-Iayer) to minimize \nthe mean squared error E. It then updates the parameters associated with the sec(cid:173)\nond hidden neuron by consecutively updating {,Bi2}, h(u), and {W2j}. A complete \nupdating pass ends at the updating of the parameters associated with the mth (the \nlast) hidden neuron by consecutively updating {,Bim}, fm(u), and {wmj}. Repeated \nupdating passes are made over all the groups until convergence (i.e., in our studies \nof Section 4, we use the stopping criterion that \nbe smaller than a \nprespecified small constant, ~ = 0.005). \n\nIE(new)_E(old)1 \n\nE(old) \n\n4 LEARNING EFFICIENCY IN BPL AND PPL \n\nHaving discussed the \"parametric\" BPL and the \"semi-parametric\" PPL from struc(cid:173)\ntural, computational, and theoretical viewpoints, we have also made a more prac(cid:173)\ntical comparison of learning efficiency via a simulation stUdy. For simplicity of \ncomparison, we confine the simulations to the two-dimensional univariate case, i.e., \np = 2, q = 1. This is an important situation in practice, because the models can \nbe visualized graphically as functions y = g(Xl' X2). \n\n4.1 PROTOCOLS OF THE SIMULATIONS \n\nNonlinear Functions: There are five nonlinear functions gU) : [0,1]2 --+ R in(cid:173)\nvestigated (Maechler90 [7]), which are scaled such that the standard deviation is 1 \n(for a large regular grid of 2500 points on [0,1]2), and translated to make the range \nnonnegative. \n\nTraining and Test Data: Two independent variables (carriers) (Xll' X12) \nwere generated from the uniform distribution U([O,I]2), i.e., the abscissa values \n{(Xll' X12)} were generated as uniform random variates on [0,1] and independent \nfrom each other. We generated 225 pairs {(xu, X12)} of abscissa values, and used \nthis same set for experiments of all five different functions, thus eliminating an \nunnecessary extra random component of the simulation. In addition to one set of \nnoiseless training data, another set of noisy training data was also generated by \nadding iid Gaussian noises. \n\n\fA Comparison of Projection Pursuit and Neural Network Regression Modeling \n\n1165 \n\nAlgorithm Used: The PPL simulations were conducted using the S-Plus pack(cid:173)\nage (S-Plus90 [1]) implementation of PPL, where 3 and 5 hidden neurons were tried \n(with 5 and 7 maximum working hidden neurons used separately to avoid the overfit(cid:173)\nting). The S-Plus implementation is based on the Friedman code (Friedman85 [3]), \nwhich uses a Gauss-Newton method for updating the lower layer weights. To obtain \na fair comparison, the BPL was implemented using a batch Gauss-Newton method \n(rather than the usual gradient descent, which is slower) on two-layer perceptrons \nwith linear output neurons and nonlinear sigmoidal hidden neurons (Hwang90 [4], \nHwang9I [5]), where 5 and 10 hidden neurons were tried. \n\nIndependent Test Data Set: The assessment of performance was done by com(cid:173)\nparing the fitted models with the \"true\" function counterparts on a large indepen(cid:173)\ndent test set. Throughout all the simulations, we used the same set of test data for \nperformance assessment, i.e., {g(j)( Xll, X/2)}, of size N = 10000, namely a regularly \nspaced grid on [0,1]2, defined by its marginals. \n\n4.2 SIMULATION RESULTS IN LEARNING EFFICIENCY \n\nTo summarize the simulation results in learning efficiency, we focused on the chosen \nthree aspects: accuracy, parsimony, and speed. \n\nLearning Accuracy: The accuracy determined by the absolute L2 error measure \nof the independent test data in both learning methods are quite comparable either \ntrained by noiseless or noisy data (Hwang9I [5]). Note that our comparisons are \nbased on 5 & 10 hidden neurons of BPLs and 3 & 5 hidden neurons of PPLs. \nThe reason of choosing different number of hidden neurons will be explained in the \nlearning parsimony section. \n\nIn comparison with BPL, the PPL is more parsimonious \n\nLearning Parsimony: \nin training all types of nonlinear functions, i.e., in order to achieve comparable accu(cid:173)\nracy to the BPLs for a two-layer perceptrons, the PPLs require fewer hidden neurons \n(more parsimonious) to approximate the desired true function (Hwang9I [5]). Sev(cid:173)\neral factors may contribute to this favorable performance. First and foremost, the \ndata-smoothing technique creates more pertinent nonlinear nodal functions, so the \nnetwork adapts more efficiently to the observation data without using too many \nterms (hidden neurons) of interpolative projections. Secondly, the batch Gauss(cid:173)\nNewton BPL updates all the weights in the network simultaneously while the PPL \nupdates cyclically (neuron-by-neuron and layer-by-layer), which allows the most re(cid:173)\ncent updating information to be used in the subsequent updating. That is, more \nimportant projection directions can be determined first so that the less important \nprojections can have a easier search (the same argument used in favoring the Gauss(cid:173)\nSeidel method over the Jacobi method in an iterative linear equation solver). \n\nLearning Speed: As we reported earlier (Maechler90 [7]), the PPL took much \nless time (1-2 order of magnitude speedup) in achieving accuracy comparable with \nthat of the sequential gradient descent BPL. Interestingly, when compared with the \nbatch Gauss-Newton BPL, the PPL took quite similar amount of time over all the \nsimulations (under the same number of hidden neurons and the same convergence \n\n\f1166 \n\nHwang, Li, Maechler, Martin, and Schimert \n\nthreshold e = 0.005). In all simulations, both the BPLs and PPLs can converge \nunder 100 iterations most of the time. \n\n5 SENSITIVITY TO OUTLIERS \n\nBoth BPL's and PPL's are types of nonlinear least squares estimators. Hence like \nall least squares procedures, they are all sensitive to outliers. The outliers may \ncome from large errors in measurements, generated by heavy tailed deviations from \na Gaussian distribution for the noise iii in Eq. (1). \n\nWhen in presence of additive Gaussian noises without outliers, most functions can \nbe well approximated by 5-10 hidden neurons using BPL or with 3-5 hidden neurons \nusing PPL. When the Gaussian noise is altered by adding one outlier, the BPL with \n5-10 hidden neurons can still approximate the desired function reasonably well in \ngeneral at the sacrifice of the magnified error around the vicinity of the outlier. If \nthe number of outliers increases to 3 in the same corner, the BPL can only get \na \"distorted\" approximation of the desired function. On the other hand, the PPL \nwith 5 hidden neurons can successfully approximate the desired function and remove \nthe single outlier. In case of three outliers, the PPL using simple data smoothing \ntechniques can no longer keep its robustness in accuracy of approximation. \n\nAcknowledgements \n\nThis research was partially supported through grants from the National Science \nFoundation under Grant No. ECS-9014243. \n\nReferences \n\n[1] S-Plus Users Manual (Version 3.0). Statistical Science Inc., Seattle, WA, 1990. \n[2] D.L. Donoho and I.M. Johnstone. Projection-based approximation and a du(cid:173)\n\nality with kernel methods. The Annals of Statistics, Vol. 17, No.1, pp. 58-106, \n1989. \n\n[3] J .H. Friedman. Classification and multiple regression through projection pur(cid:173)\n\nsuit. Technical Report No. 12, Department of Statistics, Stanford University, \nJanuary 1985. \n\n[4] J. N. Hwang and P. S. Lewis. From nonlinear optimization to neural network \nlearning. In Proc. 24th Asilomar Conf. on Signals, Systems, & Computers, pp. \n985-989, Pacific Grove, CA, November 1990. \n\n[5] J. N. Hwang, H. Li, D. Martin, J. Schimert. The learning parsimony of pro(cid:173)\n\njection pursuit and back-propagation networks. In 25th Asilomar Conf. on \nSignals, Systems, & Computers, Pacific Grove, CA, November 1991. \n\n[6] L.K. Jones. On a conjecture of Huber concerning the convergence of projection \n\npursuit regression. The Annals of Statistics, Vol. 15, No. 2,880-882, 1987. \n\n[7] M. Maechler, D. Martin, J. Schimert, M. Csoppenszky and J. N. Hwang. Pro(cid:173)\n\njection pursuit learning networks for regression. in Proc. 2nd Int'l Conf. Tools \nfor AI, pp. 350-358, Washington D.C., November 1990. \n\n\f", "award": [], "sourceid": 578, "authors": [{"given_name": "Jenq-Neng", "family_name": "Huang", "institution": null}, {"given_name": "Hang", "family_name": "Li", "institution": null}, {"given_name": "Martin", "family_name": "Maechler", "institution": null}, {"given_name": "R.", "family_name": "Martin", "institution": null}, {"given_name": "Jim", "family_name": "Schimert", "institution": null}]}