{"title": "A Comparison between Neural Networks and other Statistical Techniques for Modeling the Relationship between Tobacco and Alcohol and Cancer", "book": "Advances in Neural Information Processing Systems", "page_first": 967, "page_last": 973, "abstract": null, "full_text": "A  comparison between neural  networks \n\nand other statistical techniques for \nmodeling the relationship between \n\ntobacco and  alcohol and  cancer \n\nTony  Plate \n\nBC  Cancer Agency \n\n601  West  10th Ave,  Epidemiology \nVancouver BC Canada V5Z  1L3 \n\ntap@comp.vuw.ac.nz \n\nPierre Band \n\nBC Cancer Agency \n\n601  West  10th Ave,  Epidemiology \nVancouver BC  Canada V5Z  1L3 \n\nJoel Bert \n\nDept of Chemical Engineering \nUniversity of British Columbia \n\n2216  Main  Mall \n\nVancouver BC  Canada V6T  1Z4 \n\nJohn Grace \n\nDept of Chemical Engineering \nUniversity of British Columbia \n\n2216 Main  Mall \n\nVancouver BC  Canada V6T  1Z4 \n\nAbstract \n\nEpidemiological  data  is  traditionally  analyzed  with  very  simple \ntechniques.  Flexible  models,  such  as  neural  networks,  have  the \npotential to discover unanticipated features in the data.  However, \nto be useful,  flexible  models  must have effective control on overfit(cid:173)\nting.  This paper reports on a  comparative study of the predictive \nquality of neural networks and other flexible models applied to real \nand artificial epidemiological data.  The results  suggest that there \nare no major unanticipated complex features  in the real data, and \nalso  demonstrate  that  MacKay's  [1995]  Bayesian  neural  network \nmethodology  provides effective  control on overfitting while  retain(cid:173)\ning the ability  to discover complex features  in the artificial data. \n\n1 \n\nIntroduction \n\nTraditionally, very simple statistical techniques  are used in the analysis of epidemi(cid:173)\nological  studies.  The  predominant  technique  is  logistic  regression,  in  which  the \neffects  of predictors  are linear  (or  categorical)  and  additive  on the  log-odds  scale. \nAn important virtue of logistic regression is  that the relationships identified in the \n\n\f968 \n\nT.  Plate, P.  Band, J.  Bert and J.  Grace \n\ndata can be  interpreted and explained in simple terms,  such as  \"the odds of devel(cid:173)\noping  lung  cancer for  males  who  smoke  between  20  and  29  cigarettes per day  are \nincreased by  a  factor of 11.5 over males  who do  not smoke\".  However,  because of \ntheir simplicity, it is difficult to use these models to discover unanticipated complex \nrelationships, i.e.,  non-linearities in the effect of a  predictor or interactions between \npredictors.  Interactions and non-linearities can of course be introduced into logistic \nregressions,  but must  be  pre-specified,  which  tends  to  be  impractical  unless  there \nare only  a  few  variables or there are a  priori reasons to test for  particular effects. \nNeural networks have the potential to automatically discover complex relationships. \nThere has been  much interest in  using neural networks in  biomedical  applications; \nwitness  the  recent  series  of articles  in  The  Lancet,  e.g.,  Wyatt  [1995]  and  Baxt \n[1995].  However,  there are not yet sufficient  comparisons or theory to come to firm \nconclusions  about  the  utility  of neural  networks  in  biomedical  data  analysis.  To \ndate,  comparison  studies,  e.g,  those  by  Michie,  Spiegelhalter,  and  Taylor  [1994], \nBurke, Rosen, and Goodman [1995]'  and Lippmann, Lee,  and Shahian [1995], have \nhad  mixed  results,  and  Jefferson  et  aI's  [1995]  complaint  that  many  \"successful\" \napplications  of neural networks  are not compared against standard techniques  ap(cid:173)\npears to be justified.  The intent of this paper is  to contribute to the body of useful \ncomparisons by reporting a study of various neural-network and statistical modeling \ntechniques applied to an epidemiological data analysis problem. \n\nIn  this  study,  detailed questionnaire reported personal information,  life(cid:173)\n\n2  The data \nThe original data set consisted of information on 15,463 subjects from  a study con(cid:173)\nducted by  the  Division  of Epidemiology  and  Cancer Prevention at the  BC  Cancer \nAgency. \ntime  tobacco  and  alcohol  use,  and  lifetime  employment  history  for  each  subject. \nThe  subjects  were  cancer  patients  in  BC  with  diagnosis  dates  between  1983  and \n1989,  as  ascertained  by  the  population-based  registry  at  the  BC  Cancer  Agency. \nSix different  tobacco and alcohol habits were included:  cigarette (c), cigar (G),  and \npipe  (p)  smoking,  and  beer  (B),  wine  (w),  and spirit drinking  (s).  The models  re(cid:173)\nported in this paper used up to 27  predictor variables:  age at first  diagnosis  (AGE), \nand 26  variables  related to  alcohol and tobacco consumption.  These included four \nvariables  for  each  habit:  total  years  of consumption  (CYR  etc),  consumption  per \nday or week  (CDAY,  BWK  etc),  years since quitting  (CYQUIT  etc), and a  binary vari(cid:173)\nable  indicating  any  indulgence  (CSMOKE,  BDRINK  etc) .  The  remaining  two  binary \nvariables indicated whether the subject ever smoked tobacco or drank alcohol.  All \nthe  binary variables  were  non-linear  (threshold)  transforms of the other variables. \nVariables  not applicable to a  particular subject  were zero,  e.g.,  number of years of \nsmoking for  a  non-smoker, or years since quitting for  a  smoker who did not quit. \nOf the  15,463  records,  5901  had  missing  information  in  some  of the fields  related \nto  tobacco  or  alcohol  use.  These  were  not  used,  as  there  are  no  simple  methods \nfor  dealing with missing data in  neural networks.  Of the 9,562  complete records,  a \nrandomly  selected  3,195  were set  aside  for  testing,  leaving  6,367 complete  records \nto be used  in the modeling experiments. \nThere were  28  binary outcomes:  the  28  sites at which  a  subject could have cancer \n(subjects had cancers at up to 3 different  sites).  The number of cases for  each site \nvaried, e.g., for  LUNGSQ  (Lung Squamous) there were 694 cases among the complete \nrecords, for  ORAL  (Oral Cavity and Pharynx)  306,  and for  MEL  (Melanoma)  464. \nAll  sites  were  modeled  individually  using  carefully  selected  subjects  as  controls. \nThis  is  common  practice  in  cancer  epidemiology  studies,  due  to  the  difficulty  of \ncollecting  an  unbiased  sample  of non-cancer  subjects  for  controls.  Subjects  with \n\n\fNeural Networks in Cancer Epidemiology \n\n969 \n\ncancers  at  a  site  suspected  of  being  related  to  tobacco  usage  were  not  used  as \ncontrols.  This eliminated  subjects with any  sites other than  COLON,  RECTUM,  MEL \n(Melanoma),  NMSK  (Non-melanoma  skin),  PROS  (Prostate),  NHL  (Non-Hodgkin's \nlymphoma), and MMY  (Multiple-Myeloma), and resulted in between 2959 and 3694 \ncontrols  for  each  site.  For  example,  the  model  for  LUNGSQ  (lung  squamous  cell) \ncancer was fitted  using subjects with LUNGSQ  as the positive outcomes  (694 cases), \nand subjects all of whose sites were among COLON,  RECTUM,  MEL,  NMSK,  PROS,  NHL, \nor MMY  as  negative outcomes  (3694 controls). \n\n3  Statistical methods \n\nA  number  of different  types  of statistical  methods  were  used  to  model  the  data. \nThese  ranged  from  the  non-flexible  (logistic  regression)  through  partially  flexible \n(Generalized Additive  Models or GAMs)  to completely flexible  (classification trees \nand  neural  networks).  Each site  was  modeled  independently,  using  the log  likeli(cid:173)\nhood of the data under the binomial distribution as the fitting criterion.  All  of the \nmodeling,  except for  the neural  networks and ridge  regression,  was  done  using the \nthe S-plus statistical software package  [StatSci  1995]. \nFor several  methods,  we  used  Breiman's  [1996]  bagging  technique  to control  over(cid:173)\nfitting.  To  \"bag\"  a  model,  one  fits  a  set  of  models  independently  on  bootstrap \nsamples.  The bagged prediction is  then the average of the predictions of the models \nin the set.  Breiman suggests that bagging will give superior predictions for unstable \nmodels  (such as stepwise selection,  pruned trees,  and neural  networks). \nPreliminary analysis revealed that the predictive power of non-flexible models could \nbe  improved  by  including  non-linear  transforms  of some  variables,  namely  AGESQ \nand  the  binary  indicator  variables  SMOKE,  DRINK,  CSMOKE,  etc.  Flexible  models \nshould be able to discover useful  non-linear transforms for  themselves and so these \nderived  variables  were  not  included  in  the flexible  models.  In  order to allow  com(cid:173)\nparisons to test this, one of non-flexible models  (ONLYLIN-STEP)  also did not use any \nof these derived variables. \nNull  model:  (NULL)  The  predictions  of the  null  model  are just  the  frequency  of \nthe outcome in the training set. \nLogistic  regression:  The  FULL  model  used  the  full  set  of  predictor  variables, \nincluding a  quadratic term for  age:  AGESQ. \nStepwise  logistic  regression:  A  number of stepwise  regressions were  fitted,  dif(cid:173)\nfering in the set of variables considered.  Outcome-balanced lO-fold cross validation \nwas  used to select  the model  size  giving best  generalization.  The models  were  as \nfollows:  AGE-STEP  (AGE  and  AGESQ);  CYR-AGE-STEP  (CYR,  AGE  and  AGESQ)j  ALC(cid:173)\nCYR-AGE-STEP  (all alcohol variables, CYR,  AGE and AGESQ);  FULL-STEP  (all variables \nincluding  AGESQ);  and  ONLYLIN-STEP  (all  variables  except  for  the  derived  binary \nindicator variables SMOKE,  CSMOKE,  etc,  and only a  linear  AGE  term). \nRidge regression:  (RIDGE)  Ridge regression penalizes a  logistic regression model \nby  the  sum  of the  squared  parameter  values  in  order  to  control  overfitting.  The \nevidence framework  [MacKay  1995]  was  used to select seven shrinkage parameters: \none for  each of the six habits,  and one for  SMOKE,  DRINK,  AGE  and AGESQ. \nGeneralized Additive Models:  GAMs [Hastie and Tibshirani 1990] fit a smooth(cid:173)\ning spline to each parameter.  GAMs can model non-linearities, but not interactions. \nA  stepwise  procedure was  used  to  select  the  degree  (0,1,2,  or 4)  of the smoothing \nspline for  each parameter.  The procedure started with a model having a smoothing \nspline  of degree  2  for  each  parameter,  and  stopped  when  the  Ale  statistic  could \n\n\f970 \n\nT.  Plate, P.  Band, 1.  Bert and 1.  Grace \n\nnot  reduced  any  further.  Two stepwise  GAM  models  were  fitted :  GAM-FULL  used \nthe full  set of variables,  while  GAM-CIG  used the cigarette variables  and  AGE . \nClassification trees:  [Breiman et al.  1984]  The same cross-validation  procedure \nas  used  with  stepwise  regression  was  used  to  select  the  best  size  for  TREE,  using \nthe implementation in S-plus,  and the function shrink.treeO for  pruning.  A bagged \nversion with 50 replications, TREE-BAGGED, was also used.  After constructing a tree \nfor  the  data in  a  replication,  it  was  pruned  to  perform  optimally  on  the  training \ndata not included in that replication. \nOrdinary neural networks:  The neural network models had a single hidden layer \nof tanh functions  and a  small weight  penalty (0.01)  to prevent parameters going to \ninfinity.  A conjugate-gradient procedure was  used to optimize weights.  For the NN(cid:173)\nORD-H2 model, which had no control on complexity, a network with two hidden units \nwas trained three times from different small random starting weights.  Of these three, \nthe one with best performance on the training data was selected as \"the model\".  The \nNN-ORD-HCV used common method for controlling overfitting in neural networks:  10-\nfold  CV  for  selecting the optimal number of hidden  units.  Three random starting \npoints  for  each  partition  were  used  calculate  the  average  generalization  error  for \nnetworks with one, two and three hidden units Three networks with the best number \nof hidden  units  were  trained  on  the  entire  set  of training  data,  and  the  network \nhaving the lowest training error was chosen. \nBagged  neural  networks  with  early  stopping:  Bagging  and  early  stopping \n(terminating training before reaching  a  minimum  on  training set error in order to \nprevent overfitting)  work naturally together.  The training examples omitted from \neach bootstrap replication provide a validation set to decide when to stop, and with \nearly  stopping,  training  is  fast  enough  to  make  bagging  practical.  100  networks \nwith  two  hidden  units  were  trained  on  separate  bootstrap  replications,  and  the \nbest 50  (by  their performance on the omitted examples)  were  included in the final \nbagged  model,  NN-ESTOP-BAGGED.  For  comparison  purposes,  the  mean  individual \nperformance of these early-stopped networks is  reported as  NN-ESTOP-AVG . \nN eur~l networks  with  Bayesian  regularization:  MacKay's  [1995]  Bayesian \nevidence framework  was used to control overfitting in  neural networks.  Three ran(cid:173)\ndom  starts for  networks  with  1,  2,  3  or  4  hidden  units  and  three  different  sets  of \nregularization (penalty) parameters were used, giving a total of 36 networks for each \nsite.  The three  possibilities  for  regularization  parameters  were:  (a)  three  penalty \nparameters - one for each of input to hidden,  bias to hidden,  and hidden to output; \n(b)  partial Automatic Relevance  Determination  (ARD)  [MacKay  1995]  with seven \npenalty parameters controlling the input to hidden weights - one for each habit and \none  for  AGE ;  and  (c)  full  ARD,  with  one  penalty parameter for  each of the  19  in(cid:173)\nputs.  The \"evidence\" for each network was evaluated and the best 18 networks were \nselected for the equally-weighted committee model NN-BAYES-CMTT.  NN-BAYES-BEST \nwas  the single network with the maximum evidence. \n\n4  Results and Discussion \nModels  were  compared  based  on  their  performance  on  the  held-out  test  data,  so \nas  to avoid overfitting bias  in evaluation.  While there are several ways  to measure \nperformance, e.g. , 0-1 classification error, or area under the ROC curve (as in Burke, \nRosen and Goodman [1995]),  we  used the test-set deviance as it seems appropriate \nto  compare  models  using  the  same  criterion  as  was  used  for  fitting.  Reporting \nperformance is  complicated by the fact  that there were  28  different modeling tasks \n(Le.,  sites),  and  some  models  did  better  on  some  sites  and  worse  on  others.  We \nreport some overall performance figures  and some pairwise comparisons of models. \n\n\fNeural Networks in Cancer Epidemiology \n\n971 \n\n-10 \n\n0 \n\n10 \n\n20 \n\n-10 \n\n0 \n\n10 \n\n20 \n\nNULL \nAGE-STEP \nCYR-AGE-STEP \nALC-CYR-AGE-STEP \nONLYLIN-STEP \nFULL-STEP \nFULL \nRIDGE \nGAM-CIG \nGAM-FULL \nTREE \nTREE-BAGGED \nNN-ORD-H2 \nNN-ORD-HCV \nNN-ESTOP-AVG \nNN-ESTOP-BAGGED \nNN -BAYES-BEST \nNN-BAYES-CMTT \n\nALL \n\n6618/3299 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nMEL \n\n322/142 r  \u2022 \n\n\u2022 \n\n:. \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n-10 \n\n0 \n\n10 \n\n20 \n\n-10 \n\n0 \n\n10 \n\n20 \n\nFigure  1:  Percent improvement in deviance on test data over the null  model. \n\nFigure 1 shows aggregate deviances across sites (Le., the sum of the test deviance for \none  model over the  28  sites)  and deviances  for  selected sites.  The horizontal scale \nin each column indicates the percentage reduction in deviance over the null  model. \nZero percent (the dotted line)  is the same performance as the null model, and 100% \nwould  be perfect predictions.  Numbers  below the column labels are the number of \npositive  outcomes in  the  training and test sets,  respectively.  The best  predictions \nfor  LUNGSQ  can reduce the null deviance  by just over 25%.  It is  interesting to note \nthat  much  of the  information  is  contained  in  AGE  and  CYR:  The  CYR-AGE-STEP \nmodel achieved a  7.1% reduction in overall deviance, while the maximum reduction \n(achieved by  NN-BAYES-CMTT)  was only  8.3%. \nThere  is  no  single  threshold  at  which  differences  in  test-set  deviance  are  \"signifi(cid:173)\ncant\", because of strong correlations between predictions of different  models.  How(cid:173)\never,  the general patterns of superiority apparent in Figure 1 were  repeated across \nthe other sites,  and various other tests indicate they are reliable indicators  of gen(cid:173)\neral  performance.  For  example,  the  best  five  models,  both  in  terms  of aggregate \ndeviance across all  sites and median  rank of performance on individual sites,  were, \nin  order  NN-BAYES-CMTT,  RIDGE,  NN-ESTOP-BAGGED,  GAM-CIG,  and FULL-STEP.  The \nONLYLIN-STEP model ranked sixth in median rank,  and tenth in aggregate deviance. \nAlthough  the  differences  between  the  best  flexible  models  and the  logistic  models \nwere  slight,  they  were  consistent.  For  example,  NN-BAYES-CMTT  did  better  than \nFULL-STEP  on  21  sites,  and better than ONLYLIN-STEP on  23  sites,  while  FULL-STEP \ndrew  with  ONLYLIN-STEP  on  14  sites  and  did  better  on  9.  If the  models  had  no \neffective  difference,  there was  only a  1.25%.  chance of one model doing better than \nthe  other  21  or  more  times  out  of  28.  Individual  measures  of performance  were \nalso  consistent  with  these  findings.  For  example,  for  LUNGSQ  a  bootstrap  test  of \ntest-set  deviance  revealed that the  predictions  of NN-BA YES-CMTT  were  on  average \nbetter than those of ONLYLIN-STEP in 99.82% of res amp led test sets  (out of 10,000), \nwhile  the  predictions  of NN-BAYES-CMTT  beat  FULL-STEP  in  93.75%  of replications \n\n\f972 \n\nT.  Plate, P.  Band, J.  Bert and J.  Grace \n\nand FULL-STEP  beat ONLYLIN-STEP  in  98.48% of replications. \nThese results demonstrate that good control on overfitting is  essential for this task. \nOrdinary neural networks with no control on overfitting do worse than guessing (i.e., \nthe null model).  Even when the number of hidden units is chosen by cross-validation, \nthe performance is  still worse than a simple two-variable stepwise logistic regression \n(CYR-AGE-STEP).  The  inadequacy  of the  simple  Ale-based  stepwise  procedure  for \nchoosing  the  complexity  of  G AMs  is  illustrated  by  the  poor  performance  of  the \nGAM-FULL  model  (the  more restricted GAM-CIG  model does  quite  well). \nThe  effective  methods  for  controlling  overfitting  were  bagging  and  Bayesian  reg(cid:173)\nularization.  Bagging improved  the  performance of trees  and  early-stopped  neural \nnetworks to good levels.  Bayesian regularization worked very  well  with neural net(cid:173)\nworks  and  with  ridge  regression.  Furthermore, examination of the  performance of \nindividual networks indicates that networks with fine-grained ARD  were frequently \nsuperior to those with coarser control on regularization. \n\n5  Artificial sites with complex relationships \nThe  very  minor  improvement  achieved  by  neural  networks  and  trees  over  logistic \nmodels  provokes the  following  question:  are complex  relationships  are really  rela(cid:173)\ntively  unimportant  in  this  data,  or  is  the  strong control  on  overfitting preventing \nidentification  of complex  relationships?  In  order  to  answer  this  question,  we  cre(cid:173)\nated six artificial  \"sites\"  for the subjects.  These were designed  to have very similar \nproperties to the real sites,  while  possessing non-linear effects  and interactions. \nThe  risk  models  for  the  artificial  sites  possessed  a  underlying  trend  equal  to  half \nthat  of a  good  logistic  model  for  LUNGSQ,  and  one  of three  more  complex  effects: \nFREQ,  a frequent non-linear (threshold) effect  (BWK  > 1)  affecting 4,334 of the 9,562 \nsubjects;  RARE,  a  rare  threshold  effect  (BWK  >  10),  affecting  1,550  subjects;  and \nINTER,  an interaction  (BYR . GYR)  affecting 482  subjects.  For three of the artificial \nsites the complex effect was weak (LO), and for the other three it was strong (HI).  For \neach subject and each artificial site, a random choice as to whether that subject was \na  positive case for  that site was  made,  based on probability given  by  the model for \nthe artificial site.  Models were fitted to these sites in the same way as to other sites \nand only  subjects without cancer at a  smoking related site  were  used  as controls. \n\no \n\n20 \n\n40 \n\n60 \n\no \n\n20 \n\n40 \n\n60 \n\no \n\n20 \n\n40 \n\n60 \n\nNULL \nFREQ-TRUE \nRARE-TRUE \nINTER-TRUE \nONLYLIN-STEP \nFULL-STEP \nTREE-BAGGED \nNN-ESTOP-BAGGED \nNN-BA YES-CMTT \n\nFREQ-LO \n\n263;128 \n\n\u2022  \u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nFREQ-HI  RA~f-LO  RARE-HI \n482/253 \n440/2lO \n\n402  218 \n\nINTER-LO  INTER-HI \n\n245/115 \n\n564/274 \n\n~ \n\n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n~ \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n~ \n\n\u2022 \n\n\u2022 \n\n~ \n\u2022 \n\u2022  \u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n~ \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n0204060  \n\n0204060  \n\n020 4060  \n\nFigure 2:  Percent improvement in  deviance on test  data for  the artificial sites. \n\nFor comparison purposes, logistic models containing the true set of variables, includ(cid:173)\ning non-linearities and interactions,  were fitted  to the artificial data.  For example, \nthe model  RARE-TRUE  contained the continuous variables  AGE,  AGESQ,  CDAY,  CYR, \nand CYQUIT,  and the binary variables  SMOKE  and BWK> 10-, \n\n\fNeural Networks in Cancer Epidemiology \n\n973 \n\nFigure 2 shows performance on the artificial data.  The neural networks and bagged \ntrees were very effective at detecting non-linearities and interactions.  Their perfor(cid:173)\nmance was at the same level  as the appropriate true models, while the performance \nof simple  models  lacking  the  ability  to  fit  the  complexities  (e.g.,  FULL-STEP)  was \nconsiderably worse. \n\n6  Conclusions \nFor  predicting  the  risk  of cancer  in  our  data,  neural  networks  with  Bayesian  es(cid:173)\ntimation of regularization parameters to control overfitting performed consistently \nbut  only  slightly  better  than  logistic  regression  models.  This  appeared to  be  due \nto  the  lack  of complex  relationships  in  the  data:  on  artificial  data  with  complex \nrelationships  they  performed  markedly  better than logistic \u00b7 models.  Good  control \nof overfitting is  essential for  this  task, as shown  by  the poor performance of neural \nnetworks with the number of hidden  units chosen by  cross-validation. \nGiven  their  ability  to  not  overfit  while  still  identifying  complex  relationships  we \nexpect  that  neural  networks  could  prove  useful  in  epidemiological  data-analysis \nby  providing a  method  for  checking that  a  simple  statistical  model  is  not  missing \nimportant complex relationships. \n\nAcknowledgments \n\nThis  research  was  funded  by  grants  from  the  Workers  Compensation  Board  of \nBritish Columbia, NSERC , and IRIS,  and conducted at the  BC  Cancer Agency. \n\nReferences \nBaxt,  w . G.  1995.  Application  of  artificial  neural  networks  to  clinical  medicine.  The \n\nLancet, 346:1135-1138. \n\nBreiman,  L.  1996.  Bagging predictors.  Machine  Learning,  26(2) :123- 140. \nBreiman,  L. , Friedman, J ., Olshen, R. , and Stone, C.  1984.  Classification  and  Regression \n\nTrees.  Wadsworth,  Belmont, CA. \n\nBurke,  H. ,  Rosen,  D. ,  and  Goodman,  P.  1995.  Comparing  the  prediction  accuracy  of \nartificial  neural  networks  and other  statistical  methods for  breast  cancer  survival.  In \nTesauro, G.,  Touretzky, D. S.,  and Leen, T. K. , editors, Advances in Neural Information \nProcessing  Systems  7, pages  1063- 1067,  Cambridge,  MA.  MIT  Press. \n\nHastie, T. J . and Tibshirani, R.  J . 1990.  Generalized  additive  models.  Chapman and Hall, \n\nLondon. \n\nJefferson,  M.  F .,  Pendleton,  N.,  Lucas,  S.,  and  Horan,  M.  A.  1995.  Neural  networks \n\n(letter).  The  Lancet,  346:1712. \n\nLippmann,  R. ,  Lee,  Y. ,  and  Shahian,  D.  1995.  Predicting  the  risk  of  complications  in \ncoronary  artery  bypass  operations  using  neural  networks.  In  Tesauro,  G. ,  Touretzky, \nD.  S. , and Leen, T . K .,  editors,  Advances  in Neural  Information  Processing  Systems  7, \npages  1055-1062,  Cambridge,  MA.  MIT Press. \n\nMacKay, D. J. C.  1995.  Probable networks and plausible predictions - a review of practical \nBayesian  methods  for  supervised  neural  networks.  Network:  Computation  in  Neural \nSystems,  6:469- 505. \n\nMichie, D.,  Spiegelhalter,  D.,  and Taylor, C.  1994.  Machine  Learning,  Neural  and Statis(cid:173)\n\ntical  Classification.  Ellis  Horwood, Hertfordshire, UK. \n\nStatSci 1995.  S-Plus  Guide to Statistical and Mathematical Analyses,  Version  3.3.  StatSci, \n\na  division  of MathSoft, Inc, Seattle. \n\nWyatt,  J.  1995.  Nervous  about  artificial  neural  networks?  (commentary).  The  Lancet, \n\n346:1175-1177. \n\n\f", "award": [], "sourceid": 1252, "authors": [{"given_name": "Tony", "family_name": "Plate", "institution": null}, {"given_name": "Pierre", "family_name": "Band", "institution": null}, {"given_name": "Joel", "family_name": "Bert", "institution": null}, {"given_name": "John", "family_name": "Grace", "institution": null}]}