{"title": "Competitive On-line Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 364, "page_last": 370, "abstract": "", "full_text": "Competitive On-Line Linear Regression \n\nV. Vovk \n\npepartment of Computer Science \n\nRoyal Holloway, University of London \n\nEgham, Surrey TW20 OEX,  UK \n\nvovkGdcs.rhbnc.ac.uk \n\nAbstract \n\nWe apply a general algorithm for merging prediction strategies (the \nAggregating Algorithm) to the problem of linear regression with the \nsquare loss;  our  main assumption  is  that the response variable is \nbounded.  It turns out that for  this particular problem the Aggre(cid:173)\ngating Algorithm resembles, but is slightly different from,  the well(cid:173)\nknown ridge estimation procedure.  From general results about the \nAggregating Algorithm we  deduce a  guaranteed bound on the dif(cid:173)\nference between our algorithm's performance and the best, in some \nsense,  linear regression function's  performance.  We  show that the \nAA  attains the optimal constant  in  our  bound,  whereas  the con(cid:173)\nstant attained by the ridge regression procedure in general can be \n4 times worse. \n\n1 \n\nINTRODUCTION \n\nThe  usual  approach  to  regression  problems  is  to assume  that  the  data  are  gen(cid:173)\nerated  by  some  stochastic  mechanism  and  make  some,  typically  very  restrictive, \nassumptions about that stochastic mechanism.  In recent years, however, a different \napproach to this kind of problems was developed  (see,  e.g., DeSantis et al.  [2],  Lit(cid:173)\ntlestone and Warmuth  [7]):  in our context,  that approach sets the goal  of finding \nan on-line algorithm that performs not much  worse than the best regression func(cid:173)\ntion found  off-line;  in other words, it replaces the usual statistical analyses by the \ncompetitive analysis of on-line algorithms. \n\nDeSantis et al.  [2]  performed a competitive analysis of the Bayesian merging scheme \nfor  the log-loss prediction game;  later Littlestone and  Warmuth  [7]  and Vovk  [10] \nintroduced  an on-line algorithm  (called  the  Weighted  Majority  Algorithm  by  the \n\n\fCompetitive On-line Linear Regression \n\n365 \n\nformer authors)  for  the simple binary prediction game.  These two algorithms (the \nBayesian merging scheme and the Weighted  Majority Algorithm) are special cases \nof the Aggregating Algorithm  (AA)  proposed  in  [9,  11].  The AA  is  a  member  of \na wide family  of algorithms called  \"multiplicative weight\"  or  \"exponential weight\" \nalgorithms. \n\nCloser  to the topic of this  paper,  Cesa-Bianchi et al.  [1)  performed  a  competitive \nanalysis,  under  the square loss,  of the standard  Gradient  Descent  Algorithm  and \nKivinen  and  Warmuth  [6]  complemented  it  by  a  competitive analysis  of a  modi(cid:173)\nfication  of the  Gradient Descent,  which  they call  the Exponentiated  Gradient Al(cid:173)\ngorithm.  The bounds  obtained  in  [1,  6]  are of the  following  type:  at every  trial \nT, \n\n(1) \nwhere  LT  is  the loss  (over  the  first  T  trials)  of the on-line  algorithm,  LT  is  the \nloss  of the  best  (by  trial T)  linear  regression  function,  and  c  is  a  constant,  c  > \n1;  specifically,  c = 2 for  the  Gradient  Descent  and  c = 3 for  the  Exponentiated \nGradient.  These bounds  hold  under the following  assumptions:  for  the  Gradient \nDescent,  it is  assumed  that the L2  norm of the weights  and of all  data items  are \nbounded by constant  1;  for  the Exponentiated Gradient,  that the Ll  norm of the \nweights and the Loo  norm of all data items are bounded by 1. \nIn many interesting cases bound  (1)  is  weak.  For example, suppose that our com(cid:173)\nparison class contains a  \"true\"  regression function,  but its values are corrupted by \nan Li.d.  noise.  Then, under reasonable assumptions about the noise,  LT will  grow \nlinearly in T,  and inequality  (1)  will  only  bound the difference  LT - LT  by a  lin(cid:173)\near function  of T.  (Though in other situations bound  (1)  can be  better than our \nbound  (2), see below.  For example, in the case of the Exponentiated Gradient, the \n0(1)  in  (1)  depends  on  the number  of parameters n  logarithmically whereas  our \nbound depends on n  linearly.) \nIn this  paper we  will  apply the AA  to the problem  of linear  regression.  The AA \nhas  been  proven to be optimal in  some  simple cases  [5,  11],  so  we  can also expect \ngood  performance  in the problem  of linear  regression.  The following  is  a  typical \nresult  that  can  be  obtained  using the  AA:  Learner  has  a  strategy which  ensures \nthat always \n\nLT ~ LT + nIn(T + 1) + 1 \n\n(2) \n(n  is  the  number  of  predictor  variables).  It  is  interesting  that  the  assumptions \nfor  the last  inequality  are  weaker  than those  for  both  the  Gradient  Descent  and \nExponentiated Gradient:  we only assume that the L2  norm of the weights and the \nLoo  norm of all  data items  are bounded  by constant  1 (these  assumptions  will  be \nfurther  relaxed  later on).  The norms  L2  and  Loo  are not  dual,  which casts doubt \non the accepted intuition that the weights  and data items should  be measured by \ndual norms  (such as Ll-Loo or L2-L2). \nNotice  that  the logarithmic  term  nln(T + 1)  of (2)  is  similar to the term  ~ In T \noccurring in the analysis of the log-loss  game and its generalizations, in  particular \nin  Wallace's  theory  of minimum  message  length,  Rissanen's  theory  of stochastic \ncomplexity, minimax regret analysis.  In the case n =  1 and Xt = 1, Vt, inequality (2) \ndiffers  from  Freund's  [4]  Theorem 4  only  in  the  additive  constant.  In this  paper \nwe will see another manifestation of a phenomenon noticed by Freund [4]:  for  some \nimportant problems, the adversarial bounds of on-line competitive learning theory \n\n\f366 \n\nV.  Vovk \n\nare  only  a  tiny  amount  worse  than  the average-case  bounds  for  some  stochastic \nstrategies for  Nature. \nA  weaker variant of inequality  (2)  can be deduced from  Foster's  [3]  Theorem 1 (if \nwe  additionally assume that the response variable take only two values,  -1 or 1): \nFoster's result implies \n\nLT ~ LT + 8n In(2n(T + 1)) + 8 \n\n(a multiple  of 4  arises from  replacing  Foster's set  {O, 1}  of possible  values  of the \nresponse variable by our {-1, 1}j we also replaced Foster's d by 2n:  to span our set \nof possible weights we need 2n Foster's predictors). \n\nInequality (2)  is also similar to Yamanishi's [12]  resultj in that paper, he considers a \nmore general framework than ours but does not attempt to find  optimal constants. \n\n2  ALGORITHM \n\nWe  consider the following protocol of interaction between Learner and Nature: \n\nFOR t = 1,2, ... \n\nNature chooses  Xt  \u20ac  m.n \nLearner chooses prediction Pt  E m. \nNature chooses Yt  E  [-Y, Y] \n\nEND FOR. \n\nThis  is  a  \"perfect-information\"  protocol:  either  player  can  see  the other player's \nmoves.  The parameters of our protocol are:  a fixed  positive number n  (the dimen(cid:173)\nsionality  of our  regression  problem)  and  an  upper  bound  Y  >  0  on the  value  Yt \nreturned by  Nature.  It is  important, however,  that our algorithm for  playing this \ngame (on the part of Learner)  does  not need to  know Y. \n\nWe  will only give a  description of our regression algorithmj its derivation from  the \ngeneral AA will be given in the future full version of this paper.  (It is usually a non(cid:173)\ntrivial task to represent the AA in a computationally efficient form,  and the case of \non-line linear regression is not an exception.)  Fix n  and a > O.  The algorithm is  as \nfollows: \n\nA  :=alj b:=O \nFOR TRlAL t  =  1,2, ... : \n\nread new  Xt  E m.n \nA:= A +XtX~ \noutput prediction Pt  := b' A -1 Xt \nread new Yt  E m. \nb:= b+YtXt \n\nEND FOR. \n\nIn this description,  A is an n x n matrix (which is  always symmetrical and positive \ndefinite),  bE mn ,  I  is the unit n  x n  matrix, and 0 is the all-O  vector. \nThe naive implementation of this algorithm would  require  O(n3)  arithmetic oper(cid:173)\nations at every trial, but the standard recursive technique allows  us  to spend only \nO(n2 )  arithmetic operations per trial.  This is  still not as good as for  the Gradient \nDescent Algorithm and Exponentiated Gradient Algorithm (they require only O(n) \n\n\fCompetitive On-line Linear Regression \n\n367 \n\noperations  per  trial);  we  seem  to  have  a  trade-off between  the  quality  of bounds \non  predictive performance and  computational efficiency.  In the rest  of the paper \n\"AA\"  will  mean  the algorithm.  described  in  the previous paragraph  (which  is the \nAggregating Algorithm applied to a  particular uncountable pool of experts with a \nparticular Gaussian prior). \n\n3  BOUNDS \n\nIn this section we state, without proof, results describing the predictive performance \nof our algorithm.  Our comparison class consists of the linear functions Yt  =  W\u00b7 Xt, \nwhere  W  E  m.n \u2022  We  will call  the  possible  weights  w  \"experts\"  (imagine  that we \nhave continuously many experts indexed by W  E m.n ;  Expert w always recommends \nprediction  w  . Xt  to  Learner).  At  every  trial  t  Expert  w  and  Learner  suffer  loss \n(Yt  - w . Xt)2  and (Yt  - Pt)2, respectively.  Our notation for the total loss suffered by \nExpert w  and Learner over the first T  trials will be \n\nT \n\nLT(W)  := L(Yt - W\u00b7 Xt)2 \n\nt=1 \n\nT \n\nLT(Learner) := L(Yt - Pt)2, \n\nt=1 \n\nand \n\nrespectively. \n\nFor  compact  pools  of experts  (which,  in  our  setting,  corresponds  to  the  set  of \npossible weights w being bounded and closed) it is usually possible to derive bounds \n(such as  (2\u00bb where the learner's loss  is  compared to the best expert's loss.  In our \ncase of non-compact  pool,  however,  we  need to give the learner a  start on  remote \nexperts.  Specifically,  instead  of comparing  Learner's  performance to infw LT(W), \nwe  compare it to infw  (LT(W) + allwlI 2 )  (thus giving  ~arner a  start of allwII2  on \nExpert w),  where  a  > 0 is  a  constant reflecting our prior expectations about  the \n\"complexity\"  IIwll  := -IE:=1 w;  of successful experts. \nThis idea of giving a  start to experts allows  us to prove stronger results;  e.g.,  the \nfollowing elaboration of (2)  holds: \n\n(3) \n\n(this inequality still assumes that IIXtiloo  ~ 1 for  all t but w  is  unbounded). \nOur notation for the transpose of matrix A will be A'; as usual, vectors are identified \nwith one-column matrices. \n\nTheorem 1  For  any fi:ted n,  Leamer has  a strategy which  ensures that  always \n\n\f368 \n\nII,  in addition,  IIxt II 00  $  X, \\It, \n\nV.  Vovk \n\n(4) \n\nThe  last  inequality of this  theorem  implies  inequality  (3):  it  suffices  to put  X  = \nY=a=1. \nThe term \n\nlndet (1 +; t,x.x:) \n\nin  Theorem 1 might be difficult  to interpret.  Notice that it can be rewritten as \n\nnlnT + lndet (~I + ~COV(Xl' ... ,Xn)) , \n\nwhere cov(Xl , ... , Xn) is the empirical covariance matrix of the predictor variables \n(in  other  words,  cov(Xl , ... ,Xn)  is  the  covariance  matrix  of the  random  vector \nwhich takes the values Xl, ... ,XT with equal probability  ~).  We can see that this \nterm is typically close to n In T. \n\nUsing  standard  transformations,  it  is  easy  to  deduce  from  Theorem  1,  e.g.,  the \nfollowing  results (for simplicity we  assume n =  1 and Xt,Yt  E [-1,1], 'It): \n\n\u2022  if the pool of experts consists of all polynomials of degree d,  Learner has a \n\nstrategy guaranteeing \n\n\u2022  if the pool of experts consists of all splines of degree d with k  nodes (chosen \n\na priori),  Learner has a  strategy guaranteeing \n\nLT(Learner)  S inf (LT(W) + Ilw1l 2 )  + (d + k + 1) In(T + 1). \n\nw \n\nThe following  theorem shows that the constant n  in  inequality  (4)  cannot  be im(cid:173)\nproved. \n\nTheorem 2  Fix n  (the  number 01 attributes) and Y  (the upper bound on IlItl).  For \nany  f  > 0  there  exist a  constant C  and  a  stochastic  strategy lor Nature  such  that \nIIxtiloo = 1  and Illtl = Y, lor all t,  and, lor any stochastic strategy lor Learner, \n\nE  (LT(Learner)  -\n\ninf  LT(W))  ~ (n  - f)y21nT - C, \n\nw:llwll:SY \n\n'IT. \n\n4  COMPARISONS \n\nIt is  easy to see  that  the  ridge  regression  procedure sometimes gives  results  that \nare not  sensible  in our framework  where  lit  E  [-Y, Y]  and  the goal  is  to compete \n\n\fCompetitive On-line Linear Regression \n\n369 \n\nagainst  the best linear  regression function.  For  example,  suppose  n  = 1,  Y  = 1, \nand Nature generates outcomes  (Xt, Yt),  t =  1,2, ... , where \n\na \u00ab  Xl \u00ab  X2  \u00ab  ... ,  Yt  -\n\n_  {  1,  if todd, \n\n-1,  if t  even. \n\nAt  trial  t  = 2,3, ... the  ridge  regression  procedure  (more  accurately,  its  natural \nmodification which truncates its predictions to [-1, 1]) will give prediction Pt = Yt-l \nequal to the previous response,  and so will  suffer a  loss  of about 4T over T  trials. \nOn the other hand,  the AA's  prediction will  be close to 0,  and so the cumulative \nloss  of the  AA  over  the first  T  trials  will  be about T,  which  is  close  to the best \nexpert's  loss.  We  can  see  that  the  ridge  regression  procedure in  this  situation  is \nforced  to suffer a  loss 4 times as big as the AA's loss. \n\nThe lower bound stated in Theorem 2 does not imply that our regression algorithm is \nbetter than the ridge regression procedure in our adversarial framework.  (Moreover, \nthe idea of our proof of Theorem 2 is  to lower bound the performance of the ridge \nregression procedure in the situation where the expected loss of the ridge regression \nprocedure is optimal.)  Theorem 1 asserts that \n\nLT(Leamer) :S ~ (LT( w) + allwl12) + y2 t. In ( 1 + ~ t. X~.i) \n\n(5) \n\nwhen  Learner  follows  the  AA.  The next  theorem  shows  that the ridge  regression \nprocedure sometimes violates this inequality. \nTheorem 3  Let n  = 1  (the  number  0/  attributes)  and Y  = 1  (the  upper  bound \non  IYtl); fix a  > O.  Nature  has  a strategy  such  that,  when  Learner plays  the  ridge \nregression  strategy, \n\nLT(Learner) = 4T + 0(1), \n\nUI \n\ninf (LT(w) + allwll2) = T + 0(1), \nIn (1+ ~ t.x~) = TIn2+ 0(1) \n\n(6) \n(7) \n\n(8) \n\nas T  -4 00  (and,  there/ore,  (5)  is  violated). \n\n5  CONCLUSION \n\nA distinctive feature of our approach to linear regression is  that our only assump(cid:173)\ntion  about the data is  that  IYt I ~ Y,  'tit;  we  do  not  make any assumptions about \nstochastic properties of the data-generating mechanism.  In some situations  (if the \ndata were  generated  by a  partially known  stochastic mechanism)  this  feature  is  a \ndisadvantage, but often it will  be an advantage. \n\nThis  paper  was  greatly influenced  by Vapnik's  [8]  idea of transductive  inference. \nThe algorithm analyzed in this paper is  \"transductive\", in the sense that it outputs \nsome prediction Pt  for  Yt  after being given  Xt,  rather than to output a  general rule \nfor  mapping Xt  into Ptj  in particular, Pt  may depend non-linearly on Xt.  (It is easy, \nhowever,  to  extract  such  a  rule  from  the  description  of the  algorithm  once  it  is \nfound.) \n\n\f370 \n\nAcknowledgments \n\nV.  Vovk \n\nKostas Skouras and Philip Dawid noticed that our regression algorithm is  different \nfrom  the ridge  regression  and  that  in  some  situations  it  behaves  very  differently. \nManfred Warmuth's advice about relevant  literature is  also gratefully appreciated. \n\nReferences \n\n[11  N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth (1996), Worst-case quadratic \nloss bounds for on-line prediction of linear functions by gradient descent, IEEE \n7rons.  Neural  Networks  7:604-619. \n\n[21  A.  DeSantis, G. Markowsky, and M.  N. Wegman (1988), Learning probabilistic \n\nprediction functions,  in  \"Proceedings, 29th Annual IEEE Symposium on Foun(cid:173)\ndations of Computer Science,\"  pp. 110-119, Los Alamitos, CA: IEEE Comput. \nSoc. \n\n[31  D.  P.  Foster (1991),  Prediction in the worst case,  Ann.  Statist.  19:1084-1090. \n\n[4]  Y.  Freund  (1996),  Predicting a  binary sequence almost as well  as the optimal \nbiased coin,  in  \"Proceedings, 9th Annual ACM  Conference on  Computational \nLearning Theory\" , pp. 89-98, New  York:  Assoc.  Comput. Mach. \n\n[5]  D.  Haussler,  J.  Kivinen,  and  M.  K.  Warmuth  (1994),  Tight  worst-case  loss \nbounds  for  predicting  with  expert  advice,  University  of California  at  Santa \nCruz, Technical Report UCSC-CRL-94-36, revised December. Short version in \n\"Computational Learning Theory\"  (P.  Vitanyi,  Ed.),  Lecture  Notes  in  Com(cid:173)\nputer Science, Vol.  904, pp. 69-83, Berlin:  Springer, 1995. \n\n[6]  J. Kivinen and M.  K. Warmuth (1997), Exponential Gradient versus Gradient \n\nDescent for  linear predictors, Inform.  Computation  132:1-63. \n\n[7]  N.  Littlestone and M.  K.  Warmuth (1994), The Weighted Majority Algorithm, \n\nInform.  Computation  108:212-261. \n\n[8]  V.  N.  Vapnik  (1995),  The  Nature  of Statistical  Learning  Theory,  New  York: \n\nSpringer. \n\n[9]  V. Vovk (1990), Aggregating strategies, in \"Proceedings, 3rd Annual Workshop \non Computational Learning Theory\"  (M. Fulk and J. Case, Eds.), pp. 371-383, \nSan Mateo,  CA:  Morgan Kaufmann. \n\n[10]  V.  Vovk  (1992),  Universal  forecasting  algorithms,  In/orm.  Computation \n\n96:245-277. \n\n[11]  V.  Vovk (1997), A game of prediction with expert advice, to appear in J.  Com(cid:173)\nput.  In/orm.  Syst. Short version in \"Proceedings, 8th Annual ACM Conference \non Computational Learning Theory,\"  pp.  51-60,  New  York:  Assoc.  Comput. \nMach.,  1995. \n\n[12]  K.  Yamanishi  (1997),  A  decision-theoretic extension  of stochastic complexity \n\nand its applications to learning, submitted to IEEE  7rons.  In/orm.  Theory. \n\n\f", "award": [], "sourceid": 1419, "authors": [{"given_name": "Volodya", "family_name": "Vovk", "institution": null}]}