{"title": "Is Learning The n-th Thing Any Easier Than Learning The First?", "book": "Advances in Neural Information Processing Systems", "page_first": 640, "page_last": 646, "abstract": null, "full_text": "Is Learning The n-th Thing Any Easier Than \n\nLearning The First? \n\nSebastian Thrun I \n\nComputer Science Department \n\nCarnegie Mellon University \nPittsburgh, PA  15213-3891 \n\nWorld Wide Web:  http://www.cs.cmu.edul'''thrun \n\nAbstract \n\nThis paper investigates learning in  a lifelong context.  Lifelong learning \naddresses  situations  in  which  a  learner  faces  a  whole  stream  of learn(cid:173)\ning tasks.  Such scenarios provide the opportunity to transfer knowledge \nacross multiple learning tasks, in order to generalize more accurately from \nless  training data.  In  this paper,  several  different approaches  to lifelong \nlearning  are  described,  and  applied in  an  object recognition domain.  It \nis  shown  that  across  the  board,  lifelong learning  approaches  generalize \nconsistently  more  accurately  from  less  training data,  by  their  ability  to \ntransfer knowledge across learning tasks. \n\n1  Introduction \n\nSupervised learning is concerned with approximating an unknown function based on exam(cid:173)\nples.  Virtually all current approaches to supervised learning assume that one is given  a set \nof input-output examples, denoted by X, which characterize an unknown function, denoted \nby f. The target function f  is drawn from a class of functions, F, and the learner is given a \nspace of hypotheses, denoted by H, and an order (preference/prior) with which it considers \nthem during learning.  For example,  H  might be the space of functions represented by  an \nartificial neural network with different weight vectors. \nWhile  this formulation establishes a rigid framework  for research  in  machine learning,  it \ndismisses  important aspects  that are  essential  for  human  learning.  Psychological  studies \nhave  shown that  humans often employ more than just the training data for generalization. \nThey are often able to generalize correctly even from a single training example [2,  10].  One \nof the key  aspects  of the  learning problem faced  by  humans,  which differs from  the  vast \nmajority of problems studied in the field of neural network learning, is the fact that humans \nencounter a whole stream of learning problems over their entire lifetime.  When faced with \na new  thing to learn, humans can  usually exploit an enormous amount of training data and \n\nI also affiliated with:  Institut fur Informatik III, Universitat Bonn, Romerstr.  164, Germany \n\n\fIs Learning the n-th Thing Any  Easier Than Learning the First? \n\n641 \n\nexperiences that stem from other, related learning tasks. For example, when learning to drive \na car,  years of learning experience  with basic motor skills,  typical  traffic  patterns,  logical \nreasoning, language and much more precede and influence this learning task.  The transfer of \nknowledge across learning tasks seems to play an essential role for generalizing accurately, \nparticularly when training data is scarce. \nA framework for the study of the transfer of knowledge is the lifelong learning framework. \nIn this framework, it is assumed that a learner faces a whole collection of learning problems \nover its entire lifetime.  Such a scenario opens the opportunity for synergy.  When facing its \nn-th learning task,  a learner can re-use knowledge gathered in its previous n  - 1 learning \ntasks to boost the generalization accuracy. \nIn this paper we will be interested in the most simple version of the lifelong learning problem, \nin which the learner faces a family of concept learning tasks.  More specifically, the functions \nto be learned over the lifetime of the learner, denoted by 11 , 12, 13, .. . E F , are all of the type \nI  : I  --+ {O, I}  and sampled from  F.  Each  function I  E  {II , h ,13, ... } is  an  indicator \nfunction  that defines  a  particular concept:  a  pattern  x  E  I  is  member  of this concept  if \nand only if I(x)  =  1.  When  learning the  n-th indicator function,  In , the training set X \ncontains examples of the type (x , In(x))  (which may be distorted by noise).  In addition to \nthe training set, the learner is also given n  - 1 sets of examples of other concept functions, \ndenoted by Xk  (k  = 1, .. . , n  -\nI).  Each Xk  contains training examples that characterize \nIk.  Since this additional data is desired to support learning In, Xk  is called a support set \nfor the training set X . \nAn example of the above is the recognition of faces  [5, 7].  When learning to recognize the \nn-th person,  say  IBob,  the learner is  given  a  set of positive and  negative  example  of face \nimages of this person.  In lifelong learning, it may also exploit training information stemming \nfrom other persons, such as I  E {/Rieh, IMike , IDave ,  ... }.  The support sets usually cannot be \nused directly as training patterns when learning a new concept, since they describe different \nconcepts (hence have different class labels). However, certain features (like the shape of the \neyes) are more important than others (like the facial  expression, or the location of the face \nwithin the image).  Once the invariances of the domain are learned, they can be transferred \nto new learning tasks (new people) and hence improve generalization. \nTo  illustrate  the  potential  importance  of related  learning  tasks  in  lifelong  learning,  this \npaper does not present just one particular approach  to the transfer of knowledge.  Instead, \nit  describes  several,  all  of which  extend  conventional  memory-based  or neural  network \nalgorithms.  These approaches are compared with more traditional learning algorithms, i.e., \nthose  that  do  not transfer  knowledge.  The  goal  of this  research  is  to  demonstrate  that, \nindependent of a particular learning approach, more complex functions can be learned from \nless training data iflearning is embedded into a lifelong context. \n\n2  Memory-Based Learning Approaches \n\nMemory-based algorithms memorize all training examples explicitly and interpolate them \nat query-time.  We  will first  sketch two simple,  well-known approaches  to  memory-based \nlearning, then propose extensions that take the support sets into account. \n\n2.1  Nearest Neighbor and Shepard's Method \n\nProbably  the  most widely  used  memory-based  learning algorithm  is  J{ -nearest  neighbor \n(KNN) [15].  Suppose x  is a query pattern, for  which we would like to know the output y. \nKNN  searches  the set of training examples  X  for those  J{  examples  (Xi, Yi )  E  X  whose \ninput patterns  Xi  are  nearest  to  X  (according  to  some distance metric,  e.g.,  the  Euclidian \ndistance).  It then returns the mean output value k 2:= Yi of these nearest neighbors. \nAnother commonly used method, which is due to Shepard [13], averages the output values \n\n\f642 \n\ns. THRUN \n\nof all training examples but weights each example  according to the inverse distance to the \nquery :~~~t x. \n\n( \n\n) \n\n( \n\nI) -I \n\nL \n\n(x\"y.)EX \n\nIlx - ~: II + E \u00b7   L \n\n(x. ,y.)EX \n\nIlx - X i II + E \n\n(1) \n\nHere E  > 0 is a small constant that prevents division by zero.  Plain memory-based learning \nuses exclusively the training set X  for learning.  There is no obvious way to incorporate the \nsupport sets, since they carry the wrong class labels. \n\n2.2  Learning A New Representation \n\nThe first modification of memory-based learning proposed in this paper employs the support \nsets to learn a new representation of the data.  More specifically, the support sets are employed \nto learn a function, denoted by 9  : I  --+ I', which maps input patterns in I  to a new space, \nI' . This new space I' forms the input space for a memory-based algorithm. \nObviously, the  key  property of a  good data representations is that multiple examples of a \nsingle concept should have a similar representation, whereas the representation of an example \nand a counterexample of a concept should be more different.  This property can directly be \ntransformed into an energy function for g: \n\n( \n\nn-I \n\nE:=  ~ (X,y~EXk  (X\"y~EXk Ilg(x)-g(x')11 \nAdjusting 9  to  minimize  E  forces  the  distance  between  pairs  of examples  of the  same \nconcept to be small, and the distance between an example and a counterexample of a concept \nto be large.  In our implementation, 9  is realized by a neural  network and trained using the \nBack-Propagation algorithm [12]. \n\n(X\"y~EXk Ilg( x )-g(x')11 \n\n(2) \n\n) \n\nNotice that the new  representation, g, is obtained through the support sets.  Assuming that \nthe  learned  representation  is  appropriate  for  new  learning  tasks,  standard  memory-based \nlearning can be applied using this new representation when learning the n-th concept. \n\n2.3  Learning A Distance Function \n\nAn alternative way for exploiting support sets to improve memory-based learning is to learn \na distance function [3, 9].  This approach learns a function d : I  x  I  --+ [0, I] which accepts \ntwo input patterns,  say  x  and  x' , and outputs whether x  and  x'  are  members  of the  same \nconcept, regardless what the concept is. Training examples for d are \n\n(( x , x'),I) \n((x, x'), 0) \n\nify=y'=l \nif(y=IAy'=O)or(y=OAy'=I). \n\nThey  are derived from pairs of examples (x , y) , (x', y')  E  Xk  taken from  a single support \nset X k  (k  =  1, . .. , n -\nI).  In our implementation, d is an artificial  neural network trained \nwith Back-Propagation. Notice that the training examples for d lack information concerning \nthe concept for which they  were originally derived.  Hence,  all  support sets can be used to \ntrain d.  After training, d can be interpreted as the probability that two patterns x, x' E  I  are \nexamples of the same concept. \nOnce trained, d can be used as a generalized distance function for a memory-based approach. \nSuppose one is  given  a  training set  X  and  a  query  point x  E  I.  Then,  for each  positive \nexample (x' , y'  =  I)  EX , d( x , x') can be interpreted as the probability that x is a member \nof the target concept.  Votes from  multiple positive examples  (XI,  I) , (X2'  I), ... E  X  are \ncombined using Bayes'  rule, yielding \n\n.- 1- (I  + \n\nII \n\n(x' ,y'=I)EXk \n\nI:(~(::~,))-I \n\nProb(fn(x)=I) \n\n(3) \n\n\fIs Learning the n-th Thing Any Easier Than Learning the First? \n\n643 \n\nNotice that d is not a distance metric.  It generalizes the notion of a distance metric, because \nthe triangle inequality needs not hold,  and because an  example of the target concept x' can \nprovide evidence that x  is not a member of that concept (if d(x, x')  < 0.5). \n\n3  Neural Network Approaches \n\nTo  make our comparison more complete, we will now briefly describe approaches that rely \nexclusively on artificial neural networks for learning In. \n\n3.1  Back-Propagation \nStandard Back-Propagation can be used to learn the indicator function In, using X  as training \nset.  This approach does not employ the support sets, hence is unable to transfer knowledge \nacross learning tasks. \n\n3.2  Learning With Hints \n\nLearning  with hints [1,  4,  6,  16]  constructs  a neural  network with  n  output units, one for \neach function Ik  (k  =  1,2, .. . , n). This network is then trained to simultaneously minimize \nthe error on  both the support sets  {Xk}  and the training set X.  By doing so,  the  internal \nrepresentation  of this  network is  not  only  determined  by  X  but  also  shaped  through the \nsupport  sets  {X k }.  If similar  internal  representations  are  required  for  al1  functions  Ik \n(k  = 1,2, .. . , n),  the  support  sets  provide  additional  training examples  for  the  internal \nrepresentation. \n\n3.3  Explanation-Based Neural Network Learning \n\nThe  last  method  described  here  uses  the  explanation-based neural  network  learning  al(cid:173)\ngorithm (EBNN),  which  was original1y proposed in the context of reinforcement learning \n[8,  17].  EBNN  trains  an  artificial  neural  network,  denoted  by  h  :  I  ----+  [0,  1], just like \nBack-Propagation.  However,  in  addition  to  the target  values  given  by  the  training set  X, \nEBNN estimates the slopes (tangents) of the target function In  for each example in X.  More \nspecifically,  training examples in EBNN are of the sort (x, In (x), \\7 xl n (x)),  which are fit \nusing the Tangent-Prop algorithm [14].  The input x and  target value  In(x)  are taken from \nthe trai ning set X.  The third term, the slope \\7 xl n ( X ), is estimated using the learned distance \nfunction  d  described above.  Suppose  (x', y'  = 1)  E  X  is  a  (positive) training example. \nThen, the function dx '  :  I  ----+  [0,  1]  with dx ' (z)  := d(z , x')  maps  a single input pattern to \n[0,  1], and is an approximation to In.  Since d( z, x')  is represented by a neural network and \nneural networks are differentiable, the gradient 8dx ' (z) /8z is an estimate of the slope of In \nat z.  Setting z := x yields the desired estimate of \\7 xln (x) . As stated above, both the target \nvalue In (x)  and the slope vector \\7 x In (x)  are fit using the Tangent-Prop algorithm for each \ntraining example x  EX . \nThe slope  \\7 xln  provides additional information about the  target function  In.  Since  d  is \nlearned using the support sets, EBNN approach transfers  knowledge from  the support sets \nto the new learning task.  EBNN relies on the assumption that d is accurate enough to yield \nhelpful  sensitivity information.  However,  since EBNN fits  both training patterns (values) \nand slopes, misleading slopes can be overridden by training examples.  See  [17]  for a more \ndetailed description of EBNN and further references. \n\n4  Experimental Results \n\nAll  approaches  were  tested  using  a  database  of color camera  images  of different objects \n(see  Fig.  3.3).  Each  of the  object  in  the  database  has  a distinct color  or size.  The  n-th \n\n\f644 \n\nl1 \n\n.... \n\n'I  I't'  'I \n\n\u2022 \n\n, \n\n. \n\n< \n\n> \n\n-\n-.... \n\n'\" \n\n~ \n\n1:1  ,I \n\n:.  ~~~ \n\n-~\"\"\":::.~  ~ \n\n-,:~~,}  I \n\n, \n\n\u00a3 \n\n...... \n\n~.. \n\n' \n\n.  ~ \n\n~ <.-\n\n~~- -_,1-\n\n:t \n\n\" \n\n~ \n\n~_ \n\n, \n\n~,-l/>  ;'  ;'j  III \n\n'1  ~' \n.d!t~)ltI!{iH-\"\" \n\n''',ll  t! ~[~  -\n\n... \n\n, \n\n, \n\nc- _ .  \n\nML~._ ... , I \n\n'''!!!i!~, \n\n=' \n;~~~ , \n\nS. THRUN \n\nFigure  1:  The  sup(cid:173)\nport  sets  were  com(cid:173)\npiled out of a hundred \nimages  of a bottle,  a \nhat, a hammer, a coke \ncan, and a book.  The \nn-th  learning  tasks \ninvolves  distinguish(cid:173)\ning the shoe from the \nImages \nsunglasses. \nwere  subsampled  to \na 100x 100 pixel ma(cid:173)\ntrix (each  pixel has  a \ncolor, saturation, and \na  brightness  value), \nshown  on  the  right \nside. \n\n\u00bb.~ <.~ \n\n,,-\n\n~ \n\n-... \n~  ~_l_~  __ E~ \n'~~ \nII  _e\u00b7m;,  ;1  ~ t \n\n~,~,AA( \n\n, \n\" \n\n~ \n\n:R;1-; \n, \n\"\"111':'i, It \nf4~  r \n\nlearning task  was the  recognition of one of these objects,  namely the shoe.  The previous \nn  - 1 learning tasks correspond to the recognition of five  other objects, namely  the bottle, \nthe  hat,  the hammer,  the coke  can, and  the  book.  To  ensure  that  the  latter images  could \nnot be used simply as additional training data for In, the only counterexamples of the shoe \nwas  the seventh object, the sunglasses.  Hence,  the training set for In  contained images of \nthe shoe and the sunglasses, and the support sets contained images of the other five  objects. \nThe object recognition domain  is  a good  testbed for  the transfer of knowledge in  lifelong \nlearning.  This is because finding a good approximation to In  involves recognizing the target \nobject invariant of rotation, translation, scaling in size, change of lighting and so on.  Since \nthese invariances are common to all object recognition tasks, images showing other objects \ncan provide additional information and boost the generalization accuracy. \nTransfer of knowledge is most important when training data is scarce.  Hence,  in an  initial \nexperiment we tested all methods using a single image of the shoe and the sunglasses only. \nThose methods that are able to transfer knowledge were  also provided  100 images of each \nof the other five objects.  The results are intriguing. The generalization accuracies \n\ndistanced \n\nBack-Prop \n\nrepro g+Shep. \n\n59.7% \n\u00b19.0% \n\n74.4% \n\u00b118.5% \n\n75.2% \n\u00b118.9% \n\nKNN \n60.4% \n\u00b18.3% \n\nShepard \n60.4% \n\u00b18.3% \n\nEBNN \nhints \n74.8% \n62.1% \n\u00b110.2%  \u00b111.1% \nillustrate that all  approaches that transfer knowledge (printed in bold font)  generalize  sig(cid:173)\nnificantly better than those that do  not.  With the exception of the hint learning technique, \nthe approaches can  be grouped into two categories:  Those which generalize approximately \n60% of the testing set correctly,  and  those  which achieve  approximately  75% generaliza(cid:173)\ntion accuracy.  The former group contains the standard supervised learning algorithms, and \nthe latter contains the  \"new\" algorithms proposed  here,  which  are capable  of transferring \nknOWledge.  The differences  within each  group are  statistically  not  significant,  while the \ndifferences between them are (at the 95% level).  Notice that random guessing classifies 50% \nof the testing examples correctly. \nThese  results  suggest  that  the  generalization  accuracy  merely  depends  on  the  particular \nchoice  of the  learning  algorithm (memory-based  vs.  neural  networks).  Instead,  the  main \nfactor  determining  the  generalization  accuracy  is  the  fact  whether  or  not  knowledge  is \ntransferred from past learning tasks. \n\n\fIs  Learning the n-th Thing  Any  Easier Than Learning the First? \n\n645 \n\n95% \n\n85% \n\n~  80% \n~  15% \n70% \n-\n\n65% \n\n60% \n\n55% \n\ndistance function d \n\nhepard 's method with representation g \n\nShepard's method \n\n95% \n\n70% \n65% \n\n60% \n\n;;./ \n\n55% \n\n, ,,'~ . \n\n. </'~ \ni f  \n\n. \n/f Back-Propagauon \n\n50%~2--~~----~10~~1~2~1~4--1~6--1~.--~20 \n\n~%~2--~~----~1~O~1~2--1~4--1~6--~1B--~20 \n\ntraining example. \n\ntraining exampletl \n\nFigure  2:  Generalization  accuracy  as  a  function  of training  examples,  measured  on  an \nindependent  test  set  and  averaged  over  100  experiments.  95%-confidence  bars  are  also \ndisplayed. \n\nWhat  happens  as  more  training  data  arrives?  Fig.  2  shows  generalization  curves  with \nincreasing  numbers  of training examples  for  some  of these  methods.  As  the  number of \ntraining examples increases, prior knowledge becomes less important.  After presenting 20 \ntraining examples, the results \n\nKNN \n81.0% \n\u00b13.4% \n\nShepard \n70.5% \n\u00b14.9% \n\nrepro g+Shep. \n\n81.7% \n\u00b12.7% \n\ndistance d \n\n87.3% \n\u00b1O_9% \n\nBack-Prop \n\n88.4% \n\u00b12.5% \n\nhints \nn_avail. \n\nEBNN \n90.8% \n\u00b12.7% \n\nillustrate that some of the standard methods (especially Back-Propagation) generalize about \nas accurately as those methods that exploit support sets.  Here the differences in the underlying \nlearning mechanisms becomes more dominant.  However, when comparing lifelong learning \nmethods with their corresponding standard approaches, the latter ones are stiIl inferior: Back(cid:173)\nPropagation (88.4%)  is  outperformed by  EBNN (90.8%),  and  Shepard's method  (70.5%) \ngeneralizes less accurately when the representation is learned (81.7%) or when the distance \nfunction is learned (87.3%).  All these differences are significant at the 95% confidence level. \n\n5  Discussion \n\nThe experimental results reported in this paper provide evidence that learning becomes easier \nwhen  embedded  in  a  lifelong learning context.  By transferring  knowledge across  related \nlearning  tasks,  a  learner  can  become  \"more  experienced\"  and  generalize  better.  To  test \nthis conjecture in  a more systematic way,  a variety of learning approaches  were evaluated \nand compared with methods that are unable to transfer knowledge.  It is consistently found \nthat lifelong learning algorithms generalize significantly more accurately, particularly when \ntraining data is scarce. \nNotice that these results are  well in tune with other results obtained by  the author.  One of \nthe approaches here, EBNN, has extensively been studied in the context of robot perception \n[11],  reinforcement learning for robot control, and chess  [17].  In all these domains, it has \nconsistently been found to generalize better from less training data by transferring knowledge \nfrom previous learning tasks.  The results are  also consistent with observations made about \nhuman learning [2,  10], namely that previously learned knowledge plays an  important role \nin generalization, particularly when training data is scarce.  [18] extends these techniques to \nsituations where most support sets are not related.w \nHowever,  lifelong learning  rests  on the  assumption  that  more  than  a  single  task  is  to  be \nlearned,  and  that  learning  tasks  are  appropriately  related.  Lifelong  learning  algorithms \nare  particularly  well-suited in  domains  where  the costs  of collecting  training data  is  the \ndominating factor in learning, since these costs can be amortized over several learning tasks. \nSuch  domains  include,  for  example,  autonomous  service  robots  which  are  to  learn  and \nimprove over  their entire  lifetime.  They  include personal  software assistants  which  have \n\n\f646 \n\nS. THRUN \n\nto perform  various  tasks  for  various  users.  Pattern  recognition,  speech  recognition,  time \nseries prediction, and database mining might be other, potential application domains for the \ntechniques presented here. \n\nReferences \n[1]  Y.  S.  Abu-Mostafa.  Learning from  hints in neural networks.  Journal of Complexity, 6: 192-198, \n\n1990. \n\n[2]  W-K.  Ahn  and  W  F.  Brewer.  Psychological  studies  of  explanation-based  learning. \n\nIn \nG.  Dejong,  editor,  Investigating  Explanation-Based Learning. Kluwer  Academic  Publishers, \nBostonlDordrechtILondon, 1993. \n\n[3]  c. A.  Atkeson.  Using locally weighted regression for robot learning.  In Proceedings of the 1991 \n1EEE International Conference on Robotics and Automation, pages 958-962, Sacramento, CA, \nApril  1991. \n\n[4]  J. Baxter.  Learning internal representations.  In Proceedings of the Conference on Computation \n\nLearning Theory,  1995. \n\n[5]  D.  Beymer  and  T.  Poggio.  Face  recognition  from  one  model  view. \n\nInternational Conference on Computer Vision,  1995. \n\nIn  Proceedings of the \n\n[6]  R. Caruana. MuItitask learning:  A knowledge-based of source of inductive bias.  In P.  E.  Utgoff, \neditor,  Proceedings of the  Tenth International Conference on Machine Learning, pages 41-48, \nSan Mateo, CA,  1993. Morgan Kaufmann. \n\n[7]  M. Lando and S. Edelman. Generalizing from a single view in face recognition. Technical Report \nCS-TR  95-02,  Department of Applied  Mathematics  and  Computer  Science,  The  Weizmann \nInstitute of Science, Rehovot 76100, Israel, January 1995. \n\n[8]  T.  M.  Mitchell  and S.  Thrun.  Explanation-based neural network learning for robot control.  In \nS.  J.  Hanson, J.  Cowan,  and C.  L.  Giles,  editors, Advances in  Neural Information  Processing \nSystems 5, pages 287-294, San Mateo, CA,  1993. Morgan Kaufmann. \n\n[9]  A.  W  Moore, D. 1.  Hill, and M.  P. Johnson. An Empirical Investigation of Brute Force to choose \nFeatures, Smoothers and Function Approximators. In S. Hanson, S. Judd, and T. Petsche, editors, \nComputational Learning Theory and Natural Learning Systems,  Volume 3. MIT Press,  1992. \n\n[10]  Y. Moses, S. Ullman, and S. Edelman. Generalization across changes in illumination and viewing \nposition in upright and inverted faces.  Technical Report CS-TR 93-14, Department of Applied \nMathematics and Computer Science, The Weizmann Institute of Science, Rehovot 76100, Israel, \n1993. \n\n[11]  J.  O'Sullivan, T.  M. Mitchell,  and S.  Thrun.  Explanation-based neural network learning  from \nmobile robot perception. In K. Ikeuchi and M. Veloso, editors, Symbolic Visual Learning. Oxford \nUniversity Press,  1995. \n\n[12]  D.  E.  Rumelhart, G. E. Hinton,  and R.  J. Williams.  Learning internal representations by error \npropagation.  In D.  E.  Rumelhart and 1. L.  McClelland, editors, Parallel Distributed Processing. \nVol.  I + II.  MIT Press,  1986. \n\n[13]  D.  Shepard.  A  two-dimensional  interpolation  function  for  irregularly  spaced data.  In  23rd \n\nNational Conference ACM, pages 517-523, 1968. \n\n[14]  P.  Simard,  B.  Victorri,  Y.  LeCun,  and  J.  Denker.  Tangent prop - a formalism  for  specifying \nselected invariances in an adaptive network.  In 1.  E.  Moody, S.  J.  Hanson, and R.  P.  Lippmann, \neditors, Advances in Neural Information Processing Systems 4, pages 895-903, San Mateo, CA, \n1992. Morgan Kaufmann. \n\n[15]  c.  Stanfill  and  D.  Waltz.  Towards  memory-based reasoning.  Communications of the  ACM, \n\n29(12): 1213-1228, December 1986. \n\n[16]  S.  C.  Suddarth and  A.  Holden.  Symbolic  neural  systems and the  use of hints  for  developing \n\ncomplex systems.  International Journal of Machine Studies, 35, 1991. \n\n[17]  S. Thrun. Explanation-Based Neural Network Learning: A Lifelong Learning Approach. Kluwer \n\nAcademic Publishers, Boston, MA,  1996.  to appear. \n\n[18]  S.  Thrun  and  J.  O'Sullivan.  Clustering  learning  tasks  and  the  selective  cross-task  transfer \nof knowledge.  Technical  Report  CMU-CS-95-209,  Carnegie  Mellon  University,  School  of \nComputer Science, Pittsburgh, PA  15213, November 1995. \n\n\f", "award": [], "sourceid": 1034, "authors": [{"given_name": "Sebastian", "family_name": "Thrun", "institution": null}]}