{"title": "A Smoothed Approximate Linear Program", "book": "Advances in Neural Information Processing Systems", "page_first": 459, "page_last": 467, "abstract": "We present a novel linear program for the approximation of the dynamic programming cost-to-go function in high-dimensional stochastic control problems. LP approaches to approximate DP naturally restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program -- the `smoothed approximate linear program -- relaxes this restriction in an appropriate fashion while remaining computationally tractable. Doing so appears to have several advantages: First, we demonstrate superior bounds on the quality of approximation to the optimal cost-to-go function afforded by our approach. Second, experiments with our approach on a challenging problem (the game of Tetris) show that the approach outperforms the existing LP approach (which has previously been shown to be competitive with several ADP algorithms) by an order of magnitude.", "full_text": "ASmoothedApproximateLinearProgramVijayV.DesaiIEOR,ColumbiaUniversityvvd2101@columbia.eduVivekF.FariasMITSloanvivekf@mit.eduCiamacC.MoallemiGSB,ColumbiaUniversityciamac@gsb.columbia.eduAbstractWepresentanovellinearprogramfortheapproximationofthedynamicprogrammingcost-to-gofunctioninhigh-dimensionalstochasticcontrolproblems.LPapproachestoapproximateDPnaturallyrestrictattentiontoapproximationsthatarelowerboundstotheoptimalcost-to-gofunc-tion.Ourprogram\u2013the\u2018smoothedapproximatelinearprogram\u2019\u2013relaxesthisrestrictioninanappropriatefashionwhileremainingcomputation-allytractable.Doingsoappearstohaveseveraladvantages:First,wedemonstratesuperiorboundsonthequalityofapproximationtotheop-timalcost-to-gofunctiona\ufb00ordedbyourapproach.Second,experimentswithourapproachonachallengingproblem(thegameofTetris)showthattheapproachoutperformstheexistingLPapproach(whichhaspreviouslybeenshowntobecompetitivewithseveralADPalgorithms)byanorderofmagnitude.1IntroductionManydynamicoptimizationproblemscanbecastasMarkovdecisionproblems(MDPs)andsolved,inprinciple,viadynamicprogramming.Unfortunately,thisapproachisfrequentlyuntenableduetothe\u2018curseofdimensionality\u2019.Approximatedynamicprogramming(ADP)isanapproachwhichattemptstoaddressthisdi\ufb03culty.ADPalgorithmsseektocomputegoodapproximationstothedynamicprogramingoptimalcost-to-gofunctionwithinthespanofsomepre-speci\ufb01edsetofbasisfunctions.Theapproximatelinearprogramming(ALP)approachtoADP[1,2]isonesuchwell-recognizedapproach.TheprogramemployedintheALPapproachisidenticaltotheLPusedforexactcom-putationoftheoptimalcost-to-gofunction,withfurtherconstraintslimitingsolutionstothelow-dimensionalsubspacespannedbythebasisfunctionsused.Theresultinglowdi-mensionalLPimplicitlyrestrictsattentiontoapproximationsthatarelowerboundsontheoptimalcost-to-gofunction.Whilethestructureofthisprogramappearscrucialinestab-lishingapproximationguaranteesfortheapproach,therestrictiontolowerboundsleadsonetoaskwhethertheALPisthe\u2018right\u2019LP.Inparticular,couldanappropriaterelaxationofthefeasibleregionoftheALPallowforbetterapproximationstothecost-to-gofunctionwhileremainingcomputationallytractable?Motivatedbythisquestion,thepresentpaperpresentsanewlinearprogramforADPwecallthe\u2018smoothed\u2019ALP(orSALP).TheSALPmaybeviewedasarelaxationoftheALPwhereinoneisallowedtoviolatetheALPconstraintsforanygivenstate.Auserde\ufb01ned\u2018violationbudget\u2019parametercontrolsthe\u2018expected\u2019violationacrossstates;abudgetof0thusyieldstheoriginalALP.Wespecifyachoiceofthisviolationbudgetthatyieldsarelaxationwithattractiveproperties.Inparticular,weareabletoestablishstrongapproximationguaranteesfortheSALP;theseguaranteesaresubstantiallystrongerthanthecorrespondingguaranteesfortheALP.ThenumberofconstraintsandvariablesintheSALPscalewiththesizeoftheMDPstatespace.Wenonethelessestablishsamplecomplexityboundsthatdemonstratethatan1\fappropriate\u2018sampled\u2019SALPprovidesagoodapproximationtotheSALPsolutionwithatractablenumberofsampledMDPstates.Thissampledprogramisnomorecomplexthanthe\u2018sampled\u2019ALPand,assuch,wedemonstratethattheSALPisessentiallynohardertosolvethantheALP.Wepresentacomputationalstudydemonstratingthee\ufb03cacyofourapproachonthegameofTetris.TheALPhasbeendemonstratedtobecompetitivewithseveralADPapproachesforTetris(see[3]).IndetailedcomparisonswiththeALP,weestimatethattheSALPprovidesanorderofmagnitudeimprovementovercontrollersdesignedviathatapproachforthegameofTetris.2ProblemFormulationOursettingisthatofadiscrete-time,discountedin\ufb01nite-horizon,cost-minimizingMDPwitha\ufb01nitestatespaceXand\ufb01niteactionspaceA.Giventhestateandactionattimet,xtandat,aper-stagecostg(xt,at)isincurred.Thesubsequentstatext+1isdeterminedaccordingtothetransitionprobabilitykernelPat(xt,\u00b7).Astationarypolicy\u00b5:X\u2192Aisamappingthatdeterminestheactionateachtimeasafunctionofthestate.Giveneachinitialstatex0=x,theexpecteddiscountedcost(cost-to-gofunction)ofthepolicy\u00b5isgivenbyJ\u00b5(x),E\u00b5\"\u221eXt=0\u03b1tg(xt,\u00b5(xt))(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)x0=x#.where,\u03b1\u2208(0,1)isthediscountfactor.DenotebyP\u00b5\u2208RX\u00d7Xthetransitionprobabilitymatrixforthepolicy\u00b5,whose(x,x0)thentryisP\u00b5(x)(x,x0).Denotebyg\u00b5\u2208RXthevectorwhosexthentryisg(x,\u00b5(x)).Then,thecost-to-gofunctionJ\u00b5istheuniquesolutiontotheequationT\u00b5J=J,wheretheoperatorT\u00b5isde\ufb01nedbyT\u00b5J=g\u00b5+\u03b1P\u00b5J.TheBellmanoperatorTcanbede\ufb01nedaccordingtoTJ=min\u00b5T\u00b5J.Bellman\u2019sequationisthenthe\ufb01xedpointequation,TJ=J.Itisreadilyshownthattheoptimalcost-to-gofunctionJ\u2217istheuniquesolutiontoBellman\u2019sequationandthatacorrespondingoptimalpolicy\u00b5\u2217isgreedywithrespecttoJ\u2217;i.e.,\u00b5\u2217satis\ufb01esTJ\u2217=T\u00b5\u2217J\u2217.Bellman\u2019sequationmaybesolvedexactlyviathefollowinglinearprogram:(1)maximizeJ\u03bd>JsubjecttoJ\u2264TJ.Here,\u03bd\u2208RXisavectorwithpositivecomponentsthatareknownasthestate-relevanceweights.TheaboveprogramisindeedanLPsincetheconstraintJ(x)\u2264(TJ)(x)isequiva-lenttothesetoflinearconstraintsJ(x)\u2264g(x,a)+\u03b1Px0\u2208XPa(x,x0)J(x0),\u2200a\u2208A.Wereferto(1)astheexactLP.NotethatifavectorJsatis\ufb01esJ\u2264TJ,thenJ\u2264TkJ(bymonotonicityoftheBellmanoperator),andthusJ\u2264J\u2217(sincetheBellmanoperatorisacontractionwithunique\ufb01xedpointJ\u2217).Then,everyfeasiblepointfor(1)isacomponent-wiselowerboundtoJ\u2217,andJ\u2217istheuniqueoptimalsolutiontotheexactLP(1).ForproblemswhereXisprohibitivelylarge,anADPalgorithmseeksto\ufb01ndagoodap-proximationtoJ\u2217.Speci\ufb01cally,oneconsidersacollectionofbasisfunctions{\u03c61,...,\u03c6K}whereeach\u03c6i:X\u2192R.De\ufb01ning\u03a6,[\u03c61\u03c62...\u03c6K]tobeamatrixwithcolumnsconsistingofbasisfunctions,oneseeksanapproximationoftheformJr=\u03a6r,withthehopethatJr\u223cJ\u2217.TheALPforthistaskisthensimply(2)maximizer\u03bd>\u03a6rsubjectto\u03a6r\u2264T\u03a6r.ThegeometricintuitionbehindtheALPisillustratedinFigure1(a).SupposedthatrALPisavectorthatisoptimalfortheALP.Thentheapproximatevaluefunction\u03a6rALPwilllieonthesubspacespannedbythecolumnsof\u03a6,asillustratedbytheorangeline.\u03a6rALP2\fwillalsosatisfytheconstraintsoftheexactLP,illustratedbythedarkgrayregion;thisimpliesthat\u03a6rALP\u2264J\u2217.Inotherwords,theapproximatecost-to-gofunctionisnecessarilyapoint-wiselowerboundtothetruecost-to-gofunctioninthespanof\u03a6.J=\u03a6r\u03a6rALPJ\u2217\u03bdJ(1)J(2)(a)ALPcase.J=\u03a6r\u03a6rSALPJ\u2217\u03bdJ(1)J(2)(b)SALPcase.Figure1:AcartoonillustratingthefeasiblesetandoptimalsolutionfortheALPandSALP,inthecaseofatwo-stateMDP.Theaxescorrespondtothecomponentsofthevaluefunction.AcarefulrelaxationfromthefeasiblesetoftheALPtothatoftheSALPcanyieldanimprovedapproximation.3TheSmoothedALPTheJ\u2264TJconstraintsintheexactLP,whichcarryovertotheALP,imposeastrongrestrictiononthecost-to-gofunctionapproximation:inparticulartheyrestrictustoap-proximationsthatarelowerboundstoJ\u2217ateverypointinthestatespace.Inthecasewherethestatespaceisverylarge,andthenumberofbasisfunctionsis(relatively)small,itmaybethecasethatconstraintsarisingfromrarelyvisitedorpathologicalstatesarebindingandin\ufb02uencetheoptimalsolution.Inmanycases,ourultimategoalisnotto\ufb01ndalowerboundontheoptimalcost-to-gofunction,butratheragoodapproximationtoJ\u2217.Intheseinstances,itmaybethecasethatrelaxingtheconstraintsintheALPsoasnottorequireauniformlowerboundmayallowforbetteroverallapproximationstotheoptimalcost-to-gofunction.ThisisalsoillustratedinFigure1.RelaxingthefeasibleregionoftheALPinFigure1(b)tothelightgrayregioninFigure1(b)wouldyieldthepoint\u03a6rSALPasanoptimalsolution.Therelaxationinthiscaseisclearlybene\ufb01cial;itallowsustocomputeabetterapproximationtoJ\u2217thanthepoint\u03a6rSALP.Canweconstructafruitfulrelaxationofthissortingeneral?Thesmoothedapproximatelinearprogram(SALP)isgivenby:(3)maximizer,s\u03bd>\u03a6rsubjectto\u03a6r\u2264T\u03a6r+s,\u03c0>s\u2264\u03b8,s\u22650.Here,avectors\u2208RXofadditionaldecisionvariableshasbeenintroduced.Foreachstatex,s(x)isanon-negativedecisionvariable(aslack)thatallowsforviolationofthecorrespondingALPconstraint.Theparameter\u03b8\u22650isanon-negativescalar.Theparameter\u03c0\u2208RXisaprobabilitydistributionknownastheconstraintviolationdistribution.Theparameter\u03b8isthusaviolationbudget:theexpectedviolationofthe\u03a6r\u2264T\u03a6rconstraint,underthedistribution\u03c0,mustbelessthan\u03b8.ThebalanceofthepaperisconcernedwithestablishingthattheSALPformsthebasisofausefulADPalgorithminlargescaleproblems:\u2022Weidentifyaconcretechoiceofviolationbudget\u03b8andanidealizedconstraintviolationdistribution\u03c0forwhichtheSALPprovidesausefulrelaxationinthattheoptimalsolutioncanbeabetterapproximationtotheoptimalcost-to-gofunction.ThisbringsthecartoonimprovementinFigure1tofruitionforgeneralproblems.3\f\u2022WeshowthattheSALPistractable(i.eitiswellapproximatedbyanappropri-ate\u2018sampled\u2019version)andpresentcomputationalexperimentsforahardproblem(Tetris)illustratinganorderofmagnitudeimprovementovertheALP.4AnalysisThissectionisdedicatedtoatheoreticalanalysisoftheSALP.Theoverarchingobjectiveofthisanalysisistoprovidesomeassuranceofthesoundnessoftheproposedapproach.Inaddition,ouranalysiswillserveasacrucialguidetopracticalimplementationoftheSALP.Ouranalysiswillpresenttwotypesofresults:First,weproveapproximationguarantees(Sections4.1and4.2)thatwillindicatethattheSALPcomputesapproximationsthatareofcomparablequalitytotheprojectionofJ\u2217onthelinearspanof\u03a6.Second,weshow(Section4.3)thatanimplementable\u2018sampled\u2019versionoftheSALPmaybeusedtoapproximatetheSALPwithatractablenumberofsamples.Allproofscanbefoundinthetechnicalappendix.IdealizedAssumptions:GiventhebroadscopeofproblemsaddressedbyADPalgo-rithms,analysesofsuchalgorithmstypicallyrelyonan\u2018idealized\u2019assumptionofsomesort.InthecaseoftheALP,oneeitherassumestheabilitytosolvealinearprogramwithasmanyconstraintsastherearestates,orabsentthat,knowledgeofacertainidealizedsamplingdistribution,sothatonecanthenproceedwithsolvinga\u2018sampled\u2019versionoftheALP.OuranalysisoftheSALPinthissectionispredicatedontheknowledgeofanidealizedconstraintviolationdistribution,whichisthissameidealizedsamplingdistribution.Inpar-ticular,wewillrequireaccesstosamplesdrawnaccordingtothedistribution\u03c0\u00b5\u2217,\u03bdgivenby\u03c0>\u00b5\u2217,\u03bd,(1\u2212\u03b1)\u03bd>(I\u2212\u03b1P\u00b5\u2217)\u22121.Here\u03bdisanarbitraryinitialdistributionoverstates.Thedistribution\u03c0\u00b5\u2217,\u03bdmaybeinterpretedasyieldingthediscountedexpectedfrequencyofvisitstoagivenstatewhentheinitialstateisdistributedaccordingto\u03bdandthesystemrunsundertheoptimalpolicy\u00b5\u2217.Wenotethatthe\u2018sampled\u2019ALPintroducedbydeFariasandVanRoy[2]requiresaccesstostatessampledaccordingtopreciselythisdistribution.4.1ASimpleApproximationGuaranteeWepresenta\ufb01rst,simpleapproximationguaranteeforthefollowingspecializationoftheSALPin(3):(4)maximizer,s\u03bd>\u03a6rsubjectto\u03a6r\u2264T\u03a6r+s,\u03c0>\u00b5\u2217,\u03bds\u2264\u03b8,s\u22650.Beforeweproceedtostateourresult,wede\ufb01neausefulfunction:(5)\u2018(r,\u03b8),minimizes,\u03b3\u03b3subjectto\u03a6r\u2212T\u03a6r\u2264s+\u03b31,\u03c0>\u00b5\u2217,\u03bds\u2264\u03b8,s\u22650.\u2018(r,\u03b8)istheminimumtranslation(inthedirectionofthevector1)ofanarbitraryweightvectorrsoastoresultinafeasiblevectorfor(4).Wewilldenotebys(r,\u03b8)thescomponentofthesolutionto(5).ThefollowingLemmacharacterizesl(r,\u03b8):Lemma1.Foranyr\u2208RKand\u03b8\u22650:(i)\u2018(r,\u03b8)isabounded,decreasing,piecewiselinear,convexfunctionof\u03b8.(ii)\u2018(r,\u03b8)\u2264(1+\u03b1)kJ\u2217\u2212\u03a6rk\u221e.(iii)\u2202\u2202r\u2018(r,0)=\u22121Px\u2208\u2126(r)\u03c0\u00b5\u2217,\u03bd(x),where\u2126(r)=argmaxx\u2208X\u03a6r(x)\u2212T\u03a6r(x).Armedwiththisde\ufb01nition,wearenowinapositiontostateour\ufb01rst,crudeapproximationguarantee:4\fTheorem1.Let1beinthespanof\u03a6and\u03bdbeaprobabilitydistribution.Let\u00afrbeanoptimalsolutiontotheSALP(4).Moreover,letr\u2217satisfyr\u2217\u2208argminrkJ\u2217\u2212\u03a6rk\u221e.Then,kJ\u2217\u2212\u03a6\u00afrk1,\u03bd\u2264kJ\u2217\u2212\u03a6r\u2217k\u221e+l(r\u2217,\u03b8)+2\u03b81\u2212\u03b1.Theabovetheoremallowsustointerpret\u2018(r\u2217,\u03b8)+2\u03b81\u2212\u03b1astheapproximationerrorassociatedwiththeSALPsolution\u00afr.Considersetting\u03b8=0,inwhichcase(4)isidenticaltotheALP.Inthiscase,wehavefromLemma1that\u2018(r\u2217,0)\u2264(1+\u03b1)kJ\u2217\u2212\u03a6r\u2217k\u221e,sothattherighthandsideofourboundisatmost21\u2212\u03b1kJ\u2217\u2212\u03a6r\u2217k\u221e.ThisispreciselyTheorem2indeFariasandVanRoy[1];werecovertheirapproximationguaranteefortheALP.Nextobservethat,from(iii),iftheset\u2126(r\u2217)isofsmallprobabilityaccordingtothedistribution\u03c0\u00b5\u2217,\u03bd,weexpectthat\u2018(r\u2217,\u03b8)willdecreasedramaticallyas\u03b8isincreasedfrom0.Intheeventthat\u03a6r\u2217(x)\u2212T\u03a6r\u2217(x)islargeforonlyasmallnumberofstates(thatis,theBellmanerroroftheapproximationproducedbyr\u2217islargeforonlyasmallnumberofstates),wethusexpecttohaveachoiceof\u03b8forwhichl(r\u2217,\u03b8)+2\u03b8(cid:28)l(r\u2217,0).Thus,Theorem1reinforcestheintuition(shownviaFigure1)thattheSALPwillpermitcloserapproximationstoJ\u2217thantheALP.TheboundinTheorem1leavesroomforimprovement:1.Therighthandsideofourboundmeasuresprojectionerror,kJ\u2217\u2212\u03a6r\u2217k\u221eintheL\u221e-norm.Sinceitisunlikelythatthebasisfunctions\u03a6willprovideauniformlygoodapproximationovertheentirestatespace,therighthandsideofourboundcouldbequitelarge.2.Thechoiceofstaterelevanceweightscansigni\ufb01cantlyin\ufb02uencethesolution.Whilewedonotshowthishere,thischoiceallowsustochooseregionsofthestatespacewherewewouldlikeabetterapproximationofJ\u2217.Therighthandsideofourbound,however,isindependentof\u03bd.3.Ourguaranteedoesnotsuggestaconcretechoiceoftheviolationbudget,\u03b8.Thenextsectionwillpresentasubstantiallyre\ufb01nedapproximationbound.4.2ABetterApproximationGuaranteeWiththeintentofderivingstrongerapproximationguarantees,webeginthissectionbyintroducinga\u2018nicer\u2019measureofthequalityofapproximationa\ufb00ordedby\u03a6.Inparticular,insteadofmeasuringkJ\u2217\u2212\u03a6r\u2217kintheL\u221enormaswedidforourpreviousbounds,wewilluseaweightedmaxnormde\ufb01nedaccordingto:kJk\u221e,1/\u03c8,maxx\u2208X|J(x)|/\u03c8(x),where\u03c8:X\u2192[1,\u221e)isagivenweightingfunction.Theweightingfunction\u03c8allowsustoweightapproximationerrorinanon-uniformfashionacrossthestatespaceandinthismannerpotentiallyignoreapproximationqualityinregionsofthestatespacethat\u2018don\u2019tmatter\u2019.Inadditiontospecifyingtheconstraintviolationdistribution\u03c0aswedidforourpreviousbound,wewillspecify(implicitly)aparticularchoiceoftheviolationbudget\u03b8.Inparticular,wewillconsidersolvingthefollowingSALP:(6)maximizer,s\u03bd>\u03a6r\u22122\u03c0>\u00b5\u2217,\u03bds1\u2212\u03b1subjectto\u03a6r\u2264T\u03a6r+s,s\u22650.Itisclearthat(6)isequivalentto(4)foraspeci\ufb01cchoiceof\u03b8.Wethenhave:Theorem2.Let\u03a8,{y\u2208R|X|:y\u22651}.Forevery\u03c8\u2208\u03a8,let\u03b2(\u03c8)=max\u00b5(cid:13)(cid:13)(cid:13)P\u00b5\u03c8\u03c8(cid:13)(cid:13)(cid:13)\u221e.Then,foranoptimalsolution(rSALP,\u00afs)to(6),wehave:kJ\u2217\u2212\u03a6rSALPk1,\u03bd\u2264infr,\u03c8\u2208\u03a8kJ\u2217\u2212\u03a6rk\u221e,1/\u03c8 \u03bd>\u03c8+2(\u03c0>\u00b5\u2217,\u03bd\u03c8+1)(\u03b1\u03b2(\u03c8)+1)1\u2212\u03b1!.5\fItisworthplacingtheresultincontexttounderstanditsimplications.Forthis,werecallacloselyrelatedresultshownbydeFariasandVanRoy[1]fortheALP.Inparticular,deFariasandVanRoy[1]showedthatgivenanappropriateweighting(orintheircontext,\u2018Lyapunov\u2019)function\u03c8,onemaysolveanALP,with\u03c8inthespanofthebasisfunctions\u03a6;thesolutiontosuchanALPthensatis\ufb01es:kJ\u2217\u2212\u03a6\u00afrk1,\u03bd\u2264infrkJ\u2217\u2212\u03a6rk\u221e,1/\u03c82\u03bd>\u03c81\u2212\u03b1\u03b2(\u03c8)provided\u03b2(\u03c8)\u22641/\u03b1.Selectinganappropriate\u03c8intheircontextisviewedtobeanimportanttaskforpracticalperformanceandoftenrequiresagooddealofproblemspeci\ufb01canalysis;deFariasandVanRoy[1]identifyappropriate\u03c8forseveralqueueingmodels(notethatthisisequivalenttoidentifyingadesirablebasisfunction).Incontrast,theguaranteewepresentoptimizesoverallpossible\u03c81.Thus,theapproximationguaranteeofTheorem2allowsustoviewtheSALPasautomatingthecriticalprocedureofidentifyingagoodLyapunovfunctionforagivenproblem.4.3SampleComplexityOuranalysisthusfarhasassumedwehavetheabilitytosolvetheSALP,aprogramwithapotentiallyintractablenumberofconstraintsandvariables.Asitturnsout,asolutiontotheSALPiswellapproximatedbythesolutiontoacertain\u2018sampled\u2019programwhichwenowdescribe:Let\u02c6X={x1,x2,...,xS}beanorderedcollectionofSstatesdrawnindependentlyfromXaccordingtothedistribution\u03c0\u00b5\u2217,\u03bd.LetusconsidersolvingthefollowingprogramwhichwecallthesampledSALP:(7)maximizer,s\u03bd>\u03a6r\u22122(1\u2212\u03b1)SPx\u2208\u02c6Xs(x)subjectto(\u03a6r)(x)\u2264(T\u03a6r)(x)+s(x),\u2200x\u2208\u02c6X,r\u2208N,s\u22650.HereN\u2208RmisaparametersetchosentocontaintheoptimalsolutiontotheSALP(6),rSALP.Noticethat(7)isalinearprogramwithSvariablesandS|A|constraints.ForamoderatenumberofsamplesS,thisisiseasilysolved.WewillprovideasamplecomplexityboundthatindicatesthatforanumberofsamplesSthatscaleslinearlywiththedimensionof\u03a6,K,andthatneednotdependonthesizeofthestatespace,thesolutiontothesampledSALPsatis\ufb01es,withhighprobability,theapproximationguaranteepresentedfortheSALPsolutioninTheorem2.Letusde\ufb01netheconstantB,supr\u2208Nk(\u03a6r\u2212T\u03a6r)+k\u221e.ThisquantityiscloselyrelatedtothediameteroftheregionN.Wethenhave:Theorem3.UndertheconditionsofTheorem2,letrSALPbeanoptimalsolutiontotheSALP(6),andlet\u02c6rSALPbeanoptimalsolutiontothesampledSALP(7).AssumethatrSALP\u2208N.Further,given\u0001\u2208(0,B]and\u03b4\u2208(0,1/2],supposethatthenumberofsampledstatesSsatis\ufb01esS\u226564B2\u00012(cid:18)2(K+2)log16eB\u0001+log8\u03b4(cid:19).Then,withprobabilityatleast1\u2212\u03b4\u22122\u2212383\u03b4128,kJ\u2217\u2212\u03a6\u02c6rSALPk1,\u03bd\u2264infr\u2208N\u03c8\u2208\u03a8kJ\u2217\u2212\u03a6rk\u221e,1/\u03c8 \u03bd>\u03c8+2(\u03c0>\u00b5\u2217,\u03bd\u03c8+1)(\u03b1\u03b2(\u03c8)+1)1\u2212\u03b1!+4\u00011\u2212\u03b1.Theorem3establishesthatthesampledSALPprovidesacloseapproximationtothesolutionoftheSALP,inthesensethattheapproximationguaranteesweestablishedfortheSALPareapproximatelyvalidforthesolutiontothesampledversionwithhighprobability.Thenumberofsampleswerequiretoaccomplishthistaskisspeci\ufb01edpreciselyviathetheorem.Thisnumberdependslinearlyonthenumberofbasisfunctionsandthediameterofthe1Thisincludesthose\u03c8thatdonotsatisfytheLyapunovcondition\u03b2(\u03c8)\u22641/\u03b1.6\ffeasibleregion,butisotherwiseindependentofthesizeofthestatespacefortheMDPunderconsideration.ItisworthjuxtaposingoursamplecomplexityresultwiththatavailablefortheALP.Inparticular,werecallthattheALPhasalargenumberofconstraintsbutasmallnumberofvariables;theSALPisthus,atleastsuper\ufb01cially,asigni\ufb01cantlymorecomplexprogram.ExploitingthefactthattheALPhasasmallnumberofvariables,deFariasandVanRoy[2]establishasamplecomplexityboundforasampledversionoftheALPanalogous(7).ThenumberofsamplesrequiredforthissampledALPtoproduceagoodapproximationtotheALPcanbeshowntodependonthesameproblemparameterswehaveidenti\ufb01edhere,viz.BandthenumberofbasisfunctionsK.ThesamplecomplexityinthatcaseisidenticaltothesamplecomplexityboundestablishedhereuptoconstantsandanadditionalmultiplicativefactorofB/\u0001(forthesampledSALP).Thus,thetwosamplecomplexityboundsarewithinpolynomialtermsofeachotherandwehaveestablishedthattheSALPisessentiallynohardertosolvethantheALP.ThissectionplacestheSALPonsolidtheoreticalgroundbyestablishingstrongapproxima-tionguaranteesfortheSALPthatrepresentasubstantialimprovementoverthoseavailablefortheALPandsamplecomplexityresultsthatindicatedthattheSALPwasimplementableviasampling.WenextpresentacomputationalstudythatteststheSALPrelativetootherADPmethods(includingtheALP)onahardproblem(thegameofTetris).5CaseStudy:TetrisOurinterestinTetrisasacasestudyfortheSALPalgorithmismotivatedbyseveralfacts.TheoreticalresultssuggestthatdesignofanoptimalTetrisplayerisadi\ufb03cultproblem[4\u20136].TetrisrepresentspreciselythekindoflargeandunstructuredMDPforwhichitisdi\ufb03culttodesignheuristiccontrollers,andhencepoliciesdesignedbyADPalgorithmsareparticularlyrelevant.Moreover,Tetrishasbeenemployedbyanumberofresearchersasatestbedproblem[3,7\u20139].WefollowtheformulationofTetrisasaMDPpresentedbyFariasandVanRoy[3].TheSALPmethodologywasappliedasfollows:Basisfunctions.Weemployedthe22basisfunctionsoriginallyintroducedin[7].Statesampling.GivenasamplesizeS,acollection\u02c6X\u2282XofSstateswassampled.ThesesamplesweregeneratedinanIIDfashionfromthestationarydistributionofa(ratherpoor)baselinepolicy2.Optimization.Giventhecollection\u02c6Xofsampledstates,anincreasingsequenceofchoicesoftheviolationbudget\u03b8\u22650isconsidered.Foreachchoiceof\u03b8,theoptimizationprogram(8)maximizer,s1SPx\u2208\u02c6X(\u03a6r)(x)subjectto\u03a6r(x)\u2264T\u03a6r(x)+s(x),\u2200x\u2208\u02c6X,1SPx\u2208\u02c6Xs(x)\u2264\u03b8,s(x)\u22650,\u2200x\u2208\u02c6X,wassolved.ThisprogramisaversionoftheoriginalSALP(3),butwithsampledempiricaldistributionsinplaceofthestate-relevanceweights\u03bdandtheconstraintviolationdistribu-tion\u03c0.Notethat(8)hasK+SdecisionvariablesandS|A|linearconstraints.Becauseofthesparsitystructureoftheconstraints,however,itisamenabletoe\ufb03cientsolutionviabarriermethods,evenforlargevaluesofS.Evaluation.Givenavectorofweightsobtainedbysolving(8),theperformanceofthecorrespondingpolicyisevaluatedviaMonteCarlosimulationover3,000gamesofTetris.Performanceismeasuredintermsoftheaveragenumberoflinesclearedinasinglegame.Foreachpair(S,\u03b8),theresultingaverageperformance(averagedover10di\ufb00erentsetsofsampledstates)isshowninFigure2.ItprovidesexperimentalevidencefortheintuitionexpressedinSection3andtheanalyticresultofTheorem1:RelaxingtheconstraintsoftheALPbyallowingforaviolationbudgetallowsforbetterpolicyperformance.Astheviolationbudget\u03b8isincreasedfrom0,performancedramaticallyimproves.At\u03b8=0.16384,theperformancepeaks,andwegetpoliciesthatisanorderofmagnitudebetterthanALP,andbeyondthattheperformancedeteriorates.2Ourbaselinepolicyhadanaverageperformanceof113points.7\f50100150200250300\u00d7103024\u00d7103SampleSizeSAveragePerformance\u03b8=0.65536\u03b8=0.16384\u03b8=0.02048\u03b8=0.01024\u03b8=0.00256\u03b8=0(ALP)Figure2:AverageperformanceofSALPfordi\ufb00erentvaluesofthenumberofsampledstatesSandtheviolationbudget\u03b8.Table1summarizestheperformanceofbestpoliciesobtainedbyvariousADPalgorithms.Notethatallofthesealgorithmsemploythesamebasisfunctionarchitecture.TheALPandSALPresultsarefromourexperiments,whiletheotherresultsarefromtheliterature.ThebestperformanceresultsofSALPisbetterbyafactorof2incomparisontothecompetitors.AlgorithmBestPerformanceCPUTimeALP897hoursTD-Learning[7]3,183minutesALPwithbootstrapping[3]4,274hoursTD-Learning[8]4,471minutesPolicygradient[9]5,500daysSALP10,775hoursTable1:ComparisonoftheperformanceofthebestpolicyfoundwithvariousADPmethods.Notethatsigni\ufb01cantlybetterpoliciesarepossiblewiththisbasisfunctionarchitecturethananyoftheADPalgorithmsinTable1discover.Usingaheuristicoptimizationmethod,SzitaandL\u02ddorincz[10]reportpolicieswitharemarkableaverageperformanceof350,000.Theirmethodiscomputationallyintensive,however,requiringonemonthofCPUtime.Inaddition,theapproachemploysanumberofratherarbitraryTetrisspeci\ufb01c\u2018modi\ufb01cations\u2019thatareultimatelyseentobecriticaltoperformance-intheabsenceofthesemodi\ufb01cations,themethodisunableto\ufb01ndapolicyforTetristhatscoresaboveafewhundredpoints.6FutureDirectionsThereareanumberofinterestingdirectionsthatremaintobeexplored.First,notethattheboundsderivedinSections4.1and4.2areapproximationguarantees,whichprovideboundsontheapproximationerrorgivenbytheSALPapproachversusthebestapproximationpos-siblewiththeparticularsetofbasisfunctions.Inpreliminarywork,wehavealsodevelopedperformanceguarantees.TheseprovideboundsontheperformanceoftheresultingSALPpolicies,asafunctionofthebasisarchitecture.Second,notethatsamplepathvariationsoftheSALParepossible.Ratherthansolvingalargelinearprogram,suchanalgorithmwouldoptimizeapolicyinanonlinefashionalongasinglesystemtrajectory.ThiswouldbeinamannerreminiscentofstochasticapproximationalgorithmslikeTD-learning.However,asamplepathSALPvariationwouldinheritallofthetheoreticalboundsdevelopedhere.Thedesignandanalysisofsuchanalgorithmisanexcitingfuturedirection.8\fReferences[1]D.P.deFariasandB.VanRoy.Thelinearprogrammingapproachtoapproximatedynamicprogramming.OperationsResearch,51(6):850\u2013865,2003.[2]D.P.deFariasandB.VanRoy.Onconstraintsamplinginthelinearprogrammingapproachtoapproximatedynamicprogramming.MathematicsofOperationsResearch,293(3):462\u2013478,2004.[3]V.F.FariasandB.VanRoy.Tetris:Astudyofrandomizedconstraintsampling.InProbabilisticandRandomizedMethodsforDesignUnderUncertainty.Springer-Verlag,2006.[4]J.Brzustowski.CanyouwinatTetris?Master\u2019sthesis,UniversityofBritishColumbia,1992.[5]H.Burgiel.HowtoloseatTetris.MathematicalGazette,page194,1997.[6]E.D.Demaine,S.Hohenberger,andD.Liben-Nowell.Tetrisishard,eventoapprox-imate.InProceedingsofthe9thInternationalComputingandCombinatoricsConfer-ence,2003.[7]D.P.BertsekasandS.Io\ufb00e.Temporaldi\ufb00erences\u2013basedpolicyiterationandapplica-tionsinneuro\u2013dynamicprogramming.TechnicalReportLIDS\u2013P\u20132349,MITLabora-toryforInformationandDecisionSystems,1996.[8]D.P.BertsekasandJ.N.Tsitsiklis.Neuro-DynamicProgramming.AthenaScienti\ufb01c,Belmont,MA,1996.[9]S.Kakade.Anaturalpolicygradient.InAdvancesinNeuralInformationProcessingSystems14,Cambridge,MA,2002.MITPress.[10]I.SzitaandA.L\u02ddorincz.LearningTetrisusingthenoisycross-entropymethod.NeuralComputation,18:2936\u20132941,2006.[11]D.Haussler.DecisiontheoreticgeneralizationsofthePACmodelforneuralnetandotherlearningapplications.InformationandComputation,100:78\u2013150,1992.9\f", "award": [], "sourceid": 1007, "authors": [{"given_name": "Vijay", "family_name": "Desai", "institution": null}, {"given_name": "Vivek", "family_name": "Farias", "institution": null}, {"given_name": "Ciamac", "family_name": "Moallemi", "institution": null}]}