{"title": "Infinite Latent SVM for Classification and Multi-task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1620, "page_last": 1628, "abstract": "Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes' theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the large-margin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.", "full_text": "In\ufb01niteLatentSVMforClassi\ufb01cationandMulti-taskLearningJunZhu\u2020,NingChen\u2020,andEricP.Xing\u2021\u2020Dept.ofComputerScience&Tech.,TNListLab,TsinghuaUniversity,Beijing100084,China\u2021MachineLearningDepartment,CarnegieMellonUniversity,Pittsburgh,PA15213,USAdcszj@tsinghua.edu.cn;chenn07@mails.thu.edu.cn;epxing@cs.cmu.eduAbstractUnlikeexistingnonparametricBayesianmodels,whichrelysolelyonspeciallyconceivedpriorstoincorporatedomainknowledgefordiscoveringimprovedla-tentrepresentations,westudynonparametricBayesianinferencewithregulariza-tiononthedesiredposteriordistributions.Whilepriorscanindirectlyaffectpos-teriordistributionsthroughBayes\u2019theorem,imposingposteriorregularizationisarguablymoredirectandinsomecasescanbemucheasier.Weparticularlyfo-cusondevelopingin\ufb01nitelatentsupportvectormachines(iLSVM)andmulti-taskin\ufb01nitelatentsupportvectormachines(MT-iLSVM),whichexplorethelarge-marginideaincombinationwithanonparametricBayesianmodelfordiscoveringpredictivelatentfeaturesforclassi\ufb01cationandmulti-tasklearning,respectively.Wepresentef\ufb01cientinferencemethodsandreportempiricalstudiesonseveralbenchmarkdatasets.Ourresultsappeartodemonstratethemeritsinheritedfrombothlarge-marginlearningandBayesiannonparametrics.1IntroductionNonparametricBayesianlatentvariablemodelshaverecentlygainedremarkablepopularityinstatis-ticsandmachinelearning,partlyowningtotheirdesirable\u201cnonparametric\u201dnaturewhichallowspractitionersto\u201csidestep\u201dthedif\ufb01cultmodelselectionproblem,e.g.,\ufb01guringouttheunknownnum-berofcomponents(orclasses)inamixturemodel[2]ordeterminingtheunknowndimensionalityoflatentfeatures[12],byusinganappropriatepriordistributionwithalargesupport.AmongthemostcommonlyusedpriorsareGaussianprocess(GP)[24],Dirichletprocess(DP)[2]andIndianbuffetprocess(IBP)[12].However,standardnonparametricBayesianmodelsarelimitedinthattheyusuallymakeverystrictandunrealisticassumptionsondata,suchasthatobservationsbeinghomogeneousorexchangeable.AnumberofrecentdevelopmentsinBayesiannonparametricshaveattemptedtoalleviatesuchlimi-tations.Forexample,tohandleheterogenousobservations,predictor-dependentprocesses[20]havebeenproposed;andtorelaxtheexchangeabilityassumption,variouscorrelationstructures,suchashierarchicalstructures[26],temporalorspatialdependencies[5],andstochasticorderingde-pendencies[13,10],havebeenintroduced.However,allthesemethodsrelysolelyoncraftinganonparametricBayesianpriorencodingsomespecialstructure,whichcanindirectlyin\ufb02uencetheposteriordistributionofinterestviatrading-offwithlikelihoodmodels.Sinceitistheposteriordistributions,whichcapturethelatentstructurestobelearned,thatareofourultimateinterest,anarguablymoredirectwaytolearnadesirablelatent-variablemodelistoimposeposteriorregular-ization(i.e.,regularizationonposteriordistributions),aswewillexploreinthispaper.Anotherreasonforusingposteriorregularizationisthatinsomecasesitismorenaturalandeasiertoincor-poratedomainknowledge,suchasthelarge-margin[15,31]ormanifoldconstraints[14],directlyonposteriordistributionsratherthanthroughpriors,asshowninthispaper.Posteriorregularization,usuallythroughimposingconstraintsontheposteriordistributionsoflatentvariablesorviasomeinformationprojection,hasbeenwidelystudiedinlearninga\ufb01nitelog-linearmodelfrompartiallyobserveddata,includinggeneralizedexpectation[21],posteriorregulariza-1\ftion[11],andalternatingprojection[6],allofwhicharedoingmaximumlikelihoodestimation(MLE)tolearnasinglesetofmodelparametersbyoptimizinganobjective.Recentattemptsto-wardlearningaposteriordistributionofmodelparametersincludethe\u201clearningfrommeasure-ments\u201d[19],maximumentropydiscrimination[15]andMedLDA[31].Butagain,allthesemeth-odsarelimitedto\ufb01niteparametricmodels.Toourknowledge,veryfewattemptshavebeenmadetoimposeposteriorregularizationonnonparametricBayesianlatentvariablemodels.Oneexceptionisourrecentworkofin\ufb01niteSVM(iSVM)[32],aDPmixtureoflarge-marginclassi\ufb01ers.iSVMisalatentclassmodelthatassignseachdataexampletoasinglemixturecomponentforclassi\ufb01cationandtheunknownnumberofmixturecomponentsisautomaticallyresolvedfromdata.Inthispaper,wepresentageneralformulationofperformingnonparametricBayesianinferencesubjecttoappropriateposteriorconstraints.Inparticular,weconcentrateondevelopingthein\ufb01-nitelatentsupportvectormachines(iLSVM)andmulti-taskin\ufb01nitelatentsupportvectormachines(MT-iLSVM),whichexplorethediscriminativelarge-marginideatolearnin\ufb01nitelatentfeaturemodelsforclassi\ufb01cationandmulti-tasklearning[3,4],respectively.Assuch,ourmethodsaswellas[32]representanattempttopushforwardtheinterfacebetweenBayesiannonparametricsandlargemarginlearning,whichhavecomplementaryadvantagesbuthavebeenlargelytreatedastwoseparatesub\ufb01eldsinthemachinelearningcommunity.Technically,althoughitisintuitivelynatu-ralforMLE-basedmethodstoincludearegularizationtermontheposteriordistributionsoflatentvariables,thisisnotstraightforwardforBayesianinferencebecausewedonothaveanoptimizationobjectivetoberegularized.WebaseourworkontheinterpretationoftheBayes\u2019theorembyZell-ner[29],namely,theBayes\u2019theoremcanbereformulatedasaminimizationproblem.Underthisoptimizationframework,weincorporateposteriorconstraintstodoregularizedBayesianinference,withapenaltytermthatmeasurestheviolationoftheconstraints.BothiLSVMandMT-iLSVMarespecialcasesthatexplorethelarge-marginprincipletoconsidersupervisinginformationforlearn-ingpredictivelatentfeatures,whicharegoodforclassi\ufb01cationormulti-tasklearning.WeusethenonparametricIBPpriortoallowthemodelstohaveanunboundednumberoflatentfeatures.Theregularizedinferenceproblemcanbeef\ufb01cientlysolvedwithaniterativeprocedure,whichleveragesexistinghigh-performanceconvexoptimizationtechniques.RelatedWork:Asstatedabove,bothiLSVMandMT-iLSVMgeneralizetheideasofiSVMtoin\ufb01nitelatentfeaturemodels.Formulti-tasklearning,nonparametricBayesianmodelshavebeendevelopedin[28,23]forlearningfeaturessharedbymultipletasks.ButthesemethodsarebasedonstandardBayesianinference,withouttheabilitytoconsiderposteriorregularization,suchasthelarge-marginconstraintsorthemanifoldconstraints[14].Finally,MT-iLSVMisanonparametricBayesiangeneralizationofthepopularmulti-tasklearningmethods[1,16],asexplainedshortly.2RegularizedBayesianInferencewithPosteriorConstraintsInthissection,wepresentthegeneralframeworkofregularizedBayesianinferencewithposteriorconstraints.WebeginwithabriefreviewofthebasicresultsduetoZellner[29].2.1BayesianInferenceasaLearningModelLetMbeamodelspace,containinganyvariableswhoseposteriordistributionswearetryingtoinfer.Bayesianinferencestartswithapriordistribution\u03c0(M)andalikelihoodfunctionp(x|M)indexedbythemodelM\u2208M.Then,bytheBayes\u2019theorem,theposteriordistributionisp(M|x1,\u00b7\u00b7\u00b7,xN)=\u03c0(M)\u220fNn=1p(xn|M)p(x1,\u00b7\u00b7\u00b7,xN),(1)wherep(x1,\u00b7\u00b7\u00b7,xN)isthemarginallikelihoodorevidenceofobserveddata.Zellner[29]\ufb01rstshowedthattheposteriordistributionduetotheBayes\u2019theoremisthesolutionoftheproblemminp(M)KL(p(M)\u2225\u03c0(M))\u2212N\u2211n=1\u222blogp(xn|M)p(M)dM(2)s.t.:p(M)\u2208Pprob,whereKL(p(M)\u2225\u03c0(M))istheKullback-Leibler(KL)divergence,andPprobisthespaceofvalidprobabilitydistributionswithanappropriatedimension.2.2RegularizedBayesianInferencewithPosteriorConstraintsAscommentedbyE.T.Jaynes[29],\u201cthisfreshinterpretationofBayes\u2019theoremcouldmaketheuseofBayesianmethodsmoreattractiveandwidespread,andstimulatenewdevelopmentsin2\fthegeneraltheoryofinference\u201d.Below,westudyhowtoextendthebasicresultstoincorporateposteriorconstraintsinBayesianinference.InthestandardBayesianinference,theconstraints(i.e.,p(M)\u2208Pprob)donothaveauxiliaryfreeparameters.Ingeneral,regularizedBayesianinferencesolvestheconstrainedoptimizationproblemminp(M),\u03beKL(p(M)\u2225\u03c0(M))\u2212N\u2211n=1\u222blogp(xn|M)p(M)dM+U(\u03be)(3)s.t.:p(M)\u2208Ppost(\u03be),wherePpost(\u03be)isasubspaceofdistributionsthatsatisfyasetofconstraints.Theauxiliaryparameters\u03beareusuallynonnegativeandinterpretedasslackvariables.U(\u03be)isaconvexfunction,whichusuallycorrespondstoasurrogateloss(e.g.,hingeloss)ofapredictionrule,asweshallsee.WecanuseaniterativeproceduretodotheregularizedBayesianinferencebasedonconvexop-timizationtechniques.ThegeneralrecipeisthatweusetheLagrangianmethodbyintroducingLagrangianmultipliers\u03c9.Then,weiterativelysolveforp(M)with\u03c9and\u03be\ufb01xed;andsolvefor\u03c9and\u03bewithp(M)given.Forthe\ufb01rststep,wecanusesamplingorvariationalmethods[9]todoapproximateinference;andundercertainconditions,suchasusingtheconstraintsbasedonposteriorexpectation[21],thesecondstepcanbeef\ufb01cientlydoneusinghigh-performanceconvexoptimizationtechniques,asweshallsee.3In\ufb01niteLatentSupportVectorMachinesInthissection,weconcretizetheideasofregularizedBayesianinferencebyparticularlyfocusingondevelopinglarge-marginclassi\ufb01erswithanunboundeddimensionoflatentfeatures,whichcanbeusedasarepresentationofexamplesforthesingle-taskclassi\ufb01cationorasacommonrepresentationthatcapturesrelationshipsamongmultipletasksformulti-tasklearning.We\ufb01rstpresentthesingle-taskclassi\ufb01cationmodel.Thebasicsetupisthatweprojecteachdataexamplex\u2208X\u2282RDtoalatentfeaturevectorz.Here,weconsiderbinaryfeatures1.GivenasetofNdataexamples,letZbethematrix,ofwhicheachrowisabinaryvectorznassociatedwithdatasamplen.Insteadofpre-specifyinga\ufb01xeddimensionofz,weresorttothenonparametricBayesianmethodsandletzhaveanin\ufb01nitenumberofdimensions.Tomaketheexpectednumberofactivelatentfeatures\ufb01nite,weputthewell-studiedIBPprioronthebinaryfeaturematrixZ.3.1IndianBuffetProcessIndianbuffetprocess(IBP)wasproposedin[12]andhasbeensuccessfullyappliedinvarious\ufb01elds,suchaslinkprediction[22]andmulti-tasklearning[23].Wefocusonitsstick-breakingconstruction[25],whichisgoodfordevelopingef\ufb01cientinferencemethods.Let\u03c0k\u2208(0,1)beaparameterassociatedwithcolumnkofthebinarymatrixZ.Given\u03c0k,eachznkincolumnkissampledindependentlyfromBernoulli(\u03c0k).Theparameters\u03c0aregeneratedbyastick-breakingprocess\u03c01=\u03bd1,and\u03c0k=\u03bdk\u03c0k\u22121=k\u220fi=1\u03bdi,(4)where\u03bdi\u223cBeta(\u03b1,1).Thisprocessresultsinadecreasingsequenceofprobabilities\u03c0k.Speci\ufb01-cally,givena\ufb01nitedataset,theprobabilityofseeingfeaturekdecreasesexponentiallywithk.3.2In\ufb01niteLatentSupportVectorMachinesWeconsiderthemulti-wayclassi\ufb01cation,whereeachtrainingdataisprovidedwithacategoricallabely,wherey\u2208Ydef={1,\u00b7\u00b7\u00b7,L}.Forbinaryclassi\ufb01cationandregression,similarprocedurecanbeappliedtoimposelarge-marginconstraintsonposteriordistributions.Supposethatthelatentfeatureszaregiven,thenwecande\ufb01nethelatentdiscriminantfunctionasf(y,x,z;\u03b7)def=\u03b7\u22a4g(y,x,z),(5)whereg(y,x,z)isavectorstackingofLsubvectors2ofwhichtheythisz\u22a4andalltheothersarezero.SincewearedoingBayesianinference,weneedtomaintaintheentiredistributionpro\ufb01leof1Real-valuedfeaturescanbeeasilyconsideredasin[12].2Wecanconsidertheinputfeaturesxoritscertainstatisticsincombinationwiththelatentfeaturesztode\ufb01neaclassi\ufb01erboundary,bysimplyconcatenatingtheminthesubvectors.3\fthelatentfeaturesZ.However,inordertomakeapredictionontheobserveddatax,weneedtogetridoftheuncertaintyofZ.Here,wede\ufb01netheeffectivediscriminantfunctionasanexpectation3(i.e.,aweightedaverageconsideringallpossiblevaluesofZ)ofthelatentdiscriminantfunction.TomakethemodelfullyBayesian,wealsotreat\u03b7asrandomandaimtoinfertheposteriordistri-butionp(Z,\u03b7)fromgivendata.Moreformally,theeffectivediscriminantfunctionf:X\u00d7Y7\u2192Risf(y,x;p(Z,\u03b7))def=Ep(Z,\u03b7)[f(y,x,z;\u03b7)]=Ep(Z,\u03b7)[\u03b7\u22a4g(y,x,z)].(6)Notethatalthoughthenumberoflatentfeaturesisallowedtobein\ufb01nite,withprobabilityone,thenumberofnon-zerofeaturesis\ufb01nitewhenonlya\ufb01nitenumberofdataareobserved,undertheIBPprior.Moreover,tomakeitcomputationallyfeasible,weusuallyseta\ufb01niteupperboundKtothenumberofpossiblefeatures,whereKissuf\ufb01cientlylargeandknownasthetruncationlevel(SeeSec3.4andAppendixA.2fordetails).Asshownin[9],the\u21131-distancetruncationerrorofmarginaldistributionsdecreasesexponentiallyasKincreases.Withtheabovede\ufb01nitions,wede\ufb01nethePpost(\u03be)inproblem(3)usinglarge-marginconstraintsasPcpost(\u03be)def={p(Z,\u03b7)\u2200n\u2208Itr:f(yn,xn;p(Z,\u03b7))\u2212f(y,xn;p(Z,\u03b7))\u2265\u2113(y,yn)\u2212\u03ben,\u2200y\u03ben\u22650}(7)andde\ufb01nethepenaltyfunctionasUc(\u03be)def=C\u2211n\u2208Itr\u03bepn,wherep\u22651.Ifpis1,minimizingUc(\u03be)isequivalenttominimizingthehinge-loss(or\u21131-loss)Rchofthepredictionrule(9),whereRch=C\u2211n\u2208Itrmaxy(f(y,xn;p(Z,\u03b7))+\u2113(y,yn)\u2212f(yn,xn;p(Z,\u03b7)));ifpis2,thesurrogatelossisthe\u21132-loss.Forclarity,weconsiderthehingeloss.Thenon-negativecostfunction\u2113(y,yn)(e.g.,0/1-cost)measuresthecostofpredictingxntobeywhenitstruelabelisyn.Itristheindexsetoftrainingdata.InordertorobustlyestimatethelatentmatrixZ,weneedareasonableamountofdata.Therefore,wealsorelateZtotheobserveddataxbyde\ufb01ningalikelihoodmodeltoprovideasmuchdataaspossible.Here,wede\ufb01nethelinear-Gaussianlikelihoodmodelforreal-valueddatap(xn|zn,W,\u03c32n0)=N(xn|Wz\u22a4n,\u03c32n0I),(8)whereWisarandomloadingmatrixandIisanidentitymatrixwithappropriatedimensions.WeassumeWfollowsanindependentGaussianprior,i.e.,\u03c0(W)=\u220fdN(wd|0,\u03c320I).Fig.1(a)showsthegraphicalstructureofiLSVM.Thehyperparameters\u03c320and\u03c32n0canbesetaprioriorestimatedfromobserveddata(SeeAppendixA.2fordetails).Testing:tomakepredictionontestexamples,weputbothtrainingandtestdatatogethertodotheregularizedBayesianinference.Fortrainingdata,weimposetheabovelarge-marginconstraintsbecauseoftheawarenessoftheirtruelabels,whilefortestdata,wedotheinferencewithoutthelarge-marginconstraintssincewedonotknowtheirtruelabels.Afterinference,wemakethepredictionviatheruley\u2217def=argmaxyf(y,x;p(Z,\u03b7)).(9)Theabilitytogeneralizetotestdatareliesonthefactthatallthedataexamplesshare\u03b7andtheIBPprior.Wecanalsocasttheproblemasatransductiveinferenceproblembyimposingadditionalconstraintsontestdata[17].However,theresultingproblemwillbegenerallyhardertosolve.3.3Multi-TaskIn\ufb01niteLatentSupportVectorMachinesDifferentfromclassi\ufb01cation,whichistypicallyformulatedasasinglelearningtask,multi-tasklearningaimstoimproveasetofrelatedtasksthroughsharingstatisticalstrengthbetweenthesetasks,whichareperformedjointly.Manydifferentapproacheshavebeendevelopedformulti-tasklearning(See[16]forareview).Inparticular,learningacommonlatentrepresentationsharedbyalltherelatedtaskshasproventobeaneffectivewaytocapturetaskrelationships[1,3,23].Below,wepresentthemulti-taskin\ufb01nitelatentSVM(MT-iLSVM)forlearningacommonbinaryprojectionmatrixZtocapturetherelationshipsamongmultipletasks.SimilarasiniLSVM,wealsoputtheIBPprioronZtoallowittohaveanunboundednumberofcolumns.3Althoughotherchoicessuchastakingthemodearepossible,ourchoicecouldleadtoacomputationallyeasyproblembecauseexpectationisalinearfunctionalofthedistributionunderwhichtheexpectationistaken.Moreover,expectationcanbemorerobustthantakingthemode[18],andithasbeenusedin[31,32].4\fXnYn(cid:740)WZnNIBP((cid:734))(a)ZXmnYmnWmnNM(cid:740)m(cid:771)mIBP((cid:734))(b)Figure1:Graphicalstructuresof(a)in\ufb01nitela-tentSVM(iLSVM);and(b)multi-taskin\ufb01nitelatentSVM(MT-iLSVM).ForMT-iLSVM,thedashednodes(i.e.,\u03c2m)areincludedtoillustratethetaskrelatedness.WehaveomittedthepriorsonWand\u03b7fornotationbrevity.SupposewehaveMrelatedtasks.LetDm={(xmn,ymn)}n\u2208Imtrbethetrainingdatafortaskm.Weconsiderbinaryclassi\ufb01cationtasks,whereYm={+1,\u22121}.Extensiontomulti-wayclassi\ufb01cationorregressiontaskscanbeeasilydone.IfthelatentmatrixZisgiven,wede\ufb01nethelatentdiscriminantfunctionfortaskmasfm(x,Z;\u03b7m)def=(Z\u03b7m)\u22a4x=\u03b7\u22a4m(Z\u22a4x).(10)Thisde\ufb01nitionprovidestwoviewsofhowtheMtasksgetrelated.Ifwelet\u03c2m=Z\u03b7m,then\u03c2maretheactualparametersoftaskmandall\u03c2mindifferenttasksarecoupledbysharingthesamelatentmatrixZ.Anotherviewisthateachtaskmhasitsownparameters\u03b7m,butallthetaskssharethesamelatentfeaturesZ\u22a4x,whichisaprojectionoftheinputfeaturesxandZisthelatentprojectionmatrix.Assuch,ourmethodcanbeviewedasanonparametricBayesiantreatmentofalternatingstructureoptimization(ASO)[1],whichlearnsasingleprojectionmatrixwithapre-speci\ufb01edlatentdimension.Moreover,differentfrom[16],whichlearnsabinaryvectorwithknowndimensionalitytoselectfeaturesorkernelsonx,welearnanunboundedprojectionmatrixZusingnonparametricBayesiantechniques.AsiniLSVM,wetakethefullyBayeisantreatment(i.e.,\u03b7marealsorandomvariables)andde\ufb01netheeffectivediscriminantfunctionfortaskmastheexpectationfm(x;p(Z,\u03b7))def=Ep(Z,\u03b7)[fm(x,Z;\u03b7m)]=Ep(Z,\u03b7)[Z\u03b7m]\u22a4x.(11)Then,thepredictionrulefortaskmisnaturallyy\u2217mdef=signfm(x).Similarly,wedoregularizedBayesianinferencebyimposingthefollowingconstraintsandde\ufb01ningUMT(\u03be)def=C\u2211m,n\u2208Imtr\u03bemnPMTpost(\u03be)def={p(Z,\u03b7)\u2200m,\u2200n\u2208Imtr:ymnEp(Z,\u03b7)[Z\u03b7m]\u22a4xmn\u22651\u2212\u03bemn\u03bemn\u22650}.(12)SimilarasiniLSVM,minimizingUMT(\u03be)isequivalenttominimizingthehinge-lossRMThofthemultiplebinarypredictionrules,whereRMTh=C\u2211m,n\u2208Imtrmax(0,1\u2212ymnEp(Z,\u03b7)[Z\u03b7m]\u22a4xmn).Finally,toobtainmoredatatoestimatethelatentZ,wealsorelateittoobserveddatabyde\ufb01ningthelikelihoodmodelp(xmn|wmn,Z,\u03bb2mn)=N(xmn|Zwmn,\u03bb2mnI),(13)wherewmnisavector.WeassumeWhasanindependentprior\u03c0(W)=\u220fmnN(wmn|0,\u03c32m0I).Fig.1(b)illustratesthegraphicalstructureofMT-iLSVM.Fortesting,weusethesamestrategyasiniLSVMtodoBayesianinferenceonbothtrainingandtestdata.Thedifferenceisthattrainingdataaresubjecttolarge-marginconstraints,whiletestdataarenot.Similarly,thehyper-parameters\u03c32m0and\u03bb2mncanbesetaprioriorestimatedfromdata(SeeAppendixA.1fordetails).3.4InferencewithTruncatedMean-FieldConstraintsWebrie\ufb02ydiscusshowtodoregularizedBayesianinference(3)withthelarge-marginconstraintsforMT-iLSVM.ForiLSVM,similarprocedureapplies.Tomaketheproblemeasiertosolve,weusethestick-breakingrepresentationofIBP,whichincludestheauxiliaryvariables\u03bd,andinfertheposteriorp(\u03bd,W,Z,\u03b7).Furthermore,weimposethetruncatedmean-\ufb01eldconstraintthatp(\u03bd,W,Z,\u03b7)=p(\u03b7)K\u220fk=1(p(\u03bdk|\u03b3k)D\u220fd=1p(zdk|\u03c8dk))\u220fmnp(wmn|\u03a6mn,\u03c32mnI),(14)whereKisthetruncationlevel;p(wmn|\u03a6mn,\u03c32mnI)=N(wmn|\u03a6mn,\u03c32mnI);p(zdk|\u03c8dk)=Bernoulli(\u03c8dk);andp(\u03bdk|\u03b3k)=Beta(\u03b3k1,\u03b3k2).We\ufb01rstturntheconstrainedproblemtoaprob-lemof\ufb01ndingastationarypointusingLagrangianmethodsbyintroducingLagrangemultipliers\u03c9,oneforeachlarge-marginconstraintasde\ufb01nedinEq.(12),anduforthenonnegativityconstraintsof\u03be.LetL(p,\u03be,\u03c9,u)betheLagrangianfunctional.Theinferenceprocedureiterativelysolvesthefollowingtwosteps(WedeferthedetailstoAppendixA.1):5\fInferp(\u03bd),p(W),andp(Z):forp(W),sincethepriorisalsonormal,wecaneasilyderivetheupdaterulesfor\u03a6mnand\u03c32mn.Forp(\u03bd),wehavethesameupdaterulesasin[9].WedeferthedetailstoAppendixA.1.Now,wefocusonp(Z)andprovideinsightsonhowthelarge-marginconstraintsregularizetheprocedureofinferringthelatentmatrixZ.Sincethelarge-marginconstraintsarelinearofp(Z),wecangetthemean-\ufb01eldupdateequationas\u03c8dk=11+e\u2212\u03d1dk,where\u03d1dk=k\u2211j=1Ep[logvj]\u2212L\u03bdk\u2212\u2211mn12\u03bb2mn((K\u03c32mn+(\u03d5kmn)2)(15)\u22122xdmn\u03d5kmn+2\u2211j\u0338=k\u03d5jmn\u03d5kmn\u03c8dj)+\u2211m,n\u2208ImtrymnEp[\u03b7mk]xdmn,whereL\u03bdkisanlowerboundofEp[log(1\u2212\u220fkj=1vj)](SeeAppendixA.1fordetails).Thelasttermof\u03d1dkisduetothelarge-marginposteriorconstraintsasde\ufb01nedinEq.(12).Inferp(\u03b7)andsolvefor\u03c9and\u03be:WeoptimizeLoverp(\u03b7)andcangetp(\u03b7)=\u220fmp(\u03b7m),wherep(\u03b7m)\u221d\u03c0(\u03b7m)exp{\u03b7\u22a4m\u00b5m},and\u00b5m=\u2211n\u2208Imtrymn\u03c9mn(\u03c8\u22a4xmn).Here,weassume\u03c0(\u03b7m)isstandardnormal.Then,wehavep(\u03b7m)=N(\u03b7m|\u00b5m,I).Substitutingthesolutionofp(\u03b7)intoL,wegetMindependentdualproblemsmax\u03c9m\u221212\u00b5\u22a4m\u00b5m+\u2211n\u2208Imtr\u03c9mns.t..:0\u2264\u03c9mn\u22641,\u2200n\u2208Imtr,(16)which(oritsprimalform)canbeef\ufb01cientlysolvedwithabinarySVMsolver,suchasSVM-light.4ExperimentsWepresentempiricalresultsforbothclassi\ufb01cationandmulti-tasklearning.OurresultsdemonstratethemeritsinheritedfrombothBayesiannonparametricsandlarge-marginlearning.4.1Multi-wayClassi\ufb01cationWeevaluatethein\ufb01nitelatentSVM(iLSVM)forclassi\ufb01cationontherealTRECVID2003andFlickrimagedatasets,whichhavebeenextensivelyevaluatedinthecontextoflearning\ufb01nitelatentfeaturemodels[8].TRECVID2003consistsof1078videokey-frames,andeachexamplehastwotypesoffeatures\u20131894-dimensionbinaryvectoroftextfeaturesand165-dimensionHSVcolorhistogram.TheFlickrimagedatasetconsistsof3411naturalsceneimagesabout13typesofanimals(e.g.,tiger,catandetc.)downloadedfromtheFlickrwebsite.Also,eachexamplehastwotypesoffeatures,including500-dimensionSIFTbag-of-wordsand634-dimensionreal-valuedfeatures(e.g.,colorhistogram,edgedirectionhistogram,andblock-wisecolormoments).Here,weconsiderthereal-valuedfeaturesonlybyusingnormaldistributionsforx.WecompareiLSVMwiththelarge-marginHarmonium(MMH)[8],whichwasshowntooutperformmanyotherlatentfeaturemodels[8],andtwodecoupledapproaches\u2013EFH+SVMandIBP+SVM.EFH+SVMusestheexponentialfamilyHarmonium(EFH)[27]todiscoverlatentfeaturesandthenlearnsamulti-waySVMclassi\ufb01er.IBP+SVMissimilar,butusesanIBPfactoranalysismodel[12]todiscoverlatentfeatures.As\ufb01nitemodels,bothMMHandEFH+SVMneedtopre-specifythedimensionalityoflatentfeatures.Wereporttheirresultsonclassi\ufb01cationaccuracyandF1score(i.e.,theaverageF1scoreoverallpossibleclasses)[32]achievedwiththebestdimensionalityinTa-ble1.ForiLSVMandIBP+SVM,weusethemean-\ufb01eldinferencemethodandpresenttheaverageperformancewith5randomlyinitializedruns(SeeAppendixA.2forthealgorithmandinitializa-tiondetails).Weperform5-foldcross-validationontrainingdatatoselecthyperparameters,e.g.,\u03b1andC(weusethesameprocedureforMT-iLSVM).WecanseethatiLSVMcanachievecompa-rableperformancewiththenearlyoptimalMMH,withoutneedingtopre-specifythelatentfeaturedimension4,andismuchbetterthanthedecoupledapproaches(i.e.,IBP+SVMandEFH+SVM).4.2Multi-taskLearning4.2.1DescriptionoftheDataSceneandYeastData:ThesedatasetsarefromtheUCIrepository,andeachdataexamplehasmultiplelabels.Asin[23],wetreatthemulti-labelclassi\ufb01cationasamulti-tasklearningproblem,4Wesetthetruncationlevelto300,whichislargeenough.6\fTable1:Classi\ufb01cationaccuracyandF1scoresontheTRECVID2003andFlickrimagedatasets.TRECVID2003FlickrModelAccuracyF1scoreAccuracyF1scoreEFH+SVM0.565\u00b10.00.427\u00b10.00.476\u00b10.00.461\u00b10.0MMH0.566\u00b10.00.430\u00b10.00.538\u00b10.00.512\u00b10.0IBP+SVM0.553\u00b10.0130.397\u00b10.0300.500\u00b10.0040.477\u00b10.009iLSVM0.563\u00b10.0100.448\u00b10.0110.533\u00b10.0050.510\u00b10.010Table2:Multi-labelclassi\ufb01cationperformanceonSceneandYeastdatasets.YeastSceneModelAccF1-MicroF1-MacroAccF1-MicroF1-Macroyaxue[23]0.51060.38970.40220.77650.26690.2816piyushrai-1[23]0.52120.36310.39010.77560.31530.3242piyushrai-2[23]0.54240.39460.41120.79110.32140.3226MT-IBP+SVM0.5475\u00b10.0050.3910\u00b10.0060.4345\u00b10.0070.8590\u00b10.0020.4880\u00b10.0120.5147\u00b10.018MT-iLSVM0.5792\u00b10.0030.4258\u00b10.0050.4742\u00b10.0080.8752\u00b10.0040.5834\u00b10.0260.6148\u00b10.020whereeachlabelassignmentistreatedasabinaryclassi\ufb01cationtask.TheYeastdatasetconsistsof1500trainingand917testexamples,eachhaving103features,andthenumberoflabels(ortasks)perexampleis14.TheScenedatasetconsists1211trainingand1196testexamples,eachhaving294features,andthenumberoflabels(ortasks)perexampleforthisdatasetis6.SchoolData:ThisdatasetcomesfromtheInnerLondonEducationAuthorityandhasbeenusedtostudytheeffectivenessofschools.Itconsistsofexaminationrecordsfrom139secondaryschoolsinyears1985,1986and1987.Itisarandom50%samplewith15362students.Thedatasetispubliclyavailableandhasbeenextensivelyevaluatedinvariousmulti-tasklearningmethods[4,7,30],whereeachtaskisde\ufb01nedaspredictingtheexamscoresofstudentsbelongingtoaspeci\ufb01cschoolbasedonfourstudent-dependentfeatures(yearoftheexam,gender,VRbandandethnicgroup)andfourschool-dependentfeatures(percentageofstudentseligibleforfreeschoolmeals,percentageofstudentsinVRband1,schoolgenderandschooldenomination).Inordertocomparewiththeabovemethods,wefollowthesamesetupdescribedin[3,4]andsimilarlywecreatedummyvariablesforthosefeaturesthatarecategoricalformingatotalof19student-dependentfeaturesand8school-dependentfeatures.Weusethesame10randomsplits5ofthedata,sothat75%oftheexamplesfromeachschool(task)belongtothetrainingsetand25%tothetestset.Onaverage,thetrainingsetincludesabout80studentsperschoolandthetestsetabout30studentsperschool.4.2.2ResultsSceneandYeastData:WecomparewiththecloselyrelatednonparametricBayesianmethods[23,28],whichwereshowntooutperformtheindependentBayesianlogisticregressionandasingle-taskpoolingapproach[23],andadecoupledmethodMT-IBP+SVM6thatusesIBPfactoranalysismodelto\ufb01ndsharedlatentfeaturesamongmultipletasksandthenbuildsseparateSVMclassi\ufb01ersfordifferenttasks.ForMT-iLSVMandMT-IBP+SVM,weusethemean-\ufb01eldinferencemethodinSec3.4andreporttheaverageperformancewith5randomlyinitializedruns(SeeAppendixA.1forinitializationdetails).Forcomparisonwith[23,28],weusetheoverallclassi\ufb01cationaccuracy,F1-MacroandF1-Microasperformancemeasures.Table2showstheresults.Wecanseethatthelarge-marginMT-iLSVMperformsmuchbetterthanothernonparametricBayesianmethodsandMT-IBP+SVM,whichseparatestheinferenceoflatentfeaturesfromlearningtheclassi\ufb01ers.SchoolData:Weusethepercentageofexplainedvariance[4]asthemeasureoftheregressionperformance,whichisde\ufb01nedasthetotalvarianceofthedataminusthesum-squarederroronthetestsetasapercentageofthetotalvariance.Sinceweusethesamesettings,wecancomparewiththestate-of-the-artresultsofBayesianmulti-tasklearning(BMTL)[4],multi-taskGaussianprocesses(MTGP)[7],convexmulti-taskrelationshiplearning(MTRL)[30],andsingle-tasklearning(STL)asreportedin[7,30].ForMT-iLSVMandMT-IBP+SVM,wealsoreporttheresultsachievedbyusingboththelatentfeatures(i.e.,Z\u22a4x)andtheoriginalinputfeaturesxthroughvectorconcatenation,andwedenotethecorrespondingmethodsbyMT-iLSVMfandMT-IBP+SVMf,respectively.From5Availableat:http://ttic.uchicago.edu/\u223cargyriou/code/index.html6Thisdecoupledapproachisinfactanone-iterationMT-iLSVM,wherewe\ufb01rstinferthesharedlatentmatrixZandthenlearnanSVMclassi\ufb01erforeachtask.7\fTable3:PercentageofexplainedvariancebyvariousmodelsontheSchooldataset.STLBMTLMTGPMTRLMT-IBP+SVMMT-iLSVMMT-IBP+SVMfMT-iLSVMf23.5\u00b11.929.5\u00b10.429.2\u00b11.629.9\u00b11.820.0\u00b12.930.9\u00b11.228.5\u00b11.631.7\u00b11.1Table4:PercentageofexplainedvarianceandrunningtimebyMT-iLSVMwithvarioustrainingsizes.50%60%70%80%90%100%explainedvariance(%)25.8\u00b10.427.3\u00b10.729.6\u00b10.430.0\u00b10.530.8\u00b10.430.9\u00b11.2runningtime(s)370.3\u00b132.5455.9\u00b118.6492.6\u00b133.2600.1\u00b150.2777.6\u00b173.4918.9\u00b196.5theresultsinTable3,wecanseethatthemulti-tasklatentSVM(i.e.,MT-iLSVM)achievesbetterresultsthantheexistingmethodsthathavebeentestedinpreviousstudies.Again,thejointMT-iLSVMperformsmuchbetterthanthedecoupledmethodMT-IBP+SVM,whichseparatesthelatentfeatureinferencefromthetrainingoflarge-marginclassi\ufb01ers.Finally,usingbothlatentfeaturesandtheoriginalinputfeaturescanboosttheperformanceslightlyforMT-iLSVM,whilemuchmoresigni\ufb01cantlyforthedecoupledMT-IBP+SVM.1234560.5650.570.5750.580.5850.59sqrt of aAccuracy(a)Yeast01234560.5650.570.5750.580.5850.59sqrt of CAccuracy(b)Yeast00.511.522.51520253035CExplained variance (%)(c)SchoolFigure2:SensitivitystudyofMT-iLSVM:(a)classi\ufb01cationaccuracywithdifferent\u03b1;(b)classi\ufb01cationaccu-racywithdifferentC;and(c)percentageofexplainedvariancewithdifferentC.4.3SensitivityAnalysisFigure2showshowtheperformanceofMT-iLSVMchangesagainstthehyper-parameter\u03b1andregularizationconstantConYeastandSchooldatasets.WecanseethatontheYeastdataset,MT-iLSVMisinsensitiveto\u03b1andC.FortheSchooldataset,MT-iLSVMisstablewhenCissetbetween0.3and1.MT-iLSVMisinsensitiveto\u03b1ontheSchooldatatoo,whichisomittedtosavespace.Table4showshowthetrainingsizeaffectstheperformanceandrunningtimeofMT-iLSVMontheSchooldataset.Weusethe\ufb01rstb%(b=50,60,70,80,90,100)ofthetrainingdataineachofthe10randomsplitsastrainingsetandusethecorrespondingtestdataastestset.Wecanseethatastrainingsizeincreases,theperformanceandrunningtimegenerallyincrease;andMT-iLSVMachievesthestate-of-artperformancewhenusingabout70%trainingdata.Fromtherunningtime,wecanalsoseethatMT-iLSVMisgenerallyquiteef\ufb01cientbyusingmean-\ufb01eldinference.Finally,weinvestigatehowtheperformanceofMT-iLSVMchangesagainstthehyperparameters\u03c32m0and\u03bb2mn.Weinitiallyset\u03c32m0=1andcompute\u03bb2mnfromobserveddata.Ifwefurtherestimatethembymaximizingtheobjectivefunction,theperformancedoesnotchangemuch(\u00b10.3%foraverageexplainedvarianceontheSchooldataset).WehavesimilarobservationsforiLSVM.5ConclusionsandFutureWorkWe\ufb01rstpresentageneralframeworkfordoingregularizedBayesianinferencesubjecttoappro-priateconstraints,whichareimposeddirectlyontheposteriordistributions.Then,weparticularlyconcentrateondevelopingtwononparametricBayesianmodelstolearnpredictivelatentfeaturesforclassi\ufb01cationandmulti-tasklearning,respectively,byexploringthelarge-marginprincipletode\ufb01neposteriorconstraints.Bothmodelsallowthelatentdimensiontobeautomaticallyresolvedfromthedata.TheempiricalresultsonseveralrealdatasetsappeartodemonstratethatourmethodsinheritthemeritsfrombothBayesiannonparametricsandlarge-marginlearning.RegularizedBayesianinferenceoffersageneralframeworkforconsideringposteriorregularizationinperformingnonparametricBayesianinference.Forfuturework,weplantostudyotherposteriorregularizationbeyondthelarge-marginconstraints,suchasposteriorconstraintsde\ufb01nedonman-ifoldstructures[14],andinvestigatehowposteriorregularizationcanbeusedinotherinterestingnonparametricBayesianmodels[5,26].8\fAcknowledgmentsThisworkwasdonewhenJZwasapost-docfellowinCMU.JZissupportedbyNationalKeyProjectforBasicResearchofChina(No.2012CB316300)andtheNationalNaturalScienceFoundationofChina(No.60805023).EXissupportedbyAFOSRFA95501010247,ONRN000140910758,NSFCareerDBI-0546594andAlfredP.SloanResearchFellowship.References[1]R.AndoandT.Zhang.Aframeworkforlearningpredictivestructuresfrommultipletasksandunlabeleddata.JMLR,(6):1817\u20131853,2005.[2]C.E.Antoniak.MixtureofDirichletprocesswithapplicationstoBayesiannonparametricproblems.AnnalsofStats,(273):1152\u20131174,1974.[3]A.Argyriou,T.Evgeniou,andM.Pontil.Convexmulti-taskfeaturelearning.InNIPS,2007.[4]B.BakkerandT.Heskes.TaskclusteringandgatingforBayesianmultitasklearning.JMLR,(4):83\u201399,2003.[5]M.J.Beal,Z.Ghahramani,andC.E.Rasmussen.Thein\ufb01nitehiddenMarkovmodel.InNIPS,2002.[6]K.Bellare,G.Druck,andA.McCallum.Alternatingprojectionsforlearningwithexpectationconstraints.InUAI,2009.[7]E.Bonilla,K.M.A.Chai,andC.Williams.Multi-taskGaussianprocessprediction.InNIPS,2008.[8]N.Chen,J.Zhu,andE.P.Xing.Predictivesubspacelearningformultiviewdata:alargemarginapproach.InNIPS,2010.[9]F.Doshi-Velez,K.Miller,J.VanGael,andY.W.Teh.VariationalinferencefortheIndianbuffetprocess.InAISTATS,2009.[10]D.DunsonandS.Peddada.Bayesiannonparametricinferencesonstochasticordering.ISDSDiscussionPaper,2,2007.[11]K.Ganchev,J.Graca,J.Gillenwater,andB.Taskar.Posteriorregularizationforstructuredlatentvariablemodels.JMLR,(11):2001\u20132094,2010.[12]T.L.Grif\ufb01thsandZ.Ghahramani.In\ufb01nitelatentfeaturemodelsandtheIndianbuffetprocess.InNIPS,2006.[13]D.Hoff.Bayesianmethodsforpartialstochasticorderings.Biometrika,90:303\u2013317,2003.[14]S.HuhandS.Fienberg.Discriminativetopicmodelingbasedonmanifoldlearning.InKDD,2010.[15]T.Jaakkola,M.Meila,andT.Jebara.Maximumentropydiscrimination.InNIPS,1999.[16]T.Jebara.Multitasksparsityviamaximumentropydiscrimination.JMLR,(12):75\u2013110,2011.[17]T.Joachims.Transductiveinferencefortextclassi\ufb01cationusingsupportvectormachines.InICML,1999.[18]M.E.Khan,B.Marlin,G.Bouchard,andK.Murphy.Variationalboundsformixed-datafactoranalysis.InNIPS,2010.[19]P.Liang,M.Jordan,andD.Klein.Learningfrommeasurementsinexponentialfamilies.InICML,2009.[20]S.N.MacEachern.Dependentnonparametricprocess.IntheSectiononBayesianStatisticalScienceofASA,1999.[21]G.MannandA.McCallum.Generalizedexpectationcriteriaforsemi-supervisedlearningwithweaklylabeleddata.JMLR,(11):955\u2013984,2010.[22]K.Miller,T.Grif\ufb01ths,andM.Jordan.Nonparametriclatentfeaturemodelsforlinkprediction.InNIPS,2009.[23]P.RaiandH.DaumeIII.In\ufb01nitepredictorsubspacemodelsformultitasklearning.InAISTATS,2010.[24]C.E.RasmussenandZ.Ghahramani.In\ufb01nitemixturesofGaussianprocessexperts.InNIPS,2002.[25]Y.W.Teh,D.Gorur,andZ.Ghahramani.Stick-breakingconstructionoftheIndianbuffetprocess.InAISTATS,2007.[26]Y.W.Teh,M.Jordan,M.Beal,andD.Blei.HierarchicalDirichletprocess.JASA,101(476):1566\u20131581,2006.[27]M.Welling,M.Rosen-Zvi,andG.Hinton.Exponentialfamilyharmoniumswithanapplicationtoinformationretrieval.InNIPS,2004.[28]Y.Xue,D.Dunson,andL.Carin.Thematrixstick-breakingprocessfor\ufb02exiblemulti-tasklearning.InICML,2007.[29]A.Zellner.OptimalinformationprocessingandBayes\u2019theorem.AmericanStatistician,42:278\u2013280,1988.[30]Y.ZhangandD.Y.Yeung.Aconvexformulationforlearningtaskrelationshipsinmulti-tasklearning.InUAI,2010.[31]J.Zhu,A.Ahmed,andE.P.Xing.MedLDA:Maximummarginsupervisedtopicmodelsforregressionandclassi\ufb01cation.InICML,2009.[32]J.Zhu,N.Chen,andE.P.Xing.In\ufb01niteSVM:aDirichletprocessmixtureoflarge-marginkernelma-chines.InICML,2011.9\f", "award": [], "sourceid": 931, "authors": [{"given_name": "Jun", "family_name": "Zhu", "institution": null}, {"given_name": "Ning", "family_name": "Chen", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}