{"title": "An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism", "book": "Advances in Neural Information Processing Systems", "page_first": 405, "page_last": 413, "abstract": "The Indian Buffet Process is a Bayesian nonparametric approach that models objects as arising from an infinite number of latent factors. Here we extend the latent factor model framework to two or more unbounded layers of latent factors. From a generative perspective, each layer defines a conditional \\emph{factorial} prior distribution over the binary latent variables of the layer below via a noisy-or mechanism. We explore the properties of the model with two empirical studies, one digit recognition task and one music tag data experiment.", "full_text": "AnIn\ufb01niteFactorModelHierarchyViaaNoisy-OrMechanismAaronC.Courville,DouglasEckandYoshuaBengioDepartmentofComputerScienceandOperationsResearchUniversityofMontr\u00b4ealMontr\u00b4eal,Qu\u00b4ebec,Canada{courvila,eckdoug,bengioy}@iro.umontreal.caAbstractTheIndianBuffetProcessisaBayesiannonparametricapproachthatmodelsob-jectsasarisingfromanin\ufb01nitenumberoflatentfactors.Hereweextendthelatentfactormodelframeworktotwoormoreunboundedlayersoflatentfactors.Fromagenerativeperspective,eachlayerde\ufb01nesaconditionalfactorialpriordistributionoverthebinarylatentvariablesofthelayerbelowviaanoisy-ormechanism.Weexplorethepropertiesofthemodelwithtwoempiricalstudies,onedigitrecogni-tiontaskandonemusictagdataexperiment.1IntroductionTheIndianBuffetProcess(IBP)[5]isaBayesiannonparametricapproachthatmodelsobjectsasarisingfromanunboundednumberoflatentfeatures.OneofthemainmotivationsfortheIBPisthedesireforafactorialrepresentationofdata,witheachelementofthedatavectormodelledindependently,i.e.asacollectionoffactorsratherthanasmonolithicwholesasassumedbyothermodelingparadigmssuchasmixturemodels.ConsidermusictagdatacollectedthroughtheinternetserviceproviderLast.fm.Usersoftheservicelabelsongsandartistswithdescriptivetagsthatcollectivelyformarepresentationofanartistorsong.Thesetagscanthenbeusedtoorganizeplaylistsaroundcertainthemes,suchasmusicfromthe80\u2019s.Thetop8tagsforthepopularbandRADIOHEADare:alternative,rock,alternativerock,indie,electronic,britpop,british,andindierock.Thetagspointtovariousfacetsoftheband,forexamplethattheyarebasedinBritain,thattheymakeuseofelectronicmusicandthattheirstyleofmusicisalternativeand/orrock.Thesefacetsorfeaturesarenotmutuallyexclusivepropertiesbutrepresentsomesetofdistinctaspectsoftheband.ModelingsuchdatawithanIBPallowsustocapturethelatentfactorsthatgiverisetothetags,includinginferringthenumberoffactorscharacterizingthedata.HowevertheIBPassumestheselatentfeaturesareindependentacrossobjectinstances.Yetinmanysituations,amorecompactand/oraccuratedescriptionofthedatacouldbeobtainedifwewerepreparedtoconsiderdependen-ciesbetweenlatentfactors.Despitetherebeingawealthofdistinctfactorsthatcollectivelydescribeanartist,itisclearthattheco-occurrenceofsomefeaturesismorelikelythanothers.Forexample,factorsassociatedwiththetagalternativearemorelikelytoco-occurwiththoseassociatedwiththetagindiethanthoseassociatedwithtagclassical.Themaincontributionofthisworkistopresentamethodforextendingin\ufb01nitelatentfactormod-elstotwoormoreunboundedlayersoffactors,withupper-layerfactorsde\ufb01ningafactorialpriordistributionoverthebinaryfactorsofthelayerbelow.Inthisframework,theupper-layerfactorsexpresscorrelationsbetweenlower-layerfactorsviaanoisy-ormechanism.ThusourmodelmaybeinterpretedasaBayesiannonparametricversionofthenoisy-ornetwork[6,8].Inspecifyingthemodelandinferencescheme,wemakeuseoftherecentstick-breakingconstructionoftheIBP[10].1\fForsimplicityofpresentation,wefocusonatwo-layerhierarchy,thoughthemethodextendsreadilytohigher-ordercases.Weshowhowthecompletemodelisamenabletoef\ufb01cientinferenceviaaGibbssamplingprocedureandcompareperformanceofourhierarchicalmethodwiththestandardIBPconstructiononbothadigitmodelingtask,andamusicgenre-taggingtask.2LatentFactorModelingConsiderasetofNobjectsorexemplars:x1:N=[x1,x2,...,xN].Wemodelthenthobjectwiththedistributionxn|zn,1:K,\u03b8\u223cF(zn,1:K,\u03b81:K),withmodelparameters\u03b81:K=[\u03b8k]Kk=1(where\u03b8k\u223cHindep.\u2200k)andfeaturevariableszn,1:K=[znk]Kk=1whichwetaketobebinary:znk\u2208{0,1}.Wedenotethepresenceoffeaturekinexamplenasznk=1anditsabsenceasznk=0.Featurespresentinanobjectaresaidtobeactivewhileabsentfeaturesareinactive.Collectively,thefeaturesformatypicallysparsebinaryN\u00d7Kfeaturematrix,whichwedenoteasz1:N,1:K,orsimplyZ.Foreachfeatureklet\u00b5kbethepriorprobabilitythatthefeatureisactive.ThecollectionofKprobabilities:\u00b51:K,areassumedtobemutuallyindependent,anddistributedaccordingtoaBeta(\u03b1/K,1)prior.Summarizingthefullmodel,wehave(indep.\u2200n,k):xn|zn,1:K,\u03b8\u223cF(zn,1:K,\u03b8)znk|\u00b5k\u223cBernoulli(\u00b5k)\u00b5k|\u03b1\u223cBeta!\u03b1K,1\"AccordingtothestandarddevelopmentoftheIBP,wecanmarginalizeovervariables\u00b51:KandtakethelimitK\u2192\u221etorecoveradistributionoveranunboundedbinaryfeaturematrixZ.Inthedevelopmentoftheinferenceschemeforourhierarchicalmodel,wemakeuseofanalternativecharacterizationoftheIBP:theIBPstick-breakingconstruction[10].Aswiththestick-breakingconstructionoftheDirichletprocess(DP),theIBPstick-breakingconstructionprovidesadirectcharacterizationoftherandomlatentfeatureprobabilitiesviaanunboundedsequence.Consideronceagainthe\ufb01nitelatentfactormodeldescribedabove.LettingK\u2192\u221e,Znowpossessesanunboundednumberofcolumnswithacorrespondingunboundedsetofrandomprobabilities[\u00b51,\u00b52,...].Re-arrangedindecreasingorder:\u00b5(1)>\u00b5(2)>...,thesefactorprobabilitiescanbeexpressedrecursivelyas:\u00b5(k)=U(k)\u00b5(k\u22121)=#(l)U(l),whereU(k)i.i.d\u223cBeta(\u03b1,1).3AHierarchyofLatentFeaturesViaaNoisy-ORMechanismInthissectionweextendthein\ufb01nitelatentfeaturesframeworktoincorporateinteractionsbetweenmultiplelayersofunboundedfeatures.Webeginbyde\ufb01ninga\ufb01niteversionofthemodelbeforeconsideringthelimitingprocess.Weconsiderherethesimplesthierarchicallatentfactormodelconsistingoftwolayersofbinarylatentfeatures:anupper-layerbinarylatentfeaturematrixYwithelementsynj,andalower-layerbinarylatentfeaturematrixZwithelementsznk.Theprobabilitydistributionovertheelementsynjisde\ufb01nedaspreviouslyinthelimitconstructionoftheIBP:ynj|\u00b5j\u223cBernoulli(\u00b5j),with\u00b5j|\u03b1\u00b5\u223cBeta(\u03b1\u00b5/J,1).Thelowerbinaryvariablesznkarealsode\ufb01nedasBernoullidistributedrandomquantities:znk|yn,:,V:,k\u223cBernoulli(1\u2212$j(1\u2212ynjVjk))indep.\u2200n,k.(1)However,heretheprobabilitythatznk=1isafunctionoftheupperbinaryvariablesyn,:andthekthcolumnoftheweightmatrixV,withprobabilitiesVjk\u2208[0,1]connectingynjtoznk.Thecruxofthemodelishowynjinteractswithznkviaanoisy-ormechanismde\ufb01nedinEq.(1).ThebinaryynjmodulatestheinvolvementoftheVjktermsintheproduct,whichinturnmodulatesP(znk=1|yn,:,V:,k).Thenoisy-ormechanisminteractspositivelyinthesensethatchanginganelementynjfrominactivetoactivecanonlyincreaseP(znk=1|yn:,V:k),orleaveitunchangedinthecasewhereVjk=0.Weinterprettheactiveyn,:tobepossiblecausesoftheactivationoftheindividualznk,\u2200k.ThroughtheweightmatrixV,everyelementofYn,1:JisconnectedtoeveryelementofZn,1:K,thusVisarandommatrixofsizeJ\u00d7K.Inthecaseof\ufb01niteJandK,anobviouschoiceofpriorforVis:Vjki.i.d\u223cBeta(a,b),\u2200j,k.However,lookingaheadtothecasewhereJ\u2192\u221eandK\u2192\u221e,theprioroverVwillrequiresomeadditionalstructure.Recently,[11]introducedtheHierarchicalBetaProcess(HBP)andelucidatedtherelationshipbe-tweenthisandtheIndianBuffetProcess.WeuseavariantoftheHBPtode\ufb01neaprioroverV:\u03bdk\u223cBeta(\u03b1\u03bd/K,1)Vjk|\u03bdk\u223cBeta(c\u03bdk,c(1\u2212\u03bdk)+1)indep.\u2200k,j,(2)2\fxnznkynjVjk!k(cid:78)k(cid:77)jK(cid:109)(cid:100)J(cid:109)(cid:100)N(cid:65)(cid:77)(cid:65)(cid:78)HQli.i.d\u223cBeta(\u03b1\u00b5,1),\u00b5j=j$lQlRli.i.d\u223cBeta(\u03b1\u03bd,1),\u03bdk=k$lRlVjk\u223cBeta(c\u03bdk,c(1\u2212\u03bdk)+1)ynj\u223cBern(\u00b5j)znk\u223cBern(1\u2212$j(1\u2212ynjVjk)).Figure1:Left:Agraphicalrepresentationofthe2-layerhierarchyofin\ufb01nitebinaryfactormodels.Right:Summaryofthehierarchicalin\ufb01nitenoisy-orfactormodelinthestick-breakingparametrization.whereeachcolumnofV(indexedbyk)isconstrainedtoshareacommonprior.StructuringthepriorthiswayallowsustomaintainawellbehavedpriorovertheZmatrixasweletK\u2192\u221e,groupingthevaluesofVjkacrossjwhileE[\u03bdk]\u21920.Howeverbeyondtheregionofverysmall\u03bdk(0<\u03bdk<<1),wewouldliketheweightsVjktovarymoreindependently.Thuswemodifythemodelof[11]toincludethe+1termtotheprioroverVjk(inEq.(2))andwelimitc\u22641.Fig.1showsagraphicalrepresentationofthecomplete2-layerhierarchicalnoisy-orfactormodel,asJ\u2192\u221eandK\u2192\u221e.Finally,weaugmentthemodelwithanadditionalrandommatrixAwithmultinomialelementsAnk,assigningeachinstanceofznk=1toanindexjcorrespondingtotheactiveupper-layerunitynjresponsibleforcausingtheevent.TheprobabilitythatAnk=jisde\ufb01nedviaafamil-iarstick-breakingscheme.Byenforcingan(arbitrary)orderingovertheindicesj=[1,J],wecanviewthenoisy-ormechanismde\ufb01nedinEq.(1)asspecifying,foreachznk,anorderedseriesofbinarytrials(i.e.coin\ufb02ips).Foreachznk,weproceedthroughtheorderedsetofelements,{Vjk,ynj}j=1,2,...,performingrandomtrials.Withprobabilityyn,j\u2217Vj\u2217,k,trialj\u2217isdeemeda\u201csuc-cess\u201dandwesetznk=1,Ank=j\u2217,andnofurthertrialsareconductedfor{n,k,j>j\u2217}.Conversely,withprobability(1\u2212ynj\u2217Vj\u2217k)thetrialisdeemeda\u201cfailure\u201dandwemoveontotrialj\u2217+1.Sincealltrialsjassociatedwithinactiveupper-layerfeaturesarefailureswithprobabil-ityone(becauseynj=0),weneedonlyconsiderthetrialsforwhichynj=1.If,foragivenznk,alltrialsjforwhichynj=1(active)arefailures,thenwesetznk=0withprobabilityone.Theprobabilityassociatedwiththeeventznk=0isthereforegivenbytheproductofthefailureprobabilitiesforeachoftheJtrials:P(znk=0|yn,:,V:,k)=#Jj=1(1\u2212ynjVjk),andwithP(znk=1|yn,:,V:,k)=1\u2212P(znk=0|yn,:,V:,k),wearriveatthenoisy-ormechanismgiveninEq.(1).ThisprocessissimilartothesamplingprocessassociatedwiththeDirichletprocessstick-breakingconstruction[7].Indeed,theprocessdescribedabovespeci\ufb01esastick-breakingcon-structionofageneralizedDirichletdistribution[1]overthemultinomialprobabilitiescorrespondingtotheAnk.ThegeneralizedDirichletdistributionde\ufb01nedinthiswayhastheimportantpropertythatitisconjugatetomultinomialsampling.Withthegenerativeprocessspeci\ufb01edasabove,wecande\ufb01netheposteriordistributionovertheweightsVgiventheassignmentmatrixAandthelatentfeaturematrixY.LetMjk=%Nn=1I(Ank=j)bethenumberoftimesthatthejthtrialwasasuccessforz:,k(i.e.thenumberoftimesynjcausedtheactivationofznk)andletNjk=%Nn=1ynjI(Ank>j),thatisthenumberoftimesthatthej-thtrialwasafailureforznkdespiteynjbeingactive.Finally,letusalsodenotethenumberoftimesy:,jisactive:Nj=%Nn=1ynj.Giventhesequantities,theposteriordistributionsforthemodelparameters\u00b5jandVjkaregivenby:\u00b5j|Y\u223cBeta(\u03b1\u00b5/J+Nj,1+N\u2212Nj)(3)Vjk|Y,A\u223cBeta(c\u03bdk+Mjk,c(1\u2212\u03bdk)+Njk+1)(4)TheseconjugaterelationshipsareexploitedintheGibbssamplingproceduredescribedinSect.4.ByintegratingoutVjk,wecanrecover(uptoaconstant)theposteriordistributionover\u03bdk:3\fp(\u03bdk|A:,k)\u221d\u03bd\u03b1\u03bd/K\u22121kJ$j=1\u0393(c\u03bdk+Mjk)\u0393(c\u03bdk)\u0393(c(1\u2212\u03bdk)+Njk+1)\u0393(c(1\u2212\u03bdk)+1)(5)OnepropertyofthemarginallikelihoodisthatwhollyinactiveelementsofY,whichwedenoteasy:,j\"=0,donotimpactthelikelihoodasNj\",k=0,Mj\",k=0.ThisbecomesparticularlyimportantasweletJ\u2192\u221e.Havingde\ufb01nedthe\ufb01nitemodel,itremainstotakethelimitasbothK\u2192\u221eandJ\u2192\u221e.TakingthelimitofJ\u2192\u221eisrelativelystraightforwardastheupper-layerfactormodelnaturallytendstoanIBP:Y\u223cIBP,anditsinvolvementintheremainderofthemodelislimitedtothesetofactiveelementsofY,whichremains\ufb01nitefor\ufb01nitedatasets.IntakingK\u2192\u221e,thedistributionovertheunbounded\u03bdkconvergestothatoftheIBP,whiletheconditionaldistributionoverthenoisy-orweightsVjkremainsimplebetadistributionsgiventhecorresponding\u03bdk(asinEq.(4)).4InferenceInthissection,wedescribeaninferencestrategytodrawsamplesfromthemodelposterior.ThealgorithmisbasedjointlyontheblockedGibbssamplingstrategyfortruncatedDirichletdistribu-tions[7]andontheIBPsemi-orderedslicesampler[10],whichweemployateachlayerofthehierarchy.Becausebothalgorithmsarebasedonthestrategyofdirectlysamplinganinstantiationofthemodelparameters,theirusetogetherpermitsustode\ufb01neanef\ufb01cientextendedblockedGibbssamplerovertheentiremodelwithoutapproximation.Tofacilitateourdescriptionofthesemi-orderedslicesampler,weseparate\u00b51:\u221eintotwosubsets:\u00b5+1:J+and\u00b5o1:\u221e,where\u00b5+1:J+aretheprobabilitiesassociatedwiththesetofJ+activeupper-layerfactorsY+(thosethatappearatleastonceinthedataset,i.e.\u2203i:y+ij\"=1,1\u2264j$\u2264J+)and\u00b5o1:\u221eareassociatedwiththeunboundedsetofinactivefeaturesYo(thosenotappearinginthedataset).Similarly,weseparate\u03bd1:\u221einto\u03bd+1:K+and\u03bdo1:\u221e,andZintocorrespondingactiveZ+andinactiveZowhereK+isthenumberofactivelower-layerfactors.4.1Semi-orderedslicesamplingoftheupper-layerIBPTheIBPsemi-orderedslicesamplermaintainsanunorderedsetofactivey+1:N,1:J+withcorrespond-ing\u00b5+1:J+andV1:J+,1:K,whileexploitingtheIBPstick-breakingconstructiontosamplefromthedistributionoforderedinactivefeatures,uptoanadaptivelychosentruncationlevelcontrolledbyanauxiliaryslicevariablesy.Samplesy.Theuniformlydistributedauxiliaryslicevariables,sycontrolsthetruncationleveloftheupper-layerIBP,where\u00b5\u2217isde\ufb01nedasthesmallestprobability\u00b5correspondingtoanactivefeature:sy|Y,\u00b51:\u221e\u223cUniform(0,\u00b5\u2217),\u00b5\u2217=min&1,min1\u2264j\"\u2264J+\u00b5+j\"\u2019.(6)Asdiscussedin[10],thejointdistributionisgivenbyp(sy,\u00b51:\u221e,Y)=p(Y,\u00b51:\u221e)\u00d7p(sy|Y,\u00b51:\u221e),wheremarginalizingoversypreservestheoriginaldistributionoverYand\u00b51:\u221e.How-ever,givensy,theconditionaldistributionp(ynj\"=1|Z,sy,\u00b51:\u221e)=0foralln,j$suchthat\u00b5j\"<sy.Thisisthecruxoftheslicesamplingapproach:Eachsamplesyadaptivelytruncatesthemodel,with\u00b51:J>sy.Yetbymarginalizingoversy,wecanrecoversamplesfromtheoriginalnon-truncateddistributionp(Y,\u00b51:\u221e)withoutapproximation.Sample\u00b5o1:Jo.Fortheinactivefeatures,weuseadaptiverejectionsampling(ARS)[4]tosequen-tiallydrawanorderedsetofJoposteriorfeatureprobabilitiesfromthedistribution:p(\u00b5oj|\u00b5oj\u22121,yo:,\u2265j=0)\u221dexp(\u03b1\u00b5N)n=11n(1\u2212\u00b5oj)n*\u00b7(\u00b5oj)\u03b1\u00b5\u22121(1\u2212\u00b5oj)NI(0\u2264\u00b5oj\u2264\u00b5oj\u22121),until\u00b5oJo+1<sy.TheaboveexpressionarisesfromusingtheIBPstick-breakingconstructiontomarginalizeovertheinactiveelementsof\u00b5:[10].ForeachoftheJoinactivefeaturesdrawn,the4\fcorrespondingfeaturesyo1:N,1:JoareinitializedtozeroandthecorrespondingweightVo1:Jo,1:KaresampledfromtheirpriorinEq.(2).Withtheprobabilitiesforboththeactiveandatruncatedsetofinactivefeaturessampled,thesetoffeaturesarere-integratedintoasetofJ=J++JofeaturesY=[y+1:N,1:J+,yo1:N,1:Jo]withprobabilities\u00b51:J=[\u00b5+1:J+,\u00b5o1:Jo],andcorrespondingweightsVT=[(V+1:J+,1:K)T,(Vo1:Jo,1:K)T].SampleY.Giventheupper-layerfeatureprobabilities\u00b51:J,weightmatrixV,andthelower-layerbinaryfeaturevaluesznk,weupdateeachynjasfollows:p(ynj=1|\u00b5j,zn,:,\u00b5\u2217)\u221d\u00b5j\u00b5\u2217K$k=1p(znk|ynj=1,yn,\u00acj,V:,k)(7)Thedenominator\u00b5\u2217issubjecttochangeifchangingynjinducesachangein\u00b5\u2217(asde\ufb01nedinEq.(6));yn,\u00acjrepresentsallelementsyn,1:JexceptynjTheconditionalprobabilityofthelower-layerbinaryvariablesisgivenby:p(znk|yn,:,V:,k)=(1\u2212#j(1\u2212ynjVjk)).Sample\u00b5+1:J+.OnceagainweseparateYand\u00b51:\u221eintoasetofactivefeatures:Y+withprob-abilities\u00b5+1:J+;andasetofinactivefeaturesYowith\u00b5o1:\u221e.Theinactivesetisdiscardedwhiletheactivesetof\u00b5+1:J+areresampledfromtheposteriordistribution:\u00b5+j|y+:,j\u223cBeta(Nj,1+N\u2212Nj).Atthispointwealsoseparatethelower-layerfactorsintoanactivesetofK+factorsZ+withcor-responding\u03bd+1:K+,V+1:J+,1:K+anddatalikelihoodparameters\u03b8+;andadiscardedinactiveset.4.2Semi-orderedslicesamplingofthelower-layerfactormodelSamplingthevariablesofthelower-layerIFMmodelproceedsanalogouslytotheupper-layerIBP.Howeverthepresenceofthehierarchicalrelationshipbetweenthe\u03bdkandtheV:,k(asde\ufb01nedinEqs.(3)and(4))doesrequiresomeadditionalattention.Weproceedbymakinguseofthemarginaldistributionovertheassignmentprobabilitiestode\ufb01neasecondauxiliaryslicevariable,sz.Samplesz.Theauxiliaryslicevariableissampledaccordingtothefollowing,where\u03bd\u2217isde\ufb01nedasthesmallestprobabilitycorrespondingtoanactivefeature:sz|Z,\u03bd1:\u221e\u223cUniform(0,\u03bd\u2217),\u03bd\u2217=min&1,min1\u2264k\"\u2264K+\u03bd+k\"\u2019.Sample\u03bdo1:Ko.GivenszandY,therandomprobabilitiesovertheinactivelower-layerbinaryfeatures,\u03bdo1:\u221e,aresampledsequentiallytodrawasetofKofeatureprobabilities,until\u03bdKo+1<sz.Thesamplesaredrawnaccordingtothedistribution:p(\u03bdok|\u03bdok\u22121,Y+,zo:,\u2265k=0)\u221dI(0\u2264\u03bdok\u2264\u03bdok\u22121)(\u03bdok)\u03b1\u03bd\u22121 JYj=1\u0393(c(1\u2212\u03bdok)+Nj)\u0393(c(1\u2212\u03bdok))!\u00d7exp \u03b1\u03bdJYj=1\u0393(c)\u0393(c+Nj)N1+\u00b7\u00b7\u00b7+NJXi=0wiciiXl=11l(1\u2212\u03bdok)l!\u00b7(8)Eq.(8)arisesfromthestick-breakingconstructionoftheIBPandfromtheexpressionforP(zo:,>k=0|\u03bdok,Y+)derivedinthesupplementarymaterial[2].HerewesimplynotethatthewiareweightsderivedfromtheexpansionofaproductoftermsinvolvingunsignedStirlingnum-bersofthe\ufb01rstkind.Thedistributionovertheorderedinactivefeaturesislog-concaveinlog\u03bdk,andisthereforeamenabletoef\ufb01cientsampleviaadaptiverejectionsampling(aswasdoneinsampling\u00b5o1:Jo).EachoftheKoinactivefeaturesareinitializedtozeroforeverydataobject,Zo=0,whilethecorrespondingVoandlikelihoodparameters\u03b8oaredrawnfromtheirpriors.Oncethe\u03bd1:Koaredrawn,boththeactiveandinactivefeaturesofthelower-layerarere-integratedintothesetofK=K++KofeaturesZ=[Z+,Zo]withprobabilities\u03bd1:K=[\u03bd+1:K+,\u03bdo1:Ko]andcorrespondingweightmatrixV=[V+1:J+,1:K+,Vo1:J+,1:Ko]andparameters\u03b8=[\u03b8+,\u03b8o].5\fSampleZ.GivenY+andVweuseEq.(1)tospecifytheprioroverz1:N,1:K\u2217.Then,conditionalonthisprior,thedataXandparameters\u03b8,wesamplesequentiallyforeachznk:p(znk|y+n,:,V:,k,zn,\u00ack,\u03b8,\u03bd\u2217)=1\u03bd\u22170@1\u2212J+Yj=1(1\u2212y+njVjk)1Af(xn|zn,:,\u03b8),wheref(xn|zn,:,\u03b8)isthelikelihoodfunctionforthenthdataobject.SampleA.Givenznk,y+n,:andV:,k,wedrawthemultinomialvariableAnktoassignresponsibil-ity,intheeventzik=1,tooneoftheupper-layerfeaturesy+nj,p(Ank=j|znk=1,y+n,:,V:,k)=Vjk\"j\u22121Yi=1(1\u2212y+niVik)#,(9)andify+n,j\"=0,\u2200j$>j\u2020,thenp(Ank=j\u2020|znk=1,y+n,:,V:,k)=#j\u2020\u22121i=1(1\u2212y+niVik)toensurenormalizationofthedistribution.Ifznk=0,thenP(Ank=\u221e)=1.SampleVand\u03bd+1:K+.ConditionalonY+,ZandA,theweightsVareresampledfromEq.(4),followingtheblockedGibbssamplingprocedureof[7].GiventheassignmentsA,theposteriorof\u03bd+kisgiven(uptoaconstant)byEq.(5).Thisdistributionislogconcavein\u03bd+k,thereforewecanonceagainuseARStodrawsamplesoftheposteriorof\u03bd+k,1\u2264k\u2264K+.5ExperimentsInthissection,wepresenttwoexperimentstohighlightthepropertiesandcapabilitiesofourhier-archicalin\ufb01nitefactormodel.Ourgoalistoassess,inthesetwocases,theimpactofincludinganadditionalmodelinglayer.Tothisend,andineachexperiment,wecompareourhierarchicalmodeltotheequivalentIBPmodel.Ineachcase,hyperparametersarespeci\ufb01edwithrespecttotheIBP(us-ingcross-validationbyevaluatingthelikelihoodofaholdoutset)andheld\ufb01xedforthehierarchicalfactormodel.Finallyallhyperparametersofthehierarchicalmodelthatwerenotmarginalizedoutwereheldconstantoverallexperiments,inparticularc=1and\u03b1\u03bd=1.5.1ExperimentI:DigitsInthisexperimentwetookexamplesofimagesofhand-writtendigitsfromtheMNISTdataset.Following[10],thedatasetconsistedof1000examplesofimagesofthedigit3wherethehandwrit-tendigitimagesare\ufb01rstpreprocessedbyprojectingontothe\ufb01rst64PCAcomponents.TomodelMNISTdigits,weaugmentboththeIBPandthehierarchicalmodelwithamatrixGofthesamesizeasZandwithi.i.d.zeromeanandunitvarianceelements.Eachdataobject,xnismodeledas:xn|Z,G,\u03b8,\u03c32x\u223cN((zn,:+gn,:)\u03b8,\u03c32XI)where+istheHadamard(element-wise)product.TheinclusionofGintroducesanadditionalsteptoourGibbssamplingprocedure,howevertherestofthehierarchicalin\ufb01nityfactormodelisasdescribedinSect.3.InordertoassessthesuccessofourhierarchicalIFMincapturinghigher-orderfactorspresentintheMNISTdata,weconsiderade-noisingtask.Randomnoise(std=0.5)wasaddedtoapost-processedtestsetandthemodelswereevaluatedinitsabilitytorecoverthenoise-freeversionofasetof500examplesnotusedintraining.Fig.2(a)presentsacomparisonoftheloglikelihoodofthe(noise-free)test-setforboththehierarchicalmodelandtheIBPmodel.The\ufb01gureshowsthatthe2-layernoisy-ormodelgivessig-ni\ufb01cantlymorelikelihoodtothepre-corrupteddatathantheIBP,indicatingthatthenoisy-ormodelwasabletolearnusefulhigher-orderstructurefromMNISTdata.Oneofthepotentialbene\ufb01tsofthestyleofmodelweproposehereisthatthereistheopportunityforlatentfactorsatonelayertosharefeaturesatalowerlayer.Fig.2illustratestheconditionalmodeoftherandomweightmatrixV(conditionalonasampleoftheothervariables)andshowsthatthereissigni\ufb01cantsharingoflow-levelfeaturesbythehigher-layerfactors.Fig.2(d)-(e)comparethefeatures(sampledrowsofthe\u03b8matrix)learnedbyboththeIBPandbythehierarchicalnoisy-orfactormodel.Interestingly,thesampledfeatureslearnedinthehierarchicalmodelappeartobeslightlymorespatiallylocalizedandsparse.Fig.2(f)-(i)illustratessomeofthemarginalsthatarisefromtheGibbssamplinginferenceprocess.Interestingly,theIBPmodelinfersagreaternumberoflatentfactorsthatdidthe2-layer6\f00.511.522.533.544.5x 104\u22124.5\u22124\u22123.5\u22123\u22122.5\u22122\u22121.5x 104log likelihoodMCMC iterations  IBP2\u2212layer Noisy\u2212Or model  10203040506070809010051015202500.10.20.30.40.50.60.70.80.9120140160180200010002000300040005000num. active featuresnum. MCMC iterations  010203040050100150200250300num. of objectsnum. active features  123450100200300400500600num. active featuresnum. of objects202530354002000400060008000num. MCMC iterationsnum. active featuresIBPHierarchicalIBPHierarchical(cid:8)(cid:70)(cid:9)(cid:8)(cid:71)(cid:9)(cid:8)(cid:72)(cid:9)(cid:8)(cid:73)(cid:9)(cid:8)(cid:65)(cid:9)(cid:8)(cid:66)(cid:9)(cid:8)(cid:67)(cid:9)(cid:8)(cid:68)(cid:9)(cid:8)(cid:69)(cid:9)Figure2:(a)Theloglikelihoodofade-noisedtestset.Corrupted(with0.5-stdGaussiannoise)versionsoftestexampleswereprovidedtothefactormodelsandthelikelihoodofthenoise-freetestsetwasevaluatedforbothanIBP-basedmodelaswellasforthe2-layernoisy-ormodel.Thetwolayermodelshownsubstantialimprovementinloglikelihood.(b)Reconstructionofnoisyexamples.Thetoprowshowstheoriginalvaluesforacollectionofdigits.Thesecondrowshowstheircorruptedversions;whilethethirdandfourthrowshowthereconstructionsfortheIBP-basedmodelandthe2layernoisy-orrespectively.(c)AsubsetoftheVmatrix.TherowsofVareindexedbyjwhilethecolumnsofVareindexedbyk.Theverticalstripingpatternisevidenceofsigni\ufb01cantsharingoflower-layerfeaturesamongtheupper-layerfactors.(d)-(e)Themostfrequent64features(rowsofthe\u03b8matrix)for(d)theIBPandfor(e)the2-layerin\ufb01nitenoisy-orfactormodel.(f)AcomparisonofthedistributionsofthenumberofactiveelementsbetweentheIBPandthenoisy-ormodel.(g)Acomparisonofthenumberofactive(lower-layer)factorspossessedbyanobjectbetweentheIBPandthehierarchicalmodel.(h)thedistributionofupper-layeractivefactorsand(i)thenumberofactivefactorsfoundinanobject.noisy-ormodel(atthe\ufb01rstlayer).However,thedistributionoverfactorsactiveforeachdataobjectisnearlyidentical.ThissuggeststhepossibilitythattheIBPismaintainingspecializedfactorsthatpossiblyrepresentasuperpositionoffrequentlyco-occurringfactorsthatthenoisy-ormodelhascapturedmorecompactly.5.2ExperimentII:MusicTagsReturningtoourmotivatingexamplefromtheintroduction,weextractedtagsandtagfrequenciesfromthesocialmusicwebsiteLast.fmusingtheAudioscrobblerwebservice.Thedataisintheformofcounts1oftagassignmentforeachartist.Ourgoalinmodelingthisdataistoreducethisoftennoisycollectionoftagstoasparserepresentationforeachartist.WewilladoptadifferentapproachtothestandardLatentDirichletAllocation(LDA)documentprocessingstrategyofmodelingthedocument\u2013orinthiscasetagcollection\u2013ashavingbeengeneratedfromamixtureoftagmultino-mials.Wewishtodistinguishbetweenanartistthateveryoneagreesisbothcountryandrockversusanartistthatpeoplearedividedwhethertheyarerockorcountry.Tothisend,wecanagainmakeuseoftheconjugatenoisy-ormodeltomodelthecountdataintheformofbinomialprobabilities,i.e.tothemodelde\ufb01nedinSect.3,weaddtherandomweightsWkti.i.d\u223cBeta(a,b),\u2200k.tconnectingZtothedataXviathedistribution:Xnt\u223cBinomial(1\u2212#k(1\u2212znkW),C)whereCisthelimitonthenumberofpossiblecountsachievable.Thiswouldcorrespondtothenumberofpeoplewhoevercontributedatagtothatartist.InthecaseoftheLast.fmdataC=100.MaintainingconjugacyoverWwillrequireustoaddanassignmentparameter1Thepubliclyavailabledataisnormalizedtomaximumvalue100.7\f801001201401600200400600800num. active featuresMCMC iterations02468050100150200250300num. active featuresnum. objects012340100200300400500600num. active featuresnum. objects2030405060700500100015002000num. active featuresMCMC iterations(cid:8)(cid:65)(cid:9)(cid:8)(cid:66)(cid:9)(cid:8)(cid:67)(cid:9)(cid:8)(cid:68)(cid:9)Figure3:Thedistributionofactivefeaturesforthenoisy-ormodelatthe(a)lower-layerand(c)theupper-layer.Thedistributionoveractivefeaturesperdataobjectforthe(b)upper-layerand(d)lower-layer.BntwhoseroleisanalogoustoAnk.Withthemodelthusspeci\ufb01ed,wepresentadatasetof1000artistswithavocabularysizeof100tagsrepresentingatotalof312134counts.Fig.3showstheresultrunningtheGibbssamplerfor10000iterations.Asthe\ufb01gureshows,bothlayersarequitesparse.Generally,mostofthefeatureslearnedinthe\ufb01rstlayeraredominatedbyonetothreetags.Mostfeaturesatthesecondlayercoverabroaderrangeoftags.Thetwomostprobablefactorstoemergeattheupperlayerareassociatedwiththetags(inorderofprobability):1.electronic,electronica,chillout,ambient,experimental2.pop,rock,80s,dance,90sTheabilityofthe2-layernoisy-ormodeltocapturehigher-orderstructureinthetagdatawasagainassessedthoughacomparisontothestandardIBPusingthenoisy-orobservationmodelabove.Themodelwasalsocomparedagainstamorestandardlatentfactormodelwiththelatentrepresentation\u03b7nkmodelingthedatathroughageneralizedlinearmodel:Xnt\u223cBinomial(Logistic(\u03b7n,:O:,t),C),wherethefunctionLogistic(.)isthelogisticsigmoidlinkfunctionandthelatentrepresentation\u03b7nk\u223cN(0,\u03a3\u03b7)arenormallydistributed.Inthiscase,inferenceisperformedviaaMetropolis-HastingsMCMCmethodthatmixesreadily.Thetestdatawasmissing90%ofthetagsandthemod-elswereevaluatedbytheirsuccessinimputingthemissingdatafromthe10%thatremained.Hereagain,the2-LayerNoisy-Ormodelachievedsuperiorperformance,asmeasuredbythemarginalloglikelihoodonaholdoutsetof600artist-tagcollections.Interestinglybothsparsemodels\u2013theIBPandthenoisy-ormodel\u2013dramaticallyoutperformedthegeneralizedlatentlinearmodel.MethodNLLGen.latentlinearmodel(BestDim=30)8.7781e05\u00b10.02e05IBP5.638e05\u00b10.001e052-LayerNoisy-OrIFM5.542e05\u00b10.001e056DiscussionWehavede\ufb01nedanoisy-ormechanismthatallowsonein\ufb01nitefactormodeltoactasapriorforanotherin\ufb01nitefactormodel.Themodelpermitshigh-orderstructuretobecapturedinafactormodelframeworkwhilemaintaininganef\ufb01cientsamplingalgorithm.ThemodelpresentedhereissimilarinspirittothehierarchicalBetaprocess,[11]inthesensethatbothmodelsde\ufb01neahierarchyofunboundedlatentfactormodels.However,whilethehierarchicalBetaprocesscanbeseenasawaytogroupobjectsinthedata-setwithsimilarfeatures,ourmodelprovidesawaytogroupfeaturesthatfrequentlyco-occurinthedata-set.Itisperhapsmoresimilarinspirittotheworkof[9]whoalsosoughtameansofassociatinglatentfactorsinanIBP,howevertheirworkdoesnotactdirectlyontheunboundedbinaryfactorsasoursdoes.Recentlythequestionofhowtode\ufb01neahierarchicalfactormodeltoinducecorrelationsbetweenlower-layerfactorswasaddressedby[3]withtheirIBP-IBPmodel.However,unlikeourmodel,wherethedependenciesinducedbytheupper-layerfactorsviaannoisy-ormechanism,theIBP-IBPmodelmodelscorrelationsviaanANDconstructthroughtheinteractionofbinaryfactors.AcknowledgmentsTheauthorsacknowledgethesupportofNSERCandtheCanadaResearchChairsprogram.WealsothankLast.fmformakingthetagdatapubliclyavailableandPaulLamereforhishelpinprocessingthetagdata.8\fReferences[1]RobertJ.ConnorandJamesE.Mosimann.ConceptsofindependenceforproportionswithageneralizationoftheDirichletdistribution.JournaloftheAmericanStatisticalAssociation,64(325):194\u2013206,1969.[2]AaronC.Courvile,DouglasEck,andYoshuaBengio.Anin\ufb01nitefactormodelhierarchyviaanoisy-ormechanism:Supplementalmaterial.SupplementtotheNIPSpaper.[3]FinaleDoshi-VelezandZoubinGhahramni.Correlatednonparametriclatentfeaturemodels.InProceedingsofthe25thConferenceonUncertaintyinArti\ufb01cialIntelligence,2009.[4]W.R.GilksandP.Wild.AdaptiverejectionsamplingforGibbssampling.AppliedStatistics,41(2):337\u2013348,1992.[5]TomGrif\ufb01thsandZoubinGhahramani.In\ufb01nitelatentfeaturemodelsandtheindianbuffetprocess.InAdvancesinNeuralInformationProcessingSystems18,Cambridge,MA,2006.MITPress.[6]MaxHenrion.Practicalissuesinconstructingabayes\u2019beliefnetwork.InProceedingsoftheProceedingsoftheThirdConferenceAnnualConferenceonUncertaintyinArti\ufb01cialIntelli-gence(UAI-87),page132?139,NewYork,NY,1987.ElsevierScience.[7]HemantIshwaranandLancelotF.James.Gibbssamplingmethodsforstick-breakingpriors.AmericanStatisticalAssociation,96(453):161\u2013173,2001.[8]MichaelKearnsandYishayMansour.Exactinferenceofhiddenstructurefromsampledatainnoisy-ornetworks.InProceedingsofthe14thConferenceonUncertaintyinArti\ufb01cialIntelligence,pages304\u2013310,1998.[9]PiyushRaiandHalDaum\u00b4eIII.Thein\ufb01nitehierarchicalfactorregressionmodel.InDaphneKoller,DaleSchuurmans,YoshuaBengio,andL\u00b4eonBottou,editors,AdvancesinNeuralIn-formationProcessingSystems21,2009.[10]YeeWhyeTeh,DilanG\u00a8or\u00a8ur,andZoubinGhahramani.Stick-breakingconstructionfortheindianbuffetprocess.InProceedingsoftheEleventhInternationalConferenceonArti\ufb01calIntelligenceandStatistics(AISTAT2007).,2007.[11]RomainThibauxandMichaelI.Jordan.Hierarchicalbetaprocessandtheindianbuffetpro-cess.InProceedingsoftheEleventhInternationalConferenceonArti\ufb01calIntelligenceandStatistics(AISTAT2007).,2007.9\f", "award": [], "sourceid": 1100, "authors": [{"given_name": "Douglas", "family_name": "Eck", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Aaron", "family_name": "Courville", "institution": null}]}