Viet Huynh, He Zhao, Dinh Phung
We present an optimal transport framework for learning topics from textual data. While the celebrated Latent Dirichlet allocation (LDA) topic model and its variants have been applied to many disciplines, they mainly focus on word-occurrences and neglect to incorporate semantic regularities in language. Even though recent works have tried to exploit the semantic relationship between words to bridge this gap, however, these models which are usually extensions of LDA or Dirichlet Multinomial mixture (DMM) are tailored to deal effectively with either regular or short documents. The optimal transport distance provides an appealing tool to incorporate the geometry of word semantics into it. Moreover, recent developments on efficient computation of optimal transport distance also promote its application in topic modeling. In this paper we ground on optimal transport theory to naturally exploit the geometric structures of semantically related words in embedding spaces which leads to more interpretable learned topics. Comprehensive experiments illustrate that the proposed framework outperforms competitive approaches in terms of topic coherence on assorted text corpora which include both long and short documents. The representation of learned topic also leads to better accuracy on classification downstream tasks, which is considered as an extrinsic evaluation.