{"title": "Generative Image Modeling Using Spatial LSTMs", "book": "Advances in Neural Information Processing Systems", "page_first": 1927, "page_last": 1935, "abstract": "Modeling the distribution of natural images is challenging, partly because of strong statistical dependencies which can extend over hundreds of pixels. Recurrent neural networks have been successful in capturing long-range dependencies in a number of problems but only recently have found their way into generative image models. We here introduce a recurrent image model based on multi-dimensional long short-term memory units which are particularly suited for image modeling due to their spatial structure. Our model scales to images of arbitrary size and its likelihood is computationally tractable. We find that it outperforms the state of the art in quantitative comparisons on several image datasets and produces promising results when used for texture synthesis and inpainting.", "full_text": "Generative Image Modeling Using Spatial LSTMs\n\nLucas Theis\n\nUniversity of T\u00a8ubingen\n\n72076 T\u00a8ubingen, Germany\nlucas@bethgelab.org\n\nMatthias Bethge\n\nUniversity of T\u00a8ubingen\n\n72076 T\u00a8ubingen, Germany\n\nmatthias@bethgelab.org\n\nAbstract\n\nModeling the distribution of natural images is challenging, partly because of\nstrong statistical dependencies which can extend over hundreds of pixels. Re-\ncurrent neural networks have been successful in capturing long-range dependen-\ncies in a number of problems but only recently have found their way into gener-\native image models. We here introduce a recurrent image model based on multi-\ndimensional long short-term memory units which are particularly suited for image\nmodeling due to their spatial structure. Our model scales to images of arbitrary\nsize and its likelihood is computationally tractable. We \ufb01nd that it outperforms the\nstate of the art in quantitative comparisons on several image datasets and produces\npromising results when used for texture synthesis and inpainting.\n\n1\n\nIntroduction\n\nThe last few years have seen tremendous progress in learning useful image representations [6].\nWhile early successes were often achieved through the use of generative models [e.g., 13, 23, 30],\nrecent breakthroughs were mainly driven by improvements in supervised techniques [e.g., 20, 34].\nYet unsupervised learning has the potential to tap into the much larger source of unlabeled data,\nwhich may be important for training bigger systems capable of a more general scene understand-\ning. For example, multimodal data is abundant but often unlabeled, yet can still greatly bene\ufb01t\nunsupervised approaches [36].\nGenerative models provide a principled approach to unsupervised learning. A perfect model of\nnatural images would be able to optimally predict parts of an image given other parts of an image and\nthereby clearly demonstrate a form of scene understanding. When extended by labels, the Bayesian\nframework can be used to perform semi-supervised learning in the generative model [19, 28] while it\nis less clear how to combine other unsupervised approaches with discriminative learning. Generative\nimage models are also useful in more traditional applications such as image reconstruction [33, 35,\n49] or compression [47].\nRecently there has been a renewed strong interest in the development of generative image models\n[e.g., 4, 8, 10, 11, 18, 24, 31, 35, 45, 47]. Most of this work has tried to bring to bear the \ufb02exibility of\ndeep neural networks on the problem of modeling the distribution of natural images. One challenge\nin this endeavor is to \ufb01nd the right balance between tractability and \ufb02exibility. The present article\ncontributes to this line of research by introducing a fully tractable yet highly \ufb02exible image model.\nOur model combines multi-dimensional recurrent neural networks [9] with mixtures of experts.\nMore speci\ufb01cally, the backbone of our model is formed by a spatial variant of long short-term\nmemory (LSTM) [14]. One-dimensional LSTMs have been particularly successful in modeling text\nand speech [e.g., 38, 39], but have also been used to model the progression of frames in video [36]\nand very recently to model single images [11]. In contrast to earlier work on modeling images,\nhere we use multi-dimensional LSTMs [9] which naturally lend themselves to the task of generative\nimage modeling due to their spatial structure and ability to capture long-range correlations.\n\n1\n\n\fA\n\nB\n\nMCGSM\n\nxij\nx