Paper ID: 759
Title: Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
This paper proposes a general method for solving image-based computer vision tasks using a generative probabilistic model that uses a graphics program to generate images. The method takes the standard Bayesian approach to frame the inference of the target variables, and uses Metropolis-Hastings to perform the inference. This framework is implemented for a CAPTCHA and a road-finding application, with favorable results reported for each one.

The primary contribution of this paper is a proposal for using graphics programs as a key element of a generative model for image-based tasks. While their claim that there are no previous real-world image interpretation frameworks that combine computer graphics among the other elements they list (last paragraph of Section 1) seems accurate, their proposed system does not seem to qualify as such a framework unless it's under a restricted interpretation. Specifically, the CAPTCHA problem is "real-world" in one sense, but CAPTCHAs are generated by a synthetic process and are therefore qualitatively a very different type of problem than such real-world problems as object recognition or detection in natural images. This distinction is important, because one of the main reasons that discriminative approaches have gained favor is due to their ability to deal with complex, noisy features found in natural images. The authors would presumably agree that their case is severely weakened if their framework is limited to synthetic images such as CAPTCHAs.

As for the road-finding problem, it certainly qualifies as a real-world problem and deals with natural images, but it is a strain to consider what they use to be a graphics program. The elements that the model generates are simply the properties (height, width, and lane size) of the lines that form the lanes of the road, and the "renderer" is simply an assignment of pixels to the regions delineated by these properties rather than an actual realistic rendering of the scene. This is perfectly reasonable, but it is no more of a computer graphics-based approach than the many previous approaches that align geometric elements to a scene (e.g., [3], [8], Hedau et al. ICCV '09, Lee et al. NIPS '10, Gupta et al. ECCV '10, and many similar projects by some of the authors of those papers). As such it should be seen as another work in this line of research, not a unique and novel direction that it would be if it were applied using a generative model that produced something closer to an actual image than geometric lines.

The authors point out that many other works require a good amount of application-specific engineering, which they do not require. While this is true, they also admit that such engineering would be a useful addition. In fact, in the Introduction section they point to [17, 20] as examples of methods that rely on custom inference strategies -- specifically, MCMC proposals that incorporate bottom-up cues -- in contrast to their more general approach, and then in the Discussion section they say that discriminatively trained proposals would likely be important to extend their approach to more complex scenes.

Novelty aside, the proposed framework does have attractive properties and the paper is well written, and the recent line of generative models for computer vision problems seems to be worth pursuing and has yielded promising results. While the probabilistic aspect is relatively straightforward, it is simple and easily adjusted according to the needs of the application (such as incorporating rotation), and the use of Metropolis-Hastings is elegant and general. The authors report favorable results on two interesting applications. It would have been better to compare the system's performance on the CAPTCHAs to further state-of-the-art approaches. Furthermore, while they compare directly to [1] on the road-finding application, it would be helpful to provide a means of comparison to the other results reported in the dataset's original paper ([4]).

As a technical note, it seems that the first equation should contain an integral over X, which is latent and does not appear on the left hand side.
Q2: Please summarize your review in 1-2 sentences
While the proposed generative probabilistic approach is an attractive one, the novelty is far overstated. The authors report positive results in two applications which are enough to validate the approach in general, but not enough to warrant acceptance as an application paper.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The manuscript reports a system for image interpretation. In brief, an image is characterized by a latent representation which is a graphics program which, when executed, will generate the image (at least, approximately). To infer the parameters of the graphics program, the system performs Metropolis-Hastings steps.

Overall, I like this manuscript a lot. In some sense, the ideas are not new --- analysis by synthesis has been around for a long time. But the implementation of these ideas here is elegant. I like the notion of characterizing images as latent graphics programs. In some ways, I also like the idea of searching for good parameters of the graphics programs via Metropolis-Hastings steps (although, in many cases, this is likely to be very computationally expensive). In addition, I like the two applications in the manuscript (reading CAPTCHAs and finding roads in images). And the manuscript is generally well written (though I wish it contained more details).

My biggest concern is that the system might not scale well. It seems likely that the system will require an enormous number of Metropolis-Hastings steps when applied to large-scale real-world image understanding problems. I wish that the authors commented on this in the manuscript.
Q2: Please summarize your review in 1-2 sentences
In summary, I really liked this manuscript a lot. Elegant and thoughtful implementation. Excellent applications of system to interesting data sets.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
This paper builds a generative framework for analyzing images: The content of the a scene has some distribution. The content is the mapped to an image by an approximate renderer, and a probabilistic model compares this image to the actual data, under the control of a set of latent variables that effectively control the smoothing.

The paper is well written and easy to read. My expertise is more in Bayesian modeling, than computer vision, but the method presented in this paper appears to me to be a potentially significant step forward in computer vision. I believe the approach is novel, and the results presented here are promising. The authors also outline some very interesting future directions for this work, and I look forward to seeing more.

In particular, the two applications here seem like they can be summarized by a relatively simple feature space with this model. I wonder how much feature selection will affect the performance in more complex scenes.

One complaint is that the figure legends & axis labels were ridiculously small and hard to read.
Q2: Please summarize your review in 1-2 sentences
This is a solid paper, and the approach to the problem appears to be novel and promising.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for their thoughtful comments, constructive criticism, and encouragement.

1. Reviewer 3 expresses concerns about novelty, especially considering geometric scene models from the recent vision literature, and the potential role of discriminative proposals.

We see our main contribution as an approach to generative modeling in vision, where solving a new task in a new domain requires specifying only domain-specific aspects of scene and image representation, but no domain-specific (or even domain-general) aspects of inference. That is, we do have to specify the hypothesis space of scenes (expressed as a probabilistic program) and an appropriate image representation (which can be viewed as a noisy observation model, or family of similarity metrics for approximate Bayesian computation). But we do not have to specify or implement an inference algorithm. Instead, generic, automatic inference for probabilistic programs is used to interpret images, and in fact can report posterior uncertainty about scene composition.

Regarding bottom-up proposals: our suggestion for scaling further was not to rely on custom engineering, but instead leverage probabilistic programming and ABC. For example, we could automatically build detectors from the generative model by training on sampled scene+image pairs, and integrate this into the underlying probabilistic programming system.

2. Reviewer 3 also expresses doubts about the presence of a meaningful renderer in our road scene application, and about the relevance of our CAPTCHA results for more typical vision problems, where the input image data is not generated by a computer program.

Our road scene model does include a full generative model for images, based on extremely rough, per-region color histograms. We chose to implement the likelihood calculation for this renderer using adaptive k-means quantization of colors for speed and simplicity (rather than e.g. an explicit Gaussian mixture model, per region, with a per-pixel mixture component, integrated out to yield a likelihood). If our paper is accepted we will include full road scene renderings (instead of just the surface geometry with arbitrary colors that was included in our submitted version) so readers can get a sense of how cartoonish an approximate, stochastic renderer can be and still support posterior inference.

We agree that naturalistic image settings are interesting, and have begun preliminary work on applying GPGP to reading degraded text from camera images, with distortions due to motion blur, out-of-plane rotation, warping of the underlying surface, etc. We think CAPTCHAs highlight interesting, hard aspects of scene parsing that are de-emphasized in typical object recognition tasks, such as the handling of extreme occlusion/object overlap, and complement our road scene application, which emphasizes variability in appearance against a comparatively simple hypothesis space of scenes.

3. Reviewer 4 points out that significantly more complex scenes could potentially result in very slow inference convergence. Understanding this aspect of scaling is a very important area for future work. For example, we have found that increasing the dimensionality in the CAPTCHA problem by increasing the number of letters by ~5x does not affect the number of MH steps needed for reliable convergence; instead, the difficulty seems to be determined by the degree of overlap of the letters.

4. Reviewer 7 observes that the image representations we used in our applications are simple --- raw pixels for CAPTCHAs, and highly quantized pixels for road scenes --- and that more complex scenes might benefit from (or perhaps even require) more sophisticated or carefully designed representations, analogously to the design of distance functions in more traditional ABC applications. This is an important issue to be explored in future work, e.g. by comparing typical generic image representations from vision (extracted from both a rendered image and the input data) to learned representations as well as simpler pixel-based ones.