NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:2062
Title:Learnable Visual Markers

Reviewer 1

Summary

This paper is about an approach to train a network to generate codable visual markers. The authors also employ texture generation network to generate visually human friendly markers. The framework involves jointly training a synthesizer network (input: binary bit vector, output m-by-m image) and a rendering network (input: image, output: warped image), and a recognizer network (input: image, output: probablistic bit vector) in an end-to-end fashion with bit wise classification loss.

Qualitative Assessment

Interesting paper with very novel applications. Although the experiments lack some implementation details and experimental conditions, I find the application exciting and novel (I haven't seen this application of generative neural network for coding with visual markers before). I wasn't able to find any reference to figure 5 in the main text. Where do the black bounding boxes in figure 5 come from? Does the system run a separate detector as a preprocessing step? This can be addressed in the camera ready.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 2

Summary

The basic idea this paper is built upon is to ditch the usual feature-driven design of visual markers and to exploit a combined learning of markers appearance and associated recognizer. Such joint learing is performed with the goal of reaching the desired recognition accuracy according also to the desired information payload of the marker itself. Optionally, the marker can also embed a basic picture, that can be used during the learing to control (to some degree) the final appereance.

Qualitative Assessment

I appreciated a lot this paper. The main idea is novel and fresh, and it makes a very good use of learning. This is basically an application paper, but I do not think this to be a showstopper, since the application is sound and the results interesting. Some reviewers could argue that the contribution is minimal with respect to the "learning" state-of-the-art, however I think that the impact is more than fine with respect to the overall community and the potential impact in marker community would contribute to spur interest in learning (as an additional bonus). Moreover, the approach is promising since it could be extended (maybe in a journal version) to also handle pose estimation (which is not the focus of this study, but it is still important for many applications). With respect to literature review, I think that the authors should cite some type of marker which exploits projective invariants to enable recognition and pose estimation directly on the image plane, such as: Bergamasco, F., Albarelli, A., Torsello, A. Image-space marker detection and recognition using projective invariants (2011) Proceedings - 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, 3DIMPVT 2011, art. no. 5955385, pp. 381-388. and Bergamasco, F., Albarelli, A., Torsello, A. Pi-Tag: A fast image-space marker design based on projective invariants (2013) Machine Vision and Applications, 24 (6), pp. 1295-1310.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

The paper proposes a new approach to designing visual markers which are obtained by a synthesizer network from input bit strings. Recognizer network is trained to recover the bit strings back from the photos of these markers. The two networks are all deep networks and trained simultaneously in a joint back propagation process. A classification network can be inserted into the learning in order to shift the marker appearance towards some texture prototype.

Qualitative Assessment

In the part of implementing the renderer, the authors could use some figures to show the architecture of the synthesizer, the renderer and the recognizer. In the experiments, the authors could compare the learnable visual markers with other existing visual markers to show that the learned visual markers are more practical. And the authors could also propose some evaluation measures to compare these visual markers.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 4

Summary

In this paper, the authors proposed a way to generate markers (images which encode information, like a QR code) by training an end-to-end network to maximize the recognizability in some marker-aid scenarios for robots. The network consists of a generator which encodes information and generates a marker, a renderer which simulates how the markers are placed in a real environment, and a recognizer responsible for information extraction. Qualitative experiments were detailed, and results under various scenarios were compared.

Qualitative Assessment

Briefly, the overall idea of serializing the whole encoding--environment simulation--recognition into an end-to-end network is interesting and worth exploration. The presentation is clear and comprehensive. To me, however, there is still large room to fulfill the idea and its application possibility. Specifically, more experiments should have been done to support and demonstrate the idea. About the idea. The end-to-end learning is intuitive and proved effective, at least qualitatively, but the model structure binds the synthesizer and the recognizer together, assuming that we already know the recognizer in the first place. In real situations, the two parts are mostly fully decoupled, and this structure limits its applications. For example, different robots may vary in different computational capabilities, but they all need to keep the same recognizer network inside its memory. In other words, unlike a barcode or a QR code which includes only encoding and decoding rules of sequence/image in its protocol, this method also takes the recognizer into its "protocol". Another concern is on the rendering network. The purpose of the rendering network is to simulate the real environment as much as possible, thus directly determining the performance of the method. To ensure the differentiability of the end-to-end network, the rendering network may not perform some of the transformations, such as bending, as mentioned, and some other non-linear transformations, though these can be solved in other ways: limit the marker transformation types in the real world. Also, the rendering in the real world is far more complicated than 2D rendering(example: shadows). One concern on texture loss. The purpose of texture loss is to incorporate aesthetics into the markers. However, the loss enables but the similarity to the original image, no patterns learned, as is shown in Figure 4. What we expected are some patterns which resemble a QR code: differs in content but follows the same style. Regarding this, the texture loss may harm the recognition accuracy, which limits its use in the real world. About the experiments. Since there seems no universal criterion for the performance of a protocol, it is acceptable to have only qualitative results and no quantitative comparisons. However, results of existing protocols should be reported and compared, such as of a QR code and of a bar code. For example, the encoder can be replaced with a QR code encoder, and the recognizer a QR decoder (an off-the-shelf one or a trained network exactly same as the used structure in the experiments). Question: in Figure 5, what are the black boxes? For the renderer, it would be more interesting to investigate that how to encode information can resist certain transformations, such as affine transformations only, lighting variations only, motion blurs only, optical blurs only, or even no transformations at all, where we assume that a robot can adjust its pose to locate the marker and take an image almost same as the originally generated one. This can provide perspective on how to design new protocols to reduce the effects of certain transformations.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)