Paper ID: 246
Title: Dropout Training as Adaptive Regularization
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Update: I have read the rebuttal and my opinion remains unchanged; I still think it's a very nice paper.

Summary of the paper:
In this paper, dropout training is analyzed in terms of it's effect as a regularizer. First the authors discuss Gaussian additive noise and dropout noise in the context of regularizing generalized linear models. They decompose the expected loss into two terms: a loss function over the labels, and a label-independent regularizer based on an expectation over the noise distribution. Using this, they approximate the noise-based regularizer via a second-order Taylor expansion in order to yield closed form regularizers. In the case of linear regression with normalized features, they recover the standard L2 penalty, while for logistic regression they recover novel penalty functions. They reason and demonstrate that the effect of dropout on logistic regression is to penalize large weights for commonly activated features, while allowing large weights for rare but highly discriminative features. They then cast dropout as a form of adaptive SGD in the same vein as AdaGrad. Finally, the authors exploit the fact that the effective dropout penalty function doesn't depend on the labels, and use this to perform semi-supervised learning.

Quality:
I believe this paper is of excellent quality. It has a clear motivation and a careful analysis that is backed up with empirical evidence. I particularly like the non-obvious connection to AdaGrad, as well as the semi-supervised extension. Perhaps the only empirical evaluation that is missing is a speed comparison between the quadratic, deterministic approximation and the stochastic version (similar to Wang and Manning).

Clarity:
This paper is quite clear, and the progresses through the various ideas in a nice cohesive fashion. It would have been nice if some of the more involved derivations not included in the paper had been included in the supplementary material. It's also not clear to me how they recover the exact penalty for the comparison in section 2.1. Did they use a Monte Carlo estimate for the exact penalty?

Originality:
This paper makes several original contributions, namely the deterministic penalty for dropout training of GLMs, an analysis of the effect of dropout on feature weighting, the connection with AdaGrad, and the semi-supervised extension.

Significance:
This paper makes a significant contribution to our understanding of the effect of noise-based regularization in supervised learning. I have no doubt that the results in this paper will be built upon in the future, especially the semi-supervised version.
Q2: Please summarize your review in 1-2 sentences
This paper was a pleasure to read. It had interesting analysis, novel connections to previous work, interesting extensions, and a relatively thorough empirical evaluation.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper describes how the dropout technique (and additive Gaussian
noise) can be seen as a form of adaptive regularization. Dropout has
been shown recently to provide significant performance improvements in
deep learning architectures. However, theoretical explanations
provided so far of why it works so well have not been
conclusive. While this paper is only restricted to some forms of
generalized linear models (which are simpler architectures than
typical deep learning architectures), dropout is shown to be
equivalent to very intuitive adaptive regularization, especially
useful when rare but discriminative features are prevalent. To my
knowledge, this is the first time that such a clear explanation of
dropout is provided. Moreover, a connection is also made with
AdaGrad. A semi-supervised learning algorithm is derived from this
connection, and empirical results show improvement over logistic
regression and the supervised version of the dropout regularizer.

Very interesting and well written paper. I was impressed by the
clarity of the derivations (culminating with Eq. 12), and I'm eager to
see if a similar analysis holds for deep architectures.

One suggestion about the content of this version of the paper (or
eventual subsequent journal versions): Section 4.1 in itself is not
very useful. Although it is interesting, it takes too much space in
the paper for what it's worth. It is always possible to generate
synthetic datasets to advantage any model. Instead, more space could
be devoted to describe more clearly and intuitively the relationship
between the regularizer (Eq. 11) and the Fischer information. A
graphical depiction of the spherical properties of the Fischer
normalization described in lines 243-246 could be useful in that
sense.

Minor comments:

Line 263: x_{ij} should read x^2_{ij}

Provide a reference for Eq. 15.

Line 355: The penalty term is different from Eq. 15.

I have taken into account the authors rebuttal in my final review.
Q2: Please summarize your review in 1-2 sentences
Very interesting connection between dropout and regularization in generalized linear models. Nice application to semi-supervised learning.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
*Summary of the paper:

This paper studies "dropout training" in the framework of feature noising as a regularization.

It is know that adding an additive gaussian noise to the feature is equivalent to an l_2 regularization in a least square problem (Bishop).

This paper studies multiplicative Bernoulli feature noising, in a shallow learning architecture, with a general loss function and shows that it has the effect of adapting the geometry through an "l_2 regularizer " that rescales the feature (beta^{\top} D(beta,X) beta).

The Matrix D(beta,X) is a estimate of the inverse diagonal fisher information.
It is worth noting that D does not depend on the labels. The equivalent regularizer of dropout is non convex in general.

A connexion to AdaGrad in online learning introduced in Duchi et al is made, as both approaches reduces to adapting the geometry towards rare but predictive features.

The Matrix D(beta,X) could be estimated using unlabeled data, authors devise a semi supervised variant of the equivalent dropout regularizer.


Comments:

- Dropout as originally introduced by hinton et al considers multiplicative Bernoulli noising for deep learning architectures (neural nets).
An average neural net is obtained, by training independently many neural nets where some hidden layers were dropped out at random.
Hinton's dropout seems to be close to ensemble methods (boosting, bagging etc.), and a different analysis is required to understand this paradigm even with one layer.

Referring to Hinton's dropout in the introduction is a bit misleading as there is no models averaging in the current paper.
It would be more precise to mention that this work consider another form of dropout, where we learn one model robust to Bernoulli multiplicative perturbation.


- The regularizer is not an l_2 regularizer as D depends on beta also in a non linear way through the hessian of the likelihood. one should be careful with this appellation.

- It is interesting to derive the approximate equivalent regularizer, to understand the effect of dropout. The non convexity of the regularizer poses computational issues. Authors mention in a footnote that they use lbfgs, more explanation and discussion are needed to understand how to avoid local minimas.
The non convexity of the equivalent regularizer is not discussed in section 6.

- The parameter (delta/1-delta) corresponds to a regularization parameter. any insight on how to choose delta? according to Hinton et al , delta=1/2 seems to have a good performance.

- The log partition function is sometimes hard to get in closed from any insight on how to go around that?


Authors answered promptly questions raised by the reviewer.
Q2: Please summarize your review in 1-2 sentences
The paper is well written, and is a step towards understanding the effect of dropout.
It is though analyzing a different problem than the original dropout as introduced by Hinton et al as an ensemble method: averaging many randomly sampled models, this should be made clear.
This paper shows how to learn a model robust in average against Bernoulli multiplicative noise. while Hinton's dropout shows how to learn an average model of randomly sampled models with dropout.




Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Thank you for your thoughtful feedback and helpful suggestions. We look forward to revising our paper in light of your comments. Many of the referee comments, especially regarding convexity, highlight interesting directions for new research.

Below are some responses to individual reviewer comments.

Assigned_Reviewer_4:

> Perhaps the only empirical evaluation that is missing is a speed comparison between the quadratic, deterministic approximation and the stochastic version (similar to Wang and Manning).

We did run these experiments. In the case of two class logistic regression, our quadratic approximation behaves very similarly to the Gaussian approximation of Wang and Manning, both in terms of speed and accuracy. The advantage of our second-order expansion is that it generalizes naturally to a semi-supervised setup (as emphasized in this paper). In follow-up research, we have also applied our second-order method to more complicated forms of structured prediction such as conditional random fields.

> It would have been nice if some of the more involved derivations not included in the paper had been included in the supplementary material.

We can add some more detailed derivations to our next draft.

> It's also not clear to me how they recover the exact penalty for the comparison in section 2.1. Did they use a Monte Carlo estimate for the exact penalty?

Yes, we were only able to evaluate the exact penalty by Monte Carlo.

Assigned_Reviewer_5:

> A graphical depiction of the spherical properties of the Fischer normalization described in lines 243-246 could be useful.

Thank you for this suggestion. We will add such an illustration to our next draft.

Assigned_Reviewer_6:

> Referring to Hinton's dropout in the introduction is a bit misleading as there is no models averaging in the current paper. It would be more precise to mention that this work consider another form of dropout, where we learn one model robust to Bernoulli multiplicative perturbation.

We will clarify our language to avoid any confusion.

> The non convexity of the regularizer poses computational issues. Authors mention in a footnote that they use lbfgs, more explanation and discussion are needed to understand how to avoid local minimas.

Questions surrounding the convexity of our method appear to be particularly interesting. Although our objective is not formally convex, we have not encountered any major difficulties in fitting it for datasets where n is reasonably large (say on the order of hundreds). When working with lbfgs, multiple restarts with random parameter values give almost identical results. The fact that we have never really had to struggle with local minimas suggests that there is something interesting going on here in terms of convexity. We are actively studying this topic, and hoping to gain some more clarity about it.

> The parameter delta/(1-delta) corresponds to a regularization parameter. any insight on how to choose delta? according to Hinton et al., delta=1/2 seems to have a good performance.

The tuning parameter delta/(1 - delta) behaves just like the lambda parameter in L2 regularization. We can set this parameter by cross-validation. In practice, we got good results by just using delta = 0.5 (i.e., delta/(1 - delta) = 1). We used delta = 0.5 in our experiments, but tuned delta for the simulation study.

> The log partition function is sometimes hard to get in closed from any insight on how to go around that?

Good point. The log-partition function is always tractable for the examples discussed in our paper, and this is part of what makes our method much faster than actual dropout. However, in some applications, the log-partition function can be more difficult to work with. Thankfully, we can often use special case tricks to do efficient computations with the log-partition function even when it does not allow a closed-form representation. We have not tried to apply our method to fully generic exponential families in which there is no way of efficiently getting the partition function.