Paper ID: | 4222 |
---|---|

Title: | How degenerate is the parametrization of neural networks with the ReLU activation function? |

The paper follows on a previous distinction between the weights of a neural network and the function that is realizes, and investigates under what conditions can we guarantee inverse stability: That networks with a similar realization will also be close in the parameter space (or more accurately, whether there exists an equivalent realization of one network that is close in the parameter space to the other). First, the authors identify that inverse stability fails w.r.t. the uniform norm, which motivates the investigation of inverse stability w.r.t. the Sobolev norm. Thereafter, the authors identify 4 cases where inverse stability fails even w.r.t the Sobolev norm, and lastly they show that under suitable assumptions that eliminate these 4 pathologies, inverse stability w.r.t. to the Sobolev norm for one hidden layer networks is achieved, where they argue (without proof) their technique also extends to deeper architectures. The authors motivate this line of research by establishing a correspondence between minima in the optimization space and minima in the realization space, and arguing that this connection will allow the adaptation of known tools from e.g. functional analysis to investigate neural networks. While I find the paper well-written and clear, my main concern is with its novelty, as specified in the improvements part. Minor comments: Line 22: year -> years. Line 72: "The, to the best..." is unclear. Possible typo? Line 218: an restriction -> a restriction

I read the author response and other reviews. The author response provides nice additional demonstration about the implication of connecting the two problems via inverse stability. This is an interesting and potentially important paper for a future research on this topic. This paper explains the definition of the inverse stability, proves its implication for neural network optimization, provides failure modes of having the inverse stability, and proves the inverse stability for a simple one-hidden layer network with a single output. Originality: The paper definitely provides a very interesting and unique research direction. I enjoyed reading the paper. It is quite refreshing to have a paper that acknowledges the limitation of the current theories (very strong assumptions) and aims to go beyond the regime of the strong assumptions. The present paper is related to a previous paper that concludes a global minimum of some space from local minimum of the parameter space. Although it is different, the goal is closely related and the high-level approach to relate the original problem to another is the same. In the present paper, the parameter local minimum only implies the realization local minimum, whereas the related paper concludes the global minimum. However, the realization space is richer and the present paper is good and interesting. Quality: The quality of the paper is good. Clarity: paper is clearly written. Significance: I found the paper interesting and enjoy reading it. A concern that I could think of is that the results are currently limited to the very simple case. I think, this is understandable due to the challenging nature of the topic. One may consider that this can be a concern because of the following: even if we have a complete picture about the inverse stability, that would be only the beginning of studying the neural network in the realization form. The statement about the possibility of the extension to deep neural networks is unnecessary and tautological. Of course, it is possible, but it is highly non-trivial and at what cost? For example, it may end up requiring the strong assumptions that this paper wanted to avoid.

The conditions of main Theorem 3.1 are quite technical and hard to interpret. It is not clear to me what this result actually implies in practice (also, on the original problem) and how it can be helpful in analyzing or understanding better the training of neural networks. Although this is motivated earlier in Corollary 1.3, again, the conditions there are quite technical and the results does not seem to say much about the properties of local minima of the original problem.