8/29/2019

Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

Yuanzhi Li, Tengyu Ma and Hongyang Zhang, Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations, COLT 18, Best Paper Award.
We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations.  
Concretely, we show that given ˜O(dr2) random linear measurements of a rank r positive semidefinite matrix X⋆, we can recover X⋆ by parameterizing it by UU⊤ with U ∈ Rd×d and minimizing the squared loss, even if r ≪ d. We prove that starting from a small initialization, gradient descent recovers X⋆ in ˜O(√r) iterations approximately. The results solve the conjecture of Gunasekar et al. [16] under the restricted isometry property. 
The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.
資料 x (Lemma 5.1) 滿足  (Definition 2.1) Restricted isometry property,標籤 y  滿足 (1.3),使用梯度下降法 (gradient descent algorithm),保證推論誤差在一定的範圍內,以避免過度擬合 (overfitting) 的問題。

考慮二次激勵函數 (quadratic activation function) (1.1),其梯度下降法的差分演算法變成線性方程式 (1.2),收斂的關鍵如論文所述

  1. The trajectory of the iterates of GD on the population risk stays in the set of approximately low-rank matrices.
  2. The trajectory of the empirical risk behaves similarly to that of the population risk in the set of approximately low-rank matrices.
  3. The iterates converge to the true solution X* fast enough before its effective rank increases. 
誠如作者所言,此理論的推廣是個有趣且挑戰的問題

It’s an very interesting open question to establish similar results for deeper neural networks with other activations (e.g., ReLU) and loss functions (e.g., logistic loss). We remark that likely such a result will require not only a better understanding of statistics, but also a deep grasp of the behavior of the optimization algorithms for non-linear models, which in turns is another fascinating open question.

沒有留言:

張貼留言