On ADMM in Deep Learning: Convergence and Saturation-Avoidance

Jinshan Zeng Jiangxi Normal University Shao-Bo Lin Xi'an Jiaotong University Yuan Yao Hong Kong University of Science and Technology Ding-Xuan Zhou City University of Hong Kong

Machine Learning mathscidoc:2110.41001

Journal of Machine Learning Research, 22, (199), 1-67, 2021.9
In this paper, we develop an alternating direction method of multipliers (ADMM) for deep neural networks training with sigmoid-type activation functions (called sigmoid-ADMM pair ), mainly motivated by the gradient-free nature of ADMM in avoiding the saturation of sigmoid-type activations and the advantages of deep neural networks with sigmoid-type activations (called deep sigmoid nets) over their rectified linear unit (ReLU) counterparts (called deep ReLU nets) in terms of approximation. In particular, we prove that the approximation capability of deep sigmoid nets is not worse than that of deep ReLU nets by showing that ReLU activation function can be well approximated by deep sigmoid nets with two hidden layers and finitely many free parameters but not vice-verse. We also establish the global convergence of the proposed ADMM for the nonlinearly constrained formulation of the deep sigmoid nets training from arbitrary initial points to a Karush-Kuhn-Tucker (KKT) point at a rate of order O(1/k). Besides sigmoid activation, such a convergence theorem holds for a general class of smooth activations. Compared with the widely used stochastic gradient descent (SGD) algorithm for the deep ReLU nets training (called ReLU-SGD pair), the proposed sigmoid-ADMM pair is practically stable with respect to the algorithmic hyperparameters including the learning rate, initial schemes and the pro-processing of the input data. Moreover, we find that to approximate and learn simple but important functions the proposed sigmoid-ADMM pair numerically outperforms the ReLU-SGD pair.
Deep learning, ADMM, sigmoid, global convergence, saturation avoidance
[ Download ] [ 2021-10-21 15:27:03 uploaded by JinshanZeng ] [ 210 downloads ] [ 0 comments ]
  • The major contribution of this paper are two folds. The first one is that we prove that the approximation capability of deep sigmoid nets is not worse than that of deep ReLU nets by showing that ReLU activation function can be well approximated by deep sigmoid nets with two hidden layers and finitely many free parameters but not vice-verse. The second one is that we propose an efficient ADMM algorithm for the training deep sigmoid nets with convergence guarantees and show that the suggested algorithm can overcome the saturation issue suffered by the training of deep sigmoid nets.
@inproceedings{jinshan2021on,
  title={On ADMM in Deep Learning: Convergence and Saturation-Avoidance},
  author={Jinshan Zeng, Shao-Bo Lin, Yuan Yao, and Ding-Xuan Zhou},
  url={http://archive.ymsc.tsinghua.edu.cn/pacm_paperurl/20211021152703789191880},
  booktitle={Journal of Machine Learning Research},
  volume={22},
  number={199},
  pages={1-67},
  year={2021},
}
Jinshan Zeng, Shao-Bo Lin, Yuan Yao, and Ding-Xuan Zhou. On ADMM in Deep Learning: Convergence and Saturation-Avoidance. 2021. Vol. 22. In Journal of Machine Learning Research. pp.1-67. http://archive.ymsc.tsinghua.edu.cn/pacm_paperurl/20211021152703789191880.
Please log in for comment!
 
 
Contact us: office-iccm@tsinghua.edu.cn | Copyright Reserved