Nonconvex reformulations via low-rank factorization for stochastic convex semidefinite optimization problem have attracted arising attention due to their empirical efficiency and scalability. Compared with the original convex formulations, the nonconvex ones typically involve much fewer variables, allowing them to scale to scenarios with millions of variables. However, it opens a new challenge that under what conditions the nonconvex stochastic algorithms may find the population minimizer within the optimal statistical precision despite their empirical success in applications. In this paper, we provide an answer that the stochastic gradient descent (SGD) method can be adapted to solve the nonconvex reformulation of the original convex problem, with a global linear convergence when using a fixed step size, i.e., converging exponentially fast to the population minimizer within an optimal statistical precision in the restricted strongly convex case. If a diminishing step size is adopted, the bad effect caused by the variance of gradients on the optimization error can be eliminated but the rate is dropped to be sublinear. The core of our treatment relies on a novel second-order descent lemma, which is more general than the existing best result in the literature and improves the analysis on both online and batch algorithms. The developed theoretical results and effectiveness of the suggested SGD are also verified by a series of experiments.
In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an efficient feature generating paradigm. Different from traditional gradient-based approaches that conduct optimization on all input features, the proposed paradigm iteratively activates a group of features, and solves a sequence of multiple kernel learning (MKL) subproblems. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such optimization scheme, some efficient cache techniques are also developed. The feature generating paradigm is guaranteed to converge globally under mild conditions, and can achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures, and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with O(10^14) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training efficiency.