We study the problem of globally recovering a dictionary from a set of signals via ℓ1-minimization. We assume that the signals are generated as i.i.d. random linear combinations of the K atoms from a complete reference dictionary D∗∈RK×K, where the linear combination coefficients are from either a Bernoulli type model or exact sparse model. First, we obtain a necessary and sufficient norm condition for the reference dictionary D∗ to be a sharp local minimum of the expected ℓ1 objective function. Our result substantially extends that of Wu and Yu (2015) and allows the combination coefficient to be non-negative. Secondly, we obtain an explicit bound on the region within which the objective value of the reference dictionary is minimal. Thirdly, we show that the reference dictionary is the unique sharp local minimum, thus establishing the first known global property of ℓ1-minimization dictionary learning. Motivated by the theoretical results, we introduce a perturbation based test to determine whether a dictionary is a sharp local minimum of the objective function. In addition, we also propose a new dictionary learning algorithm based on Block Coordinate Descent, called DL-BCD, which is guaranteed to decrease the obective function monotonically. Simulation studies show that DL-BCD has competitive performance in terms of recovery rate compared to other state-of-the-art dictionary learning algorithms when the reference dictionary is generated from random Gaussian matrices.
In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an efficient feature generating paradigm. Different from traditional gradient-based approaches that conduct optimization on all input features, the proposed paradigm iteratively activates a group of features, and solves a sequence of multiple kernel learning (MKL) subproblems. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such optimization scheme, some efficient cache techniques are also developed. The feature generating paradigm is guaranteed to converge globally under mild conditions, and can achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures, and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with O(10^14) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training efficiency.
Factor models are a class of powerful statistical models that have been widely used to deal with dependent measurements that arise frequently from various applications from genomics and neuroscience to economics and finance. As data are collected at an ever-growing scale, statistical machine learning faces some new challenges: high dimensionality, strong dependence among observed variables, heavy-tailed variables and heterogeneity. High-dimensional robust factor analysis serves as a powerful toolkit to conquer these challenges.
We define the notion of a Ricci curvature lower bound for parametrized statistical
models. Following the seminal ideas of Lott–Sturm–Villani, we define this notion
based on the geodesic convexity of the Kullback–Leibler divergence in a Wasserstein
statistical manifold, that is, a manifold of probability distributions endowed with a
Wasserstein metric tensor structure. Within these definitions, which are based on
Fisher information matrix and Wasserstein Christoffel symbols, the Ricci curvature
is related to both, information geometry and Wasserstein geometry. These definitions
allow us to formulate bounds on the convergence rate of Wasserstein gradient flows
and information functional inequalities in parameter space. We discuss examples of
Ricci curvature lower bounds and convergence rates in exponential family models.