Submitted to the Annals of Statistics
ARE DISCOVERIES SPURIOUS?
DISTRIBUTIONS OF MAXIMUM SPURIOUS
CORRELATIONS AND THEIR APPLICATIONS∗
By Jianqing Fank,∗∗, Qi-Man Shao†† and Wen-Xin Zhouk,‡‡
Princeton Universityk, Academy of Mathematics and Systems Science∗∗,
Chinese University of Hong Kong††
and University of California, San Diego‡‡
Over the last two decades, many exciting variable selection meth-
ods have been developed for finding a small group of covariates that
are associated with the response from a large pool. Can the discov-
eries from these data mining approaches be spurious due to high di-
mensionality and limited sample size? Can our fundamental assump-
tions about the exogeneity of the covariates needed for such variable
selection be validated with the data? To answer these questions, we
need to derive the distributions of the maximum spurious correlations
given a certain number of predictors, namely, the distribution of the
correlation of a response variable Ywith the best slinear combina-
tions of pcovariates X, even when Xand Yare independent. When
the covariance matrix of Xpossesses the restricted eigenvalue prop-
erty, we derive such distributions for both a finite sand a diverging
s, using Gaussian approximation and empirical process techniques.
However, such a distribution depends on the unknown covariance
matrix of X. Hence, we use the multiplier bootstrap procedure to ap-
proximate the unknown distributions and establish the consistency of
such a simple bootstrap approach. The results are further extended
to the situation where the residuals are from regularized fits. Our
approach is then used to construct the upper confidence limit for
the maximum spurious correlation and to test the exogeneity of the
covariates. The former provides a baseline for guarding against false
discoveries and the latter tests whether our fundamental assumptions
for high-dimensional model selection are statistically valid. Our tech-
niques and results are illustrated with both numerical examples and
real data analysis.
1. Introduction. Information technology has forever changed the data
collection process. Massive amounts of very high-dimensional or unstruc-
∗J. Fan was partially supported by NSF Grants DMS-1206464, DMS-1406266, NIH
Grant R01-GM072611-10, and National Center for Mathematics and Interdisciplinary Sci-
ences, Chinese Academy of Sciences, Beijing 100190. Q.-M. Shao was supported in part by
Hong Kong Research Grants Council GRF-403513 and GRF-14302515. W.-X. Zhou was
supported by NIH Grant R01-GM072611-10.
Keywords and phrases: High dimension, spurious correlation, bootstrap, false discovery.
1
arXiv:1502.04237v3 [math.ST] 11 Apr 2017
2J. FAN, Q.-M. SHAO AND W.-X. ZHOU
tured data are continuously produced and stored at an affordable cost. Mas-
sive and complex data and high dimensionality characterize contemporary
statistical problems in many emerging fields of science and engineering. Var-
ious statistical and machine learning methods and algorithms have been pro-
posed to find a small group of covariate variables that are associated with
given responses such as biological and clinical outcomes. These methods
have been very successfully applied to genomics, genetics, neuroscience, eco-
nomics, and finance. For an overview of high-dimensional statistical theory
and methods, see the review article by Fan and Lv (2010) and monographs
by Dudoit and van der Laan (2007), Hastie, Tibshirani and Friedman (2009),
Efron (2010) and B¨uhlmann and van de Geer (2011).
Underlying machine learning, data mining, and high-dimensional statis-
tical techniques, there are many model assumptions and even heuristic ar-
guments. For example, the LASSO [Tibshirani (1996)] and the SCAD [Fan
and Li (2001)] are based on an exogeneity assumption, meaning that all of
the covariates and the residual of the true model are uncorrelated. However,
it is nearly impossible that such a random variable, which is the part of
the response variable that can not be explained by a small group of covari-
ates and lives in a low-dimensional space spanned by the response and the
small group of variables, is uncorrelated with any of the tens of thousands
of coviariates. Indeed, Fan and Liao (2014) and Fan, Han and Liu (2014)
provide evidence that such an ideal assumption might not be valid, although
it is a necessary condition for model selection consistency. Even under the
exogenous assumption, conditions such as the restricted eigenvalue condi-
tion [Bickel, Ritov and Tsybakov (2009)] and homogeneity [Fan, Han and
Liu (2014)] are needed to ensure model selection consistency or oracle prop-
erties. Despite their critical importance, these conditions have rarely been
verified in practice. Their violations can lead to false scientific discoveries.
A simpler question is then, for a given data set, do data mining techniques
produce results that are better than spurious correlation? The answer de-
pends on not only the correlation between the fitted and observed values,
but also on the sample size, the number of variables selected, and the total
number of variables.
To better appreciate the above two questions, let us consider an example.
We take the gene expression data on 90 Asians (45 Japanese and 45 Han
Chinese) from the international ‘HapMap’ project [Thorisson et al. (2005)].
The normalized gene expression data are generated with an Illumina Sentrix
Human-6 Expression Bead Chip [Stranger et al. (2007)] and are available
on ftp://ftp.sanger.ac.uk/pub/genevar/. We take the expressions of
gene CHRNA6, a cholinergic receptor, nicotinic, alpha 6, as the response
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 3
Fig 1: Histogram of the sample correlations between the residuals and each
covariate (blue) and histogram of N(0,1/√n) random variables (green).
Yand the remaining expressions of probes as covariates Xwith dimension
p= 47292. We first fit an `1-penalized least-squares regression (LASSO)
on the data with a tuning parameter automatically selected via ten-fold
cross validation (25 genes are selected). The correlation between the LASSO-
fitted value and the response is 0.8991. Next, we refit an ordinary least-
squares regression on the selected model to calculate the fitted response and
residual vector. The sample correlation between the post-LASSO fit and
observed responses is 0.9214, a remarkable fit! But is it any better than the
spurious correlation? The model diagnostic plot, which depicts the empirical
distribution of the correlations between each covariate Xjand the residual
bεafter the LASSO fit, is given in Figure 1. Does the exogenous assumption
that E(εXj) = 0 for all j= 1, . . . , p hold?
To answer the above two important questions, we need to derive the distri-
butions of the maximum spurious correlations. Let Xbe the p-dimensional
random vector of the covariates and XSbe a subset of covariates indexed by
S. Let dcorrn(ε, αT
SXS) be the sample correlation between the random noise
ε(independent of X) and αT
SXSbased on a sample of size n, where αSis
4J. FAN, Q.-M. SHAO AND W.-X. ZHOU
a constant vector. Then, the maximum spurious correlation is defined as
(1.1) b
Rn(s, p) = max
|S|=smax
αSdcorrn(ε, αT
SXS),
when Xand εare independent, where the maximization is taken over all
p
ssubsets of size sand all of the linear combinations of the selected s
covariates. Next, let (Yi,Xi),...,(Yn,Xn) be independent and identically
distributed (i.i.d.) observations from the linear model Y=XTβ∗+ε. Assume
that scovariates are selected by a certain variable selection method for
some 1 ≤smin(p, n). If the correlation between the fitted response and
observed response is no more than the 90th or the 95th percentile of b
Rn(s, p),
it is hard to claim that the fitted value is impressive or even genuine. In this
case, the finding is hardly more impressive than the best fit using data
that consist of independent response and explanatory variables, 90% or 95%
of the time. To simplify and unify the terminology, we call this result the
spurious discovery throughout this paper.
For the aforementioned gene expression data, as 25 probes are selected,
the observed correlation of 0.9214 between the fitted value and the response
should be compared with the distribution of b
Rn(25, p). Further, a simple
method to test the null hypothesis
(1.2) E(εXj)=0,for all j= 1, . . . , p,
is to compare the maximum absolute correlation in Figure 1with the dis-
tribution of b
Rn(1, p). See additional details in Section 5.3.
The importance of such spurious correlation was recognized by Cai and
Jiang (2011), Fan, Guo and Hao (2012) and Cai, Fan and Jiang (2013).
When the data are independently and normally distributed, they derive the
distribution of b
Rn(1, p), which is equivalent to the distribution of the mini-
mum angle to the north pole among prandom points uniformly distributed
on the (n+ 1)-dimensional sphere. Fan, Guo and Hao (2012) conducted
simulations to demonstrate that the spurious correlation can be very high
when pis large and grows quickly with s. To demonstrate this effect and to
examine the impact of correlation and sample size, we conduct a similar but
more extensive simulation study based on a combination of the stepwise ad-
dition and branch-and-bound algorithms. We simulate Xfrom N(0,Ip) and
N(0,Σ0), where Σ0is block diagonal with the first block being a 500 ×500
equi-correlation matrix with a correlation 0.8 and the second block being
the (p−500) ×(p−500) identity matrix. Yis simulated independently of
Xand follows the standard normal distribution. Figure 2depicts the simu-
lation results for n= 50,100 and 200. Clearly, the distributions depend on
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 5
Fig 2: Distributions of maximum spurious correlations for p= 1000 and
s= 1,2,5 and 10 when Σis the identity matrix (left panel) or block diagonal
(right panel) with the first block being a 500 ×500 equi-correlation matrix
with a correlation 0.8 and the second block being the 500 ×500 identity
matrix. From top to bottom: n= 50,100 and 200.
(s, p, n) and Σ, the covariance matrix of X, although the dependence on Σ
does not seem very strong. However, the theoretical result of Fan, Guo and
Hao (2012) covers only the very specific case where s= 1 and Σ=Ip.
There are several challenges to deriving the asymptotic distribution of the
statistic b
Rn(s, p), as it involves combinatorial optimization. Further techni-
cal complications are added by the dependence among the covariates X.
Nevertheless, under the restricted eigenvalue condition [Bickel, Ritov and
Tsybakov (2009)] on Σ, in this paper, we derive the asymptotic distribution
of such a spurious correlation statistic for both a fixed sand a diverging s,
using the empirical process and Gaussian approximation techniques given in
Chernozhukov, Chetverikov and Kato (2014a). As expected, such distribu-
tions depend on the unknown covariance matrix Σ. To provide a consistent
estimate of the distributions of the spurious correlations, we consider the
use of a multiplier bootstrap method and demonstrate its consistency under
mild conditions. The multiplier bootstrap procedure has been widely used
due to its good numerical performance. Its theoretical validity is guaranteed
6J. FAN, Q.-M. SHAO AND W.-X. ZHOU
by the multiplier central limit theorem [van der Vaart and Wellner (1996)].
For the most advanced recent results, we refer to Chatterjee and Bose (2005),
Arlot, Blanchard and Roquain (2010) and Chernozhukov, Chetverikov and
Kato (2013). In particular, Chernozhukov, Chetverikov and Kato (2013) de-
veloped a number of non-asymptotic results on a multiplier bootstrap for
the maxima of empirical mean vectors in high dimensions with applications
to multiple hypothesis testing and parameter choice for the Dantzig selec-
tor. The use of multiplier bootstrapping enables us to empirically compute
the upper confidence limit of b
Rn(s, p) and hence decide whether discover-
ies by statistical machine learning techniques are any better than spurious
correlations.
The rest of this paper is organized as follows. Section 2discusses the
concept of spurious correlation and introduces the main conditions and no-
tation. Section 3presents the main results of the asymptotic distributions of
spurious correlations and their bootstrap approximations, which are further
extended in Section 4. Section 5identifies three important applications of
our results to high-dimensional statistical inference. Section 6presents the
numerical studies. The proof of Theorem 3.1 is provided in Section 7, and
the proofs for the remaining theoretical results are provided in the supple-
mentary material [Fan, Shao and Zhou (2017)].
2. Spurious correlation, conditions, and notation. Let ε, ε1, . . . , εn
be i.i.d. random variables with a mean of zero and a finite variance σ2>0,
and let X,X1,...,Xnbe i.i.d. p-dimensional random vectors with a mean
of zero and a covariance matrix Σ=E(XXT)=(σjk )1≤j,k≤p. Write
X= (X1, . . . , Xp)T,Xi= (Xi1, . . . , Xip)T, i = 1, . . . , n.
Assume that the two samples {εi}n
i=1 and {Xi}n
i=1 are independent. Then,
the spurious correlation (1.1) can be written as
(2.1) b
Rn(s, p) = max
α∈Sp−1:|α|0=sdcorrnε, αTX,
where the dimension pand sparsity sare allowed to grow with the sample
size n. Here dcorrn(·,·) denotes the sample Pearson correlation coefficient
and Sp−1:= {α∈Rp:|α|2= 1}is the unit sphere of Rp. Due to the anti-
symmetric property of the sample correlation under the sign transformation
of α, we have also
(2.2) b
Rn(s, p) = max
α∈Sp−1:|α|0=s|dcorrnε, αTX|,
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 7
More specifically, we can express b
Rn(s, p) as
max
S⊆[p]:|S|=smax
α∈Ss−1Pn
i=1(εi−¯εn)α,Xi,S −¯
Xn,S
qPn
i=1(εi−¯εn)2·Pn
i=1 α,Xi,S −¯
Xn,S2.(2.3)
By the scale-invariance property of b
Rn(s, p), we assume without loss of gen-
erality that σ2= 1 and Σis a correlation matrix, so that diag(Σ) = Ip.
For a random variable X, the sub-Gaussian norm kXkψ2and sub-exponential
norm kXkψ1of Xare defined, respectively, as
kXkψ2= sup
q≥1
q−1/2E|X|q1/q and kXkψ1= sup
q≥1
q−1E|X|q1/q.
A random variable Xthat satisfies kXkψ2<∞(resp., kXkψ1<∞) is called
a sub-Gaussian (resp., sub-exponential) random variable [Vershynin (2012)].
The following moment conditions for ε∈Rand X∈Rpare imposed.
Condition 2.1.There exists a random vector Usuch that X=Σ1/2U,
E(U) = 0,E(UUT) = Ipand K1:= supα∈Sp−1kαTUkψ2<∞. The random
variable εhas a zero mean and unit variance, and is sub-Gaussian with
K0:= kεkψ2<∞. Moreover, write vq=E(|ε|q)for q≥3.
The following is our assumption for the sampling process.
Condition 2.2.{εi}n
i=1 and {Xi}n
i=1 are independent random samples
from the distributions of εand X, respectively.
For 1 ≤s≤p, the s-sparse minimal and maximal eigenvalues [Bickel,
Ritov and Tsybakov (2009)] of the covariance matrix Σare defined as
φmin(s) = min
u∈Rp:1≤|u|0≤s|u|Σ/|u|22, φmax(s) = max
u∈Rp:1≤|u|0≤s|u|Σ/|u|22,
where |u|Σ= (uTΣu)1/2and |u|2= (uTu)1/2is the `2-norm of u. Con-
sequently, for 1 ≤s≤p, the s-sparse condition number of Σis given by
(2.4) γs=γs(Σ) = pφmax(s)/φmin (s).
The quantity γsplays an important role in our analysis.
The following notation is used. For the two sequences {an}and {bn}
of positive numbers, we write an=O(bn) or an.bnif there exists a
constant C > 0 such that an/bn≤Cfor all sufficiently large n; we write
8J. FAN, Q.-M. SHAO AND W.-X. ZHOU
anbnif there exist constants C1, C2>0 such that, for all nlarge enough,
C1≤an/bn≤C2; and we write an∼bnand an=o(bn) if limn→∞ an/bn= 1
and limn→∞ an/bn= 0, respectively. For a, b ∈R, we write a∨b= max(a, b)
and a∧b= min(a, b). For every vector u, we denote by |u|q=Pi≥1|ui|q1/q
for q > 0 and |u|0=Pi≥1I{ui6= 0}. We use hu,vi=uTvto denote the
inner product of two vectors uand vwith the same dimension and kMkto
denote the spectral norm of a matrix M. For every positive integer `, we write
[`] = {1,2, . . . , `}, and for any set S, we use Scto denote its complement
and |S|for its cardinality. For each p-dimensional vector uand p×ppositive
semi-definite matrix A, we write |u|A= (uTAu)1/2. In particular, put
(2.5) αΣ=α/|α|Σ
for every α∈Rpand set 0Σ=0as the convention.
3. Distributions of maximum spurious correlations. In this sec-
tion, we first derive the asymptotic distributions of the maximum spurious
correlation b
Rn(s, p). The analytic form of such asymptotic distributions can
be obtained in the isotropic case. As the asymptotic distributions of b
Rn(s, p)
depend on the unknown covariance matrix Σ, we provide a bootstrap esti-
mate and demonstrate its consistency.
3.1. Asymptotic distributions of maximum spurious correlations. In view
of (2.3), we can rewrite b
Rn(s, p) as
b
Rn(s, p) = sup
f∈F
n−1Pn
i=1(εi−¯εn)f(Xi−¯
Xn)
pn−1Pn
i=1(εi−¯εn)2·qn−1Pn
i=1 f2(Xi−¯
Xn)
,(3.1)
where ¯εn=n−1Pn
i=1 εi,¯
Xn=n−1Pn
i=1 Xiand
(3.2) F=F(s, p) = x7→ fα(x) := hα,xi:α∈ V
is a class of linear functions Rp7→ R, where V=V(s, p) = {α∈Sp−1:
|α|0=s}. The dependence of Fand Von (s, p) is suppressed.
Let Z= (Z1, . . . , Zp)Tbe a p-dimensional Gaussian random vector with
a mean of zero and the covariance matrix Σ, i.e., Zd
=N(0,Σ). Denote by
Z2
(1) ≤Z2
(2) ≤ ··· ≤ Z2
(p)the order statistics of {Z2
1, . . . , Z2
p}. The following
theorem shows that the distribution of the maximum absolute multiple cor-
relation b
Rn(s, p) can be approximated by that of the supremum of a centered
Gaussian process G∗indexed by F.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 9
Theorem 3.1.Let Conditions 2.1 and 2.2 hold, n, p ≥2and 1≤s≤p.
Then there exists a constant C > 0independent of (s, p, n)such that
sup
t≥0P√nb
Rn(s, p)≤t−PR∗(s, p)≤t
≤C(K0K1)3/4n−1/8{sbn(s, p)}7/8,(3.3)
where K0and K1are defined in Condition 2.1,bn(s, p) := log(γsp/s)∨log n
for γsas in (2.4), R∗(s, p) := supf∈F G∗fand G∗={G∗f}f∈F is a centered
Gaussian process indexed by Fdefined as, for every fα∈ F,
(3.4) G∗fα=αT
ΣZ=αTZ
√αTΣα.
In particular, if Σ=Ipand slog(pn) = o(n1/7), then as n→ ∞,
sup
t≥0Pnb
R2
n(s, p)≤t−PZ2
(p)+· ·· +Z2
(p−s+1) ≤t→0.(3.5)
Remark 3.1.The Berry-Esseen bound given in Theorem 3.1 depends
explicitly on the triplet (s, p, n), and it depends on the covariance matrix Σ
only through its s-sparse condition number γs, defined in (2.4). The proof
of (3.3) builds on a number of technical tools including a standard covering
argument, maximal and concentration inequalities for the suprema of un-
bounded empirical processes and Gaussian processes as well as a coupling in-
equality for the maxima of sums of random vectors derived in Chernozhukov,
Chetverikov and Kato (2014a). Instead, if we directly resort to the general
framework in Theorem 2.1 of Chernozhukov, Chetverikov and Kato (2014a),
the function class of interest is F=x7→ αTx
(αTΣα)1/2:α∈Sp−1,|α|0=s.
Checking high-level conditions in Theorem 2.1 can be rather complicated
and less intuitive. Also, dealing with the (uniform) entropy integral that
corresponds to the class Frelies on verifying various VC-type properties,
and thus can be fairly tedious. Following a strategy similar to that used to
prove Theorem 2.1, we provide a self-contained proof of Theorem 3.1 in Sec-
tion 7.2 by making the best use of the specific structure of F. The proof is
more intuitive and straightforward. More importantly, it leads to an explicit
non-asymptotic bound under transparent conditions.
Remark 3.2.In Theorem 3.1, the independence assumption of εand X
can be relaxed as E(εX) = 0, E(ε2|X) = σ2and E(ε4|X)≤Calmost surely,
where C > 0 is a constant.
10 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Expression (3.5) indicates that the increment n{b
R2
n(s, p)−b
R2
n(s−1, p)}
is approximately the same as Z2
(p−s+1). This can simply be seen from the
asymptotic joint distribution of b
Rn(1, p),b
Rn(2, p),..., b
Rn(s, p). The fol-
lowing proposition establishes the approximation of the joint distributions
when both the dimension pand sparsity sare allowed to diverge with the
sample size n.
Proposition 3.1.Let Conditions 2.1 and 2.2 hold with Σ=Ip. Assume
that the triplet (s, p, n)satisfies 1≤s < n ≤pand s2log p=o(n1/7). Then
as n→ ∞,
sup
0≡t0<t1<t2<···<ts<1Ps
\
k=1 b
Rn(k, p)≤tk
−Ps
\
k=1 Z2
(p−k+1) ≤n(t2
k−t2
k−1)→0.
Remark 3.3.When s= 1 and if (n, p) satisfies log p=o(n1/7), it is
straightforward to verify that, for any t∈R,
(3.6) PZ2
(p)−2 log p+ log(log p)≤t→exp(−π−1/2e−t/2) as p→ ∞.
This result is similar in nature to (5) in Fan, Guo and Hao (2012). In fact, it
is proved in Shao and Zhou (2014) that the extreme-value statistic b
Rn(1, p)
is sensitive to heavy-tailed data in the sense that, under the ultra-high di-
mensional scheme, even the law of large numbers for the maximum spurious
correlation requires exponentially light tails of the underlying distribution.
We refer readers to Theorem 2.1 in Shao and Zhou (2014) for details. There-
fore, we believe that the exponential-type moment assumptions required in
Theorem 3.1 cannot be weakened to polynomial-type ones as long as log pis
allowed to be as large as ncfor some c∈(0,1). However, it is worth mention-
ing that the factor 1/7 in Proposition 3.1 may not be optimal, and according
to the results in Shao and Zhou (2014), 1/3 is the best possible factor to
ensure that the asymptotic theory is valid. To close this gap in theory, a
significant amount of additional work and new probabilistic techniques are
needed. We do not pursue this line of research in this paper.
For a general s≥2, we establish in the following proposition the limiting
distribution of the sum of the top sorder statistics of i.i.d. chi-square random
variables with degree of freedom 1.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 11
Proposition 3.2.Assume that s≥2is a fixed integer. For any t∈R,
we have as p→ ∞,
PZ2
(p)+· ·· +Z2
(p−s+1) −sap≤t
−→ π(1−s)/2
(s−1)!Γ(s−1) Zt/s
−∞ Z(t−sv)/2
0
us−2e−udue−(s−1)v/2g(v)dv,(3.7)
where ap= 2 log p−log(log p),G(t) = exp(−π−1/2e−t/2)and g(t) = G0(t) =
e−t/2
2√πG(t). The above integral can further be expressed as
G(t/s) + π1−s/2e−t/2
(s−1)! Zt/s
−∞
eug(u)du +π(1−s)/2e−t/2
(s−1)!
×
s−2
X
j=1 G(t/s)e(j+1)t/(2s)πj/2
j
Y
`=1
(s−`)−1
j!2jZt/s
−∞
(t−sv)jev/2g(v)dv.
(3.8)
In particular, when s= 2, the last term on the right-hand side of (3.8)
vanishes so that, as p→ ∞,
PZ2
(p)+Z2
(p−1) −2ap≤t→G(t/2) + e−t/2
2√πZt/2
−∞
eu/2G(u)du.
The proofs of Propositions 3.1 and 3.2 are placed in the supplemental
material [Fan, Shao and Zhou (2017)].
3.2. Multiplier bootstrap approximation. The distribution of R∗(s, p) =
supf∈F G∗ffor G∗in (3.4) depends on the unknown Σand thus cannot
be used for statistical inference. In the following, we consider the use of
a Monte Carlo method to simulate a process that mimics G∗, now known
as the multiplier (wild) bootstrap method, which is similar to that used in
Hansen (1996), Barrett and Donald (2003) and Chernozhukov, Chetverikov
and Kato (2013), among others.
Let b
Σnbe the sample covariance matrix based on the data {Xi}n
i=1 and
ξ1, . . . , ξnbe i.i.d. standard normal random variables that are independent
of {εi}n
i=1 and {Xi}n
i=1. Then, given {Xi}n
i=1,
(3.9) Zn=n−1/2
n
X
i=1
ξi(Xi−¯
Xn)∼N(0,b
Σn).
12 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
The following result shows that the (unknown) distribution of R∗(s, p) =
supfα∈F fα(Z)
√αTΣαfor Zd
=N(0,Σ) can be consistently estimated by the con-
ditional distribution of
(3.10) RMB
n(s, p) := sup
fα∈F
fα(Zn)
qαTb
Σnα
.
Theorem 3.2.Let Conditions 2.1 and 2.2 hold. Assume that the triplet
(s, p, n)satisfies 1≤s≤pand slog(γspn) = o(n1/5). Then as n→ ∞,
(3.11) sup
t≥0PR∗(s, p)≤t−PRMB
n(s, p)≤tX1,...,XnP
−→ 0.
Remark 3.4.Together, Theorems 3.1 and 3.2 show that the maximum
spurious correlation b
Rn(s, p) can be approximated in distribution by the
multiplier bootstrap statistic n−1/2RMB
n(s, p). In practice, when the sample
size nis relatively small, the value of n−1/2RMB
n(s, p) may exceed 1, which
makes it less favorable as a proxy for spurious correlation. To address this
issue, we propose using the following corrected bootstrap approximation:
(3.12) RCMB
n(s, p) := sup
fα∈F
√n
|ξ|2
fα(Zn)
qαTb
Σnα
,
where ξ= (ξ1, . . . , ξn)Tis used in the definition of Zn. By the Cauchy-
Schwarz inequality, n−1/2RCMB
n(s, p) is always between 0 and 1. In view of
(3.10) and (3.12), RCMB
n(s, p) differs from RMB
n(s, p) only up to a multiplica-
tive random factor n−1/2|ξ|2, which in theory is concentrated around 1 with
exponentially high probability. Thus, RMB
nand RCMB
nare asymptotically
equivalent, and (3.11) remains valid with RMB
nreplaced by RCMB
n.
4. Extension to sparse linear models. Suppose that the observed
response Yand p-dimensional covariate Xfollows the sparse linear model
(4.1) Y=XTβ∗+ε,
where the regression coefficient β∗is sparse. The sparsity is typically ex-
plored by the LASSO [Tibshirani (1996)], the SCAD [Fan and Li (2001)], or
the MCP [Zhang (2010)]. Now it is well-known that, under suitable condi-
tions, the SCAD and the MCP, among other folded concave penalized least-
square estimators, also enjoy the unbiasedness property and the (strong)
oracle properties. For simplicity, we focus on the SCAD. For a given random
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 13
sample {(Xi, Yi)}n
i=1, the SCAD exploits the sparsity by pλ-regularization,
which minimizes
(4.2) (2n)−1
n
X
i=1
(Yi−XT
iβ)2+
p
X
j=1
pλ(|βj|;a)
over β= (β1, . . . , βp)T∈Rp, where pλ(·;a) denotes the SCAD penalty
function [Fan and Li (2001)], i.e., p0
λ(t;a) = λI(t≤λ) + (aλ−t)+
a−1I(t>λ) for
some a > 2, and λ=λn≥0 is a regularization parameter.
Denote by X= (X1,...,Xn)Tthe n×pdesign matrix, Y= (Y1, . . . , Yn)T
the n-dimensional response vector, and ε= (ε1, . . . , εn)T, the n-dimensional
noise vector. Without loss of generality, we assume that β∗= (βT
1,βT
2)T
with each component of β1∈Rsbeing non-zero and β2=0, such that
S0:= supp(β∗) = {1, . . . , s}is the true underlying sparse model of the
indices with s=|β∗|0. Moreover, write X= (X1,X2), where X1∈Rn×s
consists of the columns of Xindexed by S0. In this notation, Y=Xβ+ε=
X1β1+εand the oracle estimator b
βoracle has an explicit form of
b
βoracle
1= (XT
1X1)−1XT
1Y=β1+ (XT
1X1)−1XT
1ε,b
βoracle
2=0.(4.3)
In other words, the oracle estimator is the unpenalized estimator that min-
imizes Pn
i=1(Yi−XT
i,S0βS0)2over the true support set S0.
Denote by b
εoracle = (bεoracle
1,...,bεoracle
n)T=Y−XTb
βoracle the residuals af-
ter the oracle fit. Then, we can construct the maximum spurious correlation
as in (2.2), except that {εi}n
i=1 is now replaced by {bεoracle
i}n
i=1, i.e.,
b
Roracle
n(1, p)
= max
j∈[p]|Pn
i=1(bεoracle
i−eT
nb
εoracle)(Xij −¯
Xj)|
qPn
i=1(bεoracle
i−eT
nb
εoracle)2·qPn
i=1(Xij −¯
Xj)2
,(4.4)
where en= (1/n, . . . , 1/n)T∈Rnand ¯
Xj=n−1Pn
i=1 Xij. We here deal
with the specific case of a spurious correlation of size 1, as this is what is
needed for testing the exogeneity assumption (1.2).
To establish the limiting distribution of b
Roracle
n(1, p), we make the follow-
ing assumptions.
Condition 4.1.Y=Xβ∗+εwith supp (β∗) = {1, . . . , s}and ε=
(ε1, . . . , εn)Tbeing i.i.d. centered sub-Gaussian satisfying that K0=kεikψ2<
∞. The rows of X= (X1,...,Xn)Tare i.i.d. sub-Gaussian random vectors
as in Condition 2.1.
14 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
As before, we can assume that Σ=E(XiXT
i) is a correlation matrix with
diag (Σ) = Ip. Set d=p−sand partition
Σ=Σ11 Σ12
Σ21 Σ22 with Σ11 ∈Rs×s,Σ22 ∈Rd×d,Σ21 =ΣT
12.(4.5)
Let Σ22.1= (eσjk)1≤j,k≤d=Σ22 −Σ21Σ−1
11 Σ12 be the Schur complement of
Σ11 in Σ.
Condition 4.2.eσmin = min1≤j≤deσjj is bounded away from zero.
Theorem 4.1.Assume that Conditions 4.1 and 4.2 hold, and that the
triplet (s, p, n)satisfies slog p=o(√n)and log p=o(n1/7). Then the maxi-
mum spurious correlation b
Roracle
n(1, p)in (4.4)satisfies that, as n→ ∞,
sup
t≥0P√nb
Roracle
n(1, p)≤t−P|e
Z|∞≤t→0,(4.6)
where e
Zd
=N(0,Σ22.1)is a d-variate centered Gaussian random vector with
covariance matrix Σ22.1.
As pλis a folded-concave penalty function, (4.2) is a non-convex opti-
mization problem. The local linear approximation (LLA) algorithm can be
applied to produce a certain local minimum for any fixed initial solution
[Zou and Li (2008), Fan, Xue and Zou (2014)]. In particular, Fan, Xue and
Zou (2014) prove that the LLA algorithm can deliver the oracle estimator
in the folded concave penalized problem with overwhelming probability if it
is initialized by some appropriate initial estimator.
Let b
βLLA be the estimator computed via the one-step LLA algorithm
initiated by the LASSO estimator [Tibshirani (1996)]. That is,
(4.7) b
βLLA = arg min
β(2n)−1
n
X
i=1
(Yi−XT
iβ)2+
p
X
j=1
p0
λ(|b
βLASSO
j|)|βj|,
where pλis a folded concave penalty, such as the SCAD and MCP penalties,
and b
βLASSO = arg minβ{(2n)−1Pn
i=1(Yi−XT
iβ)2+λ|β|1}. Accordingly,
denote by b
RLLA
n(1, p) the maximum spurious correlation as in (4.4) with
bεoracle
ireplaced by bεLLA
i=Yi−XT
ib
βLLA. Applying Theorem 4.1, we derive
the limiting distribution of b
RLLA
n(1, p) under suitable conditions. First, let
us recall the Restricted Eigenvalue concept formulated by Bickel, Ritov and
Tsybakov (2009).
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 15
Definition 4.1.For any integer s0∈[p]and positive number c0, the
RE (s0, c0)parameter κ(s0, c0,A)of a p×pmatrix Ais defined as
(4.8) κ(s0, c0,A) := min
S⊆[p]:|S|≤s0
min
δ6=0:|δSc|1≤c0|δS|1
δTAδ
|δS|2
2
.
Theorem 4.2.Assume that Conditions 4.1 and 4.2 hold, the minimal
signal strength of β∗satisfies minj∈S0|βj|>(a+ 1)λfor a, λ as in (4.2),
and that the triplet (s, p, n)satisfies slog p=o(√n),slog p
κ(s,3+,Σ)=o(n)for
some > 0and log p=o(n1/7). If the regularization parameters (λ, λLASSO)
are such that λ≥8√s
κ(s,3,Σ)λLASSO and λLASSO ≥CK0p(log p)/n for C > 0
large enough, then as n→ ∞,
sup
t≥0P√nb
RLLA
n(1, p)≤t−P|e
Z|∞≤t→0,(4.9)
where e
Zd
=N(0,Σ22.1).
5. Applications to high-dimensional inferences. This section out-
lines three applications in high-dimensional statistics. The first determines
whether discoveries by machine learning and data mining techniques are any
better than those reached by chance. Second, we show that the distributions
of maximum spurious correlations can also be applied to model selection. In
the third application, we validate the fundamental assumption of exogeneity
(1.2) in high dimensions.
5.1. Spurious discoveries. Let qCMB
α(s, p) be the upper α-quantile of the
random variable RCMB
n(s, p) defined by (3.12). Then, an approximate 1 −α
upper confidence limit of the spurious correlation is given by qCMB
α(s, p). In
view of Theorems 3.1 and 3.2, we claim that
(5.1) Pb
Rn(s, p)≤qCMB
α(s, p)→1−α.
To see this, recall that RCMB
n=√nRMB
n/|ξ|2for ξ= (ξ1, . . . , ξn)Tas in
(3.12), and given {Xi}n
i=1,RMB
nis the supremum of a Gaussian process. Let
FMB
n(t) = P{RMB
n(s, p)≤t|X1,...,Xn}be the (conditional) distribution
function of RMB
nand define t0= inf{t:FMB
n(t)>0}. By Theorem 11.1
of Davydov, Lifshits and Smorodina (1998), FMB
nis absolutely continuous
with respect to the Lebesgue measure and is strictly increasing on (t0,∞),
indicating that P{RCMB
n≤qCMB
α(s, p)|X1,...,Xn}=αalmost surely. This,
together with (3.3) and (3.11), proves (5.1) under Conditions 2.1,2.2, and
when slog(γspn) = o(n1/7).
16 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Let b
Yibe fitted values using spredictors indexed by b
Sselected by a data-
driven technique and Yibe the associated response value. They are denoted
in the vector form by b
Yand Y, respectively. If
(5.2) dcorrn(Y,b
Y)≤qCMB
α(s, p),
then the discovery of variables b
Scan be regarded as spurious; that is, no bet-
ter than by chance. Therefore, the multiplier bootstrap quantile qCMB
α(s, p)
provides an important critical value and yardstick for judging whether the
discovery is spurious, or whether the selected set b
Sincludes too many spu-
rious variables. This yardstick is independent of the method used in the
fitting.
Remark 5.1.The problem of judging whether the discovery is spuri-
ous is intrinsically different from that of testing the global null hypothesis
H0:β∗=0, which itself is an important problem in high-dimensional statis-
tical inference and has been well-studied in the literature since the seminal
work of Goeman, van de Geer and van Houwelingen (2006). For example, the
global null hypothesis H0:β∗=0can be rejected by a test; still, the corre-
lation between Yand the variables b
Sselected by a statistical method can be
smaller than the maximum spurious correlation, and we should interpret the
findings of b
Swith caution. We need either more samples or more powerful
variable selection methods. This motivates us to derive the distribution of
the maximum spurious correlation b
Rn(s, p). This distribution serves as an
important benchmark for judging whether the discovery (of sfeatures from
pexplanatory variables based on a sample of size n) is spurious. The magni-
tude of b
Rn(s, p) gives statisticians an idea of how big a spurious correlation
can be, and therefore an idea of how much the covariates really contribute
to the regression for a given sample size.
5.2. Model selection. In the previous section, we consider the reference
distribution of the maximum spurious correlation statistic b
Rn(s, p) as a
benchmark for judging whether the discovery of ssignificant variables (among
all of the pvariables using a random sample of size n) is impressive, regard-
less of which variable selection tool is applied. In this section, we show how
the distribution of b
Rn(s, p) can be used to select a model. Intuitively, we
would like to select a model that fits better than the spurious fit. This limits
the candidate sets of models and provides an upper bound on the model
size. In our experience, this upper bound itself provides a model selector.
We now use LASSO as an illustration of the above idea. Owing to spurious
correlation, almost all of the variable selection procedures will, with high
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 17
probability, select a number of spurious variables in the model so that the
selected model is over-fitted. For example, the LASSO method with the
regularization parameter selected by cross-validation typically selects a far
larger model size, as the bias caused by the `1penalty forces the cross-
validation procedure to choose a smaller value of λ. Thus, it is important
to stop the LASSO path earlier and the quantiles of b
Rn(s, p) provide useful
guards.
Specifically, consider the LASSO estimator b
βλfor the sparse linear model
(4.1) with bsλ=|supp(b
βλ)|, where λ > 0 is the regularization parameter.
We consider the LASSO solution path with the largest knot λini := |XTY|∞
and the smallest knot λcv selected by ten-fold cross-validation. To avoid
over-fitting, we propose using qCMB
αas a guide to choose the regularization
parameter that guards us from selecting too many spurious variables. For
each λin the path, we compute dcorrn(b
Yλ,Y), the sample correlation between
the post-LASSO fitted and observed responses, and qCMB
α(bsλ, p). Let b
λαbe
the largest λsuch that the sign of dcorrn(b
Yλ,Y)−qCMB
α(bsλ, p) is nonnegative
and then flips in the subsequent knot. The selected model is given by b
Sα=
supp(b
βb
λα). As demonstrated by the simulation studies in Section 6.4, this
procedure selects a much smaller model size that is closer to the real data.
5.3. Validating exogeneity. Fan and Liao (2014) show that the exogenous
condition (1.2) is necessary for penalized least-squares to achieve a model
selection consistency. They question the validity of such an exogeneous as-
sumption, as it imposes too many equations. They argue further that even
when the exogenous model holds for important variables XS, i.e.,
(5.3) Y=XT
Sβ∗
S+ε, E(εXS) = 0,
the extra variables XN(with N=Sc) are collected in an effort to cover the
unknown set S— but no verification of the conditions
(5.4) E(εXN) = E{(Y−XT
Sβ∗
S)XN}=0
has ever been made. The equality E{(Y−XT
Sβ∗
S)Xj}= 0 in (5.4) holds
by luck for some covariate Xj, but it can not be expected that this holds
for all j∈N. They propose a focussed generalized method of moment
(FGMM) to avoid the unreasonable assumption (5.4). Recognizing (5.3) is
not identifiable in high-dimensional linear models, they impose additional
conditions such as E(εX2
S) = 0.
Despite its fundamental importance to high-dimensional statistics, there
are no available tools for validating (1.2). Regarding (1.2) as a null hypoth-
esis, an asymptotically α-level test can be used to reject assumption (1.2)
18 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
when
(5.5) b
Tn,p = max
j∈[p]√ndcorrn(Xj, ε)≥qCMB
α(1, p).
By Theorems 3.1 and 3.2, the test statistic has an approximate size α. The
p-value of the test can be computed via the distribution of the Gaussian
multiplier process RCMB
n(1, p).
As pointed out in the introduction, when the components of Xare weakly
correlated, the distribution of the maximum spurious correlation does not
depend very sensitively on Σ. See also Lemma 6 in Cai, Liu and Xia (2014).
In this case, we can approximate it by the identity matrix, and hence one
can compare the renormalized test statistic
(5.6) Jn,p =b
T2
n,p −2 log p+ log(log p)
with the limiting distribution in (3.6). The critical value for test statistic
Jn,p is
(5.7) Jα=−2 log{−√πlog(1 −α)},
and the associated p-value is given by
(5.8) exp(−π−1/2e−Jn,p/2).
Expressions (5.7) and (5.8) provide analytic forms for a quick validation of
the exogenous assumption (1.2) under weak dependence. In general, we rec-
ommend using the wild bootstrap, which takes into account the correlation
effect and provides more accurate estimates especially when the dependence
is strong. See Chang et al. (2017) for more empirical evidences.
In practice, εis typically unknown to us. Therefore, b
Tn,p in (5.5) is calcu-
lated using the fitted residuals {bεLLA
i}n
i=1. In view of Theorem 4.2, we need
to adjust the null distribution according to (4.9). By Theorem 3.2, we adjust
the definition of the process Znin (3.9) by
(5.9) ZLLA
n=n−1/2
n
X
i=1
ξi(XLLA
i−XLLA
n)∈Rp−|
b
S|,
where XLLA
i=Xi,
b
N−b
Σb
N
b
Sb
Σ−1
b
S
b
SXi,
b
Sis the residuals of Xb
Nregressed on
Xb
S, where b
Sis the set of selected variables, b
N= [p]\b
S, and b
ΣSS0denotes
the sub-matrix of b
Σncontaining entries indexed by (k, `)∈S×S0. From
(5.9), the multiplier bootstrap approximation of |e
Z|∞is RMB,LLA
n(1, p) =
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 19
|b
D−1/2ZLLA
n|∞, where b
D= diagonal matrix of the sample covariance matrix
of {XLLA
i}n
i=1. Consequently, we reject (1.2) if b
Tn,p > qMB,LLA
α(1, p), where
qMB,LLA
α(1, p) is the (conditional) upper α-quantile of RMB,LLA
n(1, p) given
{Xi}n
i=1.
Remark 5.2.To the best of our knowledge, this is the first paper to
consider testing the exogenous assumption (1.2), for which we use the maxi-
mum correlation between covariates and fitted residuals as the test statistic.
A referee kindly informed us in his/her review report that in the context
of specification testing, Chernozhukov, Chetverikov and Kato (2013) pro-
pose a similar extreme value statistic and use the multiplier bootstrap to
compute a critical value for the test. To construct marginal test statistics,
they use self-normalized covariances between generated regressors and fitted
residuals obtained via ordinary least squares, whereas we use sample cor-
relations between the covariates and fitted residuals obtained by the LLA
algorithm. We refer readers to Appendix M in the supplementary material
of Chernozhukov, Chetverikov and Kato (2013) for more details.
6. Numerical studies. In this section, Monte Carlo simulations are
used to examine the finite-sample performance of the bootstrap approxi-
mation (for a given data set) of the distribution of the maximum spurious
correlation (MSC).
6.1. Computation of spurious correlation. First, we observe that b
Rn(s, p)
in (2.2) can be written as b
R2
n(s, p) = bσ−2
εmaxS⊆[p]:|S|=svT
n,S b
Σ−1
SS vn,S , where
bσ2
ε=n−1Pn
i=1(εi−¯εn)2and vn=n−1Pn
i=1(εi−¯εn)(Xi−¯
Xn). Therefore,
the computation of b
Rn(s, p) requires solving the combinatorial optimization
problem
(6.1) b
S= arg max
S⊆[p]:|S|=svT
n,S b
Σ−1
SS vn,S .
It is computationally intensive to obtain b
Sfor large values of pand sas
one essentially needs to enumerate all p
spossible subsets of size sfrom p
covariates. A fast and easily implementable approach is to use the stepwise
addition (forward selection) algorithm as in Fan, Guo and Hao (2012), which
results in some value that is no larger than b
Rn(s, p) but avoids computing all
p
smultiple correlations in (6.1). Note that the optimization (6.1) is equiva-
lent to finding the best subset regression of size s. When pis relatively small,
say if pranges from 20 to 40, the branch-and-bound procedure is commonly
used for finding the best subset of a given size that maximizes multiple R2
20 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
[Brusco and Stahl (2005)]. However, this approach becomes computational
infeasible very quickly when there are hundreds or thousands of potential
predictors. As a trade-off between approximation accuracy and computa-
tional intensity, we propose using a two-step procedure that combines the
stepwise addition and branch-and-bound algorithms. First, we use the for-
ward selection to pick the best dvariables, say d= 40, which serves as a
pre-screening step. Second, across the d
ssubsets of size s, the branch-and-
bound procedure is implemented to select the best subset that maximizes
the multiple-R2. This subset is used as an approximate solution to (6.1).
Note that when s > 40, which is rare in many applications, we only use the
stepwise addition to reduce the computational cost.
6.2. Accuracy of the multiplier bootstrap approximation. For the first
simulation, we consider the case where the random noise εfollows the uni-
form distribution standardized so that E(ε) = 0 and E(ε2) = 1. Independent
of ε, the p-variate vector Xof covariates has i.i.d. N(0,1) components. In the
results reported in Table 1, the ambient dimension p= 2000, the sample size
ntakes a value in {400,800,1200}, and stakes a value in {1,2,5,10}. For
a given significance level α∈(0,1), let qα(s, p) be the upper α-quantile of
b
Rn(s, p) in (2.1). For each data set Xn={X1,...,Xn}, a direct application
of Theorems 3.1 and 3.2 is that
cMB(Xn, α) := PRCMB
n(s, p)≥qα(s, p)|Xn→αas n→ ∞.
The difference cMB(Xn, α)−α, however, characterizes the extent of the size
distortions and the finite-sample accuracy of the multiplier bootstrap ap-
proximation (MBA). Table 1summarizes the mean and the standard de-
viation (SD) of cMB(Xn, α) based on 200 simulated data sets with α∈
{0.05,0.1}. The α-quantile qα(s, p) is calculated from 1600 replications, and
cMB(Xn, α) for each data set is simulated based on 1600 bootstrap replica-
tions. In addition, we report in Figure 3the distributions of the maximum
spurious correlations and their multiplier bootstrap approximations condi-
tional on a given data set Xnwhen p∈ {2000,5000},s∈ {1,2,5,10}and
n= 400. Together, Table 1and Figure 3show that the multiplier bootstrap
method indeed provides a quite good approximation to the (unknown) dis-
tribution of the maximum spurious correlation.
For the second simulation, we focus on an anisotropic case where the co-
variance matrix Σof Xis non-identity, and the condition number of Σis
well-controlled. Specifically, we assume that εfollows the centered Laplace
distribution rescaled so that E(ε) = 0 and E(ε2) = 1. To introduce de-
pendence among covariates, first we denote with Aa 10 ×10 symmet-
ric positive definite matrix with a pre-specified condition number c > 1
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 21
Table 1
(Isotropic case) The mean of 200 empirical sizes cMB(·, α)×100, with its estimate of SD
in the parenthesis, when p= 2000,s= 1,2,5,10,n= 400,800,1200, and α= 0.1,0.05
Trule s= 1 s= 2 s= 5 s= 10
n α = 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05
400 9.54 4.68 9.13 4.38 9.08 3.78 8.67 4.44
(0.643) (0.294) (0.568) (0.284) (0.480) (0.245) (0.506) (0.291)
800 9.43 4.93 9.47 4.42 9.73 4.73 9.94 5.62
(0.444) (0.296) (0.474) (0.296) (0.488) (0.294) (0.557) (0.331)
1200 9.09 4.32 9.00 4.46 9.42 4.87 9.97 5.15
(0.507) (0.261) (0.542) (0.278) (0.543) (0.322) (0.579) (0.318)
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Density
0
5
10
15
20
25
30 (n,p) = (400,2000)
MSC-1
MBA-1
MSC-2
MBA-2
MSC-5
MBA-5
MSC-10
MBA-10
Maximum Spurious Correlation vs Multiplier Bootstrap Approximation
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Density
0
10
20
30
40 (n,p) = (400,5000)
Fig 3: Distributions of maximum spurious correlations (blue) and multiplier
bootstrap approximations (for a given data set; red) based on 1600 simula-
tions with combinations of p= 2000,5000, s= 1,2,5,10, and n= 400 when
Σis an identity matrix.
and let ρ∈(0,1). Then the p-dimensional vector Xof the covariates is
22 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
generated according to X=G1(ρ)Z1+G2(ρ)Z2, where Z1d
=N(0,A),
Z2d
=N(0,Ip−10), and G1(ρ)∈Rp×10 ,G2(ρ)∈Rp×(p−10) are given respec-
tively by G1(ρ)T= (I10,ρ
√1+ρ2I10,G11(ρ)T) with
G11(ρ) = 1−ρ
p1 + (1 −ρ)2
1 0 . . . 0
1 0 . . . 0
.
.
..
.
.··· .
.
.
1 0 . . . 0
∈R(p−20)×10
and
G2(ρ) =
010×10 010×(p−20)
1
√1+ρ2I10 010×(p−20)
0(p−20)×10 1
√1+(1−ρ)2Ip−20
.
In particular, we take c= 5 and ρ= 0.8 in the simulations reported in
Table 2, which summarizes the mean and the standard deviation (SD) of
the size cMB(Xn, α) based on 200 simulated data sets with α∈ {0.05,0.1}.
Comparing the simulation results shown in Tables 1and 2, we find that
the bootstrap approximation is fairly robust against heterogeneity in the
covariance structure of the covariates.
Table 2
(Anisotropic case) Mean of 200 empirical sizes cMB(·, α)×100, with its estimate of SD
in the parenthesis, when p= 2000,s= 1,2,5,10,n= 400,800,1200, and α= 0.1,0.05
Trule s= 1 s= 2 s= 5 s= 10
n α = 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05
400 9.83 4.39 9.04 4.75 9.27 4.65 9.34 4.53
(0.426) (0.222) (0.402) (0.208) (0.492) (0.273) (0.557) (0.291)
800 10.18 5.19 10.48 5.12 9.98 4.86 9.21 4.73
(0.556) (0.296) (0.519) (0.272) (0.576) (0.220) (0.474) (0.269)
1200 9.42 4.41 9.60 5.71 9.90 4.85 10.11 5.19
(0.500) (0.233) (0.543) (0.339) (0.553) (0.333) (0.606) (0.337)
6.3. Detecting spurious discoveries. To examine how the multiplier boot-
strap quantile qCMB
α(s, p) (see Section 5.1) serves as a benchmark for judging
whether the discovery is spurious, we compute the Spurious Discovery Prob-
ability (SDP) by simulating 200 data sets from (4.1) with n= 100,120,160,
p= 400, β∗= (1,0,−0.8,0,0.6,0,−0.4,0,...,0)T∈Rp, and standard Gaus-
sian noise εd
=N(0,1). For some integer s≤r≤p, we let xd
=N(0,Ir) be an
r-dimensional Gaussian random vector. Let Γrbe a p×rmatrix satisfying
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 23
ΓT
rΓr=Ir. The rows of the design matrix Xare sampled as i.i.d. copies from
Γrx∈Rp, where rtakes a value in {120,160,200,240,280,320,360}. To save
space, we give the numerical results for the case of non-Gaussian design and
noise in the supplementary material [Fan, Shao and Zhou (2017)].
Put Y= (Y1, . . . , Yn)Tand let b
Y=Xb
Sb
βpLASSO be the n-dimensional
vector of fitted values, where b
βpLASSO = (XT
b
SXb
S)−1XT
b
SYis the post-LASSO
estimator using covariates selected by the ten-fold cross-validated LASSO
estimator. Let bs=|b
S|0be the number of variables selected. For α∈(0,1),
the level-αSDP is defined as P{|dcorrn(Y,b
Y)| ≤ qCMB
α(bs, p)}. As the simu-
lated model is not null, this SDP is indeed a type II error. Given α= 5% and
for each simulated data set, qCMB
α(s, p) is computed based on 1000 bootstrap
replications. Then we compute the empirical SDP based on 200 simulations.
The results are given in Table 3.
In this study, the design matrix is chosen so that there is a low-dimensional
linear dependency in the high-dimensional covariates. The collected covari-
ates are highly correlated when ris much smaller than p. It is known that
collinearity and high dimensionality add difficulty to the problem of variable
selection and deteriorate the performance of the LASSO. The smaller the
ris, the more severe the problem of collinearity becomes. As reflected in
Table 3, the empirical SDP increases as rdecreases, indicating that the cor-
relation between fitted and observed responses is more likely to be smaller
than the spurious correlation.
Table 3
Empirical α-level spurious discovery probability (ESDP) based on 200 simulations when
p= 400,n= 100,120,160, and α= 5%.
Trule r= 120 r= 160 r= 200 r= 240 r= 280 r= 320 r= 360
n= 100 0.6950 0.6650 0.6000 0.5200 0.5100 0.4500 0.4000
n= 120 0.6600 0.5350 0.3350 0.3800 0.2500 0.2850 0.1950
n= 160 0.1950 0.1300 0.0500 0.0400 0.0550 0.0700 0.0250
6.4. Model selection. We demonstrate the idea in Section 5.2 through
the following toy example. Consider the linear model (4.1) with (n, p) =
(160,400) and β∗= (1,0,−0.8,0,0.6,0,−0.4,0,...,0)T∈Rp. The covariate
vector is taken to be X=Γx with x= (x1, . . . , x200)T, where x1, . . . , x200
are i.i.d. random variables following the continuous uniform distribution on
[−1,1] and Γis a 400×200 matrix satisfying ΓTΓ=I200. The noise variable
εfollows a standardized t-distribution with 4 degrees of freedom. Moreover,
let S0={j:β∗
j6= 0}be the true model.
24 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Applying ten-fold cross-validated LASSO selects 35 variables. Along the
solution path, we compute the number of correctly selected variables |b
S∩S0|,
the fitted correlation, and the upper 5%-quantile of the multiplier bootstrap
approximation of the maximum spurious correlation based on 1000 boot-
strap samples. The results are provided in Table 4, from which we see that
the cross-validation procedure under the guidance of MSC selects 15 vari-
ables including all of the signal covariates.
Table 4
Number of true positive results, the sample correlation between fitted and observed
responses, and the upper 5%-quantile of the multiplier bootstrap approximation based on
1000 bootstrap samples.
Trule |
b
Sλ∩S0|dcorrn(Y,
b
YpLASSO
λ)qCMB
0.05 (bsλ, p)
λ=0.3410
(bsλ=1) 1 0.3314 0.3040
λ=0.2703
(bsλ=2) 2 0.4802 0.3870
λ=0.2580
(bsλ=3) 3 0.5255 0.4435
λ=0.2351
(bsλ=4) 3 0.5536 0.4907
λ=0.2142
(bsλ=5) 3 0.5791 0.5297
λ=0.2044
(bsλ=6) 3 0.5971 0.5608
λ=0.1952
(bsλ=8) 3 0.6205 0.6131
λ=0.1778
(bsλ=9) 3 0.6377 0.6365
λ=0.1697
(bsλ=11) 4 0.6953 0.6758
λ=0.1620
(bsλ=14) 4 0.7380 0.7208
λ=0.1409
(bsλ=15) 40.7490 0.7346
λ=0.1345
(bsλ=19) 4 0.7685 0.7799
.
.
..
.
..
.
..
.
.
λ=0.0885
(bsλ=35) 4 0.8428 0.8847
6.5. Gene expression data. In this section, we extend the previous study
in Section 6.3 to an analysis of a real life data set. To further address the
question that for a given data set, whether the discoveries based on certain
data-mining technique are any better than spurious correlation, we consider
again the gene expression data from 90 individuals (45 Japanese and 45
Chinese, JPT-CHB) from the international ‘HapMap’ project [Thorisson et
al. (2005)] discussed in the introduction.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 25
The gene CHRNA6 is thought to be related to the activation of dopamine-
releasing neurons with nicotine, and therefore has been the subject of many
nicotine addiction studies [Thorgeirsson et al. (2010)]. We took the expres-
sions of CHRNA6 as the response Yand the remaining p= 47292 expres-
sions of probes as covariates X. For a given λ > 0, LASSO selects bsλprobes
indexed by b
Sλ. In particular, using ten-fold cross-validation to select the
tuning parameter gives bsλ0= 25 probes with λ0= 0.0674. Define fitted
vectors b
YLASSO
λ=Xb
βLASSO
λand b
YpLASSO
λ=Xb
Sλb
βpLASSO
λ, where b
βLASSO
λis
the LASSO estimator and b
βpLASSO
λis the post-LASSO estimator, which is
the least-square estimator based on the LASSO selected set.
We depict the observed correlations between the fitted value and the
response as well as the median and upper α-quantile of the multiplier boot-
strap approximation with α= 10% based on 1000 bootstrap replications in
Table 5. Even though dcorrn(Y,b
YLASSO)=0.8991 and dcorrn(Y,b
YpLASSO) =
0.9214, the discoveries appear to be no better than chance. We therefore in-
crease λ, which decreases the size of discovered probes. From Table 4, only
the discovery of three probes is above chance results at α= 10%. The three
probes are BBS1 – Homo sapiens Bardet-Biedl syndrome 1, POLE2 – Homo
sapiens polymerase (DNA directed), epsilon 2 (p59 subunit), and TG737 –
Homo sapiens Probe hTg737 (polycystic kidney disease, autosomal reces-
sive), transcript variant 2. Figure 4shows the observed correlations of the
fitted values and observed values compared to the reference null distribution.
We now use the test statistic (5.5) to test whether the null hypothesis (1.2)
holds. We take λ0= 0.0674 and compute the observed test statistic b
Tobs
n,p =
4.6318. This corresponds to √ntimes the maximum correlation presented
in Figure 1. Using the null distribution provided by (4.9), which can be
estimated via the multiplier bootstrap, yields the p-value 0.001. Further,
using the SCAD gives b
Tobs
n,p = 4.1324 and a p-value 0.0164. Both calculations
are based on 5000 bootstrap replications. Therefore, the evidence against
the exogeneity assumption is very strong. Figure 4depicts the observed test
statistics relative to the null distribution.
7. Proofs. We first collect several technical lemmas in Section 7.1 be-
fore proving our main result, Theorem 3.1 in Section 3. The proofs of Theo-
rems 3.2,4.1 and 4.2 are given in the supplemental material [Fan, Shao and
Zhou (2017)], where the proofs of Propositions 3.1 and 3.2 and Lemmas 7.2–
7.6 can also be found. Throughout, the letters C, C1, C2, . . . and c, c1, c2, . . .
denote generic positive constants that are independent of (s, p, n), whose
values may change from line to line.
26 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Table 5
Sample correlations between fitted and observed responses, and the empirical median and
upper α-quantile of the multiplier bootstrap approximation based on 1200 bootstrap
samples when α= 10%.
Trule dcorrn(Y,
b
YLASSO
λ)dcorrn(Y,
b
YpLASSO
λ)qCMB
0.5(bsλ, p)qCMB
0.1(bsλ, p)
λ=0.1789
(bsλ=2) 0.6813 0.6879 0.5585 0.5988
λ=0.1708
(bsλ=3) 0.6915 0.7010 0.6555 0.6904
λ=0.1630
(bsλ=4) 0.7059 0.7260 0.7252 0.7554
λ=0.1556
(bsλ=5) 0.7141 0.7406 0.7797 0.8044
λ=0.1292
(bsλ=8) 0.7454 0.7641 0.8828 0.8988
λ=0.1177
(bsλ=14) 0.7714 0.8307 0.9658 0.9724
λ=0.1073
(bsλ=17) 0.8026 0.8739 0.9817 0.9860
λ=0.0933
(bsλ=21) 0.8451 0.9019 0.9915 0.9945
λ=0.0891
(bsλ=23) 0.8561 0.9109 0.9937 0.9966
λ=0.0674
(bsλ=25) 0.8991 0.9214 0.9953 0.9979
7.1. Technical lemmas. The following lemma combines Propositions 5.10
and 5.16 in Vershynin (2012).
Lemma 7.1.Let X1, . . . , Xnbe independent centered random variables
and write xn= (X1, . . . , Xn)T∈Rn. Then for every a= (a1, . . . , an)T∈Rn
and every t≥0, we have
(7.1) P|aTxn| ≥ t≤2 exp −cBmin t2
B2
1|a|2
2
,t
B1|a|∞
and
(7.2) P|aTxn| ≥ t≤2 exp −cH
t2
B2
2|a|2
2,
where Bv= max1≤i≤nkXikψvfor v= 1,2and cB, cH>0are absolute
constants.
Lemma 7.2.Let Conditions 2.1 and 2.2 be fulfilled. Write
Dn=Dn(s, p) := sup
α∈V αTb
Σnα/αTΣα−1and bσ2
ε=n−1
n
X
i=1
(εi−¯εn)2,
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 27
(a)
0.55 0.6 0.65 0.7 0.75
0
50
100
150
200
250
300
350 lambda = 0.1708, s = 3
(b)
0.9 0.92 0.94 0.96 0.98 1
0
200
400
600
800
1000
1200 lambda = 0.0674, s = 25
(c)
2 3 4 5
0
500
1000
1500
2000 Validating exogeneity via LASSO
(d)
2 3 4 5
0
500
1000
1500
2000 Validating exogeneity via SCAD
Fig 4: Top panel: Distributions of the spurious correlation b
Rn(s, p) estimated
by the bootstrap approximation for (a) s= 3 and (b) s= 25 and the sample
correlation between fitted and observed responses (see Table 5). Red solid
lines are observed correlations and blue dash-dot lines mark the 90th per-
centile in (a) the median and (b) the distributions of the median. Bottom
panel: Null distributions for testing exogeneity (1.2) and its 95th percentile
(indicated by dash blue line) using bootstrap approximation (4.9) and ob-
served test statistics b
Tobs
n,p (indicated by solid red line) based on the residuals
of the LASSO and SCAD.
where b
Σn=n−1Pn
i=1(Xi−¯
Xn)(Xi−¯
Xn)Tand Vis as in (3.2). Then,
there exists a constant C1>0such that, for every t≥1,
Dn≤C1K2
1rs
nlog(γsep/s) + max rt
n, cn(s, p)t
n
(7.3)
holds with probability at least 1−8e−t, where cn(s, p) := slog(γsep/s)∨log n.
28 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Moreover, for every t > 0,
(7.4) bσ2
ε−1≤K2
0n−1t+ 4K2
0max n−1/2√t, n−1t
holds with probability greater than 1−2 exp(−cBt)−2 exp(−cHt), where
cB, cH>0are absolute constants as in (7.1)and (7.2).
The following results address the concentration and anti-concentration
phenomena of the supremum of the Gaussian process G∗indexed by F
(see (3.4)). In line with Chernozhukov, Chetverikov and Kato (2013), in-
equalities (7.5) and (7.6) below are referred to as the concentration and
anti-concentration inequalities, respectively.
Lemma 7.3.Let R∗(s, p) = supfα∈F fα(Z)/|α|Σfor F=F(s, p)given
in (3.2)and Zd
=N(0,Σ). Then there exists an absolute constant C > 0
such that, for every p≥2,1≤s≤pand t > 0,
PR∗(s, p)≥Cpslog(γsep/s) + t≤e−t2/2
(7.5)
and sup
x≥0
P|R∗(s, p)−x| ≤ t≤Ctpslog(γsep/s),(7.6)
where γs=pφmax(s)/pφmin(s).
Lemma 7.4.Suppose that a≥1and bj, cj>0for j= 1, . . . , m are pos-
itive constants. Let X1, . . . , Xmbe real-valued random variables that satisfy
P(|Xj| ≥ t)≤aexp{−t2/(2bj)},for t > 0, j = 1, . . . , m.
Then, for all m≥4/a, we have E(max1≤j≤m|Xj|)≤2plog(am) max1≤j≤mbj.
Furthermore, suppose that P(|Xj| ≥ t)≤aexp(−t/cj)holds for all t > 0
and j= 1, . . . , m. Then, for any m≥4/a, we have E(max1≤j≤m|Xj|)≤
{log(am)+1}max1≤j≤mcj.
To save space, we leave the proofs of Lemmas 7.2–7.4 to Appendix Ain
the supplemental material [Fan, Shao and Zhou (2017)].
7.2. Proof of Theorem 3.1.In view of (3.1), we have
b
Rn(s, p) = sup
α∈V
n−1Pn
i=1 αT(εiXi)−¯εnαT¯
Xn
(αTb
Σnα)1/2· {n−1Pn
i=1(εi−¯εn)2}1/2,
where Vis as in (3.2).
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 29
By Lemma 7.2, instead of dealing with b
Rn(s, p) directly, we first investi-
gate the asymptotic behavior of its standardized counterpart given by
(7.7) Rn(s, p) = sup
α∈V
n−1
n
X
i=1
αT(εiXi)
|α|Σ
= sup
α∈V
n−1
n
X
i=1
αT
Σyi,
where yi=εiXi= (Yi1, . . . , Yip)Tare i.i.d. random vectors with mean zero
and covariance matrix Σ. Let Pybe the probability measure on Rpinduced
by y=εX. Further, define rescaled versions of b
Rn(s, p) and Rn(s, p) as
(7.8) b
Ln=b
Ln(s, p) = √nb
Rn(s, p), Ln=Ln(s, p) = sup
α∈V
n−1/2
n
X
i=1
αT
Σyi.
The main strategy is to prove the Gaussian approximation of Lnby the
supremum of a Gaussian process G∗indexed by Fwith covariance function
EG∗fα1G∗fα2=αT
1Σα2
|α1|Σ· |α2|Σ
,α1,α2∈ V.
Let Zbe a p-variate centered Gaussian random vector with covariance ma-
trix Σ. Then the aforementioned Gaussian process G∗can be induced by Z
in the sense that for every α∈ V,G∗fα=αT
ΣZ. The following lemmas show
that, under certain moment conditions, the distribution of Ln=√nRn(s, p)
can be consistently estimated by that of the supremum of the Gaussian pro-
cess G∗, denoted by R∗(s, p) = supα∈V G∗fα, and b
Lnand Lnare close. We
state them first in the following two lemmas and prove them in Appendix F
of the supplemental material [Fan, Shao and Zhou (2017)].
Lemma 7.5.Under Conditions 2.1 and 2.2, there exists a random vari-
able T∗=T∗(s, p)d
= supα∈V αT
ΣZfor Zd
=N(0,Σ)such that, for any
δ∈(0, K0K1],
|Ln−T∗|.n−1c1/2
n(s, p) + K0K1n−3/2c2
n(s, p) + δ(7.9)
holds with probability at least 1−C∆n(s, p ;δ), where cn(s, p) = slog(γsep/s)∨
log nand
∆n(s, p ;δ) = (K0K1)3{sbn(s, p)}2
δ3√n+ (K0K1)4{sbn(s, p)}5
δ4n
with bn(s, p) = log(γsp/s)∨log n.
30 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Lemma 7.6.Let Conditions 2.1 and 2.2 hold. Assume that the sample
size satisfies n≥C1(K0∨K1)4cn(s, p). Then, with probability at least 1−
C2n−1/2c1/2
n(s, p),
|b
Ln−Ln|.(K0∨K1)2K0K1n−1/2cn(s, p),(7.10)
where cn(s, p) = slog(γsep/s)∨log n.
Let bn(s, p) = log(γsp/s)∨log n. Applying Lemmas 7.5 and 7.6 with
δ=δn(s, p) = (K0K1)3/4min 1, n−1/8{sbn(s, p)}3/8
yields that, with probability at least 1 −C(K0K1)3/4n−1/8{sbn(s, p)}7/8,
|b
Ln−T∗|.(K0K1)3/4n−1/8{sbn(s, p)}3/8.
Together with the inequality (7.6), this proves (3.3).
Further, using (3.2), (3.4) and the identity vTA−1v= maxα∈Ss−1(αTv)2
αTAα
that holds for any s×spositive definite matrix A, we find that with prob-
ability one,
R∗(s, p) = max
S⊆[p]:|S|=smax
α∈Ss−1
αTZS
pαTΣSS α= max
S⊆[p]:|S|=sqZT
SΣ−1
SS ZS,(7.11)
where for each S⊆[p] fixed, the second maximum over αis achieved when
α=Σ−1/2
SS ZS/|Σ−1/2
SS ZS|2, as for each p≥1 fixed, all of the coordinates of
Zare non-zero almost surely. In particular, when Σ=Ip, the right-hand
side of (7.11) is reduced to maxS⊆[p]:|S|=s|ZS|2and therefore, {R∗(s, p)}2=
maxS⊆[p]:|S|=sPj∈SZ2
j=Z2
(p)+···+Z2
(p−s+1) happens with probability one.
This and (3.3) complete the proof of (3.5).
References.
Arlot, S., Blanchard, G. and Roquain, E. (2010). Some nonasymptotic results on
resampling in high dimension. I. Confidence regions. Ann. Statist. 38 51–82.
Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance.
Econometrica 71 71–104.
Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso
and Dantzig selector. Ann. Statist. 37 1705–1732.
B¨
uhlmann, P. and van de Geer, S. A. (2011). Statistics for High-Dimensional Data:
Methods, Theory and Applications. Springer, New York.
Brusco, M. J. and Stahl, S. (2005). Branch-and-Bound Applications in Combinatorial
Data Analysis. Springer, New York.
Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with
applications to testing covariance structure and construction of compressed sensing
matrices. Ann. Statist. 39 1496–1525.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 31
Cai, T. T., Fan, J. and Jiang, T. (2013). Distributions of angles in random packing on
spheres. J. Mach. Learn. Res. 14 1837–1864.
Cai, T. T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means
under dependence. J. R. Stat. Soc. Ser. B 76 349–372.
Chang, J.,Zheng, C.,Zhou, W.-X. and Zhou, W. (2017). Simulation-based hypoth-
esis testing of high dimensional means under covariance heterogeneity. Biometrics. To
appear. DOI: 10.1111/biom.12695.
Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations.
Ann. Statist. 33 414–436.
Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations
and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann.
Statist. 41 2786–2819.
Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation
of suprema of empirical processes. Ann. Statist. 42 1564–1597.
Davydov, Yu. A.,Lifshits, M. A. and Smorodina, N. V. (1998). Local Properties of
Distributions of Stochastic Functionals. American Mathematical Society, Providence,
RI.
Dudoit, S. and van der Laan, M. J. (2007). Multiple Testing Procedures with Applica-
tions to Genomics. Springer, New York.
Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Test-
ing, and Prediction. IMS Monographs Vol 1. Cambridge University Press, Cambridge.
Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation
in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B 74 37–65.
Fan, J., Han, F. and Liu, H. (2014). Challenges of Big Data analysis. National Science
Review 1293–314.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
Fan, J. and Liao, Y. (2014). Endogeneity in high dimensions. Ann. Statist. 42 872–917.
Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional
feature space. Statist. Sinica 20 101–148.
Fan, J., Shao, Q.-M. and Zhou, W.-X. (2017). Supplement to “Are discoveries spurious?
Distributions of maximum spurious correlations and their applications.”
Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized
estimation. Ann. Statist. 42 819–849.
Goeman, J. J., van de Geer, S. A. and van Houwelingen, H. C. (2006). Testing
against a high dimensional alternative. J. R. Stat. Soc. Ser. B 68 477–493.
Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the
null hypothesis. Econometrica 64 413–430.
Hastie, T. J., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction (2nd ed). Springer, New York.
Shao, Q.-M. and Zhou, W.-X. (2014). Necessary and sufficient conditions for the
asymptotic distributions of coherence of ultra-high dimensional random matrices. Ann.
Probab. 42 623–648.
Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C.,
Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavar´
e, S.,
Deloukas, P. and Dermitzakis, E. T. (2007). Population genomics of human gene
expression. Nature Genetics 39 1217–1224.
Thorgeirsson, T. E., et al. (2010). Sequence variants at CHRNB3-CHRNA6 and
CYP2A6 affect smoking behavior. Nature Genetics 42 448–453.
Thorisson, G. A., Smith, A. V., Krishnan, L. and Stein, L. D. (2005). The Interna-
32 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
tional HapMap Project Web site. Genome Research 15 1592–1593.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc.
Ser. B 58 267–288.
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical
Processes: With Applications to Statistics. Springer, New York.
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices.
In Compressed Sensing: Theory and Applications (Y. Eldar and G. Kutyniok, eds.)
210–268. Cambridge University Press, Cambridge.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.
Ann. Statist. 38 2109–2144.
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood
models. Ann. Statist. 36 1509–1533.
Submitted to the Annals of Statistics
SUPPLEMENT TO “ARE DISCOVERIES SPURIOUS?
DISTRIBUTIONS OF MAXIMUM SPURIOUS
CORRELATIONS AND THEIR APPLICATIONS”
By Jianqing Fank,∗∗, Qi-Man Shao†† and Wen-Xin Zhouk,‡‡
Princeton Universityk, Academy of Mathematics and Systems Science∗∗,
Chinese University of Hong Kong††
and University of California, San Diego‡‡
This supplemental material contains additional proofs for all the
remaining theoretical results in the main text, including Lemmas 7.2–
7.6, Theorems 3.2,4.1 and 4.2, and Propositions 3.1 and 3.2. A dis-
cussion on the moment assumptions is also included.
APPENDIX A: PROOF OF LEMMAS 7.2–7.6
Here we prove Lemmas 7.2–7.6 in Fan, Shao and Zhou (2015).
A.1. Proof of Lemma 7.2.For every α∈ V, recall that αΣ=α/|α|Σ
with |α|Σ= (αTΣα)1/2. Then, we have
αTb
Σα
αTΣα=n−1
n
X
i=1
(αT
ΣXi)2−(αT
Σ¯
Xn)2.
In view of this identity, we define
Dn,1= sup
α∈V n−1
n
X
i=1
(αT
ΣXi)2−1, Dn,2= sup
α∈V
(αT
Σ¯
Xn)2,
such that Dn≤Dn,1+Dn,2. In what follows, we bound the two terms Dn,1
and Dn,2respectively.
Let Gbe a class of functions Rp7→ Rgiven by G={x7→ gα(x) =
hαΣ,xi2:α∈ V}, and denote by PXthe probability measure on Rpinduced
by X. In this notation, we have Dn,1= supg∈G |n−1Pn
i=1 g(Xi)−PXg|. To
bound Dn,1, we follow a standard procedure: first we show concentration of
Dn,1around its expectation EDn,1, and then upper bound the expectation.
To prove concentration, applying Theorem 4 in Adamczak (2008) implies
that there exists an absolute constant C > 0 such that, for every t > 0,
(A.1) Dn,1≤2EDn,1+ max 2σG
√t
n, C t
n
max
1≤i≤nsup
g∈G |g(Xi)|
ψ1
33
34 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
holds with probability at least 1−4e−t, where σ2
G:= supg∈G Pn
i=1 PXig2. Un-
der Condition 2.1, it follows from the fact |Σ1/2αΣ|2= 1 and the definition
of k·kψ2that
σ2
G≤nsup
α∈V
E{(Σ1/2αΣ)TU}4≤16nsup
α∈Sp−1kαTUk4
ψ2≤16K4
1n,(A.2)
which further leads to σG≤4K2
1√nfor K1as in Condition 2.1. In the last
term of (A.1), note that
sup
g∈G |g(Xi)|= sup
α∈V
(αT
ΣXi)2=sup
α∈V
αT
ΣXi2
.
For every ∈(0, γ−1/2
s), a standard argument can be used to prove that
there exists an -net Nof Vsuch that d=|N|≤{(2 + )ep/(s)}sand
(A.3) sup
α∈V
αT
ΣXi≤(1 −γs)−1max
α∈N
αT
ΣXi.
See, for example, the proof of (A.14) below. In particular, under Condi-
tion 2.1, using Lemma 2.2.2 in van der Vaart and Wellner (1996) implies by
taking s= (4γs)−1and N=Nsthat
max
1≤i≤nsup
g∈G |g(Xi)|
ψ1
=
max
1≤i≤nsup
α∈V{(Σ1/2αΣ)TUi}2
ψ1
.
max
1≤i≤nmax
α∈N{(Σ1/2αΣ)TUi}2
ψ1
.{slog(γsep/s)∨log n}sup
α∈Sp−1kαTUik2
ψ2
.K2
1cn(s, p),(A.4)
where cn(s, p) = slog(γsep/s)∨log n. Consequently, combining (A.1), (A.2)
and (A.4) yields, with probability at least 1 −4e−t,
(A.5) Dn,1≤2EDn,1+CK2
1max rt
n, cn(s, p)t
n.
To bound the expectation EDn,1, we use a result that involves the generic
chaining complexity, γm(T , d), of a semi-metric space (T, d). We refer to Ta-
lagrand (2005) for a systematic introduction. A tight upper bound for EDn,1
can be obtained by a direct application of Theorem A in Mendelson (2010).
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 35
To this end, note that supα∈V kαT
ΣXikψ1= supα∈V k(Σ1/2αΣ)TUikψ1≤K1
and for every α,α0∈ V,
k(αΣ−α0
Σ)TXikψ2=
{Σ1/2(αΣ−α0
Σ)}TUi
ψ2≤K1|αΣ−α0
Σ|Σ.
Successively, it follows from Theorem A in Mendelson (2010) and Theo-
rems 1.3.6, 2.1.1 in Talagrand (2005) that
EDn,1.K2
1γ2(F,|·|Σ)
√n+γ2
2(F,|·|Σ)
n.K2
1M(s, p)
√n+M2(s, p)
n,
(A.6)
where M(s, p) := E(supα∈V αT
ΣZ) with Zd
=N(0,Σ). In addition, a similar
argument to that leading to (A.3) can be used to show that
M(s, p)≤4
3Emax
α∈N αT
ΣZ≤2plog(|N|).pslog(γsep/s).(A.7)
Next we study Dn,2. Observe that pDn,2= supα∈V |n−1Pn
i=1 αT
ΣXi|.
Again, we use a concentration inequality due to Adamczak (2008). Theo-
rem 4 there implies that, for every t≥0,
(A.8) pDn,2≤2EpDn,2+ max 2σV
√t
n, C t
n
max
1≤i≤nsup
α∈V |αT
ΣXi|
ψ1
with probability at least 1 −4e−t, where σ2
V= supα∈V Pn
i=1 E(αT
ΣXi)2=n.
Under Condition 2.1, supα∈V kαT
ΣXikψ1≤supα∈V kαT
ΣXikψ2≤K1. Recall
that s= (4γs)−1, a similar argument to that leading to (A.4) gives
max
1≤i≤nsup
α∈V |αT
ΣXi|
ψ1
.
max
1≤i≤nsup
α∈N (Σ1/2αΣ)TUi
ψ2
.c1/2
n(s, p) sup
α∈Sp−1kαTUikψ2.K1c1/2
n(s, p).(A.9)
For the expectation EpDn,2, it follows from (A.3) with s= (4γs)−1that
EpDn,2=Esup
α∈V
αT
Σ¯
Xn≤4
3Emax
α∈N αT
Σ¯
Xn.
For α∈ V and t > 0, a direct consequence of (7.2) is that P(|αT
Σ¯
Xn| ≥ t)≤
2 exp(−cHnt2/K2
1). This, together with Lemma 7.4 and the previous display
implies
(A.10) EpDn,2.K1p(s/n) log(γsep/s).
36 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Together, (A.5)–(A.10) completes the proof of (7.3).
Finally, to prove (7.4), note that |bσ2
ε−1| ≤ |n−1Pn
i=1 ε2
i−1|+ ¯ε2
n. For
t1, t2≥0, applying (7.2) and (7.1) gives P(|¯εn| ≥ t1)≤2 exp(−cHnt2
1/K2
0)
and P(|n−1Pn
i=1 ε2
i−1| ≥ t2)≤2 exp{−cBnmin(t2
2/A2, t2/A)}, respectively,
where A=kε2−1kψ1≤2kε2kψ1≤4kεk2
ψ2= 4K2
0. Consequently, taking
t1=K0n−1/2√tand t2= 4K2
0max(n−1/2√t, n−1t) proves (7.4).
A.2. Proof of Lemma 7.3. By (2.5), we have R∗(s, p) = supα∈V αT
ΣZ
and for every α∈ V,E(αT
ΣZ) = 0 and E(αT
ΣZ)2= 1. Consequently, in
view of (A.7), inequalities (7.5) and (7.6) follow from Borell’s inequality
(Proposition A.2.1 in van der Vaart and Wellner (1996)) and Lemma A.1 of
Chernozhukov, Chetverikov and Kato (2014a), respectively.
A.3. Proof of Lemma 7.4. Put B= max1≤j≤mbj. For any T > 0, we
have
Emax
1≤j≤m|Xj|=Z∞
0
Pmax
1≤j≤p|Xj|> tdt
≤T+Z∞
T
Pmax
1≤j≤m|Xj|> tdt
≤T+a
m
X
j=1 pbj·Z∞
T/√bj
exp(−t2/2) dt
≤T+a√Bm min pπ/2,√B/T exp{−T2/(2B)}.
In particular, this implies by taking T=p2Blog(am)≥√2Bthat
Emax
1≤j≤m|Xj|≤√Bp2 log(am) + {2 log(am)}−1/2
≤√2 + 1
√2 log 4 pBlog(am).
A completely analogous argument will lead to the desired bound under the
condition that P(|Xj| ≥ t)≤aexp(−t/cj) for all t > 0 and j= 1, . . . , m.
A.4. Proof of Lemma 7.5. Recall that Zd
=N(0,Σ) and write Wn=
n−1/2Pn
i=1 yiwith yi=εiXi, such that Ln= supα∈V hαΣ,Wnifor Lnas
in (7.8).
To prove (7.9), a new coupling inequality for maxima of sums of random
vectors in Chernozhukov, Chetverikov and Kato (2014a) plays an impor-
tant role in our analysis. We divide the proof into three steps. First we
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 37
discretize the index space V=V(s, p) using a net, Vε, via a standard cov-
ering argument. Then we apply the aforementioned coupling inequality to
the discretized process, and finish the proof based on the concentration and
anti-concentration inequalities for Gaussian processes.
Step 1: Discretization. The goal is to establish (A.14), which approxi-
mates the supremum over an infinite index space Vby the maximum over
its -net V.
Let Rpbe equipped with the Euclidean metric ρ(x,y) = |x−y|2for x,y∈
Rp. Subsequently, the induced metric on the space of all linear functions
x7→ fα(x) = hα,xiis defined as ρ(fα, fβ) = supx∈Sp−1|fα(x)−fβ(x)|=
supx∈Sp−1|hα−β,xi| =|α−β|2. For every ∈(0,1), denote by N(V, ρ, )
the -covering number of (V, ρ). For the unit Euclidean sphere Sp−1equipped
with the Euclidean metric ρ, it is well-known that N(Sp−1, ρ, )≤(1+ 2/)p.
Together with the decomposition
(A.11) α∈Sp−1:|α|0=s=[
S⊆[p]:|S|=sα∈Sp−1: supp(α) = S
and the binomial coefficient bound p
s≤(ep/s)s, this yields
(A.12) N(V, ρ, )≤p
s(1 + 2/)s≤ {(2 + )ep/(s)}s.
For ∈(0,1) and S⊆[p] fixed, let NS, be an -net of the unit ball in
(RS, ρ) with |NS,| ≤ (1 + 2/)s. Thus the function class N:= ∪S⊆[p]NS, =
{x7→ hα,xi:α∈ NS,, S ⊆[p]}forms an -net of (V, ρ). Denote by d=
d=|N|the cardinality of N. Then it is easy to see that d≤p
s(1 + 2/)s
and hence log d.slog{ep/(s)}.
For every α∈ V with supp(α) = S, there exists some α0∈ NS, satisfying
that supp(α0) = supp(α) and |α−α0|2≤. Further, note that
|αΣ−α0
Σ|2
Σ= 2 −2hα,Σα0i
|α|Σ|α0|Σ
=hα−α0,Σ(α−α0)i − (|α|Σ− |α0|Σ)2
|α|Σ|α0|Σ≤γ2
s|α−α0|2
2,(A.13)
from which we obtain
hαΣ,Wni=αΣ−α0
Σ,Wn+α0
Σ,Wn
=|αΣ−α0
Σ|Σh(αΣ−α0
Σ)/|αΣ−α0
Σ|2,Wni
|(αΣ−α0
Σ)/|αΣ−α0
Σ|2|Σ
+α0
Σ,Wn
≤γssup
α∈V hαΣ,Wni+α0
Σ,Wn
38 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
for γsas in (2.4), and hence
sup
α∈Sp−1: supp(α)=ShαΣ,Wni ≤ γssup
α∈V hαΣ,Wni+ max
α∈NS, hαΣ,Wni.
Taking maximum over S⊆[p] with |S|=son both sides yields
sup
α∈V hαΣ,Wni ≤ γssup
α∈V hαΣ,Wni+ max
α∈NhαΣ,Wni.
Therefore, as long as ∈(0, γ−1
s),
(A.14) max
α∈NhαΣ,Wni ≤ Ln≤(1 −γs)−1max
α∈NhαΣ,Wni.
Step 2: Coupling. This aims to carry the Gaussian approximation over
the discrete index set Vand to establish (A.15) or its more explicit bound
(A.20).
Write V={αj:j= 1, . . . , d}and let V1,...,Vnbe i.i.d. d-variate ran-
dom vectors such that Vi= (Vi1, . . . , Vid)T, where Vij =αT
j,Σyisatisfies
that E(Vij) = 0 and E(V2
ij) = 1. Define the d-variate Gaussian random vec-
tor G= (G1, . . . , Gd)T, where Gj=αT
j,ΣZ=αT
jZ/|αj|Σfor j= 1, . . . , d.
Note that, for each 1 ≤j6=`≤d,E(V1jV1`) = E(GjG`). By Corollary 4.1 of
Chernozhukov, Chetverikov and Kato (2014a), there exists a random vari-
able T∗
d
= max1≤j≤dGjsuch that, for every δ > 0,
Pmax
1≤j≤dn−1/2
n
X
i=1
Vij −T∗
≥16δ
.B1
log(dn)
δ2n+B2{log(dn)}2
δ3n3/2+B3{log(dn)}3
δ4n2+log n
n,(A.15)
where
B1=Emax
1≤j,`≤d
n
X
i=1 VijVi` −EVij Vi`,
B2=Emax
1≤j≤d
n
X
i=1 |Vij|3, B3=
n
X
i=1
Emax
1≤j≤dV4
ij.
In what follows, we bound the three terms B1–B3respectively.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 39
First, by Lemma 2.2.2 in van der Vaart and Wellner (1996) we have
Emax
1≤j≤dV4
ij=E(ε4
i)·Emax
1≤j≤d(αT
j,ΣXi)4
≤16E(ε4
i)·
max
1≤j≤d(Σ1/2αj,Σ)TUi
4
ψ2
.v4(log d)2·sup
α∈Sp−1kαTUik4
ψ2
.v4K4
1(log d)2
for v4=E(ε4) as in Condition 2.1, leading to
(A.16) B3.v4K4
1n[slog{ep/(s)}]2.
For B2, we apply Lemma 9 in Chernozhukov, Chetverikov and Kato (2015)
to obtain
B2.max
1≤j≤d
n
X
i=1
E|Vij|3+ (log d)·Emax
1≤i≤nmax
1≤j≤d|Vij|3.
For every integer q≥1, by the definition of the k·kψ2norm we have
E|Vij|q=vqE(Σ1/2αj,Σ)TUiq≤qq/2vq
(Σ1/2αj,Σ)TUi
q
ψ2≤qq/2vqKq
1,
and once again, it follows from Lemma 2.2.2 in van der Vaart and Wellner
(1996) that
Emax
1≤i≤nmax
1≤j≤d|Vij|q≤qq
max
1≤i≤nmax
1≤j≤d|εi|·|αT
j,ΣXi|
q
ψ1
.qq{log(dn)}qmax
1≤j≤d
ε·(Σ1/2αj,Σ)TU
q
ψ1
.qq{log(dn)}qkεkq
ψ2max
1≤j≤d
(Σ1/2αj,Σ)TU
q
ψ2
.qq(K0K1)q{log(dn)}q.
The last three displays together imply by taking q= 3 that
(A.17) B2.v3K3
1n+ (K0K1)3[slog{ep/(s)}]4.
Turning to B1, a direct consequence of Lemma 1 in Chernozhukov, Chetverikov
and Kato (2015) is that
B1.(log d)1/2max
1≤j≤dn
X
i=1
EV4
ij1/2
+ (log d)·Emax
1≤i≤n,1≤j≤dV4
ij1/2
.v1/2
4K2
1[ns log{ep/(s)}]1/2+ (K0K1)2[slog{ep/(s)} ∨ log n]3.(A.18)
40 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Putting (A.15)–(A.18) together, we obtain that for every δ > 0 and ∈
(0,1),
Pmax
α∈N
αT
ΣWn−T∗
≥16δ
.v1/2
4K2
1
c3/2
n(s, p, )
δ2√n+v3K3
1
c2
n(s, p, )
δ3√n+v4K4
1
c5
n(s, p, )
δ4n
+ (K0K1)2c4
n(s, p, )
δ2n+ (K0K1)3c6
n(s, p, )
δ3n3/2+log n
n,(A.19)
where cn(s, p, ) := slog{ep/(s)} ∨ log n. Because this upper bound is only
meaningful when it is less than 1, it can be further reduced to
bn(s, p, , δ) := (K0K1)3c2
n(s, p, )
δ3√n+ (K0K1)4c5
n(s, p, )
δ4n
(A.20)
for ∈(0,1) and δ∈(0, K0K1].
Step 3. For every ∈(0,1), put s=γs. A similar argument to that
leading to (A.14) now gives
max
α∈N
αT
ΣZ≤sup
α∈V
αT
ΣZ≤ssup
α∈V
αT
ΣZ+ max
α∈N
αT
ΣZ.
Further, it is concluded from (7.5) and (A.30) in the proof of Lemma 7.6
that, with probability at least 1 −Cn−1,
sup
α∈V
αT
ΣZ−max
α∈N
αT
ΣZ.c1/2
n(s, p)s
(A.21)
with cn(s, p) = slog(γsep/s)∨log n, and
sup
α∈V
αT
ΣWn−max
α∈N
αT
ΣWn
.c1/2
n(s, p) + K0K1n−1/2c2
n(s, p)s.(A.22)
For the Gaussian maxima supα∈V hαΣ,Ziand maxα∈NhαΣ,Zi, it follows
from (A.21) that for any Borel subset Bof R,
PT∗
∈ B≤Psup
α∈V
αT
ΣZ∈ BCsc1/2
n(s,p)+n−1,
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 41
where Bu:= {x∈R:|x−y| ≤ u, ∀y∈ B} for u > 0. This, together
with Lemma 4.1 in Chernozhukov, Chetverikov and Kato (2014a), a vari-
ant of Strassen’s theorem, implies that there exits a random variable T∗d
=
supα∈V hαΣ,Zisuch that
(A.23) PT∗−T∗
> C sc1/2
n(s, p)≤n−1.
Finally, assembling (A.19)–(A.22) completes the proof of (7.9) by taking
= (γsn)−1.
A.5. Proof of Lemma 7.6. Let Sn=n−1Pn
i=1 XiXT
iand write
(A.24) Dn= sup
α∈V αT
Σb
ΣnαΣ−1= sup
α∈V αT
ΣSnαΣ−1−αT
Σ¯
Xn2.
For ease of exposition, define bσ2
ε=n−1Pn
i=1(εi−¯εn)2and for α∈Rp, let
(A.25) bσ2
α=bσ2
α(Σ) = n−1
n
X
i=1
(αT
ΣXi)2−(αT
Σ¯
Xn)2.
In this notation, Dn= supα∈V |bσ2
α−1|and
b
Ln= sup
α∈V
(bσεbσα)−1n−1/2
n
X
i=1
εi·αT
ΣXi−√n¯εnαT
Σ¯
Xn.
Comparing this with Lnin (7.8), it is easy to see that
(A.26) |b
Ln−Ln| ≤ sup
α∈V (bσεbσα)−1−1·Ln+√nbσ−1
ε|¯εn| · sup
α∈V bσ−1
α|αT
Σ¯
Xn|.
In what follows, we bound the two terms on the right-hand side of (A.26)
separately, starting with the first one.
For every t > 0, let EX(t) and Eε(t) be the events that (7.3) and (7.4)
hold, respectively. In particular, taking t1=A1log nfor A1>0 and t2=
min[log n, {n/cn(s, p)}1/2] yields P{Eε(t1)c} ≤ 2n−cHA1+ 2n−cBA1and
PEX(t2)c≤8 exp − {n/cn(s, p)}1/2≤3n−1/2c1/2
n(s, p),
where cH, cB>0 are as in Lemma 7.1. Here, the last step comes from the
inequality supt≥0te−t≤e−1. On the event Eε(t1)∩ EX(t2),
bσ2
ε−1.K2
0rlog n
n≤1
2, Dn.K2
1rcn(s, p)
n≤1
2
(A.27)
42 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
whenever the sample size nsatisfies n&max{K4
0log n, K4
1cn(s, p)}. To-
gether, (A.26), (A.27) and the identity
1−(bσεbσα)−1
= (bσεbσα)−1(bσε−1)(bσα−1) + bσε−1 + bσα−1
=(bσ2
ε−1)(bσ2
α−1) + (bσ2
ε−1)(bσα+ 1) + (bσ2
α−1)(bσε+ 1)
bσεbσα(bσε+ 1)(bσα+ 1)
imply, on Eε(t1)∩ EX(t2) with nsufficiently large,
sup
α∈V (bσεbσα)−1−1.(K0∨K1)2n−1/2c1/2
n(s, p).(A.28)
Next we deal with Ln, which can be written as supα∈V n−1/2Pn
i=1 αT
Σyi,
where yi=εiXisatisfies that, under Condition 2.1,
(A.29) E(αT
Σyi)2= 1 for all α∈ V and sup
α∈V kαT
Σyikψ1≤2K0K1.
As in the proof of (A.8), using Theorem 4 in Adamczak (2008) gives, for
any t≥0,
(A.30) Ln≤2ELn+ max 2√t, C t
√n
max
1≤i≤nsup
α∈V |αT
Σyi|
ψ1
holds with probability at least 1−4e−t. For the last term of (A.30), a similar
argument to that leading to (A.8) gives, on this occasion with s= (4γs)−1
and N=Nsthat
max
1≤i≤nsup
α∈V |αT
Σyi|
ψ1
=
max
1≤i≤nsup
α∈V εi·(Σ1/2αΣ)TUi
ψ1
.
max
1≤i≤nsup
α∈N εi·(Σ1/2αΣ)TUi
ψ1
.{slog(γsp/s) + log n}kεkψ2sup
α∈Sp−1kαTUikψ2
.K0K1cn(s, p).
Here, we used the property that the cardinality of the s-net Nof V, denoted
by d=|N|, is such that log d.slog(γsep/s).
To bound ELn, observe that for every u∈Rp, supα∈V |αT
Σu|= supα∈V αT
Σu.
For Wn=n−1/2Pn
i=1 yi, it follows from (A.14) with s= (4γs)−1that
ELn=Esup
α∈V
αT
ΣWn≤4
3Emax
α∈N αT
ΣWn.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 43
For every α∈ V and t > 0, from (7.1) and (A.29) we get
PαT
ΣWn≥t≤2 exp −cBmin t2
4K2
0K2
1
,√nt
2K0K1.(A.31)
Hence, applying Lemma 7.3 with slight modification gives
Emax
α∈N αT
ΣWn.K0K1plog d+K0K1n−1/2log d.
Plugging this into (A.30) and taking t= min[log n, {n/cn(s, p)}1/2] imply,
with probability at least 1 −4 exp[−{n/cn(s, p)}1/2]≥1−2n−1/2c1/2
n(s, p),
(A.32) Ln.K0K1c1/2
n(s, p)
whenever the sample size nsatisfies n&cn(s, p).
Again, on the event Eε(t1)∩ EX(t2) with nsufficiently large as above for
(A.27), the second term on the right-hand side of (A.26) is bounded by some
multiple of √nbσ−1
ε|¯εn|pDn,2, where Dn,2is as in (A.8). Arguments similar
to those in the proof of Lemma 7.2 permit us to show that, with probability
at least 1 −2n−1/2c1/2
n(s, p),
(A.33) pDn,2.K1n−1/2c1/2
n(s, p).
Further, put Sn=Pn
i=1 εi, V 2
n=Pn
i=1 ε2
i, such that √nbσ−1
ε¯εn=V−1
nSn{1−
n−1(Sn/Vn)2}1/2. Then it follows from Theorems 2.16 and 2.19 in de la Pe˜na,
Lai and Shao (2009) that for every t∈(0,√n]
P√nbσ−1
ε|¯εn| ≥ t
≤P|Sn| ≥ t(1 + t2n−1)−1/2Vn
≤P|Sn| ≥ t(1 + t2n−1)−1/2(4√2 + 1)−1(4√n+Vn)+PV2
n≤n/2
≤4 exp(−cSNt2) + exp{−n/(8v4)},
where v4=E(ε4) and cSN >0 is an absolute constant. In particular, taking
t=A2(log n)1/2with A2>0 yields, with probability greater than 1 −
4n−cSNA2
2−exp{−n/(8v4)},
(A.34) √nbσ−1
ε|¯εn|.plog n.
Finally, combing (A.26), (A.28), (A.32), (A.33) and (A.34) completes the
proof of (7.10).
44 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
APPENDIX B: PROOF OF THEOREM 3.2
We divide the proof into three key steps. The first step is to establish
(B.1) using the results on discretization in the proof of Theorem 3.1, and
then analyze separately the order of the stochastic terms (B.2) and (B.3).
Step 1. For any α∈Rp, let kαk2
n=αTb
Σαwith b
Σ=b
Σn, and let γs=
γs(Σ) be as in (2.4). First, we prove that there exits a (γsn)−1-net of V,
denoted by Vn=Vn(s, p), such that log(|Vn|).sbn(s, p) and
supt≥0Psupα∈V hαΣ,Zi ≤ t−Psupα∈V hαn,Zni ≤ tXn
.hbγs(γsn)−1s¯
bn(s, p)+∆1/3
ns¯
bn(s, p) + log(1/∆n)2/3i,(B.1)
where αn=α/kαkn,Xn={Xi}n
i=1,bn(s, p) = log(γsp/s)∨log n,¯
bn(s, p) =
log(¯γsp/s)∨log nwith ¯γs= max(γs,bγs), and
(B.2) bγs:= γs(b
Σ) = maxu∈Sp−1:1≤|u|0≤skukn
minu∈Sp−1:1≤|u|0≤skukn
denotes the s-sparse condition number of b
Σand
(B.3) ∆n= ∆n(s, p) = max
α,β∈VnαT
ΣΣβΣ−αT
nb
Σβn
with αn=α/kαknand βn=β/kβkn.
Proof of (B.1). As in the proof of Lemma 7.5 in Appendix A.4, for every ∈
(0,1), there exists an -net Nof Vsatisfying d=|N|≤{(2 + )ep/(s)}s,
such that
sup
α∈V hαΣ,Zi − max
α∈NhαΣ,Zi≤γssup
α∈V hαΣ,Zi,(B.4)
sup
α∈V hαn,Zni − max
α∈Nhαn,Zni≤bγssup
α∈V hαn,Zni.(B.5)
For notational convenience, write d=d,N={α1,...,αd}and let
G= (G1, . . . , Gd)T= (hα1,Σ,Zi,...,hαd,Σ,Zi)T,
Gn= (Gn1, . . . , Gnd)T= (hα1,n,Zni,...,hαd,n,Zni)T
be two d-dimensional centered Gaussian random vectors. Conditional on Xn,
applying Theorem 2 in Chernozhukov, Chetverikov and Kato (2015) to G
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 45
and Gnrespectively gives
sup
t∈RPmax
1≤j≤dGj≤t−Pmax
1≤j≤dGnj ≤tXn
.(∆ log d)1/3log d+ log(1/∆)1/3,
where ∆ = ∆(N) = maxα,β∈NαT
ΣΣβΣ−αT
nb
Σβn.
By Lemma 7.3, we have for every t > 0,
Psup
α∈V hαΣ,Zi ≥ Cpslog(γsep/s) + t≤e−t2/2,
Psup
α∈V hαn,Zni ≥ Cpslog(bγsep/s) + tXn≤e−t2/2.
The last three displays, together with (B.4) imply that, for every ∈(0,1)
and t > 0,
Psup
α∈V hαn,Zni ≤ tXn
≤Pmax
α∈Nhαn,Zni ≤ tXn
≤Pmax
α∈NhαΣ,Zi ≤ t+C∆1/3(log d)1/3{log(d/∆)}1/3
≤Psup
α∈V hαΣ,Zi ≤ t+Cγsc1/2
n(s, p)
+C∆1/3(log d)1/3{log(d/∆)}1/3+n−1
≤Psup
α∈V hαΣ,Zi ≤ t+Cγscn(s, p)
+C∆1/3(log d)1/3{log(d/∆)}1/3+n−1,
where cn(s, p) = slog(γsep/s)∨log nand the last inequality comes from
(7.6). For the lower bound, in view of (B.5), it can be similarly obtained
that
Psup
α∈V hαn,Zni ≤ tXn
≥Pmax
α∈Nhαn,Zni ≤ t−Cbγspslog(bγsep/s)∨log nXn−n−1
≥Psup
α∈V hαΣ,Zi ≤ t−Cbγs{slog(bγsep/s)∨log n}
−C∆1/2(log d)1/3{log(d/∆)}1/3−n−1.
46 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Taking Vn=Nwith = (γsn)−1proves (B.1).
Step 2. Next, we study ∆nin (B.3), which is bounded by
max
α,β∈VnαT
Σ(b
Σ−Σ)βΣ+ max
α,β∈VnαT
nb
Σβn−αT
Σb
ΣβΣ:= ∆n,1+ ∆n,2.
(B.6)
In what follows, we bound the two terms on the right side separately, starting
with ∆n,2. For every α,β∈ Vn,
αT
nb
Σβn−αT
Σb
ΣβΣ
=αT
nb
Σ(βn−βΣ) + αT
nb
ΣβΣ−αT
Σb
ΣβΣ
≤ kαnknkβn−βΣkn+kβΣ−βnknkαΣ−αnkn+kβnknkαn−αΣkn
≤kαΣkn−1+kβΣkn−1+kαΣkn−1·kβΣkn−1.
Using this together with Lemma 7.2 yields, with probability at least 1 −
3n−1/2c1/2
n(s, p),
(B.7) ∆n,2.K2
1n−1/2c1/2
n(s, p).
Turning to ∆n,1, it suffices to focus on
(B.8)
max
α,β∈VnαT
Σ(Sn−Σ)βΣ= max
α,β∈Vnn−1
n
X
i=1 (αT
ΣXi)(βT
ΣXi)−αT
ΣΣβΣ.
Applying Theorem 4 in Adamczak (2008) we obtain that, with probability
at least 1 −4e−t,
max
α,β∈VnαT
Σ(Sn−Σ)βΣ
≤2Emax
α,β∈VnαT
Σ(Sn−Σ)βΣ
+ max 2σVn
√t
n+Ct
n
max
1≤i≤nmax
α,β∈Vn
αT
ΣXiXT
iβΣ
ψ1,(B.9)
where similarly to (A.2) and (A.4),
σ2
Vn:= max
α,β∈Vn
n
X
i=1
E(αT
ΣXiβT
ΣXi)2
≤max
α,β∈Vn
n
X
i=1 E(αT
ΣXi)41/2E(βT
ΣXi)41/2≤16K4
1n(B.10)
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 47
and
(B.11)
max
1≤i≤nmax
α,β∈Vn
αT
ΣXiXT
iβΣ
ψ1
.K2
1cn(s, p).
From the moment inequality kαT
ΣXiβT
ΣXikψ1≤2K2
1, another consequence
of (7.1) is that
Pn−1
n
X
i=1 (αT
ΣXi)(βT
ΣXi)−αT
ΣΣβΣ≥t
≤2 exp{−cBnmin(t2/K4
1, t/K2
1)}.
Using this together with Lemma 7.4 we get
(B.12) Emax
α,β∈VnαT
Σ(Sn−Σ)βΣ.K2
1rlog(|Vn|)
n+K2
1
log(|Vn|)
n.
Consequently, combining (B.8)–(B.12) gives, with probability at least 1 −
2n−1/2c1/2
n(s, p),
(B.13) ∆n,1.K2
1n−1/2{sbn(s, p)}1/2.
Together, (B.6), (B.7) and (B.13) imply that, with probability at least 1 −
5n−1/2c1/2
n(s, p),
(B.14) ∆n.K2
1n−1/2{sbn(s, p)}1/2.
Step 3. Finally, we study the sample s-sparse condition number bγsin (B.2).
For every u∈Rp, note that (kukn/|u|Σ)2=uT
Σb
ΣuΣ= 1 + uT
Σb
ΣuΣ−1. For
every ∈(0,1), in view of the inequality Ps
j=1 p
j≤(ep/s)sthat holds for
all 1 ≤s≤p, there exists an -net of {x7→ hu,xi:u∈Sp−1,1≤ |u|0≤s}
with its cardinality bounded by {(2 + )ep/(s)}s. Consequently, it follows
from Lemma 7.2 and the previous display that, with probability at least
1−Cn−1/2c1/2
n(s, p),
1
2≤kukn/|u|Σ2≤3
2for all u∈Sp−1satisfying 1 ≤ |u|0≤s(B.15)
and hence, bγs≤3γswhenever nsatisfies n&K4
1cn(s, p).
Assembling (B.1), (B.14) and (B.15) completes the proof of Theorem 3.2.
48 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
APPENDIX C: DISCUSSION ON THE MOMENT ASSUMPTIONS
As pointed out in Remark 3.3, the sub-exponential rate, i.e. log pnc
with some c∈(0,1), requires a sub-Gaussian condition on the sampling
distribution. In the following, we will discuss the main steps on how our
analysis can be carried over under finite moment conditions, at the cost of
imposing more stringent constraints on the dimension pas a function of the
sample size n.
Note that, inequality (A.15) in the proof of Lemma 7.5 holds with B1–B3
well-defined as long as the fourth moments of all coordinates of εXare finite.
From the proof of Lemma 7.6 we see that the main difficulty comes from
bounding
Ln=Ln(s, p) = sup
α∈V
1
√n
n
X
i=1 hαΣ, εiXii
with αΣ=α/|Σ1/2α|2and
sup
α∈V |αTb
Σα−αTΣα|,
where V=V(s, p) = {α∈Sp−1:|α|0=s}for 1 ≤s≤p. Without loss of
generality, we let b
Σ=b
Σn=n−1Pn
i=1 XiXT
i, where X1,...,Xnare i.i.d.
copies of a random vector X∈Rp.
Deviation bounds for the above two terms are given in the following lem-
mas. Comparing (C.1) and (C.3) with (A.32) and (7.3), respectively, we see
that the consistency of normal approximation requires significantly more
stringent condition on the dimension punder finite fourth moment assump-
tions. In this case, the convergence in Kolmogorov distance
sup
t≥0P√nb
Rn(s, p)≤t−PR∗(s, p)≤t
holds when slog(p) = o(log n) as n→ ∞. Because our main focus is on
characterizing spurious discoveries from variable selection methods for high-
dimensional data with low-dimensional structure, the above result becomes
less instructive and is not applicable to the statistical problems considered
in this paper. Nonetheless, the study of distributional approximation for
heavy-tailed data has its own interest and is also our ongoing work [Sun,
Zhou and Fan (2016)].
Condition C.1.The random variable εsatisfies Eε= 0,Eε2= 1 and
R0=Eε4<∞. There exists a random vector Usuch that X=Σ1/2U,
E(U) = 0,E(UUT) = Ipand R1= supu∈RpEhu,Ui4<∞.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 49
Lemma C.1.Assume that Condition C.1 holds. Then, for any δ∈(0,1),
Ln≤Cmax{(R0R1)1/4n−1/4,1}(5eγsp/s)s/4δ−1/4
(C.1)
with probability greater than 1−δ, where C > 0is an absolute constant.
Proof of Lemma C.1.As before, define Wn=n−1/2Pn
i=1 yiwith
yi=εiXifor i= 1, . . . , n. By (A.14), for any ∈(0, γ−1
s), there exists
a finite set N⊆ V such that |N| ≤ p
s(1 + 2/)sand
Ln≤(1 −γs)−1max
α∈N|hαΣ,Wni|.(C.2)
For every α∈ Nand t > 0, by Markov’s inequality and the Rosenthal
inequality we have
P(|hαΣ,Wni| > t)
≤t−4EhαΣ,Wni4
.1
t4n2n
X
i=1
EhαΣ,yii4+n
X
i=1
EhαΣ,yii22
.R0R1t−4n−1+t−4.
This, together with (C.2) with = (2γs)−1and the union bound implies that
for any δ∈(0,1), Ln.max{(R0R1)1/4n−1/4,1}δ−1/4with probability at
least 1 −(5eγsp/s)sδ. By a simple algebra, we obtain (C.1) as required.
Lemma C.2.Assume that Condition C.1 holds. Then, there exists some
absolute constants C > 0such that, for any δ∈(0,1),
sup
α∈V |αTb
Σα−αTΣα| ≤ C φmax(s)pR1s1/4r(8ep/s)s
δn
(C.3)
with probability greater than 1−δ, where φmax(s)is the s-sparse maximal
eigenvalue of Σ.
Proof of Lemma C.2.For every α∈ V, define
b
Q(α) = αTb
Σα=1
n
n
X
i=1
(αTXi)2and Q(α) = αTΣα.
Applying Chebyshev’s inequality to the quadratic form b
Q(α), we obtain
that for every δ > 0,
P|b
Q(α)−Q(α)| ≥ rVar(hα,Xi2)
δn ≤δ.(C.4)
50 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
In view of Proposition 6.2 in Catoni (2012), this upper bound is tight un-
der the finite fourth moment condition. Moreover, for any ∈(0,1/2),
it follows from Lemma 2 in the supplement to Wang, Berthet and Sam-
worth (2016) that there exists N⊆ V with cardinality at most π(1 −
2/16)−(s−1)/2√sp
s(2/)s−1such that
sup
α∈V |b
Q(α)−Q(α)| ≤ (1 −2)−1max
α∈N|b
Q(α)−Q(α)|.(C.5)
Together, (C.4) and (C.5) with = 1/4 yield
Psup
α∈V |b
Q(α)−Q(α)| ≥ rsupα∈V Var(hα,Xi2)
δn .√s(8ep/s)sδ.
Combining this with the fact that
hα,Xi2=hΣ1/2α,Ui2=αTΣα·Σ1/2α
|Σ1/2α|2
,U2
proves (C.3).
APPENDIX D: PROOF OF THEOREM 4.1
Without loss of generality, we assume that σ2=E(ε2
i) = 1 and s≤n≤
p≤en. The dependence of b
Roracle
non pwill be assumed without displaying.
Let bεi=bεoracle
i,b
β= (b
βT
1,b
βT
2)T=b
βoracle and δ= (δT
1,δT
2)T=b
β−β∗∈
Rpwith δ1=b
β1−β1∈Rsand δ2=b
β2−β2=0. As in the proof of
Theorem 3.1, we first consider the following standardized version of b
Roracle
n:
Roracle
n= max
j∈[p]n−1
n
X
i=1 bεiXij,(D.1)
Recall that δ= (δT
1,δT
2)Twith δ2=0. For every j∈[p], from the identity
XT
1X1δ1=XT
1εwe derive that
(D.2) Σ11δ1=n−1XT
1ε+b1with b1:= −n−1XT
1X1−Σ11δ1.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 51
Together with (4.4) and some simple algebra, this implies
n−1
n
X
i=1 bεiXij
=n−1ej(p)TXTε−XTXδ
=n−1
n
X
i=1
ej(p)TPXiεi
−ej(p)T0s×1
Σ21Σ−1
11 b1+n−1XT
2X1−Σ21δ1,(D.3)
where ej(p) = (0,...,0,1,0,...,0)Tis the unit vector in Rpwith 1 on the
jth position and
P=0s×s0s×d
−Σ21Σ−1
11 Id∈Rp×p.(D.4)
In view of (D.3), we define
(D.5) e
Rn= max
j∈[p]n−1
n
X
i=1
ej(p)TPXiεi.
Together, (D.1), (D.2), (D.3) and (D.5) imply
Roracle
n−e
Rn
≤max
j∈[d]ej(d)TΣ21 Σ−1/2
11 n−1Σ−1/2
11 XT
1X1Σ−1/2
11 −IsΣ1/2
11 δ1
+ max
j∈[p]\[s]n−1
n
X
i=1 XijXT
i,1Σ−1/2
11 −ej−s(d)TΣ21Σ−1/2
11 Σ1/2
11 δ1
≤max
j∈[d]Σ−1/2
11 Σ12 ej(d)2
n−1Σ−1/2
11 XT
1X1Σ−1/2
11 −Is
Σ1/2
11 δ12
+√sΣ1/2
11 δ12max
j∈[p]\[s]
k∈[s]n−1
n
X
i=1
(Id −E)Xij ek(s)TΣ−1/2
11 Xi,1
:= Q1+Q2,(D.6)
where (Id −E)Y:= Y−EYfor any random variable Y.
With the above preparations, the rest of the proof involves three steps:
First, we prove the Gaussian approximation of √ne
Rnby the Gaussian
maximum e
R∗:= |e
Z|∞, where e
Zd
=N(0,Σ22.1). Second, we prove that
52 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
√n(Q1+Q2) is negligible with high probability and that √nb
Roracle
nand
√ne
Rnare close. Finally, we apply an anti-concentration argument to prove
the convergence in the Kolmogorov distance.
Step 1: Gaussian approximation. First we prove that, under Condi-
tion 4.1 in the main text, there exists a random variable e
T∗d
=e
R∗such that,
for every δ∈(0, K0K1],
P√ne
Rn−e
T∗≥16δ.(K0K1)3(log p)2
δ3√n+ (K0K1)4(log p)5
δ4n.(D.7)
By the definition of Pin (D.4), we have
√ne
Rn= max
j∈[p]\[s]n−1/2
n
X
i=1
ej(p)TPXiεi,
where [p]\[s] = {s+ 1, . . . , p}. In addition, write Xi= (XT
i,1,XT
i,2)Twith
Xi,1∈Rs,Xi,2∈Rdand define e
yi=εie
Xi, where e
Xi=Xi,2−Σ21Σ−1
11 Xi,1
are such that E(e
Xi) = 0and E(e
Xie
XT
i) = Σ22.1. In this notation, we can
rewrite √ne
Rnas
(D.8) √ne
Rn= max
j∈[d]n−1/2
n
X
i=1
ej(d)Te
yi.
Next, we use the coupling inequality (A.15) below with d=p−sto the
random vectors V1,...,Vnwhich, on this occasion, are defined by Vi=
(Vi1, . . . , Vid)Twith Vij =ej(d)Te
yi. Since E(εiXi) = 0 and E(ε2
i|Xi) = 1, we
have E(Vi) = 0and E(ViVT
i) = Σ22.1. Then there exists a random variable
e
T∗d
=|e
Z|∞such that, for every δ > 0,
P√ne
Rn−e
T∗≥16δ
.B1
log p
δ2n+B2
(log p)2
δ3n3/2+B3
(log p)3
δ4n2+log n
n.(D.9)
In addition, note that the random vectors Vi= (Vi1, . . . , Vid)Tare such that
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 53
E(V2
ij) = eσj j ≤1 and
max
j∈[d]
ej(d)Te
yi
ψ1
≤2K0max
j∈[d]
ej(d)Te
Xi
ψ2
≤2K0max
j∈[p]\[s]
ej(p)TPΣ1/2Ui
ψ2
≤2K0K1max
j∈[p]\[s]ej(p)TPΣPTej(d)1/2
= 2K0K1max
j∈[d]eσ1/2
jj ≤2K0K1.(D.10)
Consequently, similar arguments to those leading to (A.16), (A.17) and
(A.18) in Appendix A.4 can be used to derive that
B1.(K0K1)2pnlog p+ (log p)3,
B2.(K0K1)3n+ (log p)4, B3.(K0K1)4n(log p)2.
Plugging the above bounds for B1–B3into (D.9) proves (D.7).
Step 2. First we prove that √nQ1and √nQ2are negligible with high prob-
ability, starting with √nQ1. Since Σ22.1=Σ22 −Σ21 Σ−1
11 Σ12 is positive
definite,
(D.11) max
j∈[d]Σ−1/2
11 Σ12 ej(d)2≤max
j∈[d]ej(d)TΣ22 ej(d)1/2= 1.
Again, from the identity XT
1X1δ1=XT
1εwe find that
δT
1(n−1XT
1X1)δ1
=δT
1Σ1/2
11 n−1Σ−1/2
11 XT
1ε(D.12)
≤Σ1/2
11 δ12·n−1Σ−1/2
11 XT
1ε2
≤Σ1/2
11 δ12·√sn−1Σ−1/2
11 XT
1ε∞.
To bound the left-hand side of (D.12) from below, note that
δT
1(n−1XT
1X1)δ1
=δT
1Σ1/2
11 n−1Σ−1/2
11 XT
1X1Σ−1/2
11 −IsΣ1/2
11 δ1+δT
1Σ11δ1
≥1−
n−1Σ−1/2
11 XT
1X1Σ−1/2
11 −Is
δT
1Σ11δ1.(D.13)
54 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Recall that Xi= (XT
i,1,XT
i,2)T,X1Σ−1/2
11 = (Σ−1/2
11 X1,1,...,Σ−1/2
11 Xn,1)T∈
Rn×s. Under Condition 4.1,
sup
α∈Ss−1
αTΣ−1/2
11 Xi,1
ψ2= sup
α∈Ss−1
αTΣ−1/2
11 (Is,0)Σ1/2U
ψ2
≤K1sup
α∈Ss−1Σ1/2(Is,0)TΣ−1/2
11 α2
=K1sup
α∈Ss−1|α|2=K1,
which, together with Theorem 5.39 in Vershynin (2012) yields that, for every
t≥0,
(D.14)
n−1Σ−1/2
11 XT
1X1Σ−1/2
11 −Is
.max(δ, δ2)
holds with probability at least 1−2 exp(−cBt2), where δ=K2
1n−1/2(√s+t).
By (D.12), (D.13) and taking t=c−1/2
Bplog(2n) in (D.14), we have with
probability at least 1 −n−1,
(D.15) 1
2δT
1Σ11δ1≤n−1XT
1X1δ1≤Σ1/2
11 δ12·√sn−1Σ−1/2
11 XT
1ε∞
whenever the sample size nsatisfies n&K4
1(s+ log n).
To bound the right-hand side of (D.15), we define ξij =ej(s)TΣ−1/2
11 Xi,1εi
such that |n−1Σ−1/2
11 XT
1ε|∞= maxj∈[s]|n−1Pn
i=1 ξij|. Under Condition (4.1),
we have E(ξij ) = 0, E(ξ2
ij) = 1 and
kξijkψ1≤2kεkψ2
ej(s)TΣ−1/2
11 (Is,0)Σ1/2U
ψ2
≤2K0K1ej(s)TΣ−1/2
11 (Is,0)Σ1/22= 2K0K1.
Using the union bound and inequality (7.1) in the main text implies that,
for every t > 0,
PnΣ−1/2
11 XT
1ε∞>2K0K1max √nt, to≤2sexp(−cBt).(D.16)
Taking respectively t=c−1/2
Bplog(2n) and t=c−1
Blog(2sn) in (D.14) and
(D.16) yields, with probability at least 1 −2n−1,
n−1Σ−1/2
11 XT
1X1Σ−1/2
11 −Is
Σ1/2
11 δ12.K0K3
1n−1slog n
whenever n&K4
1(s+ log n). Combining this with (D.11), we have with the
same probability,
(D.17) √nQ1.K0K3
1n−1/2slog n
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 55
whenever n&K4
1(s+ log n).
Turning to Q2, we define ξi,jk =Xij ek(s)TΣ−1/2
11 Xi,1such that
Q21 := max
j∈[p]\[s]
k∈[s]n−1
n
X
i=1
(Id −E)Xij ek(s)TΣ−1/2
11 Xi,1
= max
j∈[p]\[s]
k∈[s]n−1
n
X
i=1 ξi,jk −Eξi,jk
and Q2≤√s|Σ1/2
11 δ1|2Q21. To bound Q21, note that kξi,jk −Eξi,jk kψ1≤
2kξi,jk kψ1and
ξi,jk
ψ1≤2kXijkψ2
ek(s)TΣ−1/2
11 Xi,1
ψ2
= 2
ej(p)TΣ1/2U
ψ2
ek(s)TΣ−1/2
11 (Is,0)Σ1/2U
ψ2
≤2K2
1Σ1/2ej(p)2·Σ1/2(Is,0)TΣ−1/2
11 ek(s)2= 2K2
1.
Then using inequality (7.1) and the union bound again, we obtain that for
every t > 0,
PQ21 ≥4K2
1max pt/n, t/n≤2(p−s) exp(−cBt).
Taking t=c−1
Blog(2pn), we conclude from the bound on |Σ1/2
11 δ1|2estab-
lished earlier that, with probability at least 1 −3n−1,
(D.18) √nQ2.K0K3
1n−1/2slog p
whenever n&K4
1(s+ log n).
Putting (D.6), (D.17) and (D.18) together implies that, with probability
at least 1 −3n−1,
√nRoracle
n−√ne
Rn.K0K3
1n−1/2slog p(D.19)
whenever n&K4
1(s+ log n).
Next, we prove that √nb
Roracle
nand √ne
Rnare close with high probability.
To this end, set b
ε= (bε1,...,bεn)Tand define bσ2=n−1Pn
i=1(bεi−¯ε)2,bσ2
j=
n−1Pn
i=1(Xij −¯
Xj)2, where ¯ε=eT
nb
εand en= (1/n, . . . , 1/n)T∈Rn. In
this notation, we have
√nb
Roracle
n=bσ−1max
j∈[p]bσ−1
jn−1/2
n
X
i=1 bεiXij −√n¯ε¯
Xj.
56 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Combined with (D.1), this implies
√nb
Roracle
n−√nRoracle
n
≤max
j∈[p](bσbσj)−1−1·√nRoracle +√nbσ−1|¯ε|max
j∈[p]bσ−1
j|¯
Xj|.(D.20)
In view of (D.19), it suffices to show that the right-hand side of (D.20)
is negligible with high probability. The following lemma provides deviation
inequalities for the variance estimators bσ2and bσ2
jas well as the sample
means |¯ε|and |¯
Xj|. The proof is deferred to Section H.
Lemma D.1.Assume that Condition 4.1 in Fan, Shao and Zhou (2015)
holds. Then, with probability at least 1−Cn−1,
max
j∈[p]|¯
Xj|.K1rlog p
n,max
j∈[p]bσ2
j−1.K2
1rlog p
n+log p
n
(D.21)
and
|¯ε|.K0K2
1
slog n
n,bσ2−1.K2
0rlog n
n+ (K0K1)2slog n
n,(D.22)
provided that n&K4
1slog n.
In addition, for √ne
Rnin (D.8), it follows from the union bound, in-
equality (7.1) in the main text and (D.10) that, with probability at least
1−2dexp(−cBt), √ne
Rn.K0K1max(√t, n−1/2t). This implies by taking
t=c−1
Blog(2pn) that, with probability at least 1 −n−1,
√ne
Rn.K0K1plog p.(D.23)
Combining (D.19), (D.20) and (D.23), we conclude from Lemma D.1 that,
with probability at least 1 −Cn−1,
√nb
Roracle −√ne
Rn.(K0∨K1)2K0K1n−1/2slog p,(D.24)
provided n&(K0∨K1)4slog p.
Step 3. From (D.7) with δ= (K0K1)3/4min{1, n−1/8(log p)3/8}and (D.24),
we obtain that, with probability at least 1 −C(K0K1)3/4n−1/8(log p)7/8,
√nb
Roracle
n−e
T∗.(K0K1)3/4(log p)3/8
n1/8+ (K0∨K1)2K0K1
slog p
√n,
(D.25)
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 57
where e
T∗d
=e
R∗. For e
R∗=|e
Z|∞, it follows from Theorem 3, (ii) in Cher-
nozhukov, Chetverikov and Kato (2015) and the fact maxj∈[d]eσjj ≤1 that,
for every > 0,
sup
t≥0
Pe
R∗−t≤≤e
Cplog d+plog(1/),(D.26)
where e
C > 0 depends only on eσmin = minj∈[d]eσjj , which under Condi-
tion 4.2, is bounded away from zero.
Finally, combining (D.25) and (D.26) leads to
sup
t≥0P√nb
Roracle
n≤t−Pe
R∗≤t
.(K0K1)3/4n−1/8(log p)7/8+ (K0∨K1)2K0K1n−1/2slog p.
The conclusion of the theorem follows immediately.
APPENDIX E: PROOF OF THEOREM 4.2
In view of Theorem 4.1, we only need to prove the strong oracle property
of b
βlla, i.e.
(E.1) Pb
βlla =b
βoracle→1 as n→ ∞.
Together, (E.1) and (4.6) prove (4.9).
To prove (E.1), define events
A1=max
1≤j≤pn−1
n
X
i=1
X2
ij ≤2,A2=nκ(s, 3,Sn)≥1
2κ(s, 3,Σ)o,(E.2)
where Sn=n−1XTX. Given {Xi}n
i=1 and on the event A1∩A2, applying The-
orem 1 and Corollary 3 in Fan, Xue and Zou (2014) gives, with conditional
probability at least 1 −2pexp(−c0nλ2
lasso/K2
0)−2(p−s) exp(−c1nλ2/K2
0)
over {εi}n
i=1, the computed estimator b
βlla equals the oracle estimator b
βoracle,
provided λ≥8√s λlasso
κ(s,3,Σ), where c0, c1>0 are absolute constants. Taking into
account the randomness of {Xi}n
i=1, we obtain that
Pb
βlla 6=b
βoracle
≤2pexp(−c0nλ2
lasso/K2
0) + 2(p−s) exp(−c1nλ2/K2
0) + P(Ac
1) + P(Ac
2).
It remains to show that the events A1and A2in (E.2) hold with over-
whelming probability. Using the union bound and the one-sided version of
58 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
inequality (7.1), we find that the probability of the complementary event
Ac
1satisfies P(Ac
1)≤pexp(−c2n/K4
1). Under Condition 4.1,Xi=Σ1/2Ui,
where U1,...,Unare i.i.d. Rp-valued isotropic random vectors. Then it
follows from Theorem 6 and Remark 15 in Rudelson and Zhou (2013) by
taking δ= 1 −√2/2, s0=s,k0= 3, q=p,α=K1,A=Σ1/2and
Ψ = (U1,...,Un)T∈Rn×pthere that, P(Ac
2)≤2 exp(−c3n/K4
1) when-
ever the sample size nsatisfies n&K4
1slog p
κ(s,3+,Σ). Here, c2, c3>0 are absolute
constants. The proof of Theorem 4.2 is then complete.
APPENDIX F: PROOF OF PROPOSITION 3.1
F.1. Preliminaries. First, we introduce basic notation and definitions
that will be used to prove Proposition 3.1.
F.1.1. m-generated convex set. For any convex set A⊆Rp, its support
function is defined as v7→ SA(v) := sup{wTv:w∈A}for v∈Sp−1, such
that Acan be written as
A=\
v∈Sp−1w∈Rp:wTv≤SA(v).
Following Chernozhukov, Chetverikov and Kato (2014b), we say that Ais
m-generated if it is generated by intersections of mhalf-spaces; that is, there
exists a subset V(A)⊆Sp−1consisting of munit vectors that are outward
normal to the faces of Asuch that
A=\
v∈V(A)w∈Rp:wTv≤SA(v).
Moreover, for m≥1 and > 0, we say that a convex set Aadmits an
approximation with precision by an m-generated convex set Amif Am⊆
A⊆Am,, where Am, := ∩v∈V(Am){w∈Rp:wTv≤SAm(v) + }.
F.1.2. Sparsely convex set. In this section, we consider a particular class
of convex sets that can be approximated by m-generated convex sets with a
pre-specified precision for some finite m≥1.
Definition F.1 (Sparsely convex sets).Let 1≤s≤pand Q≥1be two
integers. We say that A⊆Rpis an (s, Q)-sparsely convex set if A=∩Q
q=1Aq,
where for each q,Aqis a convex set and is such that the map w7→ I(w∈Aq)
depends at most on scomponents of w= (w1, . . . , wp)T∈Rp. We refer to
A=∩Q
q=1Aqas a sparse representation of A.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 59
The class of sparsely convex sets can be regarded as a generalization of
the class of the rectangles. We refer to Chernozhukov, Chetverikov and Kato
(2014b) for a detailed introduction and more concrete examples. In particu-
lar, the following result which is Lemma D.1 there shows that under suitable
conditions, sparsely convex sets can be approximated by m-generated convex
sets with pre-specified precisions.
Lemma F.1.Assume that A is an (s, Q)-sparsely convex set satisfying
(i). 0∈A, (ii). supw∈A|w|2≤Rfor some R > 0and (iii). A=∩Q
q=1Aq,
where for each q,−A1⊆µAqfor some µ≥1. Then for every γ > e/8,
there exists 0=0(γ)>0such that for any 0< < 0,Aadmits an
approximation with precision R by an m-generated convex set Amsatisfying
that (i). |v|0≤sfor all v∈ V(Am), and (ii). m≤Qγqµ+1
log 1
s2
.
F.1.3. Central limit theorem for simple convex sets. Let X1,...,Xnbe
i.i.d. p-dimensional random vectors with mean zero and covariance matrix Σ,
and let Zbe a p-dimensional centered Gaussian random vector with the same
covariance matrix. Assume that diag(Σ) = Ip. Write W=n−1/2Pn
i=1 Xi.
For a given class Aof Borel sets in Rp, the problem of bounding the quantity
ρn(A) = supA∈A |P(W∈A)−P(Z∈A)|, which characterizes the rate of
convergence to normality with respect to A, is of long-standing interest.
In this section, we focus on a particular class of convex sets for which a
Berry-Esseen theorem can be established in high dimensions.
For integers 1 ≤s≤p,m≥1 and for δ≥0, we denote by Asc(s, m, δ)
the class of convex sets in Rpsatisfying that, every A∈ Asc(s, m, δ) admits
an approximation with precision δby an m-generated convex set Amwhich
can be chosen to satisfy |v|0≤sfor all v∈Am. We refer to Asc(s, m, δ )
as a class of simple convex sets. The following Berry-Esseen-type result is
a modification of Proposition 3.2 in Chernozhukov, Chetverikov and Kato
(2014b).
Lemma F.2.There exists some integer 1≤s≤psuch that
K= sup
u∈Sp−1:|u|0≤skuTX1kψ1<∞and σ2
min = inf
u∈Sp−1:|u|0≤s
E(uTX1)2>0.
Then, there exists an absolute constant C > 0such that for any m≥1and
δ > 0,
sup
A∈Asc(s,m,δ)|P(W∈A)−P(Z∈A)|
≤Cσ−1
minhKn−1/6{log(mn)}7/6+δplog mi.
60 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
F.2. Proof of the proposition. First, we define the following stan-
dardized counterparts of √nb
Rn(k, p) for k= 1, . . . , s:
Ln(k, p) = max
u∈Sp−1:|u|0=k|uTW|= max
S⊆[p]:|S|=k|WS|2,
where W=n−1/2Pn
i=1 yiwith yi=εiXifor i= 1, . . . , n.
The following lemma shows that, after properly normalized, the joint
distribution of {Ln(k, p)}s
k=1 can be consistently estimated by that of the
top sorder statistics of i.i.d. chi-square random variables with 1 degree of
freedom. Recall that Z= (Z1, . . . , Zp)Td
=N(0,Ip), and Z2
(1) ≤Z2
(2) ≤ ··· ≤
Z2
(p)denote the order statistics of {Z2
1, . . . , Z2
p}.
Lemma F.3.Assume that Conditions 2.1 and 2.2 in the main text hold
with Σ=Ip. Then there is an absolute constant C > 0such that
sup
0<t1<···<tk≤2Ps
\
k=1 n−1/2Ln(k, p)≤tk
−Ps
\
k=1 n−1/2R∗(k, p)≤tk≤C(K0K1)1/3{s2log(pn)}7/6
n1/6,
where R∗(k, p)2= maxu∈Sp−1:|u|0=k(uTZ)2=Pk
ν=1 Z2
(ν).
Further, define
b
L=√nb
Rn(1, p),..., b
Rn(s, p)T,L=Ln(1, p), . . . , Ln(s, p)T.
Then it is easy to see that |b
L−L|∞≤max1≤k≤s|√nb
Rn(k, p)−Ln(k, p)|.
Taking V:= ∪s
k=1V(k, p) with V(k, p) = {x7→ uTx:u∈Sp−1,|u|0=k}as
in Lemma 7.2, the same conclusions there hold by a similar argument. Con-
sequently, it follows from a modification of Lemma 7.6 that, with probability
greater than 1 −C1n−1/2{cn(s, p)}1/2,
(F.1) |b
L−L|∞≤C2(K0∨K1)2K0K1n−1/2cn(s, p)
whenever n≥C3(K0∨K1)4cn(s, p), where cn(s, p) := slog(ep/s)∨log n.
Together, (F.1) and Lemma F.3 imply that for any 0 < t1<··· < ts<1
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 61
and all sufficiently large n,
Ps
\
k=1 b
Rn(k, p)≤tk
≤Ps
\
k=1 Ln(k, p)≤√n(tk+n)+C1n−1/2{cn(s, p)}1/2
≤Ps
\
k=1 n−1/2R∗(k, p)≤tk+n+C(K0K1)1/3n−1/6{s2log(pn)}7/6
≤Ps
\
k=1 n−1/2R∗(k, p)≤tk
+P√nts< R∗(s, p)≤√n(ts+n)+C(K0K1)1/3n−1/6{s2log(pn)}7/6
≤Ps
\
k=1 n−1/2Rk≤tk
+Cnpns log(ep/s)+(K0K1)1/3n−1/6{s2log(pn)}7/6,
where n=n(s, p) = C2(K0∨K1)2K0K1n−1cn(s, p)≤1 for all sufficiently
large n. A similar argument leads to the reverse inequality, and hence com-
pletes the proof.
F.3. Proof of Lemma F.2.This proof is similar to that of Proposi-
tion 3.2 in Chernozhukov, Chetverikov and Kato (2014b) with slight modi-
fication. We reproduce them here for the sake of readability.
For every A∈ Asc(s, m, δ), let Ambe the approximating m-generated
convex set of Asuch that Am⊆A⊆Am,δ. Put
ρ= max |P(W∈Am)−P(Z∈Am)|,|P(W∈Am,δ )−P(Z∈Am,δ )|.
Applying Lemma A.1 in Chernozhukov, Chetverikov and Kato (2014b) and
Theorem 20 in Klivans, O’Donnell and Servedio (2008) to the m-dimensional
Gaussian random vector (vTZ)v∈V(Am)implies that
|P(W∈A)−P(Z∈A)| ≤ σ−1
minp2 log m+ 2δ+ρ.(F.2)
Recall that K= supu∈Sp−1:|u|0≤skuTXikψ1<∞. Then, for every q≥3
and v∈ V(Am) with |v|0≤s,
n−1
n
X
i=1
E|vTXi|q≤n−1
n
X
i=1
(qkvTXikψ1)q≤(qK)q.
62 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Consequently, it follows from Proposition 2.1 in Chernozhukov, Chetverikov
and Kato (2014b) that
ρ.σ−1
minK n−1/6{log(mn)}7/6.(F.3)
Together, (F.2) and (F.3) complete the proof of the lemma.
F.4. Proof of Lemma F.3.For any 0 < t1<·· · < ts≤2, we have
PLn(1, p)≤√nt1, Ln(2, p)≤√nt2,···, Ln(s, p)≤√nts
=PW∈
s
\
k=1
Ak(wk),
where wk=nt2
kfor k= 1, . . . , s and for t≥0,
Ak(t) := w∈Rp:|wS|2
2≤tfor all S⊆[p] with |S|=k.
Put A(t) = ∩s
k=1Ak(wk), where t= (t1, . . . , ts). For every 1 ≤k≤s, let
{Sk`}(p
k)
`=1 be all the subsets of [p] with cardinality k. In this notation, we can
further write the set A(t) as
A(t) =
s
\
k=1
(p
k)
\
`=1
Ak` =
s
\
k=1
(p
k)
\
`=1 w∈Rp:|wSk` |2
2≤wk.
It is easy to see that the indicator function w= (w1, . . . , wp)T∈Rp7→
I(w∈Ak`) depends only on k(≤s) components of w. By Definition F.1,
A(t) is an (s, (ep/s)s)-sparsely convex set. Then it follows from Lemma F.1
with R= 2(pn)1/2and γ=µ= 1 that there exists some constant 0>0
such that for every ∈(0, 0) and 0 < t1<··· < ts<1, the set A(t) admits
an approximation with precision 2(pn)1/2by an m-generated convex set
Am, where
m≤ep
ssr2
log 1
s2
.
In particular, taking = (pn)−1yields, for any t= (t1, . . . , ts) with 0 < t1<
t2<··· < ts≤2, A(t)∈ Ascs, (ep/s)s{√2pn log(pn)}s2,2(pn)−1/2. This,
together with Lemma F.2 and the inequality kuTyikψ1≤2kεikψ2kuTXikψ2≤
2K0K1that holds for all u∈Sp−1completes the proof of the lemma.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 63
APPENDIX G: PROOF OF PROPOSITION 3.2
Observe that Z2
1, . . . , Z2
pare i.i.d. chi-square random variables with 1 de-
gree of freedom. For t∈Rfixed, let tp=ap+twith ap= 2 log p−log(log p)
such that as p→ ∞,
PZ2
(p)≤tp=1−P(Z2
1> tp)p→exp −1
√πe−t/2.
In other words, Z2
(p)−apconverges weakly to a Gumbel distribution with
a cumulative distribution function given by G(t) = exp(−π−1/2e−t/2) for
t∈R. Consequently, for every s≥2 fixed, the s-dimensional vector
Z2
(p)−ap, . . . , Z2
(p−s+1) −ap
has a limiting distribution with joint density function given by [David and
Nagaraja (2003)]
gs(t1, . . . , ts) = G(ts)
s
Y
j=1
g(ti)
G(ti), t1> t2>··· > ts,
where g(t) = G0(t) = e−t/2
2√πG(t) for t∈R, and it is easy to verify that
gs(t1, . . . , ts) = 1
2√πs−1
exp −1
2
s−1
X
j=1
tjg(ts), t1> t2>··· > ts.
Consequently, as p→ ∞,
PZ2
(p)+· ·· +Z2
(p−s+1) −sap≤t
→Z···Zt1+···ts≤t,t1>···>ts
gs(t1, . . . , ts)dt1··· dts
=1
2√πs−1Zt/s
−∞ Z···Zt1+···+ts−1≤t−ts,t1>···>ts−1>ts
s−1
Y
j=1
e−tj/2dtjg(ts)dts
=1
√πs−1Zt/s
−∞ Z···Zt1+···+ts−1≤(t−ts)/2,t1>···>ts−1>ts/2
s−1
Y
j=1
e−tjdtjg(ts)dts
=1
√πs−1Zt/s
−∞
dtse−(s−1)ts/2g(ts)
×Z···Zu1+···+us−1≤(t−sts)/2, u1>···>us−1>0
e−u1−···−us−1du1··· dus−1.
(G.1)
64 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Now, let E1, . . . , Es−1be i.i.d. standard exponential distributed random
variables and let E(1) ≥E(2) ≥ ··· ≥ E(s−1) be the corresponding order
statistics. It is known that the joint density function of (E(1), . . . , E(s−1) ) is
(s−1)!e−t1−···−ts−1,t1> t2>··· > ts−1≥0. Therefore, the last multiple
integral on the right side of (G.1) is equal to
1
(s−1)!PE(1) +· ·· +E(s−1) ≤(t−sts)/2
=1
(s−1)!PE1+· ·· +Es−1≤(t−sts)/2
=1
(s−1)!Γ(s−1) Z(t−sts)/2
0
us−2e−udu,
where we used the fact that E1+·· · +Es−1d
= Gamma(s−1,1). Putting
the above calculations together yields (3.7).
To prove (3.8), observe that for any a > 0 and positive integer `,
1
Γ(`)Za
0
u`−1e−udu = 1 −
`−1
X
j=0
aj
j!e−a.
Hence, for s≥2,
1
Γ(s−1) Zt/s
−∞ Z(t−sv)/2
0
us−2e−udue−(s−1)v/2g(v)dv
=Zt/s
−∞
e−(s−1)v/2g(v)dv
−
s−2
X
j=0
1
j!2jZt/s
−∞
(t−sv)je−(t−sv)/2−(s−1)v/2g(v)dv
=Zt/s
−∞
e−(s−1)v/2g(v)dv −e−t/2
s−2
X
j=0
1
j!2jZt/s
−∞
(t−sv)jev/2g(v)dv.(G.2)
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 65
Further, using integration by parts repeatedly gives
Zt/s
−∞
e−(s−1)v/2g(v)dv =Zt/s
−∞
e−(s−1)v/2dG(v)
=G(t/s)e−(s−1)t/(2s)+ (s−1)√πZt/s
−∞
e−(s−2)v/2dG(v)
=G(t/s)ne−(s−1)t/(2s)+ (s−1)√πe−(s−2)t/(2s)o
+ (s−1)(s−2)πZt/s
−∞
e−(s−3)v/2dG(v)
=···
=G(t/s)π(s−1)/2(s−1)! + e−t/2+t/(2s)
(G.3)
+e−t/2
s−2
X
j=1
πj/2e(j+1)t/(2s)
j
Y
`=1
(s−`).
The first summand of the last term on the right-hand side of (G.2) reads to
(G.4) Zt/s
−∞
ev/2g(v)dv =et/(2s)G(t/s)−√πZt/s
−∞
evg(v)dv.
Assembling (G.2)–(G.4) completes the proof of Proposition 3.2.
APPENDIX H: PROOF OF LEMMA C.1
We continue to adopt the notation in the proof of Theorem 4.1. To prove
(D.21), consider the inequality |bσ2
j−1|≤|n−1Pn
i=1 X2
ij −1|+¯
X2
j. Anal-
ogously to (7.4), for every t1, t2>0 we have, with probability at least
1−2pexp(−cHt1)−2pexp(−cBt2), maxj∈[p]|¯
Xj| ≤ K1n−1/2√t1and
max
j∈[p]bσ2
j−1≤K2
1n−1t1+ 4K2
1max n−1/2√t2, n−1t2.
In particular, taking t1=c−1
Hlog(2pn) and t2=c−1
Blog(2pn) proves (D.21).
Next we prove (D.22). Recall that Y=X1β1+εand b
ε= (bε1,...,bεn)T=
Y−X1b
β1. Therefore, we have
n
X
i=1 bε2
i=b
εTb
ε=εTε−2εTX1δ1+|X1δ1|2
2,
¯ε=eT
nb
ε=eT
nε−n−1
n
X
i=1
XT
i,1δ1.
66 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
This and (D.16) yield |b
εTb
ε−εTε| ≤ 3√s|Σ−1/2
11 XT
1ε|∞·|Σ1/2
11 δ1|2and |eT
n(b
ε−
ε)| ≤ |Σ1/2
11 δ1|2·√s|n−1Pn
i=1 Σ−1/2
11 Xi,1|∞. Applying the union bound and
inequality (7.2) we obtain that, for every t > 0, |n−1/2Pn
i=1 Σ−1/2
11 Xi,1|∞≤
K1tholds with probability at least 1 −2sexp(−cHt2). Hence, taking t=
c−1/2
Hplog(2sn), we conclude from (D.12), (D.15) and (D.16) that, with
probability at least 1 −3n−1,
b
εTb
ε−εTε.(K0K1)2slog n, eT
n(b
ε−ε).K0K2
1n−1slog n,(H.1)
provided that n&K4
1(s+ log n).
Finally, from (H.1) and (7.4) we obtain, with probability at least 1−5n−1,
bσ2−1.K2
0rlog n
n+ (K0K1)2slog n
n
whenever n&K4
1slog n. This, together with (H.1) proves (D.22).
APPENDIX I: ADDITIONAL SIMULATION RESULTS
In this section, we present additional numerical results for detecting spu-
rious discoveries in the case of non-Gaussian design and noise. We con-
tinue with the setup in Section 6.3 by taking n= 120,160, p= 400 and
β∗= (1,0,−0.8,0,0.6,0,−0.4,0,...,0)T∈Rp. For r∈ {120,200,280,360},
we let x= (x1, . . . , xr)T, where x1, . . . , xrare i.i.d. random variables follow-
ing the continuous uniform distribution on [−1,1]. The rows of the design
matrix Xare sampled as i.i.d. copies from Γrx∈Rp, where Γris a p×r
matrix satisfying ΓT
rΓr=Ir. Moreover, the noise variable εfollows a stan-
dardized t-distribution with 4 degrees of freedom. We compute the empirical
SDP based on 200 simulations. The results are provided in Table 6.
Table 6
The empirical α-level Spurious Discovery Probability (ESDP) based on 200 simulations
when p= 400,n= 120,160 and α= 5%.
Trule r= 120 r= 200 r= 280 r= 360
n= 120 0.9100 0.8000 0.7200 0.5900
n= 160 0.7650 0.6100 0.3600 0.2750
REFERENCES
Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes
with applications to Markov chains. Electron. J. Probab. 13 1000–1034.
Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation
study. Ann. Inst. Henri Poincar´e Probab. Stat. 48 1148–1185.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 67
Chernozhukov, V., Chetverikov, D. and Kato, K. (2014a). Gaussian approximation
of suprema of empirical processes. Ann. Statist. 42 1564–1597.
Chernozhukov, V., Chetverikov, D. and Kato, K. (2014b). Central limit theorems
and bootstrap in high dimensions. Available at arXiv:1412.3661.
Chernozhukov, V., Chetverikov, D. and Kato, K. (2015). Comparison and anti-
concentration bounds for maxima of Gaussian random vectors. Probab. Theory Relat.
Fields 162 47–70.
David, H. A. and Nagaraja, H. N. (2003). Order Statistics (3rd ed). Wiley-Interscience,
Hoboken, NJ.
de la Pe˜
na, V. H., Lai, T. L. and Shao, Q.-M. (2009). Self-Normalized Processes: Limit
Theory and Statistical Applications. Springer, Berlin.
Fan, J., Shao, Q.-M. and Zhou, W.-X. (2015). Are discoveries spurious? Distributions of
maximum spurious correlations and their applications. Available at arXiv:1502.04237.
Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized
estimation. Ann. Statist. 42 819–849.
Klivans, A., O’Donnell, R. and Servedio, R. (2008). Learning geometric concepts
via Gaussian surface area. In Proceedings of 49th IEEE Symposium on Foundations of
Computer Science 541–550.
Mendelson, S. (2010). Empirical processes with a bounded ψ1diameter. Geom. Funct.
Anal. 20 988–1027.
Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measure-
ments. IEEE Trans. Inform. Theory 59 3434–3447.
Sun, Q., Zhou, W.-X. and Fan, J. (2016). Adaptive Huber regression: Optimality and
phase transition. Preprint.
Talagrand, M. (2005). The Generic Chaining. Springer-Verlag, Berlin.
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical
Processes: With Applications to Statistics. Springer, New York.
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices.
In Compressed Sensing: Theory and Applications (Y. Eldar and G. Kutyniok, eds.)
210–268. Cambridge University Press, Cambridge.
Wang, T., Berthet, Q. and Samworth, R. J. (2016). Statistical and computational
trade-offs in estimation of sparse principal components. Ann. Statist. 44 1896–1930.