Are Discoveries Spurious? Distributions of Maximum Spurious Correlations and Their Applications

Article (PDF Available)inThe Annals of Statistics 46(3) · February 2015with 125 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
DOI: 10.1214/17-AOS1575 · Source: arXiv
Cite this publication
Abstract
Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries by such data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions on exogeneity of covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given certain number of predictors, namely, the distribution of the correlation of a response variable Y with the best s linear combinations of p covariates X, even when X and Y are independent. When the covariance matrix of X possesses the restricted eigenvalue property, we derive such distributions for both finite s and diverging s, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of X. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where residuals are from regularized fits. Our approach is then applied to construct the upper confidence limit for the maximum spurious correlation and testing exogeneity of covariates. The former provides a baseline for guarding against false discoveries due to data mining and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated by both numerical examples and real data analysis.

Methods used:

Submitted to the Annals of Statistics
ARE DISCOVERIES SPURIOUS?
DISTRIBUTIONS OF MAXIMUM SPURIOUS
CORRELATIONS AND THEIR APPLICATIONS
By Jianqing Fank,∗∗, Qi-Man Shao†† and Wen-Xin Zhouk,‡‡
Princeton Universityk, Academy of Mathematics and Systems Science∗∗,
Chinese University of Hong Kong††
and University of California, San Diego‡‡
Over the last two decades, many exciting variable selection meth-
ods have been developed for finding a small group of covariates that
are associated with the response from a large pool. Can the discov-
eries from these data mining approaches be spurious due to high di-
mensionality and limited sample size? Can our fundamental assump-
tions about the exogeneity of the covariates needed for such variable
selection be validated with the data? To answer these questions, we
need to derive the distributions of the maximum spurious correlations
given a certain number of predictors, namely, the distribution of the
correlation of a response variable Ywith the best slinear combina-
tions of pcovariates X, even when Xand Yare independent. When
the covariance matrix of Xpossesses the restricted eigenvalue prop-
erty, we derive such distributions for both a finite sand a diverging
s, using Gaussian approximation and empirical process techniques.
However, such a distribution depends on the unknown covariance
matrix of X. Hence, we use the multiplier bootstrap procedure to ap-
proximate the unknown distributions and establish the consistency of
such a simple bootstrap approach. The results are further extended
to the situation where the residuals are from regularized fits. Our
approach is then used to construct the upper confidence limit for
the maximum spurious correlation and to test the exogeneity of the
covariates. The former provides a baseline for guarding against false
discoveries and the latter tests whether our fundamental assumptions
for high-dimensional model selection are statistically valid. Our tech-
niques and results are illustrated with both numerical examples and
real data analysis.
1. Introduction. Information technology has forever changed the data
collection process. Massive amounts of very high-dimensional or unstruc-
J. Fan was partially supported by NSF Grants DMS-1206464, DMS-1406266, NIH
Grant R01-GM072611-10, and National Center for Mathematics and Interdisciplinary Sci-
ences, Chinese Academy of Sciences, Beijing 100190. Q.-M. Shao was supported in part by
Hong Kong Research Grants Council GRF-403513 and GRF-14302515. W.-X. Zhou was
supported by NIH Grant R01-GM072611-10.
Keywords and phrases: High dimension, spurious correlation, bootstrap, false discovery.
1
arXiv:1502.04237v3 [math.ST] 11 Apr 2017
2J. FAN, Q.-M. SHAO AND W.-X. ZHOU
tured data are continuously produced and stored at an affordable cost. Mas-
sive and complex data and high dimensionality characterize contemporary
statistical problems in many emerging fields of science and engineering. Var-
ious statistical and machine learning methods and algorithms have been pro-
posed to find a small group of covariate variables that are associated with
given responses such as biological and clinical outcomes. These methods
have been very successfully applied to genomics, genetics, neuroscience, eco-
nomics, and finance. For an overview of high-dimensional statistical theory
and methods, see the review article by Fan and Lv (2010) and monographs
by Dudoit and van der Laan (2007), Hastie, Tibshirani and Friedman (2009),
Efron (2010) and uhlmann and van de Geer (2011).
Underlying machine learning, data mining, and high-dimensional statis-
tical techniques, there are many model assumptions and even heuristic ar-
guments. For example, the LASSO [Tibshirani (1996)] and the SCAD [Fan
and Li (2001)] are based on an exogeneity assumption, meaning that all of
the covariates and the residual of the true model are uncorrelated. However,
it is nearly impossible that such a random variable, which is the part of
the response variable that can not be explained by a small group of covari-
ates and lives in a low-dimensional space spanned by the response and the
small group of variables, is uncorrelated with any of the tens of thousands
of coviariates. Indeed, Fan and Liao (2014) and Fan, Han and Liu (2014)
provide evidence that such an ideal assumption might not be valid, although
it is a necessary condition for model selection consistency. Even under the
exogenous assumption, conditions such as the restricted eigenvalue condi-
tion [Bickel, Ritov and Tsybakov (2009)] and homogeneity [Fan, Han and
Liu (2014)] are needed to ensure model selection consistency or oracle prop-
erties. Despite their critical importance, these conditions have rarely been
verified in practice. Their violations can lead to false scientific discoveries.
A simpler question is then, for a given data set, do data mining techniques
produce results that are better than spurious correlation? The answer de-
pends on not only the correlation between the fitted and observed values,
but also on the sample size, the number of variables selected, and the total
number of variables.
To better appreciate the above two questions, let us consider an example.
We take the gene expression data on 90 Asians (45 Japanese and 45 Han
Chinese) from the international ‘HapMap’ project [Thorisson et al. (2005)].
The normalized gene expression data are generated with an Illumina Sentrix
Human-6 Expression Bead Chip [Stranger et al. (2007)] and are available
on ftp://ftp.sanger.ac.uk/pub/genevar/. We take the expressions of
gene CHRNA6, a cholinergic receptor, nicotinic, alpha 6, as the response
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 3
Fig 1: Histogram of the sample correlations between the residuals and each
covariate (blue) and histogram of N(0,1/n) random variables (green).
Yand the remaining expressions of probes as covariates Xwith dimension
p= 47292. We first fit an `1-penalized least-squares regression (LASSO)
on the data with a tuning parameter automatically selected via ten-fold
cross validation (25 genes are selected). The correlation between the LASSO-
fitted value and the response is 0.8991. Next, we refit an ordinary least-
squares regression on the selected model to calculate the fitted response and
residual vector. The sample correlation between the post-LASSO fit and
observed responses is 0.9214, a remarkable fit! But is it any better than the
spurious correlation? The model diagnostic plot, which depicts the empirical
distribution of the correlations between each covariate Xjand the residual
bεafter the LASSO fit, is given in Figure 1. Does the exogenous assumption
that E(εXj) = 0 for all j= 1, . . . , p hold?
To answer the above two important questions, we need to derive the distri-
butions of the maximum spurious correlations. Let Xbe the p-dimensional
random vector of the covariates and XSbe a subset of covariates indexed by
S. Let dcorrn(ε, αT
SXS) be the sample correlation between the random noise
ε(independent of X) and αT
SXSbased on a sample of size n, where αSis
4J. FAN, Q.-M. SHAO AND W.-X. ZHOU
a constant vector. Then, the maximum spurious correlation is defined as
(1.1) b
Rn(s, p) = max
|S|=smax
αSdcorrn(ε, αT
SXS),
when Xand εare independent, where the maximization is taken over all
p
ssubsets of size sand all of the linear combinations of the selected s
covariates. Next, let (Yi,Xi),...,(Yn,Xn) be independent and identically
distributed (i.i.d.) observations from the linear model Y=XTβ+ε. Assume
that scovariates are selected by a certain variable selection method for
some 1 smin(p, n). If the correlation between the fitted response and
observed response is no more than the 90th or the 95th percentile of b
Rn(s, p),
it is hard to claim that the fitted value is impressive or even genuine. In this
case, the finding is hardly more impressive than the best fit using data
that consist of independent response and explanatory variables, 90% or 95%
of the time. To simplify and unify the terminology, we call this result the
spurious discovery throughout this paper.
For the aforementioned gene expression data, as 25 probes are selected,
the observed correlation of 0.9214 between the fitted value and the response
should be compared with the distribution of b
Rn(25, p). Further, a simple
method to test the null hypothesis
(1.2) E(εXj)=0,for all j= 1, . . . , p,
is to compare the maximum absolute correlation in Figure 1with the dis-
tribution of b
Rn(1, p). See additional details in Section 5.3.
The importance of such spurious correlation was recognized by Cai and
Jiang (2011), Fan, Guo and Hao (2012) and Cai, Fan and Jiang (2013).
When the data are independently and normally distributed, they derive the
distribution of b
Rn(1, p), which is equivalent to the distribution of the mini-
mum angle to the north pole among prandom points uniformly distributed
on the (n+ 1)-dimensional sphere. Fan, Guo and Hao (2012) conducted
simulations to demonstrate that the spurious correlation can be very high
when pis large and grows quickly with s. To demonstrate this effect and to
examine the impact of correlation and sample size, we conduct a similar but
more extensive simulation study based on a combination of the stepwise ad-
dition and branch-and-bound algorithms. We simulate Xfrom N(0,Ip) and
N(0,Σ0), where Σ0is block diagonal with the first block being a 500 ×500
equi-correlation matrix with a correlation 0.8 and the second block being
the (p500) ×(p500) identity matrix. Yis simulated independently of
Xand follows the standard normal distribution. Figure 2depicts the simu-
lation results for n= 50,100 and 200. Clearly, the distributions depend on
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 5
Fig 2: Distributions of maximum spurious correlations for p= 1000 and
s= 1,2,5 and 10 when Σis the identity matrix (left panel) or block diagonal
(right panel) with the first block being a 500 ×500 equi-correlation matrix
with a correlation 0.8 and the second block being the 500 ×500 identity
matrix. From top to bottom: n= 50,100 and 200.
(s, p, n) and Σ, the covariance matrix of X, although the dependence on Σ
does not seem very strong. However, the theoretical result of Fan, Guo and
Hao (2012) covers only the very specific case where s= 1 and Σ=Ip.
There are several challenges to deriving the asymptotic distribution of the
statistic b
Rn(s, p), as it involves combinatorial optimization. Further techni-
cal complications are added by the dependence among the covariates X.
Nevertheless, under the restricted eigenvalue condition [Bickel, Ritov and
Tsybakov (2009)] on Σ, in this paper, we derive the asymptotic distribution
of such a spurious correlation statistic for both a fixed sand a diverging s,
using the empirical process and Gaussian approximation techniques given in
Chernozhukov, Chetverikov and Kato (2014a). As expected, such distribu-
tions depend on the unknown covariance matrix Σ. To provide a consistent
estimate of the distributions of the spurious correlations, we consider the
use of a multiplier bootstrap method and demonstrate its consistency under
mild conditions. The multiplier bootstrap procedure has been widely used
due to its good numerical performance. Its theoretical validity is guaranteed
6J. FAN, Q.-M. SHAO AND W.-X. ZHOU
by the multiplier central limit theorem [van der Vaart and Wellner (1996)].
For the most advanced recent results, we refer to Chatterjee and Bose (2005),
Arlot, Blanchard and Roquain (2010) and Chernozhukov, Chetverikov and
Kato (2013). In particular, Chernozhukov, Chetverikov and Kato (2013) de-
veloped a number of non-asymptotic results on a multiplier bootstrap for
the maxima of empirical mean vectors in high dimensions with applications
to multiple hypothesis testing and parameter choice for the Dantzig selec-
tor. The use of multiplier bootstrapping enables us to empirically compute
the upper confidence limit of b
Rn(s, p) and hence decide whether discover-
ies by statistical machine learning techniques are any better than spurious
correlations.
The rest of this paper is organized as follows. Section 2discusses the
concept of spurious correlation and introduces the main conditions and no-
tation. Section 3presents the main results of the asymptotic distributions of
spurious correlations and their bootstrap approximations, which are further
extended in Section 4. Section 5identifies three important applications of
our results to high-dimensional statistical inference. Section 6presents the
numerical studies. The proof of Theorem 3.1 is provided in Section 7, and
the proofs for the remaining theoretical results are provided in the supple-
mentary material [Fan, Shao and Zhou (2017)].
2. Spurious correlation, conditions, and notation. Let ε, ε1, . . . , εn
be i.i.d. random variables with a mean of zero and a finite variance σ2>0,
and let X,X1,...,Xnbe i.i.d. p-dimensional random vectors with a mean
of zero and a covariance matrix Σ=E(XXT)=(σjk )1j,kp. Write
X= (X1, . . . , Xp)T,Xi= (Xi1, . . . , Xip)T, i = 1, . . . , n.
Assume that the two samples {εi}n
i=1 and {Xi}n
i=1 are independent. Then,
the spurious correlation (1.1) can be written as
(2.1) b
Rn(s, p) = max
αSp1:|α|0=sdcorrnε, αTX,
where the dimension pand sparsity sare allowed to grow with the sample
size n. Here dcorrn(·,·) denotes the sample Pearson correlation coefficient
and Sp1:= {αRp:|α|2= 1}is the unit sphere of Rp. Due to the anti-
symmetric property of the sample correlation under the sign transformation
of α, we have also
(2.2) b
Rn(s, p) = max
αSp1:|α|0=s|dcorrnε, αTX|,
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 7
More specifically, we can express b
Rn(s, p) as
max
S[p]:|S|=smax
αSs1Pn
i=1(εi¯εn)α,Xi,S ¯
Xn,S
qPn
i=1(εi¯εn)2·Pn
i=1 α,Xi,S ¯
Xn,S2.(2.3)
By the scale-invariance property of b
Rn(s, p), we assume without loss of gen-
erality that σ2= 1 and Σis a correlation matrix, so that diag(Σ) = Ip.
For a random variable X, the sub-Gaussian norm kXkψ2and sub-exponential
norm kXkψ1of Xare defined, respectively, as
kXkψ2= sup
q1
q1/2E|X|q1/q and kXkψ1= sup
q1
q1E|X|q1/q.
A random variable Xthat satisfies kXkψ2<(resp., kXkψ1<) is called
a sub-Gaussian (resp., sub-exponential) random variable [Vershynin (2012)].
The following moment conditions for εRand XRpare imposed.
Condition 2.1.There exists a random vector Usuch that X=Σ1/2U,
E(U) = 0,E(UUT) = Ipand K1:= supαSp1kαTUkψ2<. The random
variable εhas a zero mean and unit variance, and is sub-Gaussian with
K0:= kεkψ2<. Moreover, write vq=E(|ε|q)for q3.
The following is our assumption for the sampling process.
Condition 2.2.{εi}n
i=1 and {Xi}n
i=1 are independent random samples
from the distributions of εand X, respectively.
For 1 sp, the s-sparse minimal and maximal eigenvalues [Bickel,
Ritov and Tsybakov (2009)] of the covariance matrix Σare defined as
φmin(s) = min
uRp:1≤|u|0s|u|Σ/|u|22, φmax(s) = max
uRp:1≤|u|0s|u|Σ/|u|22,
where |u|Σ= (uTΣu)1/2and |u|2= (uTu)1/2is the `2-norm of u. Con-
sequently, for 1 sp, the s-sparse condition number of Σis given by
(2.4) γs=γs(Σ) = pφmax(s)min (s).
The quantity γsplays an important role in our analysis.
The following notation is used. For the two sequences {an}and {bn}
of positive numbers, we write an=O(bn) or an.bnif there exists a
constant C > 0 such that an/bnCfor all sufficiently large n; we write
8J. FAN, Q.-M. SHAO AND W.-X. ZHOU
anbnif there exist constants C1, C2>0 such that, for all nlarge enough,
C1an/bnC2; and we write anbnand an=o(bn) if limn→∞ an/bn= 1
and limn→∞ an/bn= 0, respectively. For a, b R, we write ab= max(a, b)
and ab= min(a, b). For every vector u, we denote by |u|q=Pi1|ui|q1/q
for q > 0 and |u|0=Pi1I{ui6= 0}. We use hu,vi=uTvto denote the
inner product of two vectors uand vwith the same dimension and kMkto
denote the spectral norm of a matrix M. For every positive integer `, we write
[`] = {1,2, . . . , `}, and for any set S, we use Scto denote its complement
and |S|for its cardinality. For each p-dimensional vector uand p×ppositive
semi-definite matrix A, we write |u|A= (uTAu)1/2. In particular, put
(2.5) αΣ=α/|α|Σ
for every αRpand set 0Σ=0as the convention.
3. Distributions of maximum spurious correlations. In this sec-
tion, we first derive the asymptotic distributions of the maximum spurious
correlation b
Rn(s, p). The analytic form of such asymptotic distributions can
be obtained in the isotropic case. As the asymptotic distributions of b
Rn(s, p)
depend on the unknown covariance matrix Σ, we provide a bootstrap esti-
mate and demonstrate its consistency.
3.1. Asymptotic distributions of maximum spurious correlations. In view
of (2.3), we can rewrite b
Rn(s, p) as
b
Rn(s, p) = sup
f∈F
n1Pn
i=1(εi¯εn)f(Xi¯
Xn)
pn1Pn
i=1(εi¯εn)2·qn1Pn
i=1 f2(Xi¯
Xn)
,(3.1)
where ¯εn=n1Pn
i=1 εi,¯
Xn=n1Pn
i=1 Xiand
(3.2) F=F(s, p) = x7→ fα(x) := hα,xi:α∈ V
is a class of linear functions Rp7→ R, where V=V(s, p) = {αSp1:
|α|0=s}. The dependence of Fand Von (s, p) is suppressed.
Let Z= (Z1, . . . , Zp)Tbe a p-dimensional Gaussian random vector with
a mean of zero and the covariance matrix Σ, i.e., Zd
=N(0,Σ). Denote by
Z2
(1) Z2
(2) ≤ ··· ≤ Z2
(p)the order statistics of {Z2
1, . . . , Z2
p}. The following
theorem shows that the distribution of the maximum absolute multiple cor-
relation b
Rn(s, p) can be approximated by that of the supremum of a centered
Gaussian process Gindexed by F.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 9
Theorem 3.1.Let Conditions 2.1 and 2.2 hold, n, p 2and 1sp.
Then there exists a constant C > 0independent of (s, p, n)such that
sup
t0Pnb
Rn(s, p)tPR(s, p)t
C(K0K1)3/4n1/8{sbn(s, p)}7/8,(3.3)
where K0and K1are defined in Condition 2.1,bn(s, p) := log(γsp/s)log n
for γsas in (2.4), R(s, p) := supf∈F Gfand G={Gf}f∈F is a centered
Gaussian process indexed by Fdefined as, for every fα∈ F,
(3.4) Gfα=αT
ΣZ=αTZ
αTΣα.
In particular, if Σ=Ipand slog(pn) = o(n1/7), then as n→ ∞,
sup
t0Pnb
R2
n(s, p)tPZ2
(p)+· ·· +Z2
(ps+1) t0.(3.5)
Remark 3.1.The Berry-Esseen bound given in Theorem 3.1 depends
explicitly on the triplet (s, p, n), and it depends on the covariance matrix Σ
only through its s-sparse condition number γs, defined in (2.4). The proof
of (3.3) builds on a number of technical tools including a standard covering
argument, maximal and concentration inequalities for the suprema of un-
bounded empirical processes and Gaussian processes as well as a coupling in-
equality for the maxima of sums of random vectors derived in Chernozhukov,
Chetverikov and Kato (2014a). Instead, if we directly resort to the general
framework in Theorem 2.1 of Chernozhukov, Chetverikov and Kato (2014a),
the function class of interest is F=x7→ αTx
(αTΣα)1/2:αSp1,|α|0=s.
Checking high-level conditions in Theorem 2.1 can be rather complicated
and less intuitive. Also, dealing with the (uniform) entropy integral that
corresponds to the class Frelies on verifying various VC-type properties,
and thus can be fairly tedious. Following a strategy similar to that used to
prove Theorem 2.1, we provide a self-contained proof of Theorem 3.1 in Sec-
tion 7.2 by making the best use of the specific structure of F. The proof is
more intuitive and straightforward. More importantly, it leads to an explicit
non-asymptotic bound under transparent conditions.
Remark 3.2.In Theorem 3.1, the independence assumption of εand X
can be relaxed as E(εX) = 0, E(ε2|X) = σ2and E(ε4|X)Calmost surely,
where C > 0 is a constant.
10 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Expression (3.5) indicates that the increment n{b
R2
n(s, p)b
R2
n(s1, p)}
is approximately the same as Z2
(ps+1). This can simply be seen from the
asymptotic joint distribution of b
Rn(1, p),b
Rn(2, p),..., b
Rn(s, p). The fol-
lowing proposition establishes the approximation of the joint distributions
when both the dimension pand sparsity sare allowed to diverge with the
sample size n.
Proposition 3.1.Let Conditions 2.1 and 2.2 hold with Σ=Ip. Assume
that the triplet (s, p, n)satisfies 1s < n pand s2log p=o(n1/7). Then
as n→ ∞,
sup
0t0<t1<t2<···<ts<1Ps
\
k=1 b
Rn(k, p)tk
Ps
\
k=1 Z2
(pk+1) n(t2
kt2
k1)0.
Remark 3.3.When s= 1 and if (n, p) satisfies log p=o(n1/7), it is
straightforward to verify that, for any tR,
(3.6) PZ2
(p)2 log p+ log(log p)texp(π1/2et/2) as p→ ∞.
This result is similar in nature to (5) in Fan, Guo and Hao (2012). In fact, it
is proved in Shao and Zhou (2014) that the extreme-value statistic b
Rn(1, p)
is sensitive to heavy-tailed data in the sense that, under the ultra-high di-
mensional scheme, even the law of large numbers for the maximum spurious
correlation requires exponentially light tails of the underlying distribution.
We refer readers to Theorem 2.1 in Shao and Zhou (2014) for details. There-
fore, we believe that the exponential-type moment assumptions required in
Theorem 3.1 cannot be weakened to polynomial-type ones as long as log pis
allowed to be as large as ncfor some c(0,1). However, it is worth mention-
ing that the factor 1/7 in Proposition 3.1 may not be optimal, and according
to the results in Shao and Zhou (2014), 1/3 is the best possible factor to
ensure that the asymptotic theory is valid. To close this gap in theory, a
significant amount of additional work and new probabilistic techniques are
needed. We do not pursue this line of research in this paper.
For a general s2, we establish in the following proposition the limiting
distribution of the sum of the top sorder statistics of i.i.d. chi-square random
variables with degree of freedom 1.
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 11
Proposition 3.2.Assume that s2is a fixed integer. For any tR,
we have as p→ ∞,
PZ2
(p)+· ·· +Z2
(ps+1) sapt
π(1s)/2
(s1)!Γ(s1) Zt/s
−∞ Z(tsv)/2
0
us2eudue(s1)v/2g(v)dv,(3.7)
where ap= 2 log plog(log p),G(t) = exp(π1/2et/2)and g(t) = G0(t) =
et/2
2πG(t). The above integral can further be expressed as
G(t/s) + π1s/2et/2
(s1)! Zt/s
−∞
eug(u)du +π(1s)/2et/2
(s1)!
×
s2
X
j=1 G(t/s)e(j+1)t/(2s)πj/2
j
Y
`=1
(s`)1
j!2jZt/s
−∞
(tsv)jev/2g(v)dv.
(3.8)
In particular, when s= 2, the last term on the right-hand side of (3.8)
vanishes so that, as p→ ∞,
PZ2
(p)+Z2
(p1) 2aptG(t/2) + et/2
2πZt/2
−∞
eu/2G(u)du.
The proofs of Propositions 3.1 and 3.2 are placed in the supplemental
material [Fan, Shao and Zhou (2017)].
3.2. Multiplier bootstrap approximation. The distribution of R(s, p) =
supf∈F Gffor Gin (3.4) depends on the unknown Σand thus cannot
be used for statistical inference. In the following, we consider the use of
a Monte Carlo method to simulate a process that mimics G, now known
as the multiplier (wild) bootstrap method, which is similar to that used in
Hansen (1996), Barrett and Donald (2003) and Chernozhukov, Chetverikov
and Kato (2013), among others.
Let b
Σnbe the sample covariance matrix based on the data {Xi}n
i=1 and
ξ1, . . . , ξnbe i.i.d. standard normal random variables that are independent
of {εi}n
i=1 and {Xi}n
i=1. Then, given {Xi}n
i=1,
(3.9) Zn=n1/2
n
X
i=1
ξi(Xi¯
Xn)N(0,b
Σn).
12 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
The following result shows that the (unknown) distribution of R(s, p) =
supfα∈F fα(Z)
αTΣαfor Zd
=N(0,Σ) can be consistently estimated by the con-
ditional distribution of
(3.10) RMB
n(s, p) := sup
fα∈F
fα(Zn)
qαTb
Σnα
.
Theorem 3.2.Let Conditions 2.1 and 2.2 hold. Assume that the triplet
(s, p, n)satisfies 1spand slog(γspn) = o(n1/5). Then as n→ ∞,
(3.11) sup
t0PR(s, p)tPRMB
n(s, p)tX1,...,XnP
0.
Remark 3.4.Together, Theorems 3.1 and 3.2 show that the maximum
spurious correlation b
Rn(s, p) can be approximated in distribution by the
multiplier bootstrap statistic n1/2RMB
n(s, p). In practice, when the sample
size nis relatively small, the value of n1/2RMB
n(s, p) may exceed 1, which
makes it less favorable as a proxy for spurious correlation. To address this
issue, we propose using the following corrected bootstrap approximation:
(3.12) RCMB
n(s, p) := sup
fα∈F
n
|ξ|2
fα(Zn)
qαTb
Σnα
,
where ξ= (ξ1, . . . , ξn)Tis used in the definition of Zn. By the Cauchy-
Schwarz inequality, n1/2RCMB
n(s, p) is always between 0 and 1. In view of
(3.10) and (3.12), RCMB
n(s, p) differs from RMB
n(s, p) only up to a multiplica-
tive random factor n1/2|ξ|2, which in theory is concentrated around 1 with
exponentially high probability. Thus, RMB
nand RCMB
nare asymptotically
equivalent, and (3.11) remains valid with RMB
nreplaced by RCMB
n.
4. Extension to sparse linear models. Suppose that the observed
response Yand p-dimensional covariate Xfollows the sparse linear model
(4.1) Y=XTβ+ε,
where the regression coefficient βis sparse. The sparsity is typically ex-
plored by the LASSO [Tibshirani (1996)], the SCAD [Fan and Li (2001)], or
the MCP [Zhang (2010)]. Now it is well-known that, under suitable condi-
tions, the SCAD and the MCP, among other folded concave penalized least-
square estimators, also enjoy the unbiasedness property and the (strong)
oracle properties. For simplicity, we focus on the SCAD. For a given random
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 13
sample {(Xi, Yi)}n
i=1, the SCAD exploits the sparsity by pλ-regularization,
which minimizes
(4.2) (2n)1
n
X
i=1
(YiXT
iβ)2+
p
X
j=1
pλ(|βj|;a)
over β= (β1, . . . , βp)TRp, where pλ(·;a) denotes the SCAD penalty
function [Fan and Li (2001)], i.e., p0
λ(t;a) = λI(tλ) + (t)+
a1I(t>λ) for
some a > 2, and λ=λn0 is a regularization parameter.
Denote by X= (X1,...,Xn)Tthe n×pdesign matrix, Y= (Y1, . . . , Yn)T
the n-dimensional response vector, and ε= (ε1, . . . , εn)T, the n-dimensional
noise vector. Without loss of generality, we assume that β= (βT
1,βT
2)T
with each component of β1Rsbeing non-zero and β2=0, such that
S0:= supp(β) = {1, . . . , s}is the true underlying sparse model of the
indices with s=|β|0. Moreover, write X= (X1,X2), where X1Rn×s
consists of the columns of Xindexed by S0. In this notation, Y=Xβ+ε=
X1β1+εand the oracle estimator b
βoracle has an explicit form of
b
βoracle
1= (XT
1X1)1XT
1Y=β1+ (XT
1X1)1XT
1ε,b
βoracle
2=0.(4.3)
In other words, the oracle estimator is the unpenalized estimator that min-
imizes Pn
i=1(YiXT
i,S0βS0)2over the true support set S0.
Denote by b
εoracle = (bεoracle
1,...,bεoracle
n)T=YXTb
βoracle the residuals af-
ter the oracle fit. Then, we can construct the maximum spurious correlation
as in (2.2), except that {εi}n
i=1 is now replaced by {bεoracle
i}n
i=1, i.e.,
b
Roracle
n(1, p)
= max
j[p]|Pn
i=1(bεoracle
ieT
nb
εoracle)(Xij ¯
Xj)|
qPn
i=1(bεoracle
ieT
nb
εoracle)2·qPn
i=1(Xij ¯
Xj)2
,(4.4)
where en= (1/n, . . . , 1/n)TRnand ¯
Xj=n1Pn
i=1 Xij. We here deal
with the specific case of a spurious correlation of size 1, as this is what is
needed for testing the exogeneity assumption (1.2).
To establish the limiting distribution of b
Roracle
n(1, p), we make the follow-
ing assumptions.
Condition 4.1.Y=Xβ+εwith supp (β) = {1, . . . , s}and ε=
(ε1, . . . , εn)Tbeing i.i.d. centered sub-Gaussian satisfying that K0=kεikψ2<
. The rows of X= (X1,...,Xn)Tare i.i.d. sub-Gaussian random vectors
as in Condition 2.1.
14 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
As before, we can assume that Σ=E(XiXT
i) is a correlation matrix with
diag (Σ) = Ip. Set d=psand partition
Σ=Σ11 Σ12
Σ21 Σ22 with Σ11 Rs×s,Σ22 Rd×d,Σ21 =ΣT
12.(4.5)
Let Σ22.1= (eσjk)1j,kd=Σ22 Σ21Σ1
11 Σ12 be the Schur complement of
Σ11 in Σ.
Condition 4.2.eσmin = min1jdeσjj is bounded away from zero.
Theorem 4.1.Assume that Conditions 4.1 and 4.2 hold, and that the
triplet (s, p, n)satisfies slog p=o(n)and log p=o(n1/7). Then the maxi-
mum spurious correlation b
Roracle
n(1, p)in (4.4)satisfies that, as n→ ∞,
sup
t0Pnb
Roracle
n(1, p)tP|e
Z|t0,(4.6)
where e
Zd
=N(0,Σ22.1)is a d-variate centered Gaussian random vector with
covariance matrix Σ22.1.
As pλis a folded-concave penalty function, (4.2) is a non-convex opti-
mization problem. The local linear approximation (LLA) algorithm can be
applied to produce a certain local minimum for any fixed initial solution
[Zou and Li (2008), Fan, Xue and Zou (2014)]. In particular, Fan, Xue and
Zou (2014) prove that the LLA algorithm can deliver the oracle estimator
in the folded concave penalized problem with overwhelming probability if it
is initialized by some appropriate initial estimator.
Let b
βLLA be the estimator computed via the one-step LLA algorithm
initiated by the LASSO estimator [Tibshirani (1996)]. That is,
(4.7) b
βLLA = arg min
β(2n)1
n
X
i=1
(YiXT
iβ)2+
p
X
j=1
p0
λ(|b
βLASSO
j|)|βj|,
where pλis a folded concave penalty, such as the SCAD and MCP penalties,
and b
βLASSO = arg minβ{(2n)1Pn
i=1(YiXT
iβ)2+λ|β|1}. Accordingly,
denote by b
RLLA
n(1, p) the maximum spurious correlation as in (4.4) with
bεoracle
ireplaced by bεLLA
i=YiXT
ib
βLLA. Applying Theorem 4.1, we derive
the limiting distribution of b
RLLA
n(1, p) under suitable conditions. First, let
us recall the Restricted Eigenvalue concept formulated by Bickel, Ritov and
Tsybakov (2009).
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 15
Definition 4.1.For any integer s0[p]and positive number c0, the
RE (s0, c0)parameter κ(s0, c0,A)of a p×pmatrix Ais defined as
(4.8) κ(s0, c0,A) := min
S[p]:|S|≤s0
min
δ6=0:|δSc|1c0|δS|1
δTAδ
|δS|2
2
.
Theorem 4.2.Assume that Conditions 4.1 and 4.2 hold, the minimal
signal strength of βsatisfies minjS0|βj|>(a+ 1)λfor a, λ as in (4.2),
and that the triplet (s, p, n)satisfies slog p=o(n),slog p
κ(s,3+,Σ)=o(n)for
some  > 0and log p=o(n1/7). If the regularization parameters (λ, λLASSO)
are such that λ8s
κ(s,3,Σ)λLASSO and λLASSO CK0p(log p)/n for C > 0
large enough, then as n→ ∞,
sup
t0Pnb
RLLA
n(1, p)tP|e
Z|t0,(4.9)
where e
Zd
=N(0,Σ22.1).
5. Applications to high-dimensional inferences. This section out-
lines three applications in high-dimensional statistics. The first determines
whether discoveries by machine learning and data mining techniques are any
better than those reached by chance. Second, we show that the distributions
of maximum spurious correlations can also be applied to model selection. In
the third application, we validate the fundamental assumption of exogeneity
(1.2) in high dimensions.
5.1. Spurious discoveries. Let qCMB
α(s, p) be the upper α-quantile of the
random variable RCMB
n(s, p) defined by (3.12). Then, an approximate 1 α
upper confidence limit of the spurious correlation is given by qCMB
α(s, p). In
view of Theorems 3.1 and 3.2, we claim that
(5.1) Pb
Rn(s, p)qCMB
α(s, p)1α.
To see this, recall that RCMB
n=nRMB
n/|ξ|2for ξ= (ξ1, . . . , ξn)Tas in
(3.12), and given {Xi}n
i=1,RMB
nis the supremum of a Gaussian process. Let
FMB
n(t) = P{RMB
n(s, p)t|X1,...,Xn}be the (conditional) distribution
function of RMB
nand define t0= inf{t:FMB
n(t)>0}. By Theorem 11.1
of Davydov, Lifshits and Smorodina (1998), FMB
nis absolutely continuous
with respect to the Lebesgue measure and is strictly increasing on (t0,),
indicating that P{RCMB
nqCMB
α(s, p)|X1,...,Xn}=αalmost surely. This,
together with (3.3) and (3.11), proves (5.1) under Conditions 2.1,2.2, and
when slog(γspn) = o(n1/7).
16 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
Let b
Yibe fitted values using spredictors indexed by b
Sselected by a data-
driven technique and Yibe the associated response value. They are denoted
in the vector form by b
Yand Y, respectively. If
(5.2) dcorrn(Y,b
Y)qCMB
α(s, p),
then the discovery of variables b
Scan be regarded as spurious; that is, no bet-
ter than by chance. Therefore, the multiplier bootstrap quantile qCMB
α(s, p)
provides an important critical value and yardstick for judging whether the
discovery is spurious, or whether the selected set b
Sincludes too many spu-
rious variables. This yardstick is independent of the method used in the
fitting.
Remark 5.1.The problem of judging whether the discovery is spuri-
ous is intrinsically different from that of testing the global null hypothesis
H0:β=0, which itself is an important problem in high-dimensional statis-
tical inference and has been well-studied in the literature since the seminal
work of Goeman, van de Geer and van Houwelingen (2006). For example, the
global null hypothesis H0:β=0can be rejected by a test; still, the corre-
lation between Yand the variables b
Sselected by a statistical method can be
smaller than the maximum spurious correlation, and we should interpret the
findings of b
Swith caution. We need either more samples or more powerful
variable selection methods. This motivates us to derive the distribution of
the maximum spurious correlation b
Rn(s, p). This distribution serves as an
important benchmark for judging whether the discovery (of sfeatures from
pexplanatory variables based on a sample of size n) is spurious. The magni-
tude of b
Rn(s, p) gives statisticians an idea of how big a spurious correlation
can be, and therefore an idea of how much the covariates really contribute
to the regression for a given sample size.
5.2. Model selection. In the previous section, we consider the reference
distribution of the maximum spurious correlation statistic b
Rn(s, p) as a
benchmark for judging whether the discovery of ssignificant variables (among
all of the pvariables using a random sample of size n) is impressive, regard-
less of which variable selection tool is applied. In this section, we show how
the distribution of b
Rn(s, p) can be used to select a model. Intuitively, we
would like to select a model that fits better than the spurious fit. This limits
the candidate sets of models and provides an upper bound on the model
size. In our experience, this upper bound itself provides a model selector.
We now use LASSO as an illustration of the above idea. Owing to spurious
correlation, almost all of the variable selection procedures will, with high
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 17
probability, select a number of spurious variables in the model so that the
selected model is over-fitted. For example, the LASSO method with the
regularization parameter selected by cross-validation typically selects a far
larger model size, as the bias caused by the `1penalty forces the cross-
validation procedure to choose a smaller value of λ. Thus, it is important
to stop the LASSO path earlier and the quantiles of b
Rn(s, p) provide useful
guards.
Specifically, consider the LASSO estimator b
βλfor the sparse linear model
(4.1) with bsλ=|supp(b
βλ)|, where λ > 0 is the regularization parameter.
We consider the LASSO solution path with the largest knot λini := |XTY|
and the smallest knot λcv selected by ten-fold cross-validation. To avoid
over-fitting, we propose using qCMB
αas a guide to choose the regularization
parameter that guards us from selecting too many spurious variables. For
each λin the path, we compute dcorrn(b
Yλ,Y), the sample correlation between
the post-LASSO fitted and observed responses, and qCMB
α(bsλ, p). Let b
λαbe
the largest λsuch that the sign of dcorrn(b
Yλ,Y)qCMB
α(bsλ, p) is nonnegative
and then flips in the subsequent knot. The selected model is given by b
Sα=
supp(b
βb
λα). As demonstrated by the simulation studies in Section 6.4, this
procedure selects a much smaller model size that is closer to the real data.
5.3. Validating exogeneity. Fan and Liao (2014) show that the exogenous
condition (1.2) is necessary for penalized least-squares to achieve a model
selection consistency. They question the validity of such an exogeneous as-
sumption, as it imposes too many equations. They argue further that even
when the exogenous model holds for important variables XS, i.e.,
(5.3) Y=XT
Sβ
S+ε, E(εXS) = 0,
the extra variables XN(with N=Sc) are collected in an effort to cover the
unknown set S— but no verification of the conditions
(5.4) E(εXN) = E{(YXT
Sβ
S)XN}=0
has ever been made. The equality E{(YXT
Sβ
S)Xj}= 0 in (5.4) holds
by luck for some covariate Xj, but it can not be expected that this holds
for all jN. They propose a focussed generalized method of moment
(FGMM) to avoid the unreasonable assumption (5.4). Recognizing (5.3) is
not identifiable in high-dimensional linear models, they impose additional
conditions such as E(εX2
S) = 0.
Despite its fundamental importance to high-dimensional statistics, there
are no available tools for validating (1.2). Regarding (1.2) as a null hypoth-
esis, an asymptotically α-level test can be used to reject assumption (1.2)
18 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
when
(5.5) b
Tn,p = max
j[p]ndcorrn(Xj, ε)qCMB
α(1, p).
By Theorems 3.1 and 3.2, the test statistic has an approximate size α. The
p-value of the test can be computed via the distribution of the Gaussian
multiplier process RCMB
n(1, p).
As pointed out in the introduction, when the components of Xare weakly
correlated, the distribution of the maximum spurious correlation does not
depend very sensitively on Σ. See also Lemma 6 in Cai, Liu and Xia (2014).
In this case, we can approximate it by the identity matrix, and hence one
can compare the renormalized test statistic
(5.6) Jn,p =b
T2
n,p 2 log p+ log(log p)
with the limiting distribution in (3.6). The critical value for test statistic
Jn,p is
(5.7) Jα=2 log{−πlog(1 α)},
and the associated p-value is given by
(5.8) exp(π1/2eJn,p/2).
Expressions (5.7) and (5.8) provide analytic forms for a quick validation of
the exogenous assumption (1.2) under weak dependence. In general, we rec-
ommend using the wild bootstrap, which takes into account the correlation
effect and provides more accurate estimates especially when the dependence
is strong. See Chang et al. (2017) for more empirical evidences.
In practice, εis typically unknown to us. Therefore, b
Tn,p in (5.5) is calcu-
lated using the fitted residuals {bεLLA
i}n
i=1. In view of Theorem 4.2, we need
to adjust the null distribution according to (4.9). By Theorem 3.2, we adjust
the definition of the process Znin (3.9) by
(5.9) ZLLA
n=n1/2
n
X
i=1
ξi(XLLA
iXLLA
n)Rp−|
b
S|,
where XLLA
i=Xi,
b
Nb
Σb
N
b
Sb
Σ1
b
S
b
SXi,
b
Sis the residuals of Xb
Nregressed on
Xb
S, where b
Sis the set of selected variables, b
N= [p]\b
S, and b
ΣSS0denotes
the sub-matrix of b
Σncontaining entries indexed by (k, `)S×S0. From
(5.9), the multiplier bootstrap approximation of |e
Z|is RMB,LLA
n(1, p) =
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 19
|b
D1/2ZLLA
n|, where b
D= diagonal matrix of the sample covariance matrix
of {XLLA
i}n
i=1. Consequently, we reject (1.2) if b
Tn,p > qMB,LLA
α(1, p), where
qMB,LLA
α(1, p) is the (conditional) upper α-quantile of RMB,LLA
n(1, p) given
{Xi}n
i=1.
Remark 5.2.To the best of our knowledge, this is the first paper to
consider testing the exogenous assumption (1.2), for which we use the maxi-
mum correlation between covariates and fitted residuals as the test statistic.
A referee kindly informed us in his/her review report that in the context
of specification testing, Chernozhukov, Chetverikov and Kato (2013) pro-
pose a similar extreme value statistic and use the multiplier bootstrap to
compute a critical value for the test. To construct marginal test statistics,
they use self-normalized covariances between generated regressors and fitted
residuals obtained via ordinary least squares, whereas we use sample cor-
relations between the covariates and fitted residuals obtained by the LLA
algorithm. We refer readers to Appendix M in the supplementary material
of Chernozhukov, Chetverikov and Kato (2013) for more details.
6. Numerical studies. In this section, Monte Carlo simulations are
used to examine the finite-sample performance of the bootstrap approxi-
mation (for a given data set) of the distribution of the maximum spurious
correlation (MSC).
6.1. Computation of spurious correlation. First, we observe that b
Rn(s, p)
in (2.2) can be written as b
R2
n(s, p) = bσ2
εmaxS[p]:|S|=svT
n,S b
Σ1
SS vn,S , where
bσ2
ε=n1Pn
i=1(εi¯εn)2and vn=n1Pn
i=1(εi¯εn)(Xi¯
Xn). Therefore,
the computation of b
Rn(s, p) requires solving the combinatorial optimization
problem
(6.1) b
S= arg max
S[p]:|S|=svT
n,S b
Σ1
SS vn,S .
It is computationally intensive to obtain b
Sfor large values of pand sas
one essentially needs to enumerate all p
spossible subsets of size sfrom p
covariates. A fast and easily implementable approach is to use the stepwise
addition (forward selection) algorithm as in Fan, Guo and Hao (2012), which
results in some value that is no larger than b
Rn(s, p) but avoids computing all
p
smultiple correlations in (6.1). Note that the optimization (6.1) is equiva-
lent to finding the best subset regression of size s. When pis relatively small,
say if pranges from 20 to 40, the branch-and-bound procedure is commonly
used for finding the best subset of a given size that maximizes multiple R2
20 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
[Brusco and Stahl (2005)]. However, this approach becomes computational
infeasible very quickly when there are hundreds or thousands of potential
predictors. As a trade-off between approximation accuracy and computa-
tional intensity, we propose using a two-step procedure that combines the
stepwise addition and branch-and-bound algorithms. First, we use the for-
ward selection to pick the best dvariables, say d= 40, which serves as a
pre-screening step. Second, across the d
ssubsets of size s, the branch-and-
bound procedure is implemented to select the best subset that maximizes
the multiple-R2. This subset is used as an approximate solution to (6.1).
Note that when s > 40, which is rare in many applications, we only use the
stepwise addition to reduce the computational cost.
6.2. Accuracy of the multiplier bootstrap approximation. For the first
simulation, we consider the case where the random noise εfollows the uni-
form distribution standardized so that E(ε) = 0 and E(ε2) = 1. Independent
of ε, the p-variate vector Xof covariates has i.i.d. N(0,1) components. In the
results reported in Table 1, the ambient dimension p= 2000, the sample size
ntakes a value in {400,800,1200}, and stakes a value in {1,2,5,10}. For
a given significance level α(0,1), let qα(s, p) be the upper α-quantile of
b
Rn(s, p) in (2.1). For each data set Xn={X1,...,Xn}, a direct application
of Theorems 3.1 and 3.2 is that
cMB(Xn, α) := PRCMB
n(s, p)qα(s, p)|Xnαas n→ ∞.
The difference cMB(Xn, α)α, however, characterizes the extent of the size
distortions and the finite-sample accuracy of the multiplier bootstrap ap-
proximation (MBA). Table 1summarizes the mean and the standard de-
viation (SD) of cMB(Xn, α) based on 200 simulated data sets with α
{0.05,0.1}. The α-quantile qα(s, p) is calculated from 1600 replications, and
cMB(Xn, α) for each data set is simulated based on 1600 bootstrap replica-
tions. In addition, we report in Figure 3the distributions of the maximum
spurious correlations and their multiplier bootstrap approximations condi-
tional on a given data set Xnwhen p∈ {2000,5000},s∈ {1,2,5,10}and
n= 400. Together, Table 1and Figure 3show that the multiplier bootstrap
method indeed provides a quite good approximation to the (unknown) dis-
tribution of the maximum spurious correlation.
For the second simulation, we focus on an anisotropic case where the co-
variance matrix Σof Xis non-identity, and the condition number of Σis
well-controlled. Specifically, we assume that εfollows the centered Laplace
distribution rescaled so that E(ε) = 0 and E(ε2) = 1. To introduce de-
pendence among covariates, first we denote with Aa 10 ×10 symmet-
ric positive definite matrix with a pre-specified condition number c > 1
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 21
Table 1
(Isotropic case) The mean of 200 empirical sizes cMB(·, α)×100, with its estimate of SD
in the parenthesis, when p= 2000,s= 1,2,5,10,n= 400,800,1200, and α= 0.1,0.05
Trule s= 1 s= 2 s= 5 s= 10
n α = 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05
400 9.54 4.68 9.13 4.38 9.08 3.78 8.67 4.44
(0.643) (0.294) (0.568) (0.284) (0.480) (0.245) (0.506) (0.291)
800 9.43 4.93 9.47 4.42 9.73 4.73 9.94 5.62
(0.444) (0.296) (0.474) (0.296) (0.488) (0.294) (0.557) (0.331)
1200 9.09 4.32 9.00 4.46 9.42 4.87 9.97 5.15
(0.507) (0.261) (0.542) (0.278) (0.543) (0.322) (0.579) (0.318)
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Density
0
5
10
15
20
25
30 (n,p) = (400,2000)
MSC-1
MBA-1
MSC-2
MBA-2
MSC-5
MBA-5
MSC-10
MBA-10
Maximum Spurious Correlation vs Multiplier Bootstrap Approximation
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Density
0
10
20
30
40 (n,p) = (400,5000)
Fig 3: Distributions of maximum spurious correlations (blue) and multiplier
bootstrap approximations (for a given data set; red) based on 1600 simula-
tions with combinations of p= 2000,5000, s= 1,2,5,10, and n= 400 when
Σis an identity matrix.
and let ρ(0,1). Then the p-dimensional vector Xof the covariates is
22 J. FAN, Q.-M. SHAO AND W.-X. ZHOU
generated according to X=G1(ρ)Z1+G2(ρ)Z2, where Z1d
=N(0,A),
Z2d
=N(0,Ip10), and G1(ρ)Rp×10 ,G2(ρ)Rp×(p10) are given respec-
tively by G1(ρ)T= (I10,ρ
1+ρ2I10,G11(ρ)T) with
G11(ρ) = 1ρ
p1 + (1 ρ)2
1 0 . . . 0
1 0 . . . 0
.
.
..
.
.··· .
.
.
1 0 . . . 0
R(p20)×10
and
G2(ρ) =
010×10 010×(p20)
1
1+ρ2I10 010×(p20)
0(p20)×10 1
1+(1ρ)2Ip20
.
In particular, we take c= 5 and ρ= 0.8 in the simulations reported in
Table 2, which summarizes the mean and the standard deviation (SD) of
the size cMB(Xn, α) based on 200 simulated data sets with α∈ {0.05,0.1}.
Comparing the simulation results shown in Tables 1and 2, we find that
the bootstrap approximation is fairly robust against heterogeneity in the
covariance structure of the covariates.
Table 2
(Anisotropic case) Mean of 200 empirical sizes cMB(·, α)×100, with its estimate of SD
in the parenthesis, when p= 2000,s= 1,2,5,10,n= 400,800,1200, and α= 0.1,0.05
Trule s= 1 s= 2 s= 5 s= 10
n α = 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05
400 9.83 4.39 9.04 4.75 9.27 4.65 9.34 4.53
(0.426) (0.222) (0.402) (0.208) (0.492) (0.273) (0.557) (0.291)
800 10.18 5.19 10.48 5.12 9.98 4.86 9.21 4.73
(0.556) (0.296) (0.519) (0.272) (0.576) (0.220) (0.474) (0.269)
1200 9.42 4.41 9.60 5.71 9.90 4.85 10.11 5.19
(0.500) (0.233) (0.543) (0.339) (0.553) (0.333) (0.606) (0.337)
6.3. Detecting spurious discoveries. To examine how the multiplier boot-
strap quantile qCMB
α(s, p) (see Section 5.1) serves as a benchmark for judging
whether the discovery is spurious, we compute the Spurious Discovery Prob-
ability (SDP) by simulating 200 data sets from (4.1) with n= 100,120,160,
p= 400, β= (1,0,0.8,0,0.6,0,0.4,0,...,0)TRp, and standard Gaus-
sian noise εd
=N(0,1). For some integer srp, we let xd
=N(0,Ir) be an
r-dimensional Gaussian random vector. Let Γrbe a p×rmatrix satisfying
DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 23
ΓT
rΓr=Ir. The rows of the design matrix Xare sampled as i.i.d. copies from
ΓrxRp, where rtakes a value in {120,160,200,240,280,320,360}. To save
space, we give the numerical results for the case of non-Gaussian design and
noise in the supplementary material [Fan, Shao and Zhou (2017)].
Put Y= (Y1, . . . , Yn)Tand let b
Y=Xb
Sb
βpLASSO be the n-dimensional
vector of fitted values, where b
βpLASSO = (XT
b
SXb
S)1XT
b
SYis the post-LASSO
estimator using covariates selected by the ten-fold cross-validated LASSO
estimator. Let bs=|b
S|0be the number of variables selected. For α(0,1),
the level-αSDP is defined as P{|dcorrn(Y,b
Y)| ≤ qCMB
α(bs, p)}. As the simu-
lated model is not null, this SDP is indeed a type II error. Given α= 5% and
for each simulated data set, qCMB
α(s, p) is computed based on 1000 bootstrap
replications. Then we compute the empirical SDP based on 200 simulations.
The results are given in Table 3.
In this study, the design matrix is chosen so that there is a low-dimensional
linear dependency in the high-dimensional covariates. The collected covari-
ates are highly correlated when ris much smaller than p. It is known that
collinearity and high dimensionality add difficulty to the problem of variable
selection and deteriorate the performance of the LASSO. The smaller the
ris, the more severe the problem of collinearity becomes. As reflected in
Table 3, the empirical SDP increases as rdecreases, indicating that the cor-
relation between fitted and observed responses is more likely to be smaller
than the spurious correlation.
Table 3
Empirical α-level spurious discovery probability (ESDP) based on 200 simulations when
p= 400,n= 100,120,160, and α= 5%.
Trule r= 120 r= 160 r= 200 r= 240 r= 280 r= 320 r= 360
n= 100 0.6950 0.6650 0.6000 0.5200 0.5100 0.4500 0.4000
n= 120 0.6600 0.5350 0.3350 0.3800 0.2500 0.2850 0.1950
n= 160 0.1950 0.1300 0.0500 0.0400 0.0550 0.0700 0.0250
6.4. Model selection. We demonstrate the idea in Section 5.2 through
the following toy example. Consider the linear model (4.1) with (n, p) =
(160,400) and β= (1,0,0.8,0,0.6,0,0.4,0,...,0)TRp. The covariate
vector is taken to be X=Γx with x= (x1, . . . , x200)T, where x1