Yannis G. YatracosYau Mathematical Sciences Center Tsinghua University, Beijing; Beijing Institute of Mathematical Sciences and Applications
Machine Learningmathscidoc:2206.41013
2022.5
In a Data-Generating Experiment (DGE), the data, X, is often obtained either from a Black-Box with inputs θ and Y, or from a Quantile function or a learning machine, f(Y, θ); θ is unknown, element of metric space (Θ, ρ), Y is random. If X has intractable or unknown c.d.f., Fθ, non-identifiability of θ cannot be confirmed and when present, among others, limits the predictive accuracy of the learned model, f(Y, \hat{θ}); \hat{θ} estimate of θ. In Machine Learning, non-identifiability of θ is ubiquitous and its extent is a criterion for selecting a learning machine. Empirical indices, EDI and PPVI, are introduced using P-values of Kolmogorov-Smirnov tests: i) to confirm almost surely, using generated data, the discrimination of θ from θ^∗, namely that the Kolmogorov distance, dK(Fθ, Fθ^∗), is positive, ii) to confirm identifiability of θ(∈ Θ) by repeating i) for θ^∗ in a sieve of Θ, since neighboring parameter values are in practice indistinguishable, and iii) most important, to compare EDI-graphs of DGEs, preferring more discrimination and less non-identifiability among parameters, and select one DGE to use. In applications, EDI-graphs confirm nonidentifiability in mixture models and in models parametrised with sums of parameters. EDI and PPVI explain why Tukey’s g-and-h model (DGE1) has better g-discrimination than the g-and-k model (DGE2), unless the sample size is extremely large; h_0 = k_0. EDIgraphs indicate that Normal learning machines have better parameter discrimination thanSigmoid learning machines and their parameters are non-identifiable.
@inproceedings{yannis2022selection,
title={Selection of data-generating experiments identifiability and expected P-values},
author={Yannis G. Yatracos},
url={http://archive.ymsc.tsinghua.edu.cn/pacm_paperurl/20220621164030684601424},
year={2022},
}
Yannis G. Yatracos. Selection of data-generating experiments identifiability and expected P-values. 2022. http://archive.ymsc.tsinghua.edu.cn/pacm_paperurl/20220621164030684601424.