Yongkun Li1Department of Mathematical Sciences, Tsinghua UniversityLily He1Department of Mathematical Sciences, Tsinghua UniversityRong Lucy He2Department of Biological Sciences, Chicago State UniversityStephen S.-T. Yau(Corresponding author)1Department of Mathematical Sciences, Tsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42002
With sharp increasing in biological sequences, the traditional sequence alignment methods become
unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence
analysis. Among these methods, many sorts of feature vector methods are established and applied to
reconstruction of species phylogeny. The vectors basically consist of some typical numerical features
for certain biological problems. The features may come from the primary sequences, secondary or
three dimensional structures of macromolecules. In this study, we propose a novel numerical vector
based on only primary sequences of organism to build their phylogeny. Three chemical and physical
properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using
each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of
letters. Therefore, three sequences are constructed according to the three properties. For each letter of
each sequence we calculate the number of the letter, the average position of the letter and the variation
of the position of the letter appearing in the sequence. Tested on several datasets related to mammals,
viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
Lily HeDepartment of Mathematical Sciences, Tsinghua UniversityYongkun LiDepartment of Mathematical Sciences, Tsinghua UniversityRong Lucy HeDepartment of Biological Sciences, Chicago State UniversityStephen S.-T. Yau(Corresponding author)Department of Mathematical Sciences, Tsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42001
Journal of Theoretical Biology, 427, 41-52, 2017.6
Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important bio- chemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.
Processing streaming data as they arrive is often necessary for high dimensional data analysis. In this paper, we analyze the convergence of a subspace online PCA iteration, as a followup of the recent work of Li, Wang, Liu, and Zhang [Math. Program., Ser. B, DOI 10.1007/s10107-017-1182-z] who considered the case for the most significant principal component only, i.e., a single vector. Under the sub-Gaussian assumption, we obtain a finite-sample error bound that closely matches the minimax information lower bound of Vu and Lei [Ann. Statist. 41:6 (2013), 2905-2947].
We propose to combine cepstrum and nonlinear time–frequency (TF) analysis
to study multiple component oscillatory signals with time-varying frequency and
amplitude and with time-varying non-sinusoidal oscillatory pattern. The concept of
cepstrum is applied to eliminate the wave-shape function influence on the TF analysis,
and we propose a new algorithm, named de-shape synchrosqueezing transform (deshape
SST). The mathematical model, adaptive non-harmonic model, is introduced
and the de-shape SST algorithm is theoretically analyzed. In addition to simulated
signals, several different physiological, musical and biological signals are analyzed to
illustrate the proposed algorithm.