Prochlorococcus marinus, one of the most abundant marine cyanobacteria in the global ocean, is classified into low-light (LL) and high-light (HL) adapted ecotypes. These two adapted ecotypes differ in their ecophysiological characteristics, especially whether adapted for growth at high-light or low-light intensities. However, some evolutionary relationships of Prochlorococcus phylogeny remain to be resolved, such as whether the strains SS120 and MIT9211 form a monophyletic group. We use the Natural Vector (NV) method to represent the sequence in order to identify the phylogeny of the Prochlorococcus. The natural vector method is alignment free without any model assumptions. This study added the covariances of amino acids in protein sequence to the natural vector method. Based on these new natural vectors, we can compute the Hausdorff distance between the two clades which represents the dissimilarity. This method enables us to systematically analyze both the dataset of ribosomal proteomes and the dataset of 16s-23s rRNA sequences in order to reconstruct the phylogeny of Prochlorococcus. Furthermore, we apply classification to inspect the relationship of SS120 and MIT9211. From the reconstructed phylogenetic trees and classification results, we may conclude that the SS120 does not cluster with MIT9211. This study demonstrates a new method for performing phylogenetic analysis. The results confirm that these two strains do not form a monophyletic clade in the phylogeny of Prochlorococcus.
Structures and functions of proteins play various essential roles in biological processes. The functions of newly discovered proteins can be predicted by comparing their structures with that of known functional proteins. Many approaches have been proposed for measuring the protein structure similarity, such as the template-modeling (TM)-score method, GRaphlet (GR)-Align method as well as the commonly used root-mean-square deviation (RMSD) measures. However, the alignment comparisons between the similarity of protein structure cost much time on large dataset, and the accuracy still have room to improve. In this study, we introduce a new three-dimensional (3D) Yau–Hausdorff distance between any two 3D objects. The (3D) Yau–Hausdorff distance can be used in particular to measure the similarity/dissimilarity of two proteins of any size and does not need aligning and super- imposing two structures. We apply structural similarity to study function similarity and perform phylogenetic analysis on several datasets. The results show that (3D) Yau–Hausdorff distance could serve as a more precise and effective method to discover biological relationships between proteins than other methods on structure comparison.
Background: In recent years, DNA barcoding has become an important tool for biologists to identify species and understand their natural biodiversity. The complexity of barcode data makes it difficult to analyze quickly and effectively. Manual classification of this data cannot keep up to the rate of increase of available data.
Results: In this study, we propose a new method for DNA barcode classification based on the distribution of nucleotides within the sequence. By adding the covariance of nucleotides to the original natural vector, this augmented 18-dimensional natural vector makes good use of the available information in the DNA sequence. The accurate classification results we obtained demonstrate that this new 18-dimensional natural vector method, together with the random forest classifier algorthm, can serve as a computationally efficient identification tool for DNA barcodes. We performed phylogenetic analysis on the genus Megacollybia to validate our method. We also studied how effective our method was in determining the genetic distance within and between species in our barcoding dataset.
Conclusions: The classification performs well on the fungi barcode dataset with high and robust accuracy. The reasonable phylogenetic trees we obtained further validate our methods. This method is alignment-free and does not depend on any model assumption, and it will become a powerful tool for classification and evolutionary analysis.
This study quantitatively validates the principle that the biological properties associated with a given genotype are determined by the distribution of amino acids. In order to visualize this central law of molecular biology, each protein was represented by a point in 250-dimensional space based on its amino acid distribution. Proteins from the same family are found to cluster together, leading to the principle that the convex hull surrounding protein points from the same family do not intersect with the convex hulls of other protein families. This principle was verified computationally for all available and reliable protein kinases and human proteins. In addition, we generated 2,328,761 figures to show that the convex hulls of different families were disjoint from each other. The classification performs well with high and robust accuracy (95.75% and 97.5%) together with reasonable phylogenetic trees validate our methods further.
Analyzing phylogenetic relationships using mathematical methods has always been of importance in bioinformatics. Quantitative research may interpret the raw biological data in a precise way. Multiple Sequence Alignment (MSA) is used frequently to analyze biological evolutions, but is very time-consuming. When the scale of data is large, alignment methods cannot finish calculation in reasonable time. Therefore, we present a new method using moments of cumulative Fourier power spectrum in clustering the DNA sequences. Each sequence is translated into a vector in Euclidean space. Distances between the vectors can reflect the relationships between sequences. The mapping between the spectra and moment vector is one-to-one, which means that no information is lost in the power spectra during the calculation. We cluster and classify several datasets including Influenza A, primates, and human rhinovirus (HRV) datasets to build up the phylogenetic trees. Results show that the new proposed cumulative Fourier power spectrum is much faster and more accurately than MSA and another alignment-free method known as k-mer. The research provides us new insights in the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes. The computer programs of the cumulative Fourier power spectrum are available at GitHub (https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum).
Rui DongTsinghua UniversityHui ZhengThe University of Illinois at ChicagoKun TianTsinghua UniversityShek-Chung YauThe Hong Kong University of Science and TechnologyWeiguang MaoTsinghua UniversityWenping YuNankai UniversityChangchuan YinThe University of Illinois at ChicagoChenglong YuSouth Australian Health and Medical Research InstituteRong Lucy HeChicago State UniversityJie YangThe University of Illinois at ChicagoStephen S.-T YauTsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42004
We construct a virus database called VirusDB (http://yaulab.math.tsinghua.edu.cn/VirusDB/) and an online inquiry system to serve people who are interested in viral classification and prediction. The database stores all viral genomes, their corresponding natural vectors, and the classification information of the single/multiple-segmented viral reference sequences downloaded from National Center for Biotechnology Information. The online inquiry system serves the purpose of computing natural vectors and their distances based on submitted genomes, providing an online interface for accessing and using the database for viral classification and prediction, and back-end processes for automatic and manual updating of database content to synchronize with GenBank. Submitted genomes data in FASTA format will be carried out and the prediction results with 5 closest neighbors and their classifications will be returned by email. Considering the one-to-one correspondence between sequence and natural vector, time efficiency, and high accuracy, natural vector is a significant advance compared with alignment methods, which makes VirusDB a useful database in further research.
Yongkun LiDepartment of Mathematical Sciences, Tsinghua UniversityLily HeDepartment of Mathematical Sciences, Tsinghua UniversityRong Lucy HeDepartment of Biological Sciences, Chicago State UniversityStephen S.-T. Yau(CorrespondingDepartment of Mathematical Sciences, Tsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42003
Zika virus (ZIKV) is a mosquito-borne flavivirus. It was first isolated from Uganda in 1947 and has become an
emergent event since 2007. However, because of the inconsistency of alignment methods, the evolution of
ZIKV remains poorly understood. In this study, we first use the complete protein and an alignment-free method
to build a phylogenetic tree of 87 Zika strains in which Asian, East African, and West African lineages are
characterized. We also use the NS5 protein to construct the genetic relationship among 44 Zika strains. For the
first time, these strains are divided into two clades: African 1 and African 2. This result suggests that ZIKV
originates from Africa, then spread to Asia, Pacific islands, and throughout the Americas. We also perform the
phylogeny analysis for 53 viruses in genus Flavivirus to which ZIKV belongs using complete proteins. Our
conclusion is consistent with the classification by the hosts and transmission vectors.
Yongkun Li1Department of Mathematical Sciences, Tsinghua UniversityLily He1Department of Mathematical Sciences, Tsinghua UniversityRong Lucy He2Department of Biological Sciences, Chicago State UniversityStephen S.-T. Yau(Corresponding author)1Department of Mathematical Sciences, Tsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42002
With sharp increasing in biological sequences, the traditional sequence alignment methods become
unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence
analysis. Among these methods, many sorts of feature vector methods are established and applied to
reconstruction of species phylogeny. The vectors basically consist of some typical numerical features
for certain biological problems. The features may come from the primary sequences, secondary or
three dimensional structures of macromolecules. In this study, we propose a novel numerical vector
based on only primary sequences of organism to build their phylogeny. Three chemical and physical
properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using
each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of
letters. Therefore, three sequences are constructed according to the three properties. For each letter of
each sequence we calculate the number of the letter, the average position of the letter and the variation
of the position of the letter appearing in the sequence. Tested on several datasets related to mammals,
viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
Lily HeDepartment of Mathematical Sciences, Tsinghua UniversityYongkun LiDepartment of Mathematical Sciences, Tsinghua UniversityRong Lucy HeDepartment of Biological Sciences, Chicago State UniversityStephen S.-T. Yau(Corresponding author)Department of Mathematical Sciences, Tsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42001
Journal of Theoretical Biology, 427, 41-52, 2017.6
Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important bio- chemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.
Processing streaming data as they arrive is often necessary for high dimensional data analysis. In this paper, we analyze the convergence of a subspace online PCA iteration, as a followup of the recent work of Li, Wang, Liu, and Zhang [Math. Program., Ser. B, DOI 10.1007/s10107-017-1182-z] who considered the case for the most significant principal component only, i.e., a single vector. Under the sub-Gaussian assumption, we obtain a finite-sample error bound that closely matches the minimax information lower bound of Vu and Lei [Ann. Statist. 41:6 (2013), 2905-2947].