Data-based detection and quantification of causation in complex, nonlinear dynamical systems is of paramount importance to science, engineering, and beyond. Inspired by the widely used methodology in recent years, the cross-map-based techniques, we develop a general framework to advance towards a comprehensive understanding of dynamical causal mechanisms, which is consistent with the natural interpretation of causality. In particular, instead of measuring the smoothness of the cross-map as conventionally implemented, we define causation through measuring the scaling law for the continuity of the investigated dynamical system directly. The uncovered scaling law enables accurate, reliable, and efficient detection of causation and assessment of its strength in general complex dynamical systems, outperforming those existing representative methods. The continuity scaling-based framework is rigorously established and demonstrated using datasets from model complex systems and the real world.
Rui DongYau Mathematical Sciences Center, Tsinghua University, Beijing, China; Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, ChinaTaojun HuDepartment of Biostatistics, School of Public Health, Peking University, Beijing 100191, ChinaYunjun ZhangDepartment of Biostatistics, School of Public Health, Peking University, Beijing 100191, ChinaYang Li Chongqing School, University of Chinese Academy of Sciences, Chongqing 400020, ChinaXiao-Hua Zhou Department of Biostatistics, School of Public Health, Peking University, Beijing 100191, China; Beijing International Center for Mathematical Research, Peking University, Beijing 100191, China
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2204.42002
Omicron, the latest SARS-CoV-2 Variant of Concern (VOC), first appeared in Africa in November 2021. At present, the question of whether a new VOC will out-compete the currently predominant variant is important for governments seeking to determine if current surveillance strategies and responses are appropriate and reasonable. Based on both virus genomes and daily-confirmed cases, we compare the additive differences in growth rates and reproductive numbers (R_0) between VOCs and their predominant variants through a Bayesian framework and phylo-dynamics analysis. Faced with different variants, we evaluate the effects of current policies and vaccinations against VOCs and predominant variants. The model also predicts the date on which a VOC may become dominant based on simulation and real data in the early stage. The results suggest that the overall additive difference in growth rates of B.1.617.2 and predominant variants was 0.44 (95% confidence interval, 95% CI: −0.38, 1.25) in February 2021, and that the VOC had a relatively high R_0. The additive difference in the growth rate of BA.1 in the United Kingdom was 6.82 times the difference between Delta and Alpha, and the model successfully predicted the dominating process of Alpha, Delta and Omicron. Current vaccination strategies remain similarly effective against Delta compared to the previous variants. Our model proposes a reliable Bayesian framework to predict the spread trends of VOCs based on early-stage data, and evaluates the effects of public health policies, which may help us better prepare for the upcoming Omicron variant, which is now spreading at an unprecedented speed.
Rui DongYau Mathematical Sciences Center, Tsinghua University, Beijing, China; Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, ChinaShaojun PeiDepartment of Mathematical Sciences, Tsinghua University, Beijing, ChinaMengcen GuanDepartment of Mathematical Sciences, Tsinghua University, Beijing, ChinaShek-Chung YauInformation Technology Services Center, The Hong Kong University of Science and Technology, Kowloon, Hong Kong, ChinaChangchuan YinDepartment of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, Chicago, IL, United StatesRong L. HeDepartment of Biological Sciences, Chicago State University, Chicago, IL, United StatesStephen S.-T. YauDepartment of Mathematical Sciences, Tsinghua University, Beijing, China; Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2204.42001
A comprehensive description of human genomes is essential for understanding human evolution and relationships between modern populations. However, most published literature focuses on local alignment comparison of several genes rather than the complete evolutionary record of individual genomes. Combining with data from the 1,000 Genomes Project, we successfully reconstructed 2,504 individual genomes and propose Divided Natural Vector method to analyze the distribution of nucleotides in the genomes. Comparisons based on autosomes, sex chromosomes and mitochondrial genomes reveal the genetic relationships between populations, and different inheritance pattern leads to different phylogenetic results. Results based on mitochondrial genomes confirm the “out-of-Africa” hypothesis and assert that humans, at least females, most likely originated in eastern Africa. The reconstructed genomes are stored on our server and can be further used for any genome-scale analysis of humans (http://yaulab.math.tsinghua.edu.cn/2022_1000genomesprojectdata/). This project provides the complete genomes of thousands of individuals and lays the groundwork for genome-level analyses of the genetic relationships between populations and the origin of humans.
Xiaojie QiuWhitehead Institute for Biomedical Research, Cambridge, MA, USAYan ZhangDepartment of Computational and System Biology, University of Pittsburgh, Pittsburgh, PA, USAJorge D. Martin-RufinoBroad Institute of MIT and Harvard, Cambridge, MA, USAChen WengWhitehead Institute for Biomedical Research, Cambridge, MA, USAShayan HosseinzadehDepartment of Molecular and Cell Biology, University of California, Berkeley, CA, USAJianhua XingDepartment of Computational and System Biology, University of Pittsburgh, Pittsburgh, PA, USAJonathan WeissmanWhitehead Institute for Biomedical Research, Cambridge, MA, USA
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2202.42001
Single-cell (sc)RNA-seq, together with RNA velocity and metabolic labeling, reveals cellular states and transitions at unprecedented resolution. Fully exploiting these data, however, requires kinetic models capable of unveiling governing regulatory functions. Here, we introduce an analytical framework dynamo (https://github.com/aristoteleo/dynamo-release), which infers absolute RNA velocity, reconstructs continuous vector fields that predict cell fates, employs differential geometry to extract underlying regulations, and ultimately predicts optimal reprogramming paths and perturbation outcomes. We highlight dynamo’s power to overcome fundamental limitations of conventional splicing-based RNA velocity analyses to enable accurate velocity estimations on a metabolically labeled human hematopoiesis scRNA-seq dataset. Furthermore, differential geometry analyses reveal mechanisms driving early megakaryocyte appearance and elucidate asymmetrical regulation within the PU.1-GATA1 circuit. Leveraging the least-action-path method, dynamo accurately predicts drivers of numerous hematopoietic transitions. Finally, in silico perturbations predict cell-fate diversions induced by gene perturbations. Dynamo, thus, represents an important step in advancing quantitative and predictive theories of cell-state transitions.
Forest above-ground biomass (AGB) can be estimated based on light detection and ranging (LiDAR) point clouds. This paper introduces an accurate and detailed quantitative structure model (AdQSM), which can estimate the AGB of large tropical trees. AdQSM is based on the reconstruction of 3D tree models from terrestrial laser scanning (TLS) point clouds. It represents a tree as a set of closed and complete convex polyhedra. We use AdQSM to model 29 trees of various species (total 18 species) scanned by TLS from three study sites (the dense tropical forests of Peru, Indonesia, and Guyana). The destructively sampled tree geometry measurement data is used as reference values to evaluate the accuracy of diameter at breast height (DBH), tree height, tree volume, branch volume, and AGB estimated from AdQSM. After AdQSM reconstructs the structure and volume of each tree, AGB is derived by combining the wood density of the specific tree species from destructive sampling. The AGB estimation from AdQSM and the post-harvest reference measurement data show a satisfying agreement. The coefficient of variation of root mean square error (CV-RMSE) and the concordance correlation coefficient (CCC) are 20.37% and 0.97, respectively. AdQSM provides accurate tree volume estimation, regardless of the characteristics of the tree structure, without major systematic deviations. We compared the accuracy of AdQSM and TreeQSM in modeling the volume of 29 trees. The tree volume from AdQSM is compared with the reference value, and the determination coefficient (R2), relative bias (rBias), and CV-RMSE of tree volume are 0.96, 6.98%, and 22.62%, respectively. The tree volume from TreeQSM is compared with the reference value, and the R2, relative Bias (rBias), and CV-RMSE of tree volume are 0.94, −9.69%, and 23.20%, respectively. The CCCs between the volume estimates based on AdQSM, TreeQSM, and the reference values are 0.97 and 0.96. AdQSM also models the branches in detail. The volume of branches from AdQSM is compared with the destructive measurement reference data. The R2, rBias, and CV-RMSE of the branches volume are 0.97, 12.38%, and 36.86%, respectively. The DBH and height of the harvested trees were used as reference values to test the accuracy of AdQSM’s estimation of DBH and tree height. The R2, rBias, and CV-RMSE of DBH are 0.94, −5.01%, and 9.06%, respectively. The R2, rBias, and CV-RMSE of the tree height were 0.95, 1.88%, and 5.79%, respectively. This paper provides not only a new QSM method for estimating AGB based on TLS point clouds but also the potential for further development and testing of allometric equations.
Laser scanning is an effective tool for acquiring geometric attributes of trees and vegetation,
which lays a solid foundation for 3-dimensional tree modelling. Existing studies on tree modelling
from laser scanning data are vast. However, some works cannot guarantee sufficient modelling
accuracy, while some other works are mainly rule-based and therefore highly depend on user inputs.
In this paper, we propose a novel method to accurately and automatically reconstruct detailed 3D
tree models from laser scans. We first extract an initial tree skeleton from the input point cloud by
establishing a minimum spanning tree using the Dijkstra shortest-path algorithm. Then, the initial tree
skeleton is pruned by iteratively removing redundant components. After that, an optimization-based
approach is performed to fit a sequence of cylinders to approximate the geometry of the tree branches.
Experiments on various types of trees from different data sources demonstrate the effectiveness and
robustness of our method. The overall fitting error (i.e., the distance between the input points and the
output model) is less than 10 cm. The reconstructed tree models can be further applied in the precise
estimation of tree attributes, urban landscape visualization, etc. The source code of this work is freely
available at https://github.com/tudelft3d/adtree
The stochasticity of gene expression is manifested in the fluctuations of mRNA and protein copy numbers within a cell lineage over time. While data of this type can be obtained for many generations, most mathematical models are unsuitable to interpret such data since they assume non-growing cells. Here we develop a theoretical approach that quantitatively links the frequency content of lineage data to subcellular dynamics. We elucidate how the position, height, and width of the peaks in the power spectrum provide a distinctive fingerprint that encodes a wealth of information about mechanisms controlling transcription, translation, replication, degradation, bursting, promoter switching, cell cycle duration, cell division, gene dosage compensation, and cell size homeostasis. Predictions are confirmed by analysis of single-cell Escherichia coli data obtained using fluorescence microscopy. Furthermore, by matching the experimental and theoretical power spectra, we infer the temperature-dependent gene expression parameters, without the need of measurements relating fluorescence intensities to molecule numbers.
Hau-Tieng WuDepartment of Mathematics, Duke University, DurhamTze Leung LaiDepartment of Statistics, Stanford University, StanfordGabriel G. Haddad3Department of Pediatrics and Rady Children’s Hospital, University of CaliforniaAlysson MuotriDepartment of Cellular & Molecular Medicine and Department of Pediatrics
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2105.45001
Herein we describe new frontiers in mathematical modeling and statistical analysis of oscillatory biomedical signals, motivated by our recent studies of network formation in the human brain during the early stages of life and studies forty years ago on cardiorespiratory patterns during sleep in infants and animal models. The frontiers involve new nonlinear-type time-frequency analysis of signals with multiple oscillatory components, and efficient particle filters for joint state and parameter estimators together with uncertainty quantification in hidden Markov models and empirical Bayes inference.
Songting LiShanghai JIao Tong UniversityNan LiuBeijing Normal UniversityLi YaoBeijing Normal UniversityXiaohui ZhangBeijing Normal UniversityDongzhuo ZhouShanghai JIao Tong UniversityDavid CaiNew York University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2104.42005
The interplay between excitatory and inhibitory neurons imparts rich functions of the brain. To understand the synaptic mechanisms underlying neuronal computations, a fundamental approach is to study the dynamics of excitatory and inhibitory synaptic inputs of each neuron. The traditional method of determining input conductance, which has been applied for decades, employs the synaptic current-voltage (I-V) relation obtained via voltage clamp. Due to the space clamp effect, the measured conductance is different from the local conductance on the dendrites. Therefore, the interpretation of the measured conductance remains to be clarified. Using theoretical analysis, electrophysiological experiments, and realistic neuron simulations, here we demonstrate that there does not exist a transform between the local conductance and the conductance measured by the traditional method, due to the neglect of a nonlinear interaction between the clamp current and the synaptic current in the
traditional method. Consequently, the conductance determined by the traditional method may not correlate with the local conductance on the dendrites, and its value could be unphysically negative as observed in experiment. To circumvent the challenge of the space clamp effect and elucidate synaptic impact on neuronal information processing, we propose the
concept of effective conductance which is proportional to the local conductance on the dendrite and reflects directly the functional influence of synaptic inputs on somatic membrane potential dynamics, and we further develop a framework to determine the effective conductance accurately. Our work suggests re-examination of previous studies involving conductance
measurement and provides a reliable approach to assess synaptic influence on neuronal computation.
Songting LiShanghai Jiao Tong UniversityNan LiuBeijing Normal UniversityXiaohui ZhangBeijing Normal UniversityDavid McLaughlinCourant Institute New York UniversityDouglas ZhouShanghai Jiao Tong UniversityDavid CaiCourant Institute New York University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2104.42004
Proceedings of the National Academy of Sciences of the United States of America, 116, (30), 15244-15252, 2019.7
Complex dendrites in general present formidable challenges to understanding neuronal information processing. To circumvent the difficulty, a prevalent viewpoint simplifies the neuronal morphology as a point representing the soma, and the excitatory and inhibitory synaptic currents originated from the dendrites are treated as linearly summed at the soma. Despite its extensive applications, the validity of the synaptic current description remains unclear, and the existing point neuron framework fails to characterize the spatiotemporal aspects of dendritic integration supporting specific computations. Using electrophysiological experiments, realistic neuronal simulations, and theoretical analyses, we demonstrate that the traditional assumption of linear summation of synaptic currents is oversimplified and underestimates the inhibition effect. We then derive a form of synaptic integration current within the point neuron framework to capture dendritic effects. In the derived form, the interaction between each pair of synaptic inputs on the dendrites can be reliably parameterized by a single coefficient, suggesting the inherent low-dimensional structure of dendritic integration. We further generalize the form of synaptic integration current to capture the spatiotemporal interactions among multiple synaptic inputs and show that a point neuron model with the synaptic integration current incorporated possesses the computational ability of a spatial neuron with dendrites, including direction selectivity, coincidence detection, logical operation, and a bilinear dendritic integration rule discovered in experiment. Our work amends the modeling of synaptic inputs and improves the computational power of a modeling neuron within the point neuron framework.
Budding yeast, which undergoes polarized growth during budding and mating, has been a useful model system to study cell polarization. Bud sites are selected differently in haploid and diploid yeast cells: haploid cells bud in an axial manner, while diploid cells bud in a bipolar manner. While previous studies have been focused on the molecular details of the bud site selection and polarity establishment, not much is known about how different budding patterns give rise to different functions at the population level. In this paper, we develop a two-dimensional agent-based model to study budding yeast colonies with cell-type specific biological processes, such as budding, mating, mating type switch, consumption of nutrients, and cell death. The model demonstrates that the axial budding pattern enhances mating probability at an early stage and the bipolar budding pattern improves colony development under nutrient limitation. Our results suggest that the frequency of mating type switch might control the trade-off between diploidization and inbreeding. The effect of cellular aging is also studied through our model. Based on the simulations, colonies initiated by an aged haploid cell show declined mating probability at an early stage and recover as the rejuvenated offsprings become the majority. Colonies initiated with aged diploid cells do not show disadvantage in colony expansion possibly due to the fact that young cells contribute the most to colony expansion.
Randomness often plays an important role in the spatial and temporal dynamics of biological systems. General stochastic simulation methods may lead to excessive computational cost for a system in which a large number of molecules involved. Therefore, multi-scale hybrid simulation methods become important for stochastic simulations. Here we build a spatially hybrid method which couples two approaches: discrete stochastic simulation and continuous stochastic differential equations. In our method, the locations of the interfaces between the two approaches are changing according to the distribution of molecules in a one-dimensional domain. To balance the accuracy and efficiency, the time step of the numerical method for the continuous stochastic differential equations is adapted to the dynamics of the molecules near the adaptive interfaces. The simulation results for a linear system and two nonlinear biological systems in different one-dimensional domains demonstrate the effectiveness and advantage of our new hybrid method with the adaptive time step control.
High-throughput biological technologies (e.g. ChIPseq, RNA-seq and single-cell RNA-seq) rapidly accelerate the accumulation of genome-wide omics data in
diverse interrelated biological scenarios (e.g. cells,
tissues and conditions). Integration and differential
analysis are two common paradigms for exploring
and analyzing such data. However, current integrative methods usually ignore the differential part, and
typical differential analysis methods either fail to
identify combinatorial patterns of difference or require matched dimensions of the data. Here, we propose a flexible framework CSMF to combine them
into one paradigm to simultaneously reveal Common
and Specific patterns via Matrix Factorization from
data generated under interrelated biological scenarios. We demonstrate the effectiveness of CSMF with
four representative applications including pairwise
ChIP-seq data describing the chromatin modification
map between K562 and Huvec cell lines; pairwise
RNA-seq data representing the expression profiles of
two different cancers; RNA-seq data of three breast
cancer subtypes; and single-cell RNA-seq data of human embryonic stem cell differentiation at six time
points. Extensive analysis yields novel insights into
hidden combinatorial patterns in these multi-modal
data. Results demonstrate that CSMF is a powerful
tool to uncover common and specific patterns with
significant biological implications from data of interrelated biological scenarios.
Persistent homology is constrained to purely topological persistence, while multiscale graphs account only for geometric information. This work introduces persistent spectral theory to create a unified low-dimensional multiscale paradigm for revealing topological persistence and extracting geometric shapes from high-dimensional datasets. For a point-cloud dataset, a filtration procedure is used to generate a sequence of chain complexes and associated families of simplicial complexes and chains, from which we construct persistent combinatorial Laplacian matrices. We show that a full set of topological persistence can be completely recovered from the harmonic persistent spectra, that is, the spectra that have zero eigenvalues, of the persistent combinatorial Laplacian matrices. However, non-harmonic spectra of the Laplacian
matrices induced by the filtration offer another powerful tool for data analysis, modeling, and prediction. In this work, fullerene stability is predicted by using both harmonic spectra and non-harmonic persistent spectra, while the latter spectra are successfully devised to analyze the structure of fullerenes and model protein flexibility, which cannot be straightforwardly extracted from the current persistent homology. The proposed method is found to provide excellent predictions of the protein B-factors for which current popular biophysical models break down.
We present a new Matched Interface and Boundary (MIB) regularization method for treating charge singularity in solvated biomolecules whose electrostatics are described by the Poisson–Boltzmann (PB) equation. In a regularization method, by decomposing the potential function into two or three components, the singular component can be analytically represented by the Green’s function, while other components possess a higher regularity. Our new regularization combines the efficiency of two-component schemes with the accuracy of the three-component schemes. Based on this regularization, a new MIB finite difference algorithm is developed for solving both linear and nonlinear PB equations, where the nonlinearity is handled by using the inexact-Newton’s method. Compared with the existing MIB PB solver based on a three-component regularization, the present algorithm is simpler to implement by circumventing the work to solve a boundary value Poisson equation inside the molecular interface and to compute related interface jump conditions numerically. Moreover, the new MIB algorithm becomes computationally less expensive, while maintains the same second order accuracy. This is numerically verified by calculating the electrostatic potential and solvation energy on the Kirkwood sphere on which the analytical solutions are available and on a series of proteins with various sizes.
Shenggao ZhouSoochow UniversityR. G. WeissETH ZurichLi-Tien ChengUniversity of California, San DiegoJoachim DzubiellaUniversity of FreiburgJ. Andrew McCammonUniversity of California, San DiegoBo LiUniversity of California, San Diego
Numerical Analysis and Scientific ComputingData Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2005.25001
Proceedings of the National Academy of Sciences of the United States of America, 116, (30), 14989–14994, 2019.7
Ligand-receptor binding and unbinding are fundamental biomolecular processes and particularly essential to drug efficacy. Environmental water fluctuations, however, impact the corresponding thermodynamics and kinetics and thereby challenge theoretical descriptions. Here, we devise a holistic, implicit-solvent, multi-method approach to predict the (un)binding kinetics for a generic ligand-pocket model. We use the variational implicit-solvent model (VISM) to calculate the solute-solvent interfacial structures and the corresponding free energies, and combine the VISM with the string method to obtain the minimum energy paths and transition states between the various metastable (“dry” and “wet”) hydration states. The resulting dry-wet transition rates are then used in a spatially-dependent multi-state continuous-time Markov chain Brownian dynamics simulations, and the related Fokker–Planck equation calculations, of the ligand stochastic motion, providing the mean first-passage times for binding and unbinding. We find the hydration transitions to significantly slow down the binding process, in semi-quantitative agreement with existing explicit-water simulations, but significantly accelerate the unbinding process. Moreover, our methods allow the characterization of non-equilibrium hydration states of pocket and ligand during the ligand movement, for which we find substantial memory and hysteresis effects for binding versus unbinding. Our study thus provides a significant step forward towards efficient, physics-based interpretation and predictions of the complex kinetics in realistic ligand-receptor systems.
Many cellular processes are governed by stochastic reaction events. These events do not necessarily occur in single steps of individual molecules, and, conversely, each birth or death of a macromolecule (e.g., protein) could involve several small reaction steps, creating a memory between individual events and thus leading to nonmarkovian reaction kinetics. Characterizing this kinetics is challenging. Here, we develop a systematic approach for a general reaction network with arbitrary intrinsic waiting-time distributions, which includes the stationary generalized chemical-master equation (sgCME), the stationary generalized Fokker–Planck equation, and the generalized linear-noise approximation. The first formulation converts a nonmarkovian issue into a markovian one by introducing effective transition rates (that explicitly decode the effect of molecular memory) for the reactions in an equivalent reaction network with the same substrates but without molecular memory. Nonmarkovian features of the reaction kinetics can be revealed by solving the sgCME. The latter 2 formulations can be used in the fast evaluation of fluctuations. These formulations can have broad applications, and, in particular, they may help us discover new biological knowledge underlying memory effects. When they are applied to generalized stochastic models of gene-expression regulation, we find that molecular memory is in effect equivalent to a feedback and can induce bimodality, fine-tune the expression noise, and induce switch.
Yujie YeDepartment of Biochemistry and Cellular and Molecular Biology, The University of Tennessee, Knoxville, Tennessee, United States of AmericaXin KangShanghai Center for Mathematical Sciences, Fudan University, Shanghai, ChinaJordan BaileyDepartment of Biochemistry and Cellular and Molecular Biology, The University of Tennessee, Knoxville, Tennessee, United States of AmericaChunhe LiShanghai Center for Mathematical Sciences, Fudan University, Shanghai, ChinaTian HongDepartment of Biochemistry and Cellular and Molecular Biology, The University of Tennessee, Knoxville, Tennessee, United States of America
Multistep cell fate transitions with stepwise changes of transcriptional profiles are common to many developmental, regenerative and pathological processes. The multiple intermediate cell lineage states can serve as differentiation checkpoints or branching points for channeling cells to more than one lineages. However, mechanisms underlying these transitions remain elusive. Here, we explored gene regulatory circuits that can generate multiple intermediate cellular states with stepwise modulations of transcription factors. With unbiased searching in the network topology space, we found a motif family containing a large set of networks can give rise to four attractors with the stepwise regulations of transcription factors, which limit the reversibility of three consecutive steps of the lineage transition. We found that there is an enrichment of these motifs in a transcriptional network controlling the early T cell development, and a mathematical model based on this network recapitulates multistep transitions in the early T cell lineage commitment. By calculating the energy landscape and minimum action paths for the T cell model, we quantified the stochastic dynamics of the critical factors in response to the differentiation signal with fluctuations. These results are in good agreement with experimental observations and they suggest the stable characteristics of the intermediate states in the T cell differentiation. These dynamical features may help to direct the cells to correct lineages during development. Our findings provide general design principles for multistep cell linage transitions and new insights into the early T cell development. The network motifs containing a large family of topologies can be useful for analyzing diverse biological systems with multistep transitions.
HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes. To study the difference among HIV-1 subtypes in infection, diagnosis and drug design, it is important to identify HIV-1 subtypes from clinical HIV-1 samples. In this work, we propose an effective numeric representation called Subsequence Natural Vector (SNV) to encode HIV-1 sequences. Using the representation, we introduce an improved linear discriminant analysis method to classify HIV-1 viruses correctly. SNV is based on distribution of nucleotides in HIV-1 viral sequences. It not only computes the number of nucleotides, but also describes the position and variance of nucleotides in viruses. To validate our alignment-free method, 6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes were collected from the up-to-date Los Alamos HIV database. SNV outperforms the three popular methods, Kameris, Comet and REGA, with almost 100% Sensitivity and Specificity, also with much less time. Our subtyping algorithm especially works better for circulating recombinant forms (CRFs) consisting of a few sequences. Our approach is also powerful to separate unique recombinant forms (URFs) from other subtypes with 100% Sensitivity and Specificity. Moreover, phylogenetic trees based on SNV representation are constructed using full-length HIV-1 genomes and pol genes respectively, where viruses from the same subtype are clustered together correctly.
Ting-Li ChenInstitute of Statistical Science, Academia SinicaDai-Ni HsiehInstitute of Statistical Science, Academia SinicaHung HungInstitute of Epidemiology and Preventive Medicine I-Ping TuInstitute of Statistical Science, Academia SinicaPei-Shien WuDept. of Biostatistics, Duke UniversityYi-Ming WuInstitute of Chemistry, Academia SinicaWei-Hau ChangInstitute of Chemistry, Academia SinicaSu-Yun HuangInstitute of Statistical Science, Academia Sinica
Statistics Theory and MethodsData Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2004.33002
The Annals of Applied Statistics , 8, (1), 259-285, 2014
Cryo-electron microscopy (cryo-EM) has recently emerged as a powerful
tool for obtaining three-dimensional (3D) structures of biological macromolecules
in native states. A minimum cryo-EM image data set for deriving a
meaningful reconstruction is comprised of thousands of randomly orientated
projections of identical particles photographed with a small number of electrons.
The computation of 3D structure from 2D projections requires clustering,
which aims to enhance the signal to noise ratio in each view by grouping
similarly oriented images. Nevertheless, the prevailing clustering techniques
are often compromised by three characteristics of cryo-EM data: high noise
content, high dimensionality and large number of clusters. Moreover, since
clustering requires registering images of similar orientation into the same
pixel coordinates by 2D alignment, it is desired that the clustering algorithm
can label misaligned images as outliers. Herein, we introduce a clustering algorithm
γ-SUP to model the data with a q-Gaussian mixture and adopt the
minimum γ-divergence for estimation, and then use a self-updating procedure
to obtain the numerical solution. We apply γ-SUP to the cryo-EM images
of two benchmark macromolecules, RNA polymerase II and ribosome.
In the former case, simulated images were chosen to decouple clustering from
alignment to demonstrate γ-SUP is more robust to misalignment outliers than
the existing clustering methods used in the cryo-EM community. In the latter
case, the clustering of real cryo-EM data by our γ-SUP method eliminates
noise in many views to reveal true structure features of ribosome at the projection
Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in R18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in R18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.
Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
Next-generation sequencing technology enables the routine detection of bacterial pathogens for clinical diagnostics and genetic research. Whole-genome sequencing has been of importance in the epidemiologic analysis of bacterial pathogens. However, few whole-genome sequencing-based genotyping pipelines are available for practical applications. Here, we present the whole-genome sequencing-based single nucleotide polymorphism(SNP) genotyping method and apply to the evolutionary analysis of methicillin-resistant Staphylococcus aureus. The SNP genotyping method calls genome variants using next-generation sequencing reads of whole genomes and calculates the pair-wise Jaccard distances of the genome variants. The method may reveal the high-resolution whole-genome SNP profiles and the structural variants of different isolates of methicillin-resistant S. aureus(MRSA) and methicillin-susceptible S. aureus(MSSA) strains. The phylogenetic analysis of whole genomes and particular regions may monitor and track the evolution and the transmission dynamic of bacterial pathogens. The computer pro-
grams of the whole genome sequencing-based SNP genotyping methods are available to the public at https://github. com/
Myxobacteria are social bacteria, that can glide in two dimensions and form counterpropagating, interacting waves. Here, we present a novel age-structured, continuous macroscopic model for the movement of myxobacteria. The derivation is based on microscopic interaction rules that can be formulated as a particle-based model and set within the Self-Organized Hydrodynamics (SOH) framework. The strength of this combined approach is that microscopic knowledge or data can be incorporated easily into the particle model, whilst the continuous model allows for easy numerical analysis of the diﬀerent eﬀects. However, we found that the derived macroscopic model lacks a diﬀusion term in the density equations, which is necessary to control the number of waves, indicating that a higher order approximation during the derivation is crucial. Upon ad hoc addition of the diﬀusion term, we found very good agreement between the age-structured model and the biology. In particular, we analyzed the inﬂuence of a refractory (insensitivity) period following a reversal of movement. Our analysis reveals that the refractory period is not necessary for wave formation, but essential to wave synchronization, indicating separate molecular mechanisms.
Although deep learning approaches have had tremendous success in image, video and audio processing, computer vision, and speech recognition, their applications to three-dimensional (3D) biomolecular structural data sets have been hindered by the geometric and biological complexity. To address this problem we introduce the element-specific persistent homology (ESPH) method. ESPH represents 3D complex geometry by one-dimensional (1D) topological invariants and retains important biological information via a multichannel image-like representation. This representation reveals hidden structure-function relationships in biomolecules. We further integrate ESPH and deep convolutional neural networks to construct a multichannel topological neural network (TopologyNet) for the predictions of protein-ligand binding affinities and protein stability changes upon mutation. To overcome the deep learning limitations from small and noisy training sets, we propose a multi-task multichannel topological convolutional neural network (MM-TCNN). We demonstrate that TopologyNet outperforms the latest methods in the prediction of protein-ligand binding affinities, mutation induced globular protein folding free energy changes, and mutation induced membrane protein folding free energy changes. Availability: weilab.math.msu.edu/TDL/
Huanfei Man Soochow UniversitySiyang LengFudan UniversityKazuyuki AiharaTokyo UniversityWei LinFudan UniversityLuonan Chen Shanghai Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1904.42006
Best Paper Award in 2019
Proceedings of the National Academy of Sciences of the United States of America, 115, (43), E9994-E10002, 2018.10
Future state prediction for nonlinear dynamical systems is a challenging task, particularly when only a few time series samples for high-dimensional variables are available from real-world systems. In this work, we propose a model-free framework, named randomly distributed embedding (RDE), to achieve accurate future state prediction based on short-term high-dimensional data. Specifically, from the observed data of high-dimensional variables, the RDE framework randomly generates a sufficient number of low-dimensional “nondelay embeddings” and maps each of them to a “delay embedding,” which is constructed from the data of a to be predicted target variable. Any of these mappings can perform as a low-dimensional weak predictor for future state prediction, and all of such mappings generate a distribution of predicted future states. This distribution actually patches all pieces of association information from various embeddings unbiasedly or biasedly into the whole dynamics of the target variable, which after operated by appropriate estimation strategies, creates a stronger predictor for achieving prediction in a more reliable and robust form. Through applying the RDE framework to data from both representative models and real-world systems, we reveal that a high-dimension feature is no longer an obstacle but a source of information crucial to the accurate prediction for short-term data, even under noise deterioration.
Comparing DNA and protein sequence groups plays an important role in biological evolutionary relationship research. Despite many methods available for sequence comparison, only a few can be used for group comparison. In this study, we propose a novel approach using convex hulls. We use statistical information contained within the sequences to represent each sequence as a point in high dimensional space. We find that the points belonging to one biological group are located in a different region of space than points belonging to other biological groups. To be more precise, the convex hull of the points from one group are disjoint from the convex hulls of points from other groups. This finding allows us to do phylogenetic analysis for groups in an efficient way. Five different theorems are presented for checking whether two convex hulls intersect or are disjoint. Test results for datasets related to HRV, HPV, Ebolavirus, PKC and protein phosphatase domains demonstrate that our method performs well and provides a new tool for studying group phylogeny. More significantly, the convex analysis presents a new way to search for sequences belonging to a biological group by examining points within the group’s convex hull.
Prochlorococcus marinus, one of the most abundant marine cyanobacteria in the global ocean, is classified into low-light (LL) and high-light (HL) adapted ecotypes. These two adapted ecotypes differ in their ecophysiological characteristics, especially whether adapted for growth at high-light or low-light intensities. However, some evolutionary relationships of Prochlorococcus phylogeny remain to be resolved, such as whether the strains SS120 and MIT9211 form a monophyletic group. We use the Natural Vector (NV) method to represent the sequence in order to identify the phylogeny of the Prochlorococcus. The natural vector method is alignment free without any model assumptions. This study added the covariances of amino acids in protein sequence to the natural vector method. Based on these new natural vectors, we can compute the Hausdorff distance between the two clades which represents the dissimilarity. This method enables us to systematically analyze both the dataset of ribosomal proteomes and the dataset of 16s-23s rRNA sequences in order to reconstruct the phylogeny of Prochlorococcus. Furthermore, we apply classification to inspect the relationship of SS120 and MIT9211. From the reconstructed phylogenetic trees and classification results, we may conclude that the SS120 does not cluster with MIT9211. This study demonstrates a new method for performing phylogenetic analysis. The results confirm that these two strains do not form a monophyletic clade in the phylogeny of Prochlorococcus.
Structures and functions of proteins play various essential roles in biological processes. The functions of newly discovered proteins can be predicted by comparing their structures with that of known functional proteins. Many approaches have been proposed for measuring the protein structure similarity, such as the template-modeling (TM)-score method, GRaphlet (GR)-Align method as well as the commonly used root-mean-square deviation (RMSD) measures. However, the alignment comparisons between the similarity of protein structure cost much time on large dataset, and the accuracy still have room to improve. In this study, we introduce a new three-dimensional (3D) Yau–Hausdorff distance between any two 3D objects. The (3D) Yau–Hausdorff distance can be used in particular to measure the similarity/dissimilarity of two proteins of any size and does not need aligning and super- imposing two structures. We apply structural similarity to study function similarity and perform phylogenetic analysis on several datasets. The results show that (3D) Yau–Hausdorff distance could serve as a more precise and effective method to discover biological relationships between proteins than other methods on structure comparison.
Background: In recent years, DNA barcoding has become an important tool for biologists to identify species and understand their natural biodiversity. The complexity of barcode data makes it difficult to analyze quickly and effectively. Manual classification of this data cannot keep up to the rate of increase of available data.
Results: In this study, we propose a new method for DNA barcode classification based on the distribution of nucleotides within the sequence. By adding the covariance of nucleotides to the original natural vector, this augmented 18-dimensional natural vector makes good use of the available information in the DNA sequence. The accurate classification results we obtained demonstrate that this new 18-dimensional natural vector method, together with the random forest classifier algorthm, can serve as a computationally efficient identification tool for DNA barcodes. We performed phylogenetic analysis on the genus Megacollybia to validate our method. We also studied how effective our method was in determining the genetic distance within and between species in our barcoding dataset.
Conclusions: The classification performs well on the fungi barcode dataset with high and robust accuracy. The reasonable phylogenetic trees we obtained further validate our methods. This method is alignment-free and does not depend on any model assumption, and it will become a powerful tool for classification and evolutionary analysis.
This study quantitatively validates the principle that the biological properties associated with a given genotype are determined by the distribution of amino acids. In order to visualize this central law of molecular biology, each protein was represented by a point in 250-dimensional space based on its amino acid distribution. Proteins from the same family are found to cluster together, leading to the principle that the convex hull surrounding protein points from the same family do not intersect with the convex hulls of other protein families. This principle was verified computationally for all available and reliable protein kinases and human proteins. In addition, we generated 2,328,761 figures to show that the convex hulls of different families were disjoint from each other. The classification performs well with high and robust accuracy (95.75% and 97.5%) together with reasonable phylogenetic trees validate our methods further.
Analyzing phylogenetic relationships using mathematical methods has always been of importance in bioinformatics. Quantitative research may interpret the raw biological data in a precise way. Multiple Sequence Alignment (MSA) is used frequently to analyze biological evolutions, but is very time-consuming. When the scale of data is large, alignment methods cannot finish calculation in reasonable time. Therefore, we present a new method using moments of cumulative Fourier power spectrum in clustering the DNA sequences. Each sequence is translated into a vector in Euclidean space. Distances between the vectors can reflect the relationships between sequences. The mapping between the spectra and moment vector is one-to-one, which means that no information is lost in the power spectra during the calculation. We cluster and classify several datasets including Influenza A, primates, and human rhinovirus (HRV) datasets to build up the phylogenetic trees. Results show that the new proposed cumulative Fourier power spectrum is much faster and more accurately than MSA and another alignment-free method known as k-mer. The research provides us new insights in the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes. The computer programs of the cumulative Fourier power spectrum are available at GitHub (https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum).
Rui DongTsinghua UniversityHui ZhengThe University of Illinois at ChicagoKun TianTsinghua UniversityShek-Chung YauThe Hong Kong University of Science and TechnologyWeiguang MaoTsinghua UniversityWenping YuNankai UniversityChangchuan YinThe University of Illinois at ChicagoChenglong YuSouth Australian Health and Medical Research InstituteRong Lucy HeChicago State UniversityJie YangThe University of Illinois at ChicagoStephen S.-T YauTsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42004
We construct a virus database called VirusDB (http://yaulab.math.tsinghua.edu.cn/VirusDB/) and an online inquiry system to serve people who are interested in viral classification and prediction. The database stores all viral genomes, their corresponding natural vectors, and the classification information of the single/multiple-segmented viral reference sequences downloaded from National Center for Biotechnology Information. The online inquiry system serves the purpose of computing natural vectors and their distances based on submitted genomes, providing an online interface for accessing and using the database for viral classification and prediction, and back-end processes for automatic and manual updating of database content to synchronize with GenBank. Submitted genomes data in FASTA format will be carried out and the prediction results with 5 closest neighbors and their classifications will be returned by email. Considering the one-to-one correspondence between sequence and natural vector, time efficiency, and high accuracy, natural vector is a significant advance compared with alignment methods, which makes VirusDB a useful database in further research.
Yongkun LiDepartment of Mathematical Sciences, Tsinghua UniversityLily HeDepartment of Mathematical Sciences, Tsinghua UniversityRong Lucy HeDepartment of Biological Sciences, Chicago State UniversityStephen S.-T. Yau(CorrespondingDepartment of Mathematical Sciences, Tsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42003
Zika virus (ZIKV) is a mosquito-borne flavivirus. It was first isolated from Uganda in 1947 and has become an
emergent event since 2007. However, because of the inconsistency of alignment methods, the evolution of
ZIKV remains poorly understood. In this study, we first use the complete protein and an alignment-free method
to build a phylogenetic tree of 87 Zika strains in which Asian, East African, and West African lineages are
characterized. We also use the NS5 protein to construct the genetic relationship among 44 Zika strains. For the
first time, these strains are divided into two clades: African 1 and African 2. This result suggests that ZIKV
originates from Africa, then spread to Asia, Pacific islands, and throughout the Americas. We also perform the
phylogeny analysis for 53 viruses in genus Flavivirus to which ZIKV belongs using complete proteins. Our
conclusion is consistent with the classification by the hosts and transmission vectors.
Yongkun Li1Department of Mathematical Sciences, Tsinghua UniversityLily He1Department of Mathematical Sciences, Tsinghua UniversityRong Lucy He2Department of Biological Sciences, Chicago State UniversityStephen S.-T. Yau(Corresponding author)1Department of Mathematical Sciences, Tsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42002
With sharp increasing in biological sequences, the traditional sequence alignment methods become
unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence
analysis. Among these methods, many sorts of feature vector methods are established and applied to
reconstruction of species phylogeny. The vectors basically consist of some typical numerical features
for certain biological problems. The features may come from the primary sequences, secondary or
three dimensional structures of macromolecules. In this study, we propose a novel numerical vector
based on only primary sequences of organism to build their phylogeny. Three chemical and physical
properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using
each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of
letters. Therefore, three sequences are constructed according to the three properties. For each letter of
each sequence we calculate the number of the letter, the average position of the letter and the variation
of the position of the letter appearing in the sequence. Tested on several datasets related to mammals,
viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
Lily HeDepartment of Mathematical Sciences, Tsinghua UniversityYongkun LiDepartment of Mathematical Sciences, Tsinghua UniversityRong Lucy HeDepartment of Biological Sciences, Chicago State UniversityStephen S.-T. Yau(Corresponding author)Department of Mathematical Sciences, Tsinghua University
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1903.42001
Journal of Theoretical Biology, 427, 41-52, 2017.6
Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important bio- chemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.
Processing streaming data as they arrive is often necessary for high dimensional data analysis. In this paper, we analyze the convergence of a subspace online PCA iteration, as a followup of the recent work of Li, Wang, Liu, and Zhang [Math. Program., Ser. B, DOI 10.1007/s10107-017-1182-z] who considered the case for the most significant principal component only, i.e., a single vector. Under the sub-Gaussian assumption, we obtain a finite-sample error bound that closely matches the minimax information lower bound of Vu and Lei [Ann. Statist. 41:6 (2013), 2905-2947].
We propose to combine cepstrum and nonlinear time–frequency (TF) analysis
to study multiple component oscillatory signals with time-varying frequency and
amplitude and with time-varying non-sinusoidal oscillatory pattern. The concept of
cepstrum is applied to eliminate the wave-shape function influence on the TF analysis,
and we propose a new algorithm, named de-shape synchrosqueezing transform (deshape
SST). The mathematical model, adaptive non-harmonic model, is introduced
and the de-shape SST algorithm is theoretically analyzed. In addition to simulated
signals, several different physiological, musical and biological signals are analyzed to
illustrate the proposed algorithm.
Chenglong YuSouth Australian Health and Medical Research InstituteBernhard T. BauneUniversity of AdelaideJulio LicinioSouth Australian Health and Medical Research InstituteMa-Li WongSouth Australian Health and Medical Research Institute
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1703.42005
Major depressive disorder (MDD) is highly prevalent, resulting in an exceedingly high disease burden. The identification of generic risk factors could lead to advance prevention and therapeutics. Current approaches examine genotyping data to identify specific variations between cases and controls. Compared to genotyping, whole-genome sequencing (WGS) allows for the detection of private mutations. In this proof-of-concept study, we establish a conceptually novel computational approach that clusters subjects based on the entirety of their WGS. Those clusters predicted MDD diagnosis. This strategy yielded encouraging results, showing that depressed Mexican-American participants were grouped closer; in contrast ethnically-matched controls grouped away from MDD patients. This implies that within the same ancestry, the WGS data of an individual can be used to check whether this individual is within or closer to MDD subjects or to controls. We propose a novel strategy to apply WGS data to clinical medicine by facilitating diagnosis through genetic clustering. Further studies utilising our method should examine larger WGS datasets on other ethnical groups.
The International Committee on Taxonomy of Viruses authorizes and organizes the taxonomic classification of viruses. Thus
far, the detailed classifications for all viruses are neither complete nor free from dispute. For example, the current missing
label rates in GenBank are 12.1% for family label and 30.0% for genus label. Using the proposed Natural Vector
representation, all 2,044 single-segment referenced viral genomes in GenBank can be embedded in R^12. Unlike other
approaches, this allows us to determine phylogenetic relations for all viruses at any level (e.g., Baltimore class, family,
subfamily, genus, and species) in real time. Additionally, the proposed graphical representation for virus phylogeny provides
a visualization of the distribution of viruses in R^12. Unlike the commonly used tree visualization methods which suffer from
uniqueness and existence problems, our representation always exists and is unique. This approach is successfully used to
predict and correct viral classification information, as well as to identify viral origins; e.g. a recent public health threat, the
West Nile virus, is closer to the Japanese encephalitis antigenic complex based on our visualization. Based on cross validation
results, the accuracy rates of our predictions are as high as 98.2% for Baltimore class labels, 96.6% for family
labels, 99.7% for subfamily labels and 97.2% for genus labels.
Current methods cannot tell us what the nature of the protein universe is concretely. They are based on different models of amino acid substitution and multiple sequence alignment which is an NP-hard problem and requires manual intervention. Protein structural analysis also gives a direction for mapping the protein universe. Unfortunately, now only a minuscule fraction of proteins' 3-dimensional structures are known. Furthermore, the phylogenetic tree representations are not unique for any existing tree construction methods. Here we develop a novel method to realize the nature of protein universe. We show the protein universe can be realized as a protein space in 60-dimensional Euclidean space using a distance based on a normalized distribution of amino acids. Every protein is in one-to-one correspondence with a point in protein space, where proteins with similar properties stay close together. Thus the distance between two points in protein space represents the biological distance of the corresponding two proteins. We also propose a natural graphical representation for inferring phylogenies. The representation is natural and unique based on the biological distances of proteins in protein space. This will solve the fundamental question of how proteins are distributed in the protein universe.
The free-living SAR11 clade is a globally abundant group of oceanic Alphaproteobacteria, with small genome sizes and rich genomic A+T content. However, the taxonomy of SAR11 has become controversial recently. Some researchers argue that the position of SAR11 is a sister group to Rickettsiales. Other researchers advocate that SAR11 is located within free-living lineages of Alphaproteobacteria. Here, we use the natural vector representation method to identify the evolutionary origin of the SAR11 clade. This alignment-free method does not depend on any model assumptions. With this approach, the correspondence between proteome sequences and their natural vectors is one-to-one. After fixing a set of proteins, each bacterium is represented by a set of vectors. The Hausdorff distance is then used to compute the dissimilarity distance between two bacteria. The phylogenetic tree can be reconstructed based on these distances. Using our method, we systematically analyze four data sets of alphaproteobacterial proteomes in order to reconstruct the phylogeny of Alphaproteobacteria. From this we can see that the phylogenetic position of the SAR11 group is within a group of other free-living lineages of Alphaproteobacteria.