HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes. To study the difference among HIV-1 subtypes in infection, diagnosis and drug design, it is important to identify HIV-1 subtypes from clinical HIV-1 samples. In this work, we propose an effective numeric representation called Subsequence Natural Vector (SNV) to encode HIV-1 sequences. Using the representation, we introduce an improved linear discriminant analysis method to classify HIV-1 viruses correctly. SNV is based on distribution of nucleotides in HIV-1 viral sequences. It not only computes the number of nucleotides, but also describes the position and variance of nucleotides in viruses. To validate our alignment-free method, 6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes were collected from the up-to-date Los Alamos HIV database. SNV outperforms the three popular methods, Kameris, Comet and REGA, with almost 100% Sensitivity and Specificity, also with much less time. Our subtyping algorithm especially works better for circulating recombinant forms (CRFs) consisting of a few sequences. Our approach is also powerful to separate unique recombinant forms (URFs) from other subtypes with 100% Sensitivity and Specificity. Moreover, phylogenetic trees based on SNV representation are constructed using full-length HIV-1 genomes and pol genes respectively, where viruses from the same subtype are clustered together correctly.
Ting-Li ChenInstitute of Statistical Science, Academia SinicaDai-Ni HsiehInstitute of Statistical Science, Academia SinicaHung HungInstitute of Epidemiology and Preventive Medicine I-Ping TuInstitute of Statistical Science, Academia SinicaPei-Shien WuDept. of Biostatistics, Duke UniversityYi-Ming WuInstitute of Chemistry, Academia SinicaWei-Hau ChangInstitute of Chemistry, Academia SinicaSu-Yun HuangInstitute of Statistical Science, Academia Sinica
Statistics Theory and MethodsData Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2004.33002
The Annals of Applied Statistics , 8, (1), 259-285, 2014
Cryo-electron microscopy (cryo-EM) has recently emerged as a powerful
tool for obtaining three-dimensional (3D) structures of biological macromolecules
in native states. A minimum cryo-EM image data set for deriving a
meaningful reconstruction is comprised of thousands of randomly orientated
projections of identical particles photographed with a small number of electrons.
The computation of 3D structure from 2D projections requires clustering,
which aims to enhance the signal to noise ratio in each view by grouping
similarly oriented images. Nevertheless, the prevailing clustering techniques
are often compromised by three characteristics of cryo-EM data: high noise
content, high dimensionality and large number of clusters. Moreover, since
clustering requires registering images of similar orientation into the same
pixel coordinates by 2D alignment, it is desired that the clustering algorithm
can label misaligned images as outliers. Herein, we introduce a clustering algorithm
γ-SUP to model the data with a q-Gaussian mixture and adopt the
minimum γ-divergence for estimation, and then use a self-updating procedure
to obtain the numerical solution. We apply γ-SUP to the cryo-EM images
of two benchmark macromolecules, RNA polymerase II and ribosome.
In the former case, simulated images were chosen to decouple clustering from
alignment to demonstrate γ-SUP is more robust to misalignment outliers than
the existing clustering methods used in the cryo-EM community. In the latter
case, the clustering of real cryo-EM data by our γ-SUP method eliminates
noise in many views to reveal true structure features of ribosome at the projection
Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in R18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in R18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.
Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
Next-generation sequencing technology enables the routine detection of bacterial pathogens for clinical diagnostics and genetic research. Whole-genome sequencing has been of importance in the epidemiologic analysis of bacterial pathogens. However, few whole-genome sequencing-based genotyping pipelines are available for practical applications. Here, we present the whole-genome sequencing-based single nucleotide polymorphism(SNP) genotyping method and apply to the evolutionary analysis of methicillin-resistant Staphylococcus aureus. The SNP genotyping method calls genome variants using next-generation sequencing reads of whole genomes and calculates the pair-wise Jaccard distances of the genome variants. The method may reveal the high-resolution whole-genome SNP profiles and the structural variants of different isolates of methicillin-resistant S. aureus(MRSA) and methicillin-susceptible S. aureus(MSSA) strains. The phylogenetic analysis of whole genomes and particular regions may monitor and track the evolution and the transmission dynamic of bacterial pathogens. The computer pro-
grams of the whole genome sequencing-based SNP genotyping methods are available to the public at https://github. com/
Myxobacteria are social bacteria, that can glide in two dimensions and form counterpropagating, interacting waves. Here, we present a novel age-structured, continuous macroscopic model for the movement of myxobacteria. The derivation is based on microscopic interaction rules that can be formulated as a particle-based model and set within the Self-Organized Hydrodynamics (SOH) framework. The strength of this combined approach is that microscopic knowledge or data can be incorporated easily into the particle model, whilst the continuous model allows for easy numerical analysis of the diﬀerent eﬀects. However, we found that the derived macroscopic model lacks a diﬀusion term in the density equations, which is necessary to control the number of waves, indicating that a higher order approximation during the derivation is crucial. Upon ad hoc addition of the diﬀusion term, we found very good agreement between the age-structured model and the biology. In particular, we analyzed the inﬂuence of a refractory (insensitivity) period following a reversal of movement. Our analysis reveals that the refractory period is not necessary for wave formation, but essential to wave synchronization, indicating separate molecular mechanisms.
Although deep learning approaches have had tremendous success in image, video and audio processing, computer vision, and speech recognition, their applications to three-dimensional (3D) biomolecular structural data sets have been hindered by the geometric and biological complexity. To address this problem we introduce the element-specific persistent homology (ESPH) method. ESPH represents 3D complex geometry by one-dimensional (1D) topological invariants and retains important biological information via a multichannel image-like representation. This representation reveals hidden structure-function relationships in biomolecules. We further integrate ESPH and deep convolutional neural networks to construct a multichannel topological neural network (TopologyNet) for the predictions of protein-ligand binding affinities and protein stability changes upon mutation. To overcome the deep learning limitations from small and noisy training sets, we propose a multi-task multichannel topological convolutional neural network (MM-TCNN). We demonstrate that TopologyNet outperforms the latest methods in the prediction of protein-ligand binding affinities, mutation induced globular protein folding free energy changes, and mutation induced membrane protein folding free energy changes. Availability: weilab.math.msu.edu/TDL/
Huanfei Man Soochow UniversitySiyang LengFudan UniversityKazuyuki AiharaTokyo UniversityWei LinFudan UniversityLuonan Chen Shanghai Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1904.42006
Best Paper Award in 2019
Proceedings of the National Academy of Sciences of the United States of America, 115, (43), E9994-E10002, 2018.10
Future state prediction for nonlinear dynamical systems is a challenging task, particularly when only a few time series samples for high-dimensional variables are available from real-world systems. In this work, we propose a model-free framework, named randomly distributed embedding (RDE), to achieve accurate future state prediction based on short-term high-dimensional data. Specifically, from the observed data of high-dimensional variables, the RDE framework randomly generates a sufficient number of low-dimensional “nondelay embeddings” and maps each of them to a “delay embedding,” which is constructed from the data of a to be predicted target variable. Any of these mappings can perform as a low-dimensional weak predictor for future state prediction, and all of such mappings generate a distribution of predicted future states. This distribution actually patches all pieces of association information from various embeddings unbiasedly or biasedly into the whole dynamics of the target variable, which after operated by appropriate estimation strategies, creates a stronger predictor for achieving prediction in a more reliable and robust form. Through applying the RDE framework to data from both representative models and real-world systems, we reveal that a high-dimension feature is no longer an obstacle but a source of information crucial to the accurate prediction for short-term data, even under noise deterioration.
Comparing DNA and protein sequence groups plays an important role in biological evolutionary relationship research. Despite many methods available for sequence comparison, only a few can be used for group comparison. In this study, we propose a novel approach using convex hulls. We use statistical information contained within the sequences to represent each sequence as a point in high dimensional space. We find that the points belonging to one biological group are located in a different region of space than points belonging to other biological groups. To be more precise, the convex hull of the points from one group are disjoint from the convex hulls of points from other groups. This finding allows us to do phylogenetic analysis for groups in an efficient way. Five different theorems are presented for checking whether two convex hulls intersect or are disjoint. Test results for datasets related to HRV, HPV, Ebolavirus, PKC and protein phosphatase domains demonstrate that our method performs well and provides a new tool for studying group phylogeny. More significantly, the convex analysis presents a new way to search for sequences belonging to a biological group by examining points within the group’s convex hull.