HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes. To study the difference among HIV-1 subtypes in infection, diagnosis and drug design, it is important to identify HIV-1 subtypes from clinical HIV-1 samples. In this work, we propose an effective numeric representation called Subsequence Natural Vector (SNV) to encode HIV-1 sequences. Using the representation, we introduce an improved linear discriminant analysis method to classify HIV-1 viruses correctly. SNV is based on distribution of nucleotides in HIV-1 viral sequences. It not only computes the number of nucleotides, but also describes the position and variance of nucleotides in viruses. To validate our alignment-free method, 6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes were collected from the up-to-date Los Alamos HIV database. SNV outperforms the three popular methods, Kameris, Comet and REGA, with almost 100% Sensitivity and Specificity, also with much less time. Our subtyping algorithm especially works better for circulating recombinant forms (CRFs) consisting of a few sequences. Our approach is also powerful to separate unique recombinant forms (URFs) from other subtypes with 100% Sensitivity and Specificity. Moreover, phylogenetic trees based on SNV representation are constructed using full-length HIV-1 genomes and pol genes respectively, where viruses from the same subtype are clustered together correctly.
Ting-Li ChenInstitute of Statistical Science, Academia SinicaDai-Ni HsiehInstitute of Statistical Science, Academia SinicaHung HungInstitute of Epidemiology and Preventive Medicine I-Ping TuInstitute of Statistical Science, Academia SinicaPei-Shien WuDept. of Biostatistics, Duke UniversityYi-Ming WuInstitute of Chemistry, Academia SinicaWei-Hau ChangInstitute of Chemistry, Academia SinicaSu-Yun HuangInstitute of Statistical Science, Academia Sinica
Statistics Theory and MethodsData Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2004.33002
The Annals of Applied Statistics , 8, (1), 259-285, 2014
Cryo-electron microscopy (cryo-EM) has recently emerged as a powerful
tool for obtaining three-dimensional (3D) structures of biological macromolecules
in native states. A minimum cryo-EM image data set for deriving a
meaningful reconstruction is comprised of thousands of randomly orientated
projections of identical particles photographed with a small number of electrons.
The computation of 3D structure from 2D projections requires clustering,
which aims to enhance the signal to noise ratio in each view by grouping
similarly oriented images. Nevertheless, the prevailing clustering techniques
are often compromised by three characteristics of cryo-EM data: high noise
content, high dimensionality and large number of clusters. Moreover, since
clustering requires registering images of similar orientation into the same
pixel coordinates by 2D alignment, it is desired that the clustering algorithm
can label misaligned images as outliers. Herein, we introduce a clustering algorithm
γ-SUP to model the data with a q-Gaussian mixture and adopt the
minimum γ-divergence for estimation, and then use a self-updating procedure
to obtain the numerical solution. We apply γ-SUP to the cryo-EM images
of two benchmark macromolecules, RNA polymerase II and ribosome.
In the former case, simulated images were chosen to decouple clustering from
alignment to demonstrate γ-SUP is more robust to misalignment outliers than
the existing clustering methods used in the cryo-EM community. In the latter
case, the clustering of real cryo-EM data by our γ-SUP method eliminates
noise in many views to reveal true structure features of ribosome at the projection
Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in R18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in R18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.
Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
Next-generation sequencing technology enables the routine detection of bacterial pathogens for clinical diagnostics and genetic research. Whole-genome sequencing has been of importance in the epidemiologic analysis of bacterial pathogens. However, few whole-genome sequencing-based genotyping pipelines are available for practical applications. Here, we present the whole-genome sequencing-based single nucleotide polymorphism(SNP) genotyping method and apply to the evolutionary analysis of methicillin-resistant Staphylococcus aureus. The SNP genotyping method calls genome variants using next-generation sequencing reads of whole genomes and calculates the pair-wise Jaccard distances of the genome variants. The method may reveal the high-resolution whole-genome SNP profiles and the structural variants of different isolates of methicillin-resistant S. aureus(MRSA) and methicillin-susceptible S. aureus(MSSA) strains. The phylogenetic analysis of whole genomes and particular regions may monitor and track the evolution and the transmission dynamic of bacterial pathogens. The computer pro-
grams of the whole genome sequencing-based SNP genotyping methods are available to the public at https://github. com/