Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in R18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in R18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.
Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
Next-generation sequencing technology enables the routine detection of bacterial pathogens for clinical diagnostics and genetic research. Whole-genome sequencing has been of importance in the epidemiologic analysis of bacterial pathogens. However, few whole-genome sequencing-based genotyping pipelines are available for practical applications. Here, we present the whole-genome sequencing-based single nucleotide polymorphism(SNP) genotyping method and apply to the evolutionary analysis of methicillin-resistant Staphylococcus aureus. The SNP genotyping method calls genome variants using next-generation sequencing reads of whole genomes and calculates the pair-wise Jaccard distances of the genome variants. The method may reveal the high-resolution whole-genome SNP profiles and the structural variants of different isolates of methicillin-resistant S. aureus(MRSA) and methicillin-susceptible S. aureus(MSSA) strains. The phylogenetic analysis of whole genomes and particular regions may monitor and track the evolution and the transmission dynamic of bacterial pathogens. The computer pro-
grams of the whole genome sequencing-based SNP genotyping methods are available to the public at https://github. com/
Myxobacteria are social bacteria, that can glide in two dimensions and form counterpropagating, interacting waves. Here, we present a novel age-structured, continuous macroscopic model for the movement of myxobacteria. The derivation is based on microscopic interaction rules that can be formulated as a particle-based model and set within the Self-Organized Hydrodynamics (SOH) framework. The strength of this combined approach is that microscopic knowledge or data can be incorporated easily into the particle model, whilst the continuous model allows for easy numerical analysis of the diﬀerent eﬀects. However, we found that the derived macroscopic model lacks a diﬀusion term in the density equations, which is necessary to control the number of waves, indicating that a higher order approximation during the derivation is crucial. Upon ad hoc addition of the diﬀusion term, we found very good agreement between the age-structured model and the biology. In particular, we analyzed the inﬂuence of a refractory (insensitivity) period following a reversal of movement. Our analysis reveals that the refractory period is not necessary for wave formation, but essential to wave synchronization, indicating separate molecular mechanisms.
Although deep learning approaches have had tremendous success in image, video and audio processing, computer vision, and speech recognition, their applications to three-dimensional (3D) biomolecular structural data sets have been hindered by the geometric and biological complexity. To address this problem we introduce the element-specific persistent homology (ESPH) method. ESPH represents 3D complex geometry by one-dimensional (1D) topological invariants and retains important biological information via a multichannel image-like representation. This representation reveals hidden structure-function relationships in biomolecules. We further integrate ESPH and deep convolutional neural networks to construct a multichannel topological neural network (TopologyNet) for the predictions of protein-ligand binding affinities and protein stability changes upon mutation. To overcome the deep learning limitations from small and noisy training sets, we propose a multi-task multichannel topological convolutional neural network (MM-TCNN). We demonstrate that TopologyNet outperforms the latest methods in the prediction of protein-ligand binding affinities, mutation induced globular protein folding free energy changes, and mutation induced membrane protein folding free energy changes. Availability: weilab.math.msu.edu/TDL/