Virus classification in 60-dimensional protein space

Yongkun Li Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China Kun Tian Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China Changchuan Yin Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, USA Rong Lucy He Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA Stephen S.-T. Yau Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China

Statistics Theory and Methods mathscidoc:1611.33001

Molecular Phylogenetics and Evolution, 2016, (99), 10, 2016.3
Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed ‘‘Natural Vector (NV) representation” has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.
Virus classification;Hausdorff distance;Natural vector;Natural graphical representation
[ Download ] [ 2016-11-26 21:06:35 uploaded by JackLee ] [ 916 downloads ] [ 0 comments ]
@inproceedings{yongkun2016virus,
  title={Virus classification in 60-dimensional protein space},
  author={Yongkun Li, Kun Tian, Changchuan Yin, Rong Lucy He, and Stephen S.-T. Yau},
  url={http://archive.ymsc.tsinghua.edu.cn/pacm_paperurl/20161126210635333180661},
  booktitle={Molecular Phylogenetics and Evolution},
  volume={2016},
  number={99},
  pages={10},
  year={2016},
}
Yongkun Li, Kun Tian, Changchuan Yin, Rong Lucy He, and Stephen S.-T. Yau. Virus classification in 60-dimensional protein space. 2016. Vol. 2016. In Molecular Phylogenetics and Evolution. pp.10. http://archive.ymsc.tsinghua.edu.cn/pacm_paperurl/20161126210635333180661.
Please log in for comment!
 
 
Contact us: office-iccm@tsinghua.edu.cn | Copyright Reserved