Background: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The\nclassification and annotation of these genomes constitute important assets in the discovery of genomic variability,\ntaxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific\nwell-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and\naccurate tools for classifying and typing newly sequenced strains of diverse virus families.\nResults: Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR\nis inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It\nsimulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two\nmetrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR\nfor the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human\nimmunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species,\nHBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance\ncompared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments.\nConclusion: The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate\nlarge scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine\nlearning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca.
Loading....