Host proteins were anticipated to be uninformative for the classification of sequences

We used the hmmsearch algorithm in HMMER3 to assess the ability of the vFam to correctly Cefonicid sodium recruit the viral sequences removed from it, simulating the detection of an unknown virus. We additionally compared the E-value of each removed sequence to the E-values for known non-viral sequences as well as the viral sequences from other vFams. These non-viral sequences were drawn from a large collection of,150,000 full-length protein sequences from well-studied prokaryotic and eukaryotic organisms, such as H. sapiens and E. coli, potentially found in the metagenomes of eukaryotic hosts. Each vFam was given a ����recall���� score, representing the fraction of Tuberostemonine left-out sequences recalled with an E-value #10. Each vFam was additionally given a ����strict recall���� score, representing the fraction of left-out sequences recalled with an E-value less than all non-viral sequences and viral sequences from other vFams. In cross-validation of the full set of vFams, 96% of the sequences were recalled and 91% of the sequences were strictly recalled by the correct vFam. While 4,337 of the vFams were able to recall 100% of their left-out sequences in cross-validation tests, a subset of 3,931 of the vFams was additionally able to strictly recall 100% of the left-out sequences. This difference in viral differentiation ability could be attributed to less robust vFams as well as those vFams derived from sequences with non-viral homologs. vFams comprising viral homologs of host proteins were anticipated to be uninformative for the classification of sequences as viral or non-viral. Non-robust vFams built from sequences that may not be true homologs of one another could be equally uninformative. We therefore used the results of the crossvalidation experiment to filter the set of vFams to those predicted to perform well in the context of experimental metagenomic datasets. Specifically, we retained those vFams that strictly recalled at least 80% of the training sequences in the cross-validation tests. We chose a threshold of 80% to maintain broad coverage of viral taxonomy without a large sacrifice in vFam performance.

Leave a Reply