Introduction

High-throughput sequencing has rapidly accelerated the discovery of small RNA and DNA viruses, but has produced markedly fewer large virus draft genomes (>101 kb) due to the limitations of de novo assembly techniques. Despite their relatively small genome sizes, viruses such as herpesviruses, adenoviruses, and large bacteriophages are infrequently assembled into complete genomes. Herpesviruses in particular contain a number of challenging features for de novo assembly, including high G+C content, numerous homopolymers, inverted structural repeats, and many tandem short sequence repeats (SSRs). To address this, we developed a fully automated computational pipeline for Virus Genome Assembly (VirGA). VirGA is comprised of four contiguous steps that conduct 1) sequencing reads preprocessing, 2) de novo assembly, 3) genome linearization and annotation, and 4) assembly assessment to allow high-throughput generation of draft genomes. The pipeline was designed with both desktop PCs and scientific computing clusters in mind, with frequent built-in use of multi-threading, parallelization, and native support for job scheduling and software module systems. Similarly, all steps are optimized for both assembly novices and bioinformaticians alike, with push-button ease of use that still allows for in-depth parameter control when desired. Since VirGA is meant as a rapid solution for generating numerous draft genomes, quality control and reporting strategies are implemented at every step. All scripts and settings used for each run are permanently stored with the output, to allow easy record-keeping and preservation of parameters for future replication. Post-assembly computational remedies ameliorate genome gaps and miss-assemblies, and reference strain comparison identifies gross errors in coding regions. Upon the pipeline’s conclusion, a comprehensive HTML report is generated which details assembly metrics and provides helpful visualizations. VirGA’s high-throughput and accurate nature will allow traditional virology wet labs to easily generate their own draft genomes and discover phenotypic causation through comparative genomics.