![]() Several potential limitations of low-coverage assemblies were examined by Margulies et al. To our knowledge, this issue has not been studied in detail, despite its potential importance in many applications in comparative genomics, including phylogenetic modeling, the detection of positive selection, and comparative gene finding. This issue of sequencing error in 2× genomes is our focus in this article. In addition, low-coverage assemblies necessarily have elevated levels of sequencing error-that is, miscalled bases and erroneous insertions and deletions, which might otherwise be corrected through redundant sequencing of the same genomic region. For example, low-coverage assemblies have decreased levels of contiguity, which can severely limit their usefulness in identifying rearrangements, duplications, and repetitive elements. While these low-coverage assemblies are valuable for many purposes, reduced sequencing redundancy has some inevitable costs. Species and genome assemblies considered in this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Ĭompeting interests: The authors have declared that no competing interests exist. No additional external funding was received for this study. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.įunding: This work was supported by NSF Faculty Early Career Development grants DBI-0644111 (AS) and DBI-0644282 (MK, MFL), NIH grant U54 HG004555-01 (MK, MFL), and a David and Lucile Packard Fellowship for Science and Engineering (AS, MJH). Received: NovemAccepted: JanuPublished: February 14, 2011Ĭopyright: © 2011 Hubisz et al. PLoS ONE 6(2):Įditor: Thomas Mailund, Aarhus University, Denmark Our error-mitigated alignments are available for download.Ĭitation: Hubisz MJ, Lin MF, Kellis M, Siepel A (2011) Error and Error Mitigation in Low-Coverage Genome Assemblies. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. While this error rate is fairly modest, sequencing error can still have surprising effects. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage.
0 Comments
Leave a Reply. |