Next-generation sequencing (NGS) technologies have superseded traditional Sanger sequencing approach in many experimental settings, given their tremendous yield and affordable cost. Nowadays it is possible to sequence any microbial organism or meta-genomic sample within hours, and to obtain a whole human genome in weeks. Nonetheless, NGS technologies are error-prone. Correcting errors is a challenge due to multiple factors, including the data sizes, the machine-specific and non-at-random characteristics of errors, and the error distributions. Errors in NGS experiments can hamper the subsequent data analysis and inference. This work proposes an error correction method based on the de Bruijn graph that permits its execution on Gigabyte-sized data sets using normal desktop/laptop computers, ideal for genome sizes in the Megabase range, e.g. bacteria. The implementation makes extensive use of hashing techniques, and implements an A∗ algorithm for optimal error correction, minimizing the distance between an erroneous read and its possible replacement with the Needleman-Wunsch score. Our approach outperforms other popular methods both in terms of random access memory usage and computing times.
|Titolo:||A∗ fast and scalable high-throughput sequencing data error correction via oligomers|
|Autori interni:||MILICCHIO, Franco|
|Data di pubblicazione:||2016|
|Appare nelle tipologie:||4.1 Contributo in Atti di convegno|