Email updates

Keep up to date with the latest news and content from BioData Mining and BioMed Central.

Open Access Highly Accessed Research

How do alignment programs perform on sequencing data with varying qualities and from repetitive regions?

Xiaoqing Yu1, Kishore Guda23, Joseph Willis4, Martina Veigl2, Zhenghe Wang5, Sanford Markowitz3, Mark D Adams5 and Shuying Sun12*

Author Affiliations

1 Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, 44106, USA

2 Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, 44106, USA

3 Department of Medicine, Case Western Reserve University, Cleveland, OH, 44106, USA

4 Department of Pathology, Case Western Reserve University, Cleveland, OH, 44106, USA

5 J. Craig Venter Institute, 10355 Science Center Dr, San Diego, CA, 92121, USA

For all author emails, please log on.

BioData Mining 2012, 5:6  doi:10.1186/1756-0381-5-6

Published: 18 June 2012

Abstract

Background

Next-generation sequencing technologies generate a significant number of short reads that are utilized to address a variety of biological questions. However, quite often, sequencing reads tend to have low quality at the 3’ end and are generated from the repetitive regions of a genome. It is unclear how different alignment programs perform under these different cases. In order to investigate this question, we use both real data and simulated data with the above issues to evaluate the performance of four commonly used algorithms: SOAP2, Bowtie, BWA, and Novoalign.

Methods

The performance of different alignment algorithms are measured in terms of concordance between any pair of aligners (for real sequencing data without known truth) and the accuracy of simulated read alignment.

Results

Our results show that, for sequencing data with reads that have relatively good quality or that have had low quality bases trimmed off, all four alignment programs perform similarly. We have also demonstrated that trimming off low quality ends markedly increases the number of aligned reads and improves the consistency among different aligners as well, especially for low quality data. However, Novoalign is more sensitive to the improvement of data quality. Trimming off low quality ends significantly increases the concordance between Novoalign and other aligners. As for aligning reads from repetitive regions, our simulation data show that reads from repetitive regions tend to be aligned incorrectly, and suppressing reads with multiple hits can improve alignment accuracy.

Conclusions

This study provides a systematic comparison of commonly used alignment algorithms in the context of sequencing data with varying qualities and from repetitive regions. Our approach can be applied to different sequencing data sets generated from different platforms. It can also be utilized to study the performance of other alignment programs.

Keywords:
Next generation sequencing; Alignment; Sequencing quality; SOAP2; Bowtie; BWA; Novoalign