Analysis of Short-read Aligners using Genome Sequence Complexity

Abstract

Next generation sequencing technologies have the capability to provide large numbers of short reads inexpensively and accurately. Researchers have proposed many different methods to align short reads to reference genomes. Nevertheless, long repeats, which are known to be abundant in eukaiyotic genomes, have caused considerable difficulty for genome assembly methods that rely on short-read alignment. Although a few researchers have studied sequence complexity of genomes in terms of repeats, none have quantitatively related such complexity to the difficulty of short read alignment and assembly. In this paper, we investigate several measures of genome sequence complexity with the goal of quantifying the difficulty of short read alignment Using genomic data from 17 different organisms and testing against 12 state-of-the-art short-read aligners, we found a very strong correlation between the performance of virtually all of these aligners and measures of genome sequence complexity. Further, we show how these measures might be used to analyze and predict the performance of aligners, and more importantly, select the best aligners for specific genomes.

Publication
Proceedings of The 12th IEEE International Conference on Knowledge and Systems Engineering