A group of scientists from the US Food and Drug Administration's (FDA) Division of Bioinformatics and Biostatistics have reviewed the algorithms used to align the data generated by next-generation sequencing (NGS) tests to reference genomes.
NGS tests hold great promise for precision medicine due to their ability to quickly sequence the human genome and identify thousands of genetic variants.
By identifying genetic variants that are associated with different diseases, researchers hope to develop more targeted treatments or identify which patients will respond better to existing treatments.
In recent years, the cost of genetic sequencing has decreased rapidly, from billions of dollars to as low as $1000. Because of the shift in cost and availability of NGS platforms, the authors say, "The bottleneck of genomics has shifted from sequencing experiments to analyzing and interpreting the sequencing data."
According to the authors, efficiently and accurately mapping the data generated by NGS tests to the human genome "is one of the major challenges in NGS data analysis."
When sequencing an individual's genome, cells are collected and broken into fragments before a polymerase chain reaction is performed to create a library that will be read by an NGS platform.
The resulting sequence includes millions of "raw reads" containing anywhere between a few hundred to tens of thousands of base pairs and can take up several gigabytes of memory.
Before further analyses can be done to detect genetic variations, the data from the raw reads must be aligned to a reference human genome. This is done using one of many recently developed alignment algorithms.
Choosing an appropriate algorithm and its parameters, the authors say, is an essential step in the process of analyzing genetic data for all applications.
In their review, the authors discuss the two primary strategies for aligning reads and look at various algorithms falling under each strategy.
The first strategy, "seed-and-extend" takes short sequences from a read and looks for exact or inextact matches between two sequences.
The second strategy looks for sequences that are within a certain number of editing operations (insertion, deletion and substitution) of each other and matches these sequences to small read fragments called q-grams.
In choosing an appropriate algorithm, researchers must determine the appropriate trade-off between the speed and sensitivity of the algorithm. Because roughly 50% of the human genome has repeated or similar genetic sequences, mapping reads to a reference genome can be difficult and time-consuming.
Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine