Title | ||
---|---|---|
FragGeneScan-plus for scalable high-throughput short-read open reading frame prediction |
Abstract | ||
---|---|---|
A fundamental step in the analysis of environmental sequence information is the prediction of potential genes or open reading frames (ORFs) encoding the metabolic potential of individual cells and entire microbial communities. FragGeneScan, a software designed to predict intact and incomplete ORFs on short sequencing reads combines codon usage bias, sequencing error models and start/stop codon patterns in a hidden Markov model to find the most likely path of hidden states from a given input sequence, provides a promising route for gene recovery in environmental datasets with incomplete assemblies. However, the current implementation of FragGeneScan does not scale efficiently with increasing input data size. Thus, FragGeneScan cannot be applied to contemporary environmental datasets that can exceed 100s of Gb. Here, we present FragGeneScan-Plus, an improved implementation of the FragGeneScan gene prediction model that leverages algorithmic thread synchronization and efficient in-memory data management to utilize multiple CPU cores without blocking I/O operations. FragGeneScan-Plus can process data approximately 5-times faster than FragGeneScan using a single core and approximately 50-times faster using eight hyper-threaded cores when benchmarked against simulated and real world environmental datasets. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1109/CIBCB.2015.7300341 | 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) |
Keywords | Field | DocType |
FragGeneScan-Plus,scalable high-throughput short-read open reading frame prediction,environmental sequence information,potential genes,open reading frames,metabolic potential,individual cells,microbial communities,software,short sequencing reads,codon usage bias,sequencing error models,hidden Markov model,hidden states,gene recovery,environmental datasets,gene prediction model,algorithmic thread synchronization,in-memory data management,multiple CPU cores | Single-core,Data mining,Computer science,Open reading frame,Artificial intelligence,Multi-core processor,Gene prediction,Bioinformatics,Hidden Markov model,Synchronization (computer science),Machine learning,Encoding (memory),Scalability | Conference |
Citations | PageRank | References |
0 | 0.34 | 9 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Dongjae Kim | 1 | 4 | 1.46 |
Aria S. Hahn | 2 | 4 | 1.80 |
Shang-Ju Wu | 3 | 3 | 0.76 |
Niels W. Hanson | 4 | 29 | 3.58 |
Kishori M. Konwar | 5 | 107 | 17.49 |
Steven J. Hallam | 6 | 34 | 3.97 |