[BioRuby] Benchmarking FASTA file parsing

Sun Aug 15 05:58:35 UTC 2010

On Sat, 14 Aug 2010 23:52:57 +0900
Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:

> To my understanding, the subparsing of the definition occurs only
> when needed, ie when entry_id, identifiers, gi, etc. is called, in  
> current code.
> If only definition is called, it is not further parsed.

Right.

> Careful coding to reduce object creation might contribute to speed up.
> One of questionable variable is
> @entry_overrun
> Is this variable and attr_reader :entry_overrun
> really required yet or is just a trace of older code? > Goto-San

The @entry_overrun has two means.

1. Adjustment of file position.
The separator used to read a fasta entry is "\n>", but the ">"
should be belonging to the next entry. To adjust this, the last
">" is stored to @entry_overrun. The Bio::FlatFile wrapper will
use the content of @entry_overrun in the next time of reading.
In addition, it is used to get proper file positions when
indexing fasta files.

2. Integrity of data format
In Bio::FastaFormat.new(str), if the str contains two or more
fasta data, the sequence could be wring with naive parser.
For example, for ">test1\nATATATAT\n>test2\nGCGCGCGC\n",
the sequence could be "ATATAT>test2GCGCGCGC" without the cutting
process of the trailing entries. In addition, to store the
removed element to @entry_overrun may help debugging of user's
code and might prevent data loss.

Indeed, in the current code, both 1 and 2 are done at a time
with the lines
     @data.sub!(/^>.*/m, '')  # remove trailing entries for sure
     @entry_overrun = $&

The 1 might be skipped when reading all data at a time without
file positions. The 2 might be skipped if we can ignore such
kind of mistakes to give two or more entries to the
Bio::FastaFormat.new.

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org