[BioRuby] Benchmarking FASTA file parsing
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Sun Aug 15 01:58:35 EDT 2010
On Sat, 14 Aug 2010 23:52:57 +0900
Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
> To my understanding, the subparsing of the definition occurs only
> when needed, ie when entry_id, identifiers, gi, etc. is called, in
> current code.
> If only definition is called, it is not further parsed.
Right.
> Careful coding to reduce object creation might contribute to speed up.
> One of questionable variable is
> @entry_overrun
> Is this variable and attr_reader :entry_overrun
> really required yet or is just a trace of older code? > Goto-San
The @entry_overrun has two means.
1. Adjustment of file position.
The separator used to read a fasta entry is "\n>", but the ">"
should be belonging to the next entry. To adjust this, the last
">" is stored to @entry_overrun. The Bio::FlatFile wrapper will
use the content of @entry_overrun in the next time of reading.
In addition, it is used to get proper file positions when
indexing fasta files.
2. Integrity of data format
In Bio::FastaFormat.new(str), if the str contains two or more
fasta data, the sequence could be wring with naive parser.
For example, for ">test1\nATATATAT\n>test2\nGCGCGCGC\n",
the sequence could be "ATATAT>test2GCGCGCGC" without the cutting
process of the trailing entries. In addition, to store the
removed element to @entry_overrun may help debugging of user's
code and might prevent data loss.
Indeed, in the current code, both 1 and 2 are done at a time
with the lines
@data.sub!(/^>.*/m, '') # remove trailing entries for sure
@entry_overrun = $&
The 1 might be skipped when reading all data at a time without
file positions. The 2 might be skipped if we can ignore such
kind of mistakes to give two or more entries to the
Bio::FastaFormat.new.
--
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the BioRuby
mailing list