[BioRuby] GSoC weekly status report No.4

Marjan Povolni marian.povolny at gmail.com
Mon Jun 18 18:28:12 UTC 2012


http://blog.mpthecoder.com/post/25375170121/gsoc-weekly-status-report-no-4

During the last week combining records into features has been added, and
also connecting the features into parent-child relationships. Validation
messages have been enhanced with file names and line numbers, and now look
like errors reported by a compiler. Feels most natural to me.

Combining the features into records works by keeping a forward cache of a
number of features (1000 by default, configurable). That means that the
parsing results will be correct only if records which are part of the same
feature are at most 1000 features from each other, or the amount of
features set. The first implementation which was comparing the IDs of
records required 10min for a 233MB file. After switching to first comparing
hash values of IDs instead, and only if they match comparing the IDs, the
parsing time was down to 45s. After fixing a bug, the time is now 10
seconds for the 233MB m_hapla file :)

Linking the features into parent-child relationships works similarly, by
using 32-bit hashes most of the time instead of comparing strings. With
this functionality turned on, the same file is parsed in 13 seconds.

All the measurements have been done using the benchmark utility, which has
a few more options for setting what should be run.

Otherwise I did more refactoring, moved all the gff3_* files into a gff3
directory, so the D modules are now bio.gff3.*, parsing functions are now
static methods of GFF3File and GFF3Data classes, etc.

For the new week, I would like to add filtering to the D library, which I
can then use to implement iteration over genes, mRNAs, CDS features, etc.
After that the library should be pretty much complete feature-wise, at
least per what was promised in the project proposal, so I’ll continue by
defining the C API and developing the Ruby gem.

--
Marjan




More information about the BioRuby mailing list