[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent

Thu May 2 12:54:52 UTC 2013

On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:
> Hi Peter and all,
> Thanks for the long explanation.
> I got much better understand of this project though I am still confusing on
> how to implement the lazy-loading parser for feature rich files (EMBL,
> GenBank, GFF3).

Hi Zhigang,

I'd considered two ideas for GenBank/EMBL,

Lazy parsing of the feature table: The existing iterator approach reads
in a GenBank file record by record, and parses everything into objects
(a SeqRecord object with the sequence as a Seq object and the
features as a list of SeqFeature objects). I did some profiling a while
ago, and of this the feature processing is quite slow, therefore during
the initial parse the features could be stored in memory as a list of
strings, and only parsed into SeqFeature objects if the user tries to
access the SeqRecord's feature property.

It would require a fairly simple subclassing of the SeqRecord to make
the features list into a property in order to populate the list of
SeqFeatures when first accessed.

In the situation where the user never uses the features, this should
be much faster, and save some memory as well (that would need to
be confirmed by measurement - but a list of strings should take less
RAM than a list of SeqFeature objects with all the sub-objects like
the locations and annotations).

In the situation where the use does access the features, the simplest
behaviour would be to process the cached raw feature table into a
list of SeqFeature objects. The overall runtime and memory usage
would be about what we have now. This would not require any
file seeking, and could be used within the existing SeqIO interface
where we make a single pass though the file for parsing - this is
vital in order to cope with handles like stdin and network handles
where you cannot seek backwards in the file.

That is the simpler idea, some real benefits, but not too ambitious.
If you are already familiar with the GenBank/EMBL file format and
our current parser and the SeqRecord object, then I think a week
is reasonable.

A full index based approach would mean scanning the GenBank,
EMBL or GFF file and recording information about where each
feature is on disk (file offset) and the feature location coordinates.
This could be recorded in an efficient index structure (I was thinking
something based on BAM's BAI or Heng Li's improved version CSI).
The idea here is that when the user wants to look at features in a
particular region of the genome (e.g. they have a mutation or SNP
in region 1234567 on chr5) then only the annotation in that part
of the genome needs to be loaded from the disk.

This would likely require API changes or additions, for example
the SeqRecord currently holds the SeqFeature objects as a
simple list - with no build in co-ordinate access.

As I wrote in the original outline email, there is scope for a very
ambitious project working in this area - but some of these ideas
would require more background knowledge or preparation:
http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html

Anything looking to work with GFF (in the broad sense of GFF3
and/or GTF) would ideal incorporate Brad Chapman's existing
work: http://biopython.org/wiki/GFF_Parsing

Regards,

Peter