[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent
Zhigang Wu
zhigang.wu at email.ucr.edu
Thu May 2 20:18:03 EDT 2013
On Thu, May 2, 2013 at 5:54 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
> wrote:
> > Hi Peter and all,
> > Thanks for the long explanation.
> > I got much better understand of this project though I am still confusing
> on
> > how to implement the lazy-loading parser for feature rich files (EMBL,
> > GenBank, GFF3).
>
> Hi Zhigang,
>
> I'd considered two ideas for GenBank/EMBL,
>
> Lazy parsing of the feature table: The existing iterator approach reads
> in a GenBank file record by record, and parses everything into objects
> (a SeqRecord object with the sequence as a Seq object and the
> features as a list of SeqFeature objects). I did some profiling a while
> ago, and of this the feature processing is quite slow, therefore during
> the initial parse the features could be stored in memory as a list of
> strings, and only parsed into SeqFeature objects if the user tries to
> access the SeqRecord's feature property.
>
> It would require a fairly simple subclassing of the SeqRecord to make
> the features list into a property in order to populate the list of
> SeqFeatures when first accessed.
>
> In the situation where the user never uses the features, this should
> be much faster, and save some memory as well (that would need to
> be confirmed by measurement - but a list of strings should take less
> RAM than a list of SeqFeature objects with all the sub-objects like
> the locations and annotations).
>
I agree. This would save some memory.
> In the situation where the use does access the features, the simplest
> behaviour would be to process the cached raw feature table into a
> list of SeqFeature objects. The overall runtime and memory usage
> would be about what we have now. This would not require any
> file seeking, and could be used within the existing SeqIO interface
> where we make a single pass though the file for parsing - this is
> vital in order to cope with handles like stdin and network handles
> where you cannot seek backwards in the file.
>
> Yes, I agree. So in this sense, the name "lazy-loading" is a little
misleading.
Because, this would load everything into memory at the beginning, while
just delay
in parsing any feature until a specific one is requested.
Seems like "lazy parsing" would be more appropriate.
That is the simpler idea, some real benefits, but not too ambitious.
> If you are already familiar with the GenBank/EMBL file format and
> our current parser and the SeqRecord object, then I think a week
> is reasonable.
>
>
No, I am not quite familiar with these.
> A full index based approach would mean scanning the GenBank,
> EMBL or GFF file and recording information about where each
> feature is on disk (file offset) and the feature location coordinates.
> This could be recorded in an efficient index structure (I was thinking
> something based on BAM's BAI or Heng Li's improved version CSI).
> The idea here is that when the user wants to look at features in a
> particular region of the genome (e.g. they have a mutation or SNP
> in region 1234567 on chr5) then only the annotation in that part
> of the genome needs to be loaded from the disk.
>
> This would likely require API changes or additions, for example
> the SeqRecord currently holds the SeqFeature objects as a
> simple list - with no build in co-ordinate access.
>
> As I wrote in the original outline email, there is scope for a very
> ambitious project working in this area - but some of these ideas
> would require more background knowledge or preparation:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html
>
>
Hmm, this is actually INDEXing a big file. Don't you think a little bit off
topic, "lazy-loading parser".
But this seems interesting and challenging and definitely going to be
useful.
> Anything looking to work with GFF (in the broad sense of GFF3
> and/or GTF) would ideal incorporate Brad Chapman's existing
> work: http://biopython.org/wiki/GFF_Parsing
>
> Yes, I definitely will take a Brad's GFF parser.
> Regards,
>
> Peter
>
Thanks for the long explanation again.
Zhigang
More information about the Biopython-dev
mailing list