[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent
Alex Leach
albl500 at york.ac.uk
Thu May 2 09:54:37 EDT 2013
Hi again,
Thought I'd contribute some thoughts... Hope I'm not intruding too much on
the discussion.
On Thu, 02 May 2013 13:54:52 +0100, Peter Cock <p.j.a.cock at googlemail.com>
wrote:
>
> It would require a fairly simple subclassing of the SeqRecord to make
> the features list into a property in order to populate the list of
> SeqFeatures when first accessed.
>
Yes. You can turn a class property into a function quite easily, using
decorators. Here[1] is a pretty good example, description and
justification.
[1] -
http://stackoverflow.com/questions/6618002/python-property-versus-getters-and-setters
> In the situation where the user never uses the features, this should
> be much faster, and save some memory as well (that would need to
> be confirmed by measurement - but a list of strings should take less
> RAM than a list of SeqFeature objects with all the sub-objects like
> the locations and annotations).
>
> In the situation where the use does access the features, the simplest
> behaviour would be to process the cached raw feature table into a
> list of SeqFeature objects. The overall runtime and memory usage
> would be about what we have now. This would not require any
> file seeking, and could be used within the existing SeqIO interface
> where we make a single pass though the file for parsing - this is
> vital in order to cope with handles like stdin and network handles
> where you cannot seek backwards in the file.
I think the Pythonic way here would be to follow the "Easier to Ask for
Forgiveness than to ask for Permission" (EAFP) idiom[2]. i.e. Try to seek
the file handle first, and if that raises an IOError, catch the exception
and continue to cache the input stream data, perhaps writing it to a
temporary file on disk.
[2] - http://docs.python.org/2/glossary.html#term-eafp
>
> That is the simpler idea, some real benefits, but not too ambitious.
> If you are already familiar with the GenBank/EMBL file format and
> our current parser and the SeqRecord object, then I think a week
> is reasonable.
>
> A full index based approach would mean scanning the GenBank,
> EMBL or GFF file and recording information about where each
> feature is on disk (file offset) and the feature location coordinates.
> This could be recorded in an efficient index structure (I was thinking
> something based on BAM's BAI or Heng Li's improved version CSI).
> The idea here is that when the user wants to look at features in a
> particular region of the genome (e.g. they have a mutation or SNP
> in region 1234567 on chr5) then only the annotation in that part
> of the genome needs to be loaded from the disk.
Thought I'd add that Blast uses SQL tables (in ISAM format) for
maintaining indexes to their databases[3]. I'm not familiar with
BioPython's BioSQL module at all, but a nice feature of sqlite is that you
can hold temporary databases in memory[4].
[3] -
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbisam_8hpp.html
[4] -
http://docs.python.org/2/library/sqlite3.html#using-sqlite3-efficiently
Cheers,
Alex
>
> This would likely require API changes or additions, for example
> the SeqRecord currently holds the SeqFeature objects as a
> simple list - with no build in co-ordinate access.
>
> As I wrote in the original outline email, there is scope for a very
> ambitious project working in this area - but some of these ideas
> would require more background knowledge or preparation:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html
>
> Anything looking to work with GFF (in the broad sense of GFF3
> and/or GTF) would ideal incorporate Brad Chapman's existing
> work: http://biopython.org/wiki/GFF_Parsing
>
> Regards,
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
---
Alex Leach. BSc, MRes
PhD Student
Chong & Redeker Labs
Department of Biology
University of York
YO10 5DD
Tel: 07940 480 771
EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm
More information about the Biopython-dev
mailing list