[Biopython-dev] Gsoc 2014: another aspirant here

Thu Mar 13 15:04:34 EDT 2014

Thank you Bow,

I'll need to digest this a bit, but you have given me a good direction. My
inclination for the proposal is to focus on sequential file formats used to
transmit 'databases' of sequences (like fasta, embl, uniprot-xml, swiss,
and others) and to mostly ignore formats used to convey alignment (ie.
anything covered exclusively by parsers in AlignIO). If this is a poor
direction please tell me so that I can add to my preparation.

-Evan

Evan Parker
Ph.D. Candidate
Dept. of Chemistry - Lebrilla Lab
University of California, Davis

On Thu, Mar 13, 2014 at 2:04 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com>wrote:

> Hi Evan,
>
> Thank you for your interest in the project :). It's good to know
> you're already quite familiar with SeqIO as well.
>
> My replies are below.
>
> > 1) Should the lazy loading be done primarily in the context of records
> > returned from the SeqIO.index() dict-like object, or should the lazy
> > loading be available to the generator made by SeqIO.parse()? The project
> > idea in the wiki mentions adding lazy-loading to SeqIO.parse() but it
> seems
> > to me that the best implementation of lazy loading in these two SeqIO
> > functions would be significantly different. My initial impression of the
> > project would be for SeqIO.parse() to stage a file segment and
> selectively
> > generate information when called while SeqIO.index() would use a more
> > detailed map created at instantiation to pull information selectively.
>
> We don't necessarily have to be restricted to SeqIO.index() objects
> here. You'll notice of course that SeqIO.index() indexes complete
> records without granularity up to the possible subsequences. What
> we're looking for is compatibility with our existing SeqIO parsers.
> The lazy parser may well be a new object implemented alongside SeqIO,
> but the parsing logic itself (the one whose invocation is delayed by
> the lazy parser) should rely on existing parsers.
>
> > 2) Is slower instantiation an acceptable trade-off for memory efficiency?
> > In the current implementation of SeqIO.index(), sequence files are read
> > twice, once to annotate beginning points of entries and a second time to
> > load the SeqRecord requested by __getitem__(). A lazy-loading parser
> could
> > amplify this issue if it works by indexing locations other than the start
> > of the record. The alternative approach of passing the complete textual
> > sequence record and selectively parsing would be easier to implement (and
> > would include dual compatibility with parse and index) but it seems that
> it
> > would be slower when called and potentially less memory efficient.
>
> I think this will depend on what you want to store in the indices and
> how you store them, which will most likely differ per sequencing file
> format. Coming up with this, we expect, is an important part of the
> project implementation. Doing a first pass for indexing is acceptable.
> Instantiation of the object using the index doesn't necessarily have
> to be slow. Retrieval of the actual (sub)sequence will be slower since
> we will touch the disk and do the actual parsing by then. But this can
> also be improved, perhaps by caching the result so subsequent
> retrieval is faster. One important point (and the use case that we
> envision for this project) is that subsequences in large sequence
> files (genome assemblies, for example) can be retrieved quite quickly.
>
> Take a look at some existing indexing implementations, such as
> faidx[1] for FASTA files and BAM indexing[2]. Looking at the tabix[3]
> tool may also help. The faidx indexing, for example, relies on the
> FASTA file having the same line length, which means it can be used to
> retrieve subsequences given only the file offset of a FASTA record.
>
> Hope this gives you some useful hints. Good luck with your proposal :).
>
> Cheers,
> Bow
>
> [1] http://samtools.sourceforge.net/samtools.shtml
> [2] http://samtools.github.io/hts-specs/SAMv1.pdf
> [3] http://bioinformatics.oxfordjournals.org/content/27/5/718
>