[Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent

Thu May 2 21:18:43 UTC 2013

Hi Chris and All,

In your comments to my proposal, you mentioned that some GFF files may have
a size of GBs.
After seeing that comment, I just want to roughly know how large is a gff
file people are often working with?
I mainly work on plants and I am not quite familiar with animals.
Below I listed out a list of animals and plants, to my knowledge from
reading papers,  which most people are working with.

organism(genome size)                      size of gff         url to the
ftp *folder*(not a huge file so feel free to click it)
arabidopsis(~120MB)                         44MB
ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/
rice(~450MB)                                     77MB
here<ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/>
corn(3GB)                                          87MB
http://ftp.maizesequence.org/release-5b/filtered-set/
D. melanogaster                                450MB
ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/
C. elegans                                         (site going down)
http://wiki.wormbase.org/index.php/Downloads#GFF2
H. sapiens(3G)                                   170MB
here<http://galaxy.raetschlab.org/library_common/browse_library?show_deleted=False&cntrller=library&use_panels=False&id=2f94e8ae9edff68a>

My point is that caching gff files in memory wasn't as bad as we have
thought. Any comments or suggestion are welcome.

Best,

Zhigang

On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell <chris.mit7 at gmail.com> wrote:

> Hi Zhigang,
>
> I throw some comments on your proposal.  As i said there, I think you need
> to find & look at a variety of gff/gtf files to see where your
> implementation breaks down.  Also, for parsing, I would focus on optimizing
> the speed the user can access attributes, they're the bits people care most
> about (where is gene X, what is the FPKM of isoform y?, etc.)
>
> Chris
>
>
> On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu <zhigangwu.bgi at gmail.com>wrote:
>
>> Hi Peter and all,
>> Thanks for the long explanation.
>> I got much better understand of this project though I am still confusing
>> on
>> how to implement the lazy-loading parser for feature rich files (EMBL,
>> GenBank, GFF3).
>> Since the deadline is pretty close,I decided to post my premature of
>> proposal for this project. It would be great if you all can given me some
>> comments and suggestions. The proposal is available
>> here<
>> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
>> >.
>>
>> Thank you all in advance.
>>
>>
>> Zhigang
>>
>>
>>
>> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock <p.j.a.cock at googlemail.com
>> >wrote:
>>
>> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
>> > wrote:
>> > > Peter,
>> > >
>> > > Thanks for the detailed explanation. It's very helpful. I am not quite
>> > > sure about the goal of the lazy-loading parser.
>> > > Let me try to summarize what are the goals of lazy-loading and how
>> > > lazy-loading would work. Please correct me if necessary. Below I use
>> > > fasta/fastq file as an example. The idea should generally applies to
>> > > other format such as GenBank/EMBL as you mentioned.
>> > >
>> > > Lazy-loading is useful under the assumption that given a large file,
>> > > we are interested in partial information of it but not all of them.
>> > > For example a fasta file contains Arabidopsis genome, we only
>> > > interested in the sequence of chr5 from index position from 2000-3000.
>> > > Rather than parsing the whole file and storing each record in memory
>> > > as most parsers will do,  during the indexing step, lazy loading
>> > > parser will only store a few position information, such as access
>> > > positions (readily usable for seek) for all chromosomes (chr1, chr2,
>> > > chr3, chr4, chr5, ...) and may be position index information such as
>> > > the access positions for every 1000bp positions for each sequence in
>> > > the given file. After indexing, we store these information in a
>> > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
>> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
>> > > 2000:access_pos,}, 'chr3'...}.
>> > >
>> > > Compared to the usual parser which tends to parsing the whole file, we
>> > > gain two benefits: speed, less memory usage and random access. Speed
>> > > is gained because we skipped a lot during the parsing step. Go back to
>> > > my example, once we have the dictionary, we can just seek to the
>> > > access position of chr5:2000 and start reading and parsing from there.
>> > > Less memory usage is due to we only stores access positions for each
>> > > record as a dictionary in memory.
>> > >
>> > >
>> > > Best,
>> > >
>> > > Zhigang
>> >
>> > Hi Zhigang,
>> >
>> > Yes - that's the basic idea of a disk based lazy loader. Here
>> > the data stays on the disk until needed, so generally this is
>> > very low memory but can be slow as it needs to read from
>> > the disk. And existing example already in Biopython is our
>> > BioSQL bindings which present a SeqRecord subclass which
>> > only retrieves values from the database on demand.
>> >
>> > Note in the case of FASTA, we might want to use the existing
>> > FAI index files from Heng Li's faidx tool (or another existing
>> > index scheme). That relies on each record using a consistent
>> > line wrapping length, so that seek offsets can be easily
>> > calculated.
>> >
>> > An alternative idea is to load the data into memory (so that the
>> > file is not touched again, useful for stream processing where
>> > you cannot seek within the input data) but it is only parsed into
>> > Python objects on demand. This would use a lot more memory,
>> > but should be faster as there is no disk seeking and reading
>> > (other than the one initial read). For FASTA this wouldn't help
>> > much but it might work for EMBL/GenBank.
>> >
>> > Something to beware of with any lazy loading / lazy parsing is
>> > what happens if the user tries to edit the record? Do you want
>> > to allow this (it makes the code more complex) or not (simpler
>> > and still very useful).
>> >
>> > In terms of usage examples, for things like raw NGS data this
>> > is (currently) made up of lots and lots of short sequences (under
>> > 1000bp). Lazy loading here is unlikely to be very helpful - unless
>> > perhaps you can make the FASTQ parser faster this way?
>> > (Once the reads are assembled or mapped to a reference,
>> > random access to lookup reads by their mapped location is
>> > very very important, thus the BAI indexing of BAM files).
>> >
>> > In terms of this project, I was thinking about a SeqRecord
>> > style interface extending Bio.SeqIO (but you can suggest
>> > something different for your project).
>> >
>> > What I saw as the main use case here is large datasets like
>> > whole chromosomes in FASTA format or richly annotated
>> > formats like EMBL, GenBank or GFF3. Right now if I am
>> > doing something with (for example) the annotated human
>> > chromosomes, loading these as GenBank files is quite
>> > slow (it takes a far amount of memory too, but that isn't
>> > my main worry). A lazy loading approach should let me
>> > 'load' the GenBank files almost instantly, and delay
>> > reading specific features or sequence from the disk
>> > until needed.
>> >
>> > For example, I might have a list of genes for which I wish
>> > to extract the annotation or sequence for - and there is no
>> > need to load all the other features or the rest of the genome.
>> >
>> > (Note we can already do this by loading GenBank files
>> > into a BioSQL database, and access them that way)
>> >
>> > Regards,
>> >
>> > Peter
>> >
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
>
>