[Biopython-dev] GSoC draft proposal - lazy loading SeqIO parsers

Wibowo Arindrarto w.arindrarto at gmail.com
Wed Mar 19 19:42:50 UTC 2014


Hi Evan,

Looks like this is shaping up in a good direction :). In addition to
Peter's earlier comments, I also have some remarks:

* How would the indices of the files be stored? Are they simply stored
in-memory or as files? Are their creation invisible to the user (i.e.
invoking the `lazy=True` argument is enough to create the index) or
does the user need to create the index explicitly? For
`SeqIO.index(lazy=True)` in particular, does this mean that we will
have two indices then (one for the currently implemented SQLite
database that stores offsets for record positions and the other to
store other informations necessary for the lazy parser)?

* It would be nice to also have some notes on the relation between
SeqRecProxy and SeqRecord (is it a subclass perhaps, or are they both
different but will inherit from another base subclass). As an
alternative, it is also possible to have regular SeqRecord object, but
with lazy Seq objects and lazy annotation objects instead.

* Have you thought about what to store in the indices of the different
formats? It's a good idea to explain this further in your proposal
(e.g. what to store when indexing GenBank files, UniprotXML files,
etc.). It doesn't have to be concrete (it will be in the code anyway,
but having an idea or possible implementations you have in mind would
be nice.

* And finally, the schedule. It looks like the early weeks will be
quite packed, considering your other obligations. I think it is
expected that students spend close to 8 hours per day (or 40 hours per
week) during the coding period. Of course this is much more sensible
when the student does not have other pressing obligations. I do agree
with Peter here that you have to at least discuss this with your PhD
supervisor. I personally do not mind that for the week you have the
conference the workload is reduced. But in the first four weeks, I
would prefer that you have more time to spend for GSoC.

Cheers & good luck,
Bow


On Wed, Mar 19, 2014 at 6:26 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Mar 19, 2014 at 4:49 PM, Evan Parker <eparker at ucdavis.edu> wrote:
>> Hi all,
>>
>> I have a rough draft of my GSoC proposal and would appreciate comments from
>> anybody who might be willing to eventually mentor this project, or anybody
>> who has opinions on implementation. It's about 3 pages of text + several
>> figures.
>>
>> I'll be submitting a final draft Friday on the GSoC website pending your
>> comments.
>>
>> Thank you,
>> -Evan
>
> Hi Evan,
>
> That's a nice job so far - although questions about your time
> availability will be raised (sadly the GSoC schedule isn't fair to
> students depending on regional University term schedules).
> However, you are a PhD student (which is normally full time).
> You will need to clear this with your PhD supervisors - since
> you would be spending a large chunk of time not working
> directly on your thesis project, and there can be strict
> deadlines for completion.
>
> Here's a selection of points in no particular order:
>
> Have you looked at Bio.SeqIO.index_db(...) which works
> like Bio.SeqIO.index(...) but stores the offsets etc in an
> SQLite database?
>
> When pondering how to design this kind of thing myself,
> I had suspected multiple SeqRecProxy classes might be
> needed (one per file format potentially), although run
> time selection of internal parsing methods might work too.
>
> I would also ask why not have the slicing of a SeqRecProxy
> return another SeqRecProxy? This means creating a new
> proxy object with different offset values - but would be fast.
> Only when the seq/annotation/etc is accessed would the
> proxy have to go to the disk drive. This becomes more
> interesting when accessing the features in the slice of
> interest (e.g. if the full record was for a whole chromosome
> and only region [1000:2000] was of interest).
>
> This idea about windows onto the data is key to how
> the SAM/BAM file format is used (coordinate sorting
> with an index). Are you familiar with that, or tabix?
>
> Another open question is what to do with file handles -
> specifically the question of when to close them? e.g.
> via garbage collection, context managers, etc. See
> for example this blog post - the lazy parsing approach
> may result in ResourceWarnings as a side effect:
> http://emptysqua.re/blog/against-resourcewarnings-in-python-3/
>
> I appreciate you are unlikely to have ready answers to
> all of that - I've probably given you a whole load more
> background reading. I hope some of the other Biopython
> developers (or GSoC mentors on other OBF projects -
> you could post this to the OBF GSoC mailing list too)
> will have further feedback.
>
> Regards,
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev



More information about the Biopython-dev mailing list