[Biopython-dev] detailed plan - Indexing & Lazy-loading Sequence Parsers

Peter Cock p.j.a.cock at googlemail.com
Wed Mar 19 18:10:54 UTC 2014


On Wed, Mar 19, 2014 at 5:45 PM, Nejat Arinik <nejat.arinik at insa-lyon.fr> wrote:
>
> Hi all,
>
> I would show you my detailed plan per mounth.
> https://docs.google.com/document/d/1IKJAs4u4rAVnmaDh0LPyMrd_MuqELqblmveeeOO36aE/edit
>
> It's not neatly I know but it's just for learn yours ideas at about that
> plan. I'll finish it this night. I understood correctly the subject, you think?
> That plan can be solution? Thanks in advance.
>
> PS: My english level is not good so It is a little bit difficult to write a
> proposal-plan detailed but I'm trying. I hope it's not a big problem :)
> I'm more comfortable with the french language unfortunately.
>
> Nejat


Hi Nejat,

I can try to answer some of the questions at the start of the document:

Q: Lazy-load ~= load partially (depends on demands) ?

Yes. For example, only load the sequence if the user tries to access
the sequence. For example, this should speed up tasks like counting
the records, or building a list of all the record identifiers.

Q: small to medium sized sequences/genomes is how much in general? It
takes how many times?

A: Bacterial genomes usually are small enough to load into memory
without worrying about RAM. Eukaryote genomes (e.g. mouse, human,
plants) are typically large enough that you may not want to load an
entire annotated chromosome into memory.

Q: python dictionary is used for SeqRecord object ?

A: Yes, the SeqRecord object uses a Python dictionary for the
annotations property, and a dictionary like object for the
letter_annotations property. The SeqRecord object also uses Python
lists, and the Biopython Seq object.

Q: Putting some data in the file will be done? If yes, relation with
Biosql? So any modification as an update will be considerable/ be paid
attention.

A: The SeqRecord-like objects from the lazy-parsers could be read
only. However, if they act enough like the original SeqRecord, then
they can be used with Bio.SeqIO.write(...) to save them to disk. It
would be nice if (like the BioSQL SeqRecord-like objects) it was
possible to modify the records in memory.

Q: For very large indexing jobs, index on multiple machines running
simultaneously, and then merge the indexes.

A: This seems too complicated. If building the index is slow, I
suggest saving the index on disk (e.g. as an SQLite database). For
comparison, see the BAM and tabix index files, or Biopython's
Bio.SeqIO.index_db(...) function.

Regards,

Peter



More information about the Biopython-dev mailing list