[Biopython-dev] GSoC python variant update 4

Mic mictadlo at gmail.com
Mon Jun 4 01:11:56 EDT 2012


Hi Lenna,
Big companies are using http://en.wikipedia.org/wiki/NoSQL

What kind of ORM do you want use ( http://en.wikipedia.org/wiki/SQLAlchemyor
http://en.wikipedia.org/wiki/Storm_%28software%29 )

Cheers,
Mic


On Mon, Jun 4, 2012 at 12:39 PM, Lenna Peterson <arklenna at gmail.com> wrote:

> Blog post (entirely reproduced in this email):
> http://arklenna.tumblr.com/post/24378549953/
>
> I started implementing storage of VCF data in `SeqRecord` and
> `SeqFeature`. I digressed, spending a few days experimenting with
> overloading `__getattr__()` in lieu of manually writing properties.
> Then it occurred to me that if, as Reece pointed out, a variant
> doesn't contain the actual sequence but a reference to the sequence,
> the advantages to using `SeqRecord` are minimal or possibly negative.
>
> In my experience, the highest performance for filtering large amounts
> of data is SQL. SQL has the advantage of scalability: SQLite now ships
> with Python, users can choose to run their own MySQL/PGSQL server, and
> I've read about a few approaches to GPU accelerated SQL.
>
> My initial glances at BioSQL, GMOD, etc. didn't show anything
> specifically designed for variants (again, a focus on storage of the
> sequence itself) so I implemented my own interface. Currently, the
> `parse_all()` method is very slow (approximately 260 seconds for a
> file with 240,000 variants when the parsing takes 5-10 seconds) and I
> am investigating why. My first step will be to reduce commit
> frequency.
>
> With a SQL backend, it seems superfluous to have a dedicated variant
> representation within Python. The SQL result object should allow for
> straightforward retrieval of data by name. I'm storing "misc" data in
> a SQL text field using JSON, which is also easy to access.
>
> Next:
>
> * Looking at BioSQL/GMOD etc to see if there is an existing standard I
> should be using/following
> * Deciding the extent of the convenience functions I wish to implement
> * Thinking about the most efficient way to filter records on the way
> into the SQL database
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


More information about the Biopython-dev mailing list