[Biopython-dev] GSoC python variant update 4

Mon Jun 4 02:39:47 UTC 2012

Blog post (entirely reproduced in this email):
http://arklenna.tumblr.com/post/24378549953/

I started implementing storage of VCF data in `SeqRecord` and
`SeqFeature`. I digressed, spending a few days experimenting with
overloading `__getattr__()` in lieu of manually writing properties.
Then it occurred to me that if, as Reece pointed out, a variant
doesn't contain the actual sequence but a reference to the sequence,
the advantages to using `SeqRecord` are minimal or possibly negative.

In my experience, the highest performance for filtering large amounts
of data is SQL. SQL has the advantage of scalability: SQLite now ships
with Python, users can choose to run their own MySQL/PGSQL server, and
I've read about a few approaches to GPU accelerated SQL.

My initial glances at BioSQL, GMOD, etc. didn't show anything
specifically designed for variants (again, a focus on storage of the
sequence itself) so I implemented my own interface. Currently, the
`parse_all()` method is very slow (approximately 260 seconds for a
file with 240,000 variants when the parsing takes 5-10 seconds) and I
am investigating why. My first step will be to reduce commit
frequency.

With a SQL backend, it seems superfluous to have a dedicated variant
representation within Python. The SQL result object should allow for
straightforward retrieval of data by name. I'm storing "misc" data in
a SQL text field using JSON, which is also easy to access.

Next:

* Looking at BioSQL/GMOD etc to see if there is an existing standard I
should be using/following
* Deciding the extent of the convenience functions I wish to implement
* Thinking about the most efficient way to filter records on the way
into the SQL database