[Biopython-dev] [GSoC] GSoC python variant update 4

Mon Jun 4 12:04:15 EDT 2012

Lenna;
Thanks for the summary. A couple of thoughts on the directions:

- For property access, I think the best approach would be to store all
  of the arbitrary key/value pairs from INFO in SeqRecord annotations,
  then only use hand coded @properties to expose the most useful. That's
  gives people access to the most useful ones (as determined by you)
  with attributes but lets anyone dig in and get custom ones.

- If you'd like to explore an SQL backend, you should have a look at
  Gemini:

  https://github.com/arq5x/gemini

  which stores variants in a SQLite database along with associated
  annotations. It's a flat structure based on adding and exposing useful
  annotations on variants:

  https://github.com/arq5x/gemini/blob/master/gemini/database.py

  Reinventing a new SQL store representation is a lot of work so it
  might be good to work off what others folks are currently doing and
  try to provide a Biopython friendly front end, much as you're
  exploring with PyVCF.

Hope these are useful. Let me know if you have any questions at all,
Brad

> Blog post (entirely reproduced in this email):
> http://arklenna.tumblr.com/post/24378549953/
>
> I started implementing storage of VCF data in `SeqRecord` and
> `SeqFeature`. I digressed, spending a few days experimenting with
> overloading `__getattr__()` in lieu of manually writing properties.
> Then it occurred to me that if, as Reece pointed out, a variant
> doesn't contain the actual sequence but a reference to the sequence,
> the advantages to using `SeqRecord` are minimal or possibly negative.
>
> In my experience, the highest performance for filtering large amounts
> of data is SQL. SQL has the advantage of scalability: SQLite now ships
> with Python, users can choose to run their own MySQL/PGSQL server, and
> I've read about a few approaches to GPU accelerated SQL.
>
> My initial glances at BioSQL, GMOD, etc. didn't show anything
> specifically designed for variants (again, a focus on storage of the
> sequence itself) so I implemented my own interface. Currently, the
> `parse_all()` method is very slow (approximately 260 seconds for a
> file with 240,000 variants when the parsing takes 5-10 seconds) and I
> am investigating why. My first step will be to reduce commit
> frequency.
>
> With a SQL backend, it seems superfluous to have a dedicated variant
> representation within Python. The SQL result object should allow for
> straightforward retrieval of data by name. I'm storing "misc" data in
> a SQL text field using JSON, which is also easy to access.
>
> Next:
>
> * Looking at BioSQL/GMOD etc to see if there is an existing standard I
> should be using/following
> * Deciding the extent of the convenience functions I wish to implement
> * Thinking about the most efficient way to filter records on the way
> into the SQL database
> _______________________________________________
> GSoC mailing list
> GSoC at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/gsoc