[Biopython-dev] GSoC python variant update 5

Reece Hart reece at harts.net
Tue Jun 19 08:51:26 EDT 2012


On Sun, Jun 17, 2012 at 9:21 PM, Lenna Peterson <arklenna at gmail.com> wrote:

> Latest post: http://arklenna.tumblr.com/post/25343434817/
>

Hi Lenna-

Thanks for making the time to update your blog.

As with James and Brad, I doubt the suitability of SQL for this project.
However, I learn things when I'm wrong, so this should work out either way!

I don't understand your "SQL diagram" (more properly, an
"entity-relationship diagram"). It would help me -- and perhaps you too --
to provide more detail in the ERD and then to parse a few lines from a VCF
file into your schema by hand (e.g., as a set of tsv files or Google doc
spreadsheets).

It's also worthwhile to look at other people's schemas for similar data.
http://www.ensembl.org/info/docs/variation/variation-database-schema.pdf is
a good place to start.

In any case, VCF parsing is merely a specialized embodiment of general
variant representation, which is the primary goal for this project.
Therefore, it would be worthwhile now to test whatever scheme you propose
against other formats (GFF and HGVS have been discussed). I don't mean that
you should implement now, but rather just make sure that you're heading in
a direction that's compatible with other planned uses.

-Reece


More information about the Biopython-dev mailing list