[Biopython-dev] [GSoC] GSoC python variant update 6

Reece Hart reece at harts.net
Thu Jul 5 19:40:02 UTC 2012


On Fri, Jun 29, 2012 at 11:15 PM, Lenna Peterson <arklenna at gmail.com> wrote:

> For a Python variant object, are there any organizational choices that
> would make it easier for future conversion of a variant to HGVS
> syntax? (this is primarily directed at Reece but I'm open to all
> suggestions)
>

Oh, no, things directed at me!

That's a broad question. I'll try to answer without being long winded.

The essential elements of a sequence variant are a reference to a sequence,
the location, and specifics about the operation. The name, allelic depth,
etc are all distinct from these elements and I would store them separately
in a format-specific record or as a subclass.

I don't have much experience with FeatureLocations, but that might be
appropriate. Depending on how far you plan to go with VCF, you'll have to
deal with Locations for breakpoints.

For the Occam's Razor version a model for variation, I'd float this in the
community:

variation := <accession, start, stop, pre_seq, post_seq>

And I'd test this against representing:

   - a single SNP in VCF
   - a compound het from VCF
   - a variant in RNA
   - a variant in CDS coords
   - a variant in a protein sequence
   - a trinuclotide repeat

(Which the simple model above fails, BTW.)

What makes the uber variant problem hard, I think, is several competing
design axes: 1) sequence type (DNA, RNA, protein), 2) coordinate systems
(really, CDS in a transcript record), 3) diversity of variant types (SNV,
indel, repeat, etc), 4) diversity of auxiliary data (e.g., genotype info
from VCF).

HGVS makes us think outside merely VCF data: in particular, it adds the
nuance of coordinate systems and multiple sequence types.  I suspect you
should be considering mixins and/or subclassing for some of these needs.

I don't know how to solve any of this complexity. What I do know is that 1)
it's too much just for your project, 2) it would be nice to have a design
that can be easily extended beyond your project, and 3) therefore, part of
your project should be to pave the way for extensions without tackling
them. It's also a good time to put stakes in the ground around internal
conventions, such as variants are always represented using interbase
coordinates (= 0-based, right-open).

And, if you end up handling just VCF variants, that's cool too.

-Reece



More information about the Biopython-dev mailing list