[Biopython-dev] GSoC python variant update 2

Reece Hart reece at harts.net
Sun May 20 16:01:54 UTC 2012


On Wed, May 16, 2012 at 1:01 PM, Lenna Peterson <arklenna at gmail.com> wrote:

> I don't think `SeqFeature` or an extension thereof would be appropriate
> for storing Variant data; therefore, I intend to make a new structure based
> on `_Record` and `_Call` in PyVCF.
>


> Brad> I don't think we should add a new representation class unless we
> Brad> explicitly need to store additional information.
>
> The reason I suggested a new representation class is so data from all
> parsers can be stored in the same way.


Lenna makes a very sound point. A Variant class should be able to represent
all variant types, and therefore represent *only* the salient features of a
generalized variant. It should not be specific to a particular format.

For instance, _Record expects a CHROM, but this immediately eliminates its
use for transcript-based variants (NM or ENST). QUAL, FILTER, INFO, and
FORMAT are not intrinsic properties of a variant. Don't get me wrong --
it's exactly right for a *VCF* variant. However, _Record was never intended
to be the variant abstraction that I think we should be aiming for at this
time. Being VCF-specific isn't bad, but let's make sure the name accurately
reflects the level of abstraction.

Here's a counter example:
variant = < ref_ac, var_type, loc, pre, post, rpt_count >
ref_ac -- accession
var_type -- type of variant/coordinate system (genomic, cds, protein)
pre -- "before" seq (aka reference); empty if insertion
post -- "after" seq (alt); empty if deletion or repeat
rpt_count -- min, max count for repeats
I implemented variants roughly this way once (
http://bitbucket.org/reece/bio-hgvs-perl). This structure is agnostic
regarding peculiarities of a particular format. I show it as an example,
not a proposal.



Therefore, I am planning to write my project to be compatible with Python
> 2.6 and delaying its inclusion in the main Biopython branch until a future
> 2.6+ Biopython release.
>

Has anyone ever polled to see what versions of python people are using? I
wonder whether we should care about 2.6 even (never mind 2.5). My guess is
that 2.5 and 2.6 are tails of the distribution (as is 3.0, but at least
it's ascending). I would be content to focus exclusively on 2.7 and 3.0.

-Reece



More information about the Biopython-dev mailing list