[Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions

Brad Chapman chapmanb at 50mail.com
Mon Mar 26 07:07:36 EDT 2012


Chaitanya;
Thanks for the interest and specific questions.

> 1. For the implementation of variants what would be better, to create
> a new SeqVariant class from scratch or to extend the SeqFeature class
> to accomodate variants? I guess a separate class would be better.

My preference would be to see how far the SeqFeature class can take you
before implementing a new class. It should be general enough to handle
variant data, but the bigger challenge might be designing a lightweight
representation that is compatible with existing SeqFeatures.

> 2. While looking at the Biopython wiki I came across an implementation
> of GFF at
> https://github.com/chapmanb/bcbb/tree/master/gff
> As GVF is an extension of GFF3, this module could be used for reading
> GVF's too. Is this module a good start to modify it to support GVFs?

That would be perfect. We're hoping to merge this into the Biopython
code base before the next release. There is also an existing VCF parser
we'd love to use here:

https://github.com/jamescasbon/PyVCF

> 3. I've been going through the VCF documentation and SNPs, insertions
> and deletions can be represented just like it is done in VCF, the
> object would have a start position, length of reference sequence(no
> need to store this sequence) and a list of alternate sequence objects.
> I have to still look into the SV(Structural variants), rearrangements
> and imprecise variant information, so this representation is only for
> SNPs and small indels. The GVF has a very similar format for small
> indels and SNPs, just that it provides an extra end position column
> which is not required if we have the reference sequence.

This sounds good. My general suggestion is to start writing your
proposal as soon as possible. A concrete first draft will help with more
detailed comments. The wiki has good information on the project plan:

http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply

and the NESCent wiki has some examples of well-written proposals from
previous years:

http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application

One of the key aspects is having a detailed week-by-week outline of your
plans for the summer.

Thanks again for the interest,
Brad


More information about the Biopython mailing list