[Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions

Tue Mar 27 18:57:45 UTC 2012

Hi,
I have uploaded the first draft of my project proposal. I will add
more sections to the project plan in a day or two. Just wanted to have
the initial draft up. I hope to write a better proposal with your
feedback.

Regards,
Chaitanya

On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Chaitanya;
> Thanks for the interest and specific questions.
>
>> 1. For the implementation of variants what would be better, to create
>> a new SeqVariant class from scratch or to extend the SeqFeature class
>> to accomodate variants? I guess a separate class would be better.
>
> My preference would be to see how far the SeqFeature class can take you
> before implementing a new class. It should be general enough to handle
> variant data, but the bigger challenge might be designing a lightweight
> representation that is compatible with existing SeqFeatures.
>
>> 2. While looking at the Biopython wiki I came across an implementation
>> of GFF at
>> https://github.com/chapmanb/bcbb/tree/master/gff
>> As GVF is an extension of GFF3, this module could be used for reading
>> GVF's too. Is this module a good start to modify it to support GVFs?
>
> That would be perfect. We're hoping to merge this into the Biopython
> code base before the next release. There is also an existing VCF parser
> we'd love to use here:
>
> https://github.com/jamescasbon/PyVCF
>
>> 3. I've been going through the VCF documentation and SNPs, insertions
>> and deletions can be represented just like it is done in VCF, the
>> object would have a start position, length of reference sequence(no
>> need to store this sequence) and a list of alternate sequence objects.
>> I have to still look into the SV(Structural variants), rearrangements
>> and imprecise variant information, so this representation is only for
>> SNPs and small indels. The GVF has a very similar format for small
>> indels and SNPs, just that it provides an extra end position column
>> which is not required if we have the reference sequence.
>
> This sounds good. My general suggestion is to start writing your
> proposal as soon as possible. A concrete first draft will help with more
> detailed comments. The wiki has good information on the project plan:
>
> http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply
>
> and the NESCent wiki has some examples of well-written proposals from
> previous years:
>
> http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application
>
> One of the key aspects is having a detailed week-by-week outline of your
> plans for the summer.
>
> Thanks again for the interest,
> Brad