[Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions

Brad Chapman chapmanb at 50mail.com
Wed Mar 28 00:43:33 UTC 2012


Chaitanya;
The easiest way to work on your proposal is to write it in a
public Google Doc and then share with the list. I don't yet have access
to all of the Melange GSoC project and I'd imagine others who might
have thoughts are in the same boat. As a side benefit it's also much
easier to collaborate on editing and notes.

Brad

> Hi,
> I have uploaded the first draft of my project proposal. I will add
> more sections to the project plan in a day or two. Just wanted to have
> the initial draft up. I hope to write a better proposal with your
> feedback.
> 
> Regards,
> Chaitanya
> 
> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> >
> > Chaitanya;
> > Thanks for the interest and specific questions.
> >
> >> 1. For the implementation of variants what would be better, to create
> >> a new SeqVariant class from scratch or to extend the SeqFeature class
> >> to accomodate variants? I guess a separate class would be better.
> >
> > My preference would be to see how far the SeqFeature class can take you
> > before implementing a new class. It should be general enough to handle
> > variant data, but the bigger challenge might be designing a lightweight
> > representation that is compatible with existing SeqFeatures.
> >
> >> 2. While looking at the Biopython wiki I came across an implementation
> >> of GFF at
> >> https://github.com/chapmanb/bcbb/tree/master/gff
> >> As GVF is an extension of GFF3, this module could be used for reading
> >> GVF's too. Is this module a good start to modify it to support GVFs?
> >
> > That would be perfect. We're hoping to merge this into the Biopython
> > code base before the next release. There is also an existing VCF parser
> > we'd love to use here:
> >
> > https://github.com/jamescasbon/PyVCF
> >
> >> 3. I've been going through the VCF documentation and SNPs, insertions
> >> and deletions can be represented just like it is done in VCF, the
> >> object would have a start position, length of reference sequence(no
> >> need to store this sequence) and a list of alternate sequence objects.
> >> I have to still look into the SV(Structural variants), rearrangements
> >> and imprecise variant information, so this representation is only for
> >> SNPs and small indels. The GVF has a very similar format for small
> >> indels and SNPs, just that it provides an extra end position column
> >> which is not required if we have the reference sequence.
> >
> > This sounds good. My general suggestion is to start writing your
> > proposal as soon as possible. A concrete first draft will help with more
> > detailed comments. The wiki has good information on the project plan:
> >
> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply
> >
> > and the NESCent wiki has some examples of well-written proposals from
> > previous years:
> >
> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application
> >
> > One of the key aspects is having a detailed week-by-week outline of your
> > plans for the summer.
> >
> > Thanks again for the interest,
> > Brad



More information about the Biopython mailing list