[Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions

Fri Mar 30 01:13:46 UTC 2012

Chaitanya;
Thanks for making this available. It's a great start and you need to
work from here on being much more detailed in your project plan. I left
specific comments in-line in the proposal. Let us know when you have a
revised version and we can work more. Thanks again,
Brad

> Here's the google doc link, I have made it editable too.
> 
> https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit
> 
> On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> >
> > Chaitanya;
> > The easiest way to work on your proposal is to write it in a
> > public Google Doc and then share with the list. I don't yet have access
> > to all of the Melange GSoC project and I'd imagine others who might
> > have thoughts are in the same boat. As a side benefit it's also much
> > easier to collaborate on editing and notes.
> >
> > Brad
> >
> >> Hi,
> >> I have uploaded the first draft of my project proposal. I will add
> >> more sections to the project plan in a day or two. Just wanted to have
> >> the initial draft up. I hope to write a better proposal with your
> >> feedback.
> >>
> >> Regards,
> >> Chaitanya
> >>
> >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> >> >
> >> > Chaitanya;
> >> > Thanks for the interest and specific questions.
> >> >
> >> >> 1. For the implementation of variants what would be better, to create
> >> >> a new SeqVariant class from scratch or to extend the SeqFeature class
> >> >> to accomodate variants? I guess a separate class would be better.
> >> >
> >> > My preference would be to see how far the SeqFeature class can take you
> >> > before implementing a new class. It should be general enough to handle
> >> > variant data, but the bigger challenge might be designing a lightweight
> >> > representation that is compatible with existing SeqFeatures.
> >> >
> >> >> 2. While looking at the Biopython wiki I came across an implementation
> >> >> of GFF at
> >> >> https://github.com/chapmanb/bcbb/tree/master/gff
> >> >> As GVF is an extension of GFF3, this module could be used for reading
> >> >> GVF's too. Is this module a good start to modify it to support GVFs?
> >> >
> >> > That would be perfect. We're hoping to merge this into the Biopython
> >> > code base before the next release. There is also an existing VCF parser
> >> > we'd love to use here:
> >> >
> >> > https://github.com/jamescasbon/PyVCF
> >> >
> >> >> 3. I've been going through the VCF documentation and SNPs, insertions
> >> >> and deletions can be represented just like it is done in VCF, the
> >> >> object would have a start position, length of reference sequence(no
> >> >> need to store this sequence) and a list of alternate sequence objects.
> >> >> I have to still look into the SV(Structural variants), rearrangements
> >> >> and imprecise variant information, so this representation is only for
> >> >> SNPs and small indels. The GVF has a very similar format for small
> >> >> indels and SNPs, just that it provides an extra end position column
> >> >> which is not required if we have the reference sequence.
> >> >
> >> > This sounds good. My general suggestion is to start writing your
> >> > proposal as soon as possible. A concrete first draft will help with more
> >> > detailed comments. The wiki has good information on the project plan:
> >> >
> >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply
> >> >
> >> > and the NESCent wiki has some examples of well-written proposals from
> >> > previous years:
> >> >
> >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application
> >> >
> >> > One of the key aspects is having a detailed week-by-week outline of your
> >> > plans for the summer.
> >> >
> >> > Thanks again for the interest,
> >> > Brad