[Biopython] GSOC 2012: Representation and manipulation of genomic variants

Mon Mar 26 11:16:27 UTC 2012

Chris;
Welcome and thanks for the interest in the project.

> I'm currently working on large -omics based data (whole genome alignments,
> RNA-Seq) so I have a flavor of what formats end users will encounter (I've
> worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and
> Affy arrays for SNPs/CNVs) and more importantly, I know how the end user
> will want to utilize the data.  By far, I see the biggest hurdle is to
> arrange several types of data representations into a universal reference
> frame (for instance bam files being 0 based, sam being 1 based, CG vcf
> files being 0 based, closed interval versus half open, etc etc etc).  I've
> written parsers for my own use that interconvert between formats and can
> read/output GFF/VCF files, and this would be a great opportunity to expand
> on my existing toolset and get valuable feedback from others in the
> community.

I agree with Peter: you want to convert everything to standard Python
0-based internally. The goal is to have a consistent data structure so
you can code independent of the input/output formats.

There are some existing VCF and GFF parsers we were targeting for
inclusion:

https://github.com/jamescasbon/PyVCF
http://biopython.org/wiki/GFF_Parsing

but it would be great to see code you've written as well.

I am repeating myself, but my general suggestion is to start writing your
proposal as soon as possible. A concrete first draft will help with more
detailed comments. The wiki has good information on the project plan:

http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply

and the NESCent wiki has some examples of well-written proposals from
previous years:

http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application

One of the key aspects is having a detailed week-by-week outline of your
plans for the summer.

Brad