[Biopython] GSOC 2012: Representation and manipulation of genomic variants

Sun Mar 25 15:13:47 EDT 2012

Hey everyone,

I'm interested in undertaking this project.  I'm currently a PhD student in
Biochemical, Cellular, & Molecular Biology at Johns Hopkins School of
Medicine, and I've been a hobby programmer for several years.  I primarily
code in Python and C++.  I'm a core developer of Mudlet, which is in C++
and has a fair user base.  For Python, I have nothing published for general
consumption yet, though I will more than likely be putting out a Mass
Spectrometry toolset in the upcoming year.

I'm currently working on large -omics based data (whole genome alignments,
RNA-Seq) so I have a flavor of what formats end users will encounter (I've
worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and
Affy arrays for SNPs/CNVs) and more importantly, I know how the end user
will want to utilize the data.  By far, I see the biggest hurdle is to
arrange several types of data representations into a universal reference
frame (for instance bam files being 0 based, sam being 1 based, CG vcf
files being 0 based, closed interval versus half open, etc etc etc).  I've
written parsers for my own use that interconvert between formats and can
read/output GFF/VCF files, and this would be a great opportunity to expand
on my existing toolset and get valuable feedback from others in the
community.

Thanks,
Chris