[Biopython] GSOC 2012: Representation and manipulation of genomic variants
Chris Mitchell
chris.mit7 at gmail.com
Sun Mar 25 19:13:47 UTC 2012
Hey everyone,
I'm interested in undertaking this project. I'm currently a PhD student in
Biochemical, Cellular, & Molecular Biology at Johns Hopkins School of
Medicine, and I've been a hobby programmer for several years. I primarily
code in Python and C++. I'm a core developer of Mudlet, which is in C++
and has a fair user base. For Python, I have nothing published for general
consumption yet, though I will more than likely be putting out a Mass
Spectrometry toolset in the upcoming year.
I'm currently working on large -omics based data (whole genome alignments,
RNA-Seq) so I have a flavor of what formats end users will encounter (I've
worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and
Affy arrays for SNPs/CNVs) and more importantly, I know how the end user
will want to utilize the data. By far, I see the biggest hurdle is to
arrange several types of data representations into a universal reference
frame (for instance bam files being 0 based, sam being 1 based, CG vcf
files being 0 based, closed interval versus half open, etc etc etc). I've
written parsers for my own use that interconvert between formats and can
read/output GFF/VCF files, and this would be a great opportunity to expand
on my existing toolset and get valuable feedback from others in the
community.
Thanks,
Chris
More information about the Biopython
mailing list