[Biopython] GSoC - BioPython and PyCogent Interoperability

Thu Apr 8 09:41:26 UTC 2010

I am a junior Computer Science major with heavy bioinformatic leanings
at Harvey Mudd College. I know that it is very late for new summer of
code applications, but I was wondering if you could have a look at my
proposed schedule to give me some pointers and answer a few questions.
I am also considering applying for the project involving adding more
ways to use R through python, but I was unsure of which project had
more users who wanted it completed.

Questions:
What does it mean by BioPython's acquired sequences? I can't seem to
find out what or where information about "acquired sequences" is.
Thus, I do not discuss anything about it in my current proposal.

For the creation of workflows, do there already exist use and test
cases for this or would I be best off looking for ones in papers and
trying to mimic them? Right now, I have an example paper where the
interoperability would have been helpful.

Any other use cases I should immediately consider in my proposal?

My current proposed schedule:

For Bio Python and PyCogent interoperability.
Week 1: Familiarization with the code and soliciting requests. While
what seems intuitive to me might not seem so to others. It would be
best to spend this time to determine a group of people who would
highly benefit from the interoperability and ask them for what they
would look for. For example, would they rather use one, save the data,
and use the other. Would they want to use them directly. Basically, I
want to get a good idea of how this code will be used before making my
own decisions on how I think people will use it. Also important here
is to create sets of data which can be used later on the process.

Week 2 and 3: Code converting PyCogent and BioPython. The core objects
in each package seem like they should not be too difficult to convert.
This step will involve looking into the documentation and coding for
PyCogent and BioPython, to determine what the core objects contain for
each. One possible problem here is if either PyCogent or BioPython
core objects use heavy subclassing, as determining subclassing in
Python has been a nightmare in the past. Testing at this point will
likely involve going through the entire round trip conversion, and
seeing if everything looks the same.

Week 4: Ensure that conversions allow the use of data from one program
to the other. The workflows of codon usage to clustering code can be
tested. One possible test set is from Sharp et. al. 1986. Here they
found different codon usage for different genes. Additionally, it
should be considered how codon usage can be used to help with making
biologically accurate clusters.

Week 5: Familiarize with phyloXML and make interoperable with
PyCogent. phyloXML has already been added with BioPython. Making
phyloXML work with PyCogent could be based on how it was adapted for
BioPython. Clear risks here include problems with making sure that the
API for phyloXML in PyCogent gives an intuitive interface to use
phyloXML.

Week 6 and 7: Adapt PyCogent to query genomics databases. Currently
there is at least some support for PyCogent to query ENSEMBL. It seems
like it would be useful to query other genomics databases such as
Entrez of NCBI. Unfortunately, it seems like NCBI only has PERL
queries into their MySQL database. Ideally, if everything previously
has been alright, the conversion of PyCogent to BioPython forms shoudl
already be accounted for.

Week 8-12: Slip days and additional features. The initial set of use
cases will surely expand and this is extra time to allow for those use
cases to be accounted for.

Thanks,
Singer Ma