[Biopython] Google Summer of Code 2014: Student application

Lluís Revilla lluis.revilla at gmail.com
Wed Mar 19 10:12:26 UTC 2014


Dear Eric and all.

I summarize here some of the comments you made to the proposal:

   1. It is a bit broad (Eric)
   2. Provides a common visual representation of the different inputs?
   (Christian)
   3. Supposed to actually rank different tools / outputs? If so is a
   surprisingly hard problem (Christian and bow)
   4. Difficult and difficult to fit in Biopython (bow)
   5. Useful just once for each task (bow)
   6. More useful to write parsers using a common object mode, but
   generalizing their outputs is also not a trivial task (bow)

And here my comments:

   1. It is intended to be broad, to be applied not just to Gene Predictors
   but also to RNA secondary structure predictors, or ncRNA predictors,
   functional site predictors or secondary or even protein tertiary structure
   predictors.
   2. Well, my initial thought was to compare their results, but to do so
   they need to be in the same format so adding a common visual representation
   it could be added.
   3. If there is a reference to which compare the programs it is not so
   hard, but then it loses the point to compare the programs. But I actually
   ranked them according of how much they share between them and how much they
   differ. If they are supposed to do the same thing their results should tend
   to be the same, at least this can set apart some very deviated programs,
   although it doesn't ensure that the other ones are the wrong ones.
   4. I agree, that is way I mailed it, to know if it would fit or not, and
   how useful it would be.
   5. Even it is useful once, the program versions can change and then they
   will need to be evaluated again (If they keep the output format it would
   work) and not all the project search the same type of result even with the
   same task to do. Some would like to test with a reference what happen with
   the false positives genes predicted, or want the minimum false rate even if
   they get just 40% of the annotated genes. But mainly it is true that it is
   to use just once.
   6. As it would be part of my idea I could make the parsers. The common
   object could include the essential information and for each parser then add
   the particular output information of each program.

In short:

 It either seems to difficult or out of my skills to complete my idea and
there are doubts if it fits in Biopython library. If it is more useful I
can change my proposal to code parsers for gene predictors or any other
program not already parsed in Biopython.

Thanks all for your comments and feed-back, I will be glad to read more
comments and improve or change my proposal.

Best,

Lluís


2014-03-19 0:30 GMT+01:00 Eric Talevich <eric.talevich at gmail.com>:

> On Mon, Mar 17, 2014 at 12:09 PM, Lluís Revilla <lluis.revilla at gmail.com>wrote:
>
>> Hi everyone,
>>
>> I am a Biotechnology student and I want to contribute to Biopython. I have
>> read the wiki GSoC page and I found two ideas. But I think I don't have
>> the
>> desired skills, I am not much familiarized with the Biopython's existing
>> sequence parsing yet ("Indexing & Lazy-loading Sequence Parsers"), or with
>> javascript ("Interactive GenomeDiagram Module"). So I am  thinking to make
>> a proposal for the Google Summer of Code about a comparing tool.
>>
>> My idea comes from the following: I have been several time in charge of
>> selecting a tool to do a certain process e.g.: A list of predicted genes,
>> a
>> list of possible structures, a list of alignments...
>>
>> But usually in bioinformatics there are many programs to do the same
>> thing,
>> usually they use a different algorithm a different training set data
>> (prokaryote, eukaryote ), or have different specifications. And they
>> return
>> a more or less sophisticated list, in some standard format, FASTA, GFF,
>> Genebank...
>>
>> The problem when starting a project is to select from this different
>> programs which one use for the task, e.g.: Which gene predictor is better
>> for prokaryote: Glimmer, EasyGene, GeneMarker, Prodigal, AUGUSTUS...? The
>> answer will be specific to the project but sometimes its difficult to
>> ensure that it is a good selection. (Other times it is good enough to do
>> what the majority do.) But does not solve the problem when new algorithms
>> appears, or even to compare between different program versions.
>>
>> To cover this problem I would like to develop for Biopython a module to
>> compare between the different programs output to asses which one is better
>> for the task.
>> Currently I developed a parser for the afford mentioned programs and it
>> compares them in a (very) rude way. I would like to develop further and
>> release it to the Biopython community.
>>
>> What are your thoughts about this idea?
>> Thanks,
>>
>> Lluís
>>
>
> Hi Lluís,
>
> This is an interesting idea, though a bit broad. You could maybe find some
> inspiration or focus by looking at Critical Assessment of Function
> Prediction (CAFA):
> http://biofunctionprediction.org/
>
> Perhaps Iddo Friedberg or another AFP enthusiast could comment on how this
> project could support benchmarking of automated annotations.
>
> On the technical side, I also recommend looking at nestly, a program that
> will execute another specific command-line program with a variety of
> different parameters and automatically organize, summarize and compare the
> outputs.
> http://fhcrc.github.io/nestly/
>
> All the best,
> Eric
>




More information about the Biopython mailing list