[Open-bio-l] a common repository for test datasets/use cases for all Bio* projects

Peter biopython at maubp.freeserve.co.uk
Thu Dec 4 18:43:51 UTC 2008


Giovanni wrote:
> For example I have it not clear how a use case could be written to be
> the best useful for all the different Bio::* projects, and other
> things.

In terms of use cases, I would imagine things like the following:

(1) Take a provided set of CDS nucleotide sequences in FASTA format,
translate them using NCBI codon table 11 (bacteria), and output the
results as a FASTA file of protein sequences.

(2) Take a provided set of protein sequences, and do pairwise
alignments between them all using the EMBOSS tool needle.

(3) Take a provided FASTA file of proteins, and run ClustalW on it
using the default settings.  Take the multiple sequence alignment in
ClustalW format, and covert it into Stockholm format.  Then build a
neighbour joining tree using quick-tree program (which cannot read in
ClustalW files directly).  Finally, load the tree file and produce a
cladogram where the taxon/leaf XXX is highlighted in red.

(4) Take a provided author name and keyword, and query the NCBI Entrez
webinterface to get a list of matching papers.  Download these
references (maybe as MedLine format, maybe as XML) and parse the
result into a CSV file for input into your reference manager (e.g.
EndNote - or generate a bibtex file for use with LaTeX).

(5) Taking a provided species name, and use NCBI Entrez to download
all matching EST sequences to a FASTA format file.

(6) Take a provided FASTA file of proteins and use standalone NCBI
BLASTP to search them against the NR database using a expectation
threshold of 10^-6 and at most ten alignments per query.  Parse the
results, and generate a new FASTA file of the protein sequences where
the description line includes the protein identifiers of closely
related entries found with BLAST.  [A more sensible approach to
automatic annotation would be nice, but more complicated]

Ideally these would all need a short motivational section explaining
why you might want to do this particular task.  There is probably a
balance between trivial and too complex.

These could be compiled on a shared OBF wiki, together with any input
files required.  It would be up to the individual projects to write
their own sample code to do this task - perhaps hosted on the Bio*
project specific wiki pages, but linked to from the use case.

Potentially this would be a huge project, but it would also be a nice
resource [provided it was maintained and kept up to date as the
toolkits evolve].  Perhaps this is too ambitious?

> On the biopython list they told me that a big issue is the license
> with which the data is released. I don't have any inconvenience in
> contributing examples with a GPL or without license, but I understand
> other people could do.
> Somebody told me that there were some interesting discussions on
> scipy.org, but I couldn't find them.

Licensing and copyright are valid concerns.  Also the different Bio*
projects use different licenses - and I suspect none of them are
compatible with the GPL.  Any licence would have to allow all the Bio*
projects to copy the example files into their code with no strings
attached - ideally just "public domain" or MIT/BSD style.

I would like to see a general collection of real world samples of each
file format (these could be pointed to by any shared file format
documentation).  Between all the Bio* projects we probably have a good
collection already - but the provenance of each file would have to be
looked at as well as the licence. In addition, artificial hand edited
files could be useful which include valid but unusual content to test
the Bio* project's parsers.  I don't think this actually needs to be
in a repository, but that would be nice for tracking ownership.

I think it would be up to the individual projects to pull in any files
of interest for us in their own test suites (essentially coping
example files into their own repositories).

> What I think that could be useful for this project is a ticket
> tracking system, or better said a feature request system, to keep
> track of all the things needed.

The OBF already runs a bugzilla installation used by most of the Bio*
projects, which would probably be OK for this sort of thing.

Peter



More information about the Open-Bio-l mailing list