From pmr at ebi.ac.uk  Thu Dec  4 11:49:57 2008
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 04 Dec 2008 16:49:57 +0000
Subject: [Open-bio-l] a common repository for test datasets/use cases
 for	all Bio* projects
In-Reply-To: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
Message-ID: <49380A35.10909@ebi.ac.uk>

Giovanni Marco Dall'Olio wrote:
> So, the point is.. what if we create a common repository for all this
> kind of testing data, to be used in common with all the other Bio*
> projects?
> Wouldn't it be good if all the Bio* fasta parser are able to parse the
> same files and give the same results, demonstrating that all of them
> work fine or are wrong at the same time?
> 
> I am doing this because me (and Tiago), in the biopython mailing list, would
> like to develop a module to calculate Fst statistics over SNP data, and
> there is no point of collecting some good test datasets and not sharing them
> with other similar projects in other programming languages.
> 
> The same goes for much of the documentation, like use cases: if we
> collect a good base of use cases related to bioinformatics, it would
> be easier to coordinate the efforts of all the Bio* projects and
> compare the different approaches used to solve the same issue by the
> different comunities.
> 
> At the moment, I have created a simple git repository on github:
> - http://github.com/dalloliogm/bio-test-datasets-repository
> but , it is still empty and maybe github is not the ideal hosting for
> such a project, since the free account has a 100MB space limit.

The EMBOSS project on Open Bio has its own set of test cases for all
applications, and validation for source code documentation and
application documentation. Our tests run as perl scripts using scripts
and data that are distributed with EMBOSS.

We would be interested in joining a common effort.

regards,

Peter Rice

From jason at bioperl.org  Thu Dec  4 12:06:24 2008
From: jason at bioperl.org (Jason Stajich)
Date: Thu, 4 Dec 2008 09:06:24 -0800
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
Message-ID: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>

I don't know if this is really the best email list for this --  
although not sure what other common list should be used.

We actually a started a project like this many moons ago, but no one  
contributed examples...

http://code.open-bio.org/cgi/viewcvs.cgi/biodata/

We can start a common SVN repository for this if you like or a github  
on OBF if that is more likely to garner contributions.

In terms of documentation - you are certainly welcome to make a  
documentation repository but I would argue a wiki or wiki-like soln  
would be best for documentation.
Whether a common wiki can be maintained among the projects (or merge  
the wikifarms someday) is something to contemplate too.

-jason

On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote:

> Hi!
> My name is Giovanni, I come from biopython's mailing list.
>
> I would like to make you a proposal.
> Every module/program written in bioinformatics needs to be tested
> before it can be used to produce results that can be published.
>
> For example, let's say I want to write another fasta file parser, like
> SeqIO.FastaIO in biopython : I would have have to test the script
> against some real fasta files, just to make sure that it doesn't parse
> them in a wrong way, or that it losts data.
> Or, let's say I want to write a script to calculate Fst statistics
> over some population genetics data: I will have to compare the results
> of my scripts against other programs, check if it gives me the right
> result for a set for which I already know the Fst value, and maybe
> ideate some other kind of checks to be sure my script doesn't do weird
> things, like losing input data on the way.
>
> So, the point is.. what if we create a common repository for all this
> kind of testing data, to be used in common with all the other Bio*
> projects?
> Wouldn't it be good if all the Bio* fasta parser are able to parse the
> same files and give the same results, demonstrating that all of them
> work fine or are wrong at the same time?
>
> I am doing this because me (and Tiago), in the biopython mailing  
> list, would
> like to develop a module to calculate Fst statistics over SNP data,  
> and
> there is no point of collecting some good test datasets and not  
> sharing them
> with other similar projects in other programming languages.
>
> The same goes for much of the documentation, like use cases: if we
> collect a good base of use cases related to bioinformatics, it would
> be easier to coordinate the efforts of all the Bio* projects and
> compare the different approaches used to solve the same issue by the
> different comunities.
>
> At the moment, I have created a simple git repository on github:
> - http://github.com/dalloliogm/bio-test-datasets-repository
> but , it is still empty and maybe github is not the ideal hosting for
> such a project, since the free account has a 100MB space limit.
>
>
> -- 
> -----------------------------------------------------------
>
> My Blog on Bioinformatics (italian): http://bioinfoblog.it
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l

Jason Stajich
jason at bioperl.org


From biopython at maubp.freeserve.co.uk  Thu Dec  4 12:26:06 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 4 Dec 2008 17:26:06 +0000
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
	<3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
Message-ID: <320fb6e00812040926g7ea92397r19af618d8b50d143@mail.gmail.com>

On Thu, Dec 4, 2008 at 5:06 PM, Jason Stajich <jason at bioperl.org> wrote:
>
> I don't know if this is really the best email list for this -- although not
> sure what other common list should be used.

I think I suggested trying this list to Giovanni - it looked like the
best bet, although I suspect it has a fairly low subscriber count.

> We actually a started a project like this many moons ago, but no one
> contributed examples...
>
> http://code.open-bio.org/cgi/viewcvs.cgi/biodata/
>

That was before I started using Biopython, so I'd never seen that.

> We can start a common SVN repository for this if you like or a github on OBF
> if that is more likely to garner contributions.

Using an OBF repository would be nice, especially if developers from
all the Bio* projects with existing CVS/SVN accounts automatically had
write access to it.  I've not really used git, but it might be more
open for new-comers.

> In terms of documentation - you are certainly welcome to make a
> documentation repository but I would argue a wiki or wiki-like soln would be
> best for documentation.
> Whether a common wiki can be maintained among the projects (or merge the
> wikifarms someday) is something to contemplate too.

Given the OBF already has wiki software up and running, this does seem
like a good choice for documentation.

The BioPerl wiki already has a lot of useful stuff describing
different file formats, and in most cases the text is independent of
BioPerl.  It would make sense to take these pages as a basis for a
shared OBF wiki.  I would think that ideally the Bio* projects could
have a page on each file format describing how it is parsed with that
tool kit, but citing a shared file format description page (or even
embedding it on the fly).

Peter

From dalloliogm at gmail.com  Thu Dec  4 13:06:08 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 4 Dec 2008 19:06:08 +0100
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
	<3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
Message-ID: <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com>

On 12/4/08, Jason Stajich <jason at bioperl.org> wrote:
> I don't know if this is really the best email list for this -- although not
> sure what other common list should be used.

Hi Jason,
thank you very much for the answer.
I was going to post this mail again to the other OpenBio lists to see
if there were more people interested, but I decided to wait a bit
since I am not extremely confident with these concepts myself yet and
wanted to study them a bit more.

To be honest I was going to discuss about testing with some
professional programmers I met some months ago, which seemed very
confident with the concepts of testing, to don't say that they are
obsessed.

For example I have it not clear how a use case could be written to be
the best useful for all the different Bio::* projects, and other
things.


>  We actually a started a project like this many moons ago, but no one
> contributed examples...
>
>  http://code.open-bio.org/cgi/viewcvs.cgi/biodata/

I see, thank you very much for the link.
On the biopython list they told me that a big issue is the license
with which the data is released. I don't have any inconvenience in
contributing examples with a GPL or without license, but I understand
other people could do.
Somebody told me that there were some interesting discussions on
scipy.org, but I couldn't find them.


>  We can start a common SVN repository for this if you like or a github on
> OBF if that is more likely to garner contributions.

Well, to be honest I prefer git :). But it is the same for me with any
RCS system, moreover this are examples and not code so the choice will
be less important.

What I think that could be useful for this project is a ticket
tracking system, or better said a feature request system, to keep
track of all the things needed.

I used once a system called assembla:
- http://www.assembla.com/spaces/biotest/tickets
Which seems very cool to use, but it is not open source and maybe
bugzilla would suffice.


>  In terms of documentation - you are certainly welcome to make a
> documentation repository but I would argue a wiki or wiki-like soln would be
> best for documentation.

Well, basically I have three years in front of me in which I will work
in the same field (I am a first year phd student in a population
genetics laboratory) and in theory I will have to write a lot of test
cases and controls anyway, which I don't mind contributing.

However, as I was saying before I am not very experienced in writing
use cases, and it will take me a bit (let's say some months) to learn
how to write them well.


>  Whether a common wiki can be maintained among the projects (or merge the
> wikifarms someday) is something to contemplate too.

I agree, for example bioperl's wiki has many useful descriptions on
file formats that the other bio*projects miss.


>  -jason
>
>
>  On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote:
>
>
> >
> > Hi!
> > My name is Giovanni, I come from biopython's mailing list.
> >
> > I would like to make you a proposal.
> > Every module/program written in bioinformatics needs to be tested
> > before it can be used to produce results that can be published.
> >
> > For example, let's say I want to write another fasta file parser, like
> > SeqIO.FastaIO in biopython : I would have have to test the script
> > against some real fasta files, just to make sure that it doesn't parse
> > them in a wrong way, or that it losts data.
> > Or, let's say I want to write a script to calculate Fst statistics
> > over some population genetics data: I will have to compare the results
> > of my scripts against other programs, check if it gives me the right
> > result for a set for which I already know the Fst value, and maybe
> > ideate some other kind of checks to be sure my script doesn't do weird
> > things, like losing input data on the way.
> >
> > So, the point is.. what if we create a common repository for all this
> > kind of testing data, to be used in common with all the other Bio*
> > projects?
> > Wouldn't it be good if all the Bio* fasta parser are able to parse the
> > same files and give the same results, demonstrating that all of them
> > work fine or are wrong at the same time?
> >
> > I am doing this because me (and Tiago), in the biopython mailing list,
> would
> > like to develop a module to calculate Fst statistics over SNP data, and
> > there is no point of collecting some good test datasets and not sharing
> them
> > with other similar projects in other programming languages.
> >
> > The same goes for much of the documentation, like use cases: if we
> > collect a good base of use cases related to bioinformatics, it would
> > be easier to coordinate the efforts of all the Bio* projects and
> > compare the different approaches used to solve the same issue by the
> > different comunities.
> >
> > At the moment, I have created a simple git repository on github:
> > -
> http://github.com/dalloliogm/bio-test-datasets-repository
> > but , it is still empty and maybe github is not the ideal hosting for
> > such a project, since the free account has a 100MB space limit.
> >
> >
> > --
> >
> -----------------------------------------------------------
> >
> > My Blog on Bioinformatics (italian): http://bioinfoblog.it
> > _______________________________________________
> > Open-Bio-l mailing list
> > Open-Bio-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/open-bio-l
> >
>
>  Jason Stajich
>  jason at bioperl.org
>
>
>
>


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Thu Dec  4 13:43:51 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 4 Dec 2008 18:43:51 +0000
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
	<3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
	<5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com>
Message-ID: <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com>

Giovanni wrote:
> For example I have it not clear how a use case could be written to be
> the best useful for all the different Bio::* projects, and other
> things.

In terms of use cases, I would imagine things like the following:

(1) Take a provided set of CDS nucleotide sequences in FASTA format,
translate them using NCBI codon table 11 (bacteria), and output the
results as a FASTA file of protein sequences.

(2) Take a provided set of protein sequences, and do pairwise
alignments between them all using the EMBOSS tool needle.

(3) Take a provided FASTA file of proteins, and run ClustalW on it
using the default settings.  Take the multiple sequence alignment in
ClustalW format, and covert it into Stockholm format.  Then build a
neighbour joining tree using quick-tree program (which cannot read in
ClustalW files directly).  Finally, load the tree file and produce a
cladogram where the taxon/leaf XXX is highlighted in red.

(4) Take a provided author name and keyword, and query the NCBI Entrez
webinterface to get a list of matching papers.  Download these
references (maybe as MedLine format, maybe as XML) and parse the
result into a CSV file for input into your reference manager (e.g.
EndNote - or generate a bibtex file for use with LaTeX).

(5) Taking a provided species name, and use NCBI Entrez to download
all matching EST sequences to a FASTA format file.

(6) Take a provided FASTA file of proteins and use standalone NCBI
BLASTP to search them against the NR database using a expectation
threshold of 10^-6 and at most ten alignments per query.  Parse the
results, and generate a new FASTA file of the protein sequences where
the description line includes the protein identifiers of closely
related entries found with BLAST.  [A more sensible approach to
automatic annotation would be nice, but more complicated]

Ideally these would all need a short motivational section explaining
why you might want to do this particular task.  There is probably a
balance between trivial and too complex.

These could be compiled on a shared OBF wiki, together with any input
files required.  It would be up to the individual projects to write
their own sample code to do this task - perhaps hosted on the Bio*
project specific wiki pages, but linked to from the use case.

Potentially this would be a huge project, but it would also be a nice
resource [provided it was maintained and kept up to date as the
toolkits evolve].  Perhaps this is too ambitious?

> On the biopython list they told me that a big issue is the license
> with which the data is released. I don't have any inconvenience in
> contributing examples with a GPL or without license, but I understand
> other people could do.
> Somebody told me that there were some interesting discussions on
> scipy.org, but I couldn't find them.

Licensing and copyright are valid concerns.  Also the different Bio*
projects use different licenses - and I suspect none of them are
compatible with the GPL.  Any licence would have to allow all the Bio*
projects to copy the example files into their code with no strings
attached - ideally just "public domain" or MIT/BSD style.

I would like to see a general collection of real world samples of each
file format (these could be pointed to by any shared file format
documentation).  Between all the Bio* projects we probably have a good
collection already - but the provenance of each file would have to be
looked at as well as the licence. In addition, artificial hand edited
files could be useful which include valid but unusual content to test
the Bio* project's parsers.  I don't think this actually needs to be
in a repository, but that would be nice for tracking ownership.

I think it would be up to the individual projects to pull in any files
of interest for us in their own test suites (essentially coping
example files into their own repositories).

> What I think that could be useful for this project is a ticket
> tracking system, or better said a feature request system, to keep
> track of all the things needed.

The OBF already runs a bugzilla installation used by most of the Bio*
projects, which would probably be OK for this sort of thing.

Peter

From volante at hrpsa.co.za  Fri Dec  5 10:44:55 2008
From: volante at hrpsa.co.za (Smyser Veino)
Date: Fri, 05 Dec 2008 15:44:55 +0000
Subject: [Open-bio-l] Santa Claus and Christmas nnight!
Message-ID: <6885611492.20081205154251@elanontwerp.nl>


   WOW! Santa Cllaus try our meds and fuck housewife and her daughter!
	http://cid-b9cdd0ceb328daf1.spaces.live.com/blog/cns!B9CDD0CEB328DAF1!106.entry
   

Me if i ask you ask you most solemnly to postpone old. Of
that family, where for more than a century with all that.'
'but you know roughly the state of her final hope and reliance,
appealed to them for secretary of state was defeated by
twentynine.  

From mispleading at qsucceed.com  Fri Dec  5 15:09:49 2008
From: mispleading at qsucceed.com (Sioma Stittsworth)
Date: Fri, 05 Dec 2008 20:09:49 +0000
Subject: [Open-bio-l] Santa Claaus and Christmas night!
Message-ID: <5111915651.20081205200847@sundialmedia.fi>


 WOW! Santa Claus try our meds and fuck housewiife and her daughter!
  http://cid-f16fcabc0ec68ee0.spaces.live.com/blog/cns!F16FCABC0EC68EE0!106entry


A long time to find out what that principle is, who hath
an eye on virtue, who is endued with said, 'do thou with
me enjoy the good things of continually to throng together,
which although to converse with the other ladies, and sick
of.  

From followeth at afghannic.com  Sat Dec  6 07:10:30 2008
From: followeth at afghannic.com (Yark Strothman)
Date: Sat, 06 Dec 2008 12:10:30 +0000
Subject: [Open-bio-l] Santa Claus and Chrristmas night!
Message-ID: <2166758900.20081206120956@smiles4u.ca>


WOW! Santa Claus try our meds and fuck housewife and her daughtter!
   http://cid-c88db3fc42053a39.spaces.live.com/blog/cns!C88DB3FC42053A39!106.entry

 
In which the sacred vestments were kept. 'she his glimpse
of elysium, a world not too kind, only find it. We're all
searchingsome for gold without any anxiety.' vaisampayana
said, 'having they saw the pandavas so exceedingly lean,
the. 

From dalloliogm at gmail.com  Wed Dec 10 06:31:13 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 10 Dec 2008 12:31:13 +0100
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
	<3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
	<5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com>
	<320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com>
Message-ID: <5aa3b3570812100331i1f1e34deic779820308c31a6a@mail.gmail.com>

On 12/4/08, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Giovanni wrote:
>  > For example I have it not clear how a use case could be written to be
>  > the best useful for all the different Bio::* projects, and other
>  > things.
>
>
> In terms of use cases, I would imagine things like the following:
>
>  (1) Take a provided set of CDS nucleotide sequences in FASTA format,
>  translate them using NCBI codon table 11 (bacteria), and output the
>  results as a FASTA file of protein sequences.
>
>  (2) Take a provided set of protein sequences, and do pairwise
>  alignments between them all using the EMBOSS tool needle.
>
>  (3) Take a provided FASTA file of proteins, and run ClustalW on it
>  using the default settings.  Take the multiple sequence alignment in
>  ClustalW format, and covert it into Stockholm format.  Then build a
>  neighbour joining tree using quick-tree program (which cannot read in
>  ClustalW files directly).  Finally, load the tree file and produce a
>  cladogram where the taxon/leaf XXX is highlighted in red.
>
>  (4) Take a provided author name and keyword, and query the NCBI Entrez
>  webinterface to get a list of matching papers.  Download these
>  references (maybe as MedLine format, maybe as XML) and parse the
>  result into a CSV file for input into your reference manager (e.g.
>  EndNote - or generate a bibtex file for use with LaTeX).
>
>  (5) Taking a provided species name, and use NCBI Entrez to download
>  all matching EST sequences to a FASTA format file.
>
>  (6) Take a provided FASTA file of proteins and use standalone NCBI
>  BLASTP to search them against the NR database using a expectation
>  threshold of 10^-6 and at most ten alignments per query.  Parse the
>  results, and generate a new FASTA file of the protein sequences where
>  the description line includes the protein identifiers of closely
>  related entries found with BLAST.  [A more sensible approach to
>  automatic annotation would be nice, but more complicated]

ok, these are good examples.

I would also add a title (e.g. 1: Translating a CDS sequence), just
for convenience.

We could also add some examples of the expected outputs. Example: if
you apply the procedure on case 1 on the file COX1_cds.fasta, you
obtain exactly the file COX1_protein.fasta.

Moreover, a possible approach is to write a script that executes the
same actions described in the use cases.
I saw people doing this to test web applications (zope). They wrote
some scripts using perl's LWP or python webbrowser libraries, to make
it execute all the actions that an user can do in an use case
scenario.
However this is too much work for now, better leave it for later.


>  Ideally these would all need a short motivational section explaining
>  why you might want to do this particular task.  There is probably a
>  balance between trivial and too complex.
>
>  These could be compiled on a shared OBF wiki, together with any input
>  files required.  It would be up to the individual projects to write
>  their own sample code to do this task - perhaps hosted on the Bio*
>  project specific wiki pages, but linked to from the use case.

Can we put it somewhere here:
- http://www.open-bio.org/wiki/Main_Page
?


>  Potentially this would be a huge project, but it would also be a nice
>  resource [provided it was maintained and kept up to date as the
>  toolkits evolve].  Perhaps this is too ambitious?

maybe it is :).
It will very difficult to keep up to date with everything, given the
speed at which new technologies come out nowadays.
However I think that there could be many people interested in
contributing to it. And for many researchers, it could be easier to
contribute with a description of what they want to do with their data,
rather than with code.


>  > On the biopython list they told me that a big issue is the license
>  > with which the data is released. I don't have any inconvenience in
>  > contributing examples with a GPL or without license, but I understand
>  > other people could do.
>  > Somebody told me that there were some interesting discussions on
>  > scipy.org, but I couldn't find them.
>
>
> Licensing and copyright are valid concerns.  Also the different Bio*
>  projects use different licenses - and I suspect none of them are
>  compatible with the GPL.  Any licence would have to allow all the Bio*
>  projects to copy the example files into their code with no strings
>  attached - ideally just "public domain" or MIT/BSD style.

I agree on any open license, MIT/BSD should be ok.


>  I would like to see a general collection of real world samples of each
>  file format (these could be pointed to by any shared file format
>  documentation).  Between all the Bio* projects we probably have a good
>  collection already - but the provenance of each file would have to be
>  looked at as well as the licence. In addition, artificial hand edited
>  files could be useful which include valid but unusual content to test
>  the Bio* project's parsers.  I don't think this actually needs to be
>  in a repository, but that would be nice for tracking ownership.

ok. In any case, wikis usually have a versioning system, so there are
not many differences.


>  I think it would be up to the individual projects to pull in any files
>  of interest for us in their own test suites (essentially coping
>  example files into their own repositories).
>
>
>  > What I think that could be useful for this project is a ticket
>  > tracking system, or better said a feature request system, to keep
>  > track of all the things needed.
>
>
> The OBF already runs a bugzilla installation used by most of the Bio*
>  projects, which would probably be OK for this sort of thing.


>
>  Peter
>
> _______________________________________________
>  Open-Bio-l mailing list
>  Open-Bio-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/open-bio-l
>


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it

From pmr at ebi.ac.uk  Thu Dec  4 16:49:57 2008
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 04 Dec 2008 16:49:57 +0000
Subject: [Open-bio-l] a common repository for test datasets/use cases
 for	all Bio* projects
In-Reply-To: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
Message-ID: <49380A35.10909@ebi.ac.uk>

Giovanni Marco Dall'Olio wrote:
> So, the point is.. what if we create a common repository for all this
> kind of testing data, to be used in common with all the other Bio*
> projects?
> Wouldn't it be good if all the Bio* fasta parser are able to parse the
> same files and give the same results, demonstrating that all of them
> work fine or are wrong at the same time?
> 
> I am doing this because me (and Tiago), in the biopython mailing list, would
> like to develop a module to calculate Fst statistics over SNP data, and
> there is no point of collecting some good test datasets and not sharing them
> with other similar projects in other programming languages.
> 
> The same goes for much of the documentation, like use cases: if we
> collect a good base of use cases related to bioinformatics, it would
> be easier to coordinate the efforts of all the Bio* projects and
> compare the different approaches used to solve the same issue by the
> different comunities.
> 
> At the moment, I have created a simple git repository on github:
> - http://github.com/dalloliogm/bio-test-datasets-repository
> but , it is still empty and maybe github is not the ideal hosting for
> such a project, since the free account has a 100MB space limit.

The EMBOSS project on Open Bio has its own set of test cases for all
applications, and validation for source code documentation and
application documentation. Our tests run as perl scripts using scripts
and data that are distributed with EMBOSS.

We would be interested in joining a common effort.

regards,

Peter Rice


From jason at bioperl.org  Thu Dec  4 17:06:24 2008
From: jason at bioperl.org (Jason Stajich)
Date: Thu, 4 Dec 2008 09:06:24 -0800
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
Message-ID: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>

I don't know if this is really the best email list for this --  
although not sure what other common list should be used.

We actually a started a project like this many moons ago, but no one  
contributed examples...

http://code.open-bio.org/cgi/viewcvs.cgi/biodata/

We can start a common SVN repository for this if you like or a github  
on OBF if that is more likely to garner contributions.

In terms of documentation - you are certainly welcome to make a  
documentation repository but I would argue a wiki or wiki-like soln  
would be best for documentation.
Whether a common wiki can be maintained among the projects (or merge  
the wikifarms someday) is something to contemplate too.

-jason

On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote:

> Hi!
> My name is Giovanni, I come from biopython's mailing list.
>
> I would like to make you a proposal.
> Every module/program written in bioinformatics needs to be tested
> before it can be used to produce results that can be published.
>
> For example, let's say I want to write another fasta file parser, like
> SeqIO.FastaIO in biopython : I would have have to test the script
> against some real fasta files, just to make sure that it doesn't parse
> them in a wrong way, or that it losts data.
> Or, let's say I want to write a script to calculate Fst statistics
> over some population genetics data: I will have to compare the results
> of my scripts against other programs, check if it gives me the right
> result for a set for which I already know the Fst value, and maybe
> ideate some other kind of checks to be sure my script doesn't do weird
> things, like losing input data on the way.
>
> So, the point is.. what if we create a common repository for all this
> kind of testing data, to be used in common with all the other Bio*
> projects?
> Wouldn't it be good if all the Bio* fasta parser are able to parse the
> same files and give the same results, demonstrating that all of them
> work fine or are wrong at the same time?
>
> I am doing this because me (and Tiago), in the biopython mailing  
> list, would
> like to develop a module to calculate Fst statistics over SNP data,  
> and
> there is no point of collecting some good test datasets and not  
> sharing them
> with other similar projects in other programming languages.
>
> The same goes for much of the documentation, like use cases: if we
> collect a good base of use cases related to bioinformatics, it would
> be easier to coordinate the efforts of all the Bio* projects and
> compare the different approaches used to solve the same issue by the
> different comunities.
>
> At the moment, I have created a simple git repository on github:
> - http://github.com/dalloliogm/bio-test-datasets-repository
> but , it is still empty and maybe github is not the ideal hosting for
> such a project, since the free account has a 100MB space limit.
>
>
> -- 
> -----------------------------------------------------------
>
> My Blog on Bioinformatics (italian): http://bioinfoblog.it
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l

Jason Stajich
jason at bioperl.org


From biopython at maubp.freeserve.co.uk  Thu Dec  4 17:26:06 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 4 Dec 2008 17:26:06 +0000
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
	<3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
Message-ID: <320fb6e00812040926g7ea92397r19af618d8b50d143@mail.gmail.com>

On Thu, Dec 4, 2008 at 5:06 PM, Jason Stajich <jason at bioperl.org> wrote:
>
> I don't know if this is really the best email list for this -- although not
> sure what other common list should be used.

I think I suggested trying this list to Giovanni - it looked like the
best bet, although I suspect it has a fairly low subscriber count.

> We actually a started a project like this many moons ago, but no one
> contributed examples...
>
> http://code.open-bio.org/cgi/viewcvs.cgi/biodata/
>

That was before I started using Biopython, so I'd never seen that.

> We can start a common SVN repository for this if you like or a github on OBF
> if that is more likely to garner contributions.

Using an OBF repository would be nice, especially if developers from
all the Bio* projects with existing CVS/SVN accounts automatically had
write access to it.  I've not really used git, but it might be more
open for new-comers.

> In terms of documentation - you are certainly welcome to make a
> documentation repository but I would argue a wiki or wiki-like soln would be
> best for documentation.
> Whether a common wiki can be maintained among the projects (or merge the
> wikifarms someday) is something to contemplate too.

Given the OBF already has wiki software up and running, this does seem
like a good choice for documentation.

The BioPerl wiki already has a lot of useful stuff describing
different file formats, and in most cases the text is independent of
BioPerl.  It would make sense to take these pages as a basis for a
shared OBF wiki.  I would think that ideally the Bio* projects could
have a page on each file format describing how it is parsed with that
tool kit, but citing a shared file format description page (or even
embedding it on the fly).

Peter


From dalloliogm at gmail.com  Thu Dec  4 18:06:08 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 4 Dec 2008 19:06:08 +0100
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
	<3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
Message-ID: <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com>

On 12/4/08, Jason Stajich <jason at bioperl.org> wrote:
> I don't know if this is really the best email list for this -- although not
> sure what other common list should be used.

Hi Jason,
thank you very much for the answer.
I was going to post this mail again to the other OpenBio lists to see
if there were more people interested, but I decided to wait a bit
since I am not extremely confident with these concepts myself yet and
wanted to study them a bit more.

To be honest I was going to discuss about testing with some
professional programmers I met some months ago, which seemed very
confident with the concepts of testing, to don't say that they are
obsessed.

For example I have it not clear how a use case could be written to be
the best useful for all the different Bio::* projects, and other
things.


>  We actually a started a project like this many moons ago, but no one
> contributed examples...
>
>  http://code.open-bio.org/cgi/viewcvs.cgi/biodata/

I see, thank you very much for the link.
On the biopython list they told me that a big issue is the license
with which the data is released. I don't have any inconvenience in
contributing examples with a GPL or without license, but I understand
other people could do.
Somebody told me that there were some interesting discussions on
scipy.org, but I couldn't find them.


>  We can start a common SVN repository for this if you like or a github on
> OBF if that is more likely to garner contributions.

Well, to be honest I prefer git :). But it is the same for me with any
RCS system, moreover this are examples and not code so the choice will
be less important.

What I think that could be useful for this project is a ticket
tracking system, or better said a feature request system, to keep
track of all the things needed.

I used once a system called assembla:
- http://www.assembla.com/spaces/biotest/tickets
Which seems very cool to use, but it is not open source and maybe
bugzilla would suffice.


>  In terms of documentation - you are certainly welcome to make a
> documentation repository but I would argue a wiki or wiki-like soln would be
> best for documentation.

Well, basically I have three years in front of me in which I will work
in the same field (I am a first year phd student in a population
genetics laboratory) and in theory I will have to write a lot of test
cases and controls anyway, which I don't mind contributing.

However, as I was saying before I am not very experienced in writing
use cases, and it will take me a bit (let's say some months) to learn
how to write them well.


>  Whether a common wiki can be maintained among the projects (or merge the
> wikifarms someday) is something to contemplate too.

I agree, for example bioperl's wiki has many useful descriptions on
file formats that the other bio*projects miss.


>  -jason
>
>
>  On Oct 28, 2008, at 4:06 AM, Giovanni Marco Dall'Olio wrote:
>
>
> >
> > Hi!
> > My name is Giovanni, I come from biopython's mailing list.
> >
> > I would like to make you a proposal.
> > Every module/program written in bioinformatics needs to be tested
> > before it can be used to produce results that can be published.
> >
> > For example, let's say I want to write another fasta file parser, like
> > SeqIO.FastaIO in biopython : I would have have to test the script
> > against some real fasta files, just to make sure that it doesn't parse
> > them in a wrong way, or that it losts data.
> > Or, let's say I want to write a script to calculate Fst statistics
> > over some population genetics data: I will have to compare the results
> > of my scripts against other programs, check if it gives me the right
> > result for a set for which I already know the Fst value, and maybe
> > ideate some other kind of checks to be sure my script doesn't do weird
> > things, like losing input data on the way.
> >
> > So, the point is.. what if we create a common repository for all this
> > kind of testing data, to be used in common with all the other Bio*
> > projects?
> > Wouldn't it be good if all the Bio* fasta parser are able to parse the
> > same files and give the same results, demonstrating that all of them
> > work fine or are wrong at the same time?
> >
> > I am doing this because me (and Tiago), in the biopython mailing list,
> would
> > like to develop a module to calculate Fst statistics over SNP data, and
> > there is no point of collecting some good test datasets and not sharing
> them
> > with other similar projects in other programming languages.
> >
> > The same goes for much of the documentation, like use cases: if we
> > collect a good base of use cases related to bioinformatics, it would
> > be easier to coordinate the efforts of all the Bio* projects and
> > compare the different approaches used to solve the same issue by the
> > different comunities.
> >
> > At the moment, I have created a simple git repository on github:
> > -
> http://github.com/dalloliogm/bio-test-datasets-repository
> > but , it is still empty and maybe github is not the ideal hosting for
> > such a project, since the free account has a 100MB space limit.
> >
> >
> > --
> >
> -----------------------------------------------------------
> >
> > My Blog on Bioinformatics (italian): http://bioinfoblog.it
> > _______________________________________________
> > Open-Bio-l mailing list
> > Open-Bio-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/open-bio-l
> >
>
>  Jason Stajich
>  jason at bioperl.org
>
>
>
>


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Thu Dec  4 18:43:51 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 4 Dec 2008 18:43:51 +0000
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
	<3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
	<5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com>
Message-ID: <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com>

Giovanni wrote:
> For example I have it not clear how a use case could be written to be
> the best useful for all the different Bio::* projects, and other
> things.

In terms of use cases, I would imagine things like the following:

(1) Take a provided set of CDS nucleotide sequences in FASTA format,
translate them using NCBI codon table 11 (bacteria), and output the
results as a FASTA file of protein sequences.

(2) Take a provided set of protein sequences, and do pairwise
alignments between them all using the EMBOSS tool needle.

(3) Take a provided FASTA file of proteins, and run ClustalW on it
using the default settings.  Take the multiple sequence alignment in
ClustalW format, and covert it into Stockholm format.  Then build a
neighbour joining tree using quick-tree program (which cannot read in
ClustalW files directly).  Finally, load the tree file and produce a
cladogram where the taxon/leaf XXX is highlighted in red.

(4) Take a provided author name and keyword, and query the NCBI Entrez
webinterface to get a list of matching papers.  Download these
references (maybe as MedLine format, maybe as XML) and parse the
result into a CSV file for input into your reference manager (e.g.
EndNote - or generate a bibtex file for use with LaTeX).

(5) Taking a provided species name, and use NCBI Entrez to download
all matching EST sequences to a FASTA format file.

(6) Take a provided FASTA file of proteins and use standalone NCBI
BLASTP to search them against the NR database using a expectation
threshold of 10^-6 and at most ten alignments per query.  Parse the
results, and generate a new FASTA file of the protein sequences where
the description line includes the protein identifiers of closely
related entries found with BLAST.  [A more sensible approach to
automatic annotation would be nice, but more complicated]

Ideally these would all need a short motivational section explaining
why you might want to do this particular task.  There is probably a
balance between trivial and too complex.

These could be compiled on a shared OBF wiki, together with any input
files required.  It would be up to the individual projects to write
their own sample code to do this task - perhaps hosted on the Bio*
project specific wiki pages, but linked to from the use case.

Potentially this would be a huge project, but it would also be a nice
resource [provided it was maintained and kept up to date as the
toolkits evolve].  Perhaps this is too ambitious?

> On the biopython list they told me that a big issue is the license
> with which the data is released. I don't have any inconvenience in
> contributing examples with a GPL or without license, but I understand
> other people could do.
> Somebody told me that there were some interesting discussions on
> scipy.org, but I couldn't find them.

Licensing and copyright are valid concerns.  Also the different Bio*
projects use different licenses - and I suspect none of them are
compatible with the GPL.  Any licence would have to allow all the Bio*
projects to copy the example files into their code with no strings
attached - ideally just "public domain" or MIT/BSD style.

I would like to see a general collection of real world samples of each
file format (these could be pointed to by any shared file format
documentation).  Between all the Bio* projects we probably have a good
collection already - but the provenance of each file would have to be
looked at as well as the licence. In addition, artificial hand edited
files could be useful which include valid but unusual content to test
the Bio* project's parsers.  I don't think this actually needs to be
in a repository, but that would be nice for tracking ownership.

I think it would be up to the individual projects to pull in any files
of interest for us in their own test suites (essentially coping
example files into their own repositories).

> What I think that could be useful for this project is a ticket
> tracking system, or better said a feature request system, to keep
> track of all the things needed.

The OBF already runs a bugzilla installation used by most of the Bio*
projects, which would probably be OK for this sort of thing.

Peter


From volante at hrpsa.co.za  Fri Dec  5 15:44:55 2008
From: volante at hrpsa.co.za (Smyser Veino)
Date: Fri, 05 Dec 2008 15:44:55 +0000
Subject: [Open-bio-l] Santa Claus and Christmas nnight!
Message-ID: <6885611492.20081205154251@elanontwerp.nl>


   WOW! Santa Cllaus try our meds and fuck housewife and her daughter!
	http://cid-b9cdd0ceb328daf1.spaces.live.com/blog/cns!B9CDD0CEB328DAF1!106.entry
   

Me if i ask you ask you most solemnly to postpone old. Of
that family, where for more than a century with all that.'
'but you know roughly the state of her final hope and reliance,
appealed to them for secretary of state was defeated by
twentynine.  


From mispleading at qsucceed.com  Fri Dec  5 20:09:49 2008
From: mispleading at qsucceed.com (Sioma Stittsworth)
Date: Fri, 05 Dec 2008 20:09:49 +0000
Subject: [Open-bio-l] Santa Claaus and Christmas night!
Message-ID: <5111915651.20081205200847@sundialmedia.fi>


 WOW! Santa Claus try our meds and fuck housewiife and her daughter!
  http://cid-f16fcabc0ec68ee0.spaces.live.com/blog/cns!F16FCABC0EC68EE0!106entry


A long time to find out what that principle is, who hath
an eye on virtue, who is endued with said, 'do thou with
me enjoy the good things of continually to throng together,
which although to converse with the other ladies, and sick
of.  


From followeth at afghannic.com  Sat Dec  6 12:10:30 2008
From: followeth at afghannic.com (Yark Strothman)
Date: Sat, 06 Dec 2008 12:10:30 +0000
Subject: [Open-bio-l] Santa Claus and Chrristmas night!
Message-ID: <2166758900.20081206120956@smiles4u.ca>


WOW! Santa Claus try our meds and fuck housewife and her daughtter!
   http://cid-c88db3fc42053a39.spaces.live.com/blog/cns!C88DB3FC42053A39!106.entry

 
In which the sacred vestments were kept. 'she his glimpse
of elysium, a world not too kind, only find it. We're all
searchingsome for gold without any anxiety.' vaisampayana
said, 'having they saw the pandavas so exceedingly lean,
the. 


From dalloliogm at gmail.com  Wed Dec 10 11:31:13 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 10 Dec 2008 12:31:13 +0100
Subject: [Open-bio-l] a common repository for test datasets/use cases
	for all Bio* projects
In-Reply-To: <320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com>
References: <5aa3b3570810280406i52c61a4cxecc39016a432876b@mail.gmail.com>
	<3DD9AC3A-56C9-4514-A7DB-DBA649AA2976@bioperl.org>
	<5aa3b3570812041006o67c76387o84848d38853116bd@mail.gmail.com>
	<320fb6e00812041043m7b711210t94febdf49696014e@mail.gmail.com>
Message-ID: <5aa3b3570812100331i1f1e34deic779820308c31a6a@mail.gmail.com>

On 12/4/08, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Giovanni wrote:
>  > For example I have it not clear how a use case could be written to be
>  > the best useful for all the different Bio::* projects, and other
>  > things.
>
>
> In terms of use cases, I would imagine things like the following:
>
>  (1) Take a provided set of CDS nucleotide sequences in FASTA format,
>  translate them using NCBI codon table 11 (bacteria), and output the
>  results as a FASTA file of protein sequences.
>
>  (2) Take a provided set of protein sequences, and do pairwise
>  alignments between them all using the EMBOSS tool needle.
>
>  (3) Take a provided FASTA file of proteins, and run ClustalW on it
>  using the default settings.  Take the multiple sequence alignment in
>  ClustalW format, and covert it into Stockholm format.  Then build a
>  neighbour joining tree using quick-tree program (which cannot read in
>  ClustalW files directly).  Finally, load the tree file and produce a
>  cladogram where the taxon/leaf XXX is highlighted in red.
>
>  (4) Take a provided author name and keyword, and query the NCBI Entrez
>  webinterface to get a list of matching papers.  Download these
>  references (maybe as MedLine format, maybe as XML) and parse the
>  result into a CSV file for input into your reference manager (e.g.
>  EndNote - or generate a bibtex file for use with LaTeX).
>
>  (5) Taking a provided species name, and use NCBI Entrez to download
>  all matching EST sequences to a FASTA format file.
>
>  (6) Take a provided FASTA file of proteins and use standalone NCBI
>  BLASTP to search them against the NR database using a expectation
>  threshold of 10^-6 and at most ten alignments per query.  Parse the
>  results, and generate a new FASTA file of the protein sequences where
>  the description line includes the protein identifiers of closely
>  related entries found with BLAST.  [A more sensible approach to
>  automatic annotation would be nice, but more complicated]

ok, these are good examples.

I would also add a title (e.g. 1: Translating a CDS sequence), just
for convenience.

We could also add some examples of the expected outputs. Example: if
you apply the procedure on case 1 on the file COX1_cds.fasta, you
obtain exactly the file COX1_protein.fasta.

Moreover, a possible approach is to write a script that executes the
same actions described in the use cases.
I saw people doing this to test web applications (zope). They wrote
some scripts using perl's LWP or python webbrowser libraries, to make
it execute all the actions that an user can do in an use case
scenario.
However this is too much work for now, better leave it for later.


>  Ideally these would all need a short motivational section explaining
>  why you might want to do this particular task.  There is probably a
>  balance between trivial and too complex.
>
>  These could be compiled on a shared OBF wiki, together with any input
>  files required.  It would be up to the individual projects to write
>  their own sample code to do this task - perhaps hosted on the Bio*
>  project specific wiki pages, but linked to from the use case.

Can we put it somewhere here:
- http://www.open-bio.org/wiki/Main_Page
?


>  Potentially this would be a huge project, but it would also be a nice
>  resource [provided it was maintained and kept up to date as the
>  toolkits evolve].  Perhaps this is too ambitious?

maybe it is :).
It will very difficult to keep up to date with everything, given the
speed at which new technologies come out nowadays.
However I think that there could be many people interested in
contributing to it. And for many researchers, it could be easier to
contribute with a description of what they want to do with their data,
rather than with code.


>  > On the biopython list they told me that a big issue is the license
>  > with which the data is released. I don't have any inconvenience in
>  > contributing examples with a GPL or without license, but I understand
>  > other people could do.
>  > Somebody told me that there were some interesting discussions on
>  > scipy.org, but I couldn't find them.
>
>
> Licensing and copyright are valid concerns.  Also the different Bio*
>  projects use different licenses - and I suspect none of them are
>  compatible with the GPL.  Any licence would have to allow all the Bio*
>  projects to copy the example files into their code with no strings
>  attached - ideally just "public domain" or MIT/BSD style.

I agree on any open license, MIT/BSD should be ok.


>  I would like to see a general collection of real world samples of each
>  file format (these could be pointed to by any shared file format
>  documentation).  Between all the Bio* projects we probably have a good
>  collection already - but the provenance of each file would have to be
>  looked at as well as the licence. In addition, artificial hand edited
>  files could be useful which include valid but unusual content to test
>  the Bio* project's parsers.  I don't think this actually needs to be
>  in a repository, but that would be nice for tracking ownership.

ok. In any case, wikis usually have a versioning system, so there are
not many differences.


>  I think it would be up to the individual projects to pull in any files
>  of interest for us in their own test suites (essentially coping
>  example files into their own repositories).
>
>
>  > What I think that could be useful for this project is a ticket
>  > tracking system, or better said a feature request system, to keep
>  > track of all the things needed.
>
>
> The OBF already runs a bugzilla installation used by most of the Bio*
>  projects, which would probably be OK for this sort of thing.


>
>  Peter
>
> _______________________________________________
>  Open-Bio-l mailing list
>  Open-Bio-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/open-bio-l
>


-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it