[BioPython] a common repository for test datasets/use cases for all Bio* projects

Tue Oct 28 13:56:34 UTC 2008

Chris Fields wrote:
> All,
>
> An open-bio repository had started up for this use at one point, 
> though I don't think it made the transition to subversion yet (and it 
> never really took off, not sure why).  You should try contacting 
> open-bio support and maybe Jason or Chris D. can answer this in a bit 
> more detail.
>
> chris
>
> On Oct 28, 2008, at 5:55 AM, Peter wrote:
>
>> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>>> Hi,
>>> I would like to make you a proposal.
>>> Every module/program written in bioinformatics needs to be tested
>>> before it can be used to produce results that can be published.
>>> ...
>>> So, the point is.. what if we create a common repository for all this
>>> kind of testing data, to be used in common with all the other Bio*
>>> projects?
>>
>> You you made some other good points, and this is a good idea.  In
>> practice the licences are usually OK for use to "borrow" example input
>> files from each other (and this does happen), but a more organised
>> system to encourage interchange of examples would be good.
>>
>> I think this sounds like an excellent topic for the (currently very
>> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
>> discussion, one of the OBF mailing lists, this should cover all the
>> Bio* project members interested).  See
>> http://lists.open-bio.org/mailman/listinfo
>>
>> Peter
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
There had been some discussion on scipy lists on data sets that you 
should look for.

One of the most critical questions that you must address is copyright 
and who owns the data sets (credit where credit is due). Ultimately any 
data will be distributable in some form and thus really brings in 
copyright issues and such. This is also country specific because there 
is the question of whether or not a data set can be copyrighted and the 
terms of it - not a lawyer to know this. The Science Commons has various 
other useful information especially the FAQ on databases, 
http://sciencecommons.org/resources/faq/databases/, that states "In the 
United States, data will be protected by copyright only if they express 
creativity".

I do believe you would need to be very strict on what is acceptable 
because if it is distributable you can not rely on the user being 
responsible:
1) If has been used for publication, an extremely clear statement of the 
owner (publisher) that it can be made available is required.
2) If the data is created from publicly available sources that allow it 
eg Uniprot (http://www.uniprot.org/help/license) then exact recreatable 
sets must be made available so the data can be exactly obtained from 
that source (must include the specific release as databases change).
3) If the data is from private sources then it must be released on a 
suitable license that can not be superseded by publication or change in 
ownership.

Also, the submitted data should not change even if there are errors. For 
example, Fisher's iris data at 
http://archive.ics.uci.edu/ml/datasets/Iris has  documented errors. 
Rather it would be better to use version numbers.

Regards
Bruce