[BioPython] a common repository for test datasets/use cases for all Bio* projects
Bruce Southey
bsouthey at gmail.com
Tue Oct 28 13:56:34 UTC 2008
Chris Fields wrote:
> All,
>
> An open-bio repository had started up for this use at one point,
> though I don't think it made the transition to subversion yet (and it
> never really took off, not sure why). You should try contacting
> open-bio support and maybe Jason or Chris D. can answer this in a bit
> more detail.
>
> chris
>
> On Oct 28, 2008, at 5:55 AM, Peter wrote:
>
>> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>>> Hi,
>>> I would like to make you a proposal.
>>> Every module/program written in bioinformatics needs to be tested
>>> before it can be used to produce results that can be published.
>>> ...
>>> So, the point is.. what if we create a common repository for all this
>>> kind of testing data, to be used in common with all the other Bio*
>>> projects?
>>
>> You you made some other good points, and this is a good idea. In
>> practice the licences are usually OK for use to "borrow" example input
>> files from each other (and this does happen), but a more organised
>> system to encourage interchange of examples would be good.
>>
>> I think this sounds like an excellent topic for the (currently very
>> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
>> discussion, one of the OBF mailing lists, this should cover all the
>> Bio* project members interested). See
>> http://lists.open-bio.org/mailman/listinfo
>>
>> Peter
>> _______________________________________________
>> BioPython mailing list - BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
There had been some discussion on scipy lists on data sets that you
should look for.
One of the most critical questions that you must address is copyright
and who owns the data sets (credit where credit is due). Ultimately any
data will be distributable in some form and thus really brings in
copyright issues and such. This is also country specific because there
is the question of whether or not a data set can be copyrighted and the
terms of it - not a lawyer to know this. The Science Commons has various
other useful information especially the FAQ on databases,
http://sciencecommons.org/resources/faq/databases/, that states "In the
United States, data will be protected by copyright only if they express
creativity".
I do believe you would need to be very strict on what is acceptable
because if it is distributable you can not rely on the user being
responsible:
1) If has been used for publication, an extremely clear statement of the
owner (publisher) that it can be made available is required.
2) If the data is created from publicly available sources that allow it
eg Uniprot (http://www.uniprot.org/help/license) then exact recreatable
sets must be made available so the data can be exactly obtained from
that source (must include the specific release as databases change).
3) If the data is from private sources then it must be released on a
suitable license that can not be superseded by publication or change in
ownership.
Also, the submitted data should not change even if there are errors. For
example, Fisher's iris data at
http://archive.ics.uci.edu/ml/datasets/Iris has documented errors.
Rather it would be better to use version numbers.
Regards
Bruce
More information about the Biopython
mailing list