From hamish.mcwilliam at bioinfo-user.org.uk Thu Dec 15 13:01:40 2011 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Thu, 15 Dec 2011 18:01:40 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

Message-ID: Hi Chris, > That might be the best source to pull from. ?Does it archive old file examples (such as older SwissProt/GenBank/EMBL)? EDAM itself does not store entry data, and at the moment it does not describe the changes to formats over time, although I'm sure this could be added along with links to sample entries in the various data archives. If you only need a few sample entries, see the appropriate database archive: - EMBL-Bank Sequence Version Archive (EMBL-SVA): http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go - UniProtKB Sequence/Annotation Version Archive (UniSave): http://www.ebi.ac.uk/uniprot/unisave/ E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go - NCBI Entrez Revision History. E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist If you need more entries... For Swiss-PROT and UniProtKB old versions of the data are available on the FTP sites, for example from EMBL-EBI: - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ For GenBank, Don Gilbert collected various old releases a while back (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), these are available via the BioMirrors (http://www.bio-mirror.net/). NCBI may also be able to provide old releases on request. For EMBL-Bank old releases can be made available on request, contact ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. All the best, Hamish > > chris > > On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: > >> I just checked with Jon and he was happy to forward this back to >> the list, and also added a couple of URLs that I'd asked about: >> >> http://bioportal.bioontology.org/ontologies/44600 >> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >> >> Peter >> >> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: >>> Hi Peter (and Peter) >>> >>> Just a quick note to say that all (well, nearly all) common bioinformatics data formats are >>> catalogued in the EDAM ontology: >>> >>> http://sourceforge.net/projects/edamontology/files >>> http://edamontology.sourceforge.net/ >>> >>> OK - there's bound to be some we've missed :) >>> >>> Anyhow, I thought it might help to structure any effort to document data formats (an effort which >>> I wholeheartedly approve of by the way). ?One thing I'd like to add to the EDAM "format" >>> definitions is a link to the format specification, or failing that, an example. >>> >>> Cheers both >>> >>> Jon From cjfields at illinois.edu Thu Dec 15 14:07:44 2011 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 15 Dec 2011 13:07:44 -0600 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

Message-ID: <4EEA4580.7010901@illinois.edu> Hamish, Reason I ask, the various Bio* and EMBOSS projects have a share of old (and possibly duplicate) data examples, but it might be nice to standardize on a common set of records, simply for less data duplication. As an example, have a git repo of purely data or links to data that we could 'git submodule' in for code distribution, release, and testing purposes, but that wouldn't bloat the code repository. chris On 12/15/2011 12:01 PM, Hamish McWilliam wrote: > Hi Chris, > >> That might be the best source to pull from. Does it archive old file examples (such as older SwissProt/GenBank/EMBL)? > EDAM itself does not store entry data, and at the moment it does not > describe the changes to formats over time, although I'm sure this > could be added along with links to sample entries in the various data > archives. > > If you only need a few sample entries, see the appropriate database archive: > > - EMBL-Bank Sequence Version Archive (EMBL-SVA): > http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. > E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go > - UniProtKB Sequence/Annotation Version Archive (UniSave): > http://www.ebi.ac.uk/uniprot/unisave/ > E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go > - NCBI Entrez Revision History. > E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist > > If you need more entries... > > For Swiss-PROT and UniProtKB old versions of the data are available on > the FTP sites, for example from EMBL-EBI: > - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ > - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ > > For GenBank, Don Gilbert collected various old releases a while back > (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), > these are available via the BioMirrors (http://www.bio-mirror.net/). > NCBI may also be able to provide old releases on request. > > For EMBL-Bank old releases can be made available on request, contact > ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. > > All the best, > > Hamish > >> chris >> >> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: >> >>> I just checked with Jon and he was happy to forward this back to >>> the list, and also added a couple of URLs that I'd asked about: >>> >>> http://bioportal.bioontology.org/ontologies/44600 >>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >>> >>> Peter >>> >>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: >>>> Hi Peter (and Peter) >>>> >>>> Just a quick note to say that all (well, nearly all) common bioinformatics data formats are >>>> catalogued in the EDAM ontology: >>>> >>>> http://sourceforge.net/projects/edamontology/files >>>> http://edamontology.sourceforge.net/ >>>> >>>> OK - there's bound to be some we've missed :) >>>> >>>> Anyhow, I thought it might help to structure any effort to document data formats (an effort which >>>> I wholeheartedly approve of by the way). One thing I'd like to add to the EDAM "format" >>>> definitions is a link to the format specification, or failing that, an example. >>>> >>>> Cheers both >>>> >>>> Jon > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From p.j.a.cock at googlemail.com Fri Dec 16 07:11:51 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Dec 2011 12:11:51 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> Message-ID: On Thu, Dec 15, 2011 at 10:01 PM, Hamish McWilliam wrote: >> Just a quick update on this: the old OBDA specs were still in CVS in >> the obda-specs module (the old obda site had the module wrong). >>?I ran git cvsimport on that after I copied the CVS repo to my laptop, >> so it's now on github: >> >> https://github.com/OBF/OBDA >> >> We could probably work on updates from there. > > At the risk of derailing the current thread... a few comments on the > "modules" in the old ODBA: Well, given the broad title of OBDA redux, why not? > - BioCorba: while CORBA may live on in some embedded applications it > has mostly been replaced by SOAP and REST web services. I suspect > there are few copies of the BioCorba IDLs surviving today. Possibly of > historic interest, but since it doesn't actually include the IDLs it > is not really of any use. As far as I know, BioCorba is defunct. > - biofetch: originally implemented in EBI's dbfetch, also implemented > by BioRuby as biofetch which had a few extensions. EBI's dbfetch has > since been reimplemented and attempts to be compatible but only > provides partial support along with various extensions, including > those from BioRuby. See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp. > I'm aware of client support in BioPerl, BioRuby and EMBOSS, not sure > of the current status for BioJava and BioPython. Current Biopython doesn't have anything for this, but I would probably want to implement this as a client not a server. > - BioSQL: as you all know over at http://www.biosql.org/. The document > should probably be updated to point there. Agreed, done: https://github.com/OBF/OBDA/commit/5798f0b4a0e3b7fd0595e0ab3017d3afdda53549 > - bioindex: the flat-file and BDB indexing formats. To which the > SQLite option will be added? Basically yes. > - naming: obsolete URN scheme. Various ontologies (e.g. EDAM) provide > possible replacements when required. This also has implications for the bioindex code as we need to specify the file format being indexed (e.g. FASTA or GenBank). > - bioregistry: database discovery and meta-data. From having tried to > implement this, the bioregisty is too limited in the available > meta-data to be very useful, especially when it comes to data format > handling. Compare with the database definitions in EMBOSS > (http://emboss.sourceforge.net/docs/themes/Databases.html) and the > dbfetch meta-data > (http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp#Meta-information). There was some partial code for this in Biopython, but it was deprecated and removed some time ago. > - XEmbl: REST and SOAP access to EMBL-Bank entries in XML. > The EBI's XEmbl service was replaced by the dbfetch > (http://www.ebi.ac.uk/Tools/dbfetch/) and WSDbfetch > (http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) services, > since these provide roughly the same functionality with wider data > format support. Presumably the XML format for EMBL is now one of the ISNDC formats also used at the NCBI? In any case, that whole folder is purely describing an (obsolete) EBI service, so can we just delete it it? > Since I've been attempting to get dbfetch to support the biofetch and > bioregistry specifications, my interest is much more at the web > service end of things. I can certainly see options for using the > current alternatives in dbfetch and EMBOSS to revise the > specifications for biofetch and bioregistry. > > Hamish How does biofetch/bioregistry compare to DAS? Separately, I suggest we rename the OBDA/preamble.txt file to README (or README.*) so it gets shown in GitHub, and then update it following this discussion with some context (like dates current status of the different parts). We should probably make the old OBDA CVS read only now. Peter From hamish.mcwilliam at bioinfo-user.org.uk Thu Dec 15 15:36:35 2011 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Thu, 15 Dec 2011 20:36:35 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4EEA4580.7010901@illinois.edu> References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

<4EEA4580.7010901@illinois.edu> Message-ID: Chris, > Reason I ask, the various Bio* and EMBOSS projects have a share of old (and > possibly duplicate) data examples, but it might be nice to standardize on a > common set of records, simply for less data duplication. > > As an example, have a git repo of purely data or links to data that we could > 'git submodule' in for code distribution, release, and testing purposes, but > that wouldn't bloat the code repository. It is debatable if version control is necessary for this, each sample entry is a snapshot obtained from the data source thus there is only ever one version of each file, and a file for each format version is required for testing purposes anyway. So for a test data archive plain old FTP would be sufficient, with fetch scripts if required. Since the historic data is most useful for compatibility testing and the archives all have web services attached (e.g. EMBL-SVA and UniSave are available through dbfetch, see http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases) fetching the required entries when/if necessary seems a more appropriate approach. For example I doubt if many people need to test compatibility with the Swiss-PROT 9.0 (1988) entry format (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002.1), so there is little need to duplicate this data in every developers set-up. In contrast users expect everything to be tested with the current entry format, also available from UniSave (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002), but the only way to be sure it is the current data is to fetch it from one of the primary sources. Given the major databases implement versioning schemes fetching specific versions of entries is simple. For less well known databases, databases which are no longer active or application results (the evils of NCBI BLAST output parsing come to mind) having a standard set would be useful. However in the case of the databases the data resources may provide appropriate mechanisms to fetch this data, so building fixed sets may be unnecessary for anything other than caching. Unless you are talking large data sets, in which case you are going to want them to be optional anyway, and you certainly don't want to put them under version control. So for the databases with archives I would tend towards just keeping identifier lists for representative entries and fetching the required entry flavours when/if required. This prevents duplication, ensures current data tests are against the current data, and provides the option of shipping the fetch script instead of the data for cases where copyright licensing is an issue. For the rest a collection of static files on an FTP/web site would have it covered. Hamish > On 12/15/2011 12:01 PM, Hamish McWilliam wrote: >> >> Hi Chris, >> >>> That might be the best source to pull from. ?Does it archive old file >>> examples (such as older SwissProt/GenBank/EMBL)? >> >> EDAM itself does not store entry data, and at the moment it does not >> describe the changes to formats over time, although I'm sure this >> could be added along with links to sample entries in the various data >> archives. >> >> If you only need a few sample entries, see the appropriate database >> archive: >> >> - EMBL-Bank Sequence Version Archive (EMBL-SVA): >> http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. >> E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go >> - UniProtKB Sequence/Annotation Version Archive (UniSave): >> http://www.ebi.ac.uk/uniprot/unisave/ >> E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go >> - NCBI Entrez Revision History. >> E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist >> >> If you need more entries... >> >> For Swiss-PROT and UniProtKB old versions of the data are available on >> the FTP sites, for example from EMBL-EBI: >> - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ >> - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ >> >> For GenBank, Don Gilbert collected various old releases a while back >> (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), >> these are available via the BioMirrors (http://www.bio-mirror.net/). >> NCBI may also be able to provide old releases on request. >> >> For EMBL-Bank old releases can be made available on request, contact >> ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. >> >> All the best, >> >> Hamish >> >>> chris >>> >>> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: >>> >>>> I just checked with Jon and he was happy to forward this back to >>>> the list, and also added a couple of URLs that I'd asked about: >>>> >>>> http://bioportal.bioontology.org/ontologies/44600 >>>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >>>> >>>> Peter >>>> >>>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison ?wrote: >>>>> >>>>> Hi Peter (and Peter) >>>>> >>>>> Just a quick note to say that all (well, nearly all) common >>>>> bioinformatics data formats are >>>>> catalogued in the EDAM ontology: >>>>> >>>>> http://sourceforge.net/projects/edamontology/files >>>>> http://edamontology.sourceforge.net/ >>>>> >>>>> OK - there's bound to be some we've missed :) >>>>> >>>>> Anyhow, I thought it might help to structure any effort to document >>>>> data formats (an effort which >>>>> I wholeheartedly approve of by the way). ?One thing I'd like to add to >>>>> the EDAM "format" >>>>> definitions is a link to the format specification, or failing that, an >>>>> example. >>>>> >>>>> Cheers both >>>>> >>>>> Jon >> >> _______________________________________________ >> >> Open-Bio-l mailing list >> Open-Bio-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/open-bio-l > > -- ---- "Saying the internet has changed dramatically over the last five years is clich? ? the internet is always changing dramatically" - Craig Labovitz, Arbor Networks. From hamish.mcwilliam at bioinfo-user.org.uk Thu Dec 15 17:01:17 2011 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Thu, 15 Dec 2011 22:01:17 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> References:

<6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> Message-ID: > Just a quick update on this: the old OBDA specs were still in CVS in the obda-specs module (the old obda site had the module wrong). ?I ran git cvsimport on that after I copied the CVS repo to my laptop, so it's now on github: > > https://github.com/OBF/OBDA > > We could probably work on updates from there. At the risk of derailing the current thread... a few comments on the "modules" in the old ODBA: - BioCorba: while CORBA may live on in some embedded applications it has mostly been replaced by SOAP and REST web services. I suspect there are few copies of the BioCorba IDLs surviving today. Possibly of historic interest, but since it doesn't actually include the IDLs it is not really of any use. - biofetch: originally implemented in EBI's dbfetch, also implemented by BioRuby as biofetch which had a few extensions. EBI's dbfetch has since been reimplemented and attempts to be compatible but only provides partial support along with various extensions, including those from BioRuby. See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp. I'm aware of client support in BioPerl, BioRuby and EMBOSS, not sure of the current status for BioJava and BioPython. - BioSQL: as you all know over at http://www.biosql.org/. The document should probably be updated to point there. - bioindex: the flat-file and BDB indexing formats. To which the SQLite option will be added? - naming: obsolete URN scheme. Various ontologies (e.g. EDAM) provide possible replacements when required. - bioregistry: database discovery and meta-data. From having tried to implement this, the bioregisty is too limited in the available meta-data to be very useful, especially when it comes to data format handling. Compare with the database definitions in EMBOSS (http://emboss.sourceforge.net/docs/themes/Databases.html) and the dbfetch meta-data (http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp#Meta-information). - XEmbl: REST and SOAP access to EMBL-Bank entries in XML. The EBI's XEmbl service was replaced by the dbfetch (http://www.ebi.ac.uk/Tools/dbfetch/) and WSDbfetch (http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) services, since these provide roughly the same functionality with wider data format support. Since I've been attempting to get dbfetch to support the biofetch and bioregistry specifications, my interest is much more at the web service end of things. I can certainly see options for using the current alternatives in dbfetch and EMBOSS to revise the specifications for biofetch and bioregistry. Hamish > > chris > > On Nov 18, 2011, at 7:45 AM, Fields, Christopher J wrote: > >> On Nov 18, 2011, at 5:21 AM, Peter Cock wrote: >> >>> On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal wrote: >>>> ... >>>> And which are the information you want to extract once you >>>> have your index ? >>>> >>> >>> Biopython and BioPerl have their SeqIO parsers hooked up >>> to indexing code. This means you can access a record via its >>> ID, and it is parsed for you on demand - just like if you'd >>> iterated over the file in order parsing the records one by one. >>> >>> Biopython (not sure about BioPerl) can also just fetch the raw >>> text of that record. >> >> Re: BioPerl, I'm not sure about the OBDA implementations, but I know the older Bio::Index modules allow this. ?I would be surprised if the OBDA-specific code didn't, but adding this should be easy. >> >> chris From cjfields at illinois.edu Fri Dec 16 16:12:45 2011 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 16 Dec 2011 15:12:45 -0600 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

<4EEA4580.7010901@illinois.edu> Message-ID: <4EEBB44D.4050905@illinois.edu> On 12/15/2011 02:36 PM, Hamish McWilliam wrote: > Chris, > >> Reason I ask, the various Bio* and EMBOSS projects have a share of old (and >> possibly duplicate) data examples, but it might be nice to standardize on a >> common set of records, simply for less data duplication. >> >> As an example, have a git repo of purely data or links to data that we could >> 'git submodule' in for code distribution, release, and testing purposes, but >> that wouldn't bloat the code repository. > It is debatable if version control is necessary for this, each sample > entry is a snapshot obtained from the data source thus there is only > ever one version of each file, and a file for each format version is > required for testing purposes anyway. So for a test data archive plain > old FTP would be sufficient, with fetch scripts if required. I agree; that or a combination of the two where appropriate. Caveats below. > Since the historic data is most useful for compatibility testing and > the archives all have web services attached (e.g. EMBL-SVA and UniSave > are available through dbfetch, see > http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases) fetching > the required entries when/if necessary seems a more appropriate > approach. For example I doubt if many people need to test > compatibility with the Swiss-PROT 9.0 (1988) entry format > (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002.1), > so there is little need to duplicate this data in every developers > set-up. In contrast users expect everything to be tested with the > current entry format, also available from UniSave > (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002), but > the only way to be sure it is the current data is to fetch it from one > of the primary sources. > > Given the major databases implement versioning schemes fetching > specific versions of entries is simple. For less well known databases, > databases which are no longer active or application results (the evils > of NCBI BLAST output parsing come to mind) having a standard set would > be useful. However in the case of the databases the data resources may > provide appropriate mechanisms to fetch this data, so building fixed > sets may be unnecessary for anything other than caching. Unless you > are talking large data sets, in which case you are going to want them > to be optional anyway, and you certainly don't want to put them under > version control. The key concerning word there is 'may', and I hesitate to rely on the certainty that some web services will be available indefinitely. A recent talk I attended (I believe at the last Galaxy Conference) mentioned the percentage of published web services that have persistent URLs over the years (e.g. found in the same location or are redirected to a new location). The number is depressingly low, I want to say 25-40%, but not sure. Even for major databases, we have had web service apps move or disappear, just fixed one related to NCBI revision history. In most cases there are notifications, but not always. Just from that perspective alone, I find static files to be a nice fallback. > So for the databases with archives I would tend towards just keeping > identifier lists for representative entries and fetching the required > entry flavours when/if required. This prevents duplication, ensures > current data tests are against the current data, and provides the > option of shipping the fetch script instead of the data for cases > where copyright licensing is an issue. For the rest a collection of > static files on an FTP/web site would have it covered. > > Hamish It's not an unreasonable expectation that some parsers would need support for both old and new ('old' being something within a sane time period). Regardless, most (all?) OBF projects have been around long enough this isn't often an issue, but there are data formats that rapidly evolve (NCBI BLAST text being an example). Even that is a bit of an exception, as NCBI has long recommended not relying on text parsing as being stable as they reserve the right to add changes that may break things (so for users caveat emptor). chris >> On 12/15/2011 12:01 PM, Hamish McWilliam wrote: >>> Hi Chris, >>> >>>> That might be the best source to pull from. Does it archive old file >>>> examples (such as older SwissProt/GenBank/EMBL)? >>> EDAM itself does not store entry data, and at the moment it does not >>> describe the changes to formats over time, although I'm sure this >>> could be added along with links to sample entries in the various data >>> archives. >>> >>> If you only need a few sample entries, see the appropriate database >>> archive: >>> >>> - EMBL-Bank Sequence Version Archive (EMBL-SVA): >>> http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. >>> E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go >>> - UniProtKB Sequence/Annotation Version Archive (UniSave): >>> http://www.ebi.ac.uk/uniprot/unisave/ >>> E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go >>> - NCBI Entrez Revision History. >>> E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist >>> >>> If you need more entries... >>> >>> For Swiss-PROT and UniProtKB old versions of the data are available on >>> the FTP sites, for example from EMBL-EBI: >>> - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ >>> - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ >>> >>> For GenBank, Don Gilbert collected various old releases a while back >>> (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), >>> these are available via the BioMirrors (http://www.bio-mirror.net/). >>> NCBI may also be able to provide old releases on request. >>> >>> For EMBL-Bank old releases can be made available on request, contact >>> ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. >>> >>> All the best, >>> >>> Hamish >>> >>>> chris >>>> >>>> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: >>>> >>>>> I just checked with Jon and he was happy to forward this back to >>>>> the list, and also added a couple of URLs that I'd asked about: >>>>> >>>>> http://bioportal.bioontology.org/ontologies/44600 >>>>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >>>>> >>>>> Peter >>>>> >>>>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: >>>>>> Hi Peter (and Peter) >>>>>> >>>>>> Just a quick note to say that all (well, nearly all) common >>>>>> bioinformatics data formats are >>>>>> catalogued in the EDAM ontology: >>>>>> >>>>>> http://sourceforge.net/projects/edamontology/files >>>>>> http://edamontology.sourceforge.net/ >>>>>> >>>>>> OK - there's bound to be some we've missed :) >>>>>> >>>>>> Anyhow, I thought it might help to structure any effort to document >>>>>> data formats (an effort which >>>>>> I wholeheartedly approve of by the way). One thing I'd like to add to >>>>>> the EDAM "format" >>>>>> definitions is a link to the format specification, or failing that, an >>>>>> example. >>>>>> >>>>>> Cheers both >>>>>> >>>>>> Jon >>> _______________________________________________ >>> >>> Open-Bio-l mailing list >>> Open-Bio-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >> > > From hamish.mcwilliam at bioinfo-user.org.uk Tue Dec 20 06:27:12 2011 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Tue, 20 Dec 2011 11:27:12 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4EEBB44D.4050905@illinois.edu> References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

<4EEA4580.7010901@illinois.edu> <4EEBB44D.4050905@illinois.edu> Message-ID: Hi Chris, >>> Reason I ask, the various Bio* and EMBOSS projects have a share of old >>> (and >>> possibly duplicate) data examples, but it might be nice to standardize on >>> a >>> common set of records, simply for less data duplication. >>> >>> As an example, have a git repo of purely data or links to data that we >>> could >>> 'git submodule' in for code distribution, release, and testing purposes, >>> but >>> that wouldn't bloat the code repository. >> >> It is debatable if version control is necessary for this, each sample >> entry is a snapshot obtained from the data source thus there is only >> ever one version of each file, and a file for each format version is >> required for testing purposes anyway. So for a test data archive plain >> old FTP would be sufficient, with fetch scripts if required. > > I agree; that or a combination of the two where appropriate. ?Caveats below. > >> Since the historic data is most useful for compatibility testing and >> the archives all have web services attached (e.g. EMBL-SVA and UniSave >> are available through dbfetch, see >> http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases) fetching >> the required entries when/if necessary seems a more appropriate >> approach. For example I doubt if many people need to test >> compatibility with the Swiss-PROT 9.0 (1988) entry format >> (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002.1), >> so there is little need to duplicate this data in every developers >> set-up. In contrast users expect everything to be tested with the >> current entry format, also available from UniSave >> (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002), but >> the only way to be sure it is the current data is to fetch it from one >> of the primary sources. >> >> Given the major databases implement versioning schemes fetching >> specific versions of entries is simple. For less well known databases, >> databases which are no longer active or application results (the evils >> of NCBI BLAST output parsing come to mind) having a standard set would >> be useful. However in the case of the databases the data resources may >> provide appropriate mechanisms to fetch this data, so building fixed >> sets may be unnecessary for anything other than caching. Unless you >> are talking large data sets, in which case you are going to want them >> to be optional anyway, and you certainly don't want to put them under >> version control. > > The key concerning word there is 'may', and I hesitate to rely on the > certainty that some web services will be available indefinitely. ?A recent > talk I attended (I believe at the last Galaxy Conference) mentioned the > percentage of published web services that have persistent URLs over the > years (e.g. found in the same location or are redirected to a new location). > ?The number is depressingly low, I want to say 25-40%, but not sure. One of the things to remember here is that the web services which can be used to fetch the data depend on usage to justify their funding and thus their future. Not using these services makes it more difficult for the service provider(s) to justify the resource costs involved in maintaining and running the services. For example: the support for EMBL-Bank entry names (which were removed in EMBL-Bank release 87, June 2006) in EBI dbfetch (e.g. http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=embl&id=BUM&format=embl&style=raw) is present due to the use of the entry name 'BUM' in the test and example code for Bio::DB::EMBL, this feature is very rarely used elsewhere. It is well accepted that there is a lot of churn in the world of web services, and the causes for this are well known. The various web service registry projects (see http://www.ebi.ac.uk/Tools/webservices/tutorials/05_registries for a selection) attempt to provide some form of tracking, and a method for finding replacement services. Projects such as ELIXIR (http://www.elixir-europe.org/) hope to address some of the causes. > Even for major databases, we have had web service apps move or disappear, > just fixed one related to NCBI revision history. ?In most cases there are > notifications, but not always. All the major providers provide notification of changes, and if possible attempt to provide backward compatibility and transition periods to allow users to migrate. Things could be clearer of course and there will always be some services which exhibit a higher rate of change with or without notice. PhD research projects come to mind as a type of service which often changes rapidly before a short period of stability and then unavailability or a long tail of no maintenance. > Just from that perspective alone, I find static files to be a nice fallback. As a fallback it may be a suitable option, this is covered by my earlier comment about caching, however I would try to avoid duplicating the data from the start and only provide static copies if an alternative source is not available. >> So for the databases with archives I would tend towards just keeping >> identifier lists for representative entries and fetching the required >> entry flavours when/if required. This prevents duplication, ensures >> current data tests are against the current data, and provides the >> option of shipping the fetch script instead of the data for cases >> where copyright licensing is an issue. For the rest a collection of >> static files on an FTP/web site would have it covered. > > It's not an unreasonable expectation that some parsers would need support > for both old and new ('old' being something within a sane time period). Having recently done some work with the patent side of things, I am no longer sure that there is such a thing as a "sane time period", for that community there is a requirement to be able to access all data since the beginning of time, or at least make a reasonable attempt to get hold of that data, in order to judge if a patent application can be granted or a grant overturned. However they are a special case... for most purposes I suspect that coverage of the last 5 years is probably enough, as long as it is clear for each release what the data support interval is. I'm sure there will be a few cases where a longer interval may be necessary. For example: EMBL-Bank changed the format of the ID line in release 87 (June 2006), but many of the other databases using the EMBL-Bank format (e.g. IMGT/HLA) have not switched to use the new format. > Regardless, most (all?) OBF projects have been around long enough this isn't > often an issue, but there are data formats that rapidly evolve (NCBI BLAST > text being an example). ?Even that is a bit of an exception, as NCBI has > long recommended not relying on text parsing as being stable as they reserve > the right to add changes that may break things (so for users caveat emptor). NCBI BLAST has long been a favourite example of this, and NCBI say that the ASN.1 or the XML should be used if you want to parse it. Given the intermediate format support in NCBI BLAST+ this is now less of an issue, since obtaining multiple formats from the results of a single search is possible in the standalone version as well as in the web service(s), so you can have the text report along side a parseable representation if required. Although there are still many cases where the text report has to be used and multi-version support is required, for example: MView (http://bio-mview.sourceforge.net/) supports many different versions of the BLAST format since those files tend to be what users have saved from their searches, and the search may have been some years back. It also depends on how detailed your parsing is, the formats have tended to change slowly in structure, but the specific details often change rapidly. This has effects when performing some types of format conversion, or performing data verification checks. All the best, Hamish > chris > >>> On 12/15/2011 12:01 PM, Hamish McWilliam wrote: >>>> >>>> Hi Chris, >>>> >>>>> That might be the best source to pull from. ?Does it archive old file >>>>> examples (such as older SwissProt/GenBank/EMBL)? >>>> >>>> EDAM itself does not store entry data, and at the moment it does not >>>> describe the changes to formats over time, although I'm sure this >>>> could be added along with links to sample entries in the various data >>>> archives. >>>> >>>> If you only need a few sample entries, see the appropriate database >>>> archive: >>>> >>>> - EMBL-Bank Sequence Version Archive (EMBL-SVA): >>>> http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. >>>> E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go >>>> - UniProtKB Sequence/Annotation Version Archive (UniSave): >>>> http://www.ebi.ac.uk/uniprot/unisave/ >>>> E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go >>>> - NCBI Entrez Revision History. >>>> E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist >>>> >>>> If you need more entries... >>>> >>>> For Swiss-PROT and UniProtKB old versions of the data are available on >>>> the FTP sites, for example from EMBL-EBI: >>>> - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ >>>> - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ >>>> >>>> For GenBank, Don Gilbert collected various old releases a while back >>>> (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), >>>> these are available via the BioMirrors (http://www.bio-mirror.net/). >>>> NCBI may also be able to provide old releases on request. >>>> >>>> For EMBL-Bank old releases can be made available on request, contact >>>> ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. >>>> >>>> All the best, >>>> >>>> Hamish >>>> >>>>> chris >>>>> >>>>> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: >>>>> >>>>>> I just checked with Jon and he was happy to forward this back to >>>>>> the list, and also added a couple of URLs that I'd asked about: >>>>>> >>>>>> http://bioportal.bioontology.org/ontologies/44600 >>>>>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >>>>>> >>>>>> Peter >>>>>> >>>>>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison ? ?wrote: >>>>>>> >>>>>>> Hi Peter (and Peter) >>>>>>> >>>>>>> Just a quick note to say that all (well, nearly all) common >>>>>>> bioinformatics data formats are >>>>>>> catalogued in the EDAM ontology: >>>>>>> >>>>>>> http://sourceforge.net/projects/edamontology/files >>>>>>> http://edamontology.sourceforge.net/ >>>>>>> >>>>>>> OK - there's bound to be some we've missed :) >>>>>>> >>>>>>> Anyhow, I thought it might help to structure any effort to document >>>>>>> data formats (an effort which >>>>>>> I wholeheartedly approve of by the way). ?One thing I'd like to add >>>>>>> to >>>>>>> the EDAM "format" >>>>>>> definitions is a link to the format specification, or failing that, >>>>>>> an >>>>>>> example. >>>>>>> >>>>>>> Cheers both >>>>>>> >>>>>>> Jon >>>> >>>> _______________________________________________ >>>> >>>> Open-Bio-l mailing list >>>> Open-Bio-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >>> >>> >> >> > -- ---- "Saying the internet has changed dramatically over the last five years is clich? ? the internet is always changing dramatically" - Craig Labovitz, Arbor Networks. From cjfields at illinois.edu Thu Dec 1 03:50:03 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Dec 2011 03:50:03 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

Message-ID: <6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> Just a quick update on this: the old OBDA specs were still in CVS in the obda-specs module (the old obda site had the module wrong). I ran git cvsimport on that after I copied the CVS repo to my laptop, so it's now on github: https://github.com/OBF/OBDA We could probably work on updates from there. chris On Nov 18, 2011, at 7:45 AM, Fields, Christopher J wrote: > On Nov 18, 2011, at 5:21 AM, Peter Cock wrote: > >> On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal wrote: >>> ... >>> And which are the information you want to extract once you >>> have your index ? >>> >> >> Biopython and BioPerl have their SeqIO parsers hooked up >> to indexing code. This means you can access a record via its >> ID, and it is parsed for you on demand - just like if you'd >> iterated over the file in order parsing the records one by one. >> >> Biopython (not sure about BioPerl) can also just fetch the raw >> text of that record. > > Re: BioPerl, I'm not sure about the OBDA implementations, but I know the older Bio::Index modules allow this. I would be surprised if the OBDA-specific code didn't, but adding this should be easy. > > chris > > > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From hamish.mcwilliam at bioinfo-user.org.uk Thu Dec 15 18:01:40 2011 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Thu, 15 Dec 2011 18:01:40 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

Message-ID: Hi Chris, > That might be the best source to pull from. ?Does it archive old file examples (such as older SwissProt/GenBank/EMBL)? EDAM itself does not store entry data, and at the moment it does not describe the changes to formats over time, although I'm sure this could be added along with links to sample entries in the various data archives. If you only need a few sample entries, see the appropriate database archive: - EMBL-Bank Sequence Version Archive (EMBL-SVA): http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go - UniProtKB Sequence/Annotation Version Archive (UniSave): http://www.ebi.ac.uk/uniprot/unisave/ E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go - NCBI Entrez Revision History. E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist If you need more entries... For Swiss-PROT and UniProtKB old versions of the data are available on the FTP sites, for example from EMBL-EBI: - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ For GenBank, Don Gilbert collected various old releases a while back (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), these are available via the BioMirrors (http://www.bio-mirror.net/). NCBI may also be able to provide old releases on request. For EMBL-Bank old releases can be made available on request, contact ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. All the best, Hamish > > chris > > On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: > >> I just checked with Jon and he was happy to forward this back to >> the list, and also added a couple of URLs that I'd asked about: >> >> http://bioportal.bioontology.org/ontologies/44600 >> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >> >> Peter >> >> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: >>> Hi Peter (and Peter) >>> >>> Just a quick note to say that all (well, nearly all) common bioinformatics data formats are >>> catalogued in the EDAM ontology: >>> >>> http://sourceforge.net/projects/edamontology/files >>> http://edamontology.sourceforge.net/ >>> >>> OK - there's bound to be some we've missed :) >>> >>> Anyhow, I thought it might help to structure any effort to document data formats (an effort which >>> I wholeheartedly approve of by the way). ?One thing I'd like to add to the EDAM "format" >>> definitions is a link to the format specification, or failing that, an example. >>> >>> Cheers both >>> >>> Jon From cjfields at illinois.edu Thu Dec 15 19:07:44 2011 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 15 Dec 2011 13:07:44 -0600 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

Message-ID: <4EEA4580.7010901@illinois.edu> Hamish, Reason I ask, the various Bio* and EMBOSS projects have a share of old (and possibly duplicate) data examples, but it might be nice to standardize on a common set of records, simply for less data duplication. As an example, have a git repo of purely data or links to data that we could 'git submodule' in for code distribution, release, and testing purposes, but that wouldn't bloat the code repository. chris On 12/15/2011 12:01 PM, Hamish McWilliam wrote: > Hi Chris, > >> That might be the best source to pull from. Does it archive old file examples (such as older SwissProt/GenBank/EMBL)? > EDAM itself does not store entry data, and at the moment it does not > describe the changes to formats over time, although I'm sure this > could be added along with links to sample entries in the various data > archives. > > If you only need a few sample entries, see the appropriate database archive: > > - EMBL-Bank Sequence Version Archive (EMBL-SVA): > http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. > E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go > - UniProtKB Sequence/Annotation Version Archive (UniSave): > http://www.ebi.ac.uk/uniprot/unisave/ > E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go > - NCBI Entrez Revision History. > E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist > > If you need more entries... > > For Swiss-PROT and UniProtKB old versions of the data are available on > the FTP sites, for example from EMBL-EBI: > - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ > - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ > > For GenBank, Don Gilbert collected various old releases a while back > (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), > these are available via the BioMirrors (http://www.bio-mirror.net/). > NCBI may also be able to provide old releases on request. > > For EMBL-Bank old releases can be made available on request, contact > ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. > > All the best, > > Hamish > >> chris >> >> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: >> >>> I just checked with Jon and he was happy to forward this back to >>> the list, and also added a couple of URLs that I'd asked about: >>> >>> http://bioportal.bioontology.org/ontologies/44600 >>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >>> >>> Peter >>> >>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: >>>> Hi Peter (and Peter) >>>> >>>> Just a quick note to say that all (well, nearly all) common bioinformatics data formats are >>>> catalogued in the EDAM ontology: >>>> >>>> http://sourceforge.net/projects/edamontology/files >>>> http://edamontology.sourceforge.net/ >>>> >>>> OK - there's bound to be some we've missed :) >>>> >>>> Anyhow, I thought it might help to structure any effort to document data formats (an effort which >>>> I wholeheartedly approve of by the way). One thing I'd like to add to the EDAM "format" >>>> definitions is a link to the format specification, or failing that, an example. >>>> >>>> Cheers both >>>> >>>> Jon > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From p.j.a.cock at googlemail.com Fri Dec 16 12:11:51 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Dec 2011 12:11:51 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> Message-ID: On Thu, Dec 15, 2011 at 10:01 PM, Hamish McWilliam wrote: >> Just a quick update on this: the old OBDA specs were still in CVS in >> the obda-specs module (the old obda site had the module wrong). >>?I ran git cvsimport on that after I copied the CVS repo to my laptop, >> so it's now on github: >> >> https://github.com/OBF/OBDA >> >> We could probably work on updates from there. > > At the risk of derailing the current thread... a few comments on the > "modules" in the old ODBA: Well, given the broad title of OBDA redux, why not? > - BioCorba: while CORBA may live on in some embedded applications it > has mostly been replaced by SOAP and REST web services. I suspect > there are few copies of the BioCorba IDLs surviving today. Possibly of > historic interest, but since it doesn't actually include the IDLs it > is not really of any use. As far as I know, BioCorba is defunct. > - biofetch: originally implemented in EBI's dbfetch, also implemented > by BioRuby as biofetch which had a few extensions. EBI's dbfetch has > since been reimplemented and attempts to be compatible but only > provides partial support along with various extensions, including > those from BioRuby. See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp. > I'm aware of client support in BioPerl, BioRuby and EMBOSS, not sure > of the current status for BioJava and BioPython. Current Biopython doesn't have anything for this, but I would probably want to implement this as a client not a server. > - BioSQL: as you all know over at http://www.biosql.org/. The document > should probably be updated to point there. Agreed, done: https://github.com/OBF/OBDA/commit/5798f0b4a0e3b7fd0595e0ab3017d3afdda53549 > - bioindex: the flat-file and BDB indexing formats. To which the > SQLite option will be added? Basically yes. > - naming: obsolete URN scheme. Various ontologies (e.g. EDAM) provide > possible replacements when required. This also has implications for the bioindex code as we need to specify the file format being indexed (e.g. FASTA or GenBank). > - bioregistry: database discovery and meta-data. From having tried to > implement this, the bioregisty is too limited in the available > meta-data to be very useful, especially when it comes to data format > handling. Compare with the database definitions in EMBOSS > (http://emboss.sourceforge.net/docs/themes/Databases.html) and the > dbfetch meta-data > (http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp#Meta-information). There was some partial code for this in Biopython, but it was deprecated and removed some time ago. > - XEmbl: REST and SOAP access to EMBL-Bank entries in XML. > The EBI's XEmbl service was replaced by the dbfetch > (http://www.ebi.ac.uk/Tools/dbfetch/) and WSDbfetch > (http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) services, > since these provide roughly the same functionality with wider data > format support. Presumably the XML format for EMBL is now one of the ISNDC formats also used at the NCBI? In any case, that whole folder is purely describing an (obsolete) EBI service, so can we just delete it it? > Since I've been attempting to get dbfetch to support the biofetch and > bioregistry specifications, my interest is much more at the web > service end of things. I can certainly see options for using the > current alternatives in dbfetch and EMBOSS to revise the > specifications for biofetch and bioregistry. > > Hamish How does biofetch/bioregistry compare to DAS? Separately, I suggest we rename the OBDA/preamble.txt file to README (or README.*) so it gets shown in GitHub, and then update it following this discussion with some context (like dates current status of the different parts). We should probably make the old OBDA CVS read only now. Peter From hamish.mcwilliam at bioinfo-user.org.uk Thu Dec 15 20:36:35 2011 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Thu, 15 Dec 2011 20:36:35 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4EEA4580.7010901@illinois.edu> References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

<4EEA4580.7010901@illinois.edu> Message-ID: Chris, > Reason I ask, the various Bio* and EMBOSS projects have a share of old (and > possibly duplicate) data examples, but it might be nice to standardize on a > common set of records, simply for less data duplication. > > As an example, have a git repo of purely data or links to data that we could > 'git submodule' in for code distribution, release, and testing purposes, but > that wouldn't bloat the code repository. It is debatable if version control is necessary for this, each sample entry is a snapshot obtained from the data source thus there is only ever one version of each file, and a file for each format version is required for testing purposes anyway. So for a test data archive plain old FTP would be sufficient, with fetch scripts if required. Since the historic data is most useful for compatibility testing and the archives all have web services attached (e.g. EMBL-SVA and UniSave are available through dbfetch, see http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases) fetching the required entries when/if necessary seems a more appropriate approach. For example I doubt if many people need to test compatibility with the Swiss-PROT 9.0 (1988) entry format (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002.1), so there is little need to duplicate this data in every developers set-up. In contrast users expect everything to be tested with the current entry format, also available from UniSave (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002), but the only way to be sure it is the current data is to fetch it from one of the primary sources. Given the major databases implement versioning schemes fetching specific versions of entries is simple. For less well known databases, databases which are no longer active or application results (the evils of NCBI BLAST output parsing come to mind) having a standard set would be useful. However in the case of the databases the data resources may provide appropriate mechanisms to fetch this data, so building fixed sets may be unnecessary for anything other than caching. Unless you are talking large data sets, in which case you are going to want them to be optional anyway, and you certainly don't want to put them under version control. So for the databases with archives I would tend towards just keeping identifier lists for representative entries and fetching the required entry flavours when/if required. This prevents duplication, ensures current data tests are against the current data, and provides the option of shipping the fetch script instead of the data for cases where copyright licensing is an issue. For the rest a collection of static files on an FTP/web site would have it covered. Hamish > On 12/15/2011 12:01 PM, Hamish McWilliam wrote: >> >> Hi Chris, >> >>> That might be the best source to pull from. ?Does it archive old file >>> examples (such as older SwissProt/GenBank/EMBL)? >> >> EDAM itself does not store entry data, and at the moment it does not >> describe the changes to formats over time, although I'm sure this >> could be added along with links to sample entries in the various data >> archives. >> >> If you only need a few sample entries, see the appropriate database >> archive: >> >> - EMBL-Bank Sequence Version Archive (EMBL-SVA): >> http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. >> E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go >> - UniProtKB Sequence/Annotation Version Archive (UniSave): >> http://www.ebi.ac.uk/uniprot/unisave/ >> E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go >> - NCBI Entrez Revision History. >> E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist >> >> If you need more entries... >> >> For Swiss-PROT and UniProtKB old versions of the data are available on >> the FTP sites, for example from EMBL-EBI: >> - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ >> - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ >> >> For GenBank, Don Gilbert collected various old releases a while back >> (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), >> these are available via the BioMirrors (http://www.bio-mirror.net/). >> NCBI may also be able to provide old releases on request. >> >> For EMBL-Bank old releases can be made available on request, contact >> ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. >> >> All the best, >> >> Hamish >> >>> chris >>> >>> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: >>> >>>> I just checked with Jon and he was happy to forward this back to >>>> the list, and also added a couple of URLs that I'd asked about: >>>> >>>> http://bioportal.bioontology.org/ontologies/44600 >>>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >>>> >>>> Peter >>>> >>>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison ?wrote: >>>>> >>>>> Hi Peter (and Peter) >>>>> >>>>> Just a quick note to say that all (well, nearly all) common >>>>> bioinformatics data formats are >>>>> catalogued in the EDAM ontology: >>>>> >>>>> http://sourceforge.net/projects/edamontology/files >>>>> http://edamontology.sourceforge.net/ >>>>> >>>>> OK - there's bound to be some we've missed :) >>>>> >>>>> Anyhow, I thought it might help to structure any effort to document >>>>> data formats (an effort which >>>>> I wholeheartedly approve of by the way). ?One thing I'd like to add to >>>>> the EDAM "format" >>>>> definitions is a link to the format specification, or failing that, an >>>>> example. >>>>> >>>>> Cheers both >>>>> >>>>> Jon >> >> _______________________________________________ >> >> Open-Bio-l mailing list >> Open-Bio-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/open-bio-l > > -- ---- "Saying the internet has changed dramatically over the last five years is clich? ? the internet is always changing dramatically" - Craig Labovitz, Arbor Networks. From hamish.mcwilliam at bioinfo-user.org.uk Thu Dec 15 22:01:17 2011 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Thu, 15 Dec 2011 22:01:17 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> References:

<6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> Message-ID: > Just a quick update on this: the old OBDA specs were still in CVS in the obda-specs module (the old obda site had the module wrong). ?I ran git cvsimport on that after I copied the CVS repo to my laptop, so it's now on github: > > https://github.com/OBF/OBDA > > We could probably work on updates from there. At the risk of derailing the current thread... a few comments on the "modules" in the old ODBA: - BioCorba: while CORBA may live on in some embedded applications it has mostly been replaced by SOAP and REST web services. I suspect there are few copies of the BioCorba IDLs surviving today. Possibly of historic interest, but since it doesn't actually include the IDLs it is not really of any use. - biofetch: originally implemented in EBI's dbfetch, also implemented by BioRuby as biofetch which had a few extensions. EBI's dbfetch has since been reimplemented and attempts to be compatible but only provides partial support along with various extensions, including those from BioRuby. See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp. I'm aware of client support in BioPerl, BioRuby and EMBOSS, not sure of the current status for BioJava and BioPython. - BioSQL: as you all know over at http://www.biosql.org/. The document should probably be updated to point there. - bioindex: the flat-file and BDB indexing formats. To which the SQLite option will be added? - naming: obsolete URN scheme. Various ontologies (e.g. EDAM) provide possible replacements when required. - bioregistry: database discovery and meta-data. From having tried to implement this, the bioregisty is too limited in the available meta-data to be very useful, especially when it comes to data format handling. Compare with the database definitions in EMBOSS (http://emboss.sourceforge.net/docs/themes/Databases.html) and the dbfetch meta-data (http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp#Meta-information). - XEmbl: REST and SOAP access to EMBL-Bank entries in XML. The EBI's XEmbl service was replaced by the dbfetch (http://www.ebi.ac.uk/Tools/dbfetch/) and WSDbfetch (http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) services, since these provide roughly the same functionality with wider data format support. Since I've been attempting to get dbfetch to support the biofetch and bioregistry specifications, my interest is much more at the web service end of things. I can certainly see options for using the current alternatives in dbfetch and EMBOSS to revise the specifications for biofetch and bioregistry. Hamish > > chris > > On Nov 18, 2011, at 7:45 AM, Fields, Christopher J wrote: > >> On Nov 18, 2011, at 5:21 AM, Peter Cock wrote: >> >>> On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal wrote: >>>> ... >>>> And which are the information you want to extract once you >>>> have your index ? >>>> >>> >>> Biopython and BioPerl have their SeqIO parsers hooked up >>> to indexing code. This means you can access a record via its >>> ID, and it is parsed for you on demand - just like if you'd >>> iterated over the file in order parsing the records one by one. >>> >>> Biopython (not sure about BioPerl) can also just fetch the raw >>> text of that record. >> >> Re: BioPerl, I'm not sure about the OBDA implementations, but I know the older Bio::Index modules allow this. ?I would be surprised if the OBDA-specific code didn't, but adding this should be easy. >> >> chris From cjfields at illinois.edu Fri Dec 16 21:12:45 2011 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 16 Dec 2011 15:12:45 -0600 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>

<4EEA4580.7010901@illinois.edu> Message-ID: <4EEBB44D.4050905@illinois.edu> On 12/15/2011 02:36 PM, Hamish McWilliam wrote: > Chris, > >> Reason I ask, the various Bio* and EMBOSS projects have a share of old (and >> possibly duplicate) data examples, but it might be nice to standardize on a >> common set of records, simply for less data duplication. >> >> As an example, have a git repo of purely data or links to data that we could >> 'git submodule' in for code distribution, release, and testing purposes, but >> that wouldn't bloat the code repository. > It is debatable if version control is necessary for this, each sample > entry is a snapshot obtained from the data source thus there is only > ever one version of each file, and a file for each format version is > required for testing purposes anyway. So for a test data archive plain > old FTP would be sufficient, with fetch scripts if required. I agree; that or a combination of the two where appropriate. Caveats below. > Since the historic data is most useful for compatibility testing and > the archives all have web services attached (e.g. EMBL-SVA and UniSave > are available through dbfetch, see > http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases) fetching > the required entries when/if necessary seems a more appropriate > approach. For example I doubt if many people need to test > compatibility with the Swiss-PROT 9.0 (1988) entry format > (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002.1), > so there is little need to duplicate this data in every developers > set-up. In contrast users expect everything to be tested with the > current entry format, also available from UniSave > (http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=unisave&id=P00002), but > the only way to be sure it is the current data is to fetch it from one > of the primary sources. > > Given the major databases implement versioning schemes fetching > specific versions of entries is simple. For less well known databases, > databases which are no longer active or application results (the evils > of NCBI BLAST output parsing come to mind) having a standard set would > be useful. However in the case of the databases the data resources may > provide appropriate mechanisms to fetch this data, so building fixed > sets may be unnecessary for anything other than caching. Unless you > are talking large data sets, in which case you are going to want them > to be optional anyway, and you certainly don't want to put them under > version control. The key concerning word there is 'may', and I hesitate to rely on the certainty that some web services will be available indefinitely. A recent talk I attended (I believe at the last Galaxy Conference) mentioned the percentage of published web services that have persistent URLs over the years (e.g. found in the same location or are redirected to a new location). The number is depressingly low, I want to say 25-40%, but not sure. Even for major databases, we have had web service apps move or disappear, just fixed one related to NCBI revision history. In most cases there are notifications, but not always. Just from that perspective alone, I find static files to be a nice fallback. > So for the databases with archives I would tend towards just keeping > identifier lists for representative entries and fetching the required > entry flavours when/if required. This prevents duplication, ensures > current data tests are against the current data, and provides the > option of shipping the fetch script instead of the data for cases > where copyright licensing is an issue. For the rest a collection of > static files on an FTP/web site would have it covered. > > Hamish It's not an unreasonable expectation that some parsers would need support for both old and new ('old' being something within a sane time period). Regardless, most (all?) OBF projects have been around long enough this isn't often an issue, but there are data formats that rapidly evolve (NCBI BLAST text being an example). Even that is a bit of an exception, as NCBI has long recommended not relying on text parsing as being stable as they reserve the right to add changes that may break things (so for users caveat emptor). chris >> On 12/15/2011 12:01 PM, Hamish McWilliam wrote: >>> Hi Chris, >>> >>>> That might be the best source to pull from. Does it archive old file >>>> examples (such as older SwissProt/GenBank/EMBL)? >>> EDAM itself does not store entry data, and at the moment it does not >>> describe the changes to formats over time, although I'm sure this >>> could be added along with links to sample entries in the various data >>> archives. >>> >>> If you only need a few sample entries, see the appropriate database >>> archive: >>> >>> - EMBL-Bank Sequence Version Archive (EMBL-SVA): >>> http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. >>> E.g. http://www.ebi.ac.uk/cgi-bin/sva/sva.pl/?query=V00077&search=Go >>> - UniProtKB Sequence/Annotation Version Archive (UniSave): >>> http://www.ebi.ac.uk/uniprot/unisave/ >>> E.g. http://www.ebi.ac.uk/uniprot/unisave/?query=P00002&search=Go >>> - NCBI Entrez Revision History. >>> E.g. http://www.ncbi.nlm.nih.gov/nuccore/V00077?report=girevhist >>> >>> If you need more entries... >>> >>> For Swiss-PROT and UniProtKB old versions of the data are available on >>> the FTP sites, for example from EMBL-EBI: >>> - ftp://ftp.ebi.ac.uk/pub/databases/uniprot/previous_releases/ >>> - ftp://ftp.ebi.ac.uk/pub/databases/swissprot/sw_old_releases/ >>> >>> For GenBank, Don Gilbert collected various old releases a while back >>> (http://www.bio.net/bionet/mm/genbankb/2006-October/000251.html), >>> these are available via the BioMirrors (http://www.bio-mirror.net/). >>> NCBI may also be able to provide old releases on request. >>> >>> For EMBL-Bank old releases can be made available on request, contact >>> ENA (http://www.ebi.ac.uk/ena/about/contact) for more information. >>> >>> All the best, >>> >>> Hamish >>> >>>> chris >>>> >>>> On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: >>>> >>>>> I just checked with Jon and he was happy to forward this back to >>>>> the list, and also added a couple of URLs that I'd asked about: >>>>> >>>>> http://bioportal.bioontology.org/ontologies/44600 >>>>> http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM >>>>> >>>>> Peter >>>>> >>>>> On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: >>>>>> Hi Peter (and Peter) >>>>>> >>>>>> Just a quick note to say that all (well, nearly all) common >>>>>> bioinformatics data formats are >>>>>> catalogued in the EDAM ontology: >>>>>> >>>>>> http://sourceforge.net/projects/edamontology/files >>>>>> http://edamontology.sourceforge.net/ >>>>>> >>>>>> OK - there's bound to be some we've missed :) >>>>>> >>>>>> Anyhow, I thought it might help to structure any effort to document >>>>>> data formats (an effort which >>>>>> I wholeheartedly approve of by the way). One thing I'd like to add to >>>>>> the EDAM "format" >>>>>> definitions is a link to the format specification, or failing that, an >>>>>> example. >>>>>> >>>>>> Cheers both >>>>>> >>>>>> Jon >>> _______________________________________________ >>> >>> Open-Bio-l mailing list >>> Open-Bio-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >> > > From hamish.mcwilliam at bioinfo-user.org.uk Tue Dec 20 11:27:12 2011 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Tue, 20 Dec 2011 11:27:12 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4EEBB44D.4050905@illinois.edu> References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk>