From hamish.mcwilliam at bioinfo-user.org.uk Thu Jan 12 11:49:25 2012 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Thu, 12 Jan 2012 16:49:25 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References: <6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> Message-ID: Hi Peter, On 16 December 2011 12:11, Peter Cock wrote: > On Thu, Dec 15, 2011 at 10:01 PM, Hamish McWilliam > wrote: >>> Just a quick update on this: the old OBDA specs were still in CVS in >>> the obda-specs module (the old obda site had the module wrong). >>>?I ran git cvsimport on that after I copied the CVS repo to my laptop, >>> so it's now on github: >>> >>> https://github.com/OBF/OBDA >>> >>> We could probably work on updates from there. >> >> At the risk of derailing the current thread... a few comments on the >> "modules" in the old ODBA: > > Well, given the broad title of OBDA redux, why not? Exactly :-) >> - BioCorba: while CORBA may live on in some embedded applications it >> has mostly been replaced by SOAP and REST web services. I suspect >> there are few copies of the BioCorba IDLs surviving today. Possibly of >> historic interest, but since it doesn't actually include the IDLs it >> is not really of any use. > > As far as I know, BioCorba is defunct. > >> - biofetch: originally implemented in EBI's dbfetch, also implemented >> by BioRuby as biofetch which had a few extensions. EBI's dbfetch has >> since been reimplemented and attempts to be compatible but only >> provides partial support along with various extensions, including >> those from BioRuby. See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp. >> I'm aware of client support in BioPerl, BioRuby and EMBOSS, not sure >> of the current status for BioJava and BioPython. > > Current Biopython doesn't have anything for this, but I would probably > want to implement this as a client not a server. While there is a example implementation of a biofetch server in BioPerl (http://search.cpan.org/~cjfields/BioPerl/examples/db/dbfetch), it is the client implementations that have been the main focus in the various projects. In BioPerl: Bio::Biblio, Bio::DB::BioFetch, Bio::DB::EMBL, Bio::DB::RefSeq and Bio::DB::SwissProt use either dbfetch or biofetch; in BioRuby: Bio::Fetch provides an interface to biofetch servers, including the EBI's dbfetch. >> - BioSQL: as you all know over at http://www.biosql.org/. The document >> should probably be updated to point there. > > Agreed, done: > https://github.com/OBF/OBDA/commit/5798f0b4a0e3b7fd0595e0ab3017d3afdda53549 > >> - bioindex: the flat-file and BDB indexing formats. To which the >> SQLite option will be added? > > Basically yes. > >> - naming: obsolete URN scheme. Various ontologies (e.g. EDAM) provide >> possible replacements when required. > > This also has implications for the bioindex code as we need to > specify the file format being indexed (e.g. FASTA or GenBank). And possibly a layer of semantics for the database and data in the database. >> - bioregistry: database discovery and meta-data. From having tried to >> implement this, the bioregisty is too limited in the available >> meta-data to be very useful, especially when it comes to data format >> handling. Compare with the database definitions in EMBOSS >> (http://emboss.sourceforge.net/docs/themes/Databases.html) and the >> dbfetch meta-data >> (http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp#Meta-information). For the current EMBOSS documentation for the database definitions see http://emboss.open-bio.org/html/adm/ch04s01.html. > There was some partial code for this in Biopython, but it was > deprecated and removed some time ago. While the bioregistry stuff is conceptually quite useful... The common format for data services to advertise the data that they provide and the interfaces which they provide for accessing the data, which has obvious benefits for client software. The notion of a site describing its own services in a standardized way, so clients and crawlers can discover the available data sources at runtime, without the inherent problems centralized repositories present. But the current specification is too limited since it does not allow for the specification of data formats, or database and data semantics. Use of a richer format and convergence with the equivalent configuration files in EMBOSS could revive the concept, and make implementing the client support worthwhile again. >> - XEmbl: REST and SOAP access to EMBL-Bank entries in XML. >> The EBI's XEmbl service was replaced by the dbfetch >> (http://www.ebi.ac.uk/Tools/dbfetch/) and WSDbfetch >> (http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) services, >> since these provide roughly the same functionality with wider data >> format support. > > Presumably the XML format for EMBL is now one of the ISNDC > formats also used at the NCBI? In any case, that whole folder > is purely describing an (obsolete) EBI service, so can we just > delete it it? The XML formats were not described as part of the XEmbl specification, but instead were external XML formats (BSML and Agave XML) which have not been adopted. The current XML formats for the INSDC member databases are in two categories: 1. INSD XML (http://insdc.org/xmlstatus.html) 2. Member database specific formats, for example ENA EMBL-Bank XML (see http://www.ebi.ac.uk/ena/about/embl_bank_format). The XEmbl service specification itself is obsolete and can be removed. >> Since I've been attempting to get dbfetch to support the biofetch and >> bioregistry specifications, my interest is much more at the web >> service end of things. I can certainly see options for using the >> current alternatives in dbfetch and EMBOSS to revise the >> specifications for biofetch and bioregistry. >> >> Hamish > > How does biofetch/bioregistry compare to DAS? biofetch specifies a HTTP GET based interface to data resources. The databases and data formats available depend on the specific implementation, and will generally include the main distribution formats for the database and commonly used formats for the specific type of data involved, for example EBI's dbfetch provides EMBL-Bank data in: - EMBL flatfile format - EMBL XML - INSD XML - Fasta sequence format - SeqXML bioregistry describes available databases at a site, providing details of how to talk to the data source and the relevant parameters required to access a specific database. For example for EMBL-Bank via dbfetch: [embl] protocol=biofetch location=http://www.ebi.ac.uk/Tools/dbfetch/dbfetch dbname=embl DAS is a protocol and set of data formats focused around delivery of sequence and sequence feature data. A DAS server provides meta-data about its capabilities and the data available through it, but knows nothing about other DAS servers. The DAS Registry (http://www.dasregistry.org/), provides information about registered DAS servers and addresses this limitation, but is centralized and DAS specific. Alternative registries (see http://www.ebi.ac.uk/Tools/webservices/tutorials/05_registries) address the service type limitation, but still are centralized resources. DAS and biofetch are complementary, DAS provides granularity and mash-up capabilities but biofetch provides original and common data formats. bioregistry appears to be unused currently, but aims to provide a format for sharing information about data services. The possibility for convergence of this format and database configurations in EMBOSS and service meta-data such as that provided by dbfetch would simplify client development and simplify maintenance of database configurations in supporting systems. > Separately, I suggest we rename the OBDA/preamble.txt > file to README (or README.*) so it gets shown in GitHub, > and then update it following this discussion with some > context (like dates current status of the different parts). Sounds good to me. > We should probably make the old OBDA CVS read only now. I assume a pointer has been added to the contents of the OBDA CVS to point to the new location on github, in which case making it read only would be sensible. Hamish From pedrolopes at ua.pt Mon Jan 16 06:54:43 2012 From: pedrolopes at ua.pt (Pedro Lopes) Date: Mon, 16 Jan 2012 11:54:43 +0000 Subject: [Open-bio-l] [SWAT4LS] Sponsorship Opportunity for International School on Semantic Web Applications & Tools for Life Sciences Message-ID: *Dear sirs, IEETA/University of Aveiro (http://www.ieeta.pt), in cooperation with the SWAT4LS group (http://www.swat4ls.org/), will host the "International School on Semantic Web Applications and Tools for Life Sciences" between May 2nd and 5th. This will be a scientific event focused on the practical learning of new technologies and strategies associated with the Semantic Web development paradigm. This includes ontology modeling and creation, performance enhancements, data integration, and LinkedData services, amongst many others. This event will gather contributions and active participation from research staff, international project leaders, private companies and other area experts. For this event, whose access is limited to 40 seats, the Organizing Committee intends to offer an advanced knowledge acquisition experience, based on the interactions, in a privileged get-together environment, between the scientific and business communities. The Organizing Committee would like to have your company's official sponsorship, which would undoubtedly contribute to this event's credibility as a discussion forum for the various stakeholders involved in the Portuguese innovative biomedical technologies scene. Your sponsorship can be materialized through distinct contributions, namely through monetary support for meals (lunches, coffe breaks, gala dinner) or speakers (travelling, accommodation). Additionally, we can also schedule a presentation slot for your company, where you can highlight your products, introducing them to a highly qualified and interested audience. You can find further information regarding available sponsorship in the attached file (SWAT4LSAveiro_Sponsorship). If you wish to sponsor this event, your company's logo will be included in all event dissemination materials, namely posters, web site, mail and email, for which we ask for your explicit authorization as well as image usage rules.* All contacts to conference organizers should be forwarded to: SWAT4LS Aveiro 2012 A/c Pedro Lopes IEETA Campus Universit?rio de Santiago 3810-193 Aveiro email: schools at swat4ls.org web: http://www.swat4ls.org/schools/aveiro2012/ tel. 234 370 500 We look forward to hearing from you, Best regards, Pedro @pedrolopes Bioinformatics Research & Development | http://bioinformatics.ua.pt -------------- next part -------------- A non-text attachment was scrubbed... Name: SWAT4LSAveiro_Promo.pdf Type: application/pdf Size: 336956 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SWAT4LSAveiro_Sponsorship.pdf Type: application/pdf Size: 327284 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Jan 20 05:46:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Jan 2012 10:46:18 +0000 Subject: [Open-bio-l] NCBI adoption of AGP v2.0 and new qualifiers in GenBank/EMBL Message-ID: Dear all, I just spotted this via the @NCBI twitter feed, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_spec_change.shtml In addition to the NCBI switch from AGP v1.1 to v2.0, the INSDC have recently added a new feature type called "assembly_gap", and the associated qualifiers "gap_type" and "linkage_evidence" to the INSDC Feature Table Definitons. Quoting from version 10.0, dated Dec 2011 http://www.insdc.org/documents/feature_table.html#7.2 > Feature Key assembly_gap > > > Definition gap between two components of a CON record that is > part of a genome assembly; > > Mandatory qualifiers /estimated_length=unknown or > /gap_type="TYPE" > /linkage_evidence="TYPE" (Note: Mandatory only if the > /gap_type is "within scaffold" or "repeat within > scaffold".If there are multiple types of linkage_evidence > they will appear as multiple /linkage_evidence="TYPE" > qualifiers. For all other types of assembly_gap > features, use of the /linkage_evidence qualifier is > invalid.) > > Comment the location span of the assembly_gap feature for an > unknown gap is 100 bp, with the 100 bp indicated as > 100 "n"'s in sequence. > i.e. DDBJ, ENA & GenBank flat-files will start to use the "assembly_gap" features to display information derived from version 2.0 AGP files from 10th Feb 2012. Probably this will affect the XML variants as well. Unless any of the parsers/writers for GenBank or EMBL flat files use a white list approach, the new feature key and qualifiers shouldn't cause a problem. Peter From hamish.mcwilliam at bioinfo-user.org.uk Thu Jan 12 16:49:25 2012 From: hamish.mcwilliam at bioinfo-user.org.uk (Hamish McWilliam) Date: Thu, 12 Jan 2012 16:49:25 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References: <6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> Message-ID: Hi Peter, On 16 December 2011 12:11, Peter Cock wrote: > On Thu, Dec 15, 2011 at 10:01 PM, Hamish McWilliam > wrote: >>> Just a quick update on this: the old OBDA specs were still in CVS in >>> the obda-specs module (the old obda site had the module wrong). >>>?I ran git cvsimport on that after I copied the CVS repo to my laptop, >>> so it's now on github: >>> >>> https://github.com/OBF/OBDA >>> >>> We could probably work on updates from there. >> >> At the risk of derailing the current thread... a few comments on the >> "modules" in the old ODBA: > > Well, given the broad title of OBDA redux, why not? Exactly :-) >> - BioCorba: while CORBA may live on in some embedded applications it >> has mostly been replaced by SOAP and REST web services. I suspect >> there are few copies of the BioCorba IDLs surviving today. Possibly of >> historic interest, but since it doesn't actually include the IDLs it >> is not really of any use. > > As far as I know, BioCorba is defunct. > >> - biofetch: originally implemented in EBI's dbfetch, also implemented >> by BioRuby as biofetch which had a few extensions. EBI's dbfetch has >> since been reimplemented and attempts to be compatible but only >> provides partial support along with various extensions, including >> those from BioRuby. See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp. >> I'm aware of client support in BioPerl, BioRuby and EMBOSS, not sure >> of the current status for BioJava and BioPython. > > Current Biopython doesn't have anything for this, but I would probably > want to implement this as a client not a server. While there is a example implementation of a biofetch server in BioPerl (http://search.cpan.org/~cjfields/BioPerl/examples/db/dbfetch), it is the client implementations that have been the main focus in the various projects. In BioPerl: Bio::Biblio, Bio::DB::BioFetch, Bio::DB::EMBL, Bio::DB::RefSeq and Bio::DB::SwissProt use either dbfetch or biofetch; in BioRuby: Bio::Fetch provides an interface to biofetch servers, including the EBI's dbfetch. >> - BioSQL: as you all know over at http://www.biosql.org/. The document >> should probably be updated to point there. > > Agreed, done: > https://github.com/OBF/OBDA/commit/5798f0b4a0e3b7fd0595e0ab3017d3afdda53549 > >> - bioindex: the flat-file and BDB indexing formats. To which the >> SQLite option will be added? > > Basically yes. > >> - naming: obsolete URN scheme. Various ontologies (e.g. EDAM) provide >> possible replacements when required. > > This also has implications for the bioindex code as we need to > specify the file format being indexed (e.g. FASTA or GenBank). And possibly a layer of semantics for the database and data in the database. >> - bioregistry: database discovery and meta-data. From having tried to >> implement this, the bioregisty is too limited in the available >> meta-data to be very useful, especially when it comes to data format >> handling. Compare with the database definitions in EMBOSS >> (http://emboss.sourceforge.net/docs/themes/Databases.html) and the >> dbfetch meta-data >> (http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp#Meta-information). For the current EMBOSS documentation for the database definitions see http://emboss.open-bio.org/html/adm/ch04s01.html. > There was some partial code for this in Biopython, but it was > deprecated and removed some time ago. While the bioregistry stuff is conceptually quite useful... The common format for data services to advertise the data that they provide and the interfaces which they provide for accessing the data, which has obvious benefits for client software. The notion of a site describing its own services in a standardized way, so clients and crawlers can discover the available data sources at runtime, without the inherent problems centralized repositories present. But the current specification is too limited since it does not allow for the specification of data formats, or database and data semantics. Use of a richer format and convergence with the equivalent configuration files in EMBOSS could revive the concept, and make implementing the client support worthwhile again. >> - XEmbl: REST and SOAP access to EMBL-Bank entries in XML. >> The EBI's XEmbl service was replaced by the dbfetch >> (http://www.ebi.ac.uk/Tools/dbfetch/) and WSDbfetch >> (http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) services, >> since these provide roughly the same functionality with wider data >> format support. > > Presumably the XML format for EMBL is now one of the ISNDC > formats also used at the NCBI? In any case, that whole folder > is purely describing an (obsolete) EBI service, so can we just > delete it it? The XML formats were not described as part of the XEmbl specification, but instead were external XML formats (BSML and Agave XML) which have not been adopted. The current XML formats for the INSDC member databases are in two categories: 1. INSD XML (http://insdc.org/xmlstatus.html) 2. Member database specific formats, for example ENA EMBL-Bank XML (see http://www.ebi.ac.uk/ena/about/embl_bank_format). The XEmbl service specification itself is obsolete and can be removed. >> Since I've been attempting to get dbfetch to support the biofetch and >> bioregistry specifications, my interest is much more at the web >> service end of things. I can certainly see options for using the >> current alternatives in dbfetch and EMBOSS to revise the >> specifications for biofetch and bioregistry. >> >> Hamish > > How does biofetch/bioregistry compare to DAS? biofetch specifies a HTTP GET based interface to data resources. The databases and data formats available depend on the specific implementation, and will generally include the main distribution formats for the database and commonly used formats for the specific type of data involved, for example EBI's dbfetch provides EMBL-Bank data in: - EMBL flatfile format - EMBL XML - INSD XML - Fasta sequence format - SeqXML bioregistry describes available databases at a site, providing details of how to talk to the data source and the relevant parameters required to access a specific database. For example for EMBL-Bank via dbfetch: [embl] protocol=biofetch location=http://www.ebi.ac.uk/Tools/dbfetch/dbfetch dbname=embl DAS is a protocol and set of data formats focused around delivery of sequence and sequence feature data. A DAS server provides meta-data about its capabilities and the data available through it, but knows nothing about other DAS servers. The DAS Registry (http://www.dasregistry.org/), provides information about registered DAS servers and addresses this limitation, but is centralized and DAS specific. Alternative registries (see http://www.ebi.ac.uk/Tools/webservices/tutorials/05_registries) address the service type limitation, but still are centralized resources. DAS and biofetch are complementary, DAS provides granularity and mash-up capabilities but biofetch provides original and common data formats. bioregistry appears to be unused currently, but aims to provide a format for sharing information about data services. The possibility for convergence of this format and database configurations in EMBOSS and service meta-data such as that provided by dbfetch would simplify client development and simplify maintenance of database configurations in supporting systems. > Separately, I suggest we rename the OBDA/preamble.txt > file to README (or README.*) so it gets shown in GitHub, > and then update it following this discussion with some > context (like dates current status of the different parts). Sounds good to me. > We should probably make the old OBDA CVS read only now. I assume a pointer has been added to the contents of the OBDA CVS to point to the new location on github, in which case making it read only would be sensible. Hamish From pedrolopes at ua.pt Mon Jan 16 11:54:43 2012 From: pedrolopes at ua.pt (Pedro Lopes) Date: Mon, 16 Jan 2012 11:54:43 +0000 Subject: [Open-bio-l] [SWAT4LS] Sponsorship Opportunity for International School on Semantic Web Applications & Tools for Life Sciences Message-ID: *Dear sirs, IEETA/University of Aveiro (http://www.ieeta.pt), in cooperation with the SWAT4LS group (http://www.swat4ls.org/), will host the "International School on Semantic Web Applications and Tools for Life Sciences" between May 2nd and 5th. This will be a scientific event focused on the practical learning of new technologies and strategies associated with the Semantic Web development paradigm. This includes ontology modeling and creation, performance enhancements, data integration, and LinkedData services, amongst many others. This event will gather contributions and active participation from research staff, international project leaders, private companies and other area experts. For this event, whose access is limited to 40 seats, the Organizing Committee intends to offer an advanced knowledge acquisition experience, based on the interactions, in a privileged get-together environment, between the scientific and business communities. The Organizing Committee would like to have your company's official sponsorship, which would undoubtedly contribute to this event's credibility as a discussion forum for the various stakeholders involved in the Portuguese innovative biomedical technologies scene. Your sponsorship can be materialized through distinct contributions, namely through monetary support for meals (lunches, coffe breaks, gala dinner) or speakers (travelling, accommodation). Additionally, we can also schedule a presentation slot for your company, where you can highlight your products, introducing them to a highly qualified and interested audience. You can find further information regarding available sponsorship in the attached file (SWAT4LSAveiro_Sponsorship). If you wish to sponsor this event, your company's logo will be included in all event dissemination materials, namely posters, web site, mail and email, for which we ask for your explicit authorization as well as image usage rules.* All contacts to conference organizers should be forwarded to: SWAT4LS Aveiro 2012 A/c Pedro Lopes IEETA Campus Universit?rio de Santiago 3810-193 Aveiro email: schools at swat4ls.org web: http://www.swat4ls.org/schools/aveiro2012/ tel. 234 370 500 We look forward to hearing from you, Best regards, Pedro @pedrolopes Bioinformatics Research & Development | http://bioinformatics.ua.pt -------------- next part -------------- A non-text attachment was scrubbed... Name: SWAT4LSAveiro_Promo.pdf Type: application/pdf Size: 336956 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SWAT4LSAveiro_Sponsorship.pdf Type: application/pdf Size: 327284 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Jan 20 10:46:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Jan 2012 10:46:18 +0000 Subject: [Open-bio-l] NCBI adoption of AGP v2.0 and new qualifiers in GenBank/EMBL Message-ID: Dear all, I just spotted this via the @NCBI twitter feed, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_spec_change.shtml In addition to the NCBI switch from AGP v1.1 to v2.0, the INSDC have recently added a new feature type called "assembly_gap", and the associated qualifiers "gap_type" and "linkage_evidence" to the INSDC Feature Table Definitons. Quoting from version 10.0, dated Dec 2011 http://www.insdc.org/documents/feature_table.html#7.2 > Feature Key assembly_gap > > > Definition gap between two components of a CON record that is > part of a genome assembly; > > Mandatory qualifiers /estimated_length=unknown or > /gap_type="TYPE" > /linkage_evidence="TYPE" (Note: Mandatory only if the > /gap_type is "within scaffold" or "repeat within > scaffold".If there are multiple types of linkage_evidence > they will appear as multiple /linkage_evidence="TYPE" > qualifiers. For all other types of assembly_gap > features, use of the /linkage_evidence qualifier is > invalid.) > > Comment the location span of the assembly_gap feature for an > unknown gap is 100 bp, with the 100 bp indicated as > 100 "n"'s in sequence. > i.e. DDBJ, ENA & GenBank flat-files will start to use the "assembly_gap" features to display information derived from version 2.0 AGP files from 10th Feb 2012. Probably this will affect the XML variants as well. Unless any of the parsers/writers for GenBank or EMBL flat files use a white list approach, the new feature key and qualifiers shouldn't cause a problem. Peter