From biopython at maubp.freeserve.co.uk Tue Nov 20 14:36:34 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Nov 2007 19:36:34 +0000 Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table Message-ID: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> Dear all, I'm one of the Biopython developers. I've recently got going with BioSQL and have been getting to grips with the Biopython BioSQL interface. I'm aware that we need to try and be consistent with BioPerl and BioJava, so I'd like to pose my first question related to that. When loading GenBank records, many features have db_xref qualifiers, e.g. from a random CDS feature in E. coli K12: /db_xref="ASAP:1309" /db_xref="GI:16128366" /db_xref="ECOCYC:EG10213" /db_xref="GeneID:945313" Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", "GeneID" before using recording these entries in the seqfeature_dbxref and dbxref tables. For example, "GI" becomes "GeneIndex". Biopython's current mapping is as follows: # Dictionary of database types, keyed by GenBank db_xref abbreviation db_dict = {'GeneID': 'Entrez', 'GI': 'GeneIndex', 'COG': 'COG', 'CDD': 'CDD', 'DDBJ': 'DNA Databank of Japan', 'Entrez': 'Entrez', 'GeneIndex': 'GeneIndex', 'PUBMED': 'PubMed', 'taxon': 'Taxon', 'ATCC': 'ATCC', 'ISFinder': 'ISFinder', 'GOA': 'Gene Ontology Annotation', 'ASAP': 'ASAP', 'PSEUDO': 'PSEUDO', 'InterPro': 'InterPro', 'GEO': 'Gene Expression Omnibus', 'EMBL': 'EMBL', 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', 'ECOCYC': 'EcoCyc', 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' } In my testing, I've found several GenBank db_xref abbreviation for which we don't have a mapping defined, such as "LocusID", "dbSNP", "MGD", "MIM", or from an EMBL file, "REMTREMBL". I'd like to know if BioPerl and/or BioJava and/or BioRuby define a similar mapping in their BioSQL code (or GenBank parser), so that Biopython can follow your example. Thank you, Peter P.S. See also Biopython bug 2405 http://bugzilla.open-bio.org/show_bug.cgi?id=2405 From arareko at campus.iztacala.unam.mx Thu Nov 22 11:37:24 2007 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Thu, 22 Nov 2007 10:37:24 -0600 Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> Message-ID: <4745B044.5090102@campus.iztacala.unam.mx> Hi Peter, In BioPerl, there's no such mapping for db_xref's that I'm aware of. Each parser handles db_xref records on its own. Take a look at the Bio::SeqIO::genbank code, inside the next_seq() method for example: http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup Regards, Mauricio. Peter wrote: > Dear all, > > I'm one of the Biopython developers. I've recently got going with > BioSQL and have been getting to grips with the Biopython BioSQL > interface. I'm aware that we need to try and be consistent with > BioPerl and BioJava, so I'd like to pose my first question related to > that. > > When loading GenBank records, many features have db_xref qualifiers, > e.g. from a random CDS feature in E. coli K12: > > /db_xref="ASAP:1309" > /db_xref="GI:16128366" > /db_xref="ECOCYC:EG10213" > /db_xref="GeneID:945313" > > Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", > "GeneID" before using recording these entries in the seqfeature_dbxref > and dbxref tables. For example, "GI" becomes "GeneIndex". > Biopython's current mapping is as follows: > > # Dictionary of database types, keyed by GenBank db_xref abbreviation > db_dict = {'GeneID': 'Entrez', > 'GI': 'GeneIndex', > 'COG': 'COG', > 'CDD': 'CDD', > 'DDBJ': 'DNA Databank of Japan', > 'Entrez': 'Entrez', > 'GeneIndex': 'GeneIndex', > 'PUBMED': 'PubMed', > 'taxon': 'Taxon', > 'ATCC': 'ATCC', > 'ISFinder': 'ISFinder', > 'GOA': 'Gene Ontology Annotation', > 'ASAP': 'ASAP', > 'PSEUDO': 'PSEUDO', > 'InterPro': 'InterPro', > 'GEO': 'Gene Expression Omnibus', > 'EMBL': 'EMBL', > 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', > 'ECOCYC': 'EcoCyc', > 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' > } > > In my testing, I've found several GenBank db_xref abbreviation for > which we don't have a mapping defined, such as "LocusID", "dbSNP", > "MGD", "MIM", or from an EMBL file, "REMTREMBL". > > I'd like to know if BioPerl and/or BioJava and/or BioRuby define a > similar mapping in their BioSQL code (or GenBank parser), so that > Biopython can follow your example. > > Thank you, > > Peter > > P.S. See also Biopython bug 2405 > http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Thu Nov 22 19:42:12 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 22 Nov 2007 18:42:12 -0600 Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <4745B044.5090102@campus.iztacala.unam.mx> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> <4745B044.5090102@campus.iztacala.unam.mx> Message-ID: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> I think SeqIO checks the name for parsing reasons only, in cases where the format changes based on the source (such as GenPept DBSOURCE data). I don't think we go beyond that in Bioperl, probably b/c modifying or expanding names for data persistence would lead to volatile coding issues (i.e. consistency between parsers, constant updating to cover new crossrefs, etc). I would definitely suggest retaining the original DB as it appears in the dbxref for consistency/sanity; if needed return expanded names using a different method if they are designated. chris On Nov 22, 2007, at 10:37 AM, Mauricio Herrera Cuadra wrote: > Hi Peter, > > In BioPerl, there's no such mapping for db_xref's that I'm aware of. > Each parser handles db_xref records on its own. Take a look at the > Bio::SeqIO::genbank code, inside the next_seq() method for example: > > http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/ > Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup > > Regards, > Mauricio. > > Peter wrote: >> Dear all, >> >> I'm one of the Biopython developers. I've recently got going with >> BioSQL and have been getting to grips with the Biopython BioSQL >> interface. I'm aware that we need to try and be consistent with >> BioPerl and BioJava, so I'd like to pose my first question related to >> that. >> >> When loading GenBank records, many features have db_xref qualifiers, >> e.g. from a random CDS feature in E. coli K12: >> >> /db_xref="ASAP:1309" >> /db_xref="GI:16128366" >> /db_xref="ECOCYC:EG10213" >> /db_xref="GeneID:945313" >> >> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", >> "GeneID" before using recording these entries in the >> seqfeature_dbxref >> and dbxref tables. For example, "GI" becomes "GeneIndex". >> Biopython's current mapping is as follows: >> >> # Dictionary of database types, keyed by GenBank db_xref abbreviation >> db_dict = {'GeneID': 'Entrez', >> 'GI': 'GeneIndex', >> 'COG': 'COG', >> 'CDD': 'CDD', >> 'DDBJ': 'DNA Databank of Japan', >> 'Entrez': 'Entrez', >> 'GeneIndex': 'GeneIndex', >> 'PUBMED': 'PubMed', >> 'taxon': 'Taxon', >> 'ATCC': 'ATCC', >> 'ISFinder': 'ISFinder', >> 'GOA': 'Gene Ontology Annotation', >> 'ASAP': 'ASAP', >> 'PSEUDO': 'PSEUDO', >> 'InterPro': 'InterPro', >> 'GEO': 'Gene Expression Omnibus', >> 'EMBL': 'EMBL', >> 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', >> 'ECOCYC': 'EcoCyc', >> 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' >> } >> >> In my testing, I've found several GenBank db_xref abbreviation for >> which we don't have a mapping defined, such as "LocusID", "dbSNP", >> "MGD", "MIM", or from an EMBL file, "REMTREMBL". >> >> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a >> similar mapping in their BioSQL code (or GenBank parser), so that >> Biopython can follow your example. >> >> Thank you, >> >> Peter >> >> P.S. See also Biopython bug 2405 >> http://bugzilla.open-bio.org/show_bug.cgi?id=2405 >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Sat Nov 24 04:16:49 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 24 Nov 2007 09:16:49 +0000 Subject: [BioSQL-l] [BioPython] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> <4745B044.5090102@campus.iztacala.unam.mx> <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> Message-ID: <320fb6e00711240116g4819fc81g202fda35801f19f2@mail.gmail.com> Thank you Chris and Mauricio, On 11/23/07, Chris Fields wrote: > I think [BioPerl's] SeqIO checks the name for parsing reasons only, in > cases where the format changes based on the source (such as GenPept > DBSOURCE data). I don't think we go beyond that in Bioperl, probably > b/c modifying or expanding names for data persistence would lead to > volatile coding issues (i.e. consistency between parsers, constant > updating to cover new crossrefs, etc). And in Biopython's case, we get annoying warnings if it hasn't seen the term before! Which is way I filed Biopython bug 2405 in the first place :) http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > I would definitely suggest retaining the original DB as it appears in > the dbxref for consistency/sanity; if needed return expanded names > using a different method if they are designated. Sounds good to me. Peter From holland at ebi.ac.uk Mon Nov 26 04:05:37 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 26 Nov 2007 09:05:37 +0000 Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> Message-ID: <474A8C61.5080504@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi there. BioJava uses the labels as-is from the file without trying to translate them further. The exceptions are taxon xrefs which get translated into taxon objects, but everything else is unchanged. cheers, Richard Peter wrote: > Dear all, > > I'm one of the Biopython developers. I've recently got going with > BioSQL and have been getting to grips with the Biopython BioSQL > interface. I'm aware that we need to try and be consistent with > BioPerl and BioJava, so I'd like to pose my first question related to > that. > > When loading GenBank records, many features have db_xref qualifiers, > e.g. from a random CDS feature in E. coli K12: > > /db_xref="ASAP:1309" > /db_xref="GI:16128366" > /db_xref="ECOCYC:EG10213" > /db_xref="GeneID:945313" > > Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", > "GeneID" before using recording these entries in the seqfeature_dbxref > and dbxref tables. For example, "GI" becomes "GeneIndex". > Biopython's current mapping is as follows: > > # Dictionary of database types, keyed by GenBank db_xref abbreviation > db_dict = {'GeneID': 'Entrez', > 'GI': 'GeneIndex', > 'COG': 'COG', > 'CDD': 'CDD', > 'DDBJ': 'DNA Databank of Japan', > 'Entrez': 'Entrez', > 'GeneIndex': 'GeneIndex', > 'PUBMED': 'PubMed', > 'taxon': 'Taxon', > 'ATCC': 'ATCC', > 'ISFinder': 'ISFinder', > 'GOA': 'Gene Ontology Annotation', > 'ASAP': 'ASAP', > 'PSEUDO': 'PSEUDO', > 'InterPro': 'InterPro', > 'GEO': 'Gene Expression Omnibus', > 'EMBL': 'EMBL', > 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', > 'ECOCYC': 'EcoCyc', > 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' > } > > In my testing, I've found several GenBank db_xref abbreviation for > which we don't have a mapping defined, such as "LocusID", "dbSNP", > "MGD", "MIM", or from an EMBL file, "REMTREMBL". > > I'd like to know if BioPerl and/or BioJava and/or BioRuby define a > similar mapping in their BioSQL code (or GenBank parser), so that > Biopython can follow your example. > > Thank you, > > Peter > > P.S. See also Biopython bug 2405 > http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHSoxh4C5LeMEKA/QRApBrAKCQDwWTHF9OQHA61PeUR/gUKdBj3wCffzDJ 7qoEUN+9XnMNkVe7wOeERbU= =80+z -----END PGP SIGNATURE----- From biopython at maubp.freeserve.co.uk Mon Nov 26 14:10:31 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Nov 2007 19:10:31 +0000 Subject: [BioSQL-l] Authority in biodatabase table Message-ID: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com> Thank's for all the replies on the db_xref issue. Today I'd like to ask if there are any established guidelines for the biodatabase table - in particular for how to use the "authority" field in the biodatabase table, and if there is any agreed terminology for the named "sub databases" defined therein i.e. what should I call them in our documentation. By default, unless the user specifies an authority, we end up with a NULL when creating entries in the biodatabase table using Biopython. For example: from BioSQL import BioSeqDatabase server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "", host = "localhost", db="bioseqdb") db = server.new_database("orchids", description="Just for testing") server.adaptor.commit() I'd like to give some sensible defaults in any worked examples. Apart from simple test cases (like above), sensible examples that came to mind would be creating a "sub database" to contain: (*) an entire GenBank release (*) the latest SwissProt release What would you use in these cases. In fact, what does your biodatabase table contain right now? Thank you all, Peter From holland at ebi.ac.uk Tue Nov 27 03:39:52 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Nov 2007 08:39:52 +0000 Subject: [BioSQL-l] Authority in biodatabase table In-Reply-To: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com> References: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com> Message-ID: <474BD7D8.2050006@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 BioJava leaves authority blank at the moment. cheers, Richard Peter wrote: > Thank's for all the replies on the db_xref issue. > > Today I'd like to ask if there are any established guidelines for the > biodatabase table - in particular for how to use the "authority" field > in the biodatabase table, and if there is any agreed terminology for > the named "sub databases" defined therein i.e. what should I call them > in our documentation. > > By default, unless the user specifies an authority, we end up with a > NULL when creating entries in the biodatabase table using Biopython. > For example: > > from BioSQL import BioSeqDatabase > server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", > passwd = "", host = "localhost", db="bioseqdb") > db = server.new_database("orchids", description="Just for testing") > server.adaptor.commit() > > I'd like to give some sensible defaults in any worked examples. Apart > from simple test cases (like above), sensible examples that came to > mind would be creating a "sub database" to contain: > (*) an entire GenBank release > (*) the latest SwissProt release > > What would you use in these cases. In fact, what does your > biodatabase table contain right now? > > Thank you all, > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHS9fY4C5LeMEKA/QRAnPBAJ48/fwxyJ0rBnIJNZnTpZexXAs6iQCgnuTq D4MzSndO+Osf/lzSVqQOArQ= =rOVk -----END PGP SIGNATURE----- From biopython at maubp.freeserve.co.uk Tue Nov 20 19:36:34 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Nov 2007 19:36:34 +0000 Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table Message-ID: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> Dear all, I'm one of the Biopython developers. I've recently got going with BioSQL and have been getting to grips with the Biopython BioSQL interface. I'm aware that we need to try and be consistent with BioPerl and BioJava, so I'd like to pose my first question related to that. When loading GenBank records, many features have db_xref qualifiers, e.g. from a random CDS feature in E. coli K12: /db_xref="ASAP:1309" /db_xref="GI:16128366" /db_xref="ECOCYC:EG10213" /db_xref="GeneID:945313" Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", "GeneID" before using recording these entries in the seqfeature_dbxref and dbxref tables. For example, "GI" becomes "GeneIndex". Biopython's current mapping is as follows: # Dictionary of database types, keyed by GenBank db_xref abbreviation db_dict = {'GeneID': 'Entrez', 'GI': 'GeneIndex', 'COG': 'COG', 'CDD': 'CDD', 'DDBJ': 'DNA Databank of Japan', 'Entrez': 'Entrez', 'GeneIndex': 'GeneIndex', 'PUBMED': 'PubMed', 'taxon': 'Taxon', 'ATCC': 'ATCC', 'ISFinder': 'ISFinder', 'GOA': 'Gene Ontology Annotation', 'ASAP': 'ASAP', 'PSEUDO': 'PSEUDO', 'InterPro': 'InterPro', 'GEO': 'Gene Expression Omnibus', 'EMBL': 'EMBL', 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', 'ECOCYC': 'EcoCyc', 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' } In my testing, I've found several GenBank db_xref abbreviation for which we don't have a mapping defined, such as "LocusID", "dbSNP", "MGD", "MIM", or from an EMBL file, "REMTREMBL". I'd like to know if BioPerl and/or BioJava and/or BioRuby define a similar mapping in their BioSQL code (or GenBank parser), so that Biopython can follow your example. Thank you, Peter P.S. See also Biopython bug 2405 http://bugzilla.open-bio.org/show_bug.cgi?id=2405 From arareko at campus.iztacala.unam.mx Thu Nov 22 16:37:24 2007 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Thu, 22 Nov 2007 10:37:24 -0600 Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> Message-ID: <4745B044.5090102@campus.iztacala.unam.mx> Hi Peter, In BioPerl, there's no such mapping for db_xref's that I'm aware of. Each parser handles db_xref records on its own. Take a look at the Bio::SeqIO::genbank code, inside the next_seq() method for example: http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup Regards, Mauricio. Peter wrote: > Dear all, > > I'm one of the Biopython developers. I've recently got going with > BioSQL and have been getting to grips with the Biopython BioSQL > interface. I'm aware that we need to try and be consistent with > BioPerl and BioJava, so I'd like to pose my first question related to > that. > > When loading GenBank records, many features have db_xref qualifiers, > e.g. from a random CDS feature in E. coli K12: > > /db_xref="ASAP:1309" > /db_xref="GI:16128366" > /db_xref="ECOCYC:EG10213" > /db_xref="GeneID:945313" > > Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", > "GeneID" before using recording these entries in the seqfeature_dbxref > and dbxref tables. For example, "GI" becomes "GeneIndex". > Biopython's current mapping is as follows: > > # Dictionary of database types, keyed by GenBank db_xref abbreviation > db_dict = {'GeneID': 'Entrez', > 'GI': 'GeneIndex', > 'COG': 'COG', > 'CDD': 'CDD', > 'DDBJ': 'DNA Databank of Japan', > 'Entrez': 'Entrez', > 'GeneIndex': 'GeneIndex', > 'PUBMED': 'PubMed', > 'taxon': 'Taxon', > 'ATCC': 'ATCC', > 'ISFinder': 'ISFinder', > 'GOA': 'Gene Ontology Annotation', > 'ASAP': 'ASAP', > 'PSEUDO': 'PSEUDO', > 'InterPro': 'InterPro', > 'GEO': 'Gene Expression Omnibus', > 'EMBL': 'EMBL', > 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', > 'ECOCYC': 'EcoCyc', > 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' > } > > In my testing, I've found several GenBank db_xref abbreviation for > which we don't have a mapping defined, such as "LocusID", "dbSNP", > "MGD", "MIM", or from an EMBL file, "REMTREMBL". > > I'd like to know if BioPerl and/or BioJava and/or BioRuby define a > similar mapping in their BioSQL code (or GenBank parser), so that > Biopython can follow your example. > > Thank you, > > Peter > > P.S. See also Biopython bug 2405 > http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Fri Nov 23 00:42:12 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 22 Nov 2007 18:42:12 -0600 Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <4745B044.5090102@campus.iztacala.unam.mx> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> <4745B044.5090102@campus.iztacala.unam.mx> Message-ID: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> I think SeqIO checks the name for parsing reasons only, in cases where the format changes based on the source (such as GenPept DBSOURCE data). I don't think we go beyond that in Bioperl, probably b/c modifying or expanding names for data persistence would lead to volatile coding issues (i.e. consistency between parsers, constant updating to cover new crossrefs, etc). I would definitely suggest retaining the original DB as it appears in the dbxref for consistency/sanity; if needed return expanded names using a different method if they are designated. chris On Nov 22, 2007, at 10:37 AM, Mauricio Herrera Cuadra wrote: > Hi Peter, > > In BioPerl, there's no such mapping for db_xref's that I'm aware of. > Each parser handles db_xref records on its own. Take a look at the > Bio::SeqIO::genbank code, inside the next_seq() method for example: > > http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/ > Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup > > Regards, > Mauricio. > > Peter wrote: >> Dear all, >> >> I'm one of the Biopython developers. I've recently got going with >> BioSQL and have been getting to grips with the Biopython BioSQL >> interface. I'm aware that we need to try and be consistent with >> BioPerl and BioJava, so I'd like to pose my first question related to >> that. >> >> When loading GenBank records, many features have db_xref qualifiers, >> e.g. from a random CDS feature in E. coli K12: >> >> /db_xref="ASAP:1309" >> /db_xref="GI:16128366" >> /db_xref="ECOCYC:EG10213" >> /db_xref="GeneID:945313" >> >> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", >> "GeneID" before using recording these entries in the >> seqfeature_dbxref >> and dbxref tables. For example, "GI" becomes "GeneIndex". >> Biopython's current mapping is as follows: >> >> # Dictionary of database types, keyed by GenBank db_xref abbreviation >> db_dict = {'GeneID': 'Entrez', >> 'GI': 'GeneIndex', >> 'COG': 'COG', >> 'CDD': 'CDD', >> 'DDBJ': 'DNA Databank of Japan', >> 'Entrez': 'Entrez', >> 'GeneIndex': 'GeneIndex', >> 'PUBMED': 'PubMed', >> 'taxon': 'Taxon', >> 'ATCC': 'ATCC', >> 'ISFinder': 'ISFinder', >> 'GOA': 'Gene Ontology Annotation', >> 'ASAP': 'ASAP', >> 'PSEUDO': 'PSEUDO', >> 'InterPro': 'InterPro', >> 'GEO': 'Gene Expression Omnibus', >> 'EMBL': 'EMBL', >> 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', >> 'ECOCYC': 'EcoCyc', >> 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' >> } >> >> In my testing, I've found several GenBank db_xref abbreviation for >> which we don't have a mapping defined, such as "LocusID", "dbSNP", >> "MGD", "MIM", or from an EMBL file, "REMTREMBL". >> >> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a >> similar mapping in their BioSQL code (or GenBank parser), so that >> Biopython can follow your example. >> >> Thank you, >> >> Peter >> >> P.S. See also Biopython bug 2405 >> http://bugzilla.open-bio.org/show_bug.cgi?id=2405 >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Sat Nov 24 09:16:49 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 24 Nov 2007 09:16:49 +0000 Subject: [BioSQL-l] [BioPython] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> <4745B044.5090102@campus.iztacala.unam.mx> <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> Message-ID: <320fb6e00711240116g4819fc81g202fda35801f19f2@mail.gmail.com> Thank you Chris and Mauricio, On 11/23/07, Chris Fields wrote: > I think [BioPerl's] SeqIO checks the name for parsing reasons only, in > cases where the format changes based on the source (such as GenPept > DBSOURCE data). I don't think we go beyond that in Bioperl, probably > b/c modifying or expanding names for data persistence would lead to > volatile coding issues (i.e. consistency between parsers, constant > updating to cover new crossrefs, etc). And in Biopython's case, we get annoying warnings if it hasn't seen the term before! Which is way I filed Biopython bug 2405 in the first place :) http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > I would definitely suggest retaining the original DB as it appears in > the dbxref for consistency/sanity; if needed return expanded names > using a different method if they are designated. Sounds good to me. Peter From holland at ebi.ac.uk Mon Nov 26 09:05:37 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 26 Nov 2007 09:05:37 +0000 Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> Message-ID: <474A8C61.5080504@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi there. BioJava uses the labels as-is from the file without trying to translate them further. The exceptions are taxon xrefs which get translated into taxon objects, but everything else is unchanged. cheers, Richard Peter wrote: > Dear all, > > I'm one of the Biopython developers. I've recently got going with > BioSQL and have been getting to grips with the Biopython BioSQL > interface. I'm aware that we need to try and be consistent with > BioPerl and BioJava, so I'd like to pose my first question related to > that. > > When loading GenBank records, many features have db_xref qualifiers, > e.g. from a random CDS feature in E. coli K12: > > /db_xref="ASAP:1309" > /db_xref="GI:16128366" > /db_xref="ECOCYC:EG10213" > /db_xref="GeneID:945313" > > Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", > "GeneID" before using recording these entries in the seqfeature_dbxref > and dbxref tables. For example, "GI" becomes "GeneIndex". > Biopython's current mapping is as follows: > > # Dictionary of database types, keyed by GenBank db_xref abbreviation > db_dict = {'GeneID': 'Entrez', > 'GI': 'GeneIndex', > 'COG': 'COG', > 'CDD': 'CDD', > 'DDBJ': 'DNA Databank of Japan', > 'Entrez': 'Entrez', > 'GeneIndex': 'GeneIndex', > 'PUBMED': 'PubMed', > 'taxon': 'Taxon', > 'ATCC': 'ATCC', > 'ISFinder': 'ISFinder', > 'GOA': 'Gene Ontology Annotation', > 'ASAP': 'ASAP', > 'PSEUDO': 'PSEUDO', > 'InterPro': 'InterPro', > 'GEO': 'Gene Expression Omnibus', > 'EMBL': 'EMBL', > 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', > 'ECOCYC': 'EcoCyc', > 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' > } > > In my testing, I've found several GenBank db_xref abbreviation for > which we don't have a mapping defined, such as "LocusID", "dbSNP", > "MGD", "MIM", or from an EMBL file, "REMTREMBL". > > I'd like to know if BioPerl and/or BioJava and/or BioRuby define a > similar mapping in their BioSQL code (or GenBank parser), so that > Biopython can follow your example. > > Thank you, > > Peter > > P.S. See also Biopython bug 2405 > http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHSoxh4C5LeMEKA/QRApBrAKCQDwWTHF9OQHA61PeUR/gUKdBj3wCffzDJ 7qoEUN+9XnMNkVe7wOeERbU= =80+z -----END PGP SIGNATURE----- From biopython at maubp.freeserve.co.uk Mon Nov 26 19:10:31 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Nov 2007 19:10:31 +0000 Subject: [BioSQL-l] Authority in biodatabase table Message-ID: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com> Thank's for all the replies on the db_xref issue. Today I'd like to ask if there are any established guidelines for the biodatabase table - in particular for how to use the "authority" field in the biodatabase table, and if there is any agreed terminology for the named "sub databases" defined therein i.e. what should I call them in our documentation. By default, unless the user specifies an authority, we end up with a NULL when creating entries in the biodatabase table using Biopython. For example: from BioSQL import BioSeqDatabase server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "", host = "localhost", db="bioseqdb") db = server.new_database("orchids", description="Just for testing") server.adaptor.commit() I'd like to give some sensible defaults in any worked examples. Apart from simple test cases (like above), sensible examples that came to mind would be creating a "sub database" to contain: (*) an entire GenBank release (*) the latest SwissProt release What would you use in these cases. In fact, what does your biodatabase table contain right now? Thank you all, Peter From holland at ebi.ac.uk Tue Nov 27 08:39:52 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Tue, 27 Nov 2007 08:39:52 +0000 Subject: [BioSQL-l] Authority in biodatabase table In-Reply-To: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com> References: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com> Message-ID: <474BD7D8.2050006@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 BioJava leaves authority blank at the moment. cheers, Richard Peter wrote: > Thank's for all the replies on the db_xref issue. > > Today I'd like to ask if there are any established guidelines for the > biodatabase table - in particular for how to use the "authority" field > in the biodatabase table, and if there is any agreed terminology for > the named "sub databases" defined therein i.e. what should I call them > in our documentation. > > By default, unless the user specifies an authority, we end up with a > NULL when creating entries in the biodatabase table using Biopython. > For example: > > from BioSQL import BioSeqDatabase > server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", > passwd = "", host = "localhost", db="bioseqdb") > db = server.new_database("orchids", description="Just for testing") > server.adaptor.commit() > > I'd like to give some sensible defaults in any worked examples. Apart > from simple test cases (like above), sensible examples that came to > mind would be creating a "sub database" to contain: > (*) an entire GenBank release > (*) the latest SwissProt release > > What would you use in these cases. In fact, what does your > biodatabase table contain right now? > > Thank you all, > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHS9fY4C5LeMEKA/QRAnPBAJ48/fwxyJ0rBnIJNZnTpZexXAs6iQCgnuTq D4MzSndO+Osf/lzSVqQOArQ= =rOVk -----END PGP SIGNATURE-----