From biopython at maubp.freeserve.co.uk  Tue Nov 20 14:36:34 2007
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Nov 2007 19:36:34 +0000
Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
Message-ID: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>

Dear all,

I'm one of the Biopython developers.  I've recently got going with
BioSQL and have been getting to grips with the Biopython BioSQL
interface.  I'm aware that we need to try and be consistent with
BioPerl and BioJava, so I'd like to pose my first question related to
that.

When loading GenBank records, many features have db_xref qualifiers,
e.g. from a random CDS feature in E. coli K12:

                     /db_xref="ASAP:1309"
                     /db_xref="GI:16128366"
                     /db_xref="ECOCYC:EG10213"
                     /db_xref="GeneID:945313"

Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
"GeneID" before using recording these entries in the seqfeature_dbxref
and dbxref tables.  For example, "GI" becomes "GeneIndex".
Biopython's current mapping is as follows:

# Dictionary of database types, keyed by GenBank db_xref abbreviation
db_dict = {'GeneID': 'Entrez',
           'GI': 'GeneIndex',
           'COG': 'COG',
           'CDD': 'CDD',
           'DDBJ': 'DNA Databank of Japan',
           'Entrez': 'Entrez',
           'GeneIndex': 'GeneIndex',
           'PUBMED': 'PubMed',
           'taxon': 'Taxon',
           'ATCC': 'ATCC',
           'ISFinder': 'ISFinder',
           'GOA': 'Gene Ontology Annotation',
           'ASAP': 'ASAP',
           'PSEUDO': 'PSEUDO',
           'InterPro': 'InterPro',
           'GEO': 'Gene Expression Omnibus',
           'EMBL': 'EMBL',
           'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
           'ECOCYC': 'EcoCyc',
           'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
           }

In my testing, I've found several GenBank db_xref abbreviation for
which we don't have a mapping defined, such as "LocusID", "dbSNP",
"MGD", "MIM", or from an EMBL file, "REMTREMBL".

I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
similar mapping in their BioSQL code (or GenBank parser), so that
Biopython can follow your example.

Thank you,

Peter

P.S. See also Biopython bug 2405
http://bugzilla.open-bio.org/show_bug.cgi?id=2405

From arareko at campus.iztacala.unam.mx  Thu Nov 22 11:37:24 2007
From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra)
Date: Thu, 22 Nov 2007 10:37:24 -0600
Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
Message-ID: <4745B044.5090102@campus.iztacala.unam.mx>

Hi Peter,

In BioPerl, there's no such mapping for db_xref's that I'm aware of. 
Each parser handles db_xref records on its own. Take a look at the 
Bio::SeqIO::genbank code, inside the next_seq() method for example:

http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup

Regards,
Mauricio.

Peter wrote:
> Dear all,
> 
> I'm one of the Biopython developers.  I've recently got going with
> BioSQL and have been getting to grips with the Biopython BioSQL
> interface.  I'm aware that we need to try and be consistent with
> BioPerl and BioJava, so I'd like to pose my first question related to
> that.
> 
> When loading GenBank records, many features have db_xref qualifiers,
> e.g. from a random CDS feature in E. coli K12:
> 
>                      /db_xref="ASAP:1309"
>                      /db_xref="GI:16128366"
>                      /db_xref="ECOCYC:EG10213"
>                      /db_xref="GeneID:945313"
> 
> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
> "GeneID" before using recording these entries in the seqfeature_dbxref
> and dbxref tables.  For example, "GI" becomes "GeneIndex".
> Biopython's current mapping is as follows:
> 
> # Dictionary of database types, keyed by GenBank db_xref abbreviation
> db_dict = {'GeneID': 'Entrez',
>            'GI': 'GeneIndex',
>            'COG': 'COG',
>            'CDD': 'CDD',
>            'DDBJ': 'DNA Databank of Japan',
>            'Entrez': 'Entrez',
>            'GeneIndex': 'GeneIndex',
>            'PUBMED': 'PubMed',
>            'taxon': 'Taxon',
>            'ATCC': 'ATCC',
>            'ISFinder': 'ISFinder',
>            'GOA': 'Gene Ontology Annotation',
>            'ASAP': 'ASAP',
>            'PSEUDO': 'PSEUDO',
>            'InterPro': 'InterPro',
>            'GEO': 'Gene Expression Omnibus',
>            'EMBL': 'EMBL',
>            'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
>            'ECOCYC': 'EcoCyc',
>            'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
>            }
> 
> In my testing, I've found several GenBank db_xref abbreviation for
> which we don't have a mapping defined, such as "LocusID", "dbSNP",
> "MGD", "MIM", or from an EMBL file, "REMTREMBL".
> 
> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
> similar mapping in their BioSQL code (or GenBank parser), so that
> Biopython can follow your example.
> 
> Thank you,
> 
> Peter
> 
> P.S. See also Biopython bug 2405
> http://bugzilla.open-bio.org/show_bug.cgi?id=2405
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

-- 
MAURICIO HERRERA CUADRA
arareko at campus.iztacala.unam.mx
Laboratorio de Gen?tica
Unidad de Morfofisiolog?a y Funci?n
Facultad de Estudios Superiores Iztacala, UNAM


From cjfields at uiuc.edu  Thu Nov 22 19:42:12 2007
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 22 Nov 2007 18:42:12 -0600
Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
In-Reply-To: <4745B044.5090102@campus.iztacala.unam.mx>
References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
	<4745B044.5090102@campus.iztacala.unam.mx>
Message-ID: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu>

I think SeqIO checks the name for parsing reasons only, in cases  
where the format changes based on the source (such as GenPept  
DBSOURCE data).  I don't think we go beyond that in Bioperl, probably  
b/c modifying or expanding names for data persistence would lead to  
volatile coding issues (i.e. consistency between parsers, constant  
updating to cover new crossrefs, etc).

I would definitely suggest retaining the original DB as it appears in  
the dbxref for consistency/sanity; if needed return expanded names  
using a different method if they are designated.

chris

On Nov 22, 2007, at 10:37 AM, Mauricio Herrera Cuadra wrote:

> Hi Peter,
>
> In BioPerl, there's no such mapping for db_xref's that I'm aware of.
> Each parser handles db_xref records on its own. Take a look at the
> Bio::SeqIO::genbank code, inside the next_seq() method for example:
>
> http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/ 
> Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup
>
> Regards,
> Mauricio.
>
> Peter wrote:
>> Dear all,
>>
>> I'm one of the Biopython developers.  I've recently got going with
>> BioSQL and have been getting to grips with the Biopython BioSQL
>> interface.  I'm aware that we need to try and be consistent with
>> BioPerl and BioJava, so I'd like to pose my first question related to
>> that.
>>
>> When loading GenBank records, many features have db_xref qualifiers,
>> e.g. from a random CDS feature in E. coli K12:
>>
>>                      /db_xref="ASAP:1309"
>>                      /db_xref="GI:16128366"
>>                      /db_xref="ECOCYC:EG10213"
>>                      /db_xref="GeneID:945313"
>>
>> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
>> "GeneID" before using recording these entries in the  
>> seqfeature_dbxref
>> and dbxref tables.  For example, "GI" becomes "GeneIndex".
>> Biopython's current mapping is as follows:
>>
>> # Dictionary of database types, keyed by GenBank db_xref abbreviation
>> db_dict = {'GeneID': 'Entrez',
>>            'GI': 'GeneIndex',
>>            'COG': 'COG',
>>            'CDD': 'CDD',
>>            'DDBJ': 'DNA Databank of Japan',
>>            'Entrez': 'Entrez',
>>            'GeneIndex': 'GeneIndex',
>>            'PUBMED': 'PubMed',
>>            'taxon': 'Taxon',
>>            'ATCC': 'ATCC',
>>            'ISFinder': 'ISFinder',
>>            'GOA': 'Gene Ontology Annotation',
>>            'ASAP': 'ASAP',
>>            'PSEUDO': 'PSEUDO',
>>            'InterPro': 'InterPro',
>>            'GEO': 'Gene Expression Omnibus',
>>            'EMBL': 'EMBL',
>>            'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
>>            'ECOCYC': 'EcoCyc',
>>            'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
>>            }
>>
>> In my testing, I've found several GenBank db_xref abbreviation for
>> which we don't have a mapping defined, such as "LocusID", "dbSNP",
>> "MGD", "MIM", or from an EMBL file, "REMTREMBL".
>>
>> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
>> similar mapping in their BioSQL code (or GenBank parser), so that
>> Biopython can follow your example.
>>
>> Thank you,
>>
>> Peter
>>
>> P.S. See also Biopython bug 2405
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2405
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
>
> -- 
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Gen?tica
> Unidad de Morfofisiolog?a y Funci?n
> Facultad de Estudios Superiores Iztacala, UNAM
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign


From biopython at maubp.freeserve.co.uk  Sat Nov 24 04:16:49 2007
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 24 Nov 2007 09:16:49 +0000
Subject: [BioSQL-l] [BioPython] BioSQL : GenBank db_xref names in dbxref
	table
In-Reply-To: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu>
References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
	<4745B044.5090102@campus.iztacala.unam.mx>
	<47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu>
Message-ID: <320fb6e00711240116g4819fc81g202fda35801f19f2@mail.gmail.com>

Thank you Chris and Mauricio,

On 11/23/07, Chris Fields <cjfields at uiuc.edu> wrote:
> I think [BioPerl's] SeqIO checks the name for parsing reasons only, in
> cases where the format changes based on the source (such as GenPept
> DBSOURCE data).  I don't think we go beyond that in Bioperl, probably
> b/c modifying or expanding names for data persistence would lead to
> volatile coding issues (i.e. consistency between parsers, constant
> updating to cover new crossrefs, etc).

And in Biopython's case, we get annoying warnings if it hasn't seen
the term before!  Which is way I filed Biopython bug 2405 in the first
place :)
http://bugzilla.open-bio.org/show_bug.cgi?id=2405

> I would definitely suggest retaining the original DB as it appears in
> the dbxref for consistency/sanity; if needed return expanded names
> using a different method if they are designated.

Sounds good to me.

Peter

From holland at ebi.ac.uk  Mon Nov 26 04:05:37 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Mon, 26 Nov 2007 09:05:37 +0000
Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
Message-ID: <474A8C61.5080504@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi there. BioJava uses the labels as-is from the file without trying to
translate them further. The exceptions are taxon xrefs which get
translated into taxon objects, but everything else is unchanged.

cheers,
Richard

Peter wrote:
> Dear all,
> 
> I'm one of the Biopython developers.  I've recently got going with
> BioSQL and have been getting to grips with the Biopython BioSQL
> interface.  I'm aware that we need to try and be consistent with
> BioPerl and BioJava, so I'd like to pose my first question related to
> that.
> 
> When loading GenBank records, many features have db_xref qualifiers,
> e.g. from a random CDS feature in E. coli K12:
> 
>                      /db_xref="ASAP:1309"
>                      /db_xref="GI:16128366"
>                      /db_xref="ECOCYC:EG10213"
>                      /db_xref="GeneID:945313"
> 
> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
> "GeneID" before using recording these entries in the seqfeature_dbxref
> and dbxref tables.  For example, "GI" becomes "GeneIndex".
> Biopython's current mapping is as follows:
> 
> # Dictionary of database types, keyed by GenBank db_xref abbreviation
> db_dict = {'GeneID': 'Entrez',
>            'GI': 'GeneIndex',
>            'COG': 'COG',
>            'CDD': 'CDD',
>            'DDBJ': 'DNA Databank of Japan',
>            'Entrez': 'Entrez',
>            'GeneIndex': 'GeneIndex',
>            'PUBMED': 'PubMed',
>            'taxon': 'Taxon',
>            'ATCC': 'ATCC',
>            'ISFinder': 'ISFinder',
>            'GOA': 'Gene Ontology Annotation',
>            'ASAP': 'ASAP',
>            'PSEUDO': 'PSEUDO',
>            'InterPro': 'InterPro',
>            'GEO': 'Gene Expression Omnibus',
>            'EMBL': 'EMBL',
>            'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
>            'ECOCYC': 'EcoCyc',
>            'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
>            }
> 
> In my testing, I've found several GenBank db_xref abbreviation for
> which we don't have a mapping defined, such as "LocusID", "dbSNP",
> "MGD", "MIM", or from an EMBL file, "REMTREMBL".
> 
> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
> similar mapping in their BioSQL code (or GenBank parser), so that
> Biopython can follow your example.
> 
> Thank you,
> 
> Peter
> 
> P.S. See also Biopython bug 2405
> http://bugzilla.open-bio.org/show_bug.cgi?id=2405
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHSoxh4C5LeMEKA/QRApBrAKCQDwWTHF9OQHA61PeUR/gUKdBj3wCffzDJ
7qoEUN+9XnMNkVe7wOeERbU=
=80+z
-----END PGP SIGNATURE-----

From biopython at maubp.freeserve.co.uk  Mon Nov 26 14:10:31 2007
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Nov 2007 19:10:31 +0000
Subject: [BioSQL-l] Authority in biodatabase table
Message-ID: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com>

Thank's for all the replies on the db_xref issue.

Today I'd like to ask if there are any established guidelines for the
biodatabase table - in particular for how to use the "authority" field
in the biodatabase table, and if there is any agreed terminology for
the named "sub databases" defined therein i.e. what should I call them
in our documentation.

By default, unless the user specifies an authority, we end up with a
NULL when creating entries in the biodatabase table using Biopython.
For example:

from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(driver="MySQLdb", user="root",
                     passwd = "", host = "localhost", db="bioseqdb")
db = server.new_database("orchids", description="Just for testing")
server.adaptor.commit()

I'd like to give some sensible defaults in any worked examples.  Apart
from simple test cases (like above), sensible examples that came to
mind would be creating a "sub database" to contain:
(*) an entire GenBank release
(*) the latest SwissProt release

What would you use in these cases.  In fact, what does your
biodatabase table contain right now?

Thank you all,

Peter

From holland at ebi.ac.uk  Tue Nov 27 03:39:52 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 27 Nov 2007 08:39:52 +0000
Subject: [BioSQL-l] Authority in biodatabase table
In-Reply-To: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com>
References: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com>
Message-ID: <474BD7D8.2050006@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

BioJava leaves authority blank at the moment.

cheers,
Richard

Peter wrote:
> Thank's for all the replies on the db_xref issue.
> 
> Today I'd like to ask if there are any established guidelines for the
> biodatabase table - in particular for how to use the "authority" field
> in the biodatabase table, and if there is any agreed terminology for
> the named "sub databases" defined therein i.e. what should I call them
> in our documentation.
> 
> By default, unless the user specifies an authority, we end up with a
> NULL when creating entries in the biodatabase table using Biopython.
> For example:
> 
> from BioSQL import BioSeqDatabase
> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root",
>                      passwd = "", host = "localhost", db="bioseqdb")
> db = server.new_database("orchids", description="Just for testing")
> server.adaptor.commit()
> 
> I'd like to give some sensible defaults in any worked examples.  Apart
> from simple test cases (like above), sensible examples that came to
> mind would be creating a "sub database" to contain:
> (*) an entire GenBank release
> (*) the latest SwissProt release
> 
> What would you use in these cases.  In fact, what does your
> biodatabase table contain right now?
> 
> Thank you all,
> 
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHS9fY4C5LeMEKA/QRAnPBAJ48/fwxyJ0rBnIJNZnTpZexXAs6iQCgnuTq
D4MzSndO+Osf/lzSVqQOArQ=
=rOVk
-----END PGP SIGNATURE-----

From biopython at maubp.freeserve.co.uk  Tue Nov 20 19:36:34 2007
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Nov 2007 19:36:34 +0000
Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
Message-ID: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>

Dear all,

I'm one of the Biopython developers.  I've recently got going with
BioSQL and have been getting to grips with the Biopython BioSQL
interface.  I'm aware that we need to try and be consistent with
BioPerl and BioJava, so I'd like to pose my first question related to
that.

When loading GenBank records, many features have db_xref qualifiers,
e.g. from a random CDS feature in E. coli K12:

                     /db_xref="ASAP:1309"
                     /db_xref="GI:16128366"
                     /db_xref="ECOCYC:EG10213"
                     /db_xref="GeneID:945313"

Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
"GeneID" before using recording these entries in the seqfeature_dbxref
and dbxref tables.  For example, "GI" becomes "GeneIndex".
Biopython's current mapping is as follows:

# Dictionary of database types, keyed by GenBank db_xref abbreviation
db_dict = {'GeneID': 'Entrez',
           'GI': 'GeneIndex',
           'COG': 'COG',
           'CDD': 'CDD',
           'DDBJ': 'DNA Databank of Japan',
           'Entrez': 'Entrez',
           'GeneIndex': 'GeneIndex',
           'PUBMED': 'PubMed',
           'taxon': 'Taxon',
           'ATCC': 'ATCC',
           'ISFinder': 'ISFinder',
           'GOA': 'Gene Ontology Annotation',
           'ASAP': 'ASAP',
           'PSEUDO': 'PSEUDO',
           'InterPro': 'InterPro',
           'GEO': 'Gene Expression Omnibus',
           'EMBL': 'EMBL',
           'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
           'ECOCYC': 'EcoCyc',
           'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
           }

In my testing, I've found several GenBank db_xref abbreviation for
which we don't have a mapping defined, such as "LocusID", "dbSNP",
"MGD", "MIM", or from an EMBL file, "REMTREMBL".

I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
similar mapping in their BioSQL code (or GenBank parser), so that
Biopython can follow your example.

Thank you,

Peter

P.S. See also Biopython bug 2405
http://bugzilla.open-bio.org/show_bug.cgi?id=2405


From arareko at campus.iztacala.unam.mx  Thu Nov 22 16:37:24 2007
From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra)
Date: Thu, 22 Nov 2007 10:37:24 -0600
Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
Message-ID: <4745B044.5090102@campus.iztacala.unam.mx>

Hi Peter,

In BioPerl, there's no such mapping for db_xref's that I'm aware of. 
Each parser handles db_xref records on its own. Take a look at the 
Bio::SeqIO::genbank code, inside the next_seq() method for example:

http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup

Regards,
Mauricio.

Peter wrote:
> Dear all,
> 
> I'm one of the Biopython developers.  I've recently got going with
> BioSQL and have been getting to grips with the Biopython BioSQL
> interface.  I'm aware that we need to try and be consistent with
> BioPerl and BioJava, so I'd like to pose my first question related to
> that.
> 
> When loading GenBank records, many features have db_xref qualifiers,
> e.g. from a random CDS feature in E. coli K12:
> 
>                      /db_xref="ASAP:1309"
>                      /db_xref="GI:16128366"
>                      /db_xref="ECOCYC:EG10213"
>                      /db_xref="GeneID:945313"
> 
> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
> "GeneID" before using recording these entries in the seqfeature_dbxref
> and dbxref tables.  For example, "GI" becomes "GeneIndex".
> Biopython's current mapping is as follows:
> 
> # Dictionary of database types, keyed by GenBank db_xref abbreviation
> db_dict = {'GeneID': 'Entrez',
>            'GI': 'GeneIndex',
>            'COG': 'COG',
>            'CDD': 'CDD',
>            'DDBJ': 'DNA Databank of Japan',
>            'Entrez': 'Entrez',
>            'GeneIndex': 'GeneIndex',
>            'PUBMED': 'PubMed',
>            'taxon': 'Taxon',
>            'ATCC': 'ATCC',
>            'ISFinder': 'ISFinder',
>            'GOA': 'Gene Ontology Annotation',
>            'ASAP': 'ASAP',
>            'PSEUDO': 'PSEUDO',
>            'InterPro': 'InterPro',
>            'GEO': 'Gene Expression Omnibus',
>            'EMBL': 'EMBL',
>            'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
>            'ECOCYC': 'EcoCyc',
>            'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
>            }
> 
> In my testing, I've found several GenBank db_xref abbreviation for
> which we don't have a mapping defined, such as "LocusID", "dbSNP",
> "MGD", "MIM", or from an EMBL file, "REMTREMBL".
> 
> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
> similar mapping in their BioSQL code (or GenBank parser), so that
> Biopython can follow your example.
> 
> Thank you,
> 
> Peter
> 
> P.S. See also Biopython bug 2405
> http://bugzilla.open-bio.org/show_bug.cgi?id=2405
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

-- 
MAURICIO HERRERA CUADRA
arareko at campus.iztacala.unam.mx
Laboratorio de Gen?tica
Unidad de Morfofisiolog?a y Funci?n
Facultad de Estudios Superiores Iztacala, UNAM


From cjfields at uiuc.edu  Fri Nov 23 00:42:12 2007
From: cjfields at uiuc.edu (Chris Fields)
Date: Thu, 22 Nov 2007 18:42:12 -0600
Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
In-Reply-To: <4745B044.5090102@campus.iztacala.unam.mx>
References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
	<4745B044.5090102@campus.iztacala.unam.mx>
Message-ID: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu>

I think SeqIO checks the name for parsing reasons only, in cases  
where the format changes based on the source (such as GenPept  
DBSOURCE data).  I don't think we go beyond that in Bioperl, probably  
b/c modifying or expanding names for data persistence would lead to  
volatile coding issues (i.e. consistency between parsers, constant  
updating to cover new crossrefs, etc).

I would definitely suggest retaining the original DB as it appears in  
the dbxref for consistency/sanity; if needed return expanded names  
using a different method if they are designated.

chris

On Nov 22, 2007, at 10:37 AM, Mauricio Herrera Cuadra wrote:

> Hi Peter,
>
> In BioPerl, there's no such mapping for db_xref's that I'm aware of.
> Each parser handles db_xref records on its own. Take a look at the
> Bio::SeqIO::genbank code, inside the next_seq() method for example:
>
> http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/ 
> Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup
>
> Regards,
> Mauricio.
>
> Peter wrote:
>> Dear all,
>>
>> I'm one of the Biopython developers.  I've recently got going with
>> BioSQL and have been getting to grips with the Biopython BioSQL
>> interface.  I'm aware that we need to try and be consistent with
>> BioPerl and BioJava, so I'd like to pose my first question related to
>> that.
>>
>> When loading GenBank records, many features have db_xref qualifiers,
>> e.g. from a random CDS feature in E. coli K12:
>>
>>                      /db_xref="ASAP:1309"
>>                      /db_xref="GI:16128366"
>>                      /db_xref="ECOCYC:EG10213"
>>                      /db_xref="GeneID:945313"
>>
>> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
>> "GeneID" before using recording these entries in the  
>> seqfeature_dbxref
>> and dbxref tables.  For example, "GI" becomes "GeneIndex".
>> Biopython's current mapping is as follows:
>>
>> # Dictionary of database types, keyed by GenBank db_xref abbreviation
>> db_dict = {'GeneID': 'Entrez',
>>            'GI': 'GeneIndex',
>>            'COG': 'COG',
>>            'CDD': 'CDD',
>>            'DDBJ': 'DNA Databank of Japan',
>>            'Entrez': 'Entrez',
>>            'GeneIndex': 'GeneIndex',
>>            'PUBMED': 'PubMed',
>>            'taxon': 'Taxon',
>>            'ATCC': 'ATCC',
>>            'ISFinder': 'ISFinder',
>>            'GOA': 'Gene Ontology Annotation',
>>            'ASAP': 'ASAP',
>>            'PSEUDO': 'PSEUDO',
>>            'InterPro': 'InterPro',
>>            'GEO': 'Gene Expression Omnibus',
>>            'EMBL': 'EMBL',
>>            'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
>>            'ECOCYC': 'EcoCyc',
>>            'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
>>            }
>>
>> In my testing, I've found several GenBank db_xref abbreviation for
>> which we don't have a mapping defined, such as "LocusID", "dbSNP",
>> "MGD", "MIM", or from an EMBL file, "REMTREMBL".
>>
>> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
>> similar mapping in their BioSQL code (or GenBank parser), so that
>> Biopython can follow your example.
>>
>> Thank you,
>>
>> Peter
>>
>> P.S. See also Biopython bug 2405
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2405
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
>
> -- 
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Gen?tica
> Unidad de Morfofisiolog?a y Funci?n
> Facultad de Estudios Superiores Iztacala, UNAM
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign


From biopython at maubp.freeserve.co.uk  Sat Nov 24 09:16:49 2007
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 24 Nov 2007 09:16:49 +0000
Subject: [BioSQL-l] [BioPython] BioSQL : GenBank db_xref names in dbxref
	table
In-Reply-To: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu>
References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
	<4745B044.5090102@campus.iztacala.unam.mx>
	<47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu>
Message-ID: <320fb6e00711240116g4819fc81g202fda35801f19f2@mail.gmail.com>

Thank you Chris and Mauricio,

On 11/23/07, Chris Fields <cjfields at uiuc.edu> wrote:
> I think [BioPerl's] SeqIO checks the name for parsing reasons only, in
> cases where the format changes based on the source (such as GenPept
> DBSOURCE data).  I don't think we go beyond that in Bioperl, probably
> b/c modifying or expanding names for data persistence would lead to
> volatile coding issues (i.e. consistency between parsers, constant
> updating to cover new crossrefs, etc).

And in Biopython's case, we get annoying warnings if it hasn't seen
the term before!  Which is way I filed Biopython bug 2405 in the first
place :)
http://bugzilla.open-bio.org/show_bug.cgi?id=2405

> I would definitely suggest retaining the original DB as it appears in
> the dbxref for consistency/sanity; if needed return expanded names
> using a different method if they are designated.

Sounds good to me.

Peter


From holland at ebi.ac.uk  Mon Nov 26 09:05:37 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Mon, 26 Nov 2007 09:05:37 +0000
Subject: [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com>
Message-ID: <474A8C61.5080504@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi there. BioJava uses the labels as-is from the file without trying to
translate them further. The exceptions are taxon xrefs which get
translated into taxon objects, but everything else is unchanged.

cheers,
Richard

Peter wrote:
> Dear all,
> 
> I'm one of the Biopython developers.  I've recently got going with
> BioSQL and have been getting to grips with the Biopython BioSQL
> interface.  I'm aware that we need to try and be consistent with
> BioPerl and BioJava, so I'd like to pose my first question related to
> that.
> 
> When loading GenBank records, many features have db_xref qualifiers,
> e.g. from a random CDS feature in E. coli K12:
> 
>                      /db_xref="ASAP:1309"
>                      /db_xref="GI:16128366"
>                      /db_xref="ECOCYC:EG10213"
>                      /db_xref="GeneID:945313"
> 
> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
> "GeneID" before using recording these entries in the seqfeature_dbxref
> and dbxref tables.  For example, "GI" becomes "GeneIndex".
> Biopython's current mapping is as follows:
> 
> # Dictionary of database types, keyed by GenBank db_xref abbreviation
> db_dict = {'GeneID': 'Entrez',
>            'GI': 'GeneIndex',
>            'COG': 'COG',
>            'CDD': 'CDD',
>            'DDBJ': 'DNA Databank of Japan',
>            'Entrez': 'Entrez',
>            'GeneIndex': 'GeneIndex',
>            'PUBMED': 'PubMed',
>            'taxon': 'Taxon',
>            'ATCC': 'ATCC',
>            'ISFinder': 'ISFinder',
>            'GOA': 'Gene Ontology Annotation',
>            'ASAP': 'ASAP',
>            'PSEUDO': 'PSEUDO',
>            'InterPro': 'InterPro',
>            'GEO': 'Gene Expression Omnibus',
>            'EMBL': 'EMBL',
>            'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
>            'ECOCYC': 'EcoCyc',
>            'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
>            }
> 
> In my testing, I've found several GenBank db_xref abbreviation for
> which we don't have a mapping defined, such as "LocusID", "dbSNP",
> "MGD", "MIM", or from an EMBL file, "REMTREMBL".
> 
> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
> similar mapping in their BioSQL code (or GenBank parser), so that
> Biopython can follow your example.
> 
> Thank you,
> 
> Peter
> 
> P.S. See also Biopython bug 2405
> http://bugzilla.open-bio.org/show_bug.cgi?id=2405
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHSoxh4C5LeMEKA/QRApBrAKCQDwWTHF9OQHA61PeUR/gUKdBj3wCffzDJ
7qoEUN+9XnMNkVe7wOeERbU=
=80+z
-----END PGP SIGNATURE-----


From biopython at maubp.freeserve.co.uk  Mon Nov 26 19:10:31 2007
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Nov 2007 19:10:31 +0000
Subject: [BioSQL-l] Authority in biodatabase table
Message-ID: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com>

Thank's for all the replies on the db_xref issue.

Today I'd like to ask if there are any established guidelines for the
biodatabase table - in particular for how to use the "authority" field
in the biodatabase table, and if there is any agreed terminology for
the named "sub databases" defined therein i.e. what should I call them
in our documentation.

By default, unless the user specifies an authority, we end up with a
NULL when creating entries in the biodatabase table using Biopython.
For example:

from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(driver="MySQLdb", user="root",
                     passwd = "", host = "localhost", db="bioseqdb")
db = server.new_database("orchids", description="Just for testing")
server.adaptor.commit()

I'd like to give some sensible defaults in any worked examples.  Apart
from simple test cases (like above), sensible examples that came to
mind would be creating a "sub database" to contain:
(*) an entire GenBank release
(*) the latest SwissProt release

What would you use in these cases.  In fact, what does your
biodatabase table contain right now?

Thank you all,

Peter


From holland at ebi.ac.uk  Tue Nov 27 08:39:52 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Tue, 27 Nov 2007 08:39:52 +0000
Subject: [BioSQL-l] Authority in biodatabase table
In-Reply-To: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com>
References: <320fb6e00711261110g63c156a1w8b76a797fe12e2b1@mail.gmail.com>
Message-ID: <474BD7D8.2050006@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

BioJava leaves authority blank at the moment.

cheers,
Richard

Peter wrote:
> Thank's for all the replies on the db_xref issue.
> 
> Today I'd like to ask if there are any established guidelines for the
> biodatabase table - in particular for how to use the "authority" field
> in the biodatabase table, and if there is any agreed terminology for
> the named "sub databases" defined therein i.e. what should I call them
> in our documentation.
> 
> By default, unless the user specifies an authority, we end up with a
> NULL when creating entries in the biodatabase table using Biopython.
> For example:
> 
> from BioSQL import BioSeqDatabase
> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root",
>                      passwd = "", host = "localhost", db="bioseqdb")
> db = server.new_database("orchids", description="Just for testing")
> server.adaptor.commit()
> 
> I'd like to give some sensible defaults in any worked examples.  Apart
> from simple test cases (like above), sensible examples that came to
> mind would be creating a "sub database" to contain:
> (*) an entire GenBank release
> (*) the latest SwissProt release
> 
> What would you use in these cases.  In fact, what does your
> biodatabase table contain right now?
> 
> Thank you all,
> 
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHS9fY4C5LeMEKA/QRAnPBAJ48/fwxyJ0rBnIJNZnTpZexXAs6iQCgnuTq
D4MzSndO+Osf/lzSVqQOArQ=
=rOVk
-----END PGP SIGNATURE-----