[BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL

Peter biopython at maubp.freeserve.co.uk
Thu May 14 18:20:47 UTC 2009


Hi,

This is cross-posted between biopython-dev and biosql-l as it regards
parsing the description (DE) lines in SwissProt files and how they are
stored in BioSQL.  This follows from an earlier discussion on
biopython-dev

Older SwissProt files just had one or two DE lines, and it made sense
to treat this as a simple string mapped onto the description field in
the bioentry table in BioSQL.  This appears to what happens with
BioPerl 1.5.x and in Biopython (although the details regarding white
space differ).  However, newer SwissProt files have many DE lines with
additional structure.  The example Michiel gave earlier on the
biopython-dev list was:

http://www.uniprot.org/uniprot/Q9XHP0.txt

This has the following DE lines:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

I had to fight with perl to get my old copy of BioPerl working again
(some week reference thing), but I managed, and then loaded this file
into my test BioSQL database with:

$ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
XXX --namespace biosql_test --format swiss Q9XHP0.txt

Then I looked at the resulting description in the main bioentry table:

$ mysql --user=root -p biosql_test -e 'SELECT description FROM
bioentry WHERE accession="Q9XHP0";'

This is stored as one huge long string (without the newlines, I'm not
sure if BioPerl strips those in parsing the file, or when loading it
into the database):

RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
globulin seed storage protein II; AltName: Full=Alpha-globulin;
Contains: RecName: Full=11S globulin seed storage protein 2 acidic
chain; AltName: Full=11S globulin seed storage protein II acidic
chain; Contains: RecName: Full=11S globulin seed storage protein 2
basic chain; AltName: Full=11S globulin seed storage protein II basic
chain; Flags: Precursor;

For Biopython, I emptied the database then did:

>>> from Bio import SeqIO
>>> from BioSQL import BioSeqDatabase
>>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>> db = server["biosql-test"] #namespace
>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
1
>>> server.commit()

As before, I looked in the table with mysql.  Again - this stores the
full description from the DE line, although with the newlines
embedded.  So, Biopython is consistent with my old copy of BioPerl
(1.5.x) if we ignore the white space.

However, how does this look in BioPerl 1.6?  If this is the same, are
there any plans to change this?  For Biopython we have discussed
recording most of the DE information under the annotations instead
(keyed off RecName, AltName, Contains, Flags), but I would like to be
consistent with BioPerl+BioSQL.

Thanks

Peter



More information about the BioSQL-l mailing list