[Bioperl-l] SwissProt DE lines and bioentry.description field in BioSQL

Sat May 16 22:34:57 UTC 2009

Don't you love SwissProt (or UniProt as we must call it now I  
suppose). They (understandably) try to squeeze ever more annotation  
into the existing tags, rather than adding new tags.

So, of the following structure:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

really only the first line, with the 'RecName: Full=' removed, is the  
description line as we know it. The rest, I would say, is annotation,  
such as two alternative names, amino acid chains contained in the full  
record (shouldn't this be feature annotation, really? and indeed it is  
- why it needs to be repeated here is beyond me) and their names as  
well as alternative names, and the fact that the sequence is a  
precursor form.

Leaving all this in one string has the advantage that we can round- 
trip it (and there is probably hardly any other way to accomplish  
that), but clearly in terms of semantics this isn't the sequence  
description as we know it anymore.

Does anyone else think too that completely changing the semantics of  
sequence annotation fields is a bad idea? <sigh/>

My inclination from a BioPerl perspective is to extract the part  
following 'RecName: Full=' as the description, and attach the rest as  
annotation. We could in fact use the TagTree class for this. I'm cross- 
posting to BioPerl too to gather what other BioPerl'ers think about  
this.

	-hilmar

On May 14, 2009, at 2:20 PM, Peter wrote:

> Hi,
>
> This is cross-posted between biopython-dev and biosql-l as it regards
> parsing the description (DE) lines in SwissProt files and how they are
> stored in BioSQL.  This follows from an earlier discussion on
> biopython-dev
>
> Older SwissProt files just had one or two DE lines, and it made sense
> to treat this as a simple string mapped onto the description field in
> the bioentry table in BioSQL.  This appears to what happens with
> BioPerl 1.5.x and in Biopython (although the details regarding white
> space differ).  However, newer SwissProt files have many DE lines with
> additional structure.  The example Michiel gave earlier on the
> biopython-dev list was:
>
> http://www.uniprot.org/uniprot/Q9XHP0.txt
>
> This has the following DE lines:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic  
> chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> I had to fight with perl to get my old copy of BioPerl working again
> (some week reference thing), but I managed, and then loaded this file
> into my test BioSQL database with:
>
> $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
> XXX --namespace biosql_test --format swiss Q9XHP0.txt
>
> Then I looked at the resulting description in the main bioentry table:
>
> $ mysql --user=root -p biosql_test -e 'SELECT description FROM
> bioentry WHERE accession="Q9XHP0";'
>
> This is stored as one huge long string (without the newlines, I'm not
> sure if BioPerl strips those in parsing the file, or when loading it
> into the database):
>
> RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
> globulin seed storage protein II; AltName: Full=Alpha-globulin;
> Contains: RecName: Full=11S globulin seed storage protein 2 acidic
> chain; AltName: Full=11S globulin seed storage protein II acidic
> chain; Contains: RecName: Full=11S globulin seed storage protein 2
> basic chain; AltName: Full=11S globulin seed storage protein II basic
> chain; Flags: Precursor;
>
> For Biopython, I emptied the database then did:
>
>>>> from Bio import SeqIO
>>>> from BioSQL import BioSeqDatabase
>>>> server = BioSeqDatabase.open_database(driver="MySQLdb",  
>>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>>> db = server["biosql-test"] #namespace
>>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
> 1
>>>> server.commit()
>
> As before, I looked in the table with mysql.  Again - this stores the
> full description from the DE line, although with the newlines
> embedded.  So, Biopython is consistent with my old copy of BioPerl
> (1.5.x) if we ignore the white space.
>
> However, how does this look in BioPerl 1.6?  If this is the same, are
> there any plans to change this?  For Biopython we have discussed
> recording most of the DE information under the annotations instead
> (keyed off RecName, AltName, Contains, Flags), but I would like to be
> consistent with BioPerl+BioSQL.
>
> Thanks
>
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================