[Bioperl-l] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL

Sat May 16 23:16:05 UTC 2009

On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote:

> Don't you love SwissProt (or UniProt as we must call it now I  
> suppose). They (understandably) try to squeeze ever more annotation  
> into the existing tags, rather than adding new tags.
>
> So, of the following structure:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic  
> chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> really only the first line, with the 'RecName: Full=' removed, is  
> the description line as we know it. The rest, I would say, is  
> annotation, such as two alternative names, amino acid chains  
> contained in the full record (shouldn't this be feature annotation,  
> really? and indeed it is - why it needs to be repeated here is  
> beyond me) and their names as well as alternative names, and the  
> fact that the sequence is a precursor form.
>
> Leaving all this in one string has the advantage that we can round- 
> trip it (and there is probably hardly any other way to accomplish  
> that), but clearly in terms of semantics this isn't the sequence  
> description as we know it anymore.
>
> Does anyone else think too that completely changing the semantics of  
> sequence annotation fields is a bad idea? <sigh/>
>
> My inclination from a BioPerl perspective is to extract the part  
> following 'RecName: Full=' as the description, and attach the rest  
> as annotation. We could in fact use the TagTree class for this. I'm  
> cross-posting to BioPerl too to gather what other BioPerl'ers think  
> about this.
>
> 	-hilmar

This is much like the GN issues we've run into before, and we *could*  
set this up using TagTree or similar.  In the latter case of gene name  
the data is stored in a text tree as follows:

gene_names:
   gene_name:
     Name: GC1QBP
     Synonyms: HABP1
     Synonyms: SF2P32
     Synonyms: C1QBP

That could be changed to an XML string:

<?xml version="1.0" encoding="UTF-8"?>
<gene_names>
   <gene_name>
     <Name>GC1QBP</Name>
     <Synonyms>HABP1</Synonyms>
     <Synonyms>SF2P32</Synonyms>
     <Synonyms>C1QBP</Synonyms>
   </gene_name>
</gene_names>

Thinking about this we should attempt to coalesce around a standard  
instead of forcing the other Bio*  to a specific format.

chris