[Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL

Peter biopython at maubp.freeserve.co.uk
Sat May 16 23:14:54 UTC 2009


On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> Don't you love SwissProt (or UniProt as we must call it now I suppose).
> They (understandably) try to squeeze ever more annotation into the existing
> tags, rather than adding new tags.
>
>  So, of the following structure:
>
>  DE   RecName: Full=11S globulin seed storage protein 2;
>  DE   AltName: Full=11S globulin seed storage protein II;
>  DE   AltName: Full=Alpha-globulin;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
>  DE     AltName: Full=11S globulin seed storage protein II acidic chain;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
>  DE     AltName: Full=11S globulin seed storage protein II basic chain;
>  DE   Flags: Precursor;
>
>  really only the first line, with the 'RecName: Full=' removed, is the
> description line as we know it. The rest, I would say, is annotation, such
> as two alternative names, amino acid chains contained in the full record
> (shouldn't this be feature annotation, really? and indeed it is - why it
> needs to be repeated here is beyond me) and their names as well as
> alternative names, and the fact that the sequence is a precursor form.
>
>  Leaving all this in one string has the advantage that we can round-trip it
> (and there is probably hardly any other way to accomplish that), but clearly
> in terms of semantics this isn't the sequence description as we know it
> anymore.
>
>  Does anyone else think too that completely changing the semantics of
> sequence annotation fields is a bad idea? <sigh/>

+1
That's pretty much what I thought on seeing this the first time.

>  My inclination from a BioPerl perspective is to extract the part following
> 'RecName: Full=' as the description, and attach the rest as annotation. We
> could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> too to gather what other BioPerl'ers think about this.

Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just
treats the DE lines as only big long string?

Could you translate your idea about the TagTree class into something
concrete with BioSQL tables and fields for me? I'm not familiar with
the TagTree (or Perl).

Over on the Biopython list we'd talked about storing this annotation in
a nested structured.  However, in order to use the BioSQL annotations
mechanisms, I think a simple flat structure is required :(

Peter



More information about the Biopython-dev mailing list