[Bioperl-l] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL
Chris Fields
cjfields at illinois.edu
Sat May 16 23:16:05 UTC 2009
On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote:
> Don't you love SwissProt (or UniProt as we must call it now I
> suppose). They (understandably) try to squeeze ever more annotation
> into the existing tags, rather than adding new tags.
>
> So, of the following structure:
>
> DE RecName: Full=11S globulin seed storage protein 2;
> DE AltName: Full=11S globulin seed storage protein II;
> DE AltName: Full=Alpha-globulin;
> DE Contains:
> DE RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE AltName: Full=11S globulin seed storage protein II acidic
> chain;
> DE Contains:
> DE RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE AltName: Full=11S globulin seed storage protein II basic chain;
> DE Flags: Precursor;
>
> really only the first line, with the 'RecName: Full=' removed, is
> the description line as we know it. The rest, I would say, is
> annotation, such as two alternative names, amino acid chains
> contained in the full record (shouldn't this be feature annotation,
> really? and indeed it is - why it needs to be repeated here is
> beyond me) and their names as well as alternative names, and the
> fact that the sequence is a precursor form.
>
> Leaving all this in one string has the advantage that we can round-
> trip it (and there is probably hardly any other way to accomplish
> that), but clearly in terms of semantics this isn't the sequence
> description as we know it anymore.
>
> Does anyone else think too that completely changing the semantics of
> sequence annotation fields is a bad idea? <sigh/>
>
> My inclination from a BioPerl perspective is to extract the part
> following 'RecName: Full=' as the description, and attach the rest
> as annotation. We could in fact use the TagTree class for this. I'm
> cross-posting to BioPerl too to gather what other BioPerl'ers think
> about this.
>
> -hilmar
This is much like the GN issues we've run into before, and we *could*
set this up using TagTree or similar. In the latter case of gene name
the data is stored in a text tree as follows:
gene_names:
gene_name:
Name: GC1QBP
Synonyms: HABP1
Synonyms: SF2P32
Synonyms: C1QBP
That could be changed to an XML string:
<?xml version="1.0" encoding="UTF-8"?>
<gene_names>
<gene_name>
<Name>GC1QBP</Name>
<Synonyms>HABP1</Synonyms>
<Synonyms>SF2P32</Synonyms>
<Synonyms>C1QBP</Synonyms>
</gene_name>
</gene_names>
Thinking about this we should attempt to coalesce around a standard
instead of forcing the other Bio* to a specific format.
chris
More information about the Bioperl-l
mailing list