[Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt

Peter Cock p.j.a.cock at googlemail.com
Tue Apr 21 07:26:00 EDT 2009


On Tue, Apr 21, 2009 at 12:12 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> Dear all,
>
> I've noticed an inconsistency between how Bio.SeqIO and Bio.SwissProt parse DE (description) lines in SwissProt files.
>
> For these DE lines:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> a SwissProt record created by Bio.SwissProt contains the following:
>>>> print swiss_record.description
> RecName: Full=11S globulin seed storage protein 2;
> AltName: Full=11S globulin seed storage protein II;
> AltName: Full=Alpha-globulin;
> Contains:
>  RecName: Full=11S globulin seed storage protein 2 acidic chain;
>  AltName: Full=11S globulin seed storage protein II acidic chain;
> Contains:
>  RecName: Full=11S globulin seed storage protein 2 basic chain;
>  AltName: Full=11S globulin seed storage protein II basic chain;
> Flags: Precursor;
>
> but a SeqRecord returned by Bio.SeqIO contains this:
>
>>>> print seq_record.description
> RecName: Full=11S globulin seed storage protein 2;
> AltName: Full=11S globulin seed storage protein II;
> AltName: Full=Alpha-globulin;
> Contains:
> RecName: Full=11S globulin seed storage protein 2 acidic chain;
> AltName: Full=11S globulin seed storage protein II acidic chain;
> Contains:
> RecName: Full=11S globulin seed storage protein 2 basic chain;
> AltName: Full=11S globulin seed storage protein II basic chain;
> Flags: Precursor;
>
> So Bio.SeqIO removes the spaces in front of the line, but Bio.SwissProt doesn't.
> For consistency, I think it's better to decide on one of these two styles.
> My preference is for the approach used by Bio.SwissProt. Any objections to modifying the code used by Bio.SeqIO?

Have you got a link for the full record in your example?

For interaction with other Bio.SeqIO formats, I generally expect the
description to be a single line string (with no embedded newlines).
If you look at the (old) SwissProt files in our unit tests, the
current Bio.SeqIO behaviour makes sense - the DE line(s) just encode a
fairly short simple string.  It looks like the SwissProt format has
changed, and we should be parsing the new extended DE lines more
carefully, and splitting these entries up and recording them in the
SeqRecord.annotations dictionary?

Peter



More information about the Biopython-dev mailing list