[Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt
Peter Cock
p.j.a.cock at googlemail.com
Tue Apr 21 08:04:44 EDT 2009
On Tue, Apr 21, 2009 at 12:55 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
>> Have you got a link for the full record in your example?
>>
> You can find it here:
>
> http://www.uniprot.org/uniprot/Q9XHP0.txt
>
>> For interaction with other Bio.SeqIO formats, I generally
>> expect the description to be a single line string (with no
>> embedded newlines).
>
>> It looks like the SwissProt format has changed, and we
>> should be parsing the new extended DE lines more
>> carefully, and splitting these entries up and recording
>> them in the SeqRecord.annotations dictionary?
>>
> That sounds reasonable. The dictionary will have to be nested though. Something like this:
>
> annotations["RecName"] = [{"Full=11S globulin seed storage protein 2"]
> annotations["AltName"] = ["Full=11S globulin seed storage protein II", "Full=Alpha-globulin"]
> annotations["Contains"] = [{"RecName": {"Full": "11S globulin seed storage protein 2 acidic chain"}},
> "AltName": {"Full": "Full=11S globulin seed storage protein II acidic chain"}},
> {"RecName": {"Full": "11S globulin seed storage protein 2 basic chain"}},
> "AltName": {"Full": "Full=11S globulin seed storage protein II basic chain"}},
> ]
> annotations["Flags"] = "Precursor"
>
Possible - but for BioSQL we couldn't store those dictionaries. A
list of strings should work, but isn't as elegant. Maybe something
along these lines?
annotations["RecName"] = ["Full: 11S globulin seed storage protein 2;"}]
annotations["AltName"] = ["Full: 11S globulin seed storage protein
II", "Full: Alpha-globulin"]
annotations["Contains"] = ["RecName: Full=11S globulin seed storage
protein 2 acidic chain;\nAltName: Full=11S globulin seed storage
protein II acidic chain;",
"RecName: Full=11S globulin seed storage protein 2 basic
chain;\nAltName: Full=11S globulin seed storage protein II basic
chain;"]
annotations["Flags"] = "Precursor"
Or for "Contains" just have a flat list of strings, one for each name
(here four names).
Or for "Contains" just drop the AltName entries, and simply have a
list of the RecName entries (here two names).
Peter
More information about the Biopython-dev
mailing list