[Biopython-dev] New: Uniprot XML parser
Andrea Pierleoni
andrea at biocomp.unibo.it
Thu Jan 21 12:01:30 UTC 2010
>> Anyhow, unit testing is coming (thanks to Mauro) together with a
>> detailed
>> comparison between the two parsed seqrecords.
>
> Great.
>
> Peter
>
As mentioned earlier, Mauro did a code review and added unit test for the
parser in Tests/test_Uniprot.py
the updated version is available on the github repository:
http://github.com/apierleoni/biopython
Since this version is mature enough I sepnt some time comparing the input
from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser.
This comparison was done using the Q13639 UniProt entry.
This are the main differences between the two generated SeqRecords:
- id: is the same (first accession)
- name: is the same
- description: UP reports the the recommended name , full name value, while
additional names and synonyms are in the annotations. SP reports a
long string containing everything parsed as it is form the plain
text.
- dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed,
NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs
- seq: is the same
- features: missing in SP (I have to check with the Peter's patch)
- annotations:
- - identical annotations: accessions, keywords, taxonomy, organism
- - mapped annotations:
date_last_annotation_update in UP---> modified in SP
date_last_sequence_update in UP---> sequence_modified in SP
gene_name_primary in UP---> gene_name in SP
>>> SP.annotations['gene_name']
'Name=HTR4;'
>>> UP.annotations['gene_name_primary']
'HTR4'
ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a
dbReference in the xmlfile
- - references: has some minor differences.
Final semicolon and double quote missing in UP for both author
and title fields.
In UP reference comments are reported as:
"PublicationType | PublicationDate | Scope | Tissue"
For submission publication type the db is reported in comments
and not in journal field.
- - comments: here comes the big differences.
SP has comments are on a single string.
UP comments are mapped to seceral annotation entries using comment
type and attributes to build the annotation key.
Eg.
comment_function --> list of "function" type comment strings
comment_subcellularlocation_location --> list of "location"
strings in the subcellularlocation comment field
Comments tree in XML would be easily mapped to a comment dictionary
tree, but this would not be BioSQL safe.
Andrea
More information about the Biopython-dev
mailing list