[Biopython-dev] New: Uniprot XML parser
Mauro
mauro at biodec.com
Thu Jan 21 20:09:28 UTC 2010
On 01/21/2010 01:01 PM, Andrea Pierleoni wrote:
>
>>> Anyhow, unit testing is coming (thanks to Mauro) together with a
>>> detailed
>>> comparison between the two parsed seqrecords.
>>
>> Great.
>>
>> Peter
>>
>
>
> As mentioned earlier, Mauro did a code review and added unit test for the
> parser in Tests/test_Uniprot.py
> the updated version is available on the github repository:
> http://github.com/apierleoni/biopython
>
> Since this version is mature enough I sepnt some time comparing the input
> from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser.
> This comparison was done using the Q13639 UniProt entry.
I made also a test for this case. Currently the test fails, you can see
the report made by Andrea below. If we agree with differences between
the seqrecord, I do the work to change the test.
Mauro.
>
> This are the main differences between the two generated SeqRecords:
>
> - id: is the same (first accession)
> - name: is the same
> - description: UP reports the the recommended name , full name value, while
> additional names and synonyms are in the annotations. SP reports a
> long string containing everything parsed as it is form the plain
> text.
> - dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed,
> NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs
> - seq: is the same
> - features: missing in SP (I have to check with the Peter's patch)
> - annotations:
> - - identical annotations: accessions, keywords, taxonomy, organism
> - - mapped annotations:
> date_last_annotation_update in UP---> modified in SP
> date_last_sequence_update in UP---> sequence_modified in SP
> gene_name_primary in UP---> gene_name in SP
> >>> SP.annotations['gene_name']
> 'Name=HTR4;'
> >>> UP.annotations['gene_name_primary']
> 'HTR4'
> ncbi_taxid in SP ---> UP dbxrefs since it is mapped as a
> dbReference in the xmlfile
> - - references: has some minor differences.
> Final semicolon and double quote missing in UP for both author
> and title fields.
> In UP reference comments are reported as:
> "PublicationType | PublicationDate | Scope | Tissue"
> For submission publication type the db is reported in comments
> and not in journal field.
> - - comments: here comes the big differences.
> SP has comments are on a single string.
> UP comments are mapped to seceral annotation entries using comment
> type and attributes to build the annotation key.
> Eg.
> comment_function --> list of "function" type comment strings
> comment_subcellularlocation_location --> list of "location"
> strings in the subcellularlocation comment field
>
> Comments tree in XML would be easily mapped to a comment dictionary
> tree, but this would not be BioSQL safe.
>
>
> Andrea
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
More information about the Biopython-dev
mailing list