[Biopython-dev] New: Uniprot XML parser

Mauro mauro at biodec.com
Thu Jan 21 15:09:28 EST 2010


On 01/21/2010 01:01 PM, Andrea Pierleoni wrote:
>
>>> Anyhow, unit testing is coming (thanks to Mauro) together with a
>>> detailed
>>> comparison between the two parsed seqrecords.
>>
>> Great.
>>
>> Peter
>>
>
>
> As mentioned earlier, Mauro did a code review and added unit test for the
> parser in Tests/test_Uniprot.py
> the updated version is available on the github repository:
> http://github.com/apierleoni/biopython
>
> Since this version is mature enough I sepnt some time comparing the input
> from this UniProt XML (UP) parser and the SwissProt (SP) plain text parser.
> This comparison was done using the Q13639 UniProt entry.

I made also a test for this case. Currently the test fails, you can see
the report made by Andrea below. If we agree with differences between 
the seqrecord, I do the work to change the test.

Mauro.

>
> This are the main differences between the two generated SeqRecords:
>
> - id:  is the same (first accession)
> - name: is the same
> - description: UP reports the  the recommended name , full name value, while
>         additional names and synonyms are in the annotations. SP reports a
>         long string containing everything parsed as it is form the plain
>         text.
> - dbxrefs: UP reports all the dbxref of SP, adding DOI, MEDLINE, PubMed,
>         NCBI Taxonomy and Swiss-Prot/Trembl dbxrefs
> - seq: is the same
> - features: missing in SP (I have to check with the Peter's patch)
> - annotations:
> - - identical annotations: accessions, keywords, taxonomy, organism
> - - mapped annotations:
>         date_last_annotation_update in UP--->  modified in SP
>         date_last_sequence_update in UP--->  sequence_modified in SP
>         gene_name_primary in UP--->  gene_name in SP
>                 >>>  SP.annotations['gene_name']
>                 'Name=HTR4;'
>                 >>>  UP.annotations['gene_name_primary']
>                 'HTR4'
>         ncbi_taxid in SP --->  UP dbxrefs since it is mapped as a
>                  dbReference in the xmlfile
> - - references: has some minor differences.
>          Final semicolon and double quote missing in UP for both author
>              and title fields.
>          In UP reference comments are reported as:
> 	    "PublicationType | PublicationDate | Scope | Tissue"
> 	For submission publication type the db is reported in comments
>              and not in journal field.
> - - comments: here comes the big differences.
>         SP has comments are on a single string.
>         UP comments are mapped to seceral annotation entries using comment
>            type and attributes to build the annotation key.
>            Eg.
>            comment_function -->  list of  "function" type comment strings
>            comment_subcellularlocation_location -->  list of  "location"
>                 strings in the subcellularlocation comment field
>
>         Comments  tree in XML would be easily mapped to a comment dictionary
>         tree, but this would not be BioSQL safe.
>
>
> Andrea
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev



More information about the Biopython-dev mailing list