[Biopython-dev] New: Uniprot XML parser
Andrea Pierleoni
andrea at biocomp.unibo.it
Tue Jul 27 16:37:59 UTC 2010
>> XML descriptions are clearer, but have some probvlem as well.
>> some features do not have a stat and end point. in this case I skipped
>> them.
>
> If you have some specific examples (IDs) to hand that would be useful.
>
try this:
http://www.uniprot.org/uniprot/Q8NE62.xml
the "error" refers to old '?' symbol in feature positions
it carries this feature:
<feature type="transit peptide" description="Mitochondrion"
status="potential">
<location>
<begin position="1"/>
<end status="unknown"/>
</location>
</feature>
I'm actually skipping al the features/comments carrying a
status="unknown" attrib
in start or end positions, or both.
other examples:
3HIDH_DICDI
ADAM1_RAT
ADAM1_RAT
ADM1B_MOUSE
ADM1B_MOUSE
CARDH_CYNCA
CARDH_CYNCA
CHDH_HUMAN
COQ41_PARTE
COQ4_CHAGB
COQ4_LEIMA
COX11_DICDI
COX11_DICDI
COX16_NEUCR
...
I'm actually skipping all the features having a
>
> I agree we're not going to get 100% identical records.
good
>
> Perhaps you are using "schema" in a different way that I would. All the
> projects use the same schema (where I mean database tables), but
> there are differences in the details of how each file format gets parsed
> and ends up stored in those tables.
Yes I'm referring to data schema in general, not strictly the BioSQL schema.
I don't mean to change the BioSQL schema.
>
>> I think we have 3 choiches:
>>
>> 1) follow BioPerl whatever they does (could be good)
>> 2) try to define our rules (bad)
>> 3) set a defined open schema and propose it to BioSQL (good)
>
> If in (3) you mean we should have some clear examples of major file
> formats and how each field should end up in BioSQL, I agree. In the
> short to medium term I regard the bioperl-db mapping as the reference
> implementation (although their code does continue to change), i.e. (1).
>
> I found one of the threads I was thinking about in the archive,
> http://bioperl.org/pipermail/biosql-l/2010-January/001672.html
> http://bioperl.org/pipermail/bioperl-l/2010-January/031993.html
> http://bioperl.org/pipermail/open-bio-l/2010-January/000609.html
so does it make sens to follow their code and their change?
this would be valid just for BioPerl and BioPython.
>
>> In my parser I'm storing information from the comment as annotations
>> in the seqrecords, buinding annotation key on the basis of the XML
>> tree. this is a quick and dirty hack, but can be done much better.
>>
>> we could store complex comment field with XML, but I'm not incline
>> in using just a big XML string in the comment field.
>
> Some sorted of nested structure like a dictionary? Are you familiar
> with the Perl TagTree which is what BioPerl are using here. I think
> Richard Holland said (in the above linked thread) that BioJava just
> sticks the DE section as an XML string into their record object
> (and thus puts XML in the BioSQL database?).
>
I'm not familiar with the TagTree but I've looked at it when there was
the discussion, and I do not see any advantage on using this explicitly
on the db fields instead of an XML.
I would save an XML text on the DB easily readable by every language
and even humans. XML text can be also queried easily. Then I'd represent
this XML in a nested dictionary structure similar to the perl TagTree.
I don't know if there is any implementation in python about this...
>> Also keep in mind that the "comment" field is no longer called comments
>> in the uniprot web-site but "general annotations", so maybe it makes
>> sense
>> to store this data as annotation in some other place.
>
> Sounds sensible.
you can use XML here too, if needed.
Also by using XML, we could be able to store dictionary-containing seqrecords
in a BioSQL db. A big plus to me.
>
> I'll (re-)post that as a specific query on the open-bio-l mailing list...
>
it looks like anybody is agreeing with "uniprot-xml"
Andrea
More information about the Biopython-dev
mailing list