[Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython

Eric Talevich eric.talevich at gmail.com
Thu Jun 18 16:22:17 EDT 2009


On Thu, Jun 18, 2009 at 5:35 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> If you can show us a sample record, I would be better able to comment
> on how I would store it in a SeqRecord.
>

Here are a couple of examples from files in Test/PhyloXML/. From
phyloxml_examples.xml, a contrived demonstration of various features:

<clade>
  <name>A</name>
  <taxonomy>
    <scientific_name>E. coli</scientific_name>
  </taxonomy>
  <sequence>
    <annotation>
      <desc>alcohol dehydrogenase</desc>
      <confidence type="probability">0.99</confidence>
    </annotation>
  </sequence>
</clade>

<clade>
  <taxonomy>
    <scientific_name>Caenorhabditis elegans</scientific_name>
  </taxonomy>
  <sequence id_source="z">
    <symbol>ADHX</symbol>
    <accession source="ncbi">Q17335</accession>
    <name>alcohol dehydrogenase</name>
    <annotation ref="InterPro:IPR002085"/>
  </sequence>
</clade>

(An extra level of context is shown -- information that doesn't fit into a
SeqRecord could also be conceivably moved up into the Clade object.)
Assuming values of the SeqRecord.attributes dictionary can also be
dictionaries, this isn't to hard to convert to primitive types.

Another example from apaf.xml, which appears to be real data:

<sequence>
  <domain_architecture length="1249">
    <domain from="6" to="90" confidence="7.0E-26">CARD</domain>
    <domain from="109" to="414" confidence="7.2E-117">NB-ARC</domain>
    <domain from="605" to="643" confidence="2.4E-6">WD40</domain>
    <domain from="647" to="685" confidence="1.1E-12">WD40</domain>
    <domain from="689" to="729" confidence="2.4E-7">WD40</domain>
    <domain from="733" to="771" confidence="4.7E-14">WD40</domain>
    <domain from="872" to="910" confidence="2.5E-8">WD40</domain>
    <domain from="993" to="1031" confidence="4.6E-6">WD40</domain>
    <domain from="1075" to="1113" confidence="6.3E-7">WD40</domain>
    <domain from="1117" to="1155" confidence="1.4E-7">WD40</domain>
    <domain from="1168" to="1204" confidence="0.3">WD40</domain>
  </domain_architecture>
</sequence>

The DomainArchitecture element refers to domains in a protein sequence,
according to the spec. This could be reasonably represented as a list of
SeqFeature objects, I see now. But converting from a SeqRecord back to
PhyloXML, not all SeqFeatures would be protein domains... I don't know what
to do with that.

The new SeqRecord chapter is very informative -- I was originally just
looking at the wiki and epydoc pages. Still unclear: why doesn't the
SeqRecord constructor take annotations as an optional argument? Should it?

Thanks,
Eric


More information about the Biopython-dev mailing list