[Biopython] help with seqxml format

Alan alanwilter at gmail.com
Wed Jan 22 16:28:57 UTC 2014


I have an input fasta file (test.fasta), like:

>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus
GN=Negr1 PE=2 SV=1
MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA
SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP
RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ
YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE
GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT
NASLPLNQSSIPWQVFFMLKVSFLLVCIL

Then I am trying this:

from Bio import SeqIO
from Bio.Alphabet import generic_protein
handle = open("test.fasta")
records = list(SeqIO.parse(handle, "fasta", generic_protein))
aa = records[0]

print aa.format('seqxml')
<?xml version="1.0" encoding="utf-8"?>
<seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
seqXMLversion="0.4" xsi:noNamespaceSchemaLocation="
http://www.seqxml.org/0.4/seqxml.xsd">
 <entry id="tr|A0A4W9|A0A4W9_MOUSE">
  <description>growth regulator 1</description>

<AAseq>MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL</AAseq>
 </entry>
</seqXML>

Note above that my SeqIO.parse is not picking all the info in the Fasta
header.

But I want to tweak this to output something more like this:
<seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
seqXMLversion="0.4" xsi:noNamespaceSchemaLocation="
http://www.seqxml.org/0.4/seqxml.xsd" source="QfO
http://www.ebi.ac.uk/reference_proteomes/" sourceVersion="2014_04">
 <entry id="A0A4W9" source="UniProtKB">
  <species name="Mus musculus" ncbiTaxID="10090" />
  <description>Neuronal growth regulator 1</description>

 <AAseq>MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL</AAseq>
  <DBRef type="gene" source="GN" id="Negr1" />
  <DBRef type="gene" source="Gene" id="TK0418" />
  <property name="PE" value="2" />
  <property name="SV" value="1" />
 </entry>
</seqXML>

The aa.id, aa.description wouldn't be a problem to update and some info I
have to provide from elsewhere (like ncbiTaxID and species name), but how
to add the details in the <seqXML>, <entry source> or create <species>,
<DBRef> etc.?

Many thanks in advance,

Alan




More information about the Biopython mailing list