[Biopython] help with seqxml format
Alan
alanwilter at gmail.com
Wed Jan 22 16:28:57 UTC 2014
I have an input fasta file (test.fasta), like:
>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus
GN=Negr1 PE=2 SV=1
MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA
SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP
RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ
YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE
GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT
NASLPLNQSSIPWQVFFMLKVSFLLVCIL
Then I am trying this:
from Bio import SeqIO
from Bio.Alphabet import generic_protein
handle = open("test.fasta")
records = list(SeqIO.parse(handle, "fasta", generic_protein))
aa = records[0]
print aa.format('seqxml')
<?xml version="1.0" encoding="utf-8"?>
<seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
seqXMLversion="0.4" xsi:noNamespaceSchemaLocation="
http://www.seqxml.org/0.4/seqxml.xsd">
<entry id="tr|A0A4W9|A0A4W9_MOUSE">
<description>growth regulator 1</description>
<AAseq>MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL</AAseq>
</entry>
</seqXML>
Note above that my SeqIO.parse is not picking all the info in the Fasta
header.
But I want to tweak this to output something more like this:
<seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
seqXMLversion="0.4" xsi:noNamespaceSchemaLocation="
http://www.seqxml.org/0.4/seqxml.xsd" source="QfO
http://www.ebi.ac.uk/reference_proteomes/" sourceVersion="2014_04">
<entry id="A0A4W9" source="UniProtKB">
<species name="Mus musculus" ncbiTaxID="10090" />
<description>Neuronal growth regulator 1</description>
<AAseq>MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL</AAseq>
<DBRef type="gene" source="GN" id="Negr1" />
<DBRef type="gene" source="Gene" id="TK0418" />
<property name="PE" value="2" />
<property name="SV" value="1" />
</entry>
</seqXML>
The aa.id, aa.description wouldn't be a problem to update and some info I
have to provide from elsewhere (like ncbiTaxID and species name), but how
to add the details in the <seqXML>, <entry source> or create <species>,
<DBRef> etc.?
Many thanks in advance,
Alan
More information about the Biopython
mailing list