[Biopython] help with seqxml format

Wed Jan 22 16:53:08 UTC 2014

On Wed, Jan 22, 2014 at 4:28 PM, Alan <alanwilter at gmail.com> wrote:
> I have an input fasta file (test.fasta), like:
>
>>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus
> GN=Negr1 PE=2 SV=1
> MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA
> SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP
> RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ
> YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE
> GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT
> NASLPLNQSSIPWQVFFMLKVSFLLVCIL
>
> Then I am trying this:
>
> from Bio import SeqIO
> from Bio.Alphabet import generic_protein
> handle = open("test.fasta")
> records = list(SeqIO.parse(handle, "fasta", generic_protein))
> aa = records[0]
>
> print aa.format('seqxml')
> <?xml version="1.0" encoding="utf-8"?>
> <seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> seqXMLversion="0.4" xsi:noNamespaceSchemaLocation="
> http://www.seqxml.org/0.4/seqxml.xsd">
>  <entry id="tr|A0A4W9|A0A4W9_MOUSE">
>   <description>growth regulator 1</description>
>
> <AAseq>MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL</AAseq>
>  </entry>
> </seqXML>
>
> Note above that my SeqIO.parse is not picking all the info in the Fasta
> header.

Odd, what does aa.description give you?

> But I want to tweak this to output something more like this:
> ...
>   <species name="Mus musculus" ncbiTaxID="10090" />
>   <description>Neuronal growth regulator 1</description>
>
> The aa.id, aa.description wouldn't be a problem to update and some info I
> have to provide from elsewhere (like ncbiTaxID and species name), but how
> to add the details in the <seqXML>, <entry source> or create <species>,
> <DBRef> etc.?

Set record.annotations["organism"] and record.annotations["ncbi_taxid"]
to suitable strings, and the list record.dbxref = ["db:identifer", ...].

Also what version of Biopython are you using?

Peter