[Biopython] help with seqxml format

Alan alanwilter at gmail.com
Thu Jan 23 13:40:42 UTC 2014


Thanks Peter,

I am using the latest version 1.63.

I’ve found some mistakes of myself, aa.description is fine:

print aa
ID: tr|A0A4W9|A0A4W9_MOUSE
Name: tr|A0A4W9|A0A4W9_MOUSE
Description: tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus
musculus GN=Negr1 PE=2 SV=1
Number of features: 0
Seq('MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRC...CIL',
ProteinAlphabet())

print aa.format('seqxml')
<?xml version="1.0" encoding="utf-8"?>
<seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
seqXMLversion="0.4" xsi:noNamespaceSchemaLocation="
http://www.seqxml.org/0.4/seqxml.xsd">
 <entry id="tr|A0A4W9|A0A4W9_MOUSE">
  <description>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus
musculus GN=Negr1 PE=2 SV=1</description>

<AAseq>MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL</AAseq>
 </entry>
</seqXML>

aa.id = 'A0A4W9'
aa.description = 'Neuronal growth regulator 1'
aa.annotations = {'PE': '2', 'ncbi_taxid': '10090', 'organism': 'Mus
musculus', 'source': 'UniProtKB', 'SV':'1'}
aa.dbxrefs = ['GN:Negr1']

which gives now:
print aa.format('seqxml')
<?xml version="1.0" encoding="utf-8"?>
<seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
seqXMLversion="0.4" xsi:noNamespaceSchemaLocation="
http://www.seqxml.org/0.4/seqxml.xsd">
 <entry source="UniProtKB" id="A0A4W9">
  <species name="Mus musculus" ncbiTaxID="10090"></species>
  <description>Neuronal growth regulator 1</description>

<AAseq>MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL</AAseq>
  <DBRef source="GN" id="Negr1"></DBRef>
  <property name="SV" value="1"></property>
  <property name="PE" value="2"></property>
 </entry>
</seqXML>

This is almost what I want. The only thing I’d like to add is
‘’’source="QfO http://www.ebi.ac.uk/reference_proteomes/"
sourceVersion="2013_04">’’’ to the <seqXML> tag header. How would I do it
please?

Many thanks again,

Alan



On 22 January 2014 16:53, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Wed, Jan 22, 2014 at 4:28 PM, Alan <alanwilter at gmail.com> wrote:
> > I have an input fasta file (test.fasta), like:
> >
> >>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus
> > GN=Negr1 PE=2 SV=1
> > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA
> > SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP
> > RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ
> > YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE
> > GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT
> > NASLPLNQSSIPWQVFFMLKVSFLLVCIL
> >
> > Then I am trying this:
> >
> > from Bio import SeqIO
> > from Bio.Alphabet import generic_protein
> > handle = open("test.fasta")
> > records = list(SeqIO.parse(handle, "fasta", generic_protein))
> > aa = records[0]
> >
> > print aa.format('seqxml')
> > <?xml version="1.0" encoding="utf-8"?>
> > <seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> > seqXMLversion="0.4" xsi:noNamespaceSchemaLocation="
> > http://www.seqxml.org/0.4/seqxml.xsd">
> >  <entry id="tr|A0A4W9|A0A4W9_MOUSE">
> >   <description>growth regulator 1</description>
> >
> >
> <AAseq>MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL</AAseq>
> >  </entry>
> > </seqXML>
> >
> > Note above that my SeqIO.parse is not picking all the info in the Fasta
> > header.
>
> Odd, what does aa.description give you?
>
> > But I want to tweak this to output something more like this:
> > ...
> >   <species name="Mus musculus" ncbiTaxID="10090" />
> >   <description>Neuronal growth regulator 1</description>
> >
> > The aa.id, aa.description wouldn't be a problem to update and some info
> I
> > have to provide from elsewhere (like ncbiTaxID and species name), but how
> > to add the details in the <seqXML>, <entry source> or create <species>,
> > <DBRef> etc.?
>
> Set record.annotations["organism"] and record.annotations["ncbi_taxid"]
> to suitable strings, and the list record.dbxref = ["db:identifer", ...].
>
> Also what version of Biopython are you using?
>
> Peter
>




More information about the Biopython mailing list