[BioPython] Problem with blast xml

Sebastian Bassi sbassi at gmail.com
Thu Oct 4 06:47:44 UTC 2007


I am having a problem that it is not originated in Biopython, but it
is affecting the Biopython (1.43) xml blast parser.
I have two xml files, one can be parsed and the other can't.
Here are the commands I run to get the xml files:

sbassi at xubuntu:~/blast-2.2.16/bin$  ./blastall -p blastn -d
/media/vic300/BLASTdb/ecoli.nt -i
/media/vic300/INTA/mitofragsB2-TAB.fasta -e 0.0001 -m 7 -o TABB2.xml
sbassi at xubuntu:~/blast-2.2.16/bin$  ./blastall -p blastn -d
/media/vic300/BLASTdb/ecoli.nt -i
/media/vic300/INTA/mitofragsB2-TABv2.fasta -e 0.0001 -m 7 -o
TABB2v2.xml

The relevant difference is the input file, the sequences are
different, but the output file should have the same format (shouldn't
it?).
When I am parsing the files, I find that this is not true.
This is the file that can be parsed without problem:

>>> bout=open('bioinfo/INTA/TABB2.xml')
>>> b_records=NCBIXML.parse(bout)
>>> x=b_records.next()
>>> y=b_records.next()
>>> x.query
u'fragment 31'
>>> y.query
u'fragment 67'
>>> x.alignments
[<Bio.Blast.Record.Alignment instance at 0xb659850c>]
>>> y.alignments
[<Bio.Blast.Record.Alignment instance at 0xb65a3c6c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3cec>,
<Bio.Blast.Record.Alignment instance at 0xb65a3d8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3e8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3f8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3e4c>,
<Bio.Blast.Record.Alignment instance at 0xb65aa1ac>]

Let's see what seems to be a malformed? xml file:

>>> bout=open('bioinfo/INTA/TABB2v2.xml')
>>> b_records=NCBIXML.parse(bout)
>>> x=b_records.next()
>>> y=b_records.next()
>>> x.query
u'fragment 1'
>>> y.query
u'fragment 57'
>>> x.alignments
[]
>>> y.alignments
[<Bio.Blast.Record.Alignment instance at 0xb65a374c>]

There is a record with an empty list.

Here is a fragment of the "normal" one (TABB2.xml):

      <Parameters_gap-extend>2</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>31</Iteration_iter-num>
      <Iteration_query-ID>lcl|31_0</Iteration_query-ID>
      <Iteration_query-def>fragment 31 </Iteration_query-def>
      <Iteration_query-len>1174</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|1788520|gb|AE000309.1|AE000309</Hit_id>
          <Hit_def>Escherichia coli K-12 MG1655 section 199 of 400 of
the complete genome</Hit_def>
          <Hit_accession>AE000309</Hit_accession>
          <Hit_len>13453</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>

Here is a fragment of the "malformed" one (TABB2v2.xml):

      <Parameters_gap-extend>2</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>400</Statistics_db-num>
          <Statistics_db-len>4662239</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.710603</Statistics_kappa>
          <Statistics_lambda>1.37406</Statistics_lambda>
          <Statistics_entropy>1.30725</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>57</Iteration_iter-num>

Why is this happening? Is this a expected behavior?

I uploaded the xml files here:
http://www.bioinformatica.info/TABB2.xml
http://www.bioinformatica.info/TABB2v2.xml

-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318



More information about the Biopython mailing list