[BioPython] Problem with blast xml

Michiel De Hoon mdehoon at c2b2.columbia.edu
Fri Oct 5 01:01:59 UTC 2007


Can you create two minimal XML files that demonstrate the problem?
For example, by removing records from the two files you have and checking if
parsing still works for one and fails for the other.
By doing so, you may be able to identify exactly what the essential
difference between the two files is.

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032



-----Original Message-----
From: biopython-bounces at lists.open-bio.org on behalf of Sebastian Bassi
Sent: Thu 10/4/2007 2:47 AM
To: biopython at biopython.org
Subject: [BioPython] Problem with blast xml
 
I am having a problem that it is not originated in Biopython, but it
is affecting the Biopython (1.43) xml blast parser.
I have two xml files, one can be parsed and the other can't.
Here are the commands I run to get the xml files:

sbassi at xubuntu:~/blast-2.2.16/bin$  ./blastall -p blastn -d
/media/vic300/BLASTdb/ecoli.nt -i
/media/vic300/INTA/mitofragsB2-TAB.fasta -e 0.0001 -m 7 -o TABB2.xml
sbassi at xubuntu:~/blast-2.2.16/bin$  ./blastall -p blastn -d
/media/vic300/BLASTdb/ecoli.nt -i
/media/vic300/INTA/mitofragsB2-TABv2.fasta -e 0.0001 -m 7 -o
TABB2v2.xml

The relevant difference is the input file, the sequences are
different, but the output file should have the same format (shouldn't
it?).
When I am parsing the files, I find that this is not true.
This is the file that can be parsed without problem:

>>> bout=open('bioinfo/INTA/TABB2.xml')
>>> b_records=NCBIXML.parse(bout)
>>> x=b_records.next()
>>> y=b_records.next()
>>> x.query
u'fragment 31'
>>> y.query
u'fragment 67'
>>> x.alignments
[<Bio.Blast.Record.Alignment instance at 0xb659850c>]
>>> y.alignments
[<Bio.Blast.Record.Alignment instance at 0xb65a3c6c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3cec>,
<Bio.Blast.Record.Alignment instance at 0xb65a3d8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3e8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3f8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3e4c>,
<Bio.Blast.Record.Alignment instance at 0xb65aa1ac>]

Let's see what seems to be a malformed? xml file:

>>> bout=open('bioinfo/INTA/TABB2v2.xml')
>>> b_records=NCBIXML.parse(bout)
>>> x=b_records.next()
>>> y=b_records.next()
>>> x.query
u'fragment 1'
>>> y.query
u'fragment 57'
>>> x.alignments
[]
>>> y.alignments
[<Bio.Blast.Record.Alignment instance at 0xb65a374c>]

There is a record with an empty list.

Here is a fragment of the "normal" one (TABB2.xml):

      <Parameters_gap-extend>2</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>31</Iteration_iter-num>
      <Iteration_query-ID>lcl|31_0</Iteration_query-ID>
      <Iteration_query-def>fragment 31 </Iteration_query-def>
      <Iteration_query-len>1174</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gi|1788520|gb|AE000309.1|AE000309</Hit_id>
          <Hit_def>Escherichia coli K-12 MG1655 section 199 of 400 of
the complete genome</Hit_def>
          <Hit_accession>AE000309</Hit_accession>
          <Hit_len>13453</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>

Here is a fragment of the "malformed" one (TABB2v2.xml):

      <Parameters_gap-extend>2</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>400</Statistics_db-num>
          <Statistics_db-len>4662239</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.710603</Statistics_kappa>
          <Statistics_lambda>1.37406</Statistics_lambda>
          <Statistics_entropy>1.30725</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>57</Iteration_iter-num>

Why is this happening? Is this a expected behavior?

I uploaded the xml files here:
http://www.bioinformatica.info/TABB2.xml
http://www.bioinformatica.info/TABB2v2.xml

-- 
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318
_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython





More information about the Biopython mailing list