[BioPython] Problem with blast xml
Sebastian Bassi
sbassi at gmail.com
Thu Oct 4 06:47:44 UTC 2007
I am having a problem that it is not originated in Biopython, but it
is affecting the Biopython (1.43) xml blast parser.
I have two xml files, one can be parsed and the other can't.
Here are the commands I run to get the xml files:
sbassi at xubuntu:~/blast-2.2.16/bin$ ./blastall -p blastn -d
/media/vic300/BLASTdb/ecoli.nt -i
/media/vic300/INTA/mitofragsB2-TAB.fasta -e 0.0001 -m 7 -o TABB2.xml
sbassi at xubuntu:~/blast-2.2.16/bin$ ./blastall -p blastn -d
/media/vic300/BLASTdb/ecoli.nt -i
/media/vic300/INTA/mitofragsB2-TABv2.fasta -e 0.0001 -m 7 -o
TABB2v2.xml
The relevant difference is the input file, the sequences are
different, but the output file should have the same format (shouldn't
it?).
When I am parsing the files, I find that this is not true.
This is the file that can be parsed without problem:
>>> bout=open('bioinfo/INTA/TABB2.xml')
>>> b_records=NCBIXML.parse(bout)
>>> x=b_records.next()
>>> y=b_records.next()
>>> x.query
u'fragment 31'
>>> y.query
u'fragment 67'
>>> x.alignments
[<Bio.Blast.Record.Alignment instance at 0xb659850c>]
>>> y.alignments
[<Bio.Blast.Record.Alignment instance at 0xb65a3c6c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3cec>,
<Bio.Blast.Record.Alignment instance at 0xb65a3d8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3e8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3f8c>,
<Bio.Blast.Record.Alignment instance at 0xb65a3e4c>,
<Bio.Blast.Record.Alignment instance at 0xb65aa1ac>]
Let's see what seems to be a malformed? xml file:
>>> bout=open('bioinfo/INTA/TABB2v2.xml')
>>> b_records=NCBIXML.parse(bout)
>>> x=b_records.next()
>>> y=b_records.next()
>>> x.query
u'fragment 1'
>>> y.query
u'fragment 57'
>>> x.alignments
[]
>>> y.alignments
[<Bio.Blast.Record.Alignment instance at 0xb65a374c>]
There is a record with an empty list.
Here is a fragment of the "normal" one (TABB2.xml):
<Parameters_gap-extend>2</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>31</Iteration_iter-num>
<Iteration_query-ID>lcl|31_0</Iteration_query-ID>
<Iteration_query-def>fragment 31 </Iteration_query-def>
<Iteration_query-len>1174</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|1788520|gb|AE000309.1|AE000309</Hit_id>
<Hit_def>Escherichia coli K-12 MG1655 section 199 of 400 of
the complete genome</Hit_def>
<Hit_accession>AE000309</Hit_accession>
<Hit_len>13453</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
Here is a fragment of the "malformed" one (TABB2v2.xml):
<Parameters_gap-extend>2</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_stat>
<Statistics>
<Statistics_db-num>400</Statistics_db-num>
<Statistics_db-len>4662239</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.710603</Statistics_kappa>
<Statistics_lambda>1.37406</Statistics_lambda>
<Statistics_entropy>1.30725</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
<Iteration>
<Iteration_iter-num>57</Iteration_iter-num>
Why is this happening? Is this a expected behavior?
I uploaded the xml files here:
http://www.bioinformatica.info/TABB2.xml
http://www.bioinformatica.info/TABB2v2.xml
--
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318
More information about the Biopython
mailing list