[BioPython] problems when parsing blast output

Tue Jan 17 06:25:42 EST 2006

OK, thanks for the extra information Alessandro.

It looks like the current BLAST parser doesn't like the current blastpgp 
output.

A quick Google suggests that it used to work, my guess is the NCBI 
recently changed the format to add this extra reference:

Reference for composition-based statistics:
Schaffer, Alejandro A., L. Aravaind, Thomas L. Madden,
Sergei Shavirin, John L. Spouge, Yuri I. Wolf,
Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements",  Nucleic Acids Res. 
29:2994-3005.

If I delete this from the blast.output file you sent me, then your 
example code works fine.

If you are running the blast search separately, and then trying to parse 
the output in Python, this short term fix should get you up and running.

You could also try getting BLAST to produce XML output.  There was a 
recent post on the list where someone was having problems with that and 
multiple inputs, and a suggestion to cope.

I have logged a bug for this issue (and attached your test file to it):

http://bugzilla.open-bio.org/show_bug.cgi?id=1929

Hopefully someone will tackle this soon - I'm off sick today, and should 
really be resting.

Peter

Alessandro S. Nascimento wrote:
> Hi Peter,
> 
> as you will see in the attached script file, I tried two parse my blast 
> output into tow ways described in the biopython cookbook.
> I'm using linux Kubuntu, python 2.4.2. I'm not completely sure about my 
> biopython version, cause it was installed from debian repositories 
> through apt-get, but it seems to be version 1.30.
> 
> I also performed Blaspgp search separately using parameters "blastpgp -i 
> seqinput -o blast.output -j 50  -v 10000 -b 10000 -d ../db/nr -h 0.001". 
> A smaller blast result which gives me the same result from my python 
> script is also attached.
> 
> My desire is to get a large number of sequences using blastpgp, filter 
> them by length and identities (e.g. > 30 and < 90), comparing the 
> results one to another using blast2seq and align them using clustalw for 
> statistical aanalysis. I have tried to do it using bioperl, but get some 
> bugs when working with a large number of sequences. Then, I am trying 
> python now. This should be something quite simple.  (I guess)
> 
> Any help will be very appreciable!!!!
> 
> Thank you so much,
> 
> 
> Alessandro