[BioPython] Changes in NCBI BLAST output format !!??

Tue Jul 19 10:00:42 EDT 2005

Hi Aurelie,

I ended up just writing my own parser... it wasn't as hard as I thought 
it would be.  The BLAST output is pretty straightforward.  What I wanted 
to do with BLAST was pretty simple, so I don't know if this will help 
you or not.  I wanted to get UIDs for the top hits, then retrieve the 
sequences.  I used the following regular espression to get the 
information from the top of the BLAST report (the part with the links to 
lower in the page:

alignre = re.compile(r'<a href = #(\d+)> *(\d+)</a> *(\d*e\-\d+|\d+.\d+)')

This regular expression contains 3 groups: the UID, score, and expect 
value, so I used the RE with:

uid, score, expect = alignre.search(line).groups()

I used a bit of other code to make sure that the line I'm looking at 
('line') contains these items.  It's kind of dirty, but it worked for me.

Hopefully this will give you ideas as to what you can do to extract the 
information you need from the BLAST report.

Jessica

aurelie.bornot at free.fr wrote:

>Hi !
>
>I've got the same problem as Jessica Leigh (in the Discussion List) :
>When I try to parse a BLAST file with a script that worked until the beginning
>of July, I get this syntax error :
>
>Line does not contain 'Database':
>(Blank line)
>
>It seem that the NCBI has made changes :
>
>-"Old" blast file :
><p>
><b>Query=</b> sequence
>         (569 letters)
>
><p>
><b>Database:</b> All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,
>GSS,environmental samples or phase 0, 1 or 2 HTGS sequences)
>           3,047,402 sequences; 13,743,552,639 total letters
>
><p> <p>If you have any problems or questions with the results...
>
>-New Blast file :
>
><b>Query=</b> sequence
>         (540 letters)
>
>
><b>Database:</b> All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,
>GSS,environmental samples or phase 0, 1 or 2 HTGS sequences)
>           3,312,348 sequences; 14,588,094,788 total letters
>
><p> <p>If you have any problems or questions with the...
>
>
>The <p> before Query and Database are missing !!!
>And the fact is that in Python24\Lib\site-packages\Bio\Blast\NCBIWWW.py, it
>seems that the code to find "Database" uses the <p> :
>
>def _scan_database_info(self, uhandle, consumer):
>        attempt_read_and_call(uhandle, consumer.noevent, start='<p>')
>        read_and_call(uhandle, consumer.database_info, contains='Database')
>        ....
>
>
>I'm not sure to have a good understanding of what happens...
>But could someone help...
>I don't know what to do. Is it possible to correct the problem easily ?
>
>Thanks a lot !!
>Aurelie
>
>--------------
>Aurelie BORNOT
>MNHN
>Paris
>_______________________________________________
>BioPython mailing list  -  BioPython at biopython.org
>http://biopython.org/mailman/listinfo/biopython
>  
>