[Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour

Tue Aug 9 20:09:05 UTC 2011

I'm reviving this thread to see what the current status is (if anything has changed).  The bioperl parser has the same problem; at the moment we're bascially stuck until NCBI gives some indication as to whether this is a bug or not.  Any word back from them yet?

(and agreed, it would be nice to have an external bug tracker from NCBI).

chris

On May 4, 2011, at 5:36 AM, Peter Cock wrote:

> On Wed, May 4, 2011 at 10:59 AM, Michal <mictadlo at gmail.com> wrote:
>> Hi Peter,
>> Do you have the script which read
>> 
>> https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml
>> 
>> 
>> and what would be the correct output?
>> 
>> Thank you in advance.
>> 
>> Cheers,
>> Michal
> 
> Hi Michal,
> 
> I'm not quite sure what you're asking, but I'll try. First, the three
> data files:
> 
> $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml
> $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/four_human_proteins.fasta
> $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/rhodopsin_proteins.fasta
> 
> The query file has four sequences,
> 
> $ grep -c "^>" four_human_proteins.fasta
> 4
> 
> $ grep "^>" four_human_proteins.fasta
>> sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1
>> sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2
>> sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4
>> sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1
> 
> Based on past experience, I would expect 4 iteration blocks in the
> XML, but in this case I have 24:
> 
> $ grep "<Iteration>" -c blastp_four_human_vs_rhodopsin.xml
> 24
> 
> Notice we get 6 iterations for each query (4 times 6 is 24):
> 
> $ grep "<Iteration_query-ID>" blastp_four_human_vs_rhodopsin.xml
>      <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
>      <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
> 
> Now, using the two FASTA files directly and re-running blastp, what do I get?
> 
> $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
> four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
> | grep "<Iteration>" -c
> 24
> 
> Or again with -parse_deflines, which changes how the hit ID/def is presented:
> 
> $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
> four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
> -parse_deflines | grep "<Iteration>" -c
> 24
> 
> How about older versions?
> 
> $ ~/Downloads/ncbi-blast-2.2.24+/bin/blastp -query
> four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
> BLAST engine error: XML formatting is only supported for a database search
> 
> I'll have to make a blast database first...
> 
> $ ~/Downloads/ncbi-blast-2.2.24+/bin/makeblastdb -in
> rhodopsin_proteins.fasta -dbtype prot
> 
> Building a new DB, current time: 05/04/2011 11:22:57
> New DB name:   rhodopsin_proteins.fasta
> New DB title:  rhodopsin_proteins.fasta
> Sequence type: Protein
> Keep Linkouts: T
> Keep MBits: T
> Maximum file size: 1073741824B
> Adding sequences from FASTA; added 6 sequences in 0.105655 seconds.
> 
> $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
> four_human_proteins.fasta -db rhodopsin_proteins.fasta -outfmt 5 |
> grep "<Iteration>" -c
> 4
> 
> Look - just four identifiers as I expect! This also works if the database
> is built with the -parse_seqids switch.
> 
> The same happens with older versions of BLAST+, one <Iteration>
> block per query, so four iteration blocks for this example. I tried all
> of 2.2.21+, 2.2.22+, 2.2.23+ and 2.2.24+ (running makeblastdb to
> give a fresh database, then blastp).
> 
> That seems to demonstrate that bug is specific to the XML output
> from FASTA vs FASTA (not FASTA vs DB), which is a new feature
> in NCBI BLAST 2.2.25+
> 
> I will raise this with the NCBI, and report back.
> 
> However, even if the NCBI fix it in the next release, we (Bio*) may
> want to update our parsers to cope with this quirk, or at least put a
> warning in our BLAST XML parser documentation, as there will be
> lots of installations of NCBI BLAST 2.2.25+ in the wild.
> 
> Peter