From p.j.a.cock at googlemail.com Tue May 3 05:24:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 May 2011 10:24:08 +0100 Subject: [Open-bio-l] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: Message-ID: Hello all, I've CC'd the BioPerl, BioRuby, BioJava and Biopython development mailing lists to make sure you're aware of this, but can we continue any discussion on the cross-project open-bio-l mailing list please? I noticed that recent versions of BLAST are not using a single block for each query, which was the historical behaviour and assumed by the Biopython BLAST XML parser. This may be a bug in BLAST. See link below for an example. Has anyone else noticed this, and has it been reported to the NCBI yet? Thanks, Peter (Not for the first time, I wish there was a public bug tracker for BLAST, or at least a private bug tracker so we could talk about issues with an NCBI assigned reference number.) ---------- Forwarded message ---------- From: Peter Cock Date: Wed, Apr 20, 2011 at 6:08 PM Subject: Interesting BLAST 2.2.25+ XML behaviour To: Biopython-Dev Mailing List Hi all, Have a look at this XML file from a FASTA vs FASTA search using blastp from ?BLAST 2.2.25+ (current release), which is a test file I created for the BLAST+ wrappers in Galaxy: https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml I just put it though the Biopython BLAST XML parser, and was surprised not to get four records back (since as you might guess from the filename, there were four queries). It appears this version of BLAST+ is incrementing the iteration counter for each match... or something like that. Has anyone else noticed this? I wonder if it is accidental... Peter From cjfields at illinois.edu Tue May 3 09:31:55 2011 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 3 May 2011 08:31:55 -0500 Subject: [Open-bio-l] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: Message-ID: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> Haven't tried this using the latest BLAST+ myself, but it doesn't surprise me too much. Also agree re: some kind of bug tracking with NCBI; I believe they have an internal one, but it would be nice to have a public interface to it. chris On May 3, 2011, at 4:24 AM, Peter Cock wrote: > Hello all, > > I've CC'd the BioPerl, BioRuby, BioJava and Biopython development mailing > lists to make sure you're aware of this, but can we continue any discussion > on the cross-project open-bio-l mailing list please? > > I noticed that recent versions of BLAST are not using a single > block for each query, which was the historical behaviour and assumed > by the Biopython BLAST XML parser. This may be a bug in BLAST. > See link below for an example. > > Has anyone else noticed this, and has it been reported to the NCBI yet? > > Thanks, > > Peter > > (Not for the first time, I wish there was a public bug tracker for BLAST, > or at least a private bug tracker so we could talk about issues with an > NCBI assigned reference number.) > > ---------- Forwarded message ---------- > From: Peter Cock > Date: Wed, Apr 20, 2011 at 6:08 PM > Subject: Interesting BLAST 2.2.25+ XML behaviour > To: Biopython-Dev Mailing List > > > Hi all, > > Have a look at this XML file from a FASTA vs FASTA search > using blastp from BLAST 2.2.25+ (current release), which > is a test file I created for the BLAST+ wrappers in Galaxy: > > https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml > > I just put it though the Biopython BLAST XML parser, and > was surprised not to get four records back (since as you > might guess from the filename, there were four queries). > > It appears this version of BLAST+ is incrementing the > iteration counter for each match... or something like that. > > Has anyone else noticed this? I wonder if it is accidental... > > Peter > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From p.j.a.cock at googlemail.com Wed May 4 06:36:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 May 2011 11:36:57 +0100 Subject: [Open-bio-l] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: <4DC12371.3040204@gmail.com> References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: On Wed, May 4, 2011 at 10:59 AM, Michal wrote: > Hi Peter, > Do you have the script which read > > https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml > > > and what would be the correct output? > > Thank you in advance. > > Cheers, > Michal Hi Michal, I'm not quite sure what you're asking, but I'll try. First, the three data files: $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/four_human_proteins.fasta $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/rhodopsin_proteins.fasta The query file has four sequences, $ grep -c "^>" four_human_proteins.fasta 4 $ grep "^>" four_human_proteins.fasta >sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1 >sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2 >sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4 >sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1 Based on past experience, I would expect 4 iteration blocks in the XML, but in this case I have 24: $ grep "" -c blastp_four_human_vs_rhodopsin.xml 24 Notice we get 6 iterations for each query (4 times 6 is 24): $ grep "" blastp_four_human_vs_rhodopsin.xml sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN Now, using the two FASTA files directly and re-running blastp, what do I get? $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 | grep "" -c 24 Or again with -parse_deflines, which changes how the hit ID/def is presented: $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 -parse_deflines | grep "" -c 24 How about older versions? $ ~/Downloads/ncbi-blast-2.2.24+/bin/blastp -query four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 BLAST engine error: XML formatting is only supported for a database search I'll have to make a blast database first... $ ~/Downloads/ncbi-blast-2.2.24+/bin/makeblastdb -in rhodopsin_proteins.fasta -dbtype prot Building a new DB, current time: 05/04/2011 11:22:57 New DB name: rhodopsin_proteins.fasta New DB title: rhodopsin_proteins.fasta Sequence type: Protein Keep Linkouts: T Keep MBits: T Maximum file size: 1073741824B Adding sequences from FASTA; added 6 sequences in 0.105655 seconds. $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query four_human_proteins.fasta -db rhodopsin_proteins.fasta -outfmt 5 | grep "" -c 4 Look - just four identifiers as I expect! This also works if the database is built with the -parse_seqids switch. The same happens with older versions of BLAST+, one block per query, so four iteration blocks for this example. I tried all of 2.2.21+, 2.2.22+, 2.2.23+ and 2.2.24+ (running makeblastdb to give a fresh database, then blastp). That seems to demonstrate that bug is specific to the XML output from FASTA vs FASTA (not FASTA vs DB), which is a new feature in NCBI BLAST 2.2.25+ I will raise this with the NCBI, and report back. However, even if the NCBI fix it in the next release, we (Bio*) may want to update our parsers to cope with this quirk, or at least put a warning in our BLAST XML parser documentation, as there will be lots of installations of NCBI BLAST 2.2.25+ in the wild. Peter From p.j.a.cock at googlemail.com Tue May 3 09:24:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 May 2011 10:24:08 +0100 Subject: [Open-bio-l] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: Message-ID: Hello all, I've CC'd the BioPerl, BioRuby, BioJava and Biopython development mailing lists to make sure you're aware of this, but can we continue any discussion on the cross-project open-bio-l mailing list please? I noticed that recent versions of BLAST are not using a single block for each query, which was the historical behaviour and assumed by the Biopython BLAST XML parser. This may be a bug in BLAST. See link below for an example. Has anyone else noticed this, and has it been reported to the NCBI yet? Thanks, Peter (Not for the first time, I wish there was a public bug tracker for BLAST, or at least a private bug tracker so we could talk about issues with an NCBI assigned reference number.) ---------- Forwarded message ---------- From: Peter Cock Date: Wed, Apr 20, 2011 at 6:08 PM Subject: Interesting BLAST 2.2.25+ XML behaviour To: Biopython-Dev Mailing List Hi all, Have a look at this XML file from a FASTA vs FASTA search using blastp from ?BLAST 2.2.25+ (current release), which is a test file I created for the BLAST+ wrappers in Galaxy: https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml I just put it though the Biopython BLAST XML parser, and was surprised not to get four records back (since as you might guess from the filename, there were four queries). It appears this version of BLAST+ is incrementing the iteration counter for each match... or something like that. Has anyone else noticed this? I wonder if it is accidental... Peter From cjfields at illinois.edu Tue May 3 13:31:55 2011 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 3 May 2011 08:31:55 -0500 Subject: [Open-bio-l] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: Message-ID: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> Haven't tried this using the latest BLAST+ myself, but it doesn't surprise me too much. Also agree re: some kind of bug tracking with NCBI; I believe they have an internal one, but it would be nice to have a public interface to it. chris On May 3, 2011, at 4:24 AM, Peter Cock wrote: > Hello all, > > I've CC'd the BioPerl, BioRuby, BioJava and Biopython development mailing > lists to make sure you're aware of this, but can we continue any discussion > on the cross-project open-bio-l mailing list please? > > I noticed that recent versions of BLAST are not using a single > block for each query, which was the historical behaviour and assumed > by the Biopython BLAST XML parser. This may be a bug in BLAST. > See link below for an example. > > Has anyone else noticed this, and has it been reported to the NCBI yet? > > Thanks, > > Peter > > (Not for the first time, I wish there was a public bug tracker for BLAST, > or at least a private bug tracker so we could talk about issues with an > NCBI assigned reference number.) > > ---------- Forwarded message ---------- > From: Peter Cock > Date: Wed, Apr 20, 2011 at 6:08 PM > Subject: Interesting BLAST 2.2.25+ XML behaviour > To: Biopython-Dev Mailing List > > > Hi all, > > Have a look at this XML file from a FASTA vs FASTA search > using blastp from BLAST 2.2.25+ (current release), which > is a test file I created for the BLAST+ wrappers in Galaxy: > > https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml > > I just put it though the Biopython BLAST XML parser, and > was surprised not to get four records back (since as you > might guess from the filename, there were four queries). > > It appears this version of BLAST+ is incrementing the > iteration counter for each match... or something like that. > > Has anyone else noticed this? I wonder if it is accidental... > > Peter > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From p.j.a.cock at googlemail.com Wed May 4 10:36:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 May 2011 11:36:57 +0100 Subject: [Open-bio-l] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: <4DC12371.3040204@gmail.com> References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: On Wed, May 4, 2011 at 10:59 AM, Michal wrote: > Hi Peter, > Do you have the script which read > > https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml > > > and what would be the correct output? > > Thank you in advance. > > Cheers, > Michal Hi Michal, I'm not quite sure what you're asking, but I'll try. First, the three data files: $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/four_human_proteins.fasta $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/rhodopsin_proteins.fasta The query file has four sequences, $ grep -c "^>" four_human_proteins.fasta 4 $ grep "^>" four_human_proteins.fasta >sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1 >sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2 >sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4 >sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1 Based on past experience, I would expect 4 iteration blocks in the XML, but in this case I have 24: $ grep "" -c blastp_four_human_vs_rhodopsin.xml 24 Notice we get 6 iterations for each query (4 times 6 is 24): $ grep "" blastp_four_human_vs_rhodopsin.xml sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9BS26|ERP44_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|Q9NSY1|BMP2K_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P06213|INSR_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN sp|P08100|OPSD_HUMAN Now, using the two FASTA files directly and re-running blastp, what do I get? $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 | grep "" -c 24 Or again with -parse_deflines, which changes how the hit ID/def is presented: $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 -parse_deflines | grep "" -c 24 How about older versions? $ ~/Downloads/ncbi-blast-2.2.24+/bin/blastp -query four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 BLAST engine error: XML formatting is only supported for a database search I'll have to make a blast database first... $ ~/Downloads/ncbi-blast-2.2.24+/bin/makeblastdb -in rhodopsin_proteins.fasta -dbtype prot Building a new DB, current time: 05/04/2011 11:22:57 New DB name: rhodopsin_proteins.fasta New DB title: rhodopsin_proteins.fasta Sequence type: Protein Keep Linkouts: T Keep MBits: T Maximum file size: 1073741824B Adding sequences from FASTA; added 6 sequences in 0.105655 seconds. $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query four_human_proteins.fasta -db rhodopsin_proteins.fasta -outfmt 5 | grep "" -c 4 Look - just four identifiers as I expect! This also works if the database is built with the -parse_seqids switch. The same happens with older versions of BLAST+, one block per query, so four iteration blocks for this example. I tried all of 2.2.21+, 2.2.22+, 2.2.23+ and 2.2.24+ (running makeblastdb to give a fresh database, then blastp). That seems to demonstrate that bug is specific to the XML output from FASTA vs FASTA (not FASTA vs DB), which is a new feature in NCBI BLAST 2.2.25+ I will raise this with the NCBI, and report back. However, even if the NCBI fix it in the next release, we (Bio*) may want to update our parsers to cope with this quirk, or at least put a warning in our BLAST XML parser documentation, as there will be lots of installations of NCBI BLAST 2.2.25+ in the wild. Peter