[Biopython-dev] BUG: blastparser: expect(2)
Jeffrey Chang
jchang at SMI.Stanford.EDU
Fri Aug 11 15:08:29 EDT 2000
Thanks for the bug report! It's nice to know what Expect(2) means
now. I think I've fixed the bug in the CVS tree. Here's my CVS log
message roughly describing the fixes:
fixes for bug found by Thomas Ponten Sicheritz.
When doing blastall with gapped alignment (-g=F), the descriptions will
include an extra term N that indicates the number of alignments for the
same subject.
Score E
Sequences producing significant
alignments: (bits) Value N
This is also reported in the alignments as "Expect(??) = XXXX".
Thus, I have added a num_alignments member to the Record.HSP and
Record.Description classes. I made changes to the scanners in
NCBIWWW and NCBIStandalone and the appropriate consumers in NCBIStandalone
to parse this information.
Plus, I added a new event called "description_header" so that I will
know whether to expect the "N" term in the descriptions.
I updated the docs and regression tests accordingly.
Keep the bug reports coming in!
Thanks,
Jeff
On Fri, 11 Aug 2000 thomas at cbs.dtu.dk wrote:
> Hi,
>
> The blastparser fails while reading a blastall result with the "-g = F" option.
> (-g Perfom gapped alignment (not available with tblastx) [T/F] default = T)
>
> Expect(2) means that there are 2 alignments for the same Sbjct:
>
> c ya
> -thomas
> example code
> ##############################################
> from Bio.Blast import NCBIStandalone
> from Bio.Data import IUPACData
>
> file = 'test.blastn'
> parser = NCBIStandalone.BlastParser()
> iter = NCBIStandalone.Iterator(handle = open(file), parser = parser)
>
> while 1:
> rec = iter.next()
> if not rec: break
> #############
>
> results in:
> ##############################################
> File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 587, in _parse
> dh.score = _safe_int(dh.score)
> File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 1469, in _safe_int
> return long(str)
> ValueError: invalid literal for long(): 5e-45
> #########
>
> the blast file:
> ##############################################
> BLASTN 2.0.14 [Jun-29-2000]
>
>
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs", Nucleic Acids Res. 25:3389-3402.
>
> Query= HUMAGCGB
> (100 letters)
>
> Database: ./ensembl.cdna
> 37,720 sequences; 24,543,038 total letters
>
> Searching..................................................done
>
> Score E
> Sequences producing significant alignments: (bits) Value N
>
> ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Cont... 153 5e-45 2
> ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Cont... 28 13 1
>
> >ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Contig:AC012263.00001
> Length = 2673
>
> Score = 46.1 bits (23), Expect(2) = 5e-45
> Identities = 23/23 (100%)
> Strand = Plus / Plus
>
>
> Query: 1 atggagaccgtggtttgcccaag 23
> |||||||||||||||||||||||
> Sbjct: 1742 atggagaccgtggtttgcccaag 1764
>
>
> Score = 153 bits (77), Expect(2) = 5e-45
> Identities = 77/77 (100%)
> Strand = Plus / Plus
>
>
> Query: 24 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 83
> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct: 1764 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 1823
>
>
> Query: 84 ttcaccatatgaggaac 100
> |||||||||||||||||
> Sbjct: 1824 ttcaccatatgaggaac 1840
>
>
> >ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Contig:AC007637.00001
> Length = 1530
>
> Score = 28.2 bits (14), Expect = 13
> Identities = 14/14 (100%)
> Strand = Plus / Plus
>
>
> Query: 26 cctgggaagagagg 39
> ||||||||||||||
> Sbjct: 57 cctgggaagagagg 70
>
>
> Database: ./ensembl.cdna
> Posted date: Aug 3, 2000 1:07 PM
> Number of letters in database: 24,543,038
> Number of sequences in database: 37,720
>
> Lambda K H
> 1.37 0.711 1.31
>
>
> Matrix: blastn matrix:1 -3
> Number of Hits to DB: 3
> Number of Sequences: 37720
> Number of extensions: 3
> Number of successful extensions: 3
> Number of sequences better than 10.0: 2
> length of query: 100
> length of database: 24,543,038
> effective HSP length: 16
> effective length of query: 84
> effective length of database: 23,939,518
> effective search space: 2010919512
> effective search space used: 2010919512
> T: 0
> A: 0
> X1: 6 (11.9 bits)
> X2: 10 (19.8 bits)
> S1: 12 (24.3 bits)
> S2: 14 (28.2 bits)
> BLASTN 2.0.14 [Jun-29-2000]
>
>
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs", Nucleic Acids Res. 25:3389-3402.
>
> Query= HUMAGCGB
> (100 letters)
>
> Database: ./ensembl.cdna
> 37,720 sequences; 24,543,038 total letters
>
> Searching..................................................done
>
> Score E
> Sequences producing significant alignments: (bits) Value N
>
> ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Cont... 153 5e-45 2
> ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Cont... 28 13 1
>
> >ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Contig:AC012263.00001
> Length = 2673
>
> Score = 46.1 bits (23), Expect(2) = 5e-45
> Identities = 23/23 (100%)
> Strand = Plus / Plus
>
>
> Query: 1 atggagaccgtggtttgcccaag 23
> |||||||||||||||||||||||
> Sbjct: 1742 atggagaccgtggtttgcccaag 1764
>
>
> Score = 153 bits (77), Expect(2) = 5e-45
> Identities = 77/77 (100%)
> Strand = Plus / Plus
>
>
> Query: 24 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 83
> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct: 1764 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 1823
>
>
> Query: 84 ttcaccatatgaggaac 100
> |||||||||||||||||
> Sbjct: 1824 ttcaccatatgaggaac 1840
>
>
> >ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Contig:AC007637.00001
> Length = 1530
>
> Score = 28.2 bits (14), Expect = 13
> Identities = 14/14 (100%)
> Strand = Plus / Plus
>
>
> Query: 26 cctgggaagagagg 39
> ||||||||||||||
> Sbjct: 57 cctgggaagagagg 70
>
>
> Database: ./ensembl.cdna
> Posted date: Aug 3, 2000 1:07 PM
> Number of letters in database: 24,543,038
> Number of sequences in database: 37,720
>
> Lambda K H
> 1.37 0.711 1.31
>
>
> Matrix: blastn matrix:1 -3
> Number of Hits to DB: 3
> Number of Sequences: 37720
> Number of extensions: 3
> Number of successful extensions: 3
> Number of sequences better than 10.0: 2
> length of query: 100
> length of database: 24,543,038
> effective HSP length: 16
> effective length of query: 84
> effective length of database: 23,939,518
> effective search space: 2010919512
> effective search space used: 2010919512
> T: 0
> A: 0
> X1: 6 (11.9 bits)
> X2: 10 (19.8 bits)
> S1: 12 (24.3 bits)
> S2: 14 (28.2 bits)
> ########
>
>
> --
> Sicheritz Ponten Thomas E. CBS, Department of Biotechnology
> blippblopp at linux.nu The Technical University of Denmark
> CBS: +45 45 252485 Building 208, DK-2800 Lyngby
> Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html
>
> De Chelonian Mobile ... The Turtle Moves ...
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
>
More information about the Biopython-dev
mailing list