[Biopython-dev] BUG: blastparser: expect(2)

Jeffrey Chang jchang at SMI.Stanford.EDU
Fri Aug 11 15:08:29 EDT 2000


Thanks for the bug report!  It's nice to know what Expect(2) means
now.  I think I've fixed the bug in the CVS tree.  Here's my CVS log
message roughly describing the fixes:

fixes for bug found by Thomas Ponten Sicheritz.
When doing blastall with gapped alignment (-g=F), the descriptions will 
include an extra term N that indicates the number of alignments for the
same subject.
                                                               Score     E
Sequences producing significant
alignments:                    (bits)  Value  N

This is also reported in the alignments as "Expect(??) = XXXX".

Thus, I have added a num_alignments member to the Record.HSP and 
Record.Description classes.  I made changes to the scanners in 
NCBIWWW and NCBIStandalone and the appropriate consumers in NCBIStandalone
to parse this information.

Plus, I added a new event called "description_header" so that I will
know whether to expect the "N" term in the descriptions.

I updated the docs and regression tests accordingly.




Keep the bug reports coming in!

Thanks,
Jeff




On Fri, 11 Aug 2000 thomas at cbs.dtu.dk wrote:

> Hi,
> 
> The blastparser fails while reading a blastall result with the "-g = F" option.
> (-g  Perfom gapped alignment (not available with tblastx) [T/F] default = T)
> 
> Expect(2) means that there are 2 alignments for the same Sbjct:
> 
> c ya
> -thomas
> example code
> ##############################################
> from Bio.Blast import NCBIStandalone
> from Bio.Data import IUPACData
> 
> file = 'test.blastn'
> parser = NCBIStandalone.BlastParser()
> iter = NCBIStandalone.Iterator(handle = open(file), parser = parser)
> 
> while 1:
>     rec = iter.next()
>     if not rec: break
> #############
> 
> results in:
> ##############################################
>   File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 587, in _parse
>     dh.score = _safe_int(dh.score)
>   File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 1469, in _safe_int
>     return long(str)
> ValueError: invalid literal for long(): 5e-45
> #########
> 
> the blast file:
> ##############################################
> BLASTN 2.0.14 [Jun-29-2000]
> 
> 
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs",  Nucleic Acids Res. 25:3389-3402.
> 
> Query= HUMAGCGB
>          (100 letters)
> 
> Database: ./ensembl.cdna
>            37,720 sequences; 24,543,038 total letters
> 
> Searching..................................................done
> 
>                                                                Score     E
> Sequences producing significant alignments:                    (bits)  Value  N
> 
> ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Cont...   153  5e-45  2
> ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Cont...    28     13  1
> 
> >ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Contig:AC012263.00001
>           Length = 2673
> 
>  Score = 46.1 bits (23), Expect(2) = 5e-45
>  Identities = 23/23 (100%)
>  Strand = Plus / Plus
> 
>                                    
> Query: 1    atggagaccgtggtttgcccaag 23
>             |||||||||||||||||||||||
> Sbjct: 1742 atggagaccgtggtttgcccaag 1764
> 
> 
>  Score =  153 bits (77), Expect(2) = 5e-45
>  Identities = 77/77 (100%)
>  Strand = Plus / Plus
> 
>                                                                         
> Query: 24   gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 83
>             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct: 1764 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 1823
> 
>                              
> Query: 84   ttcaccatatgaggaac 100
>             |||||||||||||||||
> Sbjct: 1824 ttcaccatatgaggaac 1840
> 
> 
> >ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Contig:AC007637.00001
>           Length = 1530
> 
>  Score = 28.2 bits (14), Expect =    13
>  Identities = 14/14 (100%)
>  Strand = Plus / Plus
> 
>                         
> Query: 26 cctgggaagagagg 39
>           ||||||||||||||
> Sbjct: 57 cctgggaagagagg 70
> 
> 
>   Database: ./ensembl.cdna
>     Posted date:  Aug 3, 2000  1:07 PM
>   Number of letters in database: 24,543,038
>   Number of sequences in database:  37,720
>   
> Lambda     K      H
>     1.37    0.711     1.31 
> 
> 
> Matrix: blastn matrix:1 -3
> Number of Hits to DB: 3
> Number of Sequences: 37720
> Number of extensions: 3
> Number of successful extensions: 3
> Number of sequences better than 10.0: 2
> length of query: 100
> length of database: 24,543,038
> effective HSP length: 16
> effective length of query: 84
> effective length of database: 23,939,518
> effective search space: 2010919512
> effective search space used: 2010919512
> T: 0
> A: 0
> X1: 6 (11.9 bits)
> X2: 10 (19.8 bits)
> S1: 12 (24.3 bits)
> S2: 14 (28.2 bits)
> BLASTN 2.0.14 [Jun-29-2000]
> 
> 
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs",  Nucleic Acids Res. 25:3389-3402.
> 
> Query= HUMAGCGB
>          (100 letters)
> 
> Database: ./ensembl.cdna
>            37,720 sequences; 24,543,038 total letters
> 
> Searching..................................................done
> 
>                                                                Score     E
> Sequences producing significant alignments:                    (bits)  Value  N
> 
> ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Cont...   153  5e-45  2
> ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Cont...    28     13  1
> 
> >ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Contig:AC012263.00001
>           Length = 2673
> 
>  Score = 46.1 bits (23), Expect(2) = 5e-45
>  Identities = 23/23 (100%)
>  Strand = Plus / Plus
> 
>                                    
> Query: 1    atggagaccgtggtttgcccaag 23
>             |||||||||||||||||||||||
> Sbjct: 1742 atggagaccgtggtttgcccaag 1764
> 
> 
>  Score =  153 bits (77), Expect(2) = 5e-45
>  Identities = 77/77 (100%)
>  Strand = Plus / Plus
> 
>                                                                         
> Query: 24   gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 83
>             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct: 1764 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 1823
> 
>                              
> Query: 84   ttcaccatatgaggaac 100
>             |||||||||||||||||
> Sbjct: 1824 ttcaccatatgaggaac 1840
> 
> 
> >ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Contig:AC007637.00001
>           Length = 1530
> 
>  Score = 28.2 bits (14), Expect =    13
>  Identities = 14/14 (100%)
>  Strand = Plus / Plus
> 
>                         
> Query: 26 cctgggaagagagg 39
>           ||||||||||||||
> Sbjct: 57 cctgggaagagagg 70
> 
> 
>   Database: ./ensembl.cdna
>     Posted date:  Aug 3, 2000  1:07 PM
>   Number of letters in database: 24,543,038
>   Number of sequences in database:  37,720
>   
> Lambda     K      H
>     1.37    0.711     1.31 
> 
> 
> Matrix: blastn matrix:1 -3
> Number of Hits to DB: 3
> Number of Sequences: 37720
> Number of extensions: 3
> Number of successful extensions: 3
> Number of sequences better than 10.0: 2
> length of query: 100
> length of database: 24,543,038
> effective HSP length: 16
> effective length of query: 84
> effective length of database: 23,939,518
> effective search space: 2010919512
> effective search space used: 2010919512
> T: 0
> A: 0
> X1: 6 (11.9 bits)
> X2: 10 (19.8 bits)
> S1: 12 (24.3 bits)
> S2: 14 (28.2 bits)
> ########
> 
> 
> -- 
> Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
> blippblopp at linux.nu         The Technical University of Denmark
> CBS:  +45 45 252485         Building 208, DK-2800 Lyngby
> Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas/index.html
> 
> 	De Chelonian Mobile ... The Turtle Moves ...
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 




More information about the Biopython-dev mailing list