From biopython-bug-admin at bioperl.org Sat Aug 5 05:05:12 2000 From: biopython-bug-admin at bioperl.org (biopython-bug-admin@bioperl.org) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Notification: incoming/12 Message-ID: <200008050905.FAA24975@pw600a.bioperl.org> JitterBug notification new message incoming/12 Message summary for PR#12 From: thomas@cbs.dtu.dk Subject: blastn Date: Sat, 5 Aug 2000 05:05:11 -0400 0 replies 0 followups ====> ORIGINAL MESSAGE FOLLOWS <==== From thomas at cbs.dtu.dk Sat Aug 5 05:05:11 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:51 2005 Subject: blastn Message-ID: <200008050905.FAA24959@pw600a.bioperl.org> Full_Name: thomas sichertiz-ponten Module: Blast/NCBIStandalone Version: OS: linux, IRIX Submission from: molev106.ebc.uu.se (130.238.82.106) Problem: cannot parse a multiple blastnresult because of ?hardcoded? amount of whitespaces ? #script ..... import sys, os sys.path.insert(0, os.path.expanduser('~thomas/cbs/python/biopython')) from Bio.Blast import NCBIStandalone from Bio.Data import IUPACData file = 'blasttest.blastn' parser = NCBIStandalone.BlastParser() iter = NCBIStandalone.Iterator(handle = open(file), parser = parser) while 1: res = iter.next() ---- SNIP ----- SNIP ------ # result Traceback (innermost last): File "", line 1, in ? File "/usr/tmp/python-Oq3ztf", line 18, in ? res = iter.next() File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 1199, in next return self._parser.parse(File.StringHandle(data)) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 463, in parse self._scanner.feed(handle, self._consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 68, in feed self._scan_rounds(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 121, in _scan_rounds self._scan_alignments(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 226, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 236, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 241, in _scan_one_pairwise_alignment self._scan_alignment_header(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 267, in _scan_alignment_header read_and_call(uhandle, consumer.noevent, start=' ') File "/home/genome6/thomas/cbs/python/biopython/Bio/ParserSupport.py", line 140, in read_and_call raise SyntaxError, errmsg SyntaxError: Line does not start with ' ': --- SNIP --- SNIP ----- #blasttest.blastn BLASTN 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= M15353 (100 letters) Database: ensembl.cdna 37,720 sequences; 24,543,038 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value N ENST00000044731 Gene:ENSG00000041402 Clone:AC060233 Cont... 182 4e-46 1 ENST00000041234 Gene:ENSG00000038511 Clone:AC015993 Cont... 163 3e-40 1 >ENST00000044731 Gene:ENSG00000041402 Clone:AC060233 Contig:AC060233.00036 Length = 654 Score = 182 bits (92), Expect = 4e-46 Identities = 98/100 (98%) Strand = Plus / Plus Query: 1 atggcgactgtcgaaccggaaaccacccctactcctaatcccccgactacagaagaggag 60 |||||||| ||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct: 1 atggcgaccgtcgaaccggaaaccacccctactcctaatcccccgactacagaaaaggag 60 Query: 61 aaaacggaatctaatcaggaggttgctaacccagaacact 100 |||||||||||||||||||||||||||||||||||||||| Sbjct: 61 aaaacggaatctaatcaggaggttgctaacccagaacact 100 >ENST00000041234 Gene:ENSG00000038511 Clone:AC015993 Contig:AC015993.00011 Length = 361 Score = 163 bits (82), Expect = 3e-40 Identities = 82/82 (100%) Strand = Plus / Plus Query: 19 gaaaccacccctactcctaatcccccgactacagaagaggagaaaacggaatctaatcag 78 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1 gaaaccacccctactcctaatcccccgactacagaagaggagaaaacggaatctaatcag 60 Query: 79 gaggttgctaacccagaacact 100 |||||||||||||||||||||| Sbjct: 61 gaggttgctaacccagaacact 82 Database: ensembl.cdna Posted date: Aug 3, 2000 1:07 PM Number of letters in database: 24,543,038 Number of sequences in database: 37,720 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Number of Hits to DB: 2 Number of Sequences: 37720 Number of extensions: 2 Number of successful extensions: 2 Number of sequences better than 10.0: 2 length of query: 100 length of database: 24,543,038 effective HSP length: 16 effective length of query: 84 effective length of database: 23,939,518 effective search space: 2010919512 effective search space used: 2010919512 T: 0 A: 0 X1: 6 (11.9 bits) X2: 10 (19.8 bits) S1: 12 (24.3 bits) S2: 14 (28.2 bits) BLASTN 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= X76013 (100 letters) Database: ensembl.cdna 37,720 sequences; 24,543,038 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value N ENST00000040999 Gene:ENSG00000038136 Clone:AC016581 Cont... 34 0.20 1 >ENST00000040999 Gene:ENSG00000038136 Clone:AC016581 Contig:AC016581.00002 Length = 438 Score = 34.2 bits (17), Expect = 0.20 Identities = 17/17 (100%) Strand = Plus / Plus Query: 38 tcggcctgagcgagcag 54 ||||||||||||||||| Sbjct: 29 tcggcctgagcgagcag 45 Database: ensembl.cdna Posted date: Aug 3, 2000 1:07 PM Number of letters in database: 24,543,038 Number of sequences in database: 37,720 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Number of Hits to DB: 2 Number of Sequences: 37720 Number of extensions: 2 Number of successful extensions: 2 Number of sequences better than 10.0: 1 length of query: 100 length of database: 24,543,038 effective HSP length: 16 effective length of query: 84 effective length of database: 23,939,518 effective search space: 2010919512 effective search space used: 2010919512 T: 0 A: 0 X1: 6 (11.9 bits) X2: 10 (19.8 bits) S1: 12 (24.3 bits) S2: 14 (28.2 bits) BLASTN 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= U66617 (100 letters) Database: ensembl.cdna 37,720 sequences; 24,543,038 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value N ENST00000038861 Gene:ENSG00000036360 Clone:AC025361 Cont... 198 6e-51 1 ENST00000010117 Gene:ENSG00000007819 Clone:AL031228 Cont... 32 0.81 1 >ENST00000038861 Gene:ENSG00000036360 Clone:AC025361 Contig:AC025361.00005 Length = 605 Score = 198 bits (100), Expect = 6e-51 Identities = 100/100 (100%) Strand = Plus / Plus ====> MESSAGE TRUNCATED AT 8192 <==== From biopython-bug-admin at bioperl.org Mon Aug 7 15:50:25 2000 From: biopython-bug-admin at bioperl.org (biopython-bug-admin@bioperl.org) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Notification: incoming/12 Message-ID: <200008071950.PAA07321@pw600a.bioperl.org> JitterBug notification jchang changed notes Message summary for PR#12 From: thomas@cbs.dtu.dk Subject: blastn Date: Sat, 5 Aug 2000 05:05:11 -0400 0 replies 0 followups Notes: [jchang] Yep, the parser's definitely broken. In the alignment header, the scanner was checking for a line containing ' '. However, in the latest version of BLAST it seems to have been changed to a real blank line. I've changed the scanner (NCBIStandalone.py, revision 1.22) to accept either one. In addition, I've added a BLASTN 2.0.14 test file into the regression tests as bt062. ====> ORIGINAL MESSAGE FOLLOWS <==== From thomas at cbs.dtu.dk Sat Aug 5 05:05:11 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:51 2005 Subject: blastn Message-ID: <200008050905.FAA24959@pw600a.bioperl.org> Full_Name: thomas sichertiz-ponten Module: Blast/NCBIStandalone Version: OS: linux, IRIX Submission from: molev106.ebc.uu.se (130.238.82.106) Problem: cannot parse a multiple blastnresult because of ?hardcoded? amount of whitespaces ? #script ..... import sys, os sys.path.insert(0, os.path.expanduser('~thomas/cbs/python/biopython')) from Bio.Blast import NCBIStandalone from Bio.Data import IUPACData file = 'blasttest.blastn' parser = NCBIStandalone.BlastParser() iter = NCBIStandalone.Iterator(handle = open(file), parser = parser) while 1: res = iter.next() ---- SNIP ----- SNIP ------ # result Traceback (innermost last): File "", line 1, in ? File "/usr/tmp/python-Oq3ztf", line 18, in ? res = iter.next() File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 1199, in next return self._parser.parse(File.StringHandle(data)) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 463, in parse self._scanner.feed(handle, self._consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 68, in feed self._scan_rounds(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 121, in _scan_rounds self._scan_alignments(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 226, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 236, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 241, in _scan_one_pairwise_alignment self._scan_alignment_header(uhandle, consumer) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 267, in _scan_alignment_header read_and_call(uhandle, consumer.noevent, start=' ') File "/home/genome6/thomas/cbs/python/biopython/Bio/ParserSupport.py", line 140, in read_and_call raise SyntaxError, errmsg SyntaxError: Line does not start with ' ': --- SNIP --- SNIP ----- #blasttest.blastn BLASTN 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= M15353 (100 letters) Database: ensembl.cdna 37,720 sequences; 24,543,038 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value N ENST00000044731 Gene:ENSG00000041402 Clone:AC060233 Cont... 182 4e-46 1 ENST00000041234 Gene:ENSG00000038511 Clone:AC015993 Cont... 163 3e-40 1 >ENST00000044731 Gene:ENSG00000041402 Clone:AC060233 Contig:AC060233.00036 Length = 654 Score = 182 bits (92), Expect = 4e-46 Identities = 98/100 (98%) Strand = Plus / Plus Query: 1 atggcgactgtcgaaccggaaaccacccctactcctaatcccccgactacagaagaggag 60 |||||||| ||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct: 1 atggcgaccgtcgaaccggaaaccacccctactcctaatcccccgactacagaaaaggag 60 Query: 61 aaaacggaatctaatcaggaggttgctaacccagaacact 100 |||||||||||||||||||||||||||||||||||||||| Sbjct: 61 aaaacggaatctaatcaggaggttgctaacccagaacact 100 >ENST00000041234 Gene:ENSG00000038511 Clone:AC015993 Contig:AC015993.00011 Length = 361 Score = 163 bits (82), Expect = 3e-40 Identities = 82/82 (100%) Strand = Plus / Plus Query: 19 gaaaccacccctactcctaatcccccgactacagaagaggagaaaacggaatctaatcag 78 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1 gaaaccacccctactcctaatcccccgactacagaagaggagaaaacggaatctaatcag 60 Query: 79 gaggttgctaacccagaacact 100 |||||||||||||||||||||| Sbjct: 61 gaggttgctaacccagaacact 82 Database: ensembl.cdna Posted date: Aug 3, 2000 1:07 PM Number of letters in database: 24,543,038 Number of sequences in database: 37,720 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Number of Hits to DB: 2 Number of Sequences: 37720 Number of extensions: 2 Number of successful extensions: 2 Number of sequences better than 10.0: 2 length of query: 100 length of database: 24,543,038 effective HSP length: 16 effective length of query: 84 effective length of database: 23,939,518 effective search space: 2010919512 effective search space used: 2010919512 T: 0 A: 0 X1: 6 (11.9 bits) X2: 10 (19.8 bits) S1: 12 (24.3 bits) S2: 14 (28.2 bits) BLASTN 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= X76013 (100 letters) Database: ensembl.cdna 37,720 sequences; 24,543,038 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value N ENST00000040999 Gene:ENSG00000038136 Clone:AC016581 Cont... 34 0.20 1 >ENST00000040999 Gene:ENSG00000038136 Clone:AC016581 Contig:AC016581.00002 Length = 438 Score = 34.2 bits (17), Expect = 0.20 Identities = 17/17 (100%) Strand = Plus / Plus Query: 38 tcggcctgagcgagcag 54 ||||||||||||||||| Sbjct: 29 tcggcctgagcgagcag 45 Database: ensembl.cdna Posted date: Aug 3, 2000 1:07 PM Number of letters in database: 24,543,038 Number of sequences in database: 37,720 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Number of Hits to DB: 2 Number of Sequences: 37720 Number of extensions: 2 Number of successful extensions: 2 Number of sequences better than 10.0: 1 length of query: 100 length of database: 24,543,038 effective HSP length: 16 effective length of query: 84 effective length of database: 23,939,518 effective search space: 2010919512 effective search space used: 2010919512 T: 0 A: 0 X1: 6 (11.9 bits) X2: 10 (19.8 bits) S1: 12 (24.3 bits) S2: 14 (28.2 bits) BLASTN 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= U66617 (100 letters) Database: ensembl.cdna 37,720 sequences; 24,543,038 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value N ENST00000038861 Gene:ENSG00000036360 Clone:AC025361 Cont... 198 6e-51 1 ENST00000010117 Gene:ENSG00000007819 Clone:AL031228 Cont... 32 0.81 1 >ENST00000038861 Gene:ENSG00000036360 Clone:AC025361 Contig:AC025361.00005 Length = 605 Score = 198 bits (100), Expect = 6e-51 Identities = 100/100 (100%) Strand = Plus / Plus ====> MESSAGE TRUNCATED AT 8192 <==== From thomas at cbs.dtu.dk Wed Aug 9 13:10:07 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] unambiguous DNA Message-ID: <14737.36975.379651.694881@genome.cbs.dtu.dk> Hi, I'd like to get 'X' instead of a '*' (stop signal) when there is no clear translation ...(when extracting all possible ORFs from raw - often pure - sequence data during e.g. complete genome seqeuncing projects) eg. the translation of 'NATGATTANAATNTATTCCATTATATTG' should result in XDXNXFHYI instead of *D*N*FHYI is this already possible ? -thomas -- Sicheritz Ponten Thomas E. CBS, Department of Biotechnology blippblopp@linux.nu The Technical University of Denmark CBS: +45 45 252485 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html De Chelonian Mobile ... The Turtle Moves ... From dalke at acm.org Thu Aug 10 03:43:07 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] unambiguous DNA Message-ID: <00fd01c0029e$a3840340$3f91cdcf@josiah> thomas@cbs.dtu.dk asked: >I'd like to get 'X' instead of a '*' (stop signal) when there is no >clear translation ...(when extracting all possible ORFs from raw - often >pure - sequence data during e.g. complete genome seqeuncing projects) >eg. >the translation of 'NATGATTANAATNTATTCCATTATATTG' >should result in > XDXNXFHYI >instead of > *D*N*FHYI > >is this already possible ? Yes. However, the default way (with utils.translate) won't do it. That's there mostly as a bootstrap to make it really easy to do the simple things, without needing to understand all I did with alphabet types. First off, "NAT" doesn't map to a single residue - it can be a Y, or H or N or ... In fact, it doesn't map to a stop codon, so the current biopython code wouldn't try to create a '*' character. If you tried the following: from Bio import Seq from Bio.Alphabet import IUPAC from Bio.Tools import Translate seq = Seq.Seq('NATGATTANAATNTATTCCATTATATTG', IUPAC.ambiguous_dna) protein = Translate.ambiguous_dna_by_id[1].translate(seq, "X") you would get an exception - Bio.Data.CodonTable.TranslationError: NAT So you need to define your problem more precisely. You have: 1) input sequence uses the ambiguous DNA alphabet 2) output sequence uses the ambiguous protein alphabet along with '*' for stop codon symbol and 'X' for untranslatable protein residues 3) translation table the standard table There isn't an alphabet in biopython which supports 2), so we need to make a new one: from Bio import Alphabet from Bio.Data import IUPACData class ProteinX(Alphabet.ProteinAlphabet): letters = IUPACData.extended_protein_letters + "X" proteinX = ProteinX() (To interoperate well with the other tools, you would also need to set up the right encodings for ProteinX molecular weight table and tell the PropertyManager about it.) The existing translation tables raise an exception if there isn't a given codon, so what we can do is write a wrapper class around an existing ambiguous translation table which returns an "X" when that happens. The Codon table only needs to implement the "get" function for "translate" (it needs __getitem__ for "translate_to_stop"). from Bio.Data import CodonTable from Bio.Alphabet import IUPAC # Forward translation table, mapping codon to protein class MissingTable: def __init__(self, table): self._table = table def get(self, codon, stop_symbol): try: return self._table.get(codon, stop_symbol) except CodonTable.TranslationError: return 'X' # Make the codon table given an existing table def makeTableX(table): assert table.protein_alphabet == IUPAC.extended_protein return CodonTable.CodonTable(table.nucleotide_alphabet, proteinX, MissingTable(table.forward_table), table.back_table, table.start_codons, table.stop_codons) standard_table = makeTableX(CodonTable.ambiguous_dna_by_id[1]) # Could also convert the other translation tables ... # Now make a translator object based on that codon table from Bio.Tools import Translate translator = Translate.Translator(standard_table) from Bio import Seq >>> translator.translate(Seq.Seq("NATGATTANAATNTATTCCATTATATTG", \ ... IUPAC.ambiguous_dna)) Seq('XDXNXFHYI', ProteinX()) >>> translator.translate(Seq.Seq("NATGATTANAATNTATTCCATTATATTGTTTAR", \ ... IUPAC.ambiguous_dna)) Seq('XDXNXFHYIV*', ProteinX()) = = = = = = = = (This is the section you probably want to use.) As you can see, this still translates stop codons to "*" if there isn't any other option ("TAR" is either "TAA" or "TAG", which are both stop codons). Suppose you didn't want that. Then you might try the following: class MissingTable2: def __init__(self, table): self._table = table def __getitem__(self, codon): # translate_to_codon uses [codon] try: # instead of get(codon, stop_symbol) return self._table.get(codon, 'X') # Always uses "X" except CodonTable.TranslationError: return 'X' # for what would be stop codons # Make the codon table given an existing table def makeTableX2(table): assert table.protein_alphabet == IUPAC.extended_protein return CodonTable.CodonTable(table.nucleotide_alphabet, proteinX, MissingTable2(table.forward_table), table.back_table, table.start_codons, table.stop_codons) standard_table2 = makeTableX2(CodonTable.ambiguous_dna_by_id[1]) translator2 = Translate.Translator(standard_table2) >>> translator2.translate_to_stop(Seq.Seq("NATGATTANAATNTATTCCATTATATTG", ... IUPAC.ambiguous_dna)) Seq('XDXNXFHYI', ProteinX()) >>> translator2.translate_to_stop(Seq.Seq("NATGATTANAATNTATTCCATTATATTGTTTAR", ... IUPAC.ambiguous_dna)) Seq('XDXNXFHYIVX', ProteinX()) = = = = = This section is here for edification only. It does not produce what you want. Another way you might try to do it is to define 'X' as an ambiguous character for any standard protein symbol. The AmbiguouCodonTable takes a regular Codon table and two tables of values for the nucleotide and protein ambiguity codes, and produces a CodonTable which tries to find the minimal ambiguity needed to cover the input ambiguity. Since 'X' codes for any protein, it becomes the fail-safe. That is: proteinX_values = IUPACData.extended_protein_values.copy() proteinX_values['X'] = IUPACData.extended_protein_letters def makeTableX3(table): assert table.protein_alphabet == IUPAC.protein return CodonTable.AmbiguousCodonTable(table, IUPAC.ambiguous_dna, IUPACData.ambiguous_dna_values, proteinX, proteinX_values) std_table = makeTableX3(CodonTable.unambiguous_dna_by_id[1]) translator3 = Translate.Translator(std_table) This works for some cases, like NATGATTAGAATNTATTCCATTATATTGTTTAR, which generates 'XD*NXFHYIV*', but it does not work for "TAN" becase that codes for both stop codons and for amino acids. The AmbiguousCodonTable doesn't allow that to occur since 'X' is only a protein symbol and '*' is only a stop symbol - there isn't a symbol which covers both cases. I chose this because I wanted strict typing in the alphabets, and I knew you could get around it with a bit more work, as shown above. = = = = = = This may seem complicated. I would love to see a cleaner way to do it, but I couldn't think of one which let me: use (ambiguous and unambigous) (dna/rna and protein) distinguish when an ambiguous codon can be both a stop codon and code for an amino acid So what I did was provide the simplest case (standard alphabets and common translation tables) as well as tools to make it do exactly what you want it to do. Andrew P.S. There is a bug in the existing Alphabet.AlphabetEncoder.__getattr__. It shouldn't forward anything starting and ending with "__". The "ProteinX()" shown after the call to translate() really should be a 'HasStopCodon(ProteinX(), "*")'. The object is correct, it's just using the wrong __repr__ for printing. From dalke at acm.org Thu Aug 10 12:07:41 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] unambiguous DNA Message-ID: <013201c002e5$1e424600$3f91cdcf@josiah> thomas@cbs.dtu.dk: >I'd like to get 'X' instead of a '*' (stop signal) when there is no >clear translation ...(when extracting all possible ORFs from raw - often >pure - sequence data during e.g. complete genome seqeuncing projects) Is this a common enough need to standard code to support it? If so, I can think of a couple different ways. 1) As I described in my reply, there could be a new alphabet encoding containing the 'X' character as an ambiguous amino acid? If so, should stop codons still be translated to "*" That is, should NATGATTANAATNTATTCCATTATATTGTTTAR be translated to XDXNXFHYIV* (with a stop encoded alphabet using "*") or XDXNXFHYIVX (with just the straight "X" extended alphabet) ? Or should there be two different classes of translator objects available, one for each request? (I would rather not, and instead use a converter object to strip out the StopEncoded part.) 2) The translator object could acquire a third forward translation method (in addition to "translate" and "translate_to_stop") perhaps named "translate_ignoring_stop". The code would be something like: def translate_ignoring_stop(self, seq, ignore_symbol = "X"): assert seq.alphabet == self.table.nucleotide_alphabet, \ "cannot translate from the given alphabet (%s)" % seq.alphabet s = seq.data letters = [] append = letters.append table = self.table get = table.forward_table.get n = len(seq) for i in range(0, n-n%3, 3): try: # Change append(get(s[i:i+3], ignore_symbol)) # Change except TranslationError: # Change append(ignore_symbol) # Change # return with the correct alphabet encoding (cache the encoding) try: alphabet = self._ignore_encoded[ignore_symbol] # Change except KeyError: # UnknownEncoded doens't currently exist, but easy to make alphabet = Alphabet.UnknownEncoded(table.protein_alphabet) # Change self._ignore_encoded[stop_symbol] = alphabet # Change return Seq.Seq(string.join(letters, ""), alphabet) Of course, the back_translate method would need to be told how to deal with UnknownEncoded which is hard with the current code. 'X' isn't part of the protein alphabet so it can't be passed to the codon table's reverse lookup, which expects one of the alphabet letters or 'None' for a stop codon. What could be done is to get the protein_alphabet from the codon table, sort it, and append 'None' to the list. (The sort is to guarantee a consistent order no matter the codon table implementation in the future.) Then when 'X' is found, choose successive letters from the sorted list, looping as needed. This would get you a better looking result, although the statistics will be wrong. What I don't like about it is the back translation way allows the codon table to return a statisically appropriate result, while what I outlined above doesn't. I like (2) because it's easy to understand, but it does have that statistical problem, so I would go with (1) even though it may lead to a proliferation of slightly different alphabets. On the third hand, the codon table could be changed to have some way to return the statistically appropriate result, like a new method. (It could use the method I outlined above, except there would need to be some way to reset the loop through the alphabet so successive calls to back_translate the same sequence could always give the same results.) Now that I think about it, I like that third option the best. To repeat; codon tables will have a new method which returns a generator for randomly picked back translations. This generator implements a method (codon() ?) which returns a (possibly statistically appropriate) nucleotide codon. The back translate code would look like: def _back_translate_ignore(self, seq): s = seq.data letter = seq.alphabet.unknown_symbol letters = [] append = letters.append table = self.table.back_table back_gen = self.table.back_generator() for c in seq.data: if c == letter: append(back_gen.codon()) else: append(table[c]) return Seq.Seq(string.join(letters, ""), self.table.nucleotide_alphabet) Andrew From thomas at cbs.dtu.dk Fri Aug 11 04:18:40 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] antiparallel ? Message-ID: <14739.46816.622809.199392@genome.cbs.dtu.dk> Hi, How are people changing sequences to antiparallel with biopython ? Currently I use def complement(self, seq): return string.join(map(lambda x:IUPACData.ambiguous_dna_complement[x], map(None,seq)),'') def reverse(self, seq): r = map(None, seq) r.reverse() return string.join(r,'') def antiparallel(self, seq): s = self.complement(seq) s = self.reverse(s) return s Is there another - better - way to do go ? thx -thomas -- Sicheritz Ponten Thomas E. CBS, Department of Biotechnology blippblopp@linux.nu The Technical University of Denmark CBS: +45 45 252485 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html De Chelonian Mobile ... The Turtle Moves ... From thomas at cbs.dtu.dk Fri Aug 11 07:47:38 2000 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] BUG: blastparser: expect(2) Message-ID: <14739.59354.32196.801776@genome.cbs.dtu.dk> Hi, The blastparser fails while reading a blastall result with the "-g = F" option. (-g Perfom gapped alignment (not available with tblastx) [T/F] default = T) Expect(2) means that there are 2 alignments for the same Sbjct: c ya -thomas example code ############################################## from Bio.Blast import NCBIStandalone from Bio.Data import IUPACData file = 'test.blastn' parser = NCBIStandalone.BlastParser() iter = NCBIStandalone.Iterator(handle = open(file), parser = parser) while 1: rec = iter.next() if not rec: break ############# results in: ############################################## File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 587, in _parse dh.score = _safe_int(dh.score) File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 1469, in _safe_int return long(str) ValueError: invalid literal for long(): 5e-45 ######### the blast file: ############################################## BLASTN 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= HUMAGCGB (100 letters) Database: ./ensembl.cdna 37,720 sequences; 24,543,038 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value N ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Cont... 153 5e-45 2 ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Cont... 28 13 1 >ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Contig:AC012263.00001 Length = 2673 Score = 46.1 bits (23), Expect(2) = 5e-45 Identities = 23/23 (100%) Strand = Plus / Plus Query: 1 atggagaccgtggtttgcccaag 23 ||||||||||||||||||||||| Sbjct: 1742 atggagaccgtggtttgcccaag 1764 Score = 153 bits (77), Expect(2) = 5e-45 Identities = 77/77 (100%) Strand = Plus / Plus Query: 24 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 83 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1764 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 1823 Query: 84 ttcaccatatgaggaac 100 ||||||||||||||||| Sbjct: 1824 ttcaccatatgaggaac 1840 >ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Contig:AC007637.00001 Length = 1530 Score = 28.2 bits (14), Expect = 13 Identities = 14/14 (100%) Strand = Plus / Plus Query: 26 cctgggaagagagg 39 |||||||||||||| Sbjct: 57 cctgggaagagagg 70 Database: ./ensembl.cdna Posted date: Aug 3, 2000 1:07 PM Number of letters in database: 24,543,038 Number of sequences in database: 37,720 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Number of Hits to DB: 3 Number of Sequences: 37720 Number of extensions: 3 Number of successful extensions: 3 Number of sequences better than 10.0: 2 length of query: 100 length of database: 24,543,038 effective HSP length: 16 effective length of query: 84 effective length of database: 23,939,518 effective search space: 2010919512 effective search space used: 2010919512 T: 0 A: 0 X1: 6 (11.9 bits) X2: 10 (19.8 bits) S1: 12 (24.3 bits) S2: 14 (28.2 bits) BLASTN 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= HUMAGCGB (100 letters) Database: ./ensembl.cdna 37,720 sequences; 24,543,038 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value N ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Cont... 153 5e-45 2 ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Cont... 28 13 1 >ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Contig:AC012263.00001 Length = 2673 Score = 46.1 bits (23), Expect(2) = 5e-45 Identities = 23/23 (100%) Strand = Plus / Plus Query: 1 atggagaccgtggtttgcccaag 23 ||||||||||||||||||||||| Sbjct: 1742 atggagaccgtggtttgcccaag 1764 Score = 153 bits (77), Expect(2) = 5e-45 Identities = 77/77 (100%) Strand = Plus / Plus Query: 24 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 83 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1764 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 1823 Query: 84 ttcaccatatgaggaac 100 ||||||||||||||||| Sbjct: 1824 ttcaccatatgaggaac 1840 >ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Contig:AC007637.00001 Length = 1530 Score = 28.2 bits (14), Expect = 13 Identities = 14/14 (100%) Strand = Plus / Plus Query: 26 cctgggaagagagg 39 |||||||||||||| Sbjct: 57 cctgggaagagagg 70 Database: ./ensembl.cdna Posted date: Aug 3, 2000 1:07 PM Number of letters in database: 24,543,038 Number of sequences in database: 37,720 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Number of Hits to DB: 3 Number of Sequences: 37720 Number of extensions: 3 Number of successful extensions: 3 Number of sequences better than 10.0: 2 length of query: 100 length of database: 24,543,038 effective HSP length: 16 effective length of query: 84 effective length of database: 23,939,518 effective search space: 2010919512 effective search space used: 2010919512 T: 0 A: 0 X1: 6 (11.9 bits) X2: 10 (19.8 bits) S1: 12 (24.3 bits) S2: 14 (28.2 bits) ######## -- Sicheritz Ponten Thomas E. CBS, Department of Biotechnology blippblopp@linux.nu The Technical University of Denmark CBS: +45 45 252485 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html De Chelonian Mobile ... The Turtle Moves ... From dalke at acm.org Fri Aug 11 13:11:16 2000 From: dalke at acm.org (Andrew Dalke) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] antiparallel ? Message-ID: <001e01c003b7$2b61c4a0$afc98ad1@josiah> thomas@cbs.dtu.dk >How are people changing sequences to antiparallel with biopython ? >Currently I use > > def complement(self, seq): > return string.join(map(lambda x:IUPACData.ambiguous_dna_complement[x], map(None,seq)),'') Two thing here. First, I like working in Seq space rather than as strings. Which means I just realized there's no way to get the complement table for an alphabet. (Well, there is a way using the PropertyManager and setting the values in IUPACEncodings. It's just not begin done.) If it did, then this would be a function in utils (not a method) and work like: def complement(self, seq): alphabet = seq.alphabet table = default_manager.resolve(alphabet, "complement_table") new_data = [] for c in seq.data: new_data.append(table[c]) return Seq(string.join(new_data, ''), alphabet) If I weren't trying to get things done for BOSC, I would fix things now :( Second, there's no need to do the map(None, seq) since a string is a sequence-like object. That is, def spam(c): print "Character", repr(c) return c map(spam, "Andrew") prints Character 'A' Character 'n' Character 'd' Character 'r' Character 'e' Character 'w' ['A', 'n', 'd', 'r', 'e', 'w'] Also, doing the map(lambda x, IUPACData.ambiguous_dna_complement[x], ...) is slower than x = [] for c in seq: x.append(IUPACData.ambiguous_dna_complement[c]) return string.join(x, '') because the lambda introduces the function call overhead. Also, using a loop is easier for most people to understand. > def reverse(self, seq): > r = map(None, seq) > r.reverse() > return string.join(r,'') instead of "r = map(None, seq)" try "r = list(seq)" > def antiparallel(self, seq): > s = self.complement(seq) > s = self.reverse(s) > return s If you are interested in performance, you could repeat the code for complement, except adding a ".reverse()" before the string.join. This would prevent the extra conversion from list -> string -> list. Is it usually called "antiparallel"? I'm used to "rc" or "reverse_complement". I believe bioperl calls it "rc", so and for consistency that is what I would lean towards - except that it's too small a name for my preferences. Andrew From jchang at SMI.Stanford.EDU Fri Aug 11 15:08:29 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] BUG: blastparser: expect(2) In-Reply-To: <14739.59354.32196.801776@genome.cbs.dtu.dk> Message-ID: Thanks for the bug report! It's nice to know what Expect(2) means now. I think I've fixed the bug in the CVS tree. Here's my CVS log message roughly describing the fixes: fixes for bug found by Thomas Ponten Sicheritz. When doing blastall with gapped alignment (-g=F), the descriptions will include an extra term N that indicates the number of alignments for the same subject. Score E Sequences producing significant alignments: (bits) Value N This is also reported in the alignments as "Expect(??) = XXXX". Thus, I have added a num_alignments member to the Record.HSP and Record.Description classes. I made changes to the scanners in NCBIWWW and NCBIStandalone and the appropriate consumers in NCBIStandalone to parse this information. Plus, I added a new event called "description_header" so that I will know whether to expect the "N" term in the descriptions. I updated the docs and regression tests accordingly. Keep the bug reports coming in! Thanks, Jeff On Fri, 11 Aug 2000 thomas@cbs.dtu.dk wrote: > Hi, > > The blastparser fails while reading a blastall result with the "-g = F" option. > (-g Perfom gapped alignment (not available with tblastx) [T/F] default = T) > > Expect(2) means that there are 2 alignments for the same Sbjct: > > c ya > -thomas > example code > ############################################## > from Bio.Blast import NCBIStandalone > from Bio.Data import IUPACData > > file = 'test.blastn' > parser = NCBIStandalone.BlastParser() > iter = NCBIStandalone.Iterator(handle = open(file), parser = parser) > > while 1: > rec = iter.next() > if not rec: break > ############# > > results in: > ############################################## > File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 587, in _parse > dh.score = _safe_int(dh.score) > File "/home/genome6/thomas/cbs/python/biopython/Bio/Blast/NCBIStandalone.py", line 1469, in _safe_int > return long(str) > ValueError: invalid literal for long(): 5e-45 > ######### > > the blast file: > ############################################## > BLASTN 2.0.14 [Jun-29-2000] > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > programs", Nucleic Acids Res. 25:3389-3402. > > Query= HUMAGCGB > (100 letters) > > Database: ./ensembl.cdna > 37,720 sequences; 24,543,038 total letters > > Searching..................................................done > > Score E > Sequences producing significant alignments: (bits) Value N > > ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Cont... 153 5e-45 2 > ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Cont... 28 13 1 > > >ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Contig:AC012263.00001 > Length = 2673 > > Score = 46.1 bits (23), Expect(2) = 5e-45 > Identities = 23/23 (100%) > Strand = Plus / Plus > > > Query: 1 atggagaccgtggtttgcccaag 23 > ||||||||||||||||||||||| > Sbjct: 1742 atggagaccgtggtttgcccaag 1764 > > > Score = 153 bits (77), Expect(2) = 5e-45 > Identities = 77/77 (100%) > Strand = Plus / Plus > > > Query: 24 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 83 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct: 1764 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 1823 > > > Query: 84 ttcaccatatgaggaac 100 > ||||||||||||||||| > Sbjct: 1824 ttcaccatatgaggaac 1840 > > > >ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Contig:AC007637.00001 > Length = 1530 > > Score = 28.2 bits (14), Expect = 13 > Identities = 14/14 (100%) > Strand = Plus / Plus > > > Query: 26 cctgggaagagagg 39 > |||||||||||||| > Sbjct: 57 cctgggaagagagg 70 > > > Database: ./ensembl.cdna > Posted date: Aug 3, 2000 1:07 PM > Number of letters in database: 24,543,038 > Number of sequences in database: 37,720 > > Lambda K H > 1.37 0.711 1.31 > > > Matrix: blastn matrix:1 -3 > Number of Hits to DB: 3 > Number of Sequences: 37720 > Number of extensions: 3 > Number of successful extensions: 3 > Number of sequences better than 10.0: 2 > length of query: 100 > length of database: 24,543,038 > effective HSP length: 16 > effective length of query: 84 > effective length of database: 23,939,518 > effective search space: 2010919512 > effective search space used: 2010919512 > T: 0 > A: 0 > X1: 6 (11.9 bits) > X2: 10 (19.8 bits) > S1: 12 (24.3 bits) > S2: 14 (28.2 bits) > BLASTN 2.0.14 [Jun-29-2000] > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > programs", Nucleic Acids Res. 25:3389-3402. > > Query= HUMAGCGB > (100 letters) > > Database: ./ensembl.cdna > 37,720 sequences; 24,543,038 total letters > > Searching..................................................done > > Score E > Sequences producing significant alignments: (bits) Value N > > ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Cont... 153 5e-45 2 > ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Cont... 28 13 1 > > >ENST00000022209 Gene:ENSG00000020685 Clone:AC012263 Contig:AC012263.00001 > Length = 2673 > > Score = 46.1 bits (23), Expect(2) = 5e-45 > Identities = 23/23 (100%) > Strand = Plus / Plus > > > Query: 1 atggagaccgtggtttgcccaag 23 > ||||||||||||||||||||||| > Sbjct: 1742 atggagaccgtggtttgcccaag 1764 > > > Score = 153 bits (77), Expect(2) = 5e-45 > Identities = 77/77 (100%) > Strand = Plus / Plus > > > Query: 24 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 83 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > Sbjct: 1764 gccctgggaagagaggcggaaacggagaagcctttccagtgaccgtgggaggacaaccca 1823 > > > Query: 84 ttcaccatatgaggaac 100 > ||||||||||||||||| > Sbjct: 1824 ttcaccatatgaggaac 1840 > > > >ENST00000008890 Gene:ENSG00000008430 Clone:AC007637 Contig:AC007637.00001 > Length = 1530 > > Score = 28.2 bits (14), Expect = 13 > Identities = 14/14 (100%) > Strand = Plus / Plus > > > Query: 26 cctgggaagagagg 39 > |||||||||||||| > Sbjct: 57 cctgggaagagagg 70 > > > Database: ./ensembl.cdna > Posted date: Aug 3, 2000 1:07 PM > Number of letters in database: 24,543,038 > Number of sequences in database: 37,720 > > Lambda K H > 1.37 0.711 1.31 > > > Matrix: blastn matrix:1 -3 > Number of Hits to DB: 3 > Number of Sequences: 37720 > Number of extensions: 3 > Number of successful extensions: 3 > Number of sequences better than 10.0: 2 > length of query: 100 > length of database: 24,543,038 > effective HSP length: 16 > effective length of query: 84 > effective length of database: 23,939,518 > effective search space: 2010919512 > effective search space used: 2010919512 > T: 0 > A: 0 > X1: 6 (11.9 bits) > X2: 10 (19.8 bits) > S1: 12 (24.3 bits) > S2: 14 (28.2 bits) > ######## > > > -- > Sicheritz Ponten Thomas E. CBS, Department of Biotechnology > blippblopp@linux.nu The Technical University of Denmark > CBS: +45 45 252485 Building 208, DK-2800 Lyngby > Fax +45 45 931585 http://www.cbs.dtu.dk/thomas/index.html > > De Chelonian Mobile ... The Turtle Moves ... > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > From katel at worldpath.net Mon Aug 21 14:27:13 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Gobase References: <002201bff9c4$6b8ef360$64dc85d0@g0fjl> Message-ID: <002401c00b9d$6efe3b60$b89403cf@g0fjl> I've been looking into Gobase, a mitochondrial database, and wondering whether to use a line oriented or a streaming approach. The Gobase pages don't use as much formatting as Rebase, so the ParserSupport routines would work. But the streaming lets the utility strip off all the HTML, so the user doesn'y have to delete the preamble. The streaming is also less brittle if the format should change. On the other hand, it's more bug prone because it removes linefeeds before they can be used as delimiters. Cayte From jchang at SMI.Stanford.EDU Mon Aug 28 19:44:58 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Re: Gobase In-Reply-To: <002401c00b9d$6efe3b60$b89403cf@g0fjl> Message-ID: At BOSC, Andrew Dalke gave a really nice presentation on Martel, his parser generator. Basically, it takes a regular language description of a file format and creates an event-oriented parser for the format. We're currently looking into using Martel to create parsers for biopython. While it's not certain that we'll definitely adopt Martel, it currently appears likely. I think the advantages Martel has: - more optimizable - SAX-like parser more familiar than scanner/consumer - syntax descriptions can be cross language Disadvantages: - regular expressions hard to debug (alleviated by good help messages?) - currently slower for swissprot tests (but that may change) - unclear how to handle exceptional cases (e.g. errors in format) - not yet stable Unclear: - which is easier to maintain? - which is easier to create? So to answer your question, you may want to try to create a Gobase parser in Martel, and then let us know what you think. It would be a good test case, and probably helpful to Andrew to know whether it can handle the format. Jeff On Mon, 21 Aug 2000, Cayte wrote: > I've been looking into Gobase, a mitochondrial database, and > wondering whether to use a line oriented or a streaming approach. > The Gobase pages don't use as much formatting as Rebase, so the > ParserSupport routines would work. But the streaming lets the utility > strip off all the HTML, so the user doesn'y have to delete the > preamble. The streaming is also less brittle if the format should > change. On the other hand, it's more bug prone because it removes > linefeeds before they can be used as delimiters. > > > Cayte > > From katel at worldpath.net Wed Aug 30 03:08:56 2000 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] Re: Gobase References: Message-ID: <002a01c01251$2a4cc8a0$010a0a0a@0q6vm> > So to answer your question, you may want to try to create a Gobase parser > in Martel, and then let us know what you think. It would be a good test > case, and probably helpful to Andrew to know whether it can handle the > format. > > Jeff > > > > On Mon, 21 Aug 2000, Cayte wrote: > > > I've been looking into Gobase, a mitochondrial database, and > > wondering whether to use a line oriented or a streaming approach. > > The Gobase pages don't use as much formatting as Rebase, so the > > ParserSupport routines would work. But the streaming lets the utility > > strip off all the HTML, so the user doesn'y have to delete the > > preamble. The streaming is also less brittle if the format should > > change. On the other hand, it's more bug prone because it removes > > linefeeds before they can be used as delimiters. > > > > Unfortunately, I already started Gobase. By generalizing, I was able to scrunch the code I used for Rebase. Instead of things like: def _scan_methylation(self, text, consumer ): start = string.find( text, 'Base (Type of methylation):' ) if( start != -1 ): end = string.find( text, 'REBASE enzyme #:' ) next_item = text[ start:end ] consumer.methylation( next_item ) I coded: def _scan_field(self, text, field, next_field = None ): start = string.find( text, field ) if( start == -1 ): return None if( next_field == None ): end = start + 40 else: end = string.find( text, next_field ):' ) if( end == -1 ): return None next_item = text[ start:end ] return( next_item ) But xml is a-comin' and we'll need something like Martel. I plan to get familiarized with it. I may use it for my next parser, when I get a round tooit. Maybe www.methdb.de ( methylation database ). Cayte From jchang at SMI.Stanford.EDU Thu Aug 31 18:09:41 2000 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:42:51 2005 Subject: [Biopython-dev] 0.90d03 to be released soon Message-ID: Hello everybody, I'd like to make a 0.90d03 release tomorrow. We've accumulated enough fixes in blast and other parsing code that are worth getting into people's hands. Please check in any fixes by the end of today, or let me know if I should hold off. Thanks, Jeff