From italo.maia at gmail.com Mon Jun 4 12:36:21 2007 From: italo.maia at gmail.com (Italo Maia) Date: Mon, 4 Jun 2007 13:36:21 -0300 Subject: [BioPython] Problem with blastx output parsing =~ Message-ID: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> Well, i have a perfectly fine blastx output that throws an error when parsed by biopython. It gives me this output: Traceback (most recent call last): File "", line 1, in File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 624, in parse self._scanner.feed(handle, self._consumer) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 99, in feed self._scan_parameters(uhandle, consumer) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 570, in _scan_parameters has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase")) File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line 300, in read_and_call raise SyntaxError, errmsg SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase': Number of HSP's gapped: 136690 What could i do??? I'm using ubuntu feisty here. -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB Programador Java, Python Meu blog ^^ http://eusouolobomal.blogspot.com/ =========================== From biopython at maubp.freeserve.co.uk Mon Jun 4 13:05:40 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 04 Jun 2007 18:05:40 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> Message-ID: <46644664.6080009@maubp.freeserve.co.uk> Italo Maia wrote: > Well, i have a perfectly fine blastx output that throws an error when parsed > by biopython. > It gives me this output: > > Traceback (most recent call last): > File "", line 1, in > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 624, in parse > self._scanner.feed(handle, self._consumer) > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 99, in feed > self._scan_parameters(uhandle, consumer) > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 570, in _scan_parameters > has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase")) > File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line 300, > in read_and_call > raise SyntaxError, errmsg > SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase': > Number of HSP's gapped: 136690 > > What could i do??? I'm using ubuntu feisty here. It looks like you are using the plain text output from blast, so we would recommend you try the XML output instead. See section 3.4 of the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html If you really want to use the plain text output, please file a bug (including Biopython version number) and then attach the plain text blast output which fails. But no promises - its an uphill battle to keep the parser up to date with each version of Blast! Peter From italo.maia at gmail.com Mon Jun 4 13:22:15 2007 From: italo.maia at gmail.com (Italo Maia) Date: Mon, 4 Jun 2007 14:22:15 -0300 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <46644664.6080009@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> Message-ID: <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> Well, i have 24 thousand of those, i think it would be very painfull to remake them...i'll fill the the bug, but, could there be a workaround? The file goes below: <<>> BLASTX 2.2.15 [Oct-15-2006] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= 26 (858 letters) Database: Leigo 4,535,438 sequences; 1,573,298,872 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gi|15778340|gb|AAL07392.1|AF411412_4 polymerase [Hepatitis B virus] 39 0.33 gi|12060441|dbj|BAB20611.1| DNA polymerase [Hepatitis B virus] 38 0.57 gi|84095095|dbj|BAE66661.1| P protein [Hepatitis B virus] 38 0.57 gi|57021117|ref|NP_647604.2| Polymerase [Hepatitis B virus] 38 0.75 >gi|15778340|gb|AAL07392.1|AF411412_4 polymerase [Hepatitis B virus] Length = 843 Score = 38.9 bits (89), Expect = 0.33 Identities = 24/89 (26%), Positives = 42/89 (47%), Gaps = 1/89 (1%) Frame = +1 Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738 + P G++ RGK + G IW+ HP + P G+ H +N +S + + RK Sbjct: 225 LQPQQGSLARGKSGRSGSIWARVHPTTRQSFGVEPSGSRHIDNSASSTTSCLHQSAVRKT 284 Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHR 825 + ++ K +++GRA PS+ R Sbjct: 285 AYSHLSTSKRQSSSGRAVELHNIPPSSVR 313 >gi|12060441|dbj|BAB20611.1| DNA polymerase [Hepatitis B virus] Length = 843 Score = 38.1 bits (87), Expect = 0.57 Identities = 23/90 (25%), Positives = 42/90 (46%), Gaps = 1/90 (1%) Frame = +1 Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738 + P G++ RGK + G IWS HP + P G+ H +N +S + + RK Sbjct: 225 LQPQQGSLARGKSGRSGSIWSRVHPTTRRSFGVEPSGSGHIDNSASSTSSCLHQSAVRKT 284 Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828 + ++ K +++G A P++ R+ Sbjct: 285 AYSHLSTSKRQSSSGHAVEFHNIPPNSARS 314 >gi|84095095|dbj|BAE66661.1| P protein [Hepatitis B virus] Length = 843 Score = 38.1 bits (87), Expect = 0.57 Identities = 23/90 (25%), Positives = 42/90 (46%), Gaps = 1/90 (1%) Frame = +1 Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738 + P G++ RGK + G IW+ HP + P G+ H +N +S + + RK Sbjct: 225 LQPQQGSLARGKSGRSGSIWARVHPTSRRSFGVEPSGSGHIDNSASSASSCLHQSAVRKT 284 Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828 + ++ K +++G A PS+ R+ Sbjct: 285 AYSHLSTSKRQSSSGHAVELLNIPPSSARS 314 >gi|57021117|ref|NP_647604.2| Polymerase [Hepatitis B virus] Length = 843 Score = 37.7 bits (86), Expect = 0.75 Identities = 24/90 (26%), Positives = 41/90 (45%), Gaps = 1/90 (1%) Frame = +1 Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738 + P G++ RGK + G IWS HP P G+ H +N +S + + RK Sbjct: 225 LQPQQGSLARGKSGRSGSIWSRVHPTTRRPFGVEPSGSGHIDNTASSTSSCLHQSAVRKT 284 Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828 + ++ K +++G A PS+ R+ Sbjct: 285 AYSHLSTSKRQSSSGHAVELHNIPPSSARS 314 Database: Leigo Posted date: Jan 22, 2007 11:26 AM Number of letters in database: 1,573,298,872 Number of sequences in database: 4,535,438 Lambda K H 0.318 0.134 0.401 Gapped Lambda K H 0.267 0.0410 0.140 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Sequences: 4535438 Number of Hits to DB: 2,724,816,234 Number of extensions: 65999927 Number of successful extensions: 158184 Number of sequences better than 2.0: 4 Number of HSP's gapped: 158133 Number of HSP's successfully gapped: 4 Length of query: 286 Length of database: 1,573,298,872 Length adjustment: 130 Effective length of query: 156 Effective length of database: 983,691,932 Effective search space: 153455941392 Effective search space used: 153455941392 Neighboring words threshold: 12 Window for multiple hits: 40 X1: 16 ( 7.3 bits) X2: 38 (14.6 bits) X3: 64 (24.7 bits) S1: 41 (21.7 bits) S2: 32 (16.9 bits) <<>> 2007/6/4, Peter : > > Italo Maia wrote: > > Well, i have a perfectly fine blastx output that throws an error when > parsed > > by biopython. > > It gives me this output: > > > > Traceback (most recent call last): > > File "", line 1, in > > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", > line > > 624, in parse > > self._scanner.feed(handle, self._consumer) > > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", > line > > 99, in feed > > self._scan_parameters(uhandle, consumer) > > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", > line > > 570, in _scan_parameters > > has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase")) > > File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line > 300, > > in read_and_call > > raise SyntaxError, errmsg > > SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase': > > Number of HSP's gapped: 136690 > > > > What could i do??? I'm using ubuntu feisty here. > > It looks like you are using the plain text output from blast, so we > would recommend you try the XML output instead. > > See section 3.4 of the tutorial: > http://biopython.org/DIST/docs/tutorial/Tutorial.html > > If you really want to use the plain text output, please file a bug > (including Biopython version number) and then attach the plain text > blast output which fails. But no promises - its an uphill battle to keep > the parser up to date with each version of Blast! > > Peter > > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB Programador Java, Python Meu blog ^^ http://eusouolobomal.blogspot.com/ =========================== From winter at biotec.tu-dresden.de Mon Jun 4 13:08:34 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Mon, 04 Jun 2007 19:08:34 +0200 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> Message-ID: <46644712.3030405@biotec.tu-dresden.de> Hi Maia: Could you post the begin and end of your blastx output? I think you can omit the query and hits in between... Cheers, Christof Italo Maia wrote: > Well, i have a perfectly fine blastx output that throws an error when parsed > by biopython. > It gives me this output: > > Traceback (most recent call last): > File "", line 1, in > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 624, in parse > self._scanner.feed(handle, self._consumer) > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 99, in feed > self._scan_parameters(uhandle, consumer) > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 570, in _scan_parameters > has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase")) > File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line 300, > in read_and_call > raise SyntaxError, errmsg > SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase': > Number of HSP's gapped: 136690 > > What could i do??? I'm using ubuntu feisty here. > > From biopython at maubp.freeserve.co.uk Mon Jun 4 13:41:38 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 04 Jun 2007 18:41:38 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> Message-ID: <46644ED2.1080505@maubp.freeserve.co.uk> Italo Maia wrote: > Well, i have 24 thousand of those, i think it would be very painfull to > remake them... That is a good reason not to re-run blast! > i'll fill the the bug, but, could there be a workaround? If you haven't filled a new bug already, you could attach the file to bug 2090 which is similar: http://bugzilla.open-bio.org/show_bug.cgi?id=2090 You could try that patch - it might help. I would have tested your example on my setup, but the line wrapping had been messed up. Peter From cjfields at uiuc.edu Mon Jun 4 13:55:30 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 4 Jun 2007 12:55:30 -0500 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <46644664.6080009@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> Message-ID: On Jun 4, 2007, at 12:05 PM, Peter wrote: > ... > It looks like you are using the plain text output from blast, so we > would recommend you try the XML output instead. > > See section 3.4 of the tutorial: > http://biopython.org/DIST/docs/tutorial/Tutorial.html > > If you really want to use the plain text output, please file a bug > (including Biopython version number) and then attach the plain text > blast output which fails. But no promises - its an uphill battle to > keep > the parser up to date with each version of Blast! > > Peter Same with the bioperl parser; we routinely recommend parsing XML or tabular output as they are more stable. Here is NCBI's official response (via Scott McGinnis) to problems with text BLAST output parsing: http://bioperl.org/wiki/NCBI_Blast_email chris From winter at biotec.tu-dresden.de Tue Jun 5 04:24:07 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 05 Jun 2007 10:24:07 +0200 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> Message-ID: <46651DA7.6020802@biotec.tu-dresden.de> Chris Fields wrote: > On Jun 4, 2007, at 12:05 PM, Peter wrote: > >> ... >> It looks like you are using the plain text output from blast, so we >> would recommend you try the XML output instead. >> >> See section 3.4 of the tutorial: >> http://biopython.org/DIST/docs/tutorial/Tutorial.html >> >> If you really want to use the plain text output, please file a bug >> (including Biopython version number) and then attach the plain text >> blast output which fails. But no promises - its an uphill battle to >> keep >> the parser up to date with each version of Blast! >> >> Peter > > Same with the bioperl parser; we routinely recommend parsing XML or > tabular output as they are more stable. I fully agree to that! I think the reason why people are still using the flat file format a lot is because they want to easily read it. What these people are missing is probably an easy way to convert an XML Blast output to a plain text Blast report. Would it then make sense include such a conversion script into BioPython? Maybe XSLT is the easiest option? Cheers, Christof From biopython at maubp.freeserve.co.uk Tue Jun 5 06:53:02 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 11:53:02 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <46644ED2.1080505@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> Message-ID: <4665408E.2090306@maubp.freeserve.co.uk> Peter wrote: > Italo Maia wrote: >> Well, i have 24 thousand of those, i think it would be very painfull to >> remake them... Italo, Please could you attach one or two of your plain text blast output files to bug 2090, or just email them to me directly (not the list) as attachments (not pasted into the body of the email). Thanks. http://bugzilla.open-bio.org/show_bug.cgi?id=2090 Getting Biopython's plain text blast parser updated may not be too much work... assuming your results are all in separate files that it, if you have run blast with multiple inputs then recent NBCI changes made life harder. Peter From mmokrejs at ribosome.natur.cuni.cz Tue Jun 5 08:21:36 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Tue, 05 Jun 2007 14:21:36 +0200 Subject: [BioPython] Cannot parse GenBank file Message-ID: <46655550.70400@ribosome.natur.cuni.cz> Hi, I am trying to parse a GenBank file created by ApE plasmid editor (see Google for details) with biopython-1.43 and I get: >>> fhandle = open('/mnt/smartmedia/pim-1/pGL3R.gb') >>> genbank entry = parser.parse(fhandle) File "", line 1 genbank entry = parser.parse(fhandle) ^ SyntaxError: invalid syntax >>> genbank_entry = parser.parse(fhandle) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 835, in _feed_first_line assert False, \ AssertionError: Did not recognise the LOCUS line layout: LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >>> Is the number of spaces wrong? Thanks for clues, Martin -------------- next part -------------- A non-text attachment was scrubbed... Name: pGL3R.gb.zip Type: application/zip Size: 3713 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython/attachments/20070605/6fa6d850/attachment.zip From ezequiel.panepucci at psi.ch Tue Jun 5 10:02:10 2007 From: ezequiel.panepucci at psi.ch (Ezequiel Panepucci) Date: Tue, 5 Jun 2007 16:02:10 +0200 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: <46655550.70400@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> Message-ID: > genbank entry = parser.parse(fhandle) there is a space character between "genbank" and "entry". It is a syntax error. I suppose you meant "genbank_entry" ? cheers, Zac From mmokrejs at ribosome.natur.cuni.cz Tue Jun 5 10:04:20 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Tue, 05 Jun 2007 16:04:20 +0200 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: References: <46655550.70400@ribosome.natur.cuni.cz> Message-ID: <46656D64.7010508@ribosome.natur.cuni.cz> Ezequiel Panepucci wrote: >> genbank entry = parser.parse(fhandle) > > there is a space character between "genbank" and "entry". > It is a syntax error. > I suppose you meant "genbank_entry" ? Yes, the next command was right and has shown the error. Sorry, I forgot to delete the first attempt. ;-) >>> genbank_entry = parser.parse(fhandle) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 835, in _feed_first_line assert False, \ AssertionError: Did not recognise the LOCUS line layout: LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >>> Martin From biopython at maubp.freeserve.co.uk Tue Jun 5 10:46:23 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 15:46:23 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <46655550.70400@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> Message-ID: <4665773F.2070108@maubp.freeserve.co.uk> Martin MOKREJ? wrote: > Hi, > I am trying to parse a GenBank file created by ApE plasmid editor (see > Google for details) with biopython-1.43 and I get: ... > AssertionError: Did not recognise the LOCUS line layout: > LOCUS 6499 bp ds-DNA linear 02-AUG-2006 > > Is the number of spaces wrong? Yes - fields don't line up with either of the GenBank variants Biopython expects. I suspect their files doesn't follow the current NCBI standard for the locus line... Could you make a set of different files (for different sequences) and check if the spacing changes or is preserved? Thanks Martin, Peter From cjfields at uiuc.edu Tue Jun 5 11:28:24 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 10:28:24 -0500 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: <46656D64.7010508@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <46656D64.7010508@ribosome.natur.cuni.cz> Message-ID: <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> Martin, The example file you give in the bioperl bugzilla report has several blank annotation lines which may lead to additional problems. When the BioPerl SeqIO parser finds annotation fields (SOURCE, ORGANISM, DEFINITION, etc) then it expects there will also be relevant data (text descriptions) accompanying it; I assume the BioPython parser expects likewise though I may be wrong. AFAIK the inclusion of field names w/o text isn't GenBank/EMBL- compliant. GenBank records lacking text either have a '.' instead or are left out entirely: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html We could add a fix but you should probably contact the ApE developers and request that field names w/o text be left out or have '.' added. chris On Jun 5, 2007, at 9:04 AM, Martin MOKREJ? wrote: > Ezequiel Panepucci wrote: >>> genbank entry = parser.parse(fhandle) >> >> there is a space character between "genbank" and "entry". >> It is a syntax error. >> I suppose you meant "genbank_entry" ? > > Yes, the next command was right and has shown the error. Sorry, I > forgot > to delete the first attempt. ;-) > >>>> genbank_entry = parser.parse(fhandle) > Traceback (most recent call last): > File "", line 1, in ? > File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", > line 187, in parse > self._scanner.feed(handle, self._consumer) > File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", > line 360, in feed > self._feed_first_line(consumer, self.line) > File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", > line 835, in _feed_first_line > assert False, \ > AssertionError: Did not recognise the LOCUS line layout: > LOCUS 6499 bp ds-DNA linear 02-AUG-2006 > >>>> > > Martin > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jun 5 12:07:41 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 11:07:41 -0500 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> References: <46655550.70400@ribosome.natur.cuni.cz> <46656D64.7010508@ribosome.natur.cuni.cz> <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> Message-ID: One thing I missed which explains the biopython error: the LOCUS line is missing the locus identifier (see the NCBI example record link). This doesn't choke the bioperl parser but it appears to stop the biopython parser in it's tracks (maybe a feature instead of a bug!). You should try adding a unique identifier (maybe the name of the file or record) to the LOCUS line to see if it works: LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006 The bioperl parser in CVS writes out the correct alphabet when this is added: LOCUS testfile 6499 bp ds-DNA linear 02- AUG-2006 I'll try adding a warning to the bioperl parser for this. chris On Jun 5, 2007, at 10:28 AM, Chris Fields wrote: > Martin, > > The example file you give in the bioperl bugzilla report has several > blank annotation lines which may lead to additional problems. When > the BioPerl SeqIO parser finds annotation fields (SOURCE, ORGANISM, > DEFINITION, etc) then it expects there will also be relevant data > (text descriptions) accompanying it; I assume the BioPython parser > expects likewise though I may be wrong. > > AFAIK the inclusion of field names w/o text isn't GenBank/EMBL- > compliant. GenBank records lacking text either have a '.' instead or > are left out entirely: > > http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html > > We could add a fix but you should probably contact the ApE developers > and request that field names w/o text be left out or have '.' added. > > chris > > On Jun 5, 2007, at 9:04 AM, Martin MOKREJ? wrote: > >> Ezequiel Panepucci wrote: >>>> genbank entry = parser.parse(fhandle) >>> >>> there is a space character between "genbank" and "entry". >>> It is a syntax error. >>> I suppose you meant "genbank_entry" ? >> >> Yes, the next command was right and has shown the error. Sorry, I >> forgot >> to delete the first attempt. ;-) >> >>>>> genbank_entry = parser.parse(fhandle) >> Traceback (most recent call last): >> File "", line 1, in ? >> File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", >> line 187, in parse >> self._scanner.feed(handle, self._consumer) >> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", >> line 360, in feed >> self._feed_first_line(consumer, self.line) >> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", >> line 835, in _feed_first_line >> assert False, \ >> AssertionError: Did not recognise the LOCUS line layout: >> LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >> >>>>> >> >> Martin >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From mmokrejs at ribosome.natur.cuni.cz Tue Jun 5 11:24:05 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Tue, 05 Jun 2007 17:24:05 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665773F.2070108@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> Message-ID: <46658015.6030506@ribosome.natur.cuni.cz> Peter wrote: > Martin MOKREJ? wrote: >> Hi, >> I am trying to parse a GenBank file created by ApE plasmid editor >> (see Google for details) with biopython-1.43 and I get: > > ... > >> AssertionError: Did not recognise the LOCUS line layout: >> LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >> >> Is the number of spaces wrong? > > Yes - fields don't line up with either of the GenBank variants Biopython > expects. I suspect their files doesn't follow the current NCBI standard > for the locus line... > > Could you make a set of different files (for different sequences) and > check if the spacing changes or is preserved? OK, two types of errors, the first case is caused by files generated by VectorNTI, the second type of error is caused by ApE editor-produced files: >>> fhandle = open('/mnt/smartmedia/utrophinA/p-cmvbGalCAT.gb','r') >>> genbank_entry = parser.parse(fhandle) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 361, in feed self._feed_header_lines(consumer, self.parse_header()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 967, in _feed_header_lines getattr(consumer, consumer_dict[line_type])(data) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 409, in source if content[-1] == '.': IndexError: string index out of range >>> >>> fhandle = open('/mnt/smartmedia/nrf/ok/PBCRLucPFLuc.gb','r') >>> genbank_entry = parser.parse(fhandle) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 835, in _feed_first_line assert False, \ AssertionError: Did not recognise the LOCUS line layout: LOCUS 6988 bp ds-DNA linear 20-DEC-2006 >>> I would appreciate if you could tell me then what was exactly wrong with the generated files by ApE editor (author Cc:ed). Hope this helps, Martin -------------- next part -------------- A non-text attachment was scrubbed... Name: genbank-formatted-testcases.zip Type: application/zip Size: 32571 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython/attachments/20070605/d942bf78/attachment-0001.zip From biopython at maubp.freeserve.co.uk Tue Jun 5 14:29:52 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 19:29:52 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <46658015.6030506@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> Message-ID: <4665ABA0.3060500@maubp.freeserve.co.uk> Hi Wayne & all the Biopython mailing list, Martin has been trying to parse some GenBank files produced by ApE plasmid editor, and Biopython (and BioPerl?) don't like them. Hopefully between us we can sort this out :) By the way - Is the current ApE plasmid editor webpage here, because it times out for me?: http://www.biology.utah.edu/jorgensen/wayned/ape/ Martin MOKREJ? wrote: > I would appreciate if you could tell me then what was exactly wrong with > the generated files by ApE editor (author Cc:ed). OK then, looking at file elh/pNEX3.gb which starts: LOCUS 2981 bp ds-DNA linear 12-OCT-2006 DEFINITION ACCESSION VERSION SOURCE ORGANISM COMMENT COMMENT ApEinfo:methylated:1 FEATURES Location/Qualifiers misc_feature 225..257 /ApEinfo_label=pNEX3-compatibile ... I think the location of the size (2981 bp), sequence type (ds-DNA, linear) and date (12-OCT-2006) are not in the correct positions (i.e. column numbers). Also the locus ID is missing, which is not ideal. Trying to do examples in an email is tricky as the line wrapping spoils the effect. Interestingly all these files seem to have their LOCUS line fields in the same place - perhaps the ApE plasmid editor is following an out of date version of the GenBank file format which I haven't seen before? If so, we (Biopython) should be able to deal with this too. For the current version of the LOCUS line spec, see: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt In particular: > The detailed format for the LOCUS line format is as follows: > > Positions Contents > --------- -------- > 01-05 'LOCUS' > 06-12 spaces > 13-28 Locus name > 29-29 space > 30-40 Length of sequence, right-justified > 41-41 space > 42-43 bp > 44-44 space > 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or > ms- (mixed-stranded) > 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA, > snoRNA. Left justified. > 54-55 space > 56-63 'linear' followed by two spaces, or 'circular' > 64-64 space > 65-67 The division code (see Section 3.3) > 68-68 space > 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) Note that the proteins variant "GenPept" is slightly different. The next six lines of that example file (elh/pNEX3.gb) have no values - as Chris Fields pointed out on the Biopython mailing list, the NCBI likes to use a dot/period as a place holder. The spec does explicitly say that the KEYWORDS can be omitted, but seems to assume the other lines are expected. Biopython should be happy if these lines are just omitted. See also: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html > Hope this helps, You might have upset some people by emailing an attachment to the entire Biopython mailing list, but it wasn't too big at least ;) Regards, Peter From mmokrejs at ribosome.natur.cuni.cz Tue Jun 5 14:57:14 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Tue, 05 Jun 2007 20:57:14 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665ABA0.3060500@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> Message-ID: <4665B20A.605@ribosome.natur.cuni.cz> Hi Peter, Chris and others, here I am passing the answer from Wayne back, sorry for the difficult cross-communication. Chris, I hope you will update the bioperl bug I have opened on this once it is clearer. I do not know whether Wayne will have enough time to answer all your comments, on email lists and in bugzilla. Few days ago he said they do some organize a meeting, so ... Anyway, official answer: Wayne Davis wrote: > locus line I'm using is the old standard (some older parsers wanted it > that way). > I've updated to write the new standard, if your program isn't flexible > enough to read the old style locus lines. We'll see if anyone is using > the older parsers still. > from the document laying out the new standard: > > We encourage software developers to switch to a token-based LOCUS parsing > approach, rather than a column-specific approach. If this is done, then future > changes to the LOCUS line that affect only the spacing of its data values will > > not require any modifications to software. > > > > > I've made the default behavior to put "." in the empty fields. I left > those fields there because there are other parsers that require them. > In my new version you can change the default genbank record values by > adding a line to your preferences file like this: > empty_genbank_header{LOCUS } {} {DEFINITION } {.} > {ACCESSION } {.} {VERSION } {.} {SOURCE } {.} { ORGANISM } {.} > > or > empty_genbank_header{LOCUS } {} > > > My access to our web server is temporarily unavailable, but I'll post > the update as soon as I can. Martin Peter wrote: > Hi Wayne & all the Biopython mailing list, > > Martin has been trying to parse some GenBank files produced by ApE > plasmid editor, and Biopython (and BioPerl?) don't like them. > > Hopefully between us we can sort this out :) > > By the way - Is the current ApE plasmid editor webpage here, because it > times out for me?: > > http://www.biology.utah.edu/jorgensen/wayned/ape/ > > Martin MOKREJ? wrote: >> I would appreciate if you could tell me then what was exactly wrong >> with the generated files by ApE editor (author Cc:ed). > > OK then, looking at file elh/pNEX3.gb which starts: > > LOCUS 2981 bp ds-DNA linear 12-OCT-2006 > DEFINITION > ACCESSION > VERSION > SOURCE > ORGANISM > COMMENT > COMMENT ApEinfo:methylated:1 > FEATURES Location/Qualifiers > misc_feature 225..257 > /ApEinfo_label=pNEX3-compatibile > ... > > I think the location of the size (2981 bp), sequence type (ds-DNA, > linear) and date (12-OCT-2006) are not in the correct positions (i.e. > column numbers). Also the locus ID is missing, which is not ideal. > Trying to do examples in an email is tricky as the line wrapping spoils > the effect. > > Interestingly all these files seem to have their LOCUS line fields in > the same place - perhaps the ApE plasmid editor is following an out of > date version of the GenBank file format which I haven't seen before? If > so, we (Biopython) should be able to deal with this too. > > For the current version of the LOCUS line spec, see: > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > > In particular: >> The detailed format for the LOCUS line format is as follows: >> >> Positions Contents >> --------- -------- >> 01-05 'LOCUS' >> 06-12 spaces >> 13-28 Locus name >> 29-29 space >> 30-40 Length of sequence, right-justified >> 41-41 space >> 42-43 bp >> 44-44 space >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or >> ms- (mixed-stranded) >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), >> mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA, >> snoRNA. Left justified. >> 54-55 space >> 56-63 'linear' followed by two spaces, or 'circular' >> 64-64 space >> 65-67 The division code (see Section 3.3) >> 68-68 space >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > Note that the proteins variant "GenPept" is slightly different. > > The next six lines of that example file (elh/pNEX3.gb) have no values - > as Chris Fields pointed out on the Biopython mailing list, the NCBI > likes to use a dot/period as a place holder. > > The spec does explicitly say that the KEYWORDS can be omitted, but seems > to assume the other lines are expected. Biopython should be happy if > these lines are just omitted. > > See also: > http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html > >> Hope this helps, > > You might have upset some people by emailing an attachment to the entire > Biopython mailing list, but it wasn't too big at least ;) > > Regards, > > Peter > > > -- Dr. Martin Mokrejs Dept. of Genetics and Microbiology Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From cjfields at uiuc.edu Tue Jun 5 15:55:29 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 14:55:29 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665B20A.605@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> Message-ID: <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> On Jun 5, 2007, at 1:57 PM, Martin MOKREJ? wrote: > Hi Peter, Chris and others, > here I am passing the answer from Wayne back, sorry for the > difficult > cross-communication. Chris, I hope you will update the bioperl bug > I have > opened on this once it is clearer. I do not know whether Wayne will > have > enough time to answer all your comments, on email lists and in > bugzilla. > Few days ago he said they do some organize a meeting, so ... Anyway, > official answer: > > Wayne Davis wrote: >> locus line I'm using is the old standard (some older parsers >> wanted it >> that way). >> I've updated to write the new standard, if your program isn't >> flexible >> enough to read the old style locus lines. We'll see if anyone is >> using >> the older parsers still. >> from the document laying out the new standard: >> >> We encourage software developers to switch to a token-based LOCUS >> parsing >> approach, rather than a column-specific approach. If this is done, >> then future >> changes to the LOCUS line that affect only the spacing of its data >> values will >> >> not require any modifications to software. >> >> >> >> >> I've made the default behavior to put "." in the empty fields. I left >> those fields there because there are other parsers that require them. >> In my new version you can change the default genbank record values by >> adding a line to your preferences file like this: >> empty_genbank_header{LOCUS } {} {DEFINITION } {.} >> {ACCESSION } {.} {VERSION } {.} {SOURCE } {.} >> { ORGANISM } {.} >> >> or >> empty_genbank_header{LOCUS } {} >> >> >> My access to our web server is temporarily unavailable, but I'll post >> the update as soon as I can. > > Martin The bioperl parser doesn't rely on the exact spacing and uses a tokenized approach. It does rely on the presence of the LOCUS line and a locus name in that line (which Martin's sequence record lacks). Acc. to the release notes the locus name is then followed by the sequence length, 'bp' or 'aa', and the rest. As might be guessed, the lack of a locus name is probably the major source of headaches here. Note that the presence of the locus name appears to be required according to the GenBank release notes. There is no optional designation for the LOCUS line (it is mandatory as stated in sec. 3.4.2), and the locus name appears in the line for all records (sec. 3.5.4). I could argue that errors encountered parsing a record lacking a locus name are actually features (albeit horribly documented ones). I have added a warning which catches less than six tokens on the line, but I don't see the point of going beyond that w/ o descending into tokenizing oblivion (is it an accession, if not is it the length, if not ....) when the initial source of the problem is a badly formatted line in a sequence record. chris From biopython at maubp.freeserve.co.uk Tue Jun 5 15:58:46 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 20:58:46 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665B20A.605@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> Message-ID: <4665C076.20408@maubp.freeserve.co.uk> Martin MOKREJ? wrote: > Hi Peter, Chris and others, here I am passing the answer from Wayne > back, sorry for the difficult cross-communication. Thank you both, Martin & Wayne. Wayne Davis wrote: > [the] locus line I'm using is the old standard (some older parsers > wanted it that way). That's worth knowing - thank you. Give that, maybe we (Biopython) should try and parse these files (which aside from the missing identifier in the LOCUS line should be fairly simple). On the other hand, I doubt many people still use this particular the old format. Wayne Davis wrote: >> I've updated to write the new standard, if your >> program isn't flexible enough to read the old style locus lines. That's good news. Martin - will this solve your problem, or do you think we should also update Biopython to cope with these "old style" LOCUS lines (which also lack identifiers)? Wayne Davis wrote: >> We encourage software developers to switch to a token-based LOCUS >> parsing approach, rather than a column-specific approach. If this >> is done, then future changes to the LOCUS line that affect only the >> spacing of its data values will not require any modifications to >> software. Easier said than done, as some fields can also contain white space. However, Howard Salis has some interesting code to tackle this attached to Biopython bug 2294. Peter wrote: >> The next six lines of that example file (elh/pNEX3.gb) have no >> values - as Chris Fields pointed out on the Biopython mailing list, >> the NCBI likes to use a dot/period as a place holder. >> >> The spec does explicitly say that the KEYWORDS can be omitted, but >> seems to assume the other lines are expected. Biopython should be >> happy if these lines are just omitted. Just to correct myself, many of those fields are described as mandatory single entries further up in the documentation - so using a dot/period (as Wayne has done for the ApE plasmid editor) does seem the best solution. Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > 3.4.2 Entry Organization > ... > The following is a brief description of each entry field. Detailed > information about each field may be found in Sections 3.4.4 to 3.4.15. > > LOCUS ... Mandatory keyword/exactly one record. > DEFINITION ... Mandatory keyword/one or more records. > ACCESSION ... Mandatory keyword/one or more records. > VERSION... Mandatory keyword/exactly one record. > ... KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated entries (so not mandatory in general). COMMENT is optional. Peter From cjfields at uiuc.edu Tue Jun 5 16:28:08 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 15:28:08 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665C076.20408@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> Message-ID: <87AC6A64-D329-4BCF-B868-7035AD3A2D6F@uiuc.edu> On Jun 5, 2007, at 2:58 PM, Peter wrote: ... > Easier said than done, as some fields can also contain white space. > However, Howard Salis has some interesting code to tackle this > attached > to Biopython bug 2294. The bioperl parser simply splits the data upon white space. The first three tokens (not counting the LOCUS name) are always the locus name, the seq length, and 'bp' or 'aa' (which we use to determine the alphabet); that order seems to es back to GenBank release 100 (1997): ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb100.release.notes The next few fluctuate dep. on the release or sequence type, but the division and date are always last. I don't think we require a division code to be present, but I'm not sure. > Peter wrote: >>> The next six lines of that example file (elh/pNEX3.gb) have no >>> values - as Chris Fields pointed out on the Biopython mailing list, >>> the NCBI likes to use a dot/period as a place holder. >>> >>> The spec does explicitly say that the KEYWORDS can be omitted, but >>> seems to assume the other lines are expected. Biopython should be >>> happy if these lines are just omitted. > > Just to correct myself, many of those fields are described as > mandatory > single entries further up in the documentation - so using a dot/period > (as Wayne has done for the ApE plasmid editor) does seem the best > solution. > > Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt >> 3.4.2 Entry Organization >> ... >> The following is a brief description of each entry field. Detailed >> information about each field may be found in Sections 3.4.4 to >> 3.4.15. >> >> LOCUS ... Mandatory keyword/exactly one record. >> DEFINITION ... Mandatory keyword/one or more records. >> ACCESSION ... Mandatory keyword/one or more records. >> VERSION... Mandatory keyword/exactly one record. >> ... > > KEYWORDS, SOURCE and ORGANISM are described as mandatory in all > annotated > entries (so not mandatory in general). COMMENT is optional. > > Peter Probably something we should look into and correct as well. We don't require those fields for parsing, but they should be present in output sequence records, strictly speaking. chris From biopython at maubp.freeserve.co.uk Tue Jun 5 17:11:36 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 22:11:36 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> Message-ID: <4665D188.9070202@maubp.freeserve.co.uk> Chris Fields wrote: > Note that the presence of the locus name appears to be required > according to the GenBank release notes. There is no optional > designation for the LOCUS line (it is mandatory as stated in sec. > 3.4.2), and the locus name appears in the line for all records (sec. > 3.5.4). I agree that valid GenBank files should indeed have a locus name in the LOCUS line. If it doesn't cause too many issues, then maybe we should allow such files as input. Having just gone over the Biopython code, if the locus name is missing but there is nothing else wrong with the LOCUS line, Biopython will give a slightly cryptic AssertionError, "Cannot parse the name and length in the LOCUS line" I could make the parser cope with missing locus names, but on reflection, that may just cause worse problems further downstream (e.g. trying to index the file). One option is to auto-generate an identifier... Lets wait and see what Wayne's new version of ApE plasmid editor outputs for "GenBank format" - maybe he will include some sort of locus name. Peter From cjfields at uiuc.edu Tue Jun 5 17:46:07 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 16:46:07 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665D188.9070202@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> <4665D188.9070202@maubp.freeserve.co.uk> Message-ID: <803EBA98-4521-44DD-A2C4-173E552AC2E0@uiuc.edu> On Jun 5, 2007, at 4:11 PM, Peter wrote: > Chris Fields wrote: >> Note that the presence of the locus name appears to be required >> according to the GenBank release notes. There is no optional >> designation for the LOCUS line (it is mandatory as stated in sec. >> 3.4.2), and the locus name appears in the line for all records >> (sec. 3.5.4). > > I agree that valid GenBank files should indeed have a locus name in > the LOCUS line. If it doesn't cause too many issues, then maybe we > should allow such files as input. > > Having just gone over the Biopython code, if the locus name is > missing but there is nothing else wrong with the LOCUS line, > Biopython will give a slightly cryptic AssertionError, "Cannot > parse the name and length in the LOCUS line" > > I could make the parser cope with missing locus names, but on > reflection, that may just cause worse problems further downstream > (e.g. trying to index the file). One option is to auto-generate an > identifier... > > Lets wait and see what Wayne's new version of ApE plasmid editor > outputs for "GenBank format" - maybe he will include some sort of > locus name. > > Peter In BioPerl you can optionally pass in a custom generator (specifically a code reference) to generate the LOCUS, ACCESSION, VERSION, and KEYWORD lines if needed. You might be able to do something similar for your parser, though I'm not yet familiar with Python enough to work out how... chris From cjfields at uiuc.edu Tue Jun 5 17:46:07 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 16:46:07 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665D188.9070202@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> <4665D188.9070202@maubp.freeserve.co.uk> Message-ID: <803EBA98-4521-44DD-A2C4-173E552AC2E0@uiuc.edu> On Jun 5, 2007, at 4:11 PM, Peter wrote: > Chris Fields wrote: >> Note that the presence of the locus name appears to be required >> according to the GenBank release notes. There is no optional >> designation for the LOCUS line (it is mandatory as stated in sec. >> 3.4.2), and the locus name appears in the line for all records >> (sec. 3.5.4). > > I agree that valid GenBank files should indeed have a locus name in > the LOCUS line. If it doesn't cause too many issues, then maybe we > should allow such files as input. > > Having just gone over the Biopython code, if the locus name is > missing but there is nothing else wrong with the LOCUS line, > Biopython will give a slightly cryptic AssertionError, "Cannot > parse the name and length in the LOCUS line" > > I could make the parser cope with missing locus names, but on > reflection, that may just cause worse problems further downstream > (e.g. trying to index the file). One option is to auto-generate an > identifier... > > Lets wait and see what Wayne's new version of ApE plasmid editor > outputs for "GenBank format" - maybe he will include some sort of > locus name. > > Peter In BioPerl you can optionally pass in a custom generator (specifically a code reference) to generate the LOCUS, ACCESSION, VERSION, and KEYWORD lines if needed. You might be able to do something similar for your parser, though I'm not yet familiar with Python enough to work out how... chris From mmokrejs at ribosome.natur.cuni.cz Thu Jun 7 10:26:44 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Thu, 07 Jun 2007 16:26:44 +0200 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: References: <46655550.70400@ribosome.natur.cuni.cz> <46656D64.7010508@ribosome.natur.cuni.cz> <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> Message-ID: <466815A4.9060505@ribosome.natur.cuni.cz> Hi, Chris Fields wrote: > One thing I missed which explains the biopython error: the LOCUS line is > missing the locus identifier (see the NCBI example record link). This > doesn't choke the bioperl parser but it appears to stop the biopython > parser in it's tracks (maybe a feature instead of a bug!). > > You should try adding a unique identifier (maybe the name of the file or > record) to the LOCUS line to see if it works: > > LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006 > > The bioperl parser in CVS writes out the correct alphabet when this is > added: > > LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006 > > I'll try adding a warning to the bioperl parser for this. I have updated http://bugzilla.open-bio.org/show_bug.cgi?id=2305 but let me emphasize the LOCUS line now contains LOCUS pRL 5428 bp ds-DNA linear 07-JUN-2007 which still does not comply with the line you have proposed. But it can be parsed by bioperl-live from cvs. Is it still wrong? Testcase as pRL.gb-new in the bugzilla record #2305. Martin > > chris > > On Jun 5, 2007, at 10:28 AM, Chris Fields wrote: > >> Martin, >> >> The example file you give in the bioperl bugzilla report has several >> blank annotation lines which may lead to additional problems. When >> the BioPerl SeqIO parser finds annotation fields (SOURCE, ORGANISM, >> DEFINITION, etc) then it expects there will also be relevant data >> (text descriptions) accompanying it; I assume the BioPython parser >> expects likewise though I may be wrong. >> >> AFAIK the inclusion of field names w/o text isn't GenBank/EMBL- >> compliant. GenBank records lacking text either have a '.' instead or >> are left out entirely: >> >> http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html >> >> We could add a fix but you should probably contact the ApE developers >> and request that field names w/o text be left out or have '.' added. >> >> chris >> >> On Jun 5, 2007, at 9:04 AM, Martin MOKREJ? wrote: >> >>> Ezequiel Panepucci wrote: >>>>> genbank entry = parser.parse(fhandle) >>>> >>>> there is a space character between "genbank" and "entry". >>>> It is a syntax error. >>>> I suppose you meant "genbank_entry" ? >>> >>> Yes, the next command was right and has shown the error. Sorry, I >>> forgot >>> to delete the first attempt. ;-) >>> >>>>>> genbank_entry = parser.parse(fhandle) >>> Traceback (most recent call last): >>> File "", line 1, in ? >>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", >>> line 187, in parse >>> self._scanner.feed(handle, self._consumer) >>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", >>> line 360, in feed >>> self._feed_first_line(consumer, self.line) >>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", >>> line 835, in _feed_first_line >>> assert False, \ >>> AssertionError: Did not recognise the LOCUS line layout: >>> LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >>> >>>>>> >>> >>> Martin >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > > -- Dr. Martin Mokrejs Dept. of Genetics and Microbiology Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From mmokrejs at ribosome.natur.cuni.cz Thu Jun 7 10:44:17 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Thu, 07 Jun 2007 16:44:17 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665C076.20408@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> Message-ID: <466819C1.9010203@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Martin MOKREJ? wrote: >> Hi Peter, Chris and others, here I am passing the answer from Wayne >> back, sorry for the difficult cross-communication. > > Thank you both, Martin & Wayne. > > Wayne Davis wrote: >> [the] locus line I'm using is the old standard (some older parsers > > wanted it that way). > > That's worth knowing - thank you. Give that, maybe we (Biopython) > should try and parse these files (which aside from the missing > identifier in the LOCUS line should be fairly simple). On the other > hand, I doubt many people still use this particular the old format. > > Wayne Davis wrote: >>> I've updated to write the new standard, if your >>> program isn't flexible enough to read the old style locus lines. > > That's good news. Martin - will this solve your problem, or do you > think we should also update Biopython to cope with these "old style" > LOCUS lines (which also lack identifiers)? I think that if it was ever a valid format it should cope with it. > > Wayne Davis wrote: >>> We encourage software developers to switch to a token-based LOCUS >>> parsing approach, rather than a column-specific approach. If this >>> is done, then future changes to the LOCUS line that affect only the >>> spacing of its data values will not require any modifications to > >> software. > > Easier said than done, as some fields can also contain white space. > However, Howard Salis has some interesting code to tackle this attached > to Biopython bug 2294. Please follow the bug #2305 in bioperl on this as well and see what competitors have done in this regard. ;) > > Peter wrote: >>> The next six lines of that example file (elh/pNEX3.gb) have no >>> values - as Chris Fields pointed out on the Biopython mailing list, >>> the NCBI likes to use a dot/period as a place holder. >>> >>> The spec does explicitly say that the KEYWORDS can be omitted, but >>> seems to assume the other lines are expected. Biopython should be >>> happy if these lines are just omitted. > > Just to correct myself, many of those fields are described as mandatory > single entries further up in the documentation - so using a dot/period > (as Wayne has done for the ApE plasmid editor) does seem the best solution. OK, biopython now can survive the missing dots, I think biopython should do the same. If one can fix the problem by adding internally in the parser a default value, why not to do it? > > Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt >> 3.4.2 Entry Organization >> ... >> The following is a brief description of each entry field. Detailed >> information about each field may be found in Sections 3.4.4 to 3.4.15. >> >> LOCUS ... Mandatory keyword/exactly one record. DEFINITION ... >> Mandatory keyword/one or more records. ACCESSION ... Mandatory >> keyword/one or more records. VERSION... Mandatory keyword/exactly one >> record. ... > > KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated > entries (so not mandatory in general). COMMENT is optional. Martin From mmokrejs at ribosome.natur.cuni.cz Thu Jun 7 10:51:14 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Thu, 07 Jun 2007 16:51:14 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665D188.9070202@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> <4665D188.9070202@maubp.freeserve.co.uk> Message-ID: <46681B62.2070403@ribosome.natur.cuni.cz> Peter wrote: > Chris Fields wrote: >> Note that the presence of the locus name appears to be required >> according to the GenBank release notes. There is no optional >> designation for the LOCUS line (it is mandatory as stated in sec. >> 3.4.2), and the locus name appears in the line for all records (sec. >> 3.5.4). > > I agree that valid GenBank files should indeed have a locus name in the > LOCUS line. If it doesn't cause too many issues, then maybe we should > allow such files as input. > > Having just gone over the Biopython code, if the locus name is missing > but there is nothing else wrong with the LOCUS line, Biopython will give > a slightly cryptic AssertionError, "Cannot parse the name and length in > the LOCUS line" > > I could make the parser cope with missing locus names, but on > reflection, that may just cause worse problems further downstream (e.g. > trying to index the file). One option is to auto-generate an identifier... I would vote for that. A number of things will break when the LOCUS is same for multiple records. But, imagine, I just have multiple file with same LOCUS identifier (a plasmid name) and it simply does happen that multiple plasmids of different sequence have same abbreviated names. I need to stick to their original names as published by authors in Literature, so I really do have several files with same LOCUS identifier in the LOCUS line. So, the internal indexing stuff must kick in. > > Lets wait and see what Wayne's new version of ApE plasmid editor outputs > for "GenBank format" - maybe he will include some sort of locus name. It is being fixed now, still some polishing needed. But it will produce Genbank formatted files according to current standard. Martin BTW I ahve proposed ApE editor derives the LOCUS identifier from a filename by stripping the file extension. From cjfields at uiuc.edu Thu Jun 7 11:31:45 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 7 Jun 2007 10:31:45 -0500 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: <466815A4.9060505@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <46656D64.7010508@ribosome.natur.cuni.cz> <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> <466815A4.9060505@ribosome.natur.cuni.cz> Message-ID: <2A403865-F1E8-4D19-8D19-455C22E7C6D9@uiuc.edu> On Jun 7, 2007, at 9:26 AM, Martin MOKREJ? wrote: > Hi, > > Chris Fields wrote: >> One thing I missed which explains the biopython error: the LOCUS >> line is missing the locus identifier (see the NCBI example record >> link). This doesn't choke the bioperl parser but it appears to >> stop the biopython parser in it's tracks (maybe a feature instead >> of a bug!). >> You should try adding a unique identifier (maybe the name of the >> file or record) to the LOCUS line to see if it works: >> LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006 >> The bioperl parser in CVS writes out the correct alphabet when >> this is added: >> LOCUS testfile 6499 bp ds-DNA linear 02- >> AUG-2006 >> I'll try adding a warning to the bioperl parser for this. > > I have updated http://bugzilla.open-bio.org/show_bug.cgi?id=2305 > but let me > emphasize the LOCUS line now contains > LOCUS pRL 5428 bp ds-DNA linear > 07-JUN-2007 > > > which still does not comply with the line you have proposed. But it > can be > parsed by bioperl-live from cvs. Is it still wrong? Testcase as > pRL.gb-new > in the bugzilla record #2305. > > Martin That should work. There isn't a strict uniqueness test (that would require caching and isn't worth the trouble IMHO), though it's required you add something unique for the accession/locus if you plan on indexing them in the future. Parsing GenBank data produced from third-party software is problematic at best; there seems to be no steadfast rule with GenBank output for some programs, even though the specification is plainly stated in the NCBI release notes. My take on that is to have a stricter (read:follows release notes) GenBank parser which passes off the data in the record to default handler methods. A user could then subjugate the defined handlers with their own by subclassing the default handler class and overloading the methods or adding their own code references directly. chris ... From cjfields at uiuc.edu Thu Jun 7 12:42:13 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 7 Jun 2007 11:42:13 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <466819C1.9010203@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <466819C1.9010203@ribosome.natur.cuni.cz> Message-ID: <88B5307E-6666-4646-BD0B-B0704D493CA5@uiuc.edu> On Jun 7, 2007, at 9:44 AM, Martin MOKREJ? wrote: > Hi Peter, >> ... >> That's good news. Martin - will this solve your problem, or do you >> think we should also update Biopython to cope with these "old style" >> LOCUS lines (which also lack identifiers)? > > I think that if it was ever a valid format it should cope with it. I think it's better to explicitly state that the parser is compliant with a particular GenBank release and can likely parse other similarly formatted GenBank records from third-party software. If the parser chokes on a bad record then you can point out the deficiency in the record and (if possible) try to make it more flexible w/o borking the parser later on. The release notes are there for a good reason! The LOCUS line format, however, has been relatively stable over time. Here are the release notes for a GenBank release from late 1992: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb74.release.notes and the LOCUS line is: Positions Contents 1-12 LOCUS 13-22 Locus name 23-29 Length of sequence, right-justified 31-32 bp 34-36 Blank, ss- (single-stranded), ds- (double-stranded), or ms- (mixed-stranded) 37-40 Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), mRNA (messenger RNA), or uRNA (small nuclear RNA) 43-52 Blank (implies linear) or circular 53-55 The division code (see Section 3.3) 63-73 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) The spacing is more explicitly laid out in later versions. The best part is the Entrez CD order form (clipped out by scissors to be snail- mailed) at the end of the file! chris From jlchang at broad.mit.edu Thu Jun 7 16:24:00 2007 From: jlchang at broad.mit.edu (Jean Chang) Date: Thu, 7 Jun 2007 16:24:00 -0400 Subject: [BioPython] installation question Message-ID: <224B35CB-5B36-48DE-AD9D-B839680AA71A@broad.mit.edu> Hi, I'm trying to install biopython-1.43.tar.gz my mxtexttools is in a non-standard place - I'm guessing I need my PYTHONPATH specified as I'm getting: *** mxTextTools *** is either not installed or out of date. I tried installing egenix-mx-base-3.0.0.tar.gz but I'm getting the same error. Did I get _too recent_ a copy of mxtexttools? (Previously I had egenix-mx-base-2.0.6.tar.gz) Thanks, jean From biopython at maubp.freeserve.co.uk Thu Jun 7 17:28:31 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 07 Jun 2007 22:28:31 +0100 Subject: [BioPython] installation question - mxTextTools In-Reply-To: <224B35CB-5B36-48DE-AD9D-B839680AA71A@broad.mit.edu> References: <224B35CB-5B36-48DE-AD9D-B839680AA71A@broad.mit.edu> Message-ID: <4668787F.50405@maubp.freeserve.co.uk> Jean Chang wrote: > Hi, > > I'm trying to install biopython-1.43.tar.gz > my mxtexttools is in a non-standard place - I'm guessing I need my > PYTHONPATH specified as I'm getting: > > *** mxTextTools *** is either not installed or out of date. > > I tried installing egenix-mx-base-3.0.0.tar.gz but I'm getting the > same error. Did I get _too recent_ a copy of mxtexttools? > > (Previously I had egenix-mx-base-2.0.6.tar.gz) I assume you are trying to install this on Linux? Did things work using egenix-mx-base-2.0.6.tar.gz? I use Ubuntu Dapper Drake, and used apt-get to install the package version 2.0.6 of the python-egenix-mxtexttools package. Peter From mmokrejs at ribosome.natur.cuni.cz Fri Jun 8 06:31:36 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 08 Jun 2007 12:31:36 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <88B5307E-6666-4646-BD0B-B0704D493CA5@uiuc.edu> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <466819C1.9010203@ribosome.natur.cuni.cz> <88B5307E-6666-4646-BD0B-B0704D493CA5@uiuc.edu> Message-ID: <46693008.7040202@ribosome.natur.cuni.cz> Chris Fields wrote: > > On Jun 7, 2007, at 9:44 AM, Martin MOKREJ? wrote: > >> Hi Peter, >>> ... >>> That's good news. Martin - will this solve your problem, or do you >>> think we should also update Biopython to cope with these "old style" >>> LOCUS lines (which also lack identifiers)? >> >> I think that if it was ever a valid format it should cope with it. > > I think it's better to explicitly state that the parser is compliant > with a particular GenBank release and can likely parse other similarly > formatted GenBank records from third-party software. If the parser > chokes on a bad record then you can point out the deficiency in the > record and (if possible) try to make it more flexible w/o borking the > parser later on. The release notes are there for a good reason! > > The LOCUS line format, however, has been relatively stable over time. > Here are the release notes for a GenBank release from late 1992: > > ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb74.release.notes > > and the LOCUS line is: > > Positions Contents > > 1-12 LOCUS > 13-22 Locus name > 23-29 Length of sequence, right-justified > 31-32 bp > 34-36 Blank, ss- (single-stranded), ds- (double-stranded), or > ms- (mixed-stranded) > 37-40 Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > mRNA (messenger RNA), or uRNA (small nuclear RNA) > 43-52 Blank (implies linear) or circular > 53-55 The division code (see Section 3.3) > 63-73 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > The spacing is more explicitly laid out in later versions. The best > part is the Entrez CD order form (clipped out by scissors to be > snail-mailed) at the end of the file! In principle I do agree with you but let me emphasize that I fully agree with Wayne who wrote me yesterday in the way that the GenBank format is he way to write down your data, and we often really do not need all the fields required for data syubmission into the Genbank database: >From the definition of the format (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt), only DEFINITION, KEYWORDS, SOURCE and ORIGIN (if it contains data) lines end with a period. The periods should be added to the ends of non-period containing lines for those fields only. That is where ApE doesn't conform to the file definition. I put in the fields DEFINITION, ACCESSION, VERSION, SOURCE, and ORGANISM because those are listed as mandatory by the release notes. Looks like I missed that REFERENCES is also mandatory. The release notes do not say that the fields must contain data or that they must end with a period (except where noted above). I put them in, figuring that a parser that was working from the file definition might require those fields to be present. It seems like a well written parser could handle null data in the field better than handling the absence of an explicitly required field, since there is nothing in the standard that states what data, if any, must be present, but there is an explicit requirement for the field. Ok, I'll add an option to take out blank fields (even though that will break compliance with the definition, as I understand it). One could interpret the file standard as only applying to files intended for use in the NCBI database, so the required fields are only an issue for entering into their database, not for file parsers. Working on ApE isn't what I really do, so I might not get around to it immediately. Still, while I acknowledge that ApE has been writing files that do not comply completely with the standard (needing the required periods on the end of some of the mandatory fields), your parser should be able to handle null data lines and spaces in the locus name. for parsing the locus info here is the tcl regexp that I use ($a is the full LOCUS line, x returns the full matched line): regexp {LOCUS (.*) ([0-9]*) bp ( |ss-|ds-|ms-)(NA |DNA |RNA |tRNA |rRNA |mRNA |uRNA |snRNA |snoRNA)[ ]*(linear |circular| )[ ]*([ A-Z]{3})[ ]*(..-...-....)} $a x name size stranded type circular div date you have to do a trim on the name that you get out of this, since it is space padded, as per the file definition. Let me know if you see an exception that is a valid LOCUS line but would break this. Martin From cjfields at uiuc.edu Fri Jun 8 08:57:55 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 8 Jun 2007 07:57:55 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <46693008.7040202@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <466819C1.9010203@ribosome.natur.cuni.cz> <88B5307E-6666-4646-BD0B-B0704D493CA5@uiuc.edu> <46693008.7040202@ribosome.natur.cuni.cz> Message-ID: <2949B571-B73E-4059-B6A6-45E9893DBD26@uiuc.edu> On Jun 8, 2007, at 5:31 AM, Martin MOKREJ? wrote: ... > > In principle I do agree with you but let me emphasize that I fully > agree with Wayne > who wrote me yesterday in the way that the GenBank format is he way > to write down > your data, and we often really do not need all the fields required > for data syubmission > into the Genbank database: ... It does make sense to leave some of those fields out except in cases where they are needed (with the exception of the '.' fields like KEYWORDS), but it never made sense to me to have completely blank fields or leave out the locus name. My guess is that most format parsers don't look for empty fields (or complain when one is encountered) b/c empty fields haven't been encountered before; they were always left out completely. What would work best for all would be optional validation warnings or a separate validation module if one worried about checking compliance issues with GenBank format, something that hasn't happened yet (and I don't have time to code for!). Wayne, I would say use Martin's advice for the locus name (file name w/o extension), and if the field allows '.' then add it in, otherwise it's probably easier to leave the blank fields out completely, GenBank compliance or not. There are several questionably compliant files in the genbank test suite in BioPerl so this wouldn't be the first one, and if someone wants a validation system they can try building one until we have time to do it. chris From jlchang at broad.mit.edu Fri Jun 8 11:33:03 2007 From: jlchang at broad.mit.edu (Jean Chang) Date: Fri, 8 Jun 2007 11:33:03 -0400 Subject: [BioPython] installation question - mxTextTools In-Reply-To: <4668787F.50405@maubp.freeserve.co.uk> References: <224B35CB-5B36-48DE-AD9D-B839680AA71A@broad.mit.edu> <4668787F.50405@maubp.freeserve.co.uk> Message-ID: <3487DE4A-0A77-4CE6-906B-00A76B140D1F@broad.mit.edu> thanks to Peter and Ann. It turns out I had overspecified the installation --prefix so mxTextTools was buried deeper than the PYTHONPATH that I had set. Once I realized this and re-installed correctly, the install worked just fine. Regards, Jean From aloraine at gmail.com Sat Jun 9 10:16:41 2007 From: aloraine at gmail.com (Ann Loraine) Date: Sat, 9 Jun 2007 09:16:41 -0500 Subject: [BioPython] question regarding testing suites Message-ID: <83722dde0706090716x190e6250o3d440e8c613cff71@mail.gmail.com> Dear all, I have a question which I hope you might be able to advise me on: In your experience, which testing frameworks do you think work best for managing and testing python programs and modules? I've looked at the testing code in biopython, which seems to use unittest library -- would you recommend I use this, or do you think there are some other frameworks I should investigate, as well? Sincerely, Ann Loraine -- Ann Loraine Assistant Professor University of Alabama at Birmingham http://www.transvar.org 205-996-4155 From dalke at dalkescientific.com Sat Jun 9 10:42:54 2007 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 9 Jun 2007 16:42:54 +0200 Subject: [BioPython] question regarding testing suites In-Reply-To: <83722dde0706090716x190e6250o3d440e8c613cff71@mail.gmail.com> References: <83722dde0706090716x190e6250o3d440e8c613cff71@mail.gmail.com> Message-ID: <8E844BE0-2D63-48D9-9D0C-32DA25ED2A7A@dalkescientific.com> Hi Ann! (And other BioPython folk) On Jun 9, 2007, at 4:16 PM, Ann Loraine wrote: > would you recommend I use [unittest], or do you think there are > some other frameworks I should investigate, as well? Use "nose" from http://somethingaboutorange.com/mrl/projects/nose/ which lets you develop tests using unittest *and* other ways. There are two aspects you should be aware of: unit tests and and unit test discovery. You gotta write the tests and you gotta run the tests. unittest.py does both. Derive from TestCase, write methods starting with "test_" and the unittest.main() can auto-discover everything. The downsides are "unittest"'s discovery doesn't support: - simple functions (when it's silly to make a class for two lines of code) - doctests - running all/a subset of your unittests across multiple files. The nose system also has support for things like checking execution time and doing coverage tests. Andrew dalke at dalkescientific.com From irene.farabella at gmail.com Sun Jun 10 15:44:00 2007 From: irene.farabella at gmail.com (irene farabella) Date: Sun, 10 Jun 2007 21:44:00 +0200 Subject: [BioPython] info_dictionary In-Reply-To: <709fef6a0706101224g5a1a2a4fg92d23dbb2552b626@mail.gmail.com> References: <709fef6a0706101224g5a1a2a4fg92d23dbb2552b626@mail.gmail.com> Message-ID: <709fef6a0706101244u30aad990u54d70b295ae35355@mail.gmail.com> hi i am a beginner in python. from DSSP file i made a dictionary like that: dict2[AA_num_dssp,chain] = (structure,AA_name_dssp) i need of it in my small programm. i never work with a dictionary that have the keys like that. i have to sort the key of the dictionay... how can i do? usualy when i have only the AA_num i can mutate it in a list an then sort it... but in this case...mmm thanks for help me!! and sorry for the english.. From dalloliogm at gmail.com Mon Jun 11 05:31:03 2007 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 11 Jun 2007 11:31:03 +0200 Subject: [BioPython] info_dictionary In-Reply-To: <709fef6a0706101244u30aad990u54d70b295ae35355@mail.gmail.com> References: <709fef6a0706101224g5a1a2a4fg92d23dbb2552b626@mail.gmail.com> <709fef6a0706101244u30aad990u54d70b295ae35355@mail.gmail.com> Message-ID: <5aa3b3570706110231y21a6f4del256432edd02484ca@mail.gmail.com> Hi Irene! If I understand your problem correctly, you have a dictionary in which the keys are created by concatenating many variables: dict_key = AA + '_' + num + '_' + dssp + '....' In my humble opinion, it's bad to create this kind of dictionary... it's better to use a more branched structure, e.g.: dict_dssp = {dssp : {AA : {num : (structure, ...), ...}, ...}, ...} which is easier to handle, so for example you can extract the keys with dict_dssp.items (to get a list of all the dssps) or dict_dssp[dssp1].items (for all the items in a given dssp), and so on. Anyway, if you can't change to this structure you can find some useful information here: - http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/52306 2007/6/10, irene farabella : > > hi > i am a beginner in python. > from DSSP file i made a dictionary like that: > dict2[AA_num_dssp,chain] = (structure,AA_name_dssp) > i need of it in my small programm. > i never work with a dictionary that have the keys like that. > i have to sort the key of the dictionay... how can i do? > usualy when i have only the AA_num i can mutate it in a list an then sort > it... > but in this case...mmm > thanks for help me!! > and sorry for the english.. > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com From dalloliogm at gmail.com Mon Jun 11 10:05:32 2007 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 11 Jun 2007 16:05:32 +0200 Subject: [BioPython] GFF parser, other than Bio.GFF Message-ID: <5aa3b3570706110705q2b854711pf44b398892aa64a9@mail.gmail.com> Hi! I'm currently working with some gene annotations in GFF format[1], and I've noticed there is not any GFF parser in the current biopython distribution. Am I wrong? I've found the Bio.GFF module, but it doesn't actually do what I want to do (read [2]), as it gets informations from a MySQL database (?), while I'm searching for a module to parse a gff file and transform it to dictionary, maybe like SeqIO. Well, I wrote a few scripts to parse my GFF files.. I thought I can contribute with some code if somebody can help me in refining and adapting them (there is still a lot of work to do). So why there is still not any gff parser in biopython? Is this format too outdated, or maybe nobody is using it? Or maybe I'm missing it? Or there is some other problem? Thanks! Giovanni [1] http://www.sanger.ac.uk/Software/formats/GFF/ (gff), http://mblab.wustl.edu/GTF22.html (gtf 2.2) [2] http://portal.open-bio.org/pipermail/biopython/2004-May/002099.html -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com From dalloliogm at gmail.com Tue Jun 12 07:07:20 2007 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 12 Jun 2007 13:07:20 +0200 Subject: [BioPython] I don't understand why SeqRecord.feature is a list Message-ID: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> Hi, I'm a newbie to biopython and I'm trying to use it to represent a gene structure parsed from a gff file. In principle, I would create a SeqRecord to represent an mRNA; then, I would like to annotate its exons and introns in the .feature field. But I don't understand why SeqRecord.feature is a list, I think it could be easier to use as a dictionary. For example, this is what I've created until now: mRNA1 = SeqRecord() mRNA1.seq = 'cacacacacgtatgcta..' mRNA1.id = '...' mRNA1.feature = [exon1_SeqFeatureObj, exon2_SeqFeatureObj,....] here the exons are annotated in a list; the problem is that in this way it's difficult to retrieve them, since if let's say I want to retrieve the informations from the exon3 object, I have to cycle in all the mRNA1.feature objects to look for it. Wouldn't it be better to use a dictionary for SeqRecord.feature? mRNA1 = SeqRecord() mRNA1.seq = 'cacacacacgtatgcta..' mRNA1.id = '...' mRNA1.feature = {'exon1' : exon1_SeqFeatureObj, 'exon2' : exon2_SeqFeatureObj,....} Alternatively, I've got the idea of using SeqRecord.annotations to keep track of the indexes in SeqRecord.feature: mRNA1 = SeqRecord() mRNA1.seq = 'cacacacacgtatgcta..' mRNA1.id = '...' mRNA.annotations = {'exon1' : mRNA.feature[0], 'exon2' : mRNA.feature[1], ....} mRNA1.feature = [exon1_SeqFeatureObj, exon2_SeqFeatureObj,....] -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com From ezequiel.panepucci at psi.ch Tue Jun 12 07:44:59 2007 From: ezequiel.panepucci at psi.ch (Ezequiel Panepucci) Date: Tue, 12 Jun 2007 13:44:59 +0200 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> Message-ID: > But I don't understand why SeqRecord.feature is a list, I think it > could be easier to use as a dictionary. The problem with a dictionaries is that they are not ordered and lists are, so internally it is easier to organize lists than dicts. I don't know how easy it would be to define which attribute/property of a feature should be used as a dict key. Zac From mcolosimo at mitre.org Tue Jun 12 08:09:45 2007 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 12 Jun 2007 08:09:45 -0400 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> Message-ID: <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> Additionally, for many formats you can have multiple features with the same name; e.g., CDS, gene, etc... in GenBank Records. The same rational doesn't fully apply to why the feature qualifiers are dictionaries of lists. Marc On Jun 12, 2007, at 7:44 AM, Ezequiel Panepucci wrote: >> But I don't understand why SeqRecord.feature is a list, I think it >> could be easier to use as a dictionary. > > The problem with a dictionaries is that they are not ordered > and lists are, so internally it is easier to organize lists than > dicts. > > I don't know how easy it would be to define which attribute/property > of a feature should be used as a dict key. > > Zac > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Jun 12 10:32:45 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jun 2007 15:32:45 +0100 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> Message-ID: <466EAE8D.2090609@maubp.freeserve.co.uk> Marc Colosimo wrote: > Additionally, for many formats you can have multiple features with > the same name; e.g., CDS, gene, etc... in GenBank Records. Indeed - and as the SeqRecord/SeqFeature is most heavily used by the GenBank parser, that does explain things well. The problem with using a dictionary is what to index on - you can't simply use the location string for example, as there usually entries for genes and CDS features with the same location. You can't depend on any other information like an identifier or name to be present in a GenBank file for all feature types. In general, the choice of index will depend on what you want to use it for - so the flippant answer is just index it yourself, for example like this: http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features > The same rational doesn't fully apply to why the feature qualifiers > are dictionaries of lists. No it doesn't. The rational seems to have been that feature qualifiers in GenBank files can occur with no values (e.g. /pseudo and others), a single value (e.g. translation) or multiple values (by repeated keys, e.g. database cross references). So using a list is a simple solution to cover all these cases - even if most entries only have a single entry. (There are some old posts on the mailing list archive discussing this.) Peter From richa at musc.edu Tue Jun 12 12:31:35 2007 From: richa at musc.edu (richa) Date: Tue, 12 Jun 2007 12:31:35 -0400 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' Message-ID: <466ECA67.2080506@musc.edu> Hi all, Just installed biopython on ubuntu feisty. Dependencies and package seemed to install without a problem. Many of the test files that come with it work fine, but the SeqIO object runs into a problem. For example, the following code causes an error saying that the 'module' object has no attribute 'parse'. Is this a problem of syntax or is it an installation issue? from Bio import SeqIO handle = open("ls_orchid.fasta", "rU") for record in SeqIO.parse(handle, "fasta") : print record.id From idoerg at gmail.com Tue Jun 12 12:54:30 2007 From: idoerg at gmail.com (I. Friedberg) Date: Tue, 12 Jun 2007 09:54:30 -0700 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466ECA67.2080506@musc.edu> References: <466ECA67.2080506@musc.edu> Message-ID: I just tried to install biopython in Ubuntu Feisty and the installation breaks: % sudo apt-get install python-biopython [... lots of apt install log messages...] Setting up python-biopython (1.42-2) ... Compiling /var/lib/python-support/python2.5/Bio/Wise/dnal.py ... File "/var/lib/python-support/python2.5/Bio/Wise/dnal.py", line 5 from __future__ import division SyntaxError: from __future__ imports must occur at the beginning of the file How come you managed to install? On 6/12/07, richa wrote: > > Hi all, > > Just installed biopython on ubuntu feisty. Dependencies and package > seemed to install without a problem. Many of the test files that come > with it work fine, but the SeqIO object runs into a problem. For > example, the following code causes an error saying that the 'module' > object has no attribute 'parse'. > > Is this a problem of syntax or is it an installation issue? > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for record in SeqIO.parse(handle, "fasta") : > print record.id > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- I. Friedberg "The only problem with troubleshooting is that sometimes trouble shoots back." From biopython at maubp.freeserve.co.uk Tue Jun 12 13:09:17 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jun 2007 18:09:17 +0100 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466ECA67.2080506@musc.edu> References: <466ECA67.2080506@musc.edu> Message-ID: <466ED33D.3070200@maubp.freeserve.co.uk> richa wrote: > Hi all, > > Just installed biopython on ubuntu feisty. Dependencies and package > seemed to install without a problem. Many of the test files that come > with it work fine, but the SeqIO object runs into a problem. For > example, the following code causes an error saying that the 'module' > object has no attribute 'parse'. > > Is this a problem of syntax or is it an installation issue? > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for record in SeqIO.parse(handle, "fasta") : > print record.id That should work. Odd... What version of Python are you using? It might be helpful to see the full error message, it should include the path to a file .../Bio/SeqIO/__init__.py If you have a quick look in that file and double check there is a function "parse" (e.g. search for "def parse("), and it looks something like the latest version, which is online here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/__init__.py?rev=1.19&cvsroot=biopython&content-type=text/vnd.viewcvs-markup You said you were using Ubuntu Feisty - did you install Biopython from source code by downloading the tar ball, or using "apt-get"? If you used apt-get then you could try removing the package, and then installing from source instead (fairly simple). If that works then maybe there is a problem in the Feisty package. Peter From richa at musc.edu Tue Jun 12 13:13:17 2007 From: richa at musc.edu (richa) Date: Tue, 12 Jun 2007 13:13:17 -0400 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: References: <466ECA67.2080506@musc.edu> Message-ID: <466ED42D.3040406@musc.edu> It responded with the same error message using apt-get for me too. I used synaptic. I just uninstalled and reinstalled again and looking at the log I saw the same error message. However it did not quit installation and some of biopython's modules retain functionality. It appears then that this is an installation issue. Any suggestions for what is missing for a clean install? I. Friedberg wrote: > I just tried to install biopython in Ubuntu Feisty and the > installation breaks: > > % sudo apt-get install python-biopython > > [... lots of apt install log messages...] > > Setting up python-biopython (1.42-2) ... > Compiling /var/lib/python-support/python2.5/Bio/Wise/dnal.py ... > File "/var/lib/python-support/python2.5/Bio/Wise/dnal.py", line 5 > from __future__ import division > SyntaxError: from __future__ imports must occur at the beginning of > the file > > > How come you managed to install? > > > > > > > On 6/12/07, *richa* > wrote: > > Hi all, > > Just installed biopython on ubuntu feisty. Dependencies and package > seemed to install without a problem. Many of the test files that > come > with it work fine, but the SeqIO object runs into a problem. For > example, the following code causes an error saying that the 'module' > object has no attribute 'parse'. > > Is this a problem of syntax or is it an installation issue? > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for record in SeqIO.parse(handle, "fasta") : > print record.id > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > -- > > I. Friedberg > > "The only problem with troubleshooting is that > sometimes trouble shoots back." From winter at biotec.tu-dresden.de Tue Jun 12 13:27:40 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 12 Jun 2007 19:27:40 +0200 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466ECA67.2080506@musc.edu> References: <466ECA67.2080506@musc.edu> Message-ID: <466ED78C.4040807@biotec.tu-dresden.de> richa wrote: > Hi all, > > Just installed biopython on ubuntu feisty. Dependencies and package > seemed to install without a problem. Many of the test files that come > with it work fine, but the SeqIO object runs into a problem. For > example, the following code causes an error saying that the 'module' > object has no attribute 'parse'. > > Is this a problem of syntax or is it an installation issue? > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for record in SeqIO.parse(handle, "fasta") : > print record.id Dear richa: It is an installation issue. I got exactly the same error. I am running Debian Linux with the python-biopython package installed: $ dpkg -l | grep biopython ii python-biopython 1.42-2 After upgrading to the newest version with $ sudo apt-get install python-biopython $ dpkg -l | grep biopython ii python-biopython 1.43-1 the error is gone! By the way: this was the content of my old Bio/SeqIO/__init__.py: """Sequence input/output designed to look similar to the bioperl design. At present, these are all hand written. I would like to have autogenerated parsers in the futures, esp. with the ability to parse only subsets of the data, and to support event generated parsers. Note that once a parser is given an input string, it is free to read as much of the data as it wants to read, unless otherwise mentioned. """ Nothing more. Now it matches the newest version as posted by Peter. Cheers, Christof From richa at musc.edu Tue Jun 12 15:24:23 2007 From: richa at musc.edu (richa) Date: Tue, 12 Jun 2007 15:24:23 -0400 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466ED78C.4040807@biotec.tu-dresden.de> References: <466ECA67.2080506@musc.edu> <466ED78C.4040807@biotec.tu-dresden.de> Message-ID: <466EF2E7.6080404@musc.edu> Compiled from source and the no more error messages. Thank you all for your help and advise. -Adam Richards (richa) Christof Winter wrote: > richa wrote: >> Hi all, >> >> Just installed biopython on ubuntu feisty. Dependencies and package >> seemed to install without a problem. Many of the test files that >> come with it work fine, but the SeqIO object runs into a problem. >> For example, the following code causes an error saying that the >> 'module' object has no attribute 'parse'. >> >> Is this a problem of syntax or is it an installation issue? >> >> from Bio import SeqIO >> handle = open("ls_orchid.fasta", "rU") >> for record in SeqIO.parse(handle, "fasta") : >> print record.id > > Dear richa: > > It is an installation issue. I got exactly the same error. I am > running Debian Linux with the python-biopython package installed: > > $ dpkg -l | grep biopython > ii python-biopython 1.42-2 > > After upgrading to the newest version with > $ sudo apt-get install python-biopython > > $ dpkg -l | grep biopython > ii python-biopython 1.43-1 > > the error is gone! > > By the way: this was the content of my old Bio/SeqIO/__init__.py: > > """Sequence input/output designed to look similar to the bioperl design. > > At present, these are all hand written. I would like to have > autogenerated parsers in the futures, esp. with the ability to parse > only subsets of the data, and to support event generated parsers. > > Note that once a parser is given an input string, it is free to read > as much of the data as it wants to read, unless otherwise mentioned. > """ > > Nothing more. Now it matches the newest version as posted by Peter. > > Cheers, > Christof > From idoerg at gmail.com Tue Jun 12 16:09:22 2007 From: idoerg at gmail.com (I. Friedberg) Date: Tue, 12 Jun 2007 13:09:22 -0700 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466EF2E7.6080404@musc.edu> References: <466ECA67.2080506@musc.edu> <466ED78C.4040807@biotec.tu-dresden.de> <466EF2E7.6080404@musc.edu> Message-ID: Anybody's Debian-fu strong enough to create a clean 1.43 package? Iddo On 6/12/07, richa wrote: > > Compiled from source and the no more error messages. Thank you all for > your help and advise. > > -Adam Richards (richa) > > Christof Winter wrote: > > richa wrote: > >> Hi all, > >> > >> Just installed biopython on ubuntu feisty. Dependencies and package > >> seemed to install without a problem. Many of the test files that > >> come with it work fine, but the SeqIO object runs into a problem. > >> For example, the following code causes an error saying that the > >> 'module' object has no attribute 'parse'. > >> > >> Is this a problem of syntax or is it an installation issue? > >> > >> from Bio import SeqIO > >> handle = open("ls_orchid.fasta", "rU") > >> for record in SeqIO.parse(handle, "fasta") : > >> print record.id > > > > Dear richa: > > > > It is an installation issue. I got exactly the same error. I am > > running Debian Linux with the python-biopython package installed: > > > > $ dpkg -l | grep biopython > > ii python-biopython 1.42-2 > > > > After upgrading to the newest version with > > $ sudo apt-get install python-biopython > > > > $ dpkg -l | grep biopython > > ii python-biopython 1.43-1 > > > > the error is gone! > > > > By the way: this was the content of my old Bio/SeqIO/__init__.py: > > > > """Sequence input/output designed to look similar to the bioperl design. > > > > At present, these are all hand written. I would like to have > > autogenerated parsers in the futures, esp. with the ability to parse > > only subsets of the data, and to support event generated parsers. > > > > Note that once a parser is given an input string, it is free to read > > as much of the data as it wants to read, unless otherwise mentioned. > > """ > > > > Nothing more. Now it matches the newest version as posted by Peter. > > > > Cheers, > > Christof > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- I. Friedberg "The only problem with troubleshooting is that sometimes trouble shoots back." From biopython at maubp.freeserve.co.uk Wed Jun 13 06:30:20 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jun 2007 11:30:20 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <4665408E.2090306@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> <4665408E.2090306@maubp.freeserve.co.uk> Message-ID: <466FC73C.2000608@maubp.freeserve.co.uk> To update anyone not following bug 2090, I have updated the CVS copy of Bio/Blast/NCBIStandalone.py to do a better job of recent plain text Blast output. It can now parse the two BLASTX 2.2.15 files Italo sent me. If anyone want to try this code, you can either update your entire Biopython installation to CVS, or simply update the file Bio/Blast/NCBIStandalone.py in your python site-packages directory (after making a backup). You can get the latest version here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython You want the latest revision, 1.66 (the web interface is normally updated within the hour). Italo - please could we use those two files as test cases to include with Biopython? And do let me know if any of your other 24,000 examples fails. Peter P.S. Biopython can currently only cope with single query plain text output from Blast. We recommend using the XML output. From rwbarrette at gmail.com Wed Jun 13 09:17:19 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Wed, 13 Jun 2007 09:17:19 -0400 Subject: [BioPython] [Biopython] Blastall problem w/ restrict_gi Message-ID: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> Hello, I'm new to the list, and relatively new at Python. I need to run local blast using tblastx, but I have to limit my searches to subsets of my local database. To do this I have gi lists (*.gid.txt file) obtained from NCBI, to define my subsets. To run this blast, I'm using the following command to run blastall in my script: result_handle, error_info = NCBIStandalone.blastall("/BLAST/blastall.exe", "tblastx", "/BLAST/DATAout/VirDBX" , "/BLAST/sequencesXX.fasta", "7", restrict_gi = "/BLAST/DATAout/10241.gid.txt") When I include the restrict_gi keyword and option, I get no results back when I run this through python. I went into NCBIStandalone and modified it to print out the command that is supposed to be passed through the os.popen3() command, which is: /BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i /BLAST/sequencesXX.fasta -m 7 -l /BLAST/DATAout/10241.gid.txt When I copy this string directly into the windows command line, I get results, and it works fine, but it doesn't work when called through python. It does work in Python , however, if I don't include the "restrict_gi" option. Can anyone suggest a modification to the Blastall function or how I call blast from my script that may fix this problem? T From biopython at maubp.freeserve.co.uk Wed Jun 13 13:13:33 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jun 2007 18:13:33 +0100 Subject: [BioPython] Blastall problem w/ restrict_gi In-Reply-To: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> References: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> Message-ID: <467025BD.1010907@maubp.freeserve.co.uk> Roger Barrette wrote: > Hello, I'm new to the list, and relatively new at Python. Hi Roger, and welcome to the list! > I need to run local blast using tblastx, but I have to limit my > searches to subsets of my local database. To do this I have gi lists > (*.gid.txt file) obtained from NCBI, to define my subsets. To run > this blast, I'm using the following command to run blastall in my > script: > > result_handle, error_info = > NCBIStandalone.blastall("/BLAST/blastall.exe", "tblastx", > "/BLAST/DATAout/VirDBX", "/BLAST/sequencesXX.fasta", "7", > restrict_gi="/BLAST/DATAout/10241.gid.txt") > > When I include the restrict_gi keyword and option, I get no results > back when I run this through python. Could you be a little more specific about what goes wrong? Also are you using Windows, what version of Biopython and what version of Python? Have you looked at the contents of both result_handle AND error_info? You say you get no results back (is result_handle is blank?), so checking error_info would be a good idea. Try something like this... save_file = open("my_blast.xml", "w") save_file.write(result_handle.read()) save_file.close() save_file = open("my_blast.err", "w") save_file.write(error_info.read()) save_file.close() > I went into NCBIStandalone and modified it to print out the command > that is supposed to be passed through the os.popen3() command, which > is: > > /BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i > /BLAST/sequencesXX.fasta -m 7 -l /BLAST/DATAout/10241.gid.txt > > When I copy this string directly into the windows command line, I get > results, and it works fine, but it doesn't work when called through > python. It does work in Python , however, if I don't include the > "restrict_gi" option. Can anyone suggest a modification to the > Blastall function or how I call blast from my script that may fix > this problem? Have you tried running this command at the command line, and redirecting the output to a file (e.g. test.xml) and then getting Biopython to parse that file? i.e. This should tell us if there is a problem parsing the XML output, or a problem in calling standalone blast. Peter From italo.maia at gmail.com Wed Jun 13 13:28:42 2007 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 13 Jun 2007 14:28:42 -0300 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <466FC73C.2000608@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> <4665408E.2090306@maubp.freeserve.co.uk> <466FC73C.2000608@maubp.freeserve.co.uk> Message-ID: <800166920706131028o4eb5ea6eqa92e3f0634ea7748@mail.gmail.com> Peter, i tried the patch but i received the following error : >>> from Bio.Blast import NCBIStandalone >>> parser = NCBIStandalone.BlastParser() >>> record = parser.parse(file('99.out','r')) Traceback (most recent call last): File "", line 1, in File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 746, in parse self._scanner.feed(handle, self._consumer) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 101, in feed self._scan_parameters(uhandle, consumer) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 712, in _scan_parameters attempt_read_and_call(uhandle, consumer.threshold, start='Neighboring words threshold') File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line 358, in attempt_read_and_call method(line) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 1333, in threshold line, (1,), ncols=2, expected={0:"T:"}) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 1888, in _get_cols (ncols, len(cols), line) SyntaxError: I expected 2 columns (got 4) in line Neighboring words threshold: 12 2007/6/13, Peter : > > To update anyone not following bug 2090, I have updated the CVS copy of > Bio/Blast/NCBIStandalone.py to do a better job of recent plain text > Blast output. It can now parse the two BLASTX 2.2.15 files Italo sent me. > > If anyone want to try this code, you can either update your entire > Biopython installation to CVS, or simply update the file > Bio/Blast/NCBIStandalone.py in your python site-packages directory > (after making a backup). You can get the latest version here: > > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython > > You want the latest revision, 1.66 (the web interface is normally > updated within the hour). > > Italo - please could we use those two files as test cases to include > with Biopython? And do let me know if any of your other 24,000 examples > fails. > > Peter > > P.S. Biopython can currently only cope with single query plain text > output from Blast. We recommend using the XML output. > > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB Programador Java, Python Meu blog ^^ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Wed Jun 13 13:30:21 2007 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 13 Jun 2007 14:30:21 -0300 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <466FC73C.2000608@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> <4665408E.2090306@maubp.freeserve.co.uk> <466FC73C.2000608@maubp.freeserve.co.uk> Message-ID: <800166920706131030n473c64e2o2b5184c0618ef65b@mail.gmail.com> For the use of the files, i'll ask the girl that runs the lab here, but it probably won't be a problem! And i'll try with all my files as soon as it works with 99.out 2007/6/13, Peter : > > To update anyone not following bug 2090, I have updated the CVS copy of > Bio/Blast/NCBIStandalone.py to do a better job of recent plain text > Blast output. It can now parse the two BLASTX 2.2.15 files Italo sent me. > > If anyone want to try this code, you can either update your entire > Biopython installation to CVS, or simply update the file > Bio/Blast/NCBIStandalone.py in your python site-packages directory > (after making a backup). You can get the latest version here: > > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython > > You want the latest revision, 1.66 (the web interface is normally > updated within the hour). > > Italo - please could we use those two files as test cases to include > with Biopython? And do let me know if any of your other 24,000 examples > fails. > > Peter > > P.S. Biopython can currently only cope with single query plain text > output from Blast. We recommend using the XML output. > > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB Programador Java, Python Meu blog ^^ http://eusouolobomal.blogspot.com/ =========================== From rwbarrette at gmail.com Wed Jun 13 14:32:56 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Wed, 13 Jun 2007 14:32:56 -0400 Subject: [BioPython] Blastall problem w/ restrict_gi In-Reply-To: <467025BD.1010907@maubp.freeserve.co.uk> References: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> <467025BD.1010907@maubp.freeserve.co.uk> Message-ID: <2af454d50706131132s1df127f2ve00b30cb6b2643e1@mail.gmail.com> Hi Peter, Thank you for the response. In regards to your questions, I am using Python 2.5 w/ biopython v1.43, on a Windows platform (XP). The specific problem I am having occurs when I attempt to run the blastall command from Python through the NCBIStandalone module. If I run the blastall without the "restrict_gi" option, it gives me alignment results in the .xml file. However, when I include the "restrict_gi" option, I get an empty .xml result file. As per your suggestion however, the error file does list an error. The output of this .err file is: [NULL_Caption] ERROR: gi|90968860|gb|DQ443515.1|: Unable to open file .\/BLAST/DATAout/A10241.txt This is odd because when I run the command from the c:\ prompt: c:\>/BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i /BLAST/sequencesXX.fasta -m 7 -l .\/BLAST/DATAout/A10241.txt ;it works fine, and I get the alignment results, and no error. Because I do not get any results in the xml file when I get this error, running the blast from the python script, there is nothing to parse, however, when I run the script without the "restrict_gi" option, from either the command prompt or my python script, I get results in the xml file, and they are able to be parsed. Any suggestions as to how to fix this problem would be greatly appreciated. Thanks -Roger > Roger Barrette wrote: > > Hello, I'm new to the list, and relatively new at Python. > > Hi Roger, and welcome to the list! > > > I need to run local blast using tblastx, but I have to limit my > > searches to subsets of my local database. To do this I have gi lists > > (*.gid.txt file) obtained from NCBI, to define my subsets. To run > > this blast, I'm using the following command to run blastall in my > > script: > > > > result_handle, error_info = > > NCBIStandalone.blastall("/BLAST/blastall.exe", "tblastx", > > "/BLAST/DATAout/VirDBX", "/BLAST/sequencesXX.fasta", "7", > > restrict_gi="/BLAST/DATAout/10241.gid.txt") > > > > When I include the restrict_gi keyword and option, I get no results > > back when I run this through python. > > Could you be a little more specific about what goes wrong? Also are you > using Windows, what version of Biopython and what version of Python? > > Have you looked at the contents of both result_handle AND error_info? > You say you get no results back (is result_handle is blank?), so > checking error_info would be a good idea. Try something like this... > > save_file = open("my_blast.xml", "w") > save_file.write(result_handle.read()) > save_file.close() > > save_file = open("my_blast.err", "w") > save_file.write(error_info.read()) > save_file.close() > > > I went into NCBIStandalone and modified it to print out the command > > that is supposed to be passed through the os.popen3() command, which > > is: > > > > /BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i > > /BLAST/sequencesXX.fasta -m 7 -l /BLAST/DATAout/10241.gid.txt > > > > When I copy this string directly into the windows command line, I get > > results, and it works fine, but it doesn't work when called through > > python. It does work in Python , however, if I don't include the > > "restrict_gi" option. Can anyone suggest a modification to the > > Blastall function or how I call blast from my script that may fix > > this problem? > > Have you tried running this command at the command line, and redirecting > the output to a file (e.g. test.xml) and then getting Biopython to parse > that file? > > i.e. This should tell us if there is a problem parsing the XML output, > or a problem in calling standalone blast. > > Peter > > From biopython at maubp.freeserve.co.uk Wed Jun 13 15:48:16 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jun 2007 20:48:16 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <800166920706131028o4eb5ea6eqa92e3f0634ea7748@mail.gmail.com> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> <4665408E.2090306@maubp.freeserve.co.uk> <466FC73C.2000608@maubp.freeserve.co.uk> <800166920706131028o4eb5ea6eqa92e3f0634ea7748@mail.gmail.com> Message-ID: <46704A00.9010409@maubp.freeserve.co.uk> Italo Maia wrote: > Peter, i tried the patch but i received the following error : >>>> from Bio.Blast import NCBIStandalone >>>> parser = NCBIStandalone.BlastParser() >>>> record = parser.parse(file('99.out','r')) > ... > SyntaxError: I expected 2 columns (got 4) in line > Neighboring words threshold: 12 Oh. My fault. This was due to the "T: ..." and "A: ..." lines being replaced by "Neighboring words threshold: ..." and "Window for multiple hits: ...", and me not testing my changes enough. Could you try revision 1.67 please? http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython Thanks Peter From biopython at maubp.freeserve.co.uk Wed Jun 13 15:56:48 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jun 2007 20:56:48 +0100 Subject: [BioPython] Blastall problem w/ restrict_gi In-Reply-To: <2af454d50706131132s1df127f2ve00b30cb6b2643e1@mail.gmail.com> References: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> <467025BD.1010907@maubp.freeserve.co.uk> <2af454d50706131132s1df127f2ve00b30cb6b2643e1@mail.gmail.com> Message-ID: <46704C00.5000101@maubp.freeserve.co.uk> Roger Barrette wrote: > Hi Peter, > > Thank you for the response. In regards to your questions, I am using Python > 2.5 w/ biopython v1.43, on a Windows platform (XP). The specific problem I > am having occurs when I attempt to run the blastall command from Python > through the NCBIStandalone module. If I run the blastall without the > "restrict_gi" option, it gives me alignment results in the .xml > file. However, when I include the "restrict_gi" option, I get an empty .xml > result file. As per your suggestion however, the error file does list an > error. The output of this .err file is: > > [NULL_Caption] ERROR: gi|90968860|gb|DQ443515.1|: Unable to open file > .\/BLAST/DATAout/A10241.txt That is interesting - it tells us that the argument is at least getting passed to the blast program (and explains why there is no XML output). That path is an odd mixture of Unix and Windows style paths, which I wouldn't expect to work. Could you try using \BLAST\DATAout\A10241.txt or C:\BLAST\DATAout\A10241.txt instead (both from within Python and from the command line). Remember that slashes are escape characters in python so use either r"C:\BLAST\DATAout\A10241.txt" or "C:\\BLAST\\DATAout\\A10241.txt". > This is odd because when I run the command from the c:\ prompt: > c:\>/BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i > /BLAST/sequencesXX.fasta -m 7 -l .\/BLAST/DATAout/A10241.txt > > ;it works fine, and I get the alignment results, and no error. I'm a little surprised that does work to be honest. I suspect this is something subtle to do with how command line programs break up the arguments (which is complicated by quotes and slashes - at least you have no spaces in the filenames!). Note some of ways of getting python to make a system call pass the arguments as a long string (as if typed by the user at the command prompt) while others are already broken down into the individual terms. > Because I do not get any results in the xml file when I get this error, > running the blast from the python script, there is nothing to parse, > however, when I run the script without the "restrict_gi" option, from either > the command prompt or my python script, I get results in the xml file, and > they are able to be parsed. Any suggestions as to how to fix this problem > would be greatly appreciated. Thanks Fingers crossed using Windows style absolute paths fixes this for you... Peter From kawaiichiko at gmail.com Sun Jun 17 14:34:05 2007 From: kawaiichiko at gmail.com (Jolanda Reek) Date: Sun, 17 Jun 2007 20:34:05 +0200 Subject: [BioPython] What to do with BLAST XML syntax error? Message-ID: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> Hello, I'm using BioPython to send some protein sequences to the NCBI WWWBlast server and parse the output. Sometimes, like once every few minutes, instead of giving the output, BLAST returns a XML syntax error. It states something along the lines of 'SyntaxError: This XML doesn't start with...'. and then BioPython can't parse the output (duh). I've written a try/except statement to resend the protein query when this problem occurs,however, sometimes the problem occurs multiple times in a row, leaving me with no other option then to nest try/except statements (= ugly code). print "BLAST search: "+proteinLijst[x].id try: acNr, evalueNr = blast(proteinLijst[x].sequentie) idList.append(acNr) except SyntaxError: print "SyntaxError: BLAST server send incomplete results. Resubmitting query..." try: acNr, evalueNr = blast(proteinLijst[x].sequentie) idList.append(acNr) except SyntaxError: print "SyntaxError: BLAST server send incomplete results. Resubmitting query..." try: acNr, evalueNr = blast(proteinLijst[x].sequentie) idList.append(acNr) except SyntaxError: print "SyntaxError: BLAST server send incomplete results. Resubmitting query..." acNr, evalueNr = blast(proteinLijst[x].sequentie) idList.append(acNr) << Etc. Yuck. It works, but it is still yuck. :) Can anyone help me think up a solution? And what is causing those faulty XML files? (Avoiding the problem altogether is better than fixing it.) Thank you. Chiko. From lucks at fas.harvard.edu Sun Jun 17 14:52:03 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Sun, 17 Jun 2007 14:52:03 -0400 Subject: [BioPython] What to do with BLAST XML syntax error? In-Reply-To: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> References: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> Message-ID: <34EE02FD-F9FE-46D1-963A-00392EB33DD1@fas.harvard.edu> Why not wrap the try/except block in a while loop: done = 0 while not done: #wait some time try: #your code done = 1 except SyntaxError: #your code Julius ------------------------------------------------------------------------ --- http://www.openwetware.org/wiki/User:Julius_B._Lucks ------------------------------------------------------------------------ --- On Jun 17, 2007, at 2:34 PM, Jolanda Reek wrote: > Hello, > > I'm using BioPython to send some protein sequences to the NCBI > WWWBlast > server and parse the output. Sometimes, like once every few > minutes, instead > of giving the output, BLAST returns a XML syntax error. It states > something > along the lines of 'SyntaxError: This XML doesn't start with...'. > and then > BioPython can't parse the output (duh). I've written a try/except > statement > to resend the protein query when this problem occurs,however, > sometimes the > problem occurs multiple times in a row, leaving me with no other > option then > to nest try/except statements (= ugly code). > > print "BLAST search: "+proteinLijst[x].id > try: > acNr, evalueNr = blast(proteinLijst[x].sequentie) > idList.append(acNr) > except SyntaxError: > print "SyntaxError: BLAST server send incomplete results. > Resubmitting > query..." > try: > acNr, evalueNr = blast(proteinLijst[x].sequentie) > idList.append(acNr) > except SyntaxError: > print "SyntaxError: BLAST server send incomplete results. > Resubmitting query..." > try: > acNr, evalueNr = blast(proteinLijst[x].sequentie) > idList.append(acNr) > except SyntaxError: > print "SyntaxError: BLAST server send incomplete results. > Resubmitting query..." > acNr, evalueNr = blast(proteinLijst[x].sequentie) > idList.append(acNr) > > << Etc. Yuck. It works, but it is still yuck. :) > > Can anyone help me think up a solution? And what is causing those > faulty XML > files? (Avoiding the problem altogether is better than fixing it.) > > Thank you. > > Chiko. > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From cjfields at uiuc.edu Sun Jun 17 15:01:48 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 17 Jun 2007 14:01:48 -0500 Subject: [BioPython] What to do with BLAST XML syntax error? In-Reply-To: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> References: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> Message-ID: <7ADA1F7F-C39A-4DC0-9548-DD7971BC79D9@uiuc.edu> Very likely what is returned is not XML at all, but HTML or a server error code (thus triggering the XML error since it isn't XML). IIRC you have to loop every x seconds to check the RID, then retrieve the results; every server check could theoretically have some unforeseen server error. With BioPerl this is checked using the response object returned from every RID request: if ($response->is_error) { # throw some relevant error } Hate to say but I'm not sure how this would be handled via Python. chris On Jun 17, 2007, at 1:34 PM, Jolanda Reek wrote: > ... > Can anyone help me think up a solution? And what is causing those > faulty XML > files? (Avoiding the problem altogether is better than fixing it.) > > Thank you. > > Chiko. From mdehoon at c2b2.columbia.edu Sun Jun 17 19:01:55 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Mon, 18 Jun 2007 08:01:55 +0900 Subject: [BioPython] What to do with BLAST XML syntax error? In-Reply-To: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> References: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> Message-ID: <4675BD63.6090301@c2b2.columbia.edu> Jolanda Reek wrote: > acNr, evalueNr = blast(proteinLijst[x].sequentie) Since this line causes the SyntaxError, could you show us what is in the "blast" function that is being called here? --Michiel. From rwbarrette at gmail.com Tue Jun 19 07:42:41 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Tue, 19 Jun 2007 07:42:41 -0400 Subject: [BioPython] Blastall problem w/ restrict_gi In-Reply-To: <46704C00.5000101@maubp.freeserve.co.uk> References: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> <467025BD.1010907@maubp.freeserve.co.uk> <2af454d50706131132s1df127f2ve00b30cb6b2643e1@mail.gmail.com> <46704C00.5000101@maubp.freeserve.co.uk> Message-ID: <2af454d50706190442o211e3b88tec9650598cf3427@mail.gmail.com> Thanks for the suggestions Peter. Once I changed the format of the command to call the restrict_gi with the \\, it worked fine. Thanks. -Roger On 6/13/07, Peter wrote: > > Roger Barrette wrote: > > Hi Peter, > > > > Thank you for the response. In regards to your questions, I am using > Python > > 2.5 w/ biopython v1.43, on a Windows platform (XP). The specific > problem I > > am having occurs when I attempt to run the blastall command from Python > > through the NCBIStandalone module. If I run the blastall without the > > "restrict_gi" option, it gives me alignment results in the .xml > > file. However, when I include the "restrict_gi" option, I get an empty > .xml > > result file. As per your suggestion however, the error file does list > an > > error. The output of this .err file is: > > > > [NULL_Caption] ERROR: gi|90968860|gb|DQ443515.1|: Unable to open file > > .\/BLAST/DATAout/A10241.txt > > That is interesting - it tells us that the argument is at least getting > passed to the blast program (and explains why there is no XML output). > > That path is an odd mixture of Unix and Windows style paths, which I > wouldn't expect to work. > > Could you try using \BLAST\DATAout\A10241.txt or > C:\BLAST\DATAout\A10241.txt instead (both from within Python and from > the command line). > > Remember that slashes are escape characters in python so use either > r"C:\BLAST\DATAout\A10241.txt" or "C:\\BLAST\\DATAout\\A10241.txt". > > > This is odd because when I run the command from the c:\ prompt: > > c:\>/BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i > > /BLAST/sequencesXX.fasta -m 7 -l .\/BLAST/DATAout/A10241.txt > > > > ;it works fine, and I get the alignment results, and no error. > > I'm a little surprised that does work to be honest. > > I suspect this is something subtle to do with how command line programs > break up the arguments (which is complicated by quotes and slashes - at > least you have no spaces in the filenames!). > > Note some of ways of getting python to make a system call pass the > arguments as a long string (as if typed by the user at the command > prompt) while others are already broken down into the individual terms. > > > Because I do not get any results in the xml file when I get this error, > > running the blast from the python script, there is nothing to parse, > > however, when I run the script without the "restrict_gi" option, from > either > > the command prompt or my python script, I get results in the xml file, > and > > they are able to be parsed. Any suggestions as to how to fix this > problem > > would be greatly appreciated. Thanks > > Fingers crossed using Windows style absolute paths fixes this for you... > > Peter > > From rwbarrette at gmail.com Tue Jun 19 07:58:19 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Tue, 19 Jun 2007 07:58:19 -0400 Subject: [BioPython] Can't download a FASTA file from NCBI to BLAST Message-ID: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> I am trying to set up a script to automatically go into NCBI and retrieve individual FASTA file based on a list of accession numbers (either gi or NC). The code that I have written gets the sequences and saves the file, but when I run a blast against the file, it doesn't work, Am I not using the correct parser for preparing to save the file for blasting? I tried to set the format to "fasta", but I was getting errors saying that gi_list[0] doesn't contain the arguement 'data.seq'. I also tried the arguement .sequence, and it gave me the same errror. I realize I'm not currently calling the file in as a FASTA, but this is the only way I've been able to even automate the record retrieval process for the long series of Blasting that I have to do. I have a separate function for calling the Blast, but it works fine with manually downloaded FASTA files, so the problem appears to be here. Any suggestions for a fix, or even a better way to do this would be greatly appreciated. Thanks. My code is: def Get_FASTA_Seq(NC_ID): i = NC_ID ## Search for Viruses based on TXID from Bio import GenBank gi_list = GenBank.search_for(i) ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank") fasta_file = open("c:\Current_Query.gbk", "w") ## Extract individual Sequence from NCBI based on gi# or NC# ## gb_record = ncbi_dict[gi_list[0]] record_parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser = record_parser) gb_seqrecord = ncbi_dict[gi_list[0]] SeqValue = ncbi_dict[gi_list[0]].seq.data NameValue = ncbi_dict[gi_list[0]].annotations["organism"] Length = len(SeqValue) Seq5 = 0 Seq3 = Seq5 + Length print NameValue print Length print SeqValue ## Write sequences into the FASTA file ## fasta_file.write(">" + i + " " + NameValue + "\n") for j in range(0, len(SeqValue[Seq5:Seq3]), Length): fasta_file.write(SeqValue[Seq5:Seq3]) fasta_file.write("\n") ## Close and Save the FASTA file ## fasta_file.close() From biopython at maubp.freeserve.co.uk Tue Jun 19 09:10:17 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jun 2007 14:10:17 +0100 Subject: [BioPython] Can't download a FASTA file from NCBI to BLAST In-Reply-To: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> References: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> Message-ID: <320fb6e00706190610wd422693mad16f6c3bcdf57d2@mail.gmail.com> On 6/19/07, Roger Barrette wrote: > I am trying to set up a script to automatically go into NCBI and retrieve > individual FASTA file based on a list of accession numbers (either gi or > NC). The code that I have written gets the sequences and saves the file, > but when I run a blast against the file, it doesn't work, Am I not using the > correct parser for preparing to save the file for blasting? I tried to set > the format to "fasta", but I was getting errors saying that gi_list[0] > doesn't contain the arguement 'data.seq'. I also tried the arguement > .sequence, and it gave me the same errror. I realize I'm not currently > calling the file in as a FASTA, but this is the only way I've been able to > even automate the record retrieval process for the long series of Blasting > that I have to do. I have a separate function for calling the Blast, > but it works fine with manually downloaded FASTA files, so the > problem appears to be here. Any suggestions for a fix, or even a better way > to do this would be greatly appreciated. Thanks. My code is: You seem to have tried a lot of things and its difficult to follow exactly what you are trying to do. I think you want to: (1) start with a list of gi numbers (2) get the matching sequence data online from the NCBI (3) save these as a fasta file (4) call blast using this fasta file as the input query Anyway - I've included a bit of code for (2) and (3) at the end of this email. > def Get_FASTA_Seq(NC_ID): > > i = NC_ID > > ## Search for Viruses based on TXID > > from Bio import GenBank > gi_list = GenBank.search_for(i) > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank") > > fasta_file = open("c:\Current_Query.gbk", "w") Its a bit odd to save a fasta file with a gbk extension, fasta is more common. And there may be a problem with the single unescaped slash, try r"c:\Current_Query.gbk" or "c:\\Current_Query.gbk". Remember, in python the slash is used for things like \n (new line), \t (tab) etc. > ## Extract individual Sequence from NCBI based on gi# or NC# ## > > gb_record = ncbi_dict[gi_list[0]] > record_parser = GenBank.FeatureParser() > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser = > record_parser) > gb_seqrecord = ncbi_dict[gi_list[0]] > > SeqValue = ncbi_dict[gi_list[0]].seq.data > NameValue = ncbi_dict[gi_list[0]].annotations["organism"] > Length = len(SeqValue) > Seq5 = 0 > Seq3 = Seq5 + Length > > print NameValue > print Length > print SeqValue > > ## Write sequences into the FASTA file ## > > fasta_file.write(">" + i + " " + NameValue + "\n") > for j in range(0, len(SeqValue[Seq5:Seq3]), Length): > fasta_file.write(SeqValue[Seq5:Seq3]) > fasta_file.write("\n") > ## Close and Save the FASTA file ## > fasta_file.close() That is all very complicated - why mess about with Seq5 and Seq4 when you seem to want the whole sequence anyway? Have you opened the output file in a text editor to check it looks sensible? If you can construct a list SeqRecords, why not write the file using Bio.SeqIO (Biopython 1.43 or later) like this: ... gb_records = [ncbi_dict[gi] for gi in gi_list] from Bio import SeqIO fasta_file = open("c:\\Current_Query.fasta","w") SeqIO.write(gb_records, fasta_file, "fasta") fasta_file.close() Peter From rwbarrette at gmail.com Tue Jun 19 09:50:08 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Tue, 19 Jun 2007 09:50:08 -0400 Subject: [BioPython] Can't download a FASTA file from NCBI to BLAST In-Reply-To: <320fb6e00706190610wd422693mad16f6c3bcdf57d2@mail.gmail.com> References: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> <320fb6e00706190610wd422693mad16f6c3bcdf57d2@mail.gmail.com> Message-ID: <2af454d50706190650y3d41d3bcjdc13ebbf48c7bc9c@mail.gmail.com> Hi again Peter, You are correct in your assumptions as to what I'm trying to accomplish. I have a habit of pulling random code from different places when I'm at a loss for how to do something, when I can't find documentation or examples. The Seq5 and Seq3 were residual from a previous script where I was pulling out overlapping 50mers. Regardless... I added your code to my script, and changed the format for the search parameter to "fasta", but I'm getting the following error: \ Traceback (most recent call last): \ File "", line 1, in \ Get_FASTA_Seq("NC_001653") \ File "C:/Python25/FASTAtry2.py", line 17, in Get_FASTA_Seq \ SeqIO.write(gb_records, fasta_file, "fasta") \ File "C:\Python25\lib\site-packages\Bio\SeqIO\__init__.py", line 214, in write \ writer_class(handle).write_file(sequences) \ File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 243, in write_file \ self.write_records(records) \ File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 232, in write_records \ self.write_record(record) \ File "C:\Python25\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 103, in write_record \ id = self.clean(record.id) \ AttributeError: 'str' object has no attribute 'id' Am I not supposed to use GenBank.search_for, or NCBIDictionary, or do I need to parse the raw output first. I did try to use the GenBank RecordParser, as well as the Feature Parser set to output as "fasta", but I get the same error? My current (+your) modified code is: \ def Get_FASTA_Seq(NC_ID): \ i = NC_ID \ \ ## Search for Viruses based on TXID, and count number of hits ## \ \ from Bio import GenBank \ from Bio import SeqIO \ \ gi_list = GenBank.search_for(i) \ ncbi_dict = GenBank.NCBIDictionary("nucleotide","fasta") \ \ ## Extract individual Sequence from NCBI for each TXID ## \ \ gb_records = [ncbi_dict[gi] for gi in gi_list] \ fasta_file = open("c:\\Current_Query.fasta","w") \ SeqIO.write(gb_records, fasta_file, "fasta") \ fasta_file.close() Thank you for your insights. Sorry to be such a noob. -Roger On 6/19/07, Peter wrote: > > On 6/19/07, Roger Barrette wrote: > > I am trying to set up a script to automatically go into NCBI and > retrieve > > individual FASTA file based on a list of accession numbers (either gi or > > NC). The code that I have written gets the sequences and saves the > file, > > but when I run a blast against the file, it doesn't work, Am I not using > the > > correct parser for preparing to save the file for blasting? I tried to > set > > the format to "fasta", but I was getting errors saying that gi_list[0] > > doesn't contain the arguement 'data.seq'. I also tried the arguement > > .sequence, and it gave me the same errror. I realize I'm not currently > > calling the file in as a FASTA, but this is the only way I've been able > to > > even automate the record retrieval process for the long series of > Blasting > > that I have to do. I have a separate function for calling the Blast, > > but it works fine with manually downloaded FASTA files, so the > > problem appears to be here. Any suggestions for a fix, or even a better > way > > to do this would be greatly appreciated. Thanks. My code is: > > You seem to have tried a lot of things and its difficult to follow > exactly what you are trying to do. I think you want to: > > (1) start with a list of gi numbers > (2) get the matching sequence data online from the NCBI > (3) save these as a fasta file > (4) call blast using this fasta file as the input query > > Anyway - I've included a bit of code for (2) and (3) at the end of this > email. > > > def Get_FASTA_Seq(NC_ID): > > > > i = NC_ID > > > > ## Search for Viruses based on TXID > > > > from Bio import GenBank > > gi_list = GenBank.search_for(i) > > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank") > > > > fasta_file = open("c:\Current_Query.gbk", "w") > > Its a bit odd to save a fasta file with a gbk extension, fasta is more > common. And there may be a problem with the single unescaped slash, > try r"c:\Current_Query.gbk" or "c:\\Current_Query.gbk". > > Remember, in python the slash is used for things like \n (new line), > \t (tab) etc. > > > ## Extract individual Sequence from NCBI based on gi# or NC# ## > > > > gb_record = ncbi_dict[gi_list[0]] > > record_parser = GenBank.FeatureParser() > > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser = > > record_parser) > > gb_seqrecord = ncbi_dict[gi_list[0]] > > > > SeqValue = ncbi_dict[gi_list[0]].seq.data > > NameValue = ncbi_dict[gi_list[0]].annotations["organism"] > > Length = len(SeqValue) > > Seq5 = 0 > > Seq3 = Seq5 + Length > > > > print NameValue > > print Length > > print SeqValue > > > > ## Write sequences into the FASTA file ## > > > > fasta_file.write(">" + i + " " + NameValue + "\n") > > for j in range(0, len(SeqValue[Seq5:Seq3]), Length): > > fasta_file.write(SeqValue[Seq5:Seq3]) > > fasta_file.write("\n") > > ## Close and Save the FASTA file ## > > fasta_file.close() > > That is all very complicated - why mess about with Seq5 and Seq4 when > you seem to want the whole sequence anyway? > > Have you opened the output file in a text editor to check it looks > sensible? > > If you can construct a list SeqRecords, why not write the file using > Bio.SeqIO (Biopython 1.43 or later) like this: > > ... > gb_records = [ncbi_dict[gi] for gi in gi_list] > from Bio import SeqIO > fasta_file = open("c:\\Current_Query.fasta","w") > SeqIO.write(gb_records, fasta_file, "fasta") > fasta_file.close() > > Peter > From biopython at maubp.freeserve.co.uk Tue Jun 19 12:46:20 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jun 2007 17:46:20 +0100 Subject: [BioPython] Can't download a FASTA file from NCBI to BLAST In-Reply-To: <2af454d50706190650y3d41d3bcjdc13ebbf48c7bc9c@mail.gmail.com> References: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> <320fb6e00706190610wd422693mad16f6c3bcdf57d2@mail.gmail.com> <2af454d50706190650y3d41d3bcjdc13ebbf48c7bc9c@mail.gmail.com> Message-ID: <4678085C.7060401@maubp.freeserve.co.uk> Roger Barrette wrote: > Hi again Peter, > > You are correct in your assumptions as to what I'm trying to accomplish. I > have a habit of pulling random code from different places when I'm at a loss > for how to do something, when I can't find documentation or examples. If we start with some of you last attempt, you can see that this NCBI dictionary just returns raw fasta records as strings: >>> from Bio import GenBank >>> ncbi_dict = GenBank.NCBIDictionary("nucleotide","fasta") >>> ncbi_dict["A0B5H8] '>gi|121693723|sp|A0B5H8|A0B5H8_9EURY TATA-box binding\nMESTINI...' >>> print ncbi_dict["A0B5H8] >gi|121693723|sp|A0B5H8|A0B5H8_9EURY TATA-box binding MESTINIENVVASTKLADEFDLVKIESELEGAEYNKEKFPGLVYRVKSPKAAFLIFTSGKVVCTGAKNVE DVRTVITNMARTLKSIGFDNINLEPEIHVQNIVASADLKTDLNLNAIALGLGLENIEYEPEQFPGLVYRI KQPKVVVLIFSSGKLVVTGGKSPEECEEGVRIVRQQLENLGLL You can just write these directly to your file: from Bio import GenBank from Bio import SeqIO acc_list = ["A0B5H8", "A0C5G2", "A0CM02", "A0CRU8"] #Don't use any record parser, we just want the raw text ncbi_dict = GenBank.NCBIDictionary("nucleotide","fasta") fasta_file = open("c:\\Current_Query.fasta","w") for acc in acc_list : fasta_file.write(ncbi_dict[acc]) fasta_file.close() This is very simple as there is no conversion between file formats - you are asking the NCBI for fasta format records, and you save them to a file as is. Another option (which I was suggesting in the previous email) is to have the NCBIDictionary parse the data into SeqRecord objects (rather than raw text) and then write those to your file, possibly using Bio.SeqIO Peter From mmokrejs at ribosome.natur.cuni.cz Mon Jun 25 12:05:04 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 25 Jun 2007 18:05:04 +0200 Subject: [BioPython] [Bioperl-l] How to draw a plasmid map from a genbank-formatted file? In-Reply-To: References: <466938F6.7050903@ribosome.natur.cuni.cz> <56BAE06F-2FDF-4FA4-B6A0-96D89470AF4C@wustl.edu> <467178AE.5040905@ribosome.natur.cuni.cz> <46717990.6040509@ribosome.natur.cuni.cz> <46723F91.60501@ribosome.natur.cuni.cz> Message-ID: <467FE7B0.3010904@ribosome.natur.cuni.cz> Hi Chris, Chris Fields wrote: > > On Jun 15, 2007, at 2:28 AM, Martin MOKREJ? wrote: > >> Chris Fields wrote: >>> Is 99.gb supposed to be a GenBank file? And you're loading it into >> >> Yes, it was attached to the email. ;) > > > > Sorry about that. I notice that '.' was added, but the spacing seemed > off. I think bioperl catches that fine but it's something Wayne should > consider. Would you please tell me exactly what is wrong with the spacing? > >>> embl2picture (which I assume takes EMBL format files)? Without example >>> code we can easily make the wrong assumptions (i.e. that this is user >>> error and not a BioPerl problem). >> >> use constant USAGE =><> Usage: $0 >> Render a GenBank/EMBL entry into drawable form. >> Return as a GIF or PNG image on standard output. >> >> File must be in embl, genbank, or another SeqIO- >> recognized format. Only the first entry will be >> rendered. >> >> Example to try: >> embl2picture.pl factor7.embl | display - >> >> END > > Horribly named script (should be seq2picture, since it converts both > gb/embl). The use of 'all_tags' makes me think the script version you > are using is old, as those methods have long since been renamed. Dave > has it working though, so maybe your version has been updated? The 'use > of initialized data in' errors are probably from inclusion of mandatory > fields with no data or '.'. Well, I just copy&pasted the script from the bioperl webpages, I think from a tutorial or FAQ, don't remember anymore. > >>> Also, I don't believe the feature plotting scripts plot circular >>> chromosomes/plasmids. If you want this functionality you'll have to >>> code it for yourself. >> >> That's a pitty it does not, but at least if someone could improve the >> docs. ;) >> Unfortunately I don't have the time to rewrite the code myself now, >> I need a working, standalone, already available tool. :( >> M. > > As I said, unless someone shows interest and codes it just won't get > done. We have had very little interest in this, either b/c there are > tools already out there to do this very thing (multitudes of plasmid > drawing programs, some free like ApE) or that nobody's bothered to write > it up. Well, my search for such tools available on Unix to be used in a script, non-interactively, completely failed. My last hope except getting improved ApE is to use the GenomeDiagram under biopython, but so far my .gb files cannot be parsed yet. :( Martin From cjfields at uiuc.edu Mon Jun 25 12:48:30 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 25 Jun 2007 11:48:30 -0500 Subject: [BioPython] [Bioperl-l] How to draw a plasmid map from a genbank-formatted file? In-Reply-To: <467FE7B0.3010904@ribosome.natur.cuni.cz> References: <466938F6.7050903@ribosome.natur.cuni.cz> <56BAE06F-2FDF-4FA4-B6A0-96D89470AF4C@wustl.edu> <467178AE.5040905@ribosome.natur.cuni.cz> <46717990.6040509@ribosome.natur.cuni.cz> <46723F91.60501@ribosome.natur.cuni.cz> <467FE7B0.3010904@ribosome.natur.cuni.cz> Message-ID: Martin, Keep bioperl-related discussion on the bioperl mail list. The large majority of this isn't biopython-related, but maybe some devs there can add to this? On Jun 25, 2007, at 11:05 AM, Martin MOKREJ? wrote: ... > Would you please tell me exactly what is wrong with the spacing? Here's a section of the seq record attached to your previous email: DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . Normally there is a fixed column width for any data present in a field, so it would look more like this: DEFINITION PYR4 (DIHYDROOROTASE, PYRIMIDIN 4, dihydroorotase); dihydroorotase [Arabidopsis thaliana]. ACCESSION NP_194024 VERSION NP_194024.1 GI:15235865 DBSOURCE REFSEQ: accession NM_118422.3 KEYWORDS . SOURCE Arabidopsis thaliana (thale cress) ORGANISM Arabidopsis thaliana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids II; Brassicales; Brassicaceae; Arabidopsis. Here's the relevant bit in the latest release notes: "The second part of each sequence entry record contains the information appropriate to its keyword, in positions 13 to 80 for keywords and positions 11 to 80 for the sequence." The bioperl devs try to make our parsers as flexible as possible but others may not, so it's something in ApE that should probably be fixed. And as mentioned to you several times in the past on the mail list and on bugzilla, don't expect sequence records which sway from the standard (in this case, the release notes) to parse correctly in all cases. We can try supporting some that sway from that standard but only up to a point. If it causes additional bugs, headaches, or degrades performance it won't be supported. > ... > Well, I just copy&pasted the script from the bioperl webpages, I think > from a tutorial or FAQ, don't remember anymore. Well, can't help you if you can't point out where the code originated from. We would like to know so it can be corrected. > ... > Well, my search for such tools available on Unix to be used in a > script, > non-interactively, completely failed. My last hope except getting > improved > ApE is to use the GenomeDiagram under biopython, but so far my .gb > files > cannot be parsed yet. :( > Martin As mentioned previously you will likely have to code for it yourself (perl or python) or help debug the relevant biopython code to get it working. We can't/won't do this for you unless/until it's something we feel warrants implementation. Judging by the bug list, we also haven't the time nor inclination to code for it. Sorry but we have other priorities besides doing your work for you. chris From mmokrejs at ribosome.natur.cuni.cz Mon Jun 25 10:31:49 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 25 Jun 2007 16:31:49 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665C076.20408@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> Message-ID: <467FD1D5.9020503@ribosome.natur.cuni.cz> Hi Peter, I have re-tried current CVS version of biopyhton with a file regenerated by fixed version of ApE editor. Unfortunately, I got: $ python generate_image_from_genbank.py Traceback (most recent call last): File "generate_image_from_genbank.py", line 7, in ? genbank_entry = parser.parse(fhandle) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 876, in _feed_first_line raise SyntaxError('Did not recognise the LOCUS line layout:\n' + line) SyntaxError: Did not recognise the LOCUS line layout: LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular 14-JUN-2007 What's wrong with the LOCUS line now? Bioperl from CVS can read it, and I thought it is already following the current specs. ;-) Thanks for your help, Martin Peter wrote: > Martin MOKREJ? wrote: >> Hi Peter, Chris and others, here I am passing the answer from Wayne >> back, sorry for the difficult cross-communication. > > Thank you both, Martin & Wayne. > > Wayne Davis wrote: >> [the] locus line I'm using is the old standard (some older parsers > > wanted it that way). > > That's worth knowing - thank you. Give that, maybe we (Biopython) > should try and parse these files (which aside from the missing > identifier in the LOCUS line should be fairly simple). On the other > hand, I doubt many people still use this particular the old format. > > Wayne Davis wrote: >>> I've updated to write the new standard, if your >>> program isn't flexible enough to read the old style locus lines. > > That's good news. Martin - will this solve your problem, or do you > think we should also update Biopython to cope with these "old style" > LOCUS lines (which also lack identifiers)? > > Wayne Davis wrote: >>> We encourage software developers to switch to a token-based LOCUS >>> parsing approach, rather than a column-specific approach. If this >>> is done, then future changes to the LOCUS line that affect only the >>> spacing of its data values will not require any modifications to > >> software. > > Easier said than done, as some fields can also contain white space. > However, Howard Salis has some interesting code to tackle this attached > to Biopython bug 2294. > > Peter wrote: >>> The next six lines of that example file (elh/pNEX3.gb) have no >>> values - as Chris Fields pointed out on the Biopython mailing list, >>> the NCBI likes to use a dot/period as a place holder. >>> >>> The spec does explicitly say that the KEYWORDS can be omitted, but >>> seems to assume the other lines are expected. Biopython should be >>> happy if these lines are just omitted. > > Just to correct myself, many of those fields are described as mandatory > single entries further up in the documentation - so using a dot/period > (as Wayne has done for the ApE plasmid editor) does seem the best solution. > > Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt >> 3.4.2 Entry Organization >> ... >> The following is a brief description of each entry field. Detailed >> information about each field may be found in Sections 3.4.4 to 3.4.15. >> >> LOCUS ... Mandatory keyword/exactly one record. DEFINITION ... >> Mandatory keyword/one or more records. ACCESSION ... Mandatory >> keyword/one or more records. VERSION... Mandatory keyword/exactly one >> record. ... > > KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated > entries (so not mandatory in general). COMMENT is optional. > > Peter > > > -- Dr. Martin Mokrejs Dept. of Genetics and Microbiology Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs -------------- next part -------------- A non-text attachment was scrubbed... Name: pGL3R.gb.gz Type: application/x-tar Size: 3117 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython/attachments/20070625/f6e08a5a/attachment-0001.tar From mmokrejs at ribosome.natur.cuni.cz Wed Jun 27 10:27:54 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 27 Jun 2007 16:27:54 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <467FD1D5.9020503@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <467FD1D5.9020503@ribosome.natur.cuni.cz> Message-ID: <468273EA.90709@ribosome.natur.cuni.cz> Hi, Martin MOKREJ? wrote: > Hi Peter, I have re-tried current CVS version of biopyhton with a > file regenerated by fixed version of ApE editor. Unfortunately, I > got: > > $ python generate_image_from_genbank.py Traceback (most recent call > last): File "generate_image_from_genbank.py", line 7, in ? > genbank_entry = parser.parse(fhandle) File > "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, > in parse self._scanner.feed(handle, self._consumer) File > "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, > in feed self._feed_first_line(consumer, self.line) File > "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 876, > in _feed_first_line raise SyntaxError('Did not recognise the LOCUS > line layout:\n' + line) SyntaxError: Did not recognise the LOCUS line > layout: LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA > circular 14-JUN-2007 > > What's wrong with the LOCUS line now? Bioperl from CVS can read it, > and I thought it is already following the current specs. ;-) Thanks > for your help, Martin OK, I have found the spacing problem with my LOCUS lines still to persist, and after some scripting I got the lines fixed. Now I get with current CVS version: $ python generate_image_from_genbank.py Traceback (most recent call last): File "generate_image_from_genbank.py", line 7, in ? genbank_entry = parser.parse(fhandle) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 361, in feed self._feed_header_lines(consumer, self.parse_header()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 978, in _feed_header_lines consumer.taxonomy(data.strip()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 419, in taxonomy self.data.annotations['taxonomy'] = self._split_taxonomy(content) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 250, in _split_taxonomy if taxonomy_string[-1] == '.': IndexError: string index out of range $ The file starts with: LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular SYN 14-JUN-2007 DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . COMMENT COMMENT ApEinfo:methylated:0 FEATURES Location/Qualifiers From mmokrejs at ribosome.natur.cuni.cz Wed Jun 27 11:22:53 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 27 Jun 2007 17:22:53 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <46827B15.90301@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <467FD1D5.9020503@ribosome.natur.cuni.cz> <468273EA.90709@ribosome.natur.cuni.cz> <46827B15.90301@maubp.freeserve.co.uk> Message-ID: <468280CD.6010901@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Martin MOKREJ? wrote: >> OK, I have found the spacing problem with my LOCUS lines still to >> persist, >> and after some scripting I got the lines fixed. > > Excellent. I've been away for a few days and haven't had a chance to > look at this yet. thanks! No problem, I was busy as well. ;-) > >> The file starts with: >> >> LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular SYN >> 14-JUN-2007 >> DEFINITION . >> ACCESSION . >> VERSION . >> SOURCE . >> ORGANISM . >> COMMENT COMMENT ApEinfo:methylated:0 >> FEATURES Location/Qualifiers >> > > The ORGANISM line looks wrong (three leading spaces rather than two, so > the dot is pushed one column to the right). > > There is a blank COMMENT line which is also odd. > > Some of this may just be an email formatting issue, but I would expect > this instead: > > ... > DEFINITION . > ACCESSION . > VERSION . > SOURCE . > ORGANISM . > COMMENT ApEinfo:methylated:0 > FEATURES Location/Qualifiers > ... OK, I have removed the COMMENT lines altogether and have fixed the ORGANISM line. Still, I get: python generate_image_from_genbank.py Traceback (most recent call last): File "generate_image_from_genbank.py", line 7, in ? genbank_entry = parser.parse(fhandle) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 361, in feed self._feed_header_lines(consumer, self.parse_header()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 978, in _feed_header_lines consumer.taxonomy(data.strip()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 419, in taxonomy self.data.annotations['taxonomy'] = self._split_taxonomy(content) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 250, in _split_taxonomy if taxonomy_string[-1] == '.': IndexError: string index out of range LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular SYN 14-JUN-2007 DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . Thanks for your help, M. From biopython at maubp.freeserve.co.uk Wed Jun 27 10:58:29 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Jun 2007 15:58:29 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <468273EA.90709@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <467FD1D5.9020503@ribosome.natur.cuni.cz> <468273EA.90709@ribosome.natur.cuni.cz> Message-ID: <46827B15.90301@maubp.freeserve.co.uk> Martin MOKREJ? wrote: > OK, I have found the spacing problem with my LOCUS lines still to persist, > and after some scripting I got the lines fixed. Excellent. I've been away for a few days and haven't had a chance to look at this yet. > The file starts with: > > LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular SYN 14-JUN-2007 > DEFINITION . > ACCESSION . > VERSION . > SOURCE . > ORGANISM . > COMMENT > COMMENT ApEinfo:methylated:0 > FEATURES Location/Qualifiers > The ORGANISM line looks wrong (three leading spaces rather than two, so the dot is pushed one column to the right). There is a blank COMMENT line which is also odd. Some of this may just be an email formatting issue, but I would expect this instead: ... DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . COMMENT ApEinfo:methylated:0 FEATURES Location/Qualifiers ... Peter From biopython at maubp.freeserve.co.uk Wed Jun 27 14:26:28 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Jun 2007 19:26:28 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <468280CD.6010901@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <467FD1D5.9020503@ribosome.natur.cuni.cz> <468273EA.90709@ribosome.natur.cuni.cz> <46827B15.90301@maubp.freeserve.co.uk> <468280CD.6010901@ribosome.natur.cuni.cz> Message-ID: <4682ABD4.5080904@maubp.freeserve.co.uk> Martin MOKREJ? wrote: > OK, I have removed the COMMENT lines altogether and have fixed the ORGANISM > line. Still, I get ... With hindsight I should have suggested something like this: DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . . COMMENT ApEinfo:methylated:0 FEATURES Location/Qualifiers There are usually line(s) after the ORGANISM line which hold the taxonomy, and Biopython was failing when this was missing. I have just updated CVS with a fix (see files Bio/GenBank/Scanner.py and __init__.py) for when these lines are missing or empty. You said you had resolved the LOCUS line issue with ApE plasmid editor's output - so hopefully its files should work with Biopython now. Peter From sdavis2 at mail.nih.gov Wed Jun 27 16:38:06 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 27 Jun 2007 16:38:06 -0400 Subject: [BioPython] OBO format parser and genome databasing Message-ID: <4682CAAE.6060000@mail.nih.gov> I did a bit of googling but didn't find an answer, so I will ask again. Does anyone know of an OBO format parser for python? As a more general question, is anyone using the chado database schema with python? Is there a similar project to chado/gmod but python-based? How about a microarray (MAGE-type) database system that is python-based? Thanks, Sean From biopython at maubp.freeserve.co.uk Thu Jun 28 04:38:00 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 28 Jun 2007 09:38:00 +0100 Subject: [BioPython] OBO format parser and genome databasing In-Reply-To: <4682CAAE.6060000@mail.nih.gov> References: <4682CAAE.6060000@mail.nih.gov> Message-ID: <46837368.2090302@maubp.freeserve.co.uk> Sean Davis wrote: > Does anyone know of an OBO format parser for python? This is the replacement for the older GO flat file format, right? > As a more general question, is anyone using the chado database schema > with python? I haven't. > Is there a similar project to chado/gmod but python-based? How about > a microarray (MAGE-type) database system that is python-based? Sorry Sean, I don't know of such a project. Peter From lpritc at scri.ac.uk Thu Jun 28 05:33:46 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 28 Jun 2007 10:33:46 +0100 Subject: [BioPython] OBO format parser and genome databasing In-Reply-To: <46837368.2090302@maubp.freeserve.co.uk> References: <4682CAAE.6060000@mail.nih.gov> <46837368.2090302@maubp.freeserve.co.uk> Message-ID: <1183023226.25322.175.camel@lplinuxdev.scri.sari.ac.uk> Hi Sean, On Thu, 2007-06-28 at 09:38 +0100, Peter wrote: > Sean Davis wrote: > > Does anyone know of an OBO format parser for python? > > This is the replacement for the older GO flat file format, right? I have been using the BioPerl load_ontology.pl script to load .obo files into BioSQL. Unfortunately, there have been some issues with the GO .obo files and that script, and I've had to fall back on the flat files. The Sequence Ontology/SOFA .obo files seem to work with it OK, though. > > As a more general question, is anyone using the chado database schema > > with python? > > I haven't. Nor me - still sticking with BioSQL for now. L. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From sdavis2 at mail.nih.gov Thu Jun 28 06:47:55 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 28 Jun 2007 06:47:55 -0400 Subject: [BioPython] OBO format parser and genome databasing In-Reply-To: <1183023226.25322.175.camel@lplinuxdev.scri.sari.ac.uk> References: <4682CAAE.6060000@mail.nih.gov> <46837368.2090302@maubp.freeserve.co.uk> <1183023226.25322.175.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: <468391DB.5040007@mail.nih.gov> Leighton Pritchard wrote: > Hi Sean, > > On Thu, 2007-06-28 at 09:38 +0100, Peter wrote: >> Sean Davis wrote: >>> Does anyone know of an OBO format parser for python? >> This is the replacement for the older GO flat file format, right? > > I have been using the BioPerl load_ontology.pl script to load .obo files > into BioSQL. Unfortunately, there have been some issues with the > GO .obo files and that script, and I've had to fall back on the flat > files. The Sequence Ontology/SOFA .obo files seem to work with it OK, > though. > >>> As a more general question, is anyone using the chado database schema >>> with python? >> I haven't. > > Nor me - still sticking with BioSQL for now. Thanks, Leighton. Sean From sdavis2 at mail.nih.gov Thu Jun 28 06:49:17 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 28 Jun 2007 06:49:17 -0400 Subject: [BioPython] OBO format parser and genome databasing In-Reply-To: <46837368.2090302@maubp.freeserve.co.uk> References: <4682CAAE.6060000@mail.nih.gov> <46837368.2090302@maubp.freeserve.co.uk> Message-ID: <4683922D.2090803@mail.nih.gov> Peter wrote: > Sean Davis wrote: >> Does anyone know of an OBO format parser for python? > > This is the replacement for the older GO flat file format, right? One of them, yes. http://www.geneontology.org/GO.format.obo-1_2.shtml >> As a more general question, is anyone using the chado database schema >> with python? > > I haven't. > >> Is there a similar project to chado/gmod but python-based? How about >> a microarray (MAGE-type) database system that is python-based? > > Sorry Sean, I don't know of such a project. Thanks, Peter. Sean From dalloliogm at gmail.com Thu Jun 28 09:45:28 2007 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 28 Jun 2007 15:45:28 +0200 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: <466EAE8D.2090609@maubp.freeserve.co.uk> References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> <466EAE8D.2090609@maubp.freeserve.co.uk> Message-ID: <5aa3b3570706280645s6744b6fdn2cce34abb6883155@mail.gmail.com> Hi! In principle, when I can't decide which keys to use for a dictionary, I just take simple numerical integers as keys, and it works quite well. It simplifies testing/debugging/organization a lot and I can decide the meaning of every key later (so it's better for dictionaries which have to contain very heterogeneous data). I'm not sure I have understood the example you gave me on http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features , but it seems to work in a way similar to what I was saying before: it saves all the features in a list (or is it a dictionary?) and access them later by their positions. Not to be silly but... how do you represent a gene with its transcripts/exons/introns structure with biopython? With SeqRecord and SeqFeature objects? I still don't get it :( Cheers! 2007/6/12, Peter : > Marc Colosimo wrote: > > Additionally, for many formats you can have multiple features with > > the same name; e.g., CDS, gene, etc... in GenBank Records. > > Indeed - and as the SeqRecord/SeqFeature is most heavily used by the > GenBank parser, that does explain things well. > > The problem with using a dictionary is what to index on - you can't > simply use the location string for example, as there usually entries for > genes and CDS features with the same location. > > You can't depend on any other information like an identifier or name to > be present in a GenBank file for all feature types. > > In general, the choice of index will depend on what you want to use it > for - so the flippant answer is just index it yourself, for example like > this: > > http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features > > > The same rational doesn't fully apply to why the feature qualifiers > > are dictionaries of lists. > > No it doesn't. The rational seems to have been that feature qualifiers > in GenBank files can occur with no values (e.g. /pseudo and others), a > single value (e.g. translation) or multiple values (by repeated keys, > e.g. database cross references). So using a list is a simple solution > to cover all these cases - even if most entries only have a single > entry. (There are some old posts on the mailing list archive discussing > this.) > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com From biopython at maubp.freeserve.co.uk Thu Jun 28 11:11:28 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 28 Jun 2007 16:11:28 +0100 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: <5aa3b3570706280645s6744b6fdn2cce34abb6883155@mail.gmail.com> References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> <466EAE8D.2090609@maubp.freeserve.co.uk> <5aa3b3570706280645s6744b6fdn2cce34abb6883155@mail.gmail.com> Message-ID: <4683CFA0.1050905@maubp.freeserve.co.uk> Giovanni Marco Dall'Olio wrote: > Hi! > In principle, when I can't decide which keys to use for a dictionary, > I just take simple numerical integers as keys, and it works quite > well. > It simplifies testing/debugging/organization a lot and I can decide > the meaning of every key later (so it's better for dictionaries which > have to contain very heterogeneous data). It sounds like you don't need/want a dictionary at all. If you are assigning increasing numerical integers "keys", then why not just use the list of features directly? e.g. assuming record is a SeqRecord object: first_feature = record.features[0] second_feature = record.features[1] third_feature = record.features[2] etc > I'm not sure I have understood the example you gave me on > http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features > , but it seems to work in a way similar to what I was saying before: > it saves all the features in a list (or is it a dictionary?) and > access them later by their positions. That example stored integers (indices in the features list) in a dictionary using either the Locus tag, GI numbers or GeneID (e.g. keys like "NEQ010", "GI:41614806" or "GeneID:2654552"). The point being if you know in advance you want to find individual feature on the basis of their locus tag (for example), rather than the order in the file, then I would map the locus tag strings to positions in the list. e.g. locus_tag_cds_index = \ index_genbank_features(gb_record,"CDS","locus_tag") my_feature = gb_record.features[locus_tag_index["NEQ010"]] You could also build a dictionary which maps from the locus tag directly to the associated SeqFeature objects themselves. > Not to be silly but... how do you represent a gene with its > transcripts/exons/introns structure with biopython? With SeqRecord and > SeqFeature objects? If you loaded a GenBank or EMBL file using SeqIO you get one SeqRecord object (assuming there is only one LOCUS line in the file) which contains a list of SeqFeature objects which in turn may contain sub-features. I work with bacteria so I don't have much experience with dealing with sub-features in a SeqFeature object. Peter From italo.maia at gmail.com Mon Jun 4 16:36:21 2007 From: italo.maia at gmail.com (Italo Maia) Date: Mon, 4 Jun 2007 13:36:21 -0300 Subject: [BioPython] Problem with blastx output parsing =~ Message-ID: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> Well, i have a perfectly fine blastx output that throws an error when parsed by biopython. It gives me this output: Traceback (most recent call last): File "", line 1, in File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 624, in parse self._scanner.feed(handle, self._consumer) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 99, in feed self._scan_parameters(uhandle, consumer) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 570, in _scan_parameters has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase")) File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line 300, in read_and_call raise SyntaxError, errmsg SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase': Number of HSP's gapped: 136690 What could i do??? I'm using ubuntu feisty here. -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB Programador Java, Python Meu blog ^^ http://eusouolobomal.blogspot.com/ =========================== From biopython at maubp.freeserve.co.uk Mon Jun 4 17:05:40 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 04 Jun 2007 18:05:40 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> Message-ID: <46644664.6080009@maubp.freeserve.co.uk> Italo Maia wrote: > Well, i have a perfectly fine blastx output that throws an error when parsed > by biopython. > It gives me this output: > > Traceback (most recent call last): > File "", line 1, in > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 624, in parse > self._scanner.feed(handle, self._consumer) > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 99, in feed > self._scan_parameters(uhandle, consumer) > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 570, in _scan_parameters > has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase")) > File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line 300, > in read_and_call > raise SyntaxError, errmsg > SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase': > Number of HSP's gapped: 136690 > > What could i do??? I'm using ubuntu feisty here. It looks like you are using the plain text output from blast, so we would recommend you try the XML output instead. See section 3.4 of the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html If you really want to use the plain text output, please file a bug (including Biopython version number) and then attach the plain text blast output which fails. But no promises - its an uphill battle to keep the parser up to date with each version of Blast! Peter From italo.maia at gmail.com Mon Jun 4 17:22:15 2007 From: italo.maia at gmail.com (Italo Maia) Date: Mon, 4 Jun 2007 14:22:15 -0300 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <46644664.6080009@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> Message-ID: <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> Well, i have 24 thousand of those, i think it would be very painfull to remake them...i'll fill the the bug, but, could there be a workaround? The file goes below: <<>> BLASTX 2.2.15 [Oct-15-2006] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= 26 (858 letters) Database: Leigo 4,535,438 sequences; 1,573,298,872 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gi|15778340|gb|AAL07392.1|AF411412_4 polymerase [Hepatitis B virus] 39 0.33 gi|12060441|dbj|BAB20611.1| DNA polymerase [Hepatitis B virus] 38 0.57 gi|84095095|dbj|BAE66661.1| P protein [Hepatitis B virus] 38 0.57 gi|57021117|ref|NP_647604.2| Polymerase [Hepatitis B virus] 38 0.75 >gi|15778340|gb|AAL07392.1|AF411412_4 polymerase [Hepatitis B virus] Length = 843 Score = 38.9 bits (89), Expect = 0.33 Identities = 24/89 (26%), Positives = 42/89 (47%), Gaps = 1/89 (1%) Frame = +1 Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738 + P G++ RGK + G IW+ HP + P G+ H +N +S + + RK Sbjct: 225 LQPQQGSLARGKSGRSGSIWARVHPTTRQSFGVEPSGSRHIDNSASSTTSCLHQSAVRKT 284 Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHR 825 + ++ K +++GRA PS+ R Sbjct: 285 AYSHLSTSKRQSSSGRAVELHNIPPSSVR 313 >gi|12060441|dbj|BAB20611.1| DNA polymerase [Hepatitis B virus] Length = 843 Score = 38.1 bits (87), Expect = 0.57 Identities = 23/90 (25%), Positives = 42/90 (46%), Gaps = 1/90 (1%) Frame = +1 Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738 + P G++ RGK + G IWS HP + P G+ H +N +S + + RK Sbjct: 225 LQPQQGSLARGKSGRSGSIWSRVHPTTRRSFGVEPSGSGHIDNSASSTSSCLHQSAVRKT 284 Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828 + ++ K +++G A P++ R+ Sbjct: 285 AYSHLSTSKRQSSSGHAVEFHNIPPNSARS 314 >gi|84095095|dbj|BAE66661.1| P protein [Hepatitis B virus] Length = 843 Score = 38.1 bits (87), Expect = 0.57 Identities = 23/90 (25%), Positives = 42/90 (46%), Gaps = 1/90 (1%) Frame = +1 Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738 + P G++ RGK + G IW+ HP + P G+ H +N +S + + RK Sbjct: 225 LQPQQGSLARGKSGRSGSIWARVHPTSRRSFGVEPSGSGHIDNSASSASSCLHQSAVRKT 284 Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828 + ++ K +++G A PS+ R+ Sbjct: 285 AYSHLSTSKRQSSSGHAVELLNIPPSSARS 314 >gi|57021117|ref|NP_647604.2| Polymerase [Hepatitis B virus] Length = 843 Score = 37.7 bits (86), Expect = 0.75 Identities = 24/90 (26%), Positives = 41/90 (45%), Gaps = 1/90 (1%) Frame = +1 Query: 562 VSPLLGAMTRGKRRKPGRIWSISHPLPITNLWQHPDGAWHANNRPTSVLAAAN*KE-RKF 738 + P G++ RGK + G IWS HP P G+ H +N +S + + RK Sbjct: 225 LQPQQGSLARGKSGRSGSIWSRVHPTTRRPFGVEPSGSGHIDNTASSTSSCLHQSAVRKT 284 Query: 739 FFYKQTSCKAANNTGRATPDAQWTPSTHRA 828 + ++ K +++G A PS+ R+ Sbjct: 285 AYSHLSTSKRQSSSGHAVELHNIPPSSARS 314 Database: Leigo Posted date: Jan 22, 2007 11:26 AM Number of letters in database: 1,573,298,872 Number of sequences in database: 4,535,438 Lambda K H 0.318 0.134 0.401 Gapped Lambda K H 0.267 0.0410 0.140 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Sequences: 4535438 Number of Hits to DB: 2,724,816,234 Number of extensions: 65999927 Number of successful extensions: 158184 Number of sequences better than 2.0: 4 Number of HSP's gapped: 158133 Number of HSP's successfully gapped: 4 Length of query: 286 Length of database: 1,573,298,872 Length adjustment: 130 Effective length of query: 156 Effective length of database: 983,691,932 Effective search space: 153455941392 Effective search space used: 153455941392 Neighboring words threshold: 12 Window for multiple hits: 40 X1: 16 ( 7.3 bits) X2: 38 (14.6 bits) X3: 64 (24.7 bits) S1: 41 (21.7 bits) S2: 32 (16.9 bits) <<>> 2007/6/4, Peter : > > Italo Maia wrote: > > Well, i have a perfectly fine blastx output that throws an error when > parsed > > by biopython. > > It gives me this output: > > > > Traceback (most recent call last): > > File "", line 1, in > > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", > line > > 624, in parse > > self._scanner.feed(handle, self._consumer) > > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", > line > > 99, in feed > > self._scan_parameters(uhandle, consumer) > > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", > line > > 570, in _scan_parameters > > has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase")) > > File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line > 300, > > in read_and_call > > raise SyntaxError, errmsg > > SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase': > > Number of HSP's gapped: 136690 > > > > What could i do??? I'm using ubuntu feisty here. > > It looks like you are using the plain text output from blast, so we > would recommend you try the XML output instead. > > See section 3.4 of the tutorial: > http://biopython.org/DIST/docs/tutorial/Tutorial.html > > If you really want to use the plain text output, please file a bug > (including Biopython version number) and then attach the plain text > blast output which fails. But no promises - its an uphill battle to keep > the parser up to date with each version of Blast! > > Peter > > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB Programador Java, Python Meu blog ^^ http://eusouolobomal.blogspot.com/ =========================== From winter at biotec.tu-dresden.de Mon Jun 4 17:08:34 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Mon, 04 Jun 2007 19:08:34 +0200 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> Message-ID: <46644712.3030405@biotec.tu-dresden.de> Hi Maia: Could you post the begin and end of your blastx output? I think you can omit the query and hits in between... Cheers, Christof Italo Maia wrote: > Well, i have a perfectly fine blastx output that throws an error when parsed > by biopython. > It gives me this output: > > Traceback (most recent call last): > File "", line 1, in > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 624, in parse > self._scanner.feed(handle, self._consumer) > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 99, in feed > self._scan_parameters(uhandle, consumer) > File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line > 570, in _scan_parameters > has_re=re.compile(r"[Ll]ength of \s*[Dd]atabase")) > File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line 300, > in read_and_call > raise SyntaxError, errmsg > SyntaxError: Line does not match regex '[Ll]ength of \s*[Dd]atabase': > Number of HSP's gapped: 136690 > > What could i do??? I'm using ubuntu feisty here. > > From biopython at maubp.freeserve.co.uk Mon Jun 4 17:41:38 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 04 Jun 2007 18:41:38 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> Message-ID: <46644ED2.1080505@maubp.freeserve.co.uk> Italo Maia wrote: > Well, i have 24 thousand of those, i think it would be very painfull to > remake them... That is a good reason not to re-run blast! > i'll fill the the bug, but, could there be a workaround? If you haven't filled a new bug already, you could attach the file to bug 2090 which is similar: http://bugzilla.open-bio.org/show_bug.cgi?id=2090 You could try that patch - it might help. I would have tested your example on my setup, but the line wrapping had been messed up. Peter From cjfields at uiuc.edu Mon Jun 4 17:55:30 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 4 Jun 2007 12:55:30 -0500 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <46644664.6080009@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> Message-ID: On Jun 4, 2007, at 12:05 PM, Peter wrote: > ... > It looks like you are using the plain text output from blast, so we > would recommend you try the XML output instead. > > See section 3.4 of the tutorial: > http://biopython.org/DIST/docs/tutorial/Tutorial.html > > If you really want to use the plain text output, please file a bug > (including Biopython version number) and then attach the plain text > blast output which fails. But no promises - its an uphill battle to > keep > the parser up to date with each version of Blast! > > Peter Same with the bioperl parser; we routinely recommend parsing XML or tabular output as they are more stable. Here is NCBI's official response (via Scott McGinnis) to problems with text BLAST output parsing: http://bioperl.org/wiki/NCBI_Blast_email chris From winter at biotec.tu-dresden.de Tue Jun 5 08:24:07 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 05 Jun 2007 10:24:07 +0200 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> Message-ID: <46651DA7.6020802@biotec.tu-dresden.de> Chris Fields wrote: > On Jun 4, 2007, at 12:05 PM, Peter wrote: > >> ... >> It looks like you are using the plain text output from blast, so we >> would recommend you try the XML output instead. >> >> See section 3.4 of the tutorial: >> http://biopython.org/DIST/docs/tutorial/Tutorial.html >> >> If you really want to use the plain text output, please file a bug >> (including Biopython version number) and then attach the plain text >> blast output which fails. But no promises - its an uphill battle to >> keep >> the parser up to date with each version of Blast! >> >> Peter > > Same with the bioperl parser; we routinely recommend parsing XML or > tabular output as they are more stable. I fully agree to that! I think the reason why people are still using the flat file format a lot is because they want to easily read it. What these people are missing is probably an easy way to convert an XML Blast output to a plain text Blast report. Would it then make sense include such a conversion script into BioPython? Maybe XSLT is the easiest option? Cheers, Christof From biopython at maubp.freeserve.co.uk Tue Jun 5 10:53:02 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 11:53:02 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <46644ED2.1080505@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> Message-ID: <4665408E.2090306@maubp.freeserve.co.uk> Peter wrote: > Italo Maia wrote: >> Well, i have 24 thousand of those, i think it would be very painfull to >> remake them... Italo, Please could you attach one or two of your plain text blast output files to bug 2090, or just email them to me directly (not the list) as attachments (not pasted into the body of the email). Thanks. http://bugzilla.open-bio.org/show_bug.cgi?id=2090 Getting Biopython's plain text blast parser updated may not be too much work... assuming your results are all in separate files that it, if you have run blast with multiple inputs then recent NBCI changes made life harder. Peter From mmokrejs at ribosome.natur.cuni.cz Tue Jun 5 12:21:36 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Tue, 05 Jun 2007 14:21:36 +0200 Subject: [BioPython] Cannot parse GenBank file Message-ID: <46655550.70400@ribosome.natur.cuni.cz> Hi, I am trying to parse a GenBank file created by ApE plasmid editor (see Google for details) with biopython-1.43 and I get: >>> fhandle = open('/mnt/smartmedia/pim-1/pGL3R.gb') >>> genbank entry = parser.parse(fhandle) File "", line 1 genbank entry = parser.parse(fhandle) ^ SyntaxError: invalid syntax >>> genbank_entry = parser.parse(fhandle) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 835, in _feed_first_line assert False, \ AssertionError: Did not recognise the LOCUS line layout: LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >>> Is the number of spaces wrong? Thanks for clues, Martin -------------- next part -------------- A non-text attachment was scrubbed... Name: pGL3R.gb.zip Type: application/zip Size: 3713 bytes Desc: not available URL: From ezequiel.panepucci at psi.ch Tue Jun 5 14:02:10 2007 From: ezequiel.panepucci at psi.ch (Ezequiel Panepucci) Date: Tue, 5 Jun 2007 16:02:10 +0200 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: <46655550.70400@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> Message-ID: > genbank entry = parser.parse(fhandle) there is a space character between "genbank" and "entry". It is a syntax error. I suppose you meant "genbank_entry" ? cheers, Zac From mmokrejs at ribosome.natur.cuni.cz Tue Jun 5 14:04:20 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Tue, 05 Jun 2007 16:04:20 +0200 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: References: <46655550.70400@ribosome.natur.cuni.cz> Message-ID: <46656D64.7010508@ribosome.natur.cuni.cz> Ezequiel Panepucci wrote: >> genbank entry = parser.parse(fhandle) > > there is a space character between "genbank" and "entry". > It is a syntax error. > I suppose you meant "genbank_entry" ? Yes, the next command was right and has shown the error. Sorry, I forgot to delete the first attempt. ;-) >>> genbank_entry = parser.parse(fhandle) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 835, in _feed_first_line assert False, \ AssertionError: Did not recognise the LOCUS line layout: LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >>> Martin From biopython at maubp.freeserve.co.uk Tue Jun 5 14:46:23 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 15:46:23 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <46655550.70400@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> Message-ID: <4665773F.2070108@maubp.freeserve.co.uk> Martin MOKREJ? wrote: > Hi, > I am trying to parse a GenBank file created by ApE plasmid editor (see > Google for details) with biopython-1.43 and I get: ... > AssertionError: Did not recognise the LOCUS line layout: > LOCUS 6499 bp ds-DNA linear 02-AUG-2006 > > Is the number of spaces wrong? Yes - fields don't line up with either of the GenBank variants Biopython expects. I suspect their files doesn't follow the current NCBI standard for the locus line... Could you make a set of different files (for different sequences) and check if the spacing changes or is preserved? Thanks Martin, Peter From cjfields at uiuc.edu Tue Jun 5 15:28:24 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 10:28:24 -0500 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: <46656D64.7010508@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <46656D64.7010508@ribosome.natur.cuni.cz> Message-ID: <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> Martin, The example file you give in the bioperl bugzilla report has several blank annotation lines which may lead to additional problems. When the BioPerl SeqIO parser finds annotation fields (SOURCE, ORGANISM, DEFINITION, etc) then it expects there will also be relevant data (text descriptions) accompanying it; I assume the BioPython parser expects likewise though I may be wrong. AFAIK the inclusion of field names w/o text isn't GenBank/EMBL- compliant. GenBank records lacking text either have a '.' instead or are left out entirely: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html We could add a fix but you should probably contact the ApE developers and request that field names w/o text be left out or have '.' added. chris On Jun 5, 2007, at 9:04 AM, Martin MOKREJ? wrote: > Ezequiel Panepucci wrote: >>> genbank entry = parser.parse(fhandle) >> >> there is a space character between "genbank" and "entry". >> It is a syntax error. >> I suppose you meant "genbank_entry" ? > > Yes, the next command was right and has shown the error. Sorry, I > forgot > to delete the first attempt. ;-) > >>>> genbank_entry = parser.parse(fhandle) > Traceback (most recent call last): > File "", line 1, in ? > File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", > line 187, in parse > self._scanner.feed(handle, self._consumer) > File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", > line 360, in feed > self._feed_first_line(consumer, self.line) > File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", > line 835, in _feed_first_line > assert False, \ > AssertionError: Did not recognise the LOCUS line layout: > LOCUS 6499 bp ds-DNA linear 02-AUG-2006 > >>>> > > Martin > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jun 5 16:07:41 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 11:07:41 -0500 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> References: <46655550.70400@ribosome.natur.cuni.cz> <46656D64.7010508@ribosome.natur.cuni.cz> <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> Message-ID: One thing I missed which explains the biopython error: the LOCUS line is missing the locus identifier (see the NCBI example record link). This doesn't choke the bioperl parser but it appears to stop the biopython parser in it's tracks (maybe a feature instead of a bug!). You should try adding a unique identifier (maybe the name of the file or record) to the LOCUS line to see if it works: LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006 The bioperl parser in CVS writes out the correct alphabet when this is added: LOCUS testfile 6499 bp ds-DNA linear 02- AUG-2006 I'll try adding a warning to the bioperl parser for this. chris On Jun 5, 2007, at 10:28 AM, Chris Fields wrote: > Martin, > > The example file you give in the bioperl bugzilla report has several > blank annotation lines which may lead to additional problems. When > the BioPerl SeqIO parser finds annotation fields (SOURCE, ORGANISM, > DEFINITION, etc) then it expects there will also be relevant data > (text descriptions) accompanying it; I assume the BioPython parser > expects likewise though I may be wrong. > > AFAIK the inclusion of field names w/o text isn't GenBank/EMBL- > compliant. GenBank records lacking text either have a '.' instead or > are left out entirely: > > http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html > > We could add a fix but you should probably contact the ApE developers > and request that field names w/o text be left out or have '.' added. > > chris > > On Jun 5, 2007, at 9:04 AM, Martin MOKREJ? wrote: > >> Ezequiel Panepucci wrote: >>>> genbank entry = parser.parse(fhandle) >>> >>> there is a space character between "genbank" and "entry". >>> It is a syntax error. >>> I suppose you meant "genbank_entry" ? >> >> Yes, the next command was right and has shown the error. Sorry, I >> forgot >> to delete the first attempt. ;-) >> >>>>> genbank_entry = parser.parse(fhandle) >> Traceback (most recent call last): >> File "", line 1, in ? >> File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", >> line 187, in parse >> self._scanner.feed(handle, self._consumer) >> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", >> line 360, in feed >> self._feed_first_line(consumer, self.line) >> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", >> line 835, in _feed_first_line >> assert False, \ >> AssertionError: Did not recognise the LOCUS line layout: >> LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >> >>>>> >> >> Martin >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From mmokrejs at ribosome.natur.cuni.cz Tue Jun 5 15:24:05 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Tue, 05 Jun 2007 17:24:05 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665773F.2070108@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> Message-ID: <46658015.6030506@ribosome.natur.cuni.cz> Peter wrote: > Martin MOKREJ? wrote: >> Hi, >> I am trying to parse a GenBank file created by ApE plasmid editor >> (see Google for details) with biopython-1.43 and I get: > > ... > >> AssertionError: Did not recognise the LOCUS line layout: >> LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >> >> Is the number of spaces wrong? > > Yes - fields don't line up with either of the GenBank variants Biopython > expects. I suspect their files doesn't follow the current NCBI standard > for the locus line... > > Could you make a set of different files (for different sequences) and > check if the spacing changes or is preserved? OK, two types of errors, the first case is caused by files generated by VectorNTI, the second type of error is caused by ApE editor-produced files: >>> fhandle = open('/mnt/smartmedia/utrophinA/p-cmvbGalCAT.gb','r') >>> genbank_entry = parser.parse(fhandle) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 361, in feed self._feed_header_lines(consumer, self.parse_header()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 967, in _feed_header_lines getattr(consumer, consumer_dict[line_type])(data) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 409, in source if content[-1] == '.': IndexError: string index out of range >>> >>> fhandle = open('/mnt/smartmedia/nrf/ok/PBCRLucPFLuc.gb','r') >>> genbank_entry = parser.parse(fhandle) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 835, in _feed_first_line assert False, \ AssertionError: Did not recognise the LOCUS line layout: LOCUS 6988 bp ds-DNA linear 20-DEC-2006 >>> I would appreciate if you could tell me then what was exactly wrong with the generated files by ApE editor (author Cc:ed). Hope this helps, Martin -------------- next part -------------- A non-text attachment was scrubbed... Name: genbank-formatted-testcases.zip Type: application/zip Size: 32571 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Tue Jun 5 18:29:52 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 19:29:52 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <46658015.6030506@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> Message-ID: <4665ABA0.3060500@maubp.freeserve.co.uk> Hi Wayne & all the Biopython mailing list, Martin has been trying to parse some GenBank files produced by ApE plasmid editor, and Biopython (and BioPerl?) don't like them. Hopefully between us we can sort this out :) By the way - Is the current ApE plasmid editor webpage here, because it times out for me?: http://www.biology.utah.edu/jorgensen/wayned/ape/ Martin MOKREJ? wrote: > I would appreciate if you could tell me then what was exactly wrong with > the generated files by ApE editor (author Cc:ed). OK then, looking at file elh/pNEX3.gb which starts: LOCUS 2981 bp ds-DNA linear 12-OCT-2006 DEFINITION ACCESSION VERSION SOURCE ORGANISM COMMENT COMMENT ApEinfo:methylated:1 FEATURES Location/Qualifiers misc_feature 225..257 /ApEinfo_label=pNEX3-compatibile ... I think the location of the size (2981 bp), sequence type (ds-DNA, linear) and date (12-OCT-2006) are not in the correct positions (i.e. column numbers). Also the locus ID is missing, which is not ideal. Trying to do examples in an email is tricky as the line wrapping spoils the effect. Interestingly all these files seem to have their LOCUS line fields in the same place - perhaps the ApE plasmid editor is following an out of date version of the GenBank file format which I haven't seen before? If so, we (Biopython) should be able to deal with this too. For the current version of the LOCUS line spec, see: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt In particular: > The detailed format for the LOCUS line format is as follows: > > Positions Contents > --------- -------- > 01-05 'LOCUS' > 06-12 spaces > 13-28 Locus name > 29-29 space > 30-40 Length of sequence, right-justified > 41-41 space > 42-43 bp > 44-44 space > 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or > ms- (mixed-stranded) > 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA, > snoRNA. Left justified. > 54-55 space > 56-63 'linear' followed by two spaces, or 'circular' > 64-64 space > 65-67 The division code (see Section 3.3) > 68-68 space > 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) Note that the proteins variant "GenPept" is slightly different. The next six lines of that example file (elh/pNEX3.gb) have no values - as Chris Fields pointed out on the Biopython mailing list, the NCBI likes to use a dot/period as a place holder. The spec does explicitly say that the KEYWORDS can be omitted, but seems to assume the other lines are expected. Biopython should be happy if these lines are just omitted. See also: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html > Hope this helps, You might have upset some people by emailing an attachment to the entire Biopython mailing list, but it wasn't too big at least ;) Regards, Peter From mmokrejs at ribosome.natur.cuni.cz Tue Jun 5 18:57:14 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Tue, 05 Jun 2007 20:57:14 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665ABA0.3060500@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> Message-ID: <4665B20A.605@ribosome.natur.cuni.cz> Hi Peter, Chris and others, here I am passing the answer from Wayne back, sorry for the difficult cross-communication. Chris, I hope you will update the bioperl bug I have opened on this once it is clearer. I do not know whether Wayne will have enough time to answer all your comments, on email lists and in bugzilla. Few days ago he said they do some organize a meeting, so ... Anyway, official answer: Wayne Davis wrote: > locus line I'm using is the old standard (some older parsers wanted it > that way). > I've updated to write the new standard, if your program isn't flexible > enough to read the old style locus lines. We'll see if anyone is using > the older parsers still. > from the document laying out the new standard: > > We encourage software developers to switch to a token-based LOCUS parsing > approach, rather than a column-specific approach. If this is done, then future > changes to the LOCUS line that affect only the spacing of its data values will > > not require any modifications to software. > > > > > I've made the default behavior to put "." in the empty fields. I left > those fields there because there are other parsers that require them. > In my new version you can change the default genbank record values by > adding a line to your preferences file like this: > empty_genbank_header{LOCUS } {} {DEFINITION } {.} > {ACCESSION } {.} {VERSION } {.} {SOURCE } {.} { ORGANISM } {.} > > or > empty_genbank_header{LOCUS } {} > > > My access to our web server is temporarily unavailable, but I'll post > the update as soon as I can. Martin Peter wrote: > Hi Wayne & all the Biopython mailing list, > > Martin has been trying to parse some GenBank files produced by ApE > plasmid editor, and Biopython (and BioPerl?) don't like them. > > Hopefully between us we can sort this out :) > > By the way - Is the current ApE plasmid editor webpage here, because it > times out for me?: > > http://www.biology.utah.edu/jorgensen/wayned/ape/ > > Martin MOKREJ? wrote: >> I would appreciate if you could tell me then what was exactly wrong >> with the generated files by ApE editor (author Cc:ed). > > OK then, looking at file elh/pNEX3.gb which starts: > > LOCUS 2981 bp ds-DNA linear 12-OCT-2006 > DEFINITION > ACCESSION > VERSION > SOURCE > ORGANISM > COMMENT > COMMENT ApEinfo:methylated:1 > FEATURES Location/Qualifiers > misc_feature 225..257 > /ApEinfo_label=pNEX3-compatibile > ... > > I think the location of the size (2981 bp), sequence type (ds-DNA, > linear) and date (12-OCT-2006) are not in the correct positions (i.e. > column numbers). Also the locus ID is missing, which is not ideal. > Trying to do examples in an email is tricky as the line wrapping spoils > the effect. > > Interestingly all these files seem to have their LOCUS line fields in > the same place - perhaps the ApE plasmid editor is following an out of > date version of the GenBank file format which I haven't seen before? If > so, we (Biopython) should be able to deal with this too. > > For the current version of the LOCUS line spec, see: > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > > In particular: >> The detailed format for the LOCUS line format is as follows: >> >> Positions Contents >> --------- -------- >> 01-05 'LOCUS' >> 06-12 spaces >> 13-28 Locus name >> 29-29 space >> 30-40 Length of sequence, right-justified >> 41-41 space >> 42-43 bp >> 44-44 space >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or >> ms- (mixed-stranded) >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), >> mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA, >> snoRNA. Left justified. >> 54-55 space >> 56-63 'linear' followed by two spaces, or 'circular' >> 64-64 space >> 65-67 The division code (see Section 3.3) >> 68-68 space >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > Note that the proteins variant "GenPept" is slightly different. > > The next six lines of that example file (elh/pNEX3.gb) have no values - > as Chris Fields pointed out on the Biopython mailing list, the NCBI > likes to use a dot/period as a place holder. > > The spec does explicitly say that the KEYWORDS can be omitted, but seems > to assume the other lines are expected. Biopython should be happy if > these lines are just omitted. > > See also: > http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html > >> Hope this helps, > > You might have upset some people by emailing an attachment to the entire > Biopython mailing list, but it wasn't too big at least ;) > > Regards, > > Peter > > > -- Dr. Martin Mokrejs Dept. of Genetics and Microbiology Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From cjfields at uiuc.edu Tue Jun 5 19:55:29 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 14:55:29 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665B20A.605@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> Message-ID: <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> On Jun 5, 2007, at 1:57 PM, Martin MOKREJ? wrote: > Hi Peter, Chris and others, > here I am passing the answer from Wayne back, sorry for the > difficult > cross-communication. Chris, I hope you will update the bioperl bug > I have > opened on this once it is clearer. I do not know whether Wayne will > have > enough time to answer all your comments, on email lists and in > bugzilla. > Few days ago he said they do some organize a meeting, so ... Anyway, > official answer: > > Wayne Davis wrote: >> locus line I'm using is the old standard (some older parsers >> wanted it >> that way). >> I've updated to write the new standard, if your program isn't >> flexible >> enough to read the old style locus lines. We'll see if anyone is >> using >> the older parsers still. >> from the document laying out the new standard: >> >> We encourage software developers to switch to a token-based LOCUS >> parsing >> approach, rather than a column-specific approach. If this is done, >> then future >> changes to the LOCUS line that affect only the spacing of its data >> values will >> >> not require any modifications to software. >> >> >> >> >> I've made the default behavior to put "." in the empty fields. I left >> those fields there because there are other parsers that require them. >> In my new version you can change the default genbank record values by >> adding a line to your preferences file like this: >> empty_genbank_header{LOCUS } {} {DEFINITION } {.} >> {ACCESSION } {.} {VERSION } {.} {SOURCE } {.} >> { ORGANISM } {.} >> >> or >> empty_genbank_header{LOCUS } {} >> >> >> My access to our web server is temporarily unavailable, but I'll post >> the update as soon as I can. > > Martin The bioperl parser doesn't rely on the exact spacing and uses a tokenized approach. It does rely on the presence of the LOCUS line and a locus name in that line (which Martin's sequence record lacks). Acc. to the release notes the locus name is then followed by the sequence length, 'bp' or 'aa', and the rest. As might be guessed, the lack of a locus name is probably the major source of headaches here. Note that the presence of the locus name appears to be required according to the GenBank release notes. There is no optional designation for the LOCUS line (it is mandatory as stated in sec. 3.4.2), and the locus name appears in the line for all records (sec. 3.5.4). I could argue that errors encountered parsing a record lacking a locus name are actually features (albeit horribly documented ones). I have added a warning which catches less than six tokens on the line, but I don't see the point of going beyond that w/ o descending into tokenizing oblivion (is it an accession, if not is it the length, if not ....) when the initial source of the problem is a badly formatted line in a sequence record. chris From biopython at maubp.freeserve.co.uk Tue Jun 5 19:58:46 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 20:58:46 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665B20A.605@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> Message-ID: <4665C076.20408@maubp.freeserve.co.uk> Martin MOKREJ? wrote: > Hi Peter, Chris and others, here I am passing the answer from Wayne > back, sorry for the difficult cross-communication. Thank you both, Martin & Wayne. Wayne Davis wrote: > [the] locus line I'm using is the old standard (some older parsers > wanted it that way). That's worth knowing - thank you. Give that, maybe we (Biopython) should try and parse these files (which aside from the missing identifier in the LOCUS line should be fairly simple). On the other hand, I doubt many people still use this particular the old format. Wayne Davis wrote: >> I've updated to write the new standard, if your >> program isn't flexible enough to read the old style locus lines. That's good news. Martin - will this solve your problem, or do you think we should also update Biopython to cope with these "old style" LOCUS lines (which also lack identifiers)? Wayne Davis wrote: >> We encourage software developers to switch to a token-based LOCUS >> parsing approach, rather than a column-specific approach. If this >> is done, then future changes to the LOCUS line that affect only the >> spacing of its data values will not require any modifications to >> software. Easier said than done, as some fields can also contain white space. However, Howard Salis has some interesting code to tackle this attached to Biopython bug 2294. Peter wrote: >> The next six lines of that example file (elh/pNEX3.gb) have no >> values - as Chris Fields pointed out on the Biopython mailing list, >> the NCBI likes to use a dot/period as a place holder. >> >> The spec does explicitly say that the KEYWORDS can be omitted, but >> seems to assume the other lines are expected. Biopython should be >> happy if these lines are just omitted. Just to correct myself, many of those fields are described as mandatory single entries further up in the documentation - so using a dot/period (as Wayne has done for the ApE plasmid editor) does seem the best solution. Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > 3.4.2 Entry Organization > ... > The following is a brief description of each entry field. Detailed > information about each field may be found in Sections 3.4.4 to 3.4.15. > > LOCUS ... Mandatory keyword/exactly one record. > DEFINITION ... Mandatory keyword/one or more records. > ACCESSION ... Mandatory keyword/one or more records. > VERSION... Mandatory keyword/exactly one record. > ... KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated entries (so not mandatory in general). COMMENT is optional. Peter From cjfields at uiuc.edu Tue Jun 5 20:28:08 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 15:28:08 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665C076.20408@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> Message-ID: <87AC6A64-D329-4BCF-B868-7035AD3A2D6F@uiuc.edu> On Jun 5, 2007, at 2:58 PM, Peter wrote: ... > Easier said than done, as some fields can also contain white space. > However, Howard Salis has some interesting code to tackle this > attached > to Biopython bug 2294. The bioperl parser simply splits the data upon white space. The first three tokens (not counting the LOCUS name) are always the locus name, the seq length, and 'bp' or 'aa' (which we use to determine the alphabet); that order seems to es back to GenBank release 100 (1997): ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb100.release.notes The next few fluctuate dep. on the release or sequence type, but the division and date are always last. I don't think we require a division code to be present, but I'm not sure. > Peter wrote: >>> The next six lines of that example file (elh/pNEX3.gb) have no >>> values - as Chris Fields pointed out on the Biopython mailing list, >>> the NCBI likes to use a dot/period as a place holder. >>> >>> The spec does explicitly say that the KEYWORDS can be omitted, but >>> seems to assume the other lines are expected. Biopython should be >>> happy if these lines are just omitted. > > Just to correct myself, many of those fields are described as > mandatory > single entries further up in the documentation - so using a dot/period > (as Wayne has done for the ApE plasmid editor) does seem the best > solution. > > Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt >> 3.4.2 Entry Organization >> ... >> The following is a brief description of each entry field. Detailed >> information about each field may be found in Sections 3.4.4 to >> 3.4.15. >> >> LOCUS ... Mandatory keyword/exactly one record. >> DEFINITION ... Mandatory keyword/one or more records. >> ACCESSION ... Mandatory keyword/one or more records. >> VERSION... Mandatory keyword/exactly one record. >> ... > > KEYWORDS, SOURCE and ORGANISM are described as mandatory in all > annotated > entries (so not mandatory in general). COMMENT is optional. > > Peter Probably something we should look into and correct as well. We don't require those fields for parsing, but they should be present in output sequence records, strictly speaking. chris From biopython at maubp.freeserve.co.uk Tue Jun 5 21:11:36 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 05 Jun 2007 22:11:36 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> Message-ID: <4665D188.9070202@maubp.freeserve.co.uk> Chris Fields wrote: > Note that the presence of the locus name appears to be required > according to the GenBank release notes. There is no optional > designation for the LOCUS line (it is mandatory as stated in sec. > 3.4.2), and the locus name appears in the line for all records (sec. > 3.5.4). I agree that valid GenBank files should indeed have a locus name in the LOCUS line. If it doesn't cause too many issues, then maybe we should allow such files as input. Having just gone over the Biopython code, if the locus name is missing but there is nothing else wrong with the LOCUS line, Biopython will give a slightly cryptic AssertionError, "Cannot parse the name and length in the LOCUS line" I could make the parser cope with missing locus names, but on reflection, that may just cause worse problems further downstream (e.g. trying to index the file). One option is to auto-generate an identifier... Lets wait and see what Wayne's new version of ApE plasmid editor outputs for "GenBank format" - maybe he will include some sort of locus name. Peter From cjfields at uiuc.edu Tue Jun 5 21:46:07 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 16:46:07 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665D188.9070202@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> <4665D188.9070202@maubp.freeserve.co.uk> Message-ID: <803EBA98-4521-44DD-A2C4-173E552AC2E0@uiuc.edu> On Jun 5, 2007, at 4:11 PM, Peter wrote: > Chris Fields wrote: >> Note that the presence of the locus name appears to be required >> according to the GenBank release notes. There is no optional >> designation for the LOCUS line (it is mandatory as stated in sec. >> 3.4.2), and the locus name appears in the line for all records >> (sec. 3.5.4). > > I agree that valid GenBank files should indeed have a locus name in > the LOCUS line. If it doesn't cause too many issues, then maybe we > should allow such files as input. > > Having just gone over the Biopython code, if the locus name is > missing but there is nothing else wrong with the LOCUS line, > Biopython will give a slightly cryptic AssertionError, "Cannot > parse the name and length in the LOCUS line" > > I could make the parser cope with missing locus names, but on > reflection, that may just cause worse problems further downstream > (e.g. trying to index the file). One option is to auto-generate an > identifier... > > Lets wait and see what Wayne's new version of ApE plasmid editor > outputs for "GenBank format" - maybe he will include some sort of > locus name. > > Peter In BioPerl you can optionally pass in a custom generator (specifically a code reference) to generate the LOCUS, ACCESSION, VERSION, and KEYWORD lines if needed. You might be able to do something similar for your parser, though I'm not yet familiar with Python enough to work out how... chris From cjfields at uiuc.edu Tue Jun 5 21:46:07 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Jun 2007 16:46:07 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665D188.9070202@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> <4665D188.9070202@maubp.freeserve.co.uk> Message-ID: <803EBA98-4521-44DD-A2C4-173E552AC2E0@uiuc.edu> On Jun 5, 2007, at 4:11 PM, Peter wrote: > Chris Fields wrote: >> Note that the presence of the locus name appears to be required >> according to the GenBank release notes. There is no optional >> designation for the LOCUS line (it is mandatory as stated in sec. >> 3.4.2), and the locus name appears in the line for all records >> (sec. 3.5.4). > > I agree that valid GenBank files should indeed have a locus name in > the LOCUS line. If it doesn't cause too many issues, then maybe we > should allow such files as input. > > Having just gone over the Biopython code, if the locus name is > missing but there is nothing else wrong with the LOCUS line, > Biopython will give a slightly cryptic AssertionError, "Cannot > parse the name and length in the LOCUS line" > > I could make the parser cope with missing locus names, but on > reflection, that may just cause worse problems further downstream > (e.g. trying to index the file). One option is to auto-generate an > identifier... > > Lets wait and see what Wayne's new version of ApE plasmid editor > outputs for "GenBank format" - maybe he will include some sort of > locus name. > > Peter In BioPerl you can optionally pass in a custom generator (specifically a code reference) to generate the LOCUS, ACCESSION, VERSION, and KEYWORD lines if needed. You might be able to do something similar for your parser, though I'm not yet familiar with Python enough to work out how... chris From mmokrejs at ribosome.natur.cuni.cz Thu Jun 7 14:26:44 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Thu, 07 Jun 2007 16:26:44 +0200 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: References: <46655550.70400@ribosome.natur.cuni.cz> <46656D64.7010508@ribosome.natur.cuni.cz> <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> Message-ID: <466815A4.9060505@ribosome.natur.cuni.cz> Hi, Chris Fields wrote: > One thing I missed which explains the biopython error: the LOCUS line is > missing the locus identifier (see the NCBI example record link). This > doesn't choke the bioperl parser but it appears to stop the biopython > parser in it's tracks (maybe a feature instead of a bug!). > > You should try adding a unique identifier (maybe the name of the file or > record) to the LOCUS line to see if it works: > > LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006 > > The bioperl parser in CVS writes out the correct alphabet when this is > added: > > LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006 > > I'll try adding a warning to the bioperl parser for this. I have updated http://bugzilla.open-bio.org/show_bug.cgi?id=2305 but let me emphasize the LOCUS line now contains LOCUS pRL 5428 bp ds-DNA linear 07-JUN-2007 which still does not comply with the line you have proposed. But it can be parsed by bioperl-live from cvs. Is it still wrong? Testcase as pRL.gb-new in the bugzilla record #2305. Martin > > chris > > On Jun 5, 2007, at 10:28 AM, Chris Fields wrote: > >> Martin, >> >> The example file you give in the bioperl bugzilla report has several >> blank annotation lines which may lead to additional problems. When >> the BioPerl SeqIO parser finds annotation fields (SOURCE, ORGANISM, >> DEFINITION, etc) then it expects there will also be relevant data >> (text descriptions) accompanying it; I assume the BioPython parser >> expects likewise though I may be wrong. >> >> AFAIK the inclusion of field names w/o text isn't GenBank/EMBL- >> compliant. GenBank records lacking text either have a '.' instead or >> are left out entirely: >> >> http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html >> >> We could add a fix but you should probably contact the ApE developers >> and request that field names w/o text be left out or have '.' added. >> >> chris >> >> On Jun 5, 2007, at 9:04 AM, Martin MOKREJ? wrote: >> >>> Ezequiel Panepucci wrote: >>>>> genbank entry = parser.parse(fhandle) >>>> >>>> there is a space character between "genbank" and "entry". >>>> It is a syntax error. >>>> I suppose you meant "genbank_entry" ? >>> >>> Yes, the next command was right and has shown the error. Sorry, I >>> forgot >>> to delete the first attempt. ;-) >>> >>>>>> genbank_entry = parser.parse(fhandle) >>> Traceback (most recent call last): >>> File "", line 1, in ? >>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", >>> line 187, in parse >>> self._scanner.feed(handle, self._consumer) >>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", >>> line 360, in feed >>> self._feed_first_line(consumer, self.line) >>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", >>> line 835, in _feed_first_line >>> assert False, \ >>> AssertionError: Did not recognise the LOCUS line layout: >>> LOCUS 6499 bp ds-DNA linear 02-AUG-2006 >>> >>>>>> >>> >>> Martin >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > > -- Dr. Martin Mokrejs Dept. of Genetics and Microbiology Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From mmokrejs at ribosome.natur.cuni.cz Thu Jun 7 14:44:17 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Thu, 07 Jun 2007 16:44:17 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665C076.20408@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> Message-ID: <466819C1.9010203@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Martin MOKREJ? wrote: >> Hi Peter, Chris and others, here I am passing the answer from Wayne >> back, sorry for the difficult cross-communication. > > Thank you both, Martin & Wayne. > > Wayne Davis wrote: >> [the] locus line I'm using is the old standard (some older parsers > > wanted it that way). > > That's worth knowing - thank you. Give that, maybe we (Biopython) > should try and parse these files (which aside from the missing > identifier in the LOCUS line should be fairly simple). On the other > hand, I doubt many people still use this particular the old format. > > Wayne Davis wrote: >>> I've updated to write the new standard, if your >>> program isn't flexible enough to read the old style locus lines. > > That's good news. Martin - will this solve your problem, or do you > think we should also update Biopython to cope with these "old style" > LOCUS lines (which also lack identifiers)? I think that if it was ever a valid format it should cope with it. > > Wayne Davis wrote: >>> We encourage software developers to switch to a token-based LOCUS >>> parsing approach, rather than a column-specific approach. If this >>> is done, then future changes to the LOCUS line that affect only the >>> spacing of its data values will not require any modifications to > >> software. > > Easier said than done, as some fields can also contain white space. > However, Howard Salis has some interesting code to tackle this attached > to Biopython bug 2294. Please follow the bug #2305 in bioperl on this as well and see what competitors have done in this regard. ;) > > Peter wrote: >>> The next six lines of that example file (elh/pNEX3.gb) have no >>> values - as Chris Fields pointed out on the Biopython mailing list, >>> the NCBI likes to use a dot/period as a place holder. >>> >>> The spec does explicitly say that the KEYWORDS can be omitted, but >>> seems to assume the other lines are expected. Biopython should be >>> happy if these lines are just omitted. > > Just to correct myself, many of those fields are described as mandatory > single entries further up in the documentation - so using a dot/period > (as Wayne has done for the ApE plasmid editor) does seem the best solution. OK, biopython now can survive the missing dots, I think biopython should do the same. If one can fix the problem by adding internally in the parser a default value, why not to do it? > > Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt >> 3.4.2 Entry Organization >> ... >> The following is a brief description of each entry field. Detailed >> information about each field may be found in Sections 3.4.4 to 3.4.15. >> >> LOCUS ... Mandatory keyword/exactly one record. DEFINITION ... >> Mandatory keyword/one or more records. ACCESSION ... Mandatory >> keyword/one or more records. VERSION... Mandatory keyword/exactly one >> record. ... > > KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated > entries (so not mandatory in general). COMMENT is optional. Martin From mmokrejs at ribosome.natur.cuni.cz Thu Jun 7 14:51:14 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Thu, 07 Jun 2007 16:51:14 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665D188.9070202@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4DA4B9E1-6E8C-49B3-91C5-0B336F8685BC@uiuc.edu> <4665D188.9070202@maubp.freeserve.co.uk> Message-ID: <46681B62.2070403@ribosome.natur.cuni.cz> Peter wrote: > Chris Fields wrote: >> Note that the presence of the locus name appears to be required >> according to the GenBank release notes. There is no optional >> designation for the LOCUS line (it is mandatory as stated in sec. >> 3.4.2), and the locus name appears in the line for all records (sec. >> 3.5.4). > > I agree that valid GenBank files should indeed have a locus name in the > LOCUS line. If it doesn't cause too many issues, then maybe we should > allow such files as input. > > Having just gone over the Biopython code, if the locus name is missing > but there is nothing else wrong with the LOCUS line, Biopython will give > a slightly cryptic AssertionError, "Cannot parse the name and length in > the LOCUS line" > > I could make the parser cope with missing locus names, but on > reflection, that may just cause worse problems further downstream (e.g. > trying to index the file). One option is to auto-generate an identifier... I would vote for that. A number of things will break when the LOCUS is same for multiple records. But, imagine, I just have multiple file with same LOCUS identifier (a plasmid name) and it simply does happen that multiple plasmids of different sequence have same abbreviated names. I need to stick to their original names as published by authors in Literature, so I really do have several files with same LOCUS identifier in the LOCUS line. So, the internal indexing stuff must kick in. > > Lets wait and see what Wayne's new version of ApE plasmid editor outputs > for "GenBank format" - maybe he will include some sort of locus name. It is being fixed now, still some polishing needed. But it will produce Genbank formatted files according to current standard. Martin BTW I ahve proposed ApE editor derives the LOCUS identifier from a filename by stripping the file extension. From cjfields at uiuc.edu Thu Jun 7 15:31:45 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 7 Jun 2007 10:31:45 -0500 Subject: [BioPython] Cannot parse GenBank file In-Reply-To: <466815A4.9060505@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <46656D64.7010508@ribosome.natur.cuni.cz> <24065CBD-BBF6-4CA3-9523-AD50C524DAE5@uiuc.edu> <466815A4.9060505@ribosome.natur.cuni.cz> Message-ID: <2A403865-F1E8-4D19-8D19-455C22E7C6D9@uiuc.edu> On Jun 7, 2007, at 9:26 AM, Martin MOKREJ? wrote: > Hi, > > Chris Fields wrote: >> One thing I missed which explains the biopython error: the LOCUS >> line is missing the locus identifier (see the NCBI example record >> link). This doesn't choke the bioperl parser but it appears to >> stop the biopython parser in it's tracks (maybe a feature instead >> of a bug!). >> You should try adding a unique identifier (maybe the name of the >> file or record) to the LOCUS line to see if it works: >> LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006 >> The bioperl parser in CVS writes out the correct alphabet when >> this is added: >> LOCUS testfile 6499 bp ds-DNA linear 02- >> AUG-2006 >> I'll try adding a warning to the bioperl parser for this. > > I have updated http://bugzilla.open-bio.org/show_bug.cgi?id=2305 > but let me > emphasize the LOCUS line now contains > LOCUS pRL 5428 bp ds-DNA linear > 07-JUN-2007 > > > which still does not comply with the line you have proposed. But it > can be > parsed by bioperl-live from cvs. Is it still wrong? Testcase as > pRL.gb-new > in the bugzilla record #2305. > > Martin That should work. There isn't a strict uniqueness test (that would require caching and isn't worth the trouble IMHO), though it's required you add something unique for the accession/locus if you plan on indexing them in the future. Parsing GenBank data produced from third-party software is problematic at best; there seems to be no steadfast rule with GenBank output for some programs, even though the specification is plainly stated in the NCBI release notes. My take on that is to have a stricter (read:follows release notes) GenBank parser which passes off the data in the record to default handler methods. A user could then subjugate the defined handlers with their own by subclassing the default handler class and overloading the methods or adding their own code references directly. chris ... From cjfields at uiuc.edu Thu Jun 7 16:42:13 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 7 Jun 2007 11:42:13 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <466819C1.9010203@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <466819C1.9010203@ribosome.natur.cuni.cz> Message-ID: <88B5307E-6666-4646-BD0B-B0704D493CA5@uiuc.edu> On Jun 7, 2007, at 9:44 AM, Martin MOKREJ? wrote: > Hi Peter, >> ... >> That's good news. Martin - will this solve your problem, or do you >> think we should also update Biopython to cope with these "old style" >> LOCUS lines (which also lack identifiers)? > > I think that if it was ever a valid format it should cope with it. I think it's better to explicitly state that the parser is compliant with a particular GenBank release and can likely parse other similarly formatted GenBank records from third-party software. If the parser chokes on a bad record then you can point out the deficiency in the record and (if possible) try to make it more flexible w/o borking the parser later on. The release notes are there for a good reason! The LOCUS line format, however, has been relatively stable over time. Here are the release notes for a GenBank release from late 1992: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb74.release.notes and the LOCUS line is: Positions Contents 1-12 LOCUS 13-22 Locus name 23-29 Length of sequence, right-justified 31-32 bp 34-36 Blank, ss- (single-stranded), ds- (double-stranded), or ms- (mixed-stranded) 37-40 Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), mRNA (messenger RNA), or uRNA (small nuclear RNA) 43-52 Blank (implies linear) or circular 53-55 The division code (see Section 3.3) 63-73 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) The spacing is more explicitly laid out in later versions. The best part is the Entrez CD order form (clipped out by scissors to be snail- mailed) at the end of the file! chris From jlchang at broad.mit.edu Thu Jun 7 20:24:00 2007 From: jlchang at broad.mit.edu (Jean Chang) Date: Thu, 7 Jun 2007 16:24:00 -0400 Subject: [BioPython] installation question Message-ID: <224B35CB-5B36-48DE-AD9D-B839680AA71A@broad.mit.edu> Hi, I'm trying to install biopython-1.43.tar.gz my mxtexttools is in a non-standard place - I'm guessing I need my PYTHONPATH specified as I'm getting: *** mxTextTools *** is either not installed or out of date. I tried installing egenix-mx-base-3.0.0.tar.gz but I'm getting the same error. Did I get _too recent_ a copy of mxtexttools? (Previously I had egenix-mx-base-2.0.6.tar.gz) Thanks, jean From biopython at maubp.freeserve.co.uk Thu Jun 7 21:28:31 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 07 Jun 2007 22:28:31 +0100 Subject: [BioPython] installation question - mxTextTools In-Reply-To: <224B35CB-5B36-48DE-AD9D-B839680AA71A@broad.mit.edu> References: <224B35CB-5B36-48DE-AD9D-B839680AA71A@broad.mit.edu> Message-ID: <4668787F.50405@maubp.freeserve.co.uk> Jean Chang wrote: > Hi, > > I'm trying to install biopython-1.43.tar.gz > my mxtexttools is in a non-standard place - I'm guessing I need my > PYTHONPATH specified as I'm getting: > > *** mxTextTools *** is either not installed or out of date. > > I tried installing egenix-mx-base-3.0.0.tar.gz but I'm getting the > same error. Did I get _too recent_ a copy of mxtexttools? > > (Previously I had egenix-mx-base-2.0.6.tar.gz) I assume you are trying to install this on Linux? Did things work using egenix-mx-base-2.0.6.tar.gz? I use Ubuntu Dapper Drake, and used apt-get to install the package version 2.0.6 of the python-egenix-mxtexttools package. Peter From mmokrejs at ribosome.natur.cuni.cz Fri Jun 8 10:31:36 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 08 Jun 2007 12:31:36 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <88B5307E-6666-4646-BD0B-B0704D493CA5@uiuc.edu> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <466819C1.9010203@ribosome.natur.cuni.cz> <88B5307E-6666-4646-BD0B-B0704D493CA5@uiuc.edu> Message-ID: <46693008.7040202@ribosome.natur.cuni.cz> Chris Fields wrote: > > On Jun 7, 2007, at 9:44 AM, Martin MOKREJ? wrote: > >> Hi Peter, >>> ... >>> That's good news. Martin - will this solve your problem, or do you >>> think we should also update Biopython to cope with these "old style" >>> LOCUS lines (which also lack identifiers)? >> >> I think that if it was ever a valid format it should cope with it. > > I think it's better to explicitly state that the parser is compliant > with a particular GenBank release and can likely parse other similarly > formatted GenBank records from third-party software. If the parser > chokes on a bad record then you can point out the deficiency in the > record and (if possible) try to make it more flexible w/o borking the > parser later on. The release notes are there for a good reason! > > The LOCUS line format, however, has been relatively stable over time. > Here are the release notes for a GenBank release from late 1992: > > ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb74.release.notes > > and the LOCUS line is: > > Positions Contents > > 1-12 LOCUS > 13-22 Locus name > 23-29 Length of sequence, right-justified > 31-32 bp > 34-36 Blank, ss- (single-stranded), ds- (double-stranded), or > ms- (mixed-stranded) > 37-40 Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > mRNA (messenger RNA), or uRNA (small nuclear RNA) > 43-52 Blank (implies linear) or circular > 53-55 The division code (see Section 3.3) > 63-73 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > The spacing is more explicitly laid out in later versions. The best > part is the Entrez CD order form (clipped out by scissors to be > snail-mailed) at the end of the file! In principle I do agree with you but let me emphasize that I fully agree with Wayne who wrote me yesterday in the way that the GenBank format is he way to write down your data, and we often really do not need all the fields required for data syubmission into the Genbank database: >From the definition of the format (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt), only DEFINITION, KEYWORDS, SOURCE and ORIGIN (if it contains data) lines end with a period. The periods should be added to the ends of non-period containing lines for those fields only. That is where ApE doesn't conform to the file definition. I put in the fields DEFINITION, ACCESSION, VERSION, SOURCE, and ORGANISM because those are listed as mandatory by the release notes. Looks like I missed that REFERENCES is also mandatory. The release notes do not say that the fields must contain data or that they must end with a period (except where noted above). I put them in, figuring that a parser that was working from the file definition might require those fields to be present. It seems like a well written parser could handle null data in the field better than handling the absence of an explicitly required field, since there is nothing in the standard that states what data, if any, must be present, but there is an explicit requirement for the field. Ok, I'll add an option to take out blank fields (even though that will break compliance with the definition, as I understand it). One could interpret the file standard as only applying to files intended for use in the NCBI database, so the required fields are only an issue for entering into their database, not for file parsers. Working on ApE isn't what I really do, so I might not get around to it immediately. Still, while I acknowledge that ApE has been writing files that do not comply completely with the standard (needing the required periods on the end of some of the mandatory fields), your parser should be able to handle null data lines and spaces in the locus name. for parsing the locus info here is the tcl regexp that I use ($a is the full LOCUS line, x returns the full matched line): regexp {LOCUS (.*) ([0-9]*) bp ( |ss-|ds-|ms-)(NA |DNA |RNA |tRNA |rRNA |mRNA |uRNA |snRNA |snoRNA)[ ]*(linear |circular| )[ ]*([ A-Z]{3})[ ]*(..-...-....)} $a x name size stranded type circular div date you have to do a trim on the name that you get out of this, since it is space padded, as per the file definition. Let me know if you see an exception that is a valid LOCUS line but would break this. Martin From cjfields at uiuc.edu Fri Jun 8 12:57:55 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 8 Jun 2007 07:57:55 -0500 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <46693008.7040202@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <466819C1.9010203@ribosome.natur.cuni.cz> <88B5307E-6666-4646-BD0B-B0704D493CA5@uiuc.edu> <46693008.7040202@ribosome.natur.cuni.cz> Message-ID: <2949B571-B73E-4059-B6A6-45E9893DBD26@uiuc.edu> On Jun 8, 2007, at 5:31 AM, Martin MOKREJ? wrote: ... > > In principle I do agree with you but let me emphasize that I fully > agree with Wayne > who wrote me yesterday in the way that the GenBank format is he way > to write down > your data, and we often really do not need all the fields required > for data syubmission > into the Genbank database: ... It does make sense to leave some of those fields out except in cases where they are needed (with the exception of the '.' fields like KEYWORDS), but it never made sense to me to have completely blank fields or leave out the locus name. My guess is that most format parsers don't look for empty fields (or complain when one is encountered) b/c empty fields haven't been encountered before; they were always left out completely. What would work best for all would be optional validation warnings or a separate validation module if one worried about checking compliance issues with GenBank format, something that hasn't happened yet (and I don't have time to code for!). Wayne, I would say use Martin's advice for the locus name (file name w/o extension), and if the field allows '.' then add it in, otherwise it's probably easier to leave the blank fields out completely, GenBank compliance or not. There are several questionably compliant files in the genbank test suite in BioPerl so this wouldn't be the first one, and if someone wants a validation system they can try building one until we have time to do it. chris From jlchang at broad.mit.edu Fri Jun 8 15:33:03 2007 From: jlchang at broad.mit.edu (Jean Chang) Date: Fri, 8 Jun 2007 11:33:03 -0400 Subject: [BioPython] installation question - mxTextTools In-Reply-To: <4668787F.50405@maubp.freeserve.co.uk> References: <224B35CB-5B36-48DE-AD9D-B839680AA71A@broad.mit.edu> <4668787F.50405@maubp.freeserve.co.uk> Message-ID: <3487DE4A-0A77-4CE6-906B-00A76B140D1F@broad.mit.edu> thanks to Peter and Ann. It turns out I had overspecified the installation --prefix so mxTextTools was buried deeper than the PYTHONPATH that I had set. Once I realized this and re-installed correctly, the install worked just fine. Regards, Jean From aloraine at gmail.com Sat Jun 9 14:16:41 2007 From: aloraine at gmail.com (Ann Loraine) Date: Sat, 9 Jun 2007 09:16:41 -0500 Subject: [BioPython] question regarding testing suites Message-ID: <83722dde0706090716x190e6250o3d440e8c613cff71@mail.gmail.com> Dear all, I have a question which I hope you might be able to advise me on: In your experience, which testing frameworks do you think work best for managing and testing python programs and modules? I've looked at the testing code in biopython, which seems to use unittest library -- would you recommend I use this, or do you think there are some other frameworks I should investigate, as well? Sincerely, Ann Loraine -- Ann Loraine Assistant Professor University of Alabama at Birmingham http://www.transvar.org 205-996-4155 From dalke at dalkescientific.com Sat Jun 9 14:42:54 2007 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 9 Jun 2007 16:42:54 +0200 Subject: [BioPython] question regarding testing suites In-Reply-To: <83722dde0706090716x190e6250o3d440e8c613cff71@mail.gmail.com> References: <83722dde0706090716x190e6250o3d440e8c613cff71@mail.gmail.com> Message-ID: <8E844BE0-2D63-48D9-9D0C-32DA25ED2A7A@dalkescientific.com> Hi Ann! (And other BioPython folk) On Jun 9, 2007, at 4:16 PM, Ann Loraine wrote: > would you recommend I use [unittest], or do you think there are > some other frameworks I should investigate, as well? Use "nose" from http://somethingaboutorange.com/mrl/projects/nose/ which lets you develop tests using unittest *and* other ways. There are two aspects you should be aware of: unit tests and and unit test discovery. You gotta write the tests and you gotta run the tests. unittest.py does both. Derive from TestCase, write methods starting with "test_" and the unittest.main() can auto-discover everything. The downsides are "unittest"'s discovery doesn't support: - simple functions (when it's silly to make a class for two lines of code) - doctests - running all/a subset of your unittests across multiple files. The nose system also has support for things like checking execution time and doing coverage tests. Andrew dalke at dalkescientific.com From irene.farabella at gmail.com Sun Jun 10 19:44:00 2007 From: irene.farabella at gmail.com (irene farabella) Date: Sun, 10 Jun 2007 21:44:00 +0200 Subject: [BioPython] info_dictionary In-Reply-To: <709fef6a0706101224g5a1a2a4fg92d23dbb2552b626@mail.gmail.com> References: <709fef6a0706101224g5a1a2a4fg92d23dbb2552b626@mail.gmail.com> Message-ID: <709fef6a0706101244u30aad990u54d70b295ae35355@mail.gmail.com> hi i am a beginner in python. from DSSP file i made a dictionary like that: dict2[AA_num_dssp,chain] = (structure,AA_name_dssp) i need of it in my small programm. i never work with a dictionary that have the keys like that. i have to sort the key of the dictionay... how can i do? usualy when i have only the AA_num i can mutate it in a list an then sort it... but in this case...mmm thanks for help me!! and sorry for the english.. From dalloliogm at gmail.com Mon Jun 11 09:31:03 2007 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 11 Jun 2007 11:31:03 +0200 Subject: [BioPython] info_dictionary In-Reply-To: <709fef6a0706101244u30aad990u54d70b295ae35355@mail.gmail.com> References: <709fef6a0706101224g5a1a2a4fg92d23dbb2552b626@mail.gmail.com> <709fef6a0706101244u30aad990u54d70b295ae35355@mail.gmail.com> Message-ID: <5aa3b3570706110231y21a6f4del256432edd02484ca@mail.gmail.com> Hi Irene! If I understand your problem correctly, you have a dictionary in which the keys are created by concatenating many variables: dict_key = AA + '_' + num + '_' + dssp + '....' In my humble opinion, it's bad to create this kind of dictionary... it's better to use a more branched structure, e.g.: dict_dssp = {dssp : {AA : {num : (structure, ...), ...}, ...}, ...} which is easier to handle, so for example you can extract the keys with dict_dssp.items (to get a list of all the dssps) or dict_dssp[dssp1].items (for all the items in a given dssp), and so on. Anyway, if you can't change to this structure you can find some useful information here: - http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/52306 2007/6/10, irene farabella : > > hi > i am a beginner in python. > from DSSP file i made a dictionary like that: > dict2[AA_num_dssp,chain] = (structure,AA_name_dssp) > i need of it in my small programm. > i never work with a dictionary that have the keys like that. > i have to sort the key of the dictionay... how can i do? > usualy when i have only the AA_num i can mutate it in a list an then sort > it... > but in this case...mmm > thanks for help me!! > and sorry for the english.. > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com From dalloliogm at gmail.com Mon Jun 11 14:05:32 2007 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 11 Jun 2007 16:05:32 +0200 Subject: [BioPython] GFF parser, other than Bio.GFF Message-ID: <5aa3b3570706110705q2b854711pf44b398892aa64a9@mail.gmail.com> Hi! I'm currently working with some gene annotations in GFF format[1], and I've noticed there is not any GFF parser in the current biopython distribution. Am I wrong? I've found the Bio.GFF module, but it doesn't actually do what I want to do (read [2]), as it gets informations from a MySQL database (?), while I'm searching for a module to parse a gff file and transform it to dictionary, maybe like SeqIO. Well, I wrote a few scripts to parse my GFF files.. I thought I can contribute with some code if somebody can help me in refining and adapting them (there is still a lot of work to do). So why there is still not any gff parser in biopython? Is this format too outdated, or maybe nobody is using it? Or maybe I'm missing it? Or there is some other problem? Thanks! Giovanni [1] http://www.sanger.ac.uk/Software/formats/GFF/ (gff), http://mblab.wustl.edu/GTF22.html (gtf 2.2) [2] http://portal.open-bio.org/pipermail/biopython/2004-May/002099.html -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com From dalloliogm at gmail.com Tue Jun 12 11:07:20 2007 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 12 Jun 2007 13:07:20 +0200 Subject: [BioPython] I don't understand why SeqRecord.feature is a list Message-ID: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> Hi, I'm a newbie to biopython and I'm trying to use it to represent a gene structure parsed from a gff file. In principle, I would create a SeqRecord to represent an mRNA; then, I would like to annotate its exons and introns in the .feature field. But I don't understand why SeqRecord.feature is a list, I think it could be easier to use as a dictionary. For example, this is what I've created until now: mRNA1 = SeqRecord() mRNA1.seq = 'cacacacacgtatgcta..' mRNA1.id = '...' mRNA1.feature = [exon1_SeqFeatureObj, exon2_SeqFeatureObj,....] here the exons are annotated in a list; the problem is that in this way it's difficult to retrieve them, since if let's say I want to retrieve the informations from the exon3 object, I have to cycle in all the mRNA1.feature objects to look for it. Wouldn't it be better to use a dictionary for SeqRecord.feature? mRNA1 = SeqRecord() mRNA1.seq = 'cacacacacgtatgcta..' mRNA1.id = '...' mRNA1.feature = {'exon1' : exon1_SeqFeatureObj, 'exon2' : exon2_SeqFeatureObj,....} Alternatively, I've got the idea of using SeqRecord.annotations to keep track of the indexes in SeqRecord.feature: mRNA1 = SeqRecord() mRNA1.seq = 'cacacacacgtatgcta..' mRNA1.id = '...' mRNA.annotations = {'exon1' : mRNA.feature[0], 'exon2' : mRNA.feature[1], ....} mRNA1.feature = [exon1_SeqFeatureObj, exon2_SeqFeatureObj,....] -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com From ezequiel.panepucci at psi.ch Tue Jun 12 11:44:59 2007 From: ezequiel.panepucci at psi.ch (Ezequiel Panepucci) Date: Tue, 12 Jun 2007 13:44:59 +0200 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> Message-ID: > But I don't understand why SeqRecord.feature is a list, I think it > could be easier to use as a dictionary. The problem with a dictionaries is that they are not ordered and lists are, so internally it is easier to organize lists than dicts. I don't know how easy it would be to define which attribute/property of a feature should be used as a dict key. Zac From mcolosimo at mitre.org Tue Jun 12 12:09:45 2007 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 12 Jun 2007 08:09:45 -0400 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> Message-ID: <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> Additionally, for many formats you can have multiple features with the same name; e.g., CDS, gene, etc... in GenBank Records. The same rational doesn't fully apply to why the feature qualifiers are dictionaries of lists. Marc On Jun 12, 2007, at 7:44 AM, Ezequiel Panepucci wrote: >> But I don't understand why SeqRecord.feature is a list, I think it >> could be easier to use as a dictionary. > > The problem with a dictionaries is that they are not ordered > and lists are, so internally it is easier to organize lists than > dicts. > > I don't know how easy it would be to define which attribute/property > of a feature should be used as a dict key. > > Zac > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Jun 12 14:32:45 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jun 2007 15:32:45 +0100 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> Message-ID: <466EAE8D.2090609@maubp.freeserve.co.uk> Marc Colosimo wrote: > Additionally, for many formats you can have multiple features with > the same name; e.g., CDS, gene, etc... in GenBank Records. Indeed - and as the SeqRecord/SeqFeature is most heavily used by the GenBank parser, that does explain things well. The problem with using a dictionary is what to index on - you can't simply use the location string for example, as there usually entries for genes and CDS features with the same location. You can't depend on any other information like an identifier or name to be present in a GenBank file for all feature types. In general, the choice of index will depend on what you want to use it for - so the flippant answer is just index it yourself, for example like this: http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features > The same rational doesn't fully apply to why the feature qualifiers > are dictionaries of lists. No it doesn't. The rational seems to have been that feature qualifiers in GenBank files can occur with no values (e.g. /pseudo and others), a single value (e.g. translation) or multiple values (by repeated keys, e.g. database cross references). So using a list is a simple solution to cover all these cases - even if most entries only have a single entry. (There are some old posts on the mailing list archive discussing this.) Peter From richa at musc.edu Tue Jun 12 16:31:35 2007 From: richa at musc.edu (richa) Date: Tue, 12 Jun 2007 12:31:35 -0400 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' Message-ID: <466ECA67.2080506@musc.edu> Hi all, Just installed biopython on ubuntu feisty. Dependencies and package seemed to install without a problem. Many of the test files that come with it work fine, but the SeqIO object runs into a problem. For example, the following code causes an error saying that the 'module' object has no attribute 'parse'. Is this a problem of syntax or is it an installation issue? from Bio import SeqIO handle = open("ls_orchid.fasta", "rU") for record in SeqIO.parse(handle, "fasta") : print record.id From idoerg at gmail.com Tue Jun 12 16:54:30 2007 From: idoerg at gmail.com (I. Friedberg) Date: Tue, 12 Jun 2007 09:54:30 -0700 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466ECA67.2080506@musc.edu> References: <466ECA67.2080506@musc.edu> Message-ID: I just tried to install biopython in Ubuntu Feisty and the installation breaks: % sudo apt-get install python-biopython [... lots of apt install log messages...] Setting up python-biopython (1.42-2) ... Compiling /var/lib/python-support/python2.5/Bio/Wise/dnal.py ... File "/var/lib/python-support/python2.5/Bio/Wise/dnal.py", line 5 from __future__ import division SyntaxError: from __future__ imports must occur at the beginning of the file How come you managed to install? On 6/12/07, richa wrote: > > Hi all, > > Just installed biopython on ubuntu feisty. Dependencies and package > seemed to install without a problem. Many of the test files that come > with it work fine, but the SeqIO object runs into a problem. For > example, the following code causes an error saying that the 'module' > object has no attribute 'parse'. > > Is this a problem of syntax or is it an installation issue? > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for record in SeqIO.parse(handle, "fasta") : > print record.id > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- I. Friedberg "The only problem with troubleshooting is that sometimes trouble shoots back." From biopython at maubp.freeserve.co.uk Tue Jun 12 17:09:17 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jun 2007 18:09:17 +0100 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466ECA67.2080506@musc.edu> References: <466ECA67.2080506@musc.edu> Message-ID: <466ED33D.3070200@maubp.freeserve.co.uk> richa wrote: > Hi all, > > Just installed biopython on ubuntu feisty. Dependencies and package > seemed to install without a problem. Many of the test files that come > with it work fine, but the SeqIO object runs into a problem. For > example, the following code causes an error saying that the 'module' > object has no attribute 'parse'. > > Is this a problem of syntax or is it an installation issue? > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for record in SeqIO.parse(handle, "fasta") : > print record.id That should work. Odd... What version of Python are you using? It might be helpful to see the full error message, it should include the path to a file .../Bio/SeqIO/__init__.py If you have a quick look in that file and double check there is a function "parse" (e.g. search for "def parse("), and it looks something like the latest version, which is online here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/__init__.py?rev=1.19&cvsroot=biopython&content-type=text/vnd.viewcvs-markup You said you were using Ubuntu Feisty - did you install Biopython from source code by downloading the tar ball, or using "apt-get"? If you used apt-get then you could try removing the package, and then installing from source instead (fairly simple). If that works then maybe there is a problem in the Feisty package. Peter From richa at musc.edu Tue Jun 12 17:13:17 2007 From: richa at musc.edu (richa) Date: Tue, 12 Jun 2007 13:13:17 -0400 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: References: <466ECA67.2080506@musc.edu> Message-ID: <466ED42D.3040406@musc.edu> It responded with the same error message using apt-get for me too. I used synaptic. I just uninstalled and reinstalled again and looking at the log I saw the same error message. However it did not quit installation and some of biopython's modules retain functionality. It appears then that this is an installation issue. Any suggestions for what is missing for a clean install? I. Friedberg wrote: > I just tried to install biopython in Ubuntu Feisty and the > installation breaks: > > % sudo apt-get install python-biopython > > [... lots of apt install log messages...] > > Setting up python-biopython (1.42-2) ... > Compiling /var/lib/python-support/python2.5/Bio/Wise/dnal.py ... > File "/var/lib/python-support/python2.5/Bio/Wise/dnal.py", line 5 > from __future__ import division > SyntaxError: from __future__ imports must occur at the beginning of > the file > > > How come you managed to install? > > > > > > > On 6/12/07, *richa* > wrote: > > Hi all, > > Just installed biopython on ubuntu feisty. Dependencies and package > seemed to install without a problem. Many of the test files that > come > with it work fine, but the SeqIO object runs into a problem. For > example, the following code causes an error saying that the 'module' > object has no attribute 'parse'. > > Is this a problem of syntax or is it an installation issue? > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for record in SeqIO.parse(handle, "fasta") : > print record.id > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > -- > > I. Friedberg > > "The only problem with troubleshooting is that > sometimes trouble shoots back." From winter at biotec.tu-dresden.de Tue Jun 12 17:27:40 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 12 Jun 2007 19:27:40 +0200 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466ECA67.2080506@musc.edu> References: <466ECA67.2080506@musc.edu> Message-ID: <466ED78C.4040807@biotec.tu-dresden.de> richa wrote: > Hi all, > > Just installed biopython on ubuntu feisty. Dependencies and package > seemed to install without a problem. Many of the test files that come > with it work fine, but the SeqIO object runs into a problem. For > example, the following code causes an error saying that the 'module' > object has no attribute 'parse'. > > Is this a problem of syntax or is it an installation issue? > > from Bio import SeqIO > handle = open("ls_orchid.fasta", "rU") > for record in SeqIO.parse(handle, "fasta") : > print record.id Dear richa: It is an installation issue. I got exactly the same error. I am running Debian Linux with the python-biopython package installed: $ dpkg -l | grep biopython ii python-biopython 1.42-2 After upgrading to the newest version with $ sudo apt-get install python-biopython $ dpkg -l | grep biopython ii python-biopython 1.43-1 the error is gone! By the way: this was the content of my old Bio/SeqIO/__init__.py: """Sequence input/output designed to look similar to the bioperl design. At present, these are all hand written. I would like to have autogenerated parsers in the futures, esp. with the ability to parse only subsets of the data, and to support event generated parsers. Note that once a parser is given an input string, it is free to read as much of the data as it wants to read, unless otherwise mentioned. """ Nothing more. Now it matches the newest version as posted by Peter. Cheers, Christof From richa at musc.edu Tue Jun 12 19:24:23 2007 From: richa at musc.edu (richa) Date: Tue, 12 Jun 2007 15:24:23 -0400 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466ED78C.4040807@biotec.tu-dresden.de> References: <466ECA67.2080506@musc.edu> <466ED78C.4040807@biotec.tu-dresden.de> Message-ID: <466EF2E7.6080404@musc.edu> Compiled from source and the no more error messages. Thank you all for your help and advise. -Adam Richards (richa) Christof Winter wrote: > richa wrote: >> Hi all, >> >> Just installed biopython on ubuntu feisty. Dependencies and package >> seemed to install without a problem. Many of the test files that >> come with it work fine, but the SeqIO object runs into a problem. >> For example, the following code causes an error saying that the >> 'module' object has no attribute 'parse'. >> >> Is this a problem of syntax or is it an installation issue? >> >> from Bio import SeqIO >> handle = open("ls_orchid.fasta", "rU") >> for record in SeqIO.parse(handle, "fasta") : >> print record.id > > Dear richa: > > It is an installation issue. I got exactly the same error. I am > running Debian Linux with the python-biopython package installed: > > $ dpkg -l | grep biopython > ii python-biopython 1.42-2 > > After upgrading to the newest version with > $ sudo apt-get install python-biopython > > $ dpkg -l | grep biopython > ii python-biopython 1.43-1 > > the error is gone! > > By the way: this was the content of my old Bio/SeqIO/__init__.py: > > """Sequence input/output designed to look similar to the bioperl design. > > At present, these are all hand written. I would like to have > autogenerated parsers in the futures, esp. with the ability to parse > only subsets of the data, and to support event generated parsers. > > Note that once a parser is given an input string, it is free to read > as much of the data as it wants to read, unless otherwise mentioned. > """ > > Nothing more. Now it matches the newest version as posted by Peter. > > Cheers, > Christof > From idoerg at gmail.com Tue Jun 12 20:09:22 2007 From: idoerg at gmail.com (I. Friedberg) Date: Tue, 12 Jun 2007 13:09:22 -0700 Subject: [BioPython] AttributeError: 'module' object has no attribute 'parse' In-Reply-To: <466EF2E7.6080404@musc.edu> References: <466ECA67.2080506@musc.edu> <466ED78C.4040807@biotec.tu-dresden.de> <466EF2E7.6080404@musc.edu> Message-ID: Anybody's Debian-fu strong enough to create a clean 1.43 package? Iddo On 6/12/07, richa wrote: > > Compiled from source and the no more error messages. Thank you all for > your help and advise. > > -Adam Richards (richa) > > Christof Winter wrote: > > richa wrote: > >> Hi all, > >> > >> Just installed biopython on ubuntu feisty. Dependencies and package > >> seemed to install without a problem. Many of the test files that > >> come with it work fine, but the SeqIO object runs into a problem. > >> For example, the following code causes an error saying that the > >> 'module' object has no attribute 'parse'. > >> > >> Is this a problem of syntax or is it an installation issue? > >> > >> from Bio import SeqIO > >> handle = open("ls_orchid.fasta", "rU") > >> for record in SeqIO.parse(handle, "fasta") : > >> print record.id > > > > Dear richa: > > > > It is an installation issue. I got exactly the same error. I am > > running Debian Linux with the python-biopython package installed: > > > > $ dpkg -l | grep biopython > > ii python-biopython 1.42-2 > > > > After upgrading to the newest version with > > $ sudo apt-get install python-biopython > > > > $ dpkg -l | grep biopython > > ii python-biopython 1.43-1 > > > > the error is gone! > > > > By the way: this was the content of my old Bio/SeqIO/__init__.py: > > > > """Sequence input/output designed to look similar to the bioperl design. > > > > At present, these are all hand written. I would like to have > > autogenerated parsers in the futures, esp. with the ability to parse > > only subsets of the data, and to support event generated parsers. > > > > Note that once a parser is given an input string, it is free to read > > as much of the data as it wants to read, unless otherwise mentioned. > > """ > > > > Nothing more. Now it matches the newest version as posted by Peter. > > > > Cheers, > > Christof > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- I. Friedberg "The only problem with troubleshooting is that sometimes trouble shoots back." From biopython at maubp.freeserve.co.uk Wed Jun 13 10:30:20 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jun 2007 11:30:20 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <4665408E.2090306@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> <4665408E.2090306@maubp.freeserve.co.uk> Message-ID: <466FC73C.2000608@maubp.freeserve.co.uk> To update anyone not following bug 2090, I have updated the CVS copy of Bio/Blast/NCBIStandalone.py to do a better job of recent plain text Blast output. It can now parse the two BLASTX 2.2.15 files Italo sent me. If anyone want to try this code, you can either update your entire Biopython installation to CVS, or simply update the file Bio/Blast/NCBIStandalone.py in your python site-packages directory (after making a backup). You can get the latest version here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython You want the latest revision, 1.66 (the web interface is normally updated within the hour). Italo - please could we use those two files as test cases to include with Biopython? And do let me know if any of your other 24,000 examples fails. Peter P.S. Biopython can currently only cope with single query plain text output from Blast. We recommend using the XML output. From rwbarrette at gmail.com Wed Jun 13 13:17:19 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Wed, 13 Jun 2007 09:17:19 -0400 Subject: [BioPython] [Biopython] Blastall problem w/ restrict_gi Message-ID: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> Hello, I'm new to the list, and relatively new at Python. I need to run local blast using tblastx, but I have to limit my searches to subsets of my local database. To do this I have gi lists (*.gid.txt file) obtained from NCBI, to define my subsets. To run this blast, I'm using the following command to run blastall in my script: result_handle, error_info = NCBIStandalone.blastall("/BLAST/blastall.exe", "tblastx", "/BLAST/DATAout/VirDBX" , "/BLAST/sequencesXX.fasta", "7", restrict_gi = "/BLAST/DATAout/10241.gid.txt") When I include the restrict_gi keyword and option, I get no results back when I run this through python. I went into NCBIStandalone and modified it to print out the command that is supposed to be passed through the os.popen3() command, which is: /BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i /BLAST/sequencesXX.fasta -m 7 -l /BLAST/DATAout/10241.gid.txt When I copy this string directly into the windows command line, I get results, and it works fine, but it doesn't work when called through python. It does work in Python , however, if I don't include the "restrict_gi" option. Can anyone suggest a modification to the Blastall function or how I call blast from my script that may fix this problem? T From biopython at maubp.freeserve.co.uk Wed Jun 13 17:13:33 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jun 2007 18:13:33 +0100 Subject: [BioPython] Blastall problem w/ restrict_gi In-Reply-To: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> References: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> Message-ID: <467025BD.1010907@maubp.freeserve.co.uk> Roger Barrette wrote: > Hello, I'm new to the list, and relatively new at Python. Hi Roger, and welcome to the list! > I need to run local blast using tblastx, but I have to limit my > searches to subsets of my local database. To do this I have gi lists > (*.gid.txt file) obtained from NCBI, to define my subsets. To run > this blast, I'm using the following command to run blastall in my > script: > > result_handle, error_info = > NCBIStandalone.blastall("/BLAST/blastall.exe", "tblastx", > "/BLAST/DATAout/VirDBX", "/BLAST/sequencesXX.fasta", "7", > restrict_gi="/BLAST/DATAout/10241.gid.txt") > > When I include the restrict_gi keyword and option, I get no results > back when I run this through python. Could you be a little more specific about what goes wrong? Also are you using Windows, what version of Biopython and what version of Python? Have you looked at the contents of both result_handle AND error_info? You say you get no results back (is result_handle is blank?), so checking error_info would be a good idea. Try something like this... save_file = open("my_blast.xml", "w") save_file.write(result_handle.read()) save_file.close() save_file = open("my_blast.err", "w") save_file.write(error_info.read()) save_file.close() > I went into NCBIStandalone and modified it to print out the command > that is supposed to be passed through the os.popen3() command, which > is: > > /BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i > /BLAST/sequencesXX.fasta -m 7 -l /BLAST/DATAout/10241.gid.txt > > When I copy this string directly into the windows command line, I get > results, and it works fine, but it doesn't work when called through > python. It does work in Python , however, if I don't include the > "restrict_gi" option. Can anyone suggest a modification to the > Blastall function or how I call blast from my script that may fix > this problem? Have you tried running this command at the command line, and redirecting the output to a file (e.g. test.xml) and then getting Biopython to parse that file? i.e. This should tell us if there is a problem parsing the XML output, or a problem in calling standalone blast. Peter From italo.maia at gmail.com Wed Jun 13 17:28:42 2007 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 13 Jun 2007 14:28:42 -0300 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <466FC73C.2000608@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> <4665408E.2090306@maubp.freeserve.co.uk> <466FC73C.2000608@maubp.freeserve.co.uk> Message-ID: <800166920706131028o4eb5ea6eqa92e3f0634ea7748@mail.gmail.com> Peter, i tried the patch but i received the following error : >>> from Bio.Blast import NCBIStandalone >>> parser = NCBIStandalone.BlastParser() >>> record = parser.parse(file('99.out','r')) Traceback (most recent call last): File "", line 1, in File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 746, in parse self._scanner.feed(handle, self._consumer) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 101, in feed self._scan_parameters(uhandle, consumer) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 712, in _scan_parameters attempt_read_and_call(uhandle, consumer.threshold, start='Neighboring words threshold') File "/var/lib/python-support/python2.5/Bio/ParserSupport.py", line 358, in attempt_read_and_call method(line) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 1333, in threshold line, (1,), ncols=2, expected={0:"T:"}) File "/var/lib/python-support/python2.5/Bio/Blast/NCBIStandalone.py", line 1888, in _get_cols (ncols, len(cols), line) SyntaxError: I expected 2 columns (got 4) in line Neighboring words threshold: 12 2007/6/13, Peter : > > To update anyone not following bug 2090, I have updated the CVS copy of > Bio/Blast/NCBIStandalone.py to do a better job of recent plain text > Blast output. It can now parse the two BLASTX 2.2.15 files Italo sent me. > > If anyone want to try this code, you can either update your entire > Biopython installation to CVS, or simply update the file > Bio/Blast/NCBIStandalone.py in your python site-packages directory > (after making a backup). You can get the latest version here: > > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython > > You want the latest revision, 1.66 (the web interface is normally > updated within the hour). > > Italo - please could we use those two files as test cases to include > with Biopython? And do let me know if any of your other 24,000 examples > fails. > > Peter > > P.S. Biopython can currently only cope with single query plain text > output from Blast. We recommend using the XML output. > > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB Programador Java, Python Meu blog ^^ http://eusouolobomal.blogspot.com/ =========================== From italo.maia at gmail.com Wed Jun 13 17:30:21 2007 From: italo.maia at gmail.com (Italo Maia) Date: Wed, 13 Jun 2007 14:30:21 -0300 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <466FC73C.2000608@maubp.freeserve.co.uk> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> <4665408E.2090306@maubp.freeserve.co.uk> <466FC73C.2000608@maubp.freeserve.co.uk> Message-ID: <800166920706131030n473c64e2o2b5184c0618ef65b@mail.gmail.com> For the use of the files, i'll ask the girl that runs the lab here, but it probably won't be a problem! And i'll try with all my files as soon as it works with 99.out 2007/6/13, Peter : > > To update anyone not following bug 2090, I have updated the CVS copy of > Bio/Blast/NCBIStandalone.py to do a better job of recent plain text > Blast output. It can now parse the two BLASTX 2.2.15 files Italo sent me. > > If anyone want to try this code, you can either update your entire > Biopython installation to CVS, or simply update the file > Bio/Blast/NCBIStandalone.py in your python site-packages directory > (after making a backup). You can get the latest version here: > > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython > > You want the latest revision, 1.66 (the web interface is normally > updated within the hour). > > Italo - please could we use those two files as test cases to include > with Biopython? And do let me know if any of your other 24,000 examples > fails. > > Peter > > P.S. Biopython can currently only cope with single query plain text > output from Blast. We recommend using the XML output. > > -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB Programador Java, Python Meu blog ^^ http://eusouolobomal.blogspot.com/ =========================== From rwbarrette at gmail.com Wed Jun 13 18:32:56 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Wed, 13 Jun 2007 14:32:56 -0400 Subject: [BioPython] Blastall problem w/ restrict_gi In-Reply-To: <467025BD.1010907@maubp.freeserve.co.uk> References: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> <467025BD.1010907@maubp.freeserve.co.uk> Message-ID: <2af454d50706131132s1df127f2ve00b30cb6b2643e1@mail.gmail.com> Hi Peter, Thank you for the response. In regards to your questions, I am using Python 2.5 w/ biopython v1.43, on a Windows platform (XP). The specific problem I am having occurs when I attempt to run the blastall command from Python through the NCBIStandalone module. If I run the blastall without the "restrict_gi" option, it gives me alignment results in the .xml file. However, when I include the "restrict_gi" option, I get an empty .xml result file. As per your suggestion however, the error file does list an error. The output of this .err file is: [NULL_Caption] ERROR: gi|90968860|gb|DQ443515.1|: Unable to open file .\/BLAST/DATAout/A10241.txt This is odd because when I run the command from the c:\ prompt: c:\>/BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i /BLAST/sequencesXX.fasta -m 7 -l .\/BLAST/DATAout/A10241.txt ;it works fine, and I get the alignment results, and no error. Because I do not get any results in the xml file when I get this error, running the blast from the python script, there is nothing to parse, however, when I run the script without the "restrict_gi" option, from either the command prompt or my python script, I get results in the xml file, and they are able to be parsed. Any suggestions as to how to fix this problem would be greatly appreciated. Thanks -Roger > Roger Barrette wrote: > > Hello, I'm new to the list, and relatively new at Python. > > Hi Roger, and welcome to the list! > > > I need to run local blast using tblastx, but I have to limit my > > searches to subsets of my local database. To do this I have gi lists > > (*.gid.txt file) obtained from NCBI, to define my subsets. To run > > this blast, I'm using the following command to run blastall in my > > script: > > > > result_handle, error_info = > > NCBIStandalone.blastall("/BLAST/blastall.exe", "tblastx", > > "/BLAST/DATAout/VirDBX", "/BLAST/sequencesXX.fasta", "7", > > restrict_gi="/BLAST/DATAout/10241.gid.txt") > > > > When I include the restrict_gi keyword and option, I get no results > > back when I run this through python. > > Could you be a little more specific about what goes wrong? Also are you > using Windows, what version of Biopython and what version of Python? > > Have you looked at the contents of both result_handle AND error_info? > You say you get no results back (is result_handle is blank?), so > checking error_info would be a good idea. Try something like this... > > save_file = open("my_blast.xml", "w") > save_file.write(result_handle.read()) > save_file.close() > > save_file = open("my_blast.err", "w") > save_file.write(error_info.read()) > save_file.close() > > > I went into NCBIStandalone and modified it to print out the command > > that is supposed to be passed through the os.popen3() command, which > > is: > > > > /BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i > > /BLAST/sequencesXX.fasta -m 7 -l /BLAST/DATAout/10241.gid.txt > > > > When I copy this string directly into the windows command line, I get > > results, and it works fine, but it doesn't work when called through > > python. It does work in Python , however, if I don't include the > > "restrict_gi" option. Can anyone suggest a modification to the > > Blastall function or how I call blast from my script that may fix > > this problem? > > Have you tried running this command at the command line, and redirecting > the output to a file (e.g. test.xml) and then getting Biopython to parse > that file? > > i.e. This should tell us if there is a problem parsing the XML output, > or a problem in calling standalone blast. > > Peter > > From biopython at maubp.freeserve.co.uk Wed Jun 13 19:48:16 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jun 2007 20:48:16 +0100 Subject: [BioPython] Problem with blastx output parsing =~ In-Reply-To: <800166920706131028o4eb5ea6eqa92e3f0634ea7748@mail.gmail.com> References: <800166920706040936w4de744acn8cefe445a6284f72@mail.gmail.com> <46644664.6080009@maubp.freeserve.co.uk> <800166920706041022u5fafc308h71bdcaa11acfade1@mail.gmail.com> <46644ED2.1080505@maubp.freeserve.co.uk> <4665408E.2090306@maubp.freeserve.co.uk> <466FC73C.2000608@maubp.freeserve.co.uk> <800166920706131028o4eb5ea6eqa92e3f0634ea7748@mail.gmail.com> Message-ID: <46704A00.9010409@maubp.freeserve.co.uk> Italo Maia wrote: > Peter, i tried the patch but i received the following error : >>>> from Bio.Blast import NCBIStandalone >>>> parser = NCBIStandalone.BlastParser() >>>> record = parser.parse(file('99.out','r')) > ... > SyntaxError: I expected 2 columns (got 4) in line > Neighboring words threshold: 12 Oh. My fault. This was due to the "T: ..." and "A: ..." lines being replaced by "Neighboring words threshold: ..." and "Window for multiple hits: ...", and me not testing my changes enough. Could you try revision 1.67 please? http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIStandalone.py?cvsroot=biopython Thanks Peter From biopython at maubp.freeserve.co.uk Wed Jun 13 19:56:48 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Jun 2007 20:56:48 +0100 Subject: [BioPython] Blastall problem w/ restrict_gi In-Reply-To: <2af454d50706131132s1df127f2ve00b30cb6b2643e1@mail.gmail.com> References: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> <467025BD.1010907@maubp.freeserve.co.uk> <2af454d50706131132s1df127f2ve00b30cb6b2643e1@mail.gmail.com> Message-ID: <46704C00.5000101@maubp.freeserve.co.uk> Roger Barrette wrote: > Hi Peter, > > Thank you for the response. In regards to your questions, I am using Python > 2.5 w/ biopython v1.43, on a Windows platform (XP). The specific problem I > am having occurs when I attempt to run the blastall command from Python > through the NCBIStandalone module. If I run the blastall without the > "restrict_gi" option, it gives me alignment results in the .xml > file. However, when I include the "restrict_gi" option, I get an empty .xml > result file. As per your suggestion however, the error file does list an > error. The output of this .err file is: > > [NULL_Caption] ERROR: gi|90968860|gb|DQ443515.1|: Unable to open file > .\/BLAST/DATAout/A10241.txt That is interesting - it tells us that the argument is at least getting passed to the blast program (and explains why there is no XML output). That path is an odd mixture of Unix and Windows style paths, which I wouldn't expect to work. Could you try using \BLAST\DATAout\A10241.txt or C:\BLAST\DATAout\A10241.txt instead (both from within Python and from the command line). Remember that slashes are escape characters in python so use either r"C:\BLAST\DATAout\A10241.txt" or "C:\\BLAST\\DATAout\\A10241.txt". > This is odd because when I run the command from the c:\ prompt: > c:\>/BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i > /BLAST/sequencesXX.fasta -m 7 -l .\/BLAST/DATAout/A10241.txt > > ;it works fine, and I get the alignment results, and no error. I'm a little surprised that does work to be honest. I suspect this is something subtle to do with how command line programs break up the arguments (which is complicated by quotes and slashes - at least you have no spaces in the filenames!). Note some of ways of getting python to make a system call pass the arguments as a long string (as if typed by the user at the command prompt) while others are already broken down into the individual terms. > Because I do not get any results in the xml file when I get this error, > running the blast from the python script, there is nothing to parse, > however, when I run the script without the "restrict_gi" option, from either > the command prompt or my python script, I get results in the xml file, and > they are able to be parsed. Any suggestions as to how to fix this problem > would be greatly appreciated. Thanks Fingers crossed using Windows style absolute paths fixes this for you... Peter From kawaiichiko at gmail.com Sun Jun 17 18:34:05 2007 From: kawaiichiko at gmail.com (Jolanda Reek) Date: Sun, 17 Jun 2007 20:34:05 +0200 Subject: [BioPython] What to do with BLAST XML syntax error? Message-ID: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> Hello, I'm using BioPython to send some protein sequences to the NCBI WWWBlast server and parse the output. Sometimes, like once every few minutes, instead of giving the output, BLAST returns a XML syntax error. It states something along the lines of 'SyntaxError: This XML doesn't start with...'. and then BioPython can't parse the output (duh). I've written a try/except statement to resend the protein query when this problem occurs,however, sometimes the problem occurs multiple times in a row, leaving me with no other option then to nest try/except statements (= ugly code). print "BLAST search: "+proteinLijst[x].id try: acNr, evalueNr = blast(proteinLijst[x].sequentie) idList.append(acNr) except SyntaxError: print "SyntaxError: BLAST server send incomplete results. Resubmitting query..." try: acNr, evalueNr = blast(proteinLijst[x].sequentie) idList.append(acNr) except SyntaxError: print "SyntaxError: BLAST server send incomplete results. Resubmitting query..." try: acNr, evalueNr = blast(proteinLijst[x].sequentie) idList.append(acNr) except SyntaxError: print "SyntaxError: BLAST server send incomplete results. Resubmitting query..." acNr, evalueNr = blast(proteinLijst[x].sequentie) idList.append(acNr) << Etc. Yuck. It works, but it is still yuck. :) Can anyone help me think up a solution? And what is causing those faulty XML files? (Avoiding the problem altogether is better than fixing it.) Thank you. Chiko. From lucks at fas.harvard.edu Sun Jun 17 18:52:03 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Sun, 17 Jun 2007 14:52:03 -0400 Subject: [BioPython] What to do with BLAST XML syntax error? In-Reply-To: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> References: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> Message-ID: <34EE02FD-F9FE-46D1-963A-00392EB33DD1@fas.harvard.edu> Why not wrap the try/except block in a while loop: done = 0 while not done: #wait some time try: #your code done = 1 except SyntaxError: #your code Julius ------------------------------------------------------------------------ --- http://www.openwetware.org/wiki/User:Julius_B._Lucks ------------------------------------------------------------------------ --- On Jun 17, 2007, at 2:34 PM, Jolanda Reek wrote: > Hello, > > I'm using BioPython to send some protein sequences to the NCBI > WWWBlast > server and parse the output. Sometimes, like once every few > minutes, instead > of giving the output, BLAST returns a XML syntax error. It states > something > along the lines of 'SyntaxError: This XML doesn't start with...'. > and then > BioPython can't parse the output (duh). I've written a try/except > statement > to resend the protein query when this problem occurs,however, > sometimes the > problem occurs multiple times in a row, leaving me with no other > option then > to nest try/except statements (= ugly code). > > print "BLAST search: "+proteinLijst[x].id > try: > acNr, evalueNr = blast(proteinLijst[x].sequentie) > idList.append(acNr) > except SyntaxError: > print "SyntaxError: BLAST server send incomplete results. > Resubmitting > query..." > try: > acNr, evalueNr = blast(proteinLijst[x].sequentie) > idList.append(acNr) > except SyntaxError: > print "SyntaxError: BLAST server send incomplete results. > Resubmitting query..." > try: > acNr, evalueNr = blast(proteinLijst[x].sequentie) > idList.append(acNr) > except SyntaxError: > print "SyntaxError: BLAST server send incomplete results. > Resubmitting query..." > acNr, evalueNr = blast(proteinLijst[x].sequentie) > idList.append(acNr) > > << Etc. Yuck. It works, but it is still yuck. :) > > Can anyone help me think up a solution? And what is causing those > faulty XML > files? (Avoiding the problem altogether is better than fixing it.) > > Thank you. > > Chiko. > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From cjfields at uiuc.edu Sun Jun 17 19:01:48 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 17 Jun 2007 14:01:48 -0500 Subject: [BioPython] What to do with BLAST XML syntax error? In-Reply-To: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> References: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> Message-ID: <7ADA1F7F-C39A-4DC0-9548-DD7971BC79D9@uiuc.edu> Very likely what is returned is not XML at all, but HTML or a server error code (thus triggering the XML error since it isn't XML). IIRC you have to loop every x seconds to check the RID, then retrieve the results; every server check could theoretically have some unforeseen server error. With BioPerl this is checked using the response object returned from every RID request: if ($response->is_error) { # throw some relevant error } Hate to say but I'm not sure how this would be handled via Python. chris On Jun 17, 2007, at 1:34 PM, Jolanda Reek wrote: > ... > Can anyone help me think up a solution? And what is causing those > faulty XML > files? (Avoiding the problem altogether is better than fixing it.) > > Thank you. > > Chiko. From mdehoon at c2b2.columbia.edu Sun Jun 17 23:01:55 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Mon, 18 Jun 2007 08:01:55 +0900 Subject: [BioPython] What to do with BLAST XML syntax error? In-Reply-To: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> References: <8d8cfd390706171134m74027696j49354e96f9c6e2ed@mail.gmail.com> Message-ID: <4675BD63.6090301@c2b2.columbia.edu> Jolanda Reek wrote: > acNr, evalueNr = blast(proteinLijst[x].sequentie) Since this line causes the SyntaxError, could you show us what is in the "blast" function that is being called here? --Michiel. From rwbarrette at gmail.com Tue Jun 19 11:42:41 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Tue, 19 Jun 2007 07:42:41 -0400 Subject: [BioPython] Blastall problem w/ restrict_gi In-Reply-To: <46704C00.5000101@maubp.freeserve.co.uk> References: <2af454d50706130617t46b7b6cdt2e75b47912f4a2b5@mail.gmail.com> <467025BD.1010907@maubp.freeserve.co.uk> <2af454d50706131132s1df127f2ve00b30cb6b2643e1@mail.gmail.com> <46704C00.5000101@maubp.freeserve.co.uk> Message-ID: <2af454d50706190442o211e3b88tec9650598cf3427@mail.gmail.com> Thanks for the suggestions Peter. Once I changed the format of the command to call the restrict_gi with the \\, it worked fine. Thanks. -Roger On 6/13/07, Peter wrote: > > Roger Barrette wrote: > > Hi Peter, > > > > Thank you for the response. In regards to your questions, I am using > Python > > 2.5 w/ biopython v1.43, on a Windows platform (XP). The specific > problem I > > am having occurs when I attempt to run the blastall command from Python > > through the NCBIStandalone module. If I run the blastall without the > > "restrict_gi" option, it gives me alignment results in the .xml > > file. However, when I include the "restrict_gi" option, I get an empty > .xml > > result file. As per your suggestion however, the error file does list > an > > error. The output of this .err file is: > > > > [NULL_Caption] ERROR: gi|90968860|gb|DQ443515.1|: Unable to open file > > .\/BLAST/DATAout/A10241.txt > > That is interesting - it tells us that the argument is at least getting > passed to the blast program (and explains why there is no XML output). > > That path is an odd mixture of Unix and Windows style paths, which I > wouldn't expect to work. > > Could you try using \BLAST\DATAout\A10241.txt or > C:\BLAST\DATAout\A10241.txt instead (both from within Python and from > the command line). > > Remember that slashes are escape characters in python so use either > r"C:\BLAST\DATAout\A10241.txt" or "C:\\BLAST\\DATAout\\A10241.txt". > > > This is odd because when I run the command from the c:\ prompt: > > c:\>/BLAST/blastall.exe -p tblastx -d /BLAST/DATAout/VirDBX -i > > /BLAST/sequencesXX.fasta -m 7 -l .\/BLAST/DATAout/A10241.txt > > > > ;it works fine, and I get the alignment results, and no error. > > I'm a little surprised that does work to be honest. > > I suspect this is something subtle to do with how command line programs > break up the arguments (which is complicated by quotes and slashes - at > least you have no spaces in the filenames!). > > Note some of ways of getting python to make a system call pass the > arguments as a long string (as if typed by the user at the command > prompt) while others are already broken down into the individual terms. > > > Because I do not get any results in the xml file when I get this error, > > running the blast from the python script, there is nothing to parse, > > however, when I run the script without the "restrict_gi" option, from > either > > the command prompt or my python script, I get results in the xml file, > and > > they are able to be parsed. Any suggestions as to how to fix this > problem > > would be greatly appreciated. Thanks > > Fingers crossed using Windows style absolute paths fixes this for you... > > Peter > > From rwbarrette at gmail.com Tue Jun 19 11:58:19 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Tue, 19 Jun 2007 07:58:19 -0400 Subject: [BioPython] Can't download a FASTA file from NCBI to BLAST Message-ID: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> I am trying to set up a script to automatically go into NCBI and retrieve individual FASTA file based on a list of accession numbers (either gi or NC). The code that I have written gets the sequences and saves the file, but when I run a blast against the file, it doesn't work, Am I not using the correct parser for preparing to save the file for blasting? I tried to set the format to "fasta", but I was getting errors saying that gi_list[0] doesn't contain the arguement 'data.seq'. I also tried the arguement .sequence, and it gave me the same errror. I realize I'm not currently calling the file in as a FASTA, but this is the only way I've been able to even automate the record retrieval process for the long series of Blasting that I have to do. I have a separate function for calling the Blast, but it works fine with manually downloaded FASTA files, so the problem appears to be here. Any suggestions for a fix, or even a better way to do this would be greatly appreciated. Thanks. My code is: def Get_FASTA_Seq(NC_ID): i = NC_ID ## Search for Viruses based on TXID from Bio import GenBank gi_list = GenBank.search_for(i) ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank") fasta_file = open("c:\Current_Query.gbk", "w") ## Extract individual Sequence from NCBI based on gi# or NC# ## gb_record = ncbi_dict[gi_list[0]] record_parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser = record_parser) gb_seqrecord = ncbi_dict[gi_list[0]] SeqValue = ncbi_dict[gi_list[0]].seq.data NameValue = ncbi_dict[gi_list[0]].annotations["organism"] Length = len(SeqValue) Seq5 = 0 Seq3 = Seq5 + Length print NameValue print Length print SeqValue ## Write sequences into the FASTA file ## fasta_file.write(">" + i + " " + NameValue + "\n") for j in range(0, len(SeqValue[Seq5:Seq3]), Length): fasta_file.write(SeqValue[Seq5:Seq3]) fasta_file.write("\n") ## Close and Save the FASTA file ## fasta_file.close() From biopython at maubp.freeserve.co.uk Tue Jun 19 13:10:17 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jun 2007 14:10:17 +0100 Subject: [BioPython] Can't download a FASTA file from NCBI to BLAST In-Reply-To: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> References: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> Message-ID: <320fb6e00706190610wd422693mad16f6c3bcdf57d2@mail.gmail.com> On 6/19/07, Roger Barrette wrote: > I am trying to set up a script to automatically go into NCBI and retrieve > individual FASTA file based on a list of accession numbers (either gi or > NC). The code that I have written gets the sequences and saves the file, > but when I run a blast against the file, it doesn't work, Am I not using the > correct parser for preparing to save the file for blasting? I tried to set > the format to "fasta", but I was getting errors saying that gi_list[0] > doesn't contain the arguement 'data.seq'. I also tried the arguement > .sequence, and it gave me the same errror. I realize I'm not currently > calling the file in as a FASTA, but this is the only way I've been able to > even automate the record retrieval process for the long series of Blasting > that I have to do. I have a separate function for calling the Blast, > but it works fine with manually downloaded FASTA files, so the > problem appears to be here. Any suggestions for a fix, or even a better way > to do this would be greatly appreciated. Thanks. My code is: You seem to have tried a lot of things and its difficult to follow exactly what you are trying to do. I think you want to: (1) start with a list of gi numbers (2) get the matching sequence data online from the NCBI (3) save these as a fasta file (4) call blast using this fasta file as the input query Anyway - I've included a bit of code for (2) and (3) at the end of this email. > def Get_FASTA_Seq(NC_ID): > > i = NC_ID > > ## Search for Viruses based on TXID > > from Bio import GenBank > gi_list = GenBank.search_for(i) > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank") > > fasta_file = open("c:\Current_Query.gbk", "w") Its a bit odd to save a fasta file with a gbk extension, fasta is more common. And there may be a problem with the single unescaped slash, try r"c:\Current_Query.gbk" or "c:\\Current_Query.gbk". Remember, in python the slash is used for things like \n (new line), \t (tab) etc. > ## Extract individual Sequence from NCBI based on gi# or NC# ## > > gb_record = ncbi_dict[gi_list[0]] > record_parser = GenBank.FeatureParser() > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser = > record_parser) > gb_seqrecord = ncbi_dict[gi_list[0]] > > SeqValue = ncbi_dict[gi_list[0]].seq.data > NameValue = ncbi_dict[gi_list[0]].annotations["organism"] > Length = len(SeqValue) > Seq5 = 0 > Seq3 = Seq5 + Length > > print NameValue > print Length > print SeqValue > > ## Write sequences into the FASTA file ## > > fasta_file.write(">" + i + " " + NameValue + "\n") > for j in range(0, len(SeqValue[Seq5:Seq3]), Length): > fasta_file.write(SeqValue[Seq5:Seq3]) > fasta_file.write("\n") > ## Close and Save the FASTA file ## > fasta_file.close() That is all very complicated - why mess about with Seq5 and Seq4 when you seem to want the whole sequence anyway? Have you opened the output file in a text editor to check it looks sensible? If you can construct a list SeqRecords, why not write the file using Bio.SeqIO (Biopython 1.43 or later) like this: ... gb_records = [ncbi_dict[gi] for gi in gi_list] from Bio import SeqIO fasta_file = open("c:\\Current_Query.fasta","w") SeqIO.write(gb_records, fasta_file, "fasta") fasta_file.close() Peter From rwbarrette at gmail.com Tue Jun 19 13:50:08 2007 From: rwbarrette at gmail.com (Roger Barrette) Date: Tue, 19 Jun 2007 09:50:08 -0400 Subject: [BioPython] Can't download a FASTA file from NCBI to BLAST In-Reply-To: <320fb6e00706190610wd422693mad16f6c3bcdf57d2@mail.gmail.com> References: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> <320fb6e00706190610wd422693mad16f6c3bcdf57d2@mail.gmail.com> Message-ID: <2af454d50706190650y3d41d3bcjdc13ebbf48c7bc9c@mail.gmail.com> Hi again Peter, You are correct in your assumptions as to what I'm trying to accomplish. I have a habit of pulling random code from different places when I'm at a loss for how to do something, when I can't find documentation or examples. The Seq5 and Seq3 were residual from a previous script where I was pulling out overlapping 50mers. Regardless... I added your code to my script, and changed the format for the search parameter to "fasta", but I'm getting the following error: \ Traceback (most recent call last): \ File "", line 1, in \ Get_FASTA_Seq("NC_001653") \ File "C:/Python25/FASTAtry2.py", line 17, in Get_FASTA_Seq \ SeqIO.write(gb_records, fasta_file, "fasta") \ File "C:\Python25\lib\site-packages\Bio\SeqIO\__init__.py", line 214, in write \ writer_class(handle).write_file(sequences) \ File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 243, in write_file \ self.write_records(records) \ File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 232, in write_records \ self.write_record(record) \ File "C:\Python25\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 103, in write_record \ id = self.clean(record.id) \ AttributeError: 'str' object has no attribute 'id' Am I not supposed to use GenBank.search_for, or NCBIDictionary, or do I need to parse the raw output first. I did try to use the GenBank RecordParser, as well as the Feature Parser set to output as "fasta", but I get the same error? My current (+your) modified code is: \ def Get_FASTA_Seq(NC_ID): \ i = NC_ID \ \ ## Search for Viruses based on TXID, and count number of hits ## \ \ from Bio import GenBank \ from Bio import SeqIO \ \ gi_list = GenBank.search_for(i) \ ncbi_dict = GenBank.NCBIDictionary("nucleotide","fasta") \ \ ## Extract individual Sequence from NCBI for each TXID ## \ \ gb_records = [ncbi_dict[gi] for gi in gi_list] \ fasta_file = open("c:\\Current_Query.fasta","w") \ SeqIO.write(gb_records, fasta_file, "fasta") \ fasta_file.close() Thank you for your insights. Sorry to be such a noob. -Roger On 6/19/07, Peter wrote: > > On 6/19/07, Roger Barrette wrote: > > I am trying to set up a script to automatically go into NCBI and > retrieve > > individual FASTA file based on a list of accession numbers (either gi or > > NC). The code that I have written gets the sequences and saves the > file, > > but when I run a blast against the file, it doesn't work, Am I not using > the > > correct parser for preparing to save the file for blasting? I tried to > set > > the format to "fasta", but I was getting errors saying that gi_list[0] > > doesn't contain the arguement 'data.seq'. I also tried the arguement > > .sequence, and it gave me the same errror. I realize I'm not currently > > calling the file in as a FASTA, but this is the only way I've been able > to > > even automate the record retrieval process for the long series of > Blasting > > that I have to do. I have a separate function for calling the Blast, > > but it works fine with manually downloaded FASTA files, so the > > problem appears to be here. Any suggestions for a fix, or even a better > way > > to do this would be greatly appreciated. Thanks. My code is: > > You seem to have tried a lot of things and its difficult to follow > exactly what you are trying to do. I think you want to: > > (1) start with a list of gi numbers > (2) get the matching sequence data online from the NCBI > (3) save these as a fasta file > (4) call blast using this fasta file as the input query > > Anyway - I've included a bit of code for (2) and (3) at the end of this > email. > > > def Get_FASTA_Seq(NC_ID): > > > > i = NC_ID > > > > ## Search for Viruses based on TXID > > > > from Bio import GenBank > > gi_list = GenBank.search_for(i) > > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank") > > > > fasta_file = open("c:\Current_Query.gbk", "w") > > Its a bit odd to save a fasta file with a gbk extension, fasta is more > common. And there may be a problem with the single unescaped slash, > try r"c:\Current_Query.gbk" or "c:\\Current_Query.gbk". > > Remember, in python the slash is used for things like \n (new line), > \t (tab) etc. > > > ## Extract individual Sequence from NCBI based on gi# or NC# ## > > > > gb_record = ncbi_dict[gi_list[0]] > > record_parser = GenBank.FeatureParser() > > ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser = > > record_parser) > > gb_seqrecord = ncbi_dict[gi_list[0]] > > > > SeqValue = ncbi_dict[gi_list[0]].seq.data > > NameValue = ncbi_dict[gi_list[0]].annotations["organism"] > > Length = len(SeqValue) > > Seq5 = 0 > > Seq3 = Seq5 + Length > > > > print NameValue > > print Length > > print SeqValue > > > > ## Write sequences into the FASTA file ## > > > > fasta_file.write(">" + i + " " + NameValue + "\n") > > for j in range(0, len(SeqValue[Seq5:Seq3]), Length): > > fasta_file.write(SeqValue[Seq5:Seq3]) > > fasta_file.write("\n") > > ## Close and Save the FASTA file ## > > fasta_file.close() > > That is all very complicated - why mess about with Seq5 and Seq4 when > you seem to want the whole sequence anyway? > > Have you opened the output file in a text editor to check it looks > sensible? > > If you can construct a list SeqRecords, why not write the file using > Bio.SeqIO (Biopython 1.43 or later) like this: > > ... > gb_records = [ncbi_dict[gi] for gi in gi_list] > from Bio import SeqIO > fasta_file = open("c:\\Current_Query.fasta","w") > SeqIO.write(gb_records, fasta_file, "fasta") > fasta_file.close() > > Peter > From biopython at maubp.freeserve.co.uk Tue Jun 19 16:46:20 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jun 2007 17:46:20 +0100 Subject: [BioPython] Can't download a FASTA file from NCBI to BLAST In-Reply-To: <2af454d50706190650y3d41d3bcjdc13ebbf48c7bc9c@mail.gmail.com> References: <2af454d50706190458k3ed9cdd2jf7625aa2f28dfc60@mail.gmail.com> <320fb6e00706190610wd422693mad16f6c3bcdf57d2@mail.gmail.com> <2af454d50706190650y3d41d3bcjdc13ebbf48c7bc9c@mail.gmail.com> Message-ID: <4678085C.7060401@maubp.freeserve.co.uk> Roger Barrette wrote: > Hi again Peter, > > You are correct in your assumptions as to what I'm trying to accomplish. I > have a habit of pulling random code from different places when I'm at a loss > for how to do something, when I can't find documentation or examples. If we start with some of you last attempt, you can see that this NCBI dictionary just returns raw fasta records as strings: >>> from Bio import GenBank >>> ncbi_dict = GenBank.NCBIDictionary("nucleotide","fasta") >>> ncbi_dict["A0B5H8] '>gi|121693723|sp|A0B5H8|A0B5H8_9EURY TATA-box binding\nMESTINI...' >>> print ncbi_dict["A0B5H8] >gi|121693723|sp|A0B5H8|A0B5H8_9EURY TATA-box binding MESTINIENVVASTKLADEFDLVKIESELEGAEYNKEKFPGLVYRVKSPKAAFLIFTSGKVVCTGAKNVE DVRTVITNMARTLKSIGFDNINLEPEIHVQNIVASADLKTDLNLNAIALGLGLENIEYEPEQFPGLVYRI KQPKVVVLIFSSGKLVVTGGKSPEECEEGVRIVRQQLENLGLL You can just write these directly to your file: from Bio import GenBank from Bio import SeqIO acc_list = ["A0B5H8", "A0C5G2", "A0CM02", "A0CRU8"] #Don't use any record parser, we just want the raw text ncbi_dict = GenBank.NCBIDictionary("nucleotide","fasta") fasta_file = open("c:\\Current_Query.fasta","w") for acc in acc_list : fasta_file.write(ncbi_dict[acc]) fasta_file.close() This is very simple as there is no conversion between file formats - you are asking the NCBI for fasta format records, and you save them to a file as is. Another option (which I was suggesting in the previous email) is to have the NCBIDictionary parse the data into SeqRecord objects (rather than raw text) and then write those to your file, possibly using Bio.SeqIO Peter From mmokrejs at ribosome.natur.cuni.cz Mon Jun 25 16:05:04 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 25 Jun 2007 18:05:04 +0200 Subject: [BioPython] [Bioperl-l] How to draw a plasmid map from a genbank-formatted file? In-Reply-To: References: <466938F6.7050903@ribosome.natur.cuni.cz> <56BAE06F-2FDF-4FA4-B6A0-96D89470AF4C@wustl.edu> <467178AE.5040905@ribosome.natur.cuni.cz> <46717990.6040509@ribosome.natur.cuni.cz> <46723F91.60501@ribosome.natur.cuni.cz> Message-ID: <467FE7B0.3010904@ribosome.natur.cuni.cz> Hi Chris, Chris Fields wrote: > > On Jun 15, 2007, at 2:28 AM, Martin MOKREJ? wrote: > >> Chris Fields wrote: >>> Is 99.gb supposed to be a GenBank file? And you're loading it into >> >> Yes, it was attached to the email. ;) > > > > Sorry about that. I notice that '.' was added, but the spacing seemed > off. I think bioperl catches that fine but it's something Wayne should > consider. Would you please tell me exactly what is wrong with the spacing? > >>> embl2picture (which I assume takes EMBL format files)? Without example >>> code we can easily make the wrong assumptions (i.e. that this is user >>> error and not a BioPerl problem). >> >> use constant USAGE =><> Usage: $0 >> Render a GenBank/EMBL entry into drawable form. >> Return as a GIF or PNG image on standard output. >> >> File must be in embl, genbank, or another SeqIO- >> recognized format. Only the first entry will be >> rendered. >> >> Example to try: >> embl2picture.pl factor7.embl | display - >> >> END > > Horribly named script (should be seq2picture, since it converts both > gb/embl). The use of 'all_tags' makes me think the script version you > are using is old, as those methods have long since been renamed. Dave > has it working though, so maybe your version has been updated? The 'use > of initialized data in' errors are probably from inclusion of mandatory > fields with no data or '.'. Well, I just copy&pasted the script from the bioperl webpages, I think from a tutorial or FAQ, don't remember anymore. > >>> Also, I don't believe the feature plotting scripts plot circular >>> chromosomes/plasmids. If you want this functionality you'll have to >>> code it for yourself. >> >> That's a pitty it does not, but at least if someone could improve the >> docs. ;) >> Unfortunately I don't have the time to rewrite the code myself now, >> I need a working, standalone, already available tool. :( >> M. > > As I said, unless someone shows interest and codes it just won't get > done. We have had very little interest in this, either b/c there are > tools already out there to do this very thing (multitudes of plasmid > drawing programs, some free like ApE) or that nobody's bothered to write > it up. Well, my search for such tools available on Unix to be used in a script, non-interactively, completely failed. My last hope except getting improved ApE is to use the GenomeDiagram under biopython, but so far my .gb files cannot be parsed yet. :( Martin From cjfields at uiuc.edu Mon Jun 25 16:48:30 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 25 Jun 2007 11:48:30 -0500 Subject: [BioPython] [Bioperl-l] How to draw a plasmid map from a genbank-formatted file? In-Reply-To: <467FE7B0.3010904@ribosome.natur.cuni.cz> References: <466938F6.7050903@ribosome.natur.cuni.cz> <56BAE06F-2FDF-4FA4-B6A0-96D89470AF4C@wustl.edu> <467178AE.5040905@ribosome.natur.cuni.cz> <46717990.6040509@ribosome.natur.cuni.cz> <46723F91.60501@ribosome.natur.cuni.cz> <467FE7B0.3010904@ribosome.natur.cuni.cz> Message-ID: Martin, Keep bioperl-related discussion on the bioperl mail list. The large majority of this isn't biopython-related, but maybe some devs there can add to this? On Jun 25, 2007, at 11:05 AM, Martin MOKREJ? wrote: ... > Would you please tell me exactly what is wrong with the spacing? Here's a section of the seq record attached to your previous email: DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . Normally there is a fixed column width for any data present in a field, so it would look more like this: DEFINITION PYR4 (DIHYDROOROTASE, PYRIMIDIN 4, dihydroorotase); dihydroorotase [Arabidopsis thaliana]. ACCESSION NP_194024 VERSION NP_194024.1 GI:15235865 DBSOURCE REFSEQ: accession NM_118422.3 KEYWORDS . SOURCE Arabidopsis thaliana (thale cress) ORGANISM Arabidopsis thaliana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids II; Brassicales; Brassicaceae; Arabidopsis. Here's the relevant bit in the latest release notes: "The second part of each sequence entry record contains the information appropriate to its keyword, in positions 13 to 80 for keywords and positions 11 to 80 for the sequence." The bioperl devs try to make our parsers as flexible as possible but others may not, so it's something in ApE that should probably be fixed. And as mentioned to you several times in the past on the mail list and on bugzilla, don't expect sequence records which sway from the standard (in this case, the release notes) to parse correctly in all cases. We can try supporting some that sway from that standard but only up to a point. If it causes additional bugs, headaches, or degrades performance it won't be supported. > ... > Well, I just copy&pasted the script from the bioperl webpages, I think > from a tutorial or FAQ, don't remember anymore. Well, can't help you if you can't point out where the code originated from. We would like to know so it can be corrected. > ... > Well, my search for such tools available on Unix to be used in a > script, > non-interactively, completely failed. My last hope except getting > improved > ApE is to use the GenomeDiagram under biopython, but so far my .gb > files > cannot be parsed yet. :( > Martin As mentioned previously you will likely have to code for it yourself (perl or python) or help debug the relevant biopython code to get it working. We can't/won't do this for you unless/until it's something we feel warrants implementation. Judging by the bug list, we also haven't the time nor inclination to code for it. Sorry but we have other priorities besides doing your work for you. chris From mmokrejs at ribosome.natur.cuni.cz Mon Jun 25 14:31:49 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 25 Jun 2007 16:31:49 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <4665C076.20408@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> Message-ID: <467FD1D5.9020503@ribosome.natur.cuni.cz> Hi Peter, I have re-tried current CVS version of biopyhton with a file regenerated by fixed version of ApE editor. Unfortunately, I got: $ python generate_image_from_genbank.py Traceback (most recent call last): File "generate_image_from_genbank.py", line 7, in ? genbank_entry = parser.parse(fhandle) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, in feed self._feed_first_line(consumer, self.line) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 876, in _feed_first_line raise SyntaxError('Did not recognise the LOCUS line layout:\n' + line) SyntaxError: Did not recognise the LOCUS line layout: LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular 14-JUN-2007 What's wrong with the LOCUS line now? Bioperl from CVS can read it, and I thought it is already following the current specs. ;-) Thanks for your help, Martin Peter wrote: > Martin MOKREJ? wrote: >> Hi Peter, Chris and others, here I am passing the answer from Wayne >> back, sorry for the difficult cross-communication. > > Thank you both, Martin & Wayne. > > Wayne Davis wrote: >> [the] locus line I'm using is the old standard (some older parsers > > wanted it that way). > > That's worth knowing - thank you. Give that, maybe we (Biopython) > should try and parse these files (which aside from the missing > identifier in the LOCUS line should be fairly simple). On the other > hand, I doubt many people still use this particular the old format. > > Wayne Davis wrote: >>> I've updated to write the new standard, if your >>> program isn't flexible enough to read the old style locus lines. > > That's good news. Martin - will this solve your problem, or do you > think we should also update Biopython to cope with these "old style" > LOCUS lines (which also lack identifiers)? > > Wayne Davis wrote: >>> We encourage software developers to switch to a token-based LOCUS >>> parsing approach, rather than a column-specific approach. If this >>> is done, then future changes to the LOCUS line that affect only the >>> spacing of its data values will not require any modifications to > >> software. > > Easier said than done, as some fields can also contain white space. > However, Howard Salis has some interesting code to tackle this attached > to Biopython bug 2294. > > Peter wrote: >>> The next six lines of that example file (elh/pNEX3.gb) have no >>> values - as Chris Fields pointed out on the Biopython mailing list, >>> the NCBI likes to use a dot/period as a place holder. >>> >>> The spec does explicitly say that the KEYWORDS can be omitted, but >>> seems to assume the other lines are expected. Biopython should be >>> happy if these lines are just omitted. > > Just to correct myself, many of those fields are described as mandatory > single entries further up in the documentation - so using a dot/period > (as Wayne has done for the ApE plasmid editor) does seem the best solution. > > Quoting: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt >> 3.4.2 Entry Organization >> ... >> The following is a brief description of each entry field. Detailed >> information about each field may be found in Sections 3.4.4 to 3.4.15. >> >> LOCUS ... Mandatory keyword/exactly one record. DEFINITION ... >> Mandatory keyword/one or more records. ACCESSION ... Mandatory >> keyword/one or more records. VERSION... Mandatory keyword/exactly one >> record. ... > > KEYWORDS, SOURCE and ORGANISM are described as mandatory in all annotated > entries (so not mandatory in general). COMMENT is optional. > > Peter > > > -- Dr. Martin Mokrejs Dept. of Genetics and Microbiology Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs -------------- next part -------------- A non-text attachment was scrubbed... Name: pGL3R.gb.gz Type: application/x-tar Size: 3117 bytes Desc: not available URL: From mmokrejs at ribosome.natur.cuni.cz Wed Jun 27 14:27:54 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 27 Jun 2007 16:27:54 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <467FD1D5.9020503@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <467FD1D5.9020503@ribosome.natur.cuni.cz> Message-ID: <468273EA.90709@ribosome.natur.cuni.cz> Hi, Martin MOKREJ? wrote: > Hi Peter, I have re-tried current CVS version of biopyhton with a > file regenerated by fixed version of ApE editor. Unfortunately, I > got: > > $ python generate_image_from_genbank.py Traceback (most recent call > last): File "generate_image_from_genbank.py", line 7, in ? > genbank_entry = parser.parse(fhandle) File > "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, > in parse self._scanner.feed(handle, self._consumer) File > "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 360, > in feed self._feed_first_line(consumer, self.line) File > "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 876, > in _feed_first_line raise SyntaxError('Did not recognise the LOCUS > line layout:\n' + line) SyntaxError: Did not recognise the LOCUS line > layout: LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA > circular 14-JUN-2007 > > What's wrong with the LOCUS line now? Bioperl from CVS can read it, > and I thought it is already following the current specs. ;-) Thanks > for your help, Martin OK, I have found the spacing problem with my LOCUS lines still to persist, and after some scripting I got the lines fixed. Now I get with current CVS version: $ python generate_image_from_genbank.py Traceback (most recent call last): File "generate_image_from_genbank.py", line 7, in ? genbank_entry = parser.parse(fhandle) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 361, in feed self._feed_header_lines(consumer, self.parse_header()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 978, in _feed_header_lines consumer.taxonomy(data.strip()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 419, in taxonomy self.data.annotations['taxonomy'] = self._split_taxonomy(content) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 250, in _split_taxonomy if taxonomy_string[-1] == '.': IndexError: string index out of range $ The file starts with: LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular SYN 14-JUN-2007 DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . COMMENT COMMENT ApEinfo:methylated:0 FEATURES Location/Qualifiers From mmokrejs at ribosome.natur.cuni.cz Wed Jun 27 15:22:53 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 27 Jun 2007 17:22:53 +0200 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <46827B15.90301@maubp.freeserve.co.uk> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <467FD1D5.9020503@ribosome.natur.cuni.cz> <468273EA.90709@ribosome.natur.cuni.cz> <46827B15.90301@maubp.freeserve.co.uk> Message-ID: <468280CD.6010901@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Martin MOKREJ? wrote: >> OK, I have found the spacing problem with my LOCUS lines still to >> persist, >> and after some scripting I got the lines fixed. > > Excellent. I've been away for a few days and haven't had a chance to > look at this yet. thanks! No problem, I was busy as well. ;-) > >> The file starts with: >> >> LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular SYN >> 14-JUN-2007 >> DEFINITION . >> ACCESSION . >> VERSION . >> SOURCE . >> ORGANISM . >> COMMENT COMMENT ApEinfo:methylated:0 >> FEATURES Location/Qualifiers >> > > The ORGANISM line looks wrong (three leading spaces rather than two, so > the dot is pushed one column to the right). > > There is a blank COMMENT line which is also odd. > > Some of this may just be an email formatting issue, but I would expect > this instead: > > ... > DEFINITION . > ACCESSION . > VERSION . > SOURCE . > ORGANISM . > COMMENT ApEinfo:methylated:0 > FEATURES Location/Qualifiers > ... OK, I have removed the COMMENT lines altogether and have fixed the ORGANISM line. Still, I get: python generate_image_from_genbank.py Traceback (most recent call last): File "generate_image_from_genbank.py", line 7, in ? genbank_entry = parser.parse(fhandle) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 187, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 361, in feed self._feed_header_lines(consumer, self.parse_header()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 978, in _feed_header_lines consumer.taxonomy(data.strip()) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 419, in taxonomy self.data.annotations['taxonomy'] = self._split_taxonomy(content) File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 250, in _split_taxonomy if taxonomy_string[-1] == '.': IndexError: string index out of range LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular SYN 14-JUN-2007 DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . Thanks for your help, M. From biopython at maubp.freeserve.co.uk Wed Jun 27 14:58:29 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Jun 2007 15:58:29 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <468273EA.90709@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <467FD1D5.9020503@ribosome.natur.cuni.cz> <468273EA.90709@ribosome.natur.cuni.cz> Message-ID: <46827B15.90301@maubp.freeserve.co.uk> Martin MOKREJ? wrote: > OK, I have found the spacing problem with my LOCUS lines still to persist, > and after some scripting I got the lines fixed. Excellent. I've been away for a few days and haven't had a chance to look at this yet. > The file starts with: > > LOCUS pBL-RLuc-GBB+3-III 5391 bp ds-DNA circular SYN 14-JUN-2007 > DEFINITION . > ACCESSION . > VERSION . > SOURCE . > ORGANISM . > COMMENT > COMMENT ApEinfo:methylated:0 > FEATURES Location/Qualifiers > The ORGANISM line looks wrong (three leading spaces rather than two, so the dot is pushed one column to the right). There is a blank COMMENT line which is also odd. Some of this may just be an email formatting issue, but I would expect this instead: ... DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . COMMENT ApEinfo:methylated:0 FEATURES Location/Qualifiers ... Peter From biopython at maubp.freeserve.co.uk Wed Jun 27 18:26:28 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Jun 2007 19:26:28 +0100 Subject: [BioPython] Cannot parse ApE plasmid editor GenBank file In-Reply-To: <468280CD.6010901@ribosome.natur.cuni.cz> References: <46655550.70400@ribosome.natur.cuni.cz> <4665773F.2070108@maubp.freeserve.co.uk> <46658015.6030506@ribosome.natur.cuni.cz> <4665ABA0.3060500@maubp.freeserve.co.uk> <4665B20A.605@ribosome.natur.cuni.cz> <4665C076.20408@maubp.freeserve.co.uk> <467FD1D5.9020503@ribosome.natur.cuni.cz> <468273EA.90709@ribosome.natur.cuni.cz> <46827B15.90301@maubp.freeserve.co.uk> <468280CD.6010901@ribosome.natur.cuni.cz> Message-ID: <4682ABD4.5080904@maubp.freeserve.co.uk> Martin MOKREJ? wrote: > OK, I have removed the COMMENT lines altogether and have fixed the ORGANISM > line. Still, I get ... With hindsight I should have suggested something like this: DEFINITION . ACCESSION . VERSION . SOURCE . ORGANISM . . COMMENT ApEinfo:methylated:0 FEATURES Location/Qualifiers There are usually line(s) after the ORGANISM line which hold the taxonomy, and Biopython was failing when this was missing. I have just updated CVS with a fix (see files Bio/GenBank/Scanner.py and __init__.py) for when these lines are missing or empty. You said you had resolved the LOCUS line issue with ApE plasmid editor's output - so hopefully its files should work with Biopython now. Peter From sdavis2 at mail.nih.gov Wed Jun 27 20:38:06 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 27 Jun 2007 16:38:06 -0400 Subject: [BioPython] OBO format parser and genome databasing Message-ID: <4682CAAE.6060000@mail.nih.gov> I did a bit of googling but didn't find an answer, so I will ask again. Does anyone know of an OBO format parser for python? As a more general question, is anyone using the chado database schema with python? Is there a similar project to chado/gmod but python-based? How about a microarray (MAGE-type) database system that is python-based? Thanks, Sean From biopython at maubp.freeserve.co.uk Thu Jun 28 08:38:00 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 28 Jun 2007 09:38:00 +0100 Subject: [BioPython] OBO format parser and genome databasing In-Reply-To: <4682CAAE.6060000@mail.nih.gov> References: <4682CAAE.6060000@mail.nih.gov> Message-ID: <46837368.2090302@maubp.freeserve.co.uk> Sean Davis wrote: > Does anyone know of an OBO format parser for python? This is the replacement for the older GO flat file format, right? > As a more general question, is anyone using the chado database schema > with python? I haven't. > Is there a similar project to chado/gmod but python-based? How about > a microarray (MAGE-type) database system that is python-based? Sorry Sean, I don't know of such a project. Peter From lpritc at scri.ac.uk Thu Jun 28 09:33:46 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 28 Jun 2007 10:33:46 +0100 Subject: [BioPython] OBO format parser and genome databasing In-Reply-To: <46837368.2090302@maubp.freeserve.co.uk> References: <4682CAAE.6060000@mail.nih.gov> <46837368.2090302@maubp.freeserve.co.uk> Message-ID: <1183023226.25322.175.camel@lplinuxdev.scri.sari.ac.uk> Hi Sean, On Thu, 2007-06-28 at 09:38 +0100, Peter wrote: > Sean Davis wrote: > > Does anyone know of an OBO format parser for python? > > This is the replacement for the older GO flat file format, right? I have been using the BioPerl load_ontology.pl script to load .obo files into BioSQL. Unfortunately, there have been some issues with the GO .obo files and that script, and I've had to fall back on the flat files. The Sequence Ontology/SOFA .obo files seem to work with it OK, though. > > As a more general question, is anyone using the chado database schema > > with python? > > I haven't. Nor me - still sticking with BioSQL for now. L. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From sdavis2 at mail.nih.gov Thu Jun 28 10:47:55 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 28 Jun 2007 06:47:55 -0400 Subject: [BioPython] OBO format parser and genome databasing In-Reply-To: <1183023226.25322.175.camel@lplinuxdev.scri.sari.ac.uk> References: <4682CAAE.6060000@mail.nih.gov> <46837368.2090302@maubp.freeserve.co.uk> <1183023226.25322.175.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: <468391DB.5040007@mail.nih.gov> Leighton Pritchard wrote: > Hi Sean, > > On Thu, 2007-06-28 at 09:38 +0100, Peter wrote: >> Sean Davis wrote: >>> Does anyone know of an OBO format parser for python? >> This is the replacement for the older GO flat file format, right? > > I have been using the BioPerl load_ontology.pl script to load .obo files > into BioSQL. Unfortunately, there have been some issues with the > GO .obo files and that script, and I've had to fall back on the flat > files. The Sequence Ontology/SOFA .obo files seem to work with it OK, > though. > >>> As a more general question, is anyone using the chado database schema >>> with python? >> I haven't. > > Nor me - still sticking with BioSQL for now. Thanks, Leighton. Sean From sdavis2 at mail.nih.gov Thu Jun 28 10:49:17 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 28 Jun 2007 06:49:17 -0400 Subject: [BioPython] OBO format parser and genome databasing In-Reply-To: <46837368.2090302@maubp.freeserve.co.uk> References: <4682CAAE.6060000@mail.nih.gov> <46837368.2090302@maubp.freeserve.co.uk> Message-ID: <4683922D.2090803@mail.nih.gov> Peter wrote: > Sean Davis wrote: >> Does anyone know of an OBO format parser for python? > > This is the replacement for the older GO flat file format, right? One of them, yes. http://www.geneontology.org/GO.format.obo-1_2.shtml >> As a more general question, is anyone using the chado database schema >> with python? > > I haven't. > >> Is there a similar project to chado/gmod but python-based? How about >> a microarray (MAGE-type) database system that is python-based? > > Sorry Sean, I don't know of such a project. Thanks, Peter. Sean From dalloliogm at gmail.com Thu Jun 28 13:45:28 2007 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 28 Jun 2007 15:45:28 +0200 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: <466EAE8D.2090609@maubp.freeserve.co.uk> References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> <466EAE8D.2090609@maubp.freeserve.co.uk> Message-ID: <5aa3b3570706280645s6744b6fdn2cce34abb6883155@mail.gmail.com> Hi! In principle, when I can't decide which keys to use for a dictionary, I just take simple numerical integers as keys, and it works quite well. It simplifies testing/debugging/organization a lot and I can decide the meaning of every key later (so it's better for dictionaries which have to contain very heterogeneous data). I'm not sure I have understood the example you gave me on http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features , but it seems to work in a way similar to what I was saying before: it saves all the features in a list (or is it a dictionary?) and access them later by their positions. Not to be silly but... how do you represent a gene with its transcripts/exons/introns structure with biopython? With SeqRecord and SeqFeature objects? I still don't get it :( Cheers! 2007/6/12, Peter : > Marc Colosimo wrote: > > Additionally, for many formats you can have multiple features with > > the same name; e.g., CDS, gene, etc... in GenBank Records. > > Indeed - and as the SeqRecord/SeqFeature is most heavily used by the > GenBank parser, that does explain things well. > > The problem with using a dictionary is what to index on - you can't > simply use the location string for example, as there usually entries for > genes and CDS features with the same location. > > You can't depend on any other information like an identifier or name to > be present in a GenBank file for all feature types. > > In general, the choice of index will depend on what you want to use it > for - so the flippant answer is just index it yourself, for example like > this: > > http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features > > > The same rational doesn't fully apply to why the feature qualifiers > > are dictionaries of lists. > > No it doesn't. The rational seems to have been that feature qualifiers > in GenBank files can occur with no values (e.g. /pseudo and others), a > single value (e.g. translation) or multiple values (by repeated keys, > e.g. database cross references). So using a list is a simple solution > to cover all these cases - even if most entries only have a single > entry. (There are some old posts on the mailing list archive discussing > this.) > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com From biopython at maubp.freeserve.co.uk Thu Jun 28 15:11:28 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 28 Jun 2007 16:11:28 +0100 Subject: [BioPython] I don't understand why SeqRecord.feature is a list In-Reply-To: <5aa3b3570706280645s6744b6fdn2cce34abb6883155@mail.gmail.com> References: <5aa3b3570706120407x7bc29550j26bd8c7a5f4ae02b@mail.gmail.com> <920D9BCD-ADC3-4704-AA97-2AE8089F02CE@mitre.org> <466EAE8D.2090609@maubp.freeserve.co.uk> <5aa3b3570706280645s6744b6fdn2cce34abb6883155@mail.gmail.com> Message-ID: <4683CFA0.1050905@maubp.freeserve.co.uk> Giovanni Marco Dall'Olio wrote: > Hi! > In principle, when I can't decide which keys to use for a dictionary, > I just take simple numerical integers as keys, and it works quite > well. > It simplifies testing/debugging/organization a lot and I can decide > the meaning of every key later (so it's better for dictionaries which > have to contain very heterogeneous data). It sounds like you don't need/want a dictionary at all. If you are assigning increasing numerical integers "keys", then why not just use the list of features directly? e.g. assuming record is a SeqRecord object: first_feature = record.features[0] second_feature = record.features[1] third_feature = record.features[2] etc > I'm not sure I have understood the example you gave me on > http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features > , but it seems to work in a way similar to what I was saying before: > it saves all the features in a list (or is it a dictionary?) and > access them later by their positions. That example stored integers (indices in the features list) in a dictionary using either the Locus tag, GI numbers or GeneID (e.g. keys like "NEQ010", "GI:41614806" or "GeneID:2654552"). The point being if you know in advance you want to find individual feature on the basis of their locus tag (for example), rather than the order in the file, then I would map the locus tag strings to positions in the list. e.g. locus_tag_cds_index = \ index_genbank_features(gb_record,"CDS","locus_tag") my_feature = gb_record.features[locus_tag_index["NEQ010"]] You could also build a dictionary which maps from the locus tag directly to the associated SeqFeature objects themselves. > Not to be silly but... how do you represent a gene with its > transcripts/exons/introns structure with biopython? With SeqRecord and > SeqFeature objects? If you loaded a GenBank or EMBL file using SeqIO you get one SeqRecord object (assuming there is only one LOCUS line in the file) which contains a list of SeqFeature objects which in turn may contain sub-features. I work with bacteria so I don't have much experience with dealing with sub-features in a SeqFeature object. Peter