From p.j.a.cock at googlemail.com Tue Jan 3 09:44:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Jan 2012 14:44:03 +0000 Subject: [Biopython-dev] Installing Bio.Geography In-Reply-To: <1325600614.87653.YahooMailNeo@web160904.mail.bf1.yahoo.com> References: <1325600614.87653.YahooMailNeo@web160904.mail.bf1.yahoo.com> Message-ID: On Tue, Jan 3, 2012 at 2:23 PM, JOSE SERGIO HLEAP wrote: > Thanks for your quick answer Peter... I already went > trough?http://biopython.org/wiki/BioGeography?at the?Nick Matzke's github. I > dowloaded the enrire branch in a zip, then build, test and install, using > the setup... but still does not work. When in a python console I type: > >>>> from Bio import Geography > > Give me the following taceback: > ... > ImportError: cannot import name Geography > > ... > > am I doing something wrong?? > > Thanks again for any help you can provide me! > > Sergio Hi again. My guess is you got the (default) master branch from Nick's repository, not the Geography branch: https://github.com/nmatzke/biopython/branches Try something like this before installing, git clone https://github.com/nmatzke/biopython.git cd biopython git checkout Geography Peter From eric.talevich at gmail.com Tue Jan 3 11:42:51 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 3 Jan 2012 11:42:51 -0500 Subject: [Biopython-dev] Installing Bio.Geography In-Reply-To: References: <1325600614.87653.YahooMailNeo@web160904.mail.bf1.yahoo.com> Message-ID: On Tue, Jan 3, 2012 at 9:44 AM, Peter Cock wrote: > On Tue, Jan 3, 2012 at 2:23 PM, JOSE SERGIO HLEAP > wrote: > > Thanks for your quick answer Peter... I already went > > trough http://biopython.org/wiki/BioGeography at the Nick Matzke's > github. I > > dowloaded the enrire branch in a zip, then build, test and install, using > > the setup... but still does not work. When in a python console I type: > > > >>>> from Bio import Geography > > > > Give me the following taceback: > > ... > > ImportError: cannot import name Geography > > > > ... > > > > am I doing something wrong?? > > > > Thanks again for any help you can provide me! > > > > Sergio > > Hi again. > > My guess is you got the (default) master branch from > Nick's repository, not the Geography branch: > https://github.com/nmatzke/biopython/branches > > Try something like this before installing, > > git clone https://github.com/nmatzke/biopython.git > cd biopython > git checkout Geography > > Peter > > Also, for the sub-package to be installed via setup.py, you'll need to add Bio.Geography to the list of installed packages. In setup.py, around line 207, you'll see a global variable PACKAGES and a list of module names as strings. Try adding "Bio.Geography" to this list. -Eric From redmine at redmine.open-bio.org Thu Jan 5 19:53:33 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 6 Jan 2012 00:53:33 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] (New) Parsing Stockholm files fails with some sequence IDs Message-ID: Issue #3317 has been reported by Connor McCoy. ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Jan 5 19:53:34 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 6 Jan 2012 00:53:34 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] (New) Parsing Stockholm files fails with some sequence IDs Message-ID: Issue #3317 has been reported by Connor McCoy. ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From carlcrott at gmail.com Mon Jan 9 09:36:00 2012 From: carlcrott at gmail.com (carl crott) Date: Mon, 9 Jan 2012 09:36:00 -0500 Subject: [Biopython-dev] GFF file parsing and error handling Message-ID: Hey all, I'm posting here because I know there has been talk about GFF file parsing and I'd love to code a bit as soon as I comprehend whats going on with these files. I've got this GFF file ( placed in a spreadsheet for readability ) https://docs.google.com/spreadsheet/ccc?key=0AtOqyz8P_fJ0dGVOMzNSM29qUVdjZmZ4emdIQ3U2OUE&hl=en_US#gid=0 line 178 + 179 are the problematic lines what is going on here? I know that these genes are listed in reverse order and that a sequence of: stop_codon CDS CDS start_codon the above is a normal gene arrangement. BUT my guesses as to what happening ( between 178 and 179 ): 1) the gene stretches from the end of one chromosome to another? 2) simply a stop_codon with no attached CDS or start_codon ? I've successfully managed to parse out the gene intervals and now I'm working on the error handling. Thanks, Carl -- Carl Crott Web Applications Engineer www.black-glass.com 412-610-0600 From chapmanb at 50mail.com Tue Jan 10 11:21:17 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 10 Jan 2012 11:21:17 -0500 Subject: [Biopython-dev] GFF file parsing and error handling In-Reply-To: References: Message-ID: <87boqb5y0y.fsf@fastmail.fm> Carl; > I've got this GFF file ( placed in a spreadsheet for readability ) > > https://docs.google.com/spreadsheet/ccc?key=0AtOqyz8P_fJ0dGVOMzNSM29qUVdjZmZ4emdIQ3U2OUE&hl=en_US#gid=0 > > line 178 + 179 are the problematic lines > > what is going on here? I'm not sure I understand exactly what is giving you problems. There are two different genes: - gene_id 000014 on segment human.ENr12. This has two exons and an annotated start and stop codon. - gene_id 000001 on segment human.ENr13. This has an exon and a start codon, but no annotated stop codon. Does that help? Brad From redmine at redmine.open-bio.org Mon Jan 16 04:30:37 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 16 Jan 2012 09:30:37 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] Parsing Stockholm files fails with some sequence IDs References: Message-ID: Issue #3317 has been updated by Peter Cock. Patch cherry-picked and applied to master, https://github.com/biopython/biopython/commit/ca86f1a27e7af20b4c692cbbe9d54f08e0c65906 Do you fancy preparing a little unit test as well? Thanks. ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jan 18 17:32:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 Jan 2012 22:32:35 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] Parsing Stockholm files fails with some sequence IDs References: Message-ID: Issue #3317 has been updated by Connor McCoy. Sure: I added an example of the previously failing sequence to the existing `Tests/Stockholm/funny.sth` file. https://github.com/cmccoy/biopython/commit/7e9f697c38f0132952b50482fddfea77ce800536 If there's a better place, please let me know. All tests pass for me. ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Jan 19 13:11:10 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 19 Jan 2012 18:11:10 +0000 Subject: [Biopython-dev] [Biopython - Bug #3318] (New) Bio.GenBank.LocationParserError on RefSeq files Message-ID: Issue #3318 has been reported by John Eppley. ---------------------------------------- Bug #3318: Bio.GenBank.LocationParserError on RefSeq files https://redmine.open-bio.org/issues/3318 Author: John Eppley Status: New Priority: Normal Assignee: Category: Target version: URL: I get the following error trying to parse a file from NCBI RefSeq:
Traceback (most recent call last):
  File "/Users/jmeppley/work/delong/projects/scripts/getSequencesFromGbk.py", line 62, in main
    translateStream(instream,options.formatIn,outstream, options.formatOut, options.cds, options.translate)
  File "/Users/jmeppley/work/delong/projects/scripts/getSequencesFromGbk.py", line 87, in translateStream
    for record in records:
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/SeqIO/__init__.py", line 532, in parse
    for r in i:
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/Scanner.py", line 440, in parse_records
    record = self.parse(handle, do_features)
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/Scanner.py", line 423, in parse
    if self.feed(handle, consumer, do_features):
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/Scanner.py", line 395, in feed
    self._feed_feature_table(consumer, self.parse_features(skip=False))
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/Scanner.py", line 347, in _feed_feature_table
    consumer.location(location_string)
  File "/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/__init__.py", line 975, in location
    raise LocationParserError(location_line)
LocationParserError: join(complement(149815..150200),complement(293787..295573),NC_016402.1:6618..6676,181647..181905)
The file can be found here (release 51): The relevant section is:
     gene            join(complement(149815..150200),
                     complement(293787..295573),NC_016402.1:6618..6676,
                     181647..181905)
                     /gene="nad1"
                     /trans_splicing
                     /note="exons 1, 2, 3, and 5 on chromosome 1 are
                     trans-spliced with exon 4 on chromosome 3 to form the
                     complete coding region"
                     /db_xref="GeneID:11447159"
I think the underscore in the sequence name is confounding the regular expressions in Bio/GenBank/__init__.py. I was able to stop the error by making the following change (first line is mine, second is original):
99c99
< _complex_location = r"([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \
---
> _complex_location = r"([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \
I haven't really tested it to see if the output is correct, though. It does pass the doctests. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Jan 20 05:46:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Jan 2012 10:46:18 +0000 Subject: [Biopython-dev] NCBI adoption of AGP v2.0 and new qualifiers in GenBank/EMBL Message-ID: Dear all, I just spotted this via the @NCBI twitter feed, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_spec_change.shtml In addition to the NCBI switch from AGP v1.1 to v2.0, the INSDC have recently added a new feature type called "assembly_gap", and the associated qualifiers "gap_type" and "linkage_evidence" to the INSDC Feature Table Definitons. Quoting from version 10.0, dated Dec 2011 http://www.insdc.org/documents/feature_table.html#7.2 > Feature Key assembly_gap > > > Definition gap between two components of a CON record that is > part of a genome assembly; > > Mandatory qualifiers /estimated_length=unknown or > /gap_type="TYPE" > /linkage_evidence="TYPE" (Note: Mandatory only if the > /gap_type is "within scaffold" or "repeat within > scaffold".If there are multiple types of linkage_evidence > they will appear as multiple /linkage_evidence="TYPE" > qualifiers. For all other types of assembly_gap > features, use of the /linkage_evidence qualifier is > invalid.) > > Comment the location span of the assembly_gap feature for an > unknown gap is 100 bp, with the 100 bp indicated as > 100 "n"'s in sequence. > i.e. DDBJ, ENA & GenBank flat-files will start to use the "assembly_gap" features to display information derived from version 2.0 AGP files from 10th Feb 2012. Probably this will affect the XML variants as well. Unless any of the parsers/writers for GenBank or EMBL flat files use a white list approach, the new feature key and qualifiers shouldn't cause a problem. Peter From chapmanb at 50mail.com Tue Jan 24 21:14:58 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 24 Jan 2012 21:14:58 -0500 Subject: [Biopython-dev] Bug in Bio.Blast.NCBIWWW ? In-Reply-To: <4F1F0E020200009500004335@gwsmtp2.uni-regensburg.de> References: <4F1F0E020200009500004335@gwsmtp2.uni-regensburg.de> Message-ID: <87vco0wmsd.fsf@fastmail.fm> Micheal; Thanks for the e-mail and problem report. For Biopython issues, the best approach is to e-mail the development list (cc'ed here). Since it's a community project, there are a wide variety of contributors that can provide help and expertise. In this particular case, I am not especially familiar with python 3 as I'm still a python 2 user myself. Peter, Eric and other folks have been doing a lot of work on this, so hopefully they can confirm or provide additional details. Thanks again, Brad > Dear Mr. Chapman, > > I perhaps found a bug in the Bio.Blast.NCBIWWW module. > But since I'm using Python 3.2.2 and didn't test this with Python2.x, I don't know if this is a Python2/Python3 issue. > > When I try to execute the following line of code (where contig is a string): > > resultHandle = NCBIWWW.qblast('blastn', 'nr', contig) > > I get the following error message: > > Traceback (most recent call last): > File "/home/michael/python/gb/sheet11/ex80_all-contigs.py", line 575, in > blastContig(contigs[i]) > File "/home/michael/python/gb/sheet11/ex80_all-contigs.py", line 441, in blastContig > resultHandle = NCBIWWW.qblast('blastn', 'nr', contig, hitlist_size=1) > File "/usr/local/lib/python3.2/dist-packages/Bio/Blast/NCBIWWW.py", line 122, in qblast > handle = urllib.request.urlopen(request) > File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen > return opener.open(url, data, timeout) > File "/usr/lib/python3.2/urllib/request.py", line 367, in open > req = meth(req) > File "/usr/lib/python3.2/urllib/request.py", line 1066, in do_request_ > raise TypeError("POST data should be bytes" > TypeError: POST data should be bytes or an iterable of bytes. It cannot be str. > > > After trying different things, I found that this can be fixed by replacing > > message = urllib.parse.urlencode(query) > > in lines 113 and 145 in the NCBIWWW.py script by > > message = urllib.parse.urlencode(query, encoding='ascii') > message = bytes(message, 'ascii') > > If this is not a Python2/Python3 thing, you perhaps want to use this fix for the next version of Biopython. > > Yours sincerely, > Michael Grauvogl > > Non-text part: text/html From p.j.a.cock at googlemail.com Wed Jan 25 08:13:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 25 Jan 2012 13:13:59 +0000 Subject: [Biopython-dev] Bug in Bio.Blast.NCBIWWW ? In-Reply-To: <87vco0wmsd.fsf@fastmail.fm> References: <4F1F0E020200009500004335@gwsmtp2.uni-regensburg.de> <87vco0wmsd.fsf@fastmail.fm> Message-ID: On Wed, Jan 25, 2012 at 2:14 AM, Brad Chapman wrote: > > Micheal; > Thanks for the e-mail and problem report. For Biopython issues, the best > approach is to e-mail the development list (cc'ed here). Since it's a > community project, there are a wide variety of contributors that can > provide help and expertise. > > In this particular case, I am not especially familiar with python 3 as > I'm still a python 2 user myself. Peter, Eric and other folks have been > doing a lot of work on this, so hopefully they can confirm or provide > additional details. > > Thanks again, > Brad Hi all, Yes, as you guessed Michael, this is a Python 3 problem. This should work now - doing something similar to your suggestion: https://github.com/biopython/biopython/commit/7bd519e68974185c7b489e2415c9ee3b6f1d7ac4 Thanks for the reminder - there are still a few things to fix under in Biopython under Python 3. Peter From redmine at redmine.open-bio.org Mon Jan 30 10:59:43 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 30 Jan 2012 15:59:43 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] (Resolved) Parsing Stockholm files fails with some sequence IDs References: Message-ID: Issue #3317 has been updated by Peter Cock. Status changed from New to Resolved % Done changed from 0 to 100 Thanks, https://github.com/biopython/biopython/commit/d439ba5c6a5a501ba859cce02c746f7d75b2914e ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: Resolved Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Tue Jan 3 14:44:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Jan 2012 14:44:03 +0000 Subject: [Biopython-dev] Installing Bio.Geography In-Reply-To: <1325600614.87653.YahooMailNeo@web160904.mail.bf1.yahoo.com> References: <1325600614.87653.YahooMailNeo@web160904.mail.bf1.yahoo.com> Message-ID: On Tue, Jan 3, 2012 at 2:23 PM, JOSE SERGIO HLEAP wrote: > Thanks for your quick answer Peter... I already went > trough?http://biopython.org/wiki/BioGeography?at the?Nick Matzke's github. I > dowloaded the enrire branch in a zip, then build, test and install, using > the setup... but still does not work. When in a python console I type: > >>>> from Bio import Geography > > Give me the following taceback: > ... > ImportError: cannot import name Geography > > ... > > am I doing something wrong?? > > Thanks again for any help you can provide me! > > Sergio Hi again. My guess is you got the (default) master branch from Nick's repository, not the Geography branch: https://github.com/nmatzke/biopython/branches Try something like this before installing, git clone https://github.com/nmatzke/biopython.git cd biopython git checkout Geography Peter From eric.talevich at gmail.com Tue Jan 3 16:42:51 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 3 Jan 2012 11:42:51 -0500 Subject: [Biopython-dev] Installing Bio.Geography In-Reply-To: References: <1325600614.87653.YahooMailNeo@web160904.mail.bf1.yahoo.com> Message-ID: On Tue, Jan 3, 2012 at 9:44 AM, Peter Cock wrote: > On Tue, Jan 3, 2012 at 2:23 PM, JOSE SERGIO HLEAP > wrote: > > Thanks for your quick answer Peter... I already went > > trough http://biopython.org/wiki/BioGeography at the Nick Matzke's > github. I > > dowloaded the enrire branch in a zip, then build, test and install, using > > the setup... but still does not work. When in a python console I type: > > > >>>> from Bio import Geography > > > > Give me the following taceback: > > ... > > ImportError: cannot import name Geography > > > > ... > > > > am I doing something wrong?? > > > > Thanks again for any help you can provide me! > > > > Sergio > > Hi again. > > My guess is you got the (default) master branch from > Nick's repository, not the Geography branch: > https://github.com/nmatzke/biopython/branches > > Try something like this before installing, > > git clone https://github.com/nmatzke/biopython.git > cd biopython > git checkout Geography > > Peter > > Also, for the sub-package to be installed via setup.py, you'll need to add Bio.Geography to the list of installed packages. In setup.py, around line 207, you'll see a global variable PACKAGES and a list of module names as strings. Try adding "Bio.Geography" to this list. -Eric From redmine at redmine.open-bio.org Fri Jan 6 00:53:33 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 6 Jan 2012 00:53:33 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] (New) Parsing Stockholm files fails with some sequence IDs Message-ID: Issue #3317 has been reported by Connor McCoy. ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Jan 6 00:53:34 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 6 Jan 2012 00:53:34 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] (New) Parsing Stockholm files fails with some sequence IDs Message-ID: Issue #3317 has been reported by Connor McCoy. ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From carlcrott at gmail.com Mon Jan 9 14:36:00 2012 From: carlcrott at gmail.com (carl crott) Date: Mon, 9 Jan 2012 09:36:00 -0500 Subject: [Biopython-dev] GFF file parsing and error handling Message-ID: Hey all, I'm posting here because I know there has been talk about GFF file parsing and I'd love to code a bit as soon as I comprehend whats going on with these files. I've got this GFF file ( placed in a spreadsheet for readability ) https://docs.google.com/spreadsheet/ccc?key=0AtOqyz8P_fJ0dGVOMzNSM29qUVdjZmZ4emdIQ3U2OUE&hl=en_US#gid=0 line 178 + 179 are the problematic lines what is going on here? I know that these genes are listed in reverse order and that a sequence of: stop_codon CDS CDS start_codon the above is a normal gene arrangement. BUT my guesses as to what happening ( between 178 and 179 ): 1) the gene stretches from the end of one chromosome to another? 2) simply a stop_codon with no attached CDS or start_codon ? I've successfully managed to parse out the gene intervals and now I'm working on the error handling. Thanks, Carl -- Carl Crott Web Applications Engineer www.black-glass.com 412-610-0600 From chapmanb at 50mail.com Tue Jan 10 16:21:17 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 10 Jan 2012 11:21:17 -0500 Subject: [Biopython-dev] GFF file parsing and error handling In-Reply-To: References: Message-ID: <87boqb5y0y.fsf@fastmail.fm> Carl; > I've got this GFF file ( placed in a spreadsheet for readability ) > > https://docs.google.com/spreadsheet/ccc?key=0AtOqyz8P_fJ0dGVOMzNSM29qUVdjZmZ4emdIQ3U2OUE&hl=en_US#gid=0 > > line 178 + 179 are the problematic lines > > what is going on here? I'm not sure I understand exactly what is giving you problems. There are two different genes: - gene_id 000014 on segment human.ENr12. This has two exons and an annotated start and stop codon. - gene_id 000001 on segment human.ENr13. This has an exon and a start codon, but no annotated stop codon. Does that help? Brad From redmine at redmine.open-bio.org Mon Jan 16 09:30:37 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 16 Jan 2012 09:30:37 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] Parsing Stockholm files fails with some sequence IDs References: Message-ID: Issue #3317 has been updated by Peter Cock. Patch cherry-picked and applied to master, https://github.com/biopython/biopython/commit/ca86f1a27e7af20b4c692cbbe9d54f08e0c65906 Do you fancy preparing a little unit test as well? Thanks. ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jan 18 22:32:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 18 Jan 2012 22:32:35 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] Parsing Stockholm files fails with some sequence IDs References: Message-ID: Issue #3317 has been updated by Connor McCoy. Sure: I added an example of the previously failing sequence to the existing `Tests/Stockholm/funny.sth` file. https://github.com/cmccoy/biopython/commit/7e9f697c38f0132952b50482fddfea77ce800536 If there's a better place, please let me know. All tests pass for me. ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Jan 19 18:11:10 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 19 Jan 2012 18:11:10 +0000 Subject: [Biopython-dev] [Biopython - Bug #3318] (New) Bio.GenBank.LocationParserError on RefSeq files Message-ID: Issue #3318 has been reported by John Eppley. ---------------------------------------- Bug #3318: Bio.GenBank.LocationParserError on RefSeq files https://redmine.open-bio.org/issues/3318 Author: John Eppley Status: New Priority: Normal Assignee: Category: Target version: URL: I get the following error trying to parse a file from NCBI RefSeq:
Traceback (most recent call last):
  File "/Users/jmeppley/work/delong/projects/scripts/getSequencesFromGbk.py", line 62, in main
    translateStream(instream,options.formatIn,outstream, options.formatOut, options.cds, options.translate)
  File "/Users/jmeppley/work/delong/projects/scripts/getSequencesFromGbk.py", line 87, in translateStream
    for record in records:
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/SeqIO/__init__.py", line 532, in parse
    for r in i:
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/Scanner.py", line 440, in parse_records
    record = self.parse(handle, do_features)
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/Scanner.py", line 423, in parse
    if self.feed(handle, consumer, do_features):
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/Scanner.py", line 395, in feed
    self._feed_feature_table(consumer, self.parse_features(skip=False))
  File "/nfs/Isilon/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/Scanner.py", line 347, in _feed_feature_table
    consumer.location(location_string)
  File "/common/lib/python/2.6/biopython-1.57-py2.6-macosx-10.6-universal.egg/Bio/GenBank/__init__.py", line 975, in location
    raise LocationParserError(location_line)
LocationParserError: join(complement(149815..150200),complement(293787..295573),NC_016402.1:6618..6676,181647..181905)
The file can be found here (release 51): The relevant section is:
     gene            join(complement(149815..150200),
                     complement(293787..295573),NC_016402.1:6618..6676,
                     181647..181905)
                     /gene="nad1"
                     /trans_splicing
                     /note="exons 1, 2, 3, and 5 on chromosome 1 are
                     trans-spliced with exon 4 on chromosome 3 to form the
                     complete coding region"
                     /db_xref="GeneID:11447159"
I think the underscore in the sequence name is confounding the regular expressions in Bio/GenBank/__init__.py. I was able to stop the error by making the following change (first line is mine, second is original):
99c99
< _complex_location = r"([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \
---
> _complex_location = r"([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \
I haven't really tested it to see if the output is correct, though. It does pass the doctests. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Jan 20 10:46:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Jan 2012 10:46:18 +0000 Subject: [Biopython-dev] NCBI adoption of AGP v2.0 and new qualifiers in GenBank/EMBL Message-ID: Dear all, I just spotted this via the @NCBI twitter feed, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_spec_change.shtml In addition to the NCBI switch from AGP v1.1 to v2.0, the INSDC have recently added a new feature type called "assembly_gap", and the associated qualifiers "gap_type" and "linkage_evidence" to the INSDC Feature Table Definitons. Quoting from version 10.0, dated Dec 2011 http://www.insdc.org/documents/feature_table.html#7.2 > Feature Key assembly_gap > > > Definition gap between two components of a CON record that is > part of a genome assembly; > > Mandatory qualifiers /estimated_length=unknown or > /gap_type="TYPE" > /linkage_evidence="TYPE" (Note: Mandatory only if the > /gap_type is "within scaffold" or "repeat within > scaffold".If there are multiple types of linkage_evidence > they will appear as multiple /linkage_evidence="TYPE" > qualifiers. For all other types of assembly_gap > features, use of the /linkage_evidence qualifier is > invalid.) > > Comment the location span of the assembly_gap feature for an > unknown gap is 100 bp, with the 100 bp indicated as > 100 "n"'s in sequence. > i.e. DDBJ, ENA & GenBank flat-files will start to use the "assembly_gap" features to display information derived from version 2.0 AGP files from 10th Feb 2012. Probably this will affect the XML variants as well. Unless any of the parsers/writers for GenBank or EMBL flat files use a white list approach, the new feature key and qualifiers shouldn't cause a problem. Peter From chapmanb at 50mail.com Wed Jan 25 02:14:58 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 24 Jan 2012 21:14:58 -0500 Subject: [Biopython-dev] Bug in Bio.Blast.NCBIWWW ? In-Reply-To: <4F1F0E020200009500004335@gwsmtp2.uni-regensburg.de> References: <4F1F0E020200009500004335@gwsmtp2.uni-regensburg.de> Message-ID: <87vco0wmsd.fsf@fastmail.fm> Micheal; Thanks for the e-mail and problem report. For Biopython issues, the best approach is to e-mail the development list (cc'ed here). Since it's a community project, there are a wide variety of contributors that can provide help and expertise. In this particular case, I am not especially familiar with python 3 as I'm still a python 2 user myself. Peter, Eric and other folks have been doing a lot of work on this, so hopefully they can confirm or provide additional details. Thanks again, Brad > Dear Mr. Chapman, > > I perhaps found a bug in the Bio.Blast.NCBIWWW module. > But since I'm using Python 3.2.2 and didn't test this with Python2.x, I don't know if this is a Python2/Python3 issue. > > When I try to execute the following line of code (where contig is a string): > > resultHandle = NCBIWWW.qblast('blastn', 'nr', contig) > > I get the following error message: > > Traceback (most recent call last): > File "/home/michael/python/gb/sheet11/ex80_all-contigs.py", line 575, in > blastContig(contigs[i]) > File "/home/michael/python/gb/sheet11/ex80_all-contigs.py", line 441, in blastContig > resultHandle = NCBIWWW.qblast('blastn', 'nr', contig, hitlist_size=1) > File "/usr/local/lib/python3.2/dist-packages/Bio/Blast/NCBIWWW.py", line 122, in qblast > handle = urllib.request.urlopen(request) > File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen > return opener.open(url, data, timeout) > File "/usr/lib/python3.2/urllib/request.py", line 367, in open > req = meth(req) > File "/usr/lib/python3.2/urllib/request.py", line 1066, in do_request_ > raise TypeError("POST data should be bytes" > TypeError: POST data should be bytes or an iterable of bytes. It cannot be str. > > > After trying different things, I found that this can be fixed by replacing > > message = urllib.parse.urlencode(query) > > in lines 113 and 145 in the NCBIWWW.py script by > > message = urllib.parse.urlencode(query, encoding='ascii') > message = bytes(message, 'ascii') > > If this is not a Python2/Python3 thing, you perhaps want to use this fix for the next version of Biopython. > > Yours sincerely, > Michael Grauvogl > > Non-text part: text/html From p.j.a.cock at googlemail.com Wed Jan 25 13:13:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 25 Jan 2012 13:13:59 +0000 Subject: [Biopython-dev] Bug in Bio.Blast.NCBIWWW ? In-Reply-To: <87vco0wmsd.fsf@fastmail.fm> References: <4F1F0E020200009500004335@gwsmtp2.uni-regensburg.de> <87vco0wmsd.fsf@fastmail.fm> Message-ID: On Wed, Jan 25, 2012 at 2:14 AM, Brad Chapman wrote: > > Micheal; > Thanks for the e-mail and problem report. For Biopython issues, the best > approach is to e-mail the development list (cc'ed here). Since it's a > community project, there are a wide variety of contributors that can > provide help and expertise. > > In this particular case, I am not especially familiar with python 3 as > I'm still a python 2 user myself. Peter, Eric and other folks have been > doing a lot of work on this, so hopefully they can confirm or provide > additional details. > > Thanks again, > Brad Hi all, Yes, as you guessed Michael, this is a Python 3 problem. This should work now - doing something similar to your suggestion: https://github.com/biopython/biopython/commit/7bd519e68974185c7b489e2415c9ee3b6f1d7ac4 Thanks for the reminder - there are still a few things to fix under in Biopython under Python 3. Peter From redmine at redmine.open-bio.org Mon Jan 30 15:59:43 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 30 Jan 2012 15:59:43 +0000 Subject: [Biopython-dev] [Biopython - Bug #3317] (Resolved) Parsing Stockholm files fails with some sequence IDs References: Message-ID: Issue #3317 has been updated by Peter Cock. Status changed from New to Resolved % Done changed from 0 to 100 Thanks, https://github.com/biopython/biopython/commit/d439ba5c6a5a501ba859cce02c746f7d75b2914e ---------------------------------------- Bug #3317: Parsing Stockholm files fails with some sequence IDs https://redmine.open-bio.org/issues/3317 Author: Connor McCoy Status: Resolved Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: I ended up with a stockholm file where some sequence IDs end with "/MN1-00", e.g. 363253|refseq_protein.50.proto_past_mitoc_micro_vira|gi|94986659|ref|YP_594592.1|awsonia_intraceuaris_PHE/MN1-00 This can't be parsed by the BioPython AlignIO module - it fails attempting to convert "MN1" to an integer. I wrote a quick patch, which just returns the default value if the type conversion fails: https://github.com/cmccoy/biopython/commit/77f97b5d7184a077ca6e2e90d90ca0110b5766c3 -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org