From p.j.a.cock at googlemail.com Sun Dec 2 18:41:49 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 2 Dec 2012 23:41:49 +0000 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: On Mon, Nov 26, 2012 at 4:46 PM, Peter Cock wrote: > > Done, > https://github.com/biopython/biopython/commit/9f6e810cc68dd1e353d899772fda3053d9f49513 > >>> Once that's done there is some housekeeping to do, like >>> the indexing code duplication with Bio.SeqIO, and tackling >>> indexing BGZF compressed files with Bio.SearchIO which >>> I will have a go at. >> >> Yes. > > Started, it seems the two _index.py files have diverged a > little more than I'd expected: > https://github.com/biopython/biopython/commit/ad1786b99afd2a50248246d877ff00a53949546b I've just refactored the code in order to avoid most of the index duplication (including SQLite backend) between the SeqIO and new SearchIO index and index_db functions. In the short term at least, the common code is now part of Bio/File.py (but remains as private classes). That seemed neater than introducing a new private module. Fingers crossed everything is fine on the buildslaves, TravisCI seems happy. Bow, if you find I've broken anything then we need more unit tests ;) Regards, Peter From w.arindrarto at gmail.com Mon Dec 3 06:22:07 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 3 Dec 2012 12:22:07 +0100 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: Hi Peter, >>>> Once that's done there is some housekeeping to do, like >>>> the indexing code duplication with Bio.SeqIO, and tackling >>>> indexing BGZF compressed files with Bio.SearchIO which >>>> I will have a go at. >>> >>> Yes. >> >> Started, it seems the two _index.py files have diverged a >> little more than I'd expected: >> https://github.com/biopython/biopython/commit/ad1786b99afd2a50248246d877ff00a53949546b > > I've just refactored the code in order to avoid most of the > index duplication (including SQLite backend) between the > SeqIO and new SearchIO index and index_db functions. Thanks :). I remember I did change some of the variable names. Other than this, the biggest change is probably related to the Indexer classes lazy loading in SearchIO. But it seems to have been handled as well :). > In the short term at least, the common code is now part > of Bio/File.py (but remains as private classes). That > seemed neater than introducing a new private module. Looks like a good place for now, Bio.File as the location for common file-handling code. > Fingers crossed everything is fine on the buildslaves, > TravisCI seems happy. Bow, if you find I've broken > anything then we need more unit tests ;) Will keep that in mind :). regards, Bow From p.j.a.cock at googlemail.com Mon Dec 3 06:36:16 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Dec 2012 11:36:16 +0000 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: On Mon, Dec 3, 2012 at 11:22 AM, Wibowo Arindrarto wrote: > Hi Peter, > >> I've just refactored the code in order to avoid most of the >> index duplication (including SQLite backend) between the >> SeqIO and new SearchIO index and index_db functions. > > Thanks :). I remember I did change some of the variable names. Basically I moved the core SeqIO indexing code into Bio.File, generalised it enough to work for SearchIO as well, then removed the SearchIO indexing code. > Other than this, the biggest change is probably related to the > Indexer classes lazy loading in SearchIO. But it seems to have > been handled as well :). Yes, the SearchIO indexing is still calling your lazy loading function to get the parser objects. >> In the short term at least, the common code is now part >> of Bio/File.py (but remains as private classes). That >> seemed neater than introducing a new private module. > > Looks like a good place for now, Bio.File as the location for > common file-handling code. That was my thinking too. >> Fingers crossed everything is fine on the buildslaves, >> TravisCI seems happy. Bow, if you find I've broken >> anything then we need more unit tests ;) > > Will keep that in mind :). *Grin* I've just done a base class for the random access proxy classes, potentially a little more refactoring to follow here (or renaming): https://github.com/biopython/biopython/commit/9721cd00b5662309456c3dc573642cbb88e4e0a1 Peter From christian at brueffer.de Mon Dec 3 07:46:23 2012 From: christian at brueffer.de (Christian Brueffer) Date: Mon, 03 Dec 2012 20:46:23 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup Message-ID: <50BC9F1F.4090904@brueffer.de> Hi, I just submitted pull request #102 which fixes several types of PEP8 warnings (found using the awesome pep8 tool). Here's what's left after those fixes: $ pep8 --statistics -qq repos/biopython 789 E111 indentation is not a multiple of four 673 E121 continuation line indentation is not a multiple of four 693 E122 continuation line missing indentation or outdented 171 E123 closing bracket does not match indentation of opening bracket's line 86 E124 closing bracket does not match visual indentation 49 E125 continuation line does not distinguish itself from next logical line 197 E126 continuation line over-indented for hanging indent 575 E127 continuation line over-indented for visual indent 1092 E128 continuation line under-indented for visual indent 773 E201 whitespace after '(' 540 E202 whitespace before ')' 23543 E203 whitespace before ':' 55 E211 whitespace before '(' 180 E221 multiple spaces before operator 59 E222 multiple spaces after operator 5848 E225 missing whitespace around operator 6517 E231 missing whitespace after ',' 2544 E251 no spaces around keyword / parameter equals 644 E261 at least two spaces before inline comment 346 E262 inline comment should start with '# ' 156 E301 expected 1 blank line, found 0 1838 E302 expected 2 blank lines, found 1 364 E303 too many blank lines (2) 15553 E501 line too long (82 > 79 characters) 857 E502 the backslash is redundant between brackets 291 E701 multiple statements on one line (colon) 122 E711 comparison to None should be 'if cond is None:' 3707 W291 trailing whitespace 1913 W293 blank line contains whitespace I'm not sure where to go from here with regard to what's worth fixing and what would be considered repo churn (or gratuitous changes that make merging of existing patches harder). I'd especially like to clean up E301, E302, E701, E711, W291 and W293. Other items like E251 are more dubious, as some developers seem to prefer the current style. What do you think? Chris From p.j.a.cock at googlemail.com Mon Dec 3 08:34:52 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Dec 2012 13:34:52 +0000 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: <50BC9F1F.4090904@brueffer.de> References: <50BC9F1F.4090904@brueffer.de> Message-ID: On Mon, Dec 3, 2012 at 12:46 PM, Christian Brueffer wrote: > Hi, Hi Christian, Thanks for all the pull requests sorting out issues like this, in terms of lines of code you'll probably be one of the top contributors to the next release ;) This sort of work isn't as high profile as new features or bug fixes, but has a more subtle role in the long term of the project - making our code easier to follow etc. So we do appreciate these contributions. > I just submitted pull request #102 which fixes several types of PEP8 > warnings (found using the awesome pep8 tool). 101 not 102? https://github.com/biopython/biopython/pull/101 > Here's what's left after those fixes: > > $ pep8 --statistics -qq repos/biopython > 789 E111 indentation is not a multiple of four That's nasty - although I think we've got rid of all the tabbed indentation already which was also very annoying. > 673 E121 continuation line indentation is not a multiple of four I suspect many of those are a style judgement and done that way to line up parentheses etc. > 693 E122 continuation line missing indentation or outdented > 171 E123 closing bracket does not match indentation of opening bracket's > line > 86 E124 closing bracket does not match visual indentation > 49 E125 continuation line does not distinguish itself from next logical > line > 197 E126 continuation line over-indented for hanging indent > 575 E127 continuation line over-indented for visual indent > 1092 E128 continuation line under-indented for visual indent > 773 E201 whitespace after '(' > 540 E202 whitespace before ')' > 23543 E203 whitespace before ':' > 55 E211 whitespace before '(' I'd like to see E201, E202, and E211 fixed (whitespace next to parentheses). The count for E203 is surprisingly high - I suspect that could include some large dictionaries? Note some of the dictionaries are auto-generated so the code to do that would also need fixing. > 180 E221 multiple spaces before operator > 59 E222 multiple spaces after operator > 5848 E225 missing whitespace around operator > 6517 E231 missing whitespace after ',' > 2544 E251 no spaces around keyword / parameter equals > 644 E261 at least two spaces before inline comment > 346 E262 inline comment should start with '# ' > 156 E301 expected 1 blank line, found 0 > 1838 E302 expected 2 blank lines, found 1 > 364 E303 too many blank lines (2) > 15553 E501 line too long (82 > 79 characters) > 857 E502 the backslash is redundant between brackets Fixing E502 seems a good idea, I suspect many of these are purely accidental due to not realising when they are redundant. > 291 E701 multiple statements on one line (colon) > 122 E711 comparison to None should be 'if cond is None:' > 3707 W291 trailing whitespace > 1913 W293 blank line contains whitespace > > I'm not sure where to go from here with regard to what's worth fixing and > what would be considered repo churn (or gratuitous changes that make > merging of existing patches harder). > > I'd especially like to clean up E301, E302, E301 and E302 presumable are about the recommended spacing between function, class and method names? If you want to fix them next that seems low risk in terms of complicating merges. > ... E701, E711, W291 and W293. Did you already fix most of those in today's pull request? https://github.com/biopython/biopython/pull/101 If there are more cases, then by all means fix them too. > Other items like E251 are more dubious, as some developers > seem to prefer the current style. > > What do you think? We have a range of styles in the current code base reflecting different authors - and also changes in the Python conventions as some of the code is now over ten years old. And if any of my personal coding style is flagged, I'm willing to adapt ;) (e.g. I've learnt not to put a space before if statement colons) As you point out, the "repo churn" from fixing minor things like spaces around operators does have a cost in making merges a little harder. Things like the exception style updates which you've already fixed (seems I missed some) are more urgent for Python 3 support, so worth doing anyway. You've got us a lot closer to PEP8 compliance - do you think subject to a short white list of known cases (like module names) where we don't follow PEP8 we could aim to run a a pep8 tool automatically (e.g. as a unit test, or even a commit hook)? That is quite appealing as a way to spot any new code which breaks the style guidelines... Regards, Peter From p.j.a.cock at googlemail.com Mon Dec 3 09:02:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Dec 2012 14:02:40 +0000 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: On Mon, Nov 26, 2012 at 1:49 PM, Peter Cock wrote: > > Once that's done there is some housekeeping to do, like > the indexing code duplication with Bio.SeqIO, and tackling > indexing BGZF compressed files with Bio.SearchIO which > I will have a go at. > I've started work on SearchIO indexing of BGZF files now, enabling it was quite simple (the same code as used for SeqIO the indexing): https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f Thus far I've only tested this with BLAST XML, but that did require a bit of reworking to avoid doing file offset arithmetic: https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5 I will resume this work later this afternoon, going over all the SearchIO file formats one by one. Regards, Peter From p.j.a.cock at googlemail.com Mon Dec 3 11:49:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Dec 2012 16:49:47 +0000 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: On Mon, Dec 3, 2012 at 2:02 PM, Peter Cock wrote: > > I've started work on SearchIO indexing of BGZF files now, > enabling it was quite simple (the same code as used for > SeqIO the indexing): > https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f > > Thus far I've only tested this with BLAST XML, but that did > require a bit of reworking to avoid doing file offset arithmetic: > https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5 > > I will resume this work later this afternoon, going over all > the SearchIO file formats one by one. I've refactored test_SearchIO_index.py to make adding additional get_raw tests easier. Proper testing of all the formats with BGZF will some larger test files (over 64k before compression) which we probably don't want to include in the repository. However, I also added code to additionally test Bio.SearchIO.index_db(...).get_raw(...) as well as your original testing of Bio.SearchIO.index(...).get_raw(...) alone. These should return the exact same string, and that is now working nicely for BLAST XML (and BGZF from limited testing), but not on all the formats. Could you look at the difference in get_raw and the record length found during indexing for: blast-tab (with comments), hmmscan3-domtab, hmmer3-tab, and hmmer3-text? i.e. Anything where test_SearchIO_index.py is now printing a WARNING line when run. Thanks, Peter From christian at brueffer.de Mon Dec 3 12:02:31 2012 From: christian at brueffer.de (Christian Brueffer) Date: Tue, 04 Dec 2012 01:02:31 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: References: <50BC9F1F.4090904@brueffer.de> Message-ID: <50BCDB27.7040402@brueffer.de> On 12/3/12 21:34 , Peter Cock wrote: > On Mon, Dec 3, 2012 at 12:46 PM, Christian Brueffer > wrote: >> Hi, > > Hi Christian, > > Thanks for all the pull requests sorting out issues like this, in > terms of lines of code you'll probably be one of the top > contributors to the next release ;) This sort of work isn't as > high profile as new features or bug fixes, but has a more > subtle role in the long term of the project - making our code > easier to follow etc. So we do appreciate these contributions. > >> I just submitted pull request #102 which fixes several types of PEP8 >> warnings (found using the awesome pep8 tool). > > 101 not 102? https://github.com/biopython/biopython/pull/101 > 102 and 103 (I actually meant 103). >> Here's what's left after those fixes: >> >> $ pep8 --statistics -qq repos/biopython >> 789 E111 indentation is not a multiple of four > > That's nasty - although I think we've got rid of all the tabbed > indentation already which was also very annoying. > Some code uses two spaces etc, definatelty worth fixing. >> 673 E121 continuation line indentation is not a multiple of four > > I suspect many of those are a style judgement and done that > way to line up parentheses etc. > I'll see about those and apply case by case judgement. >> 693 E122 continuation line missing indentation or outdented >> 171 E123 closing bracket does not match indentation of opening bracket's >> line >> 86 E124 closing bracket does not match visual indentation >> 49 E125 continuation line does not distinguish itself from next logical >> line >> 197 E126 continuation line over-indented for hanging indent >> 575 E127 continuation line over-indented for visual indent >> 1092 E128 continuation line under-indented for visual indent >> 773 E201 whitespace after '(' >> 540 E202 whitespace before ')' >> 23543 E203 whitespace before ':' >> 55 E211 whitespace before '(' > > I'd like to see E201, E202, and E211 fixed (whitespace next to > parentheses). > > The count for E203 is surprisingly high - I suspect that > could include some large dictionaries? Note some of the > dictionaries are auto-generated so the code to do that > would also need fixing. > >> 180 E221 multiple spaces before operator >> 59 E222 multiple spaces after operator >> 5848 E225 missing whitespace around operator >> 6517 E231 missing whitespace after ',' >> 2544 E251 no spaces around keyword / parameter equals >> 644 E261 at least two spaces before inline comment >> 346 E262 inline comment should start with '# ' >> 156 E301 expected 1 blank line, found 0 >> 1838 E302 expected 2 blank lines, found 1 >> 364 E303 too many blank lines (2) >> 15553 E501 line too long (82 > 79 characters) >> 857 E502 the backslash is redundant between brackets > > Fixing E502 seems a good idea, I suspect many of these are > purely accidental due to not realising when they are redundant. > Agreed. >> 291 E701 multiple statements on one line (colon) >> 122 E711 comparison to None should be 'if cond is None:' >> 3707 W291 trailing whitespace >> 1913 W293 blank line contains whitespace >> >> I'm not sure where to go from here with regard to what's worth fixing and >> what would be considered repo churn (or gratuitous changes that make >> merging of existing patches harder). >> >> I'd especially like to clean up E301, E302, > > E301 and E302 presumable are about the recommended spacing > between function, class and method names? If you want to fix > them next that seems low risk in terms of complicating merges. > That and spacing between functions or between a function and a new class. >> ... E701, E711, W291 and W293. > > Did you already fix most of those in today's pull request? > https://github.com/biopython/biopython/pull/101 > > If there are more cases, then by all means fix them too. > I fixed some in Nexus, that was before actually using the pep8 tool. >> Other items like E251 are more dubious, as some developers >> seem to prefer the current style. >> >> What do you think? > > We have a range of styles in the current code base reflecting > different authors - and also changes in the Python conventions > as some of the code is now over ten years old. And if any of > my personal coding style is flagged, I'm willing to adapt ;) > > (e.g. I've learnt not to put a space before if statement colons) > > As you point out, the "repo churn" from fixing minor things > like spaces around operators does have a cost in making > merges a little harder. Things like the exception style updates > which you've already fixed (seems I missed some) are more > urgent for Python 3 support, so worth doing anyway. > On the other hand, it's basically a one-time cost. However I want to fix the lowest-hanging fruit (read: the ones with the lowest counts ;-) first. > You've got us a lot closer to PEP8 compliance - do you think > subject to a short white list of known cases (like module > names) where we don't follow PEP8 we could aim to run a > a pep8 tool automatically (e.g. as a unit test, or even a commit > hook)? That is quite appealing as a way to spot any new code > which breaks the style guidelines... > Having a commit hook would be ideal (maybe with a possibility to override). This would be especially useful against the introduction of gratuitous whitespace. With some editors/IDEs you don't even notice it. Chris From w.arindrarto at gmail.com Tue Dec 4 08:33:32 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 4 Dec 2012 14:33:32 +0100 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: Hi Peter and everyone, >> I've started work on SearchIO indexing of BGZF files now, >> enabling it was quite simple (the same code as used for >> SeqIO the indexing): >> https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f >> >> Thus far I've only tested this with BLAST XML, but that did >> require a bit of reworking to avoid doing file offset arithmetic: >> https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5 >> >> I will resume this work later this afternoon, going over all >> the SearchIO file formats one by one. Yes, the original one that I wrote did have some less straightforward arithmetic as I was trying to adhere to the strict XML definition (i.e. no matter the whitespace outside of the start and end elements, indexing will still work). But line-based indexing should work too (and is simpler) so long as BLAST XML keeps its style (and any user modification afterwards doesn't introduce any wacky whitespaces). > I've refactored test_SearchIO_index.py to make adding > additional get_raw tests easier. Proper testing of all the > formats with BGZF will some larger test files (over 64k > before compression) which we probably don't want to > include in the repository. > > However, I also added code to additionally test > Bio.SearchIO.index_db(...).get_raw(...) as well as your > original testing of Bio.SearchIO.index(...).get_raw(...) > alone. These should return the exact same string, and > that is now working nicely for BLAST XML (and BGZF > from limited testing), but not on all the formats. > > Could you look at the difference in get_raw and the > record length found during indexing for: blast-tab > (with comments), hmmscan3-domtab, hmmer3-tab, > and hmmer3-text? > > i.e. Anything where test_SearchIO_index.py is now > printing a WARNING line when run. Sure :). Based on a quick initial look, it seems that these are due to filler texts (e.g. the BLAST tab format ending with lines like "# BLAST processed 3 queries"). These texts won't affect the calculation results and the values of our objects, but does add additional text length. regards, Bow From redmine at redmine.open-bio.org Tue Dec 4 18:01:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 4 Dec 2012 23:01:35 +0000 Subject: [Biopython-dev] [Biopython - Bug #3399] (New) SearchIO hmmer3-text parser fails to parse hits that have large gaps Message-ID: Issue #3399 has been reported by Kai Blin. ---------------------------------------- Bug #3399: SearchIO hmmer3-text parser fails to parse hits that have large gaps https://redmine.open-bio.org/issues/3399 Author: Kai Blin Status: New Priority: Normal Assignee: Category: Target version: URL: While trying to parse a hit that has a really bad match to the profile, there might be alignment lines that don't contain query sequence characters at all. In that case the SearchIO hmmer3-text module currently throws a ValueError
>>> it = SearchIO.parse('../broken.hsr', 'hmmer3-text')
>>> i = it.next()
Traceback (most recent call last):
  File "", line 1, in 
  File "Bio/SearchIO/__init__.py", line 313, in parse
    for qresult in generator:
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 60, in __iter__
    for qresult in self._parse_qresult():
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 145, in _parse_qresult
    hit_list = self._parse_hit(qid)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 188, in _parse_hit
    hit_list = self._create_hits(hit_list, qid)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 309, in _create_hits
    self._parse_aln_block(hid, hit.hsps)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 358, in _parse_aln_block
    frag.query = aliseq
  File "Bio/SearchIO/_model/hsp.py", line 816, in _query_set
    self._query = self._set_seq(value, 'query')
  File "Bio/SearchIO/_model/hsp.py", line 784, in _set_seq
    len(seq), seq_type))
ValueError: Sequence lengths do not match. Expected: 202 (hit); found: 131 (query).
See the attached file broken.hsr for a dataset that triggers the error. If you remove the esterase hit (including the domain annotation), this error does not happen (broken2.hsr). If you insert fake position information into the query sequence line (broken3.hsr), the parser is happy again. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Wed Dec 5 01:46:20 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 5 Dec 2012 07:46:20 +0100 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: Hi everyone, >> However, I also added code to additionally test >> Bio.SearchIO.index_db(...).get_raw(...) as well as your >> original testing of Bio.SearchIO.index(...).get_raw(...) >> alone. These should return the exact same string, and >> that is now working nicely for BLAST XML (and BGZF >> from limited testing), but not on all the formats. >> >> Could you look at the difference in get_raw and the >> record length found during indexing for: blast-tab >> (with comments), hmmscan3-domtab, hmmer3-tab, >> and hmmer3-text? >> >> i.e. Anything where test_SearchIO_index.py is now >> printing a WARNING line when run. > > Sure :). Based on a quick initial look, it seems that these are due to > filler texts (e.g. the BLAST > tab format ending with lines like "# BLAST processed 3 queries"). > These texts won't affect the calculation results and the values of our > objects, but does add additional text length. I've looked into this and submitted a pull request to fix the issues here: https://github.com/biopython/biopython/pull/111. The details on the errors are also there. regards, Bow From kai.blin at biotech.uni-tuebingen.de Wed Dec 5 02:24:14 2012 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Wed, 05 Dec 2012 17:24:14 +1000 Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading attributes. Message-ID: <50BEF69E.2000806@biotech.uni-tuebingen.de> Hi folks, I'm trying to finally get my hmmer2-text parser in, but I'm failing one unit test. The code is a bit too smart for me, it seems. So in the file I'm parsing, I only ever get the description of the hit in the hit table, like this (appologies if my mail client breaks this): Model Description Score E-value N -------- ----------- ----- ------- --- Glu_synthase Conserved region in glutamate synthas 858.6 3.6e-255 2 But of course I can't create a hit object when parsing the hit table, as I first need to have HSPFragments to create the hit object with. Anyway, I create a placeholder hit object that I'll later convert into a real Hit object. In that placeholder object, I set a description. Now I'm parsing the HSP table, looking like this: Model Domain seq-f seq-t hmm-f hmm-t score E-value -------- ------- ----- ----- ----- ----- ----- ------- GATase_2 1/1 34 404 .. 1 385 [] 731.8 3.9e-226 The HSP table is in a different order than the hit table, so never mind the different model name. Now, I need to create an HSPFragment with the same description as the Hit object, or querying for the Hit object's description will cascade through the HSPs and HSPFragments, and return multiple values for the description. However, no matter what I do, I seem to get an tossed in there somehow. The parser is at https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py the test code is at https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py and the test file that's failing is the hmmpfam2.3 file at https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out Any pointers would be appreciated. The code is working fine in my current development work in general, and I'd love to get it upstream to get rid of an extra patch step during installation. Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-University of T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Deutschland Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From p.j.a.cock at googlemail.com Wed Dec 5 06:41:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 11:41:05 +0000 Subject: [Biopython-dev] Minor buildbot issues from SearchIO In-Reply-To: References: Message-ID: On Fri, Nov 30, 2012 at 2:35 AM, Wibowo Arindrarto wrote: > Hi everyone, > > I've done some digging around to see how to deal with these issues. > Here's what I found: > >> The BuildBot flagged two new issues overnight, >> http://testing.open-bio.org/biopython/tgrid >> >> Python 2.5 on Windows - doctests are failing due to floating point decimal place >> differences in the exponent (down to C library differences, something fixed in >> later Python releases). Perhaps a Python 2.5 hack is the way to go here? >> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%202.5/builds/664/steps/shell/logs/stdio > > I've submitted a pull request to fix this here: > https://github.com/biopython/biopython/pull/98 The Windows detection wasn't quite right, it should now match how we look for Windows elsewhere in Biopython: https://github.com/biopython/biopython/commit/fc24967b89eda56675e67824a4a57a6059650636 >> There is a separate cross-platform issue on Python 3.1, "TypeError: >> invalid event tuple" again with XML parsing. Curiously this had started >> a few days back in the UniprotIO tests on one machine, pre-dating the >> SearchIO merge. I'm not sure what triggered it. >> http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/767 >> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/766/steps/shell/logs/stdio >> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/648/steps/shell/logs/stdio > > As for this one, it seems that it's caused by a bug in Python3.1 > (http://bugs.python.org/issue9257) due to the way > `xml.etree.cElemenTree.iterparse` accepts the `event` argument. Ah - I remember that bug now, we have a hack in place elsewhere to try and avoid that - seems it won't be fixed in Python 3.1.x now so I've relaxed the version check here: https://github.com/biopython/biopython/commit/52fdd0ed7fa576494005e635b6a6610daab2ab0e Hopefully that will bring the buildbot back to all green tonight. (TravisCI has now dropped their Python 3.1 support, but they should have Python 3.3 with NumPy working soon). Peter From p.j.a.cock at googlemail.com Wed Dec 5 09:16:43 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 14:16:43 +0000 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: <50BCDB27.7040402@brueffer.de> References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> Message-ID: On Mon, Dec 3, 2012 at 5:02 PM, Christian Brueffer wrote: >> As you point out, the "repo churn" from fixing minor things >> like spaces around operators does have a cost in making >> merges a little harder. Things like the exception style updates >> which you've already fixed (seems I missed some) are more >> urgent for Python 3 support, so worth doing anyway. >> > > On the other hand, it's basically a one-time cost. However I > want to fix the lowest-hanging fruit (read: the ones with the > lowest counts ;-) first. The shear number of files touched in these PEP8 fixes would probably deserve to be called "repository churn" now - wow! Although we have good test coverage, it isn't complete (anyone fancy trying some test coverage measuring tools like figleaf?) so there is a small but real risk we've accidentally broken something. I'm wondering if therefore a 'beta' release would be prudent, of if I am just worrying about things too much? >> You've got us a lot closer to PEP8 compliance - do you think >> subject to a short white list of known cases (like module >> names) where we don't follow PEP8 we could aim to run a >> a pep8 tool automatically (e.g. as a unit test, or even a commit >> hook)? That is quite appealing as a way to spot any new code >> which breaks the style guidelines... > > Having a commit hook would be ideal (maybe with a possibility to > override). This would be especially useful against the introduction of > gratuitous whitespace. With some editors/IDEs you don't even notice it. Would you be interested in looking into how to set that up? Presumably a client-side git hook would be best, but we'd need to explore cross platform issues (e.g. developing and testing on Windows) and making sure it allowed an override on demand (where the developer wants/needs to ignore a style warning). Thanks, Peter From d.m.a.martin at dundee.ac.uk Wed Dec 5 08:50:21 2012 From: d.m.a.martin at dundee.ac.uk (David Martin) Date: Wed, 5 Dec 2012 13:50:21 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer Message-ID: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams. I'd like to modify the CircularDrawer feature drawing to allow the following: label_position: start|middle|end as per LinearDrawer label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse This will cause some issues with track widths (how can you specify a track width for a feature track?) Any thoughts/suggestions? ..d The University of Dundee is a registered Scottish Charity, No: SC015096 From christian at brueffer.de Wed Dec 5 10:28:19 2012 From: christian at brueffer.de (Christian Brueffer) Date: Wed, 05 Dec 2012 23:28:19 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> Message-ID: <50BF6813.4070102@brueffer.de> On 12/5/12 22:16 , Peter Cock wrote: > On Mon, Dec 3, 2012 at 5:02 PM, Christian Brueffer > wrote: >>> As you point out, the "repo churn" from fixing minor things >>> like spaces around operators does have a cost in making >>> merges a little harder. Things like the exception style updates >>> which you've already fixed (seems I missed some) are more >>> urgent for Python 3 support, so worth doing anyway. >>> >> >> On the other hand, it's basically a one-time cost. However I >> want to fix the lowest-hanging fruit (read: the ones with the >> lowest counts ;-) first. > > The shear number of files touched in these PEP8 fixes would > probably deserve to be called "repository churn" now - wow! > I wonder whether there's a file left I haven't touched yet (except the data files in Tests)... > Although we have good test coverage, it isn't complete (anyone > fancy trying some test coverage measuring tools like figleaf?) > so there is a small but real risk we've accidentally broken > something. I'm wondering if therefore a 'beta' release would > be prudent, of if I am just worrying about things too much? > It certainly can't hurt to advise users to have an extra eye on possible regressions and strange behaviours in existing code. I think the only risky changes were the ones concerning indentation, (f68d334b1edfd743fe8a7bb4654046295f0ff939), I was extra careful about those. So, I'm pretty confident I haven't screwed things up but it's good to be careful. FYI, here's the "pep8 --statistics -qq" output as of commit df4f12965a2ad3b6ed31bbf9d201bd5c716bd4ee: 680 E121 continuation line indentation is not a multiple of four 691 E122 continuation line missing indentation or outdented 171 E123 closing bracket does not match indentation of opening bracket's line 86 E124 closing bracket does not match visual indentation 197 E126 continuation line over-indented for hanging indent 601 E127 continuation line over-indented for visual indent 1072 E128 continuation line under-indented for visual indent 772 E201 whitespace after '(' 536 E202 whitespace before ')' 23444 E203 whitespace before ':' 94 E221 multiple spaces before operator 11 E222 multiple spaces after operator 5763 E225 missing whitespace around operator 6519 E231 missing whitespace after ',' 2542 E251 no spaces around keyword / parameter equals 622 E261 at least two spaces before inline comment 347 E262 inline comment should start with '# ' 1044 E302 expected 2 blank lines, found 1 1 E303 too many blank lines (2) 15526 E501 line too long (82 > 79 characters) 3 E711 comparison to None should be 'if cond is None:' 75 W291 trailing whitespace 12 W293 blank line contains whitespace 5 W601 .has_key() is deprecated, use 'in' E203 looks scary, but 9900 of those are in Bio/SubsMat/MatrixInfo.py alone. >>> You've got us a lot closer to PEP8 compliance - do you think >>> subject to a short white list of known cases (like module >>> names) where we don't follow PEP8 we could aim to run a >>> a pep8 tool automatically (e.g. as a unit test, or even a commit >>> hook)? That is quite appealing as a way to spot any new code >>> which breaks the style guidelines... >> >> Having a commit hook would be ideal (maybe with a possibility to >> override). This would be especially useful against the introduction of >> gratuitous whitespace. With some editors/IDEs you don't even notice it. > > Would you be interested in looking into how to set that up? > Presumably a client-side git hook would be best, but we'd > need to explore cross platform issues (e.g. developing and > testing on Windows) and making sure it allowed an override > on demand (where the developer wants/needs to ignore a > style warning). > Yes, It's fairly high on my TODO list. Chris From p.j.a.cock at googlemail.com Wed Dec 5 10:57:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 15:57:44 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Wed, Dec 5, 2012 at 1:50 PM, David Martin wrote: > Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams. > > I'd like to modify the CircularDrawer feature drawing to allow the following: > > label_position: start|middle|end as per LinearDrawer I would find it natural if we treated start/middle/end from the point of view of the feature (and its strand) as in the LinearDrawer. However the current circular drawer tries to position things at the vertical bottom of the feature (it cares about the left and right halves of the circle) which is rather different. I am suggesting a break in backwards compatibility (old code would still run but put the labels in different places) but for large circular diagrams the difference should be minor - and I think it would be an overall improvement. > label_placement: inside|outside|overlap where inside and outside are > anchored just inside and just outside the feature but do not overlap it, > and overlap is the current behaviour If I have understood your intended meaning, that won't work nicely with stranded features. I would suggest two options: outside (i.e. outside the feature's bounding box, either outside the track circle for forward strand or strand-less, or inside the track circle for reverse strand) matching the current linear code, or inside matching the current circular code. i.e. This would essentially toggle the text element's anchoring between start/end. i.e. Maintain the convention that labels above/outside the track are for the forward strand (and strand-less) features, while labels below/inside the track are for reverse strand features. > label_orientation: upright|circular which determines the orientation of > the label. upright is the current behaviour. Circular would be oriented > to face clockwise for the forward strand and anticlockwise for the reverse I would prefer making the existing (linear) option label_angle work nicely on circular diagrams (which would make sense as part of reworking the code to obey label_placement). > This will cause some issues with track widths (how can you specify a > track width for a feature track?) Do you mean how to allocate more white space between the tracks to ensure the labels have a clear background if printed outside the features? The quick and dirty solution is a spacer track (you can allocate track numbers to leave a gap). > Any thoughts/suggestions? > Comments in-line, if need be we could meet up to hash some of this out in person (although I not be in the Dundee area next week). Regards, Peter From Leighton.Pritchard at hutton.ac.uk Wed Dec 5 11:28:26 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 5 Dec 2012 16:28:26 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On 5 Dec 2012, at Wednesday, December 5, 15:57, Peter Cock wrote: On Wed, Dec 5, 2012 at 1:50 PM, David Martin > wrote: label_position: start|middle|end as per LinearDrawer I am suggesting a break in backwards compatibility (old code would still run but put the labels in different places) but for large circular diagrams the difference should be minor - and I think it would be an overall improvement. Yep - I agree label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse I would prefer making the existing (linear) option label_angle work nicely on circular diagrams (which would make sense as part of reworking the code to obey label_placement). Good point - the automatic reorientation on either side of the circle (to respect the viewer's local gravity) could effectively be handled through a working label_angle for circular diagrams. And more adventurous manual reorientation would also be possible ;) One issue there is what the angle is defined with respect to: a 'vertical' reference on the page, or a tangent/normal to some point on the feature. The first is straightforward, and might be what we want - the second will likely result in some odd - or attractive - patterns. Comments in-line, if need be we could meet up to hash some of this out in person (although I not be in the Dundee area next week). Friday's good for me. L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From ben at benfulton.net Wed Dec 5 11:28:52 2012 From: ben at benfulton.net (Ben Fulton) Date: Wed, 5 Dec 2012 11:28:52 -0500 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> Message-ID: I've been studying this a bit and have a preference for Ned Batchelder's Coverage tool. But I plan on putting some more work into it this week and next. On Wed, Dec 5, 2012 at 9:16 AM, Peter Cock wrote >Although we have good test coverage, it isn't complete (anyone >fancy trying some test coverage measuring tools like figleaf?) From w.arindrarto at gmail.com Wed Dec 5 11:39:13 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 5 Dec 2012 17:39:13 +0100 Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading attributes. In-Reply-To: <50BEF69E.2000806@biotech.uni-tuebingen.de> References: <50BEF69E.2000806@biotech.uni-tuebingen.de> Message-ID: Hi Kai and everyone, Very happy to see the parser near completion (with tests too!). The issue you're facing is unfortunately the consequence of trying to keep attribute values in sync across the object hierarchy. It is a bit troublesome for now, but not without solution. > However, no matter what I do, I seem to get an > tossed in there somehow. > > The parser is at > https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py > the test code is at > https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py > and the test file that's failing is the hmmpfam2.3 file at > https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out '' is the default value for any description attribute (be it in the QueryResult object, or in the HSPFragment.hit_description). The error you're seeing is because the hit description is being accessed through the hit object (hit.description) and the cascading property getter checks first whether all HSP contains the same `hit_description` attribute value. It'll only return the value if all HSPFragment.hit_description values are equal. Otherwise, it'll raise the error you're seeing here. In your case, there are two values: 'Conserved region in glutamate synthas' and '', while there should only be one (the first one). After prodding here and there, it seems that this is caused by the if clause here: https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py#L191 The 'else' clause in that block adds the HSP to the hit object, but does not do any cascading attribute assignment (query_description and hit_description). Here, the simple fix would be to force a description assignment to the HSP. For example, you could have the `else` block like so: ... else: hit = unordered_hits[id_] hsp.hit_description = hit.description hit.append(hsp) Other fixes are of course possible, but this is the simplest I can imagine (though it seems a bit crude). Also, I would like to note that the query description assignment of the parser may break the cascade as well. If you try to access `qresult.description` (qresult being the QueryResult object), you'd get the true query description. But if you try to access it from `qresult[0].query_description` (the query description stored in the hit object), you'd get ''. The fix here would be to assign the description at the last moment before the QueryResult object is yielded. That way, the cascading setter works properly and all Hit, HSP, and HSPFragment inside the QueryResult object will contain the same value. I realize that this approach is not without flaws (and I'm always open to suggestions), but at the moment this seems to be the most sensible way to keep the attribute values in-sync while keeping the objects more user-friendly (i.e. making the parser slightlymore complex to write, but with the result of consistent attribute value to the users). Hope this helps! Bow From Leighton.Pritchard at hutton.ac.uk Wed Dec 5 11:21:06 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 5 Dec 2012 16:21:06 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: Hi all, On 5 Dec 2012, at Wednesday, December 5, 13:50, David Martin wrote: Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams. I'd like to modify the CircularDrawer feature drawing to allow the following: label_position: start|middle|end as per LinearDrawer label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse This will cause some issues with track widths (how can you specify a track width for a feature track?) Any thoughts/suggestions? I think that the proposed changes are all sensible, but circular labels are fiddly, so we'll need to think a wee bit about what to do exactly (probably another reason why I didn't do much on this originally ;) ). label_position makes perfect sense, as suggested. label_placement as a concept is fine, but I think we might need to be more precise about the intended behaviour of the arguments. I think that 'overlap' is possibly just a result of the choice of two anchoring parameters: 'inside'/'outside' and 'start'/'end' of the label string (see .png if the list accepts them). We might be able to cover all possibilities with just these two choices - does this image cover the range of intended positioning David? If so, then how about two arguments (I'm easy on the argument names): 'label_outer=True/False' and 'label_anchor=start/end/0/1'? [cid:4EA13CE3-20E7-41D8-870F-CBBAA9DD06B0 at scri.sari.ac.uk] label_orientation: What I think you're saying is that we want a distinction between text that orients to assume a static page (one you always view upright: e.g. a monitor), and text that doesn't. I prefer 'reorient_labels=True/False', or some other geometrically neutral argument name, to 'upright' (whose expected meaning could change depending on page location and local context) as a parameter, but it's a good idea. IIRC Track widths are relative, rather than absolute, and don't include label bounding boxes, so off-hand I don't think there ought to be any downstream issues. Famous last words, there! ;) Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2012-12-05 at Wednesday, December 5 16.06.12.png Type: image/png Size: 22969 bytes Desc: Screen Shot 2012-12-05 at Wednesday, December 5 16.06.12.png URL: From d.m.a.martin at dundee.ac.uk Wed Dec 5 11:29:14 2012 From: d.m.a.martin at dundee.ac.uk (David Martin) Date: Wed, 5 Dec 2012 16:29:14 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Just got my head out of hacking at this. The options I have now are: label_position: start|middle|end with reference to the feature. So the end is always the pointy bit. label_orientation: circular|upright Sometimes it is nice to have a proper circular plot label_placement: inside|outside|overlap|strand which maintains overlap as default, inside is all inside, outside is all outside, strand is forward outside and reverse inside. It even works. Angles and so on are not so relevant with circular plots though I would prefer a label_angle: radial|tangent|[degrees] Should I attach an example? ..d From: Leighton Pritchard [mailto:Leighton.Pritchard at hutton.ac.uk] Sent: 05 December 2012 16:21 To: David Martin Cc: BioPython-Dev; Peter Cock Subject: Re: [Biopython-dev] Modifications to CircularDrawer Hi all, On 5 Dec 2012, at Wednesday, December 5, 13:50, David Martin wrote: Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams. I'd like to modify the CircularDrawer feature drawing to allow the following: label_position: start|middle|end as per LinearDrawer label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse This will cause some issues with track widths (how can you specify a track width for a feature track?) Any thoughts/suggestions? I think that the proposed changes are all sensible, but circular labels are fiddly, so we'll need to think a wee bit about what to do exactly (probably another reason why I didn't do much on this originally ;) ). label_position makes perfect sense, as suggested. label_placement as a concept is fine, but I think we might need to be more precise about the intended behaviour of the arguments. I think that 'overlap' is possibly just a result of the choice of two anchoring parameters: 'inside'/'outside' and 'start'/'end' of the label string (see .png if the list accepts them). We might be able to cover all possibilities with just these two choices - does this image cover the range of intended positioning David? If so, then how about two arguments (I'm easy on the argument names): 'label_outer=True/False' and 'label_anchor=start/end/0/1'? [cid:image001.png at 01CDD305.AA06C500] label_orientation: What I think you're saying is that we want a distinction between text that orients to assume a static page (one you always view upright: e.g. a monitor), and text that doesn't. I prefer 'reorient_labels=True/False', or some other geometrically neutral argument name, to 'upright' (whose expected meaning could change depending on page location and local context) as a parameter, but it's a good idea. IIRC Track widths are relative, rather than absolute, and don't include label bounding boxes, so off-hand I don't think there ought to be any downstream issues. Famous last words, there! ;) Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 The University of Dundee is a registered Scottish Charity, No: SC015096 -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 22969 bytes Desc: image001.png URL: From p.j.a.cock at googlemail.com Wed Dec 5 11:57:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 16:57:39 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Wed, Dec 5, 2012 at 4:29 PM, David Martin wrote: > Just got my head out of hacking at this. The options I have now are: > > label_position: start|middle|end with reference to the feature. So the end is > always the pointy bit. Sounds good and uncontentious. > label_orientation: circular|upright Sometimes it is nice to have a proper circular plot I'd have to see the code or an example (and it seems any image attachment will stall your emails for moderation - I'm a moderator but there is some time delay before this gets that far). > label_placement: inside|outside|overlap|strand which maintains overlap as > default, inside is all inside, outside is all outside, strand is forward outside > and reverse inside. Perhaps below/above rather than inside/outside and then it could be done to both the linear and circular drawers? Do you think this is useful then? Note the current circular behaviour which overlaps is strand aware, so those may not be the best names... See also my earlier email with an alternative suggestion. > It even works. Angles and so on are not so relevant with circular plots > though I would prefer a label_angle: radial|tangent|[degrees] > > Should I attach an example? You can try if the files are not overly larger (moderation delays will still occur), posting a link would be easier although probably less lasting. Are you OK with github? A natural option would be to show us your proposals on a branch (separate commits if possible, otherwise I can try and break out each bit if needed). Ta, Peter From p.j.a.cock at googlemail.com Wed Dec 5 12:24:08 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 17:24:08 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Wed, Dec 5, 2012 at 4:57 PM, Peter Cock wrote: > On Wed, Dec 5, 2012 at 4:29 PM, David Martin wrote: >> label_placement: inside|outside|overlap|strand which maintains overlap as >> default, inside is all inside, outside is all outside, strand is forward outside >> and reverse inside. > > Perhaps below/above rather than inside/outside and then it could be done > to both the linear and circular drawers? Do you think this is useful then? Having seen your example (sent directly off list), I'm convinced about the usefulness of the inside and outside idea (when used on the inner-most or outer-most track). Still not sure about those names as I would also like to support this on the linear diagrams as well. Regards, Peter From d.m.a.martin at dundee.ac.uk Wed Dec 5 12:30:26 2012 From: d.m.a.martin at dundee.ac.uk (David Martin) Date: Wed, 5 Dec 2012 17:30:26 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: <959CFF5060375249824CC633DDDF896F1C0C5EE5@AMSPRD0410MB351.eurprd04.prod.outlook.com> -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: 05 December 2012 17:24 To: David Martin Cc: Leighton Pritchard; BioPython-Dev Subject: Re: [Biopython-dev] Modifications to CircularDrawer On Wed, Dec 5, 2012 at 4:57 PM, Peter Cock wrote: > On Wed, Dec 5, 2012 at 4:29 PM, David Martin wrote: >> label_placement: inside|outside|overlap|strand which maintains >> overlap as default, inside is all inside, outside is all outside, >> strand is forward outside and reverse inside. > > Perhaps below/above rather than inside/outside and then it could be > done to both the linear and circular drawers? Do you think this is useful then? Having seen your example (sent directly off list), I'm convinced about the usefulness of the inside and outside idea (when used on the inner-most or outer-most track). Still not sure about those names as I would also like to support this on the linear diagrams as well. Linear and Circular are similar but not identical. No problem with having a above|below|strand or a more complex anchoring scheme but I don't need it right now so I'm just playing with the circular one. I've attached a PDF to this mail - it might get through and I'll try to fork/clone/push git. ..d The University of Dundee is a registered Scottish Charity, No: SC015096 -------------- next part -------------- A non-text attachment was scrubbed... Name: plasmid_circular_nice.pdf Type: application/pdf Size: 148125 bytes Desc: plasmid_circular_nice.pdf URL: From p.j.a.cock at googlemail.com Wed Dec 5 13:41:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 18:41:59 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: Hi David, I've been experimenting with your pull request, thank you: https://github.com/biopython/biopython/pull/116 On Wed, Dec 5, 2012 at 5:22 PM, Peter Cock wrote: > On Wed, Dec 5, 2012 at 5:10 PM, David Martin wrote: >> In the mean-time here is a plot (that doesn't show all layouts) > > Nice. Looking at that now I'm pretty sure I hacked the label anchor > once before of a quick job in order to get the labels outside like that... > certainly worth making this change. Found it, that change made it to a branch I'd forgotten about: https://github.com/peterjc/biopython/commit/d4764dfe929f135ec55b83ad14a9cd34e2d14bba This is bringing back memories... I think I'd concluded last time that attempting to offer anything other than radial label orientation was probably a mistake, and that if we restrict that we can safely offset the vertical position of the text midline (since right now it is positioned according to the bottom line of the font). Without that, positioning labels at the top (as you look at the page) of a circular feature gave non-ideal placement. This is likely one reason for the current hard-coded placement of the feature labels at the bottom (as you look at the circle). Hmm. I think I have a compromise forming that would allow figures like your motivating example :) Peter From kai.blin at biotech.uni-tuebingen.de Wed Dec 5 20:44:40 2012 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Thu, 06 Dec 2012 11:44:40 +1000 Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading attributes. In-Reply-To: References: <50BEF69E.2000806@biotech.uni-tuebingen.de> Message-ID: <50BFF888.50300@biotech.uni-tuebingen.de> On 2012-12-06 02:39, Wibowo Arindrarto wrote: Hi Bow, everyone, > Very happy to see the parser near completion (with tests too!). The > issue you're facing is unfortunately the consequence of trying to keep > attribute values in sync across the object hierarchy. It is a bit > troublesome for now, but not without solution. ... > Here, the simple fix would be to force a description assignment to the > HSP. For example, you could have the `else` block like so: > > ... > else: > hit = unordered_hits[id_] > hsp.hit_description = hit.description > hit.append(hsp) Thanks for the tip, that was the last speedbump I had. I just sent off the pull request for the hmmer2 parser. Thanks again for the help, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-University of T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Deutschland Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From christian at brueffer.de Wed Dec 5 23:04:37 2012 From: christian at brueffer.de (Christian Brueffer) Date: Thu, 06 Dec 2012 12:04:37 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: <50BF6813.4070102@brueffer.de> References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> <50BF6813.4070102@brueffer.de> Message-ID: <50C01955.8060505@brueffer.de> On 12/05/2012 11:28 PM, Christian Brueffer wrote: > On 12/5/12 22:16 , Peter Cock wrote: [...] > >>>> You've got us a lot closer to PEP8 compliance - do you think >>>> subject to a short white list of known cases (like module >>>> names) where we don't follow PEP8 we could aim to run a >>>> a pep8 tool automatically (e.g. as a unit test, or even a commit >>>> hook)? That is quite appealing as a way to spot any new code >>>> which breaks the style guidelines... >>> >>> Having a commit hook would be ideal (maybe with a possibility to >>> override). This would be especially useful against the introduction of >>> gratuitous whitespace. With some editors/IDEs you don't even notice it. >> >> Would you be interested in looking into how to set that up? >> Presumably a client-side git hook would be best, but we'd >> need to explore cross platform issues (e.g. developing and >> testing on Windows) and making sure it allowed an override >> on demand (where the developer wants/needs to ignore a >> style warning). >> > > Yes, It's fairly high on my TODO list. > I just had a look at this. Turns out some people have had this idea before :-) Here's a first version: https://github.com/cbrueffer/pep8-git-hook/blob/master/pre-commit Basically you just save this as biopython/.git/hooks/pre-commit and mark it executable. You also need to install pep8 (pip install pep8). The checks can be bypassed with git commit --no-verify. Currently it ignores E124 (which I think should remain that way). Any other errors or files it should ignore? I'd be grateful if someone could give this a try on Windows. Chris From christian at brueffer.de Thu Dec 6 01:22:24 2012 From: christian at brueffer.de (Christian Brueffer) Date: Thu, 06 Dec 2012 14:22:24 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: <50C01955.8060505@brueffer.de> References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> <50BF6813.4070102@brueffer.de> <50C01955.8060505@brueffer.de> Message-ID: <50C039A0.8040208@brueffer.de> On 12/06/2012 12:04 PM, Christian Brueffer wrote: > On 12/05/2012 11:28 PM, Christian Brueffer wrote: >> On 12/5/12 22:16 , Peter Cock wrote: > [...] >> >>>>> You've got us a lot closer to PEP8 compliance - do you think >>>>> subject to a short white list of known cases (like module >>>>> names) where we don't follow PEP8 we could aim to run a >>>>> a pep8 tool automatically (e.g. as a unit test, or even a commit >>>>> hook)? That is quite appealing as a way to spot any new code >>>>> which breaks the style guidelines... >>>> >>>> Having a commit hook would be ideal (maybe with a possibility to >>>> override). This would be especially useful against the introduction of >>>> gratuitous whitespace. With some editors/IDEs you don't even notice >>>> it. >>> >>> Would you be interested in looking into how to set that up? >>> Presumably a client-side git hook would be best, but we'd >>> need to explore cross platform issues (e.g. developing and >>> testing on Windows) and making sure it allowed an override >>> on demand (where the developer wants/needs to ignore a >>> style warning). >>> >> >> Yes, It's fairly high on my TODO list. >> > > I just had a look at this. Turns out some people have had this idea > before :-) > > Here's a first version: > > https://github.com/cbrueffer/pep8-git-hook/blob/master/pre-commit > > Basically you just save this as biopython/.git/hooks/pre-commit and mark > it executable. You also need to install pep8 (pip install pep8). The > checks can be bypassed with git commit --no-verify. > > Currently it ignores E124 (which I think should remain that way). Any > other errors or files it should ignore? > > I'd be grateful if someone could give this a try on Windows. > Thinking about it, I think it would make sense to ignore the following: E121 continuation line indentation is not a multiple of four E122 continuation line missing indentation or outdented E123 closing bracket does not match indentation of opening bracket's line E124 closing bracket does not match visual indentation E126 continuation line over-indented for hanging indent E127 continuation line over-indented for visual indent E128 continuation line under-indented for visual indent They all deal with indentation, but are not always beneficial to readability. E125 is missing from that list, which is a useful one. Chris From p.j.a.cock at googlemail.com Thu Dec 6 05:07:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 10:07:55 +0000 Subject: [Biopython-dev] Minor buildbot issues from SearchIO In-Reply-To: References: Message-ID: On Wed, Dec 5, 2012 at 11:41 AM, Peter Cock wrote: > On Fri, Nov 30, 2012 at 2:35 AM, Wibowo Arindrarto > wrote: >> Hi everyone, >> >> I've done some digging around to see how to deal with these issues. >> Here's what I found: >> >>> The BuildBot flagged two new issues overnight, >>> http://testing.open-bio.org/biopython/tgrid >>> >>> Python 2.5 on Windows - doctests are failing due to floating point decimal place >>> differences in the exponent (down to C library differences, something fixed in >>> later Python releases). Perhaps a Python 2.5 hack is the way to go here? >>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%202.5/builds/664/steps/shell/logs/stdio >> >> I've submitted a pull request to fix this here: >> https://github.com/biopython/biopython/pull/98 > > The Windows detection wasn't quite right, it should now match > how we look for Windows elsewhere in Biopython: > https://github.com/biopython/biopython/commit/fc24967b89eda56675e67824a4a57a6059650636 > >>> There is a separate cross-platform issue on Python 3.1, "TypeError: >>> invalid event tuple" again with XML parsing. Curiously this had started >>> a few days back in the UniprotIO tests on one machine, pre-dating the >>> SearchIO merge. I'm not sure what triggered it. >>> http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/767 >>> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/766/steps/shell/logs/stdio >>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/648/steps/shell/logs/stdio >> >> As for this one, it seems that it's caused by a bug in Python3.1 >> (http://bugs.python.org/issue9257) due to the way >> `xml.etree.cElemenTree.iterparse` accepts the `event` argument. > > Ah - I remember that bug now, we have a hack in place elsewhere > to try and avoid that - seems it won't be fixed in Python 3.1.x now > so I've relaxed the version check here: > https://github.com/biopython/biopython/commit/52fdd0ed7fa576494005e635b6a6610daab2ab0e > > Hopefully that will bring the buildbot back to all green tonight. > (TravisCI has now dropped their Python 3.1 support, but they > should have Python 3.3 with NumPy working soon). > > Peter OK, the buildbot looks happy now from the SearchIO work. There is one issue under Python 3.1.5 on a 64 bit Linux server, which I suspect is down to the Python version (this buildslave used to run an older version - Python 3.1.3 (separate email to follow). Regards, Peter From p.j.a.cock at googlemail.com Thu Dec 6 05:24:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 10:24:47 +0000 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? Message-ID: On Thu, Dec 6, 2012 at 10:07 AM, Peter Cock wrote: > > OK, the buildbot looks happy now from the SearchIO work. > > There is one issue under Python 3.1.5 on a 64 bit Linux server, > which I suspect is down to the Python version (this buildslave > used to run an older version - Python 3.1.3 (separate email > to follow). There are 18 test failures like this - all to do with handles and stdout, which have been happening for a while now but I've not found time to look into it. Example: ====================================================================== ERROR: test_needle_piped (test_Emboss.PairwiseAlignmentTests) needle with asis trick, output piped to stdout. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py", line 74, in __next__ line = self._header AttributeError: 'EmbossIterator' object has no attribute '_header' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/Tests/test_Emboss.py", line 571, in test_needle_piped align = AlignIO.read(child.stdout, "emboss") File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py", line 418, in read first = next(iterator) File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py", line 366, in parse for a in i: File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py", line 77, in __next__ line = handle.readline() AttributeError: '_io.FileIO' object has no attribute 'read1' Lasting working build, Python 3.1.3, http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/710/steps/shell/logs/stdio https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634 Next build (after a couple of weeks offline while this server was being rebuilt), Python 3.1.5, http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/722/steps/shell/logs/stdio https://github.com/biopython/biopython/commit/3ea4ea58ed80d6e517699bcab8810398f9ce5957 The timing does suggest an issue introduced in the rebuild, and the obvious difference is the version of Python jumped from 3.1.3 to 3.1.5 (likely things like NumPy etc also changed). There were some security fixes only in Python 3.1.5, none of which sound relevant here: http://www.python.org/download/releases/3.1.5/ The change log for Python 3.1.4 is longer, and does mention stdout/stderr issues so this is perhaps the cause: hg.python.org/cpython/raw-file/feae9f9e9f30/Misc/NEWS See also http://bugs.python.org/issue4996 as possibly related. The whole Python 3 text vs binary handle issue is important with stdout/stderr. What I am doing now is testing those two commits (with Python 3.1.5) to confirm they both fail, and thus rule out a Biopython code change in those two weeks being to blame. Peter From p.j.a.cock at googlemail.com Thu Dec 6 05:45:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 10:45:07 +0000 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: On Thu, Dec 6, 2012 at 10:24 AM, Peter Cock wrote: > On Thu, Dec 6, 2012 at 10:07 AM, Peter Cock wrote: >> >> OK, the buildbot looks happy now from the SearchIO work. >> >> There is one issue under Python 3.1.5 on a 64 bit Linux server, >> which I suspect is down to the Python version (this buildslave >> used to run an older version - Python 3.1.3 (separate email >> to follow). > > There are 18 test failures like this - all to do with handles and stdout, > which have been happening for a while now but I've not found time > to look into it. Example: > > ====================================================================== > ERROR: test_needle_piped (test_Emboss.PairwiseAlignmentTests) > needle with asis trick, output piped to stdout. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py", > line 74, in __next__ > line = self._header > AttributeError: 'EmbossIterator' object has no attribute '_header' > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/Tests/test_Emboss.py", > line 571, in test_needle_piped > align = AlignIO.read(child.stdout, "emboss") > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py", > line 418, in read > first = next(iterator) > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py", > line 366, in parse > for a in i: > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py", > line 77, in __next__ > line = handle.readline() > AttributeError: '_io.FileIO' object has no attribute 'read1' > > Lasting working build, Python 3.1.3, > http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/710/steps/shell/logs/stdio > https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634 > > Next build (after a couple of weeks offline while this server was > being rebuilt), Python 3.1.5, > http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/722/steps/shell/logs/stdio > https://github.com/biopython/biopython/commit/3ea4ea58ed80d6e517699bcab8810398f9ce5957 > > The timing does suggest an issue introduced in the rebuild, and > the obvious difference is the version of Python jumped from > 3.1.3 to 3.1.5 (likely things like NumPy etc also changed). > > There were some security fixes only in Python 3.1.5, none of > which sound relevant here: > http://www.python.org/download/releases/3.1.5/ > > The change log for Python 3.1.4 is longer, and does mention > stdout/stderr issues so this is perhaps the cause: > hg.python.org/cpython/raw-file/feae9f9e9f30/Misc/NEWS > > See also http://bugs.python.org/issue4996 as possibly > related. The whole Python 3 text vs binary handle issue > is important with stdout/stderr. > > What I am doing now is testing those two commits (with > Python 3.1.5) to confirm they both fail, and thus rule out > a Biopython code change in those two weeks being to > blame. > > Peter Confirmed, using test_Emboss.py and Python 3.1.5 on this machine (running as the buildslave user using the same Python 3.1.5 installation), using the current tip 5092e0e9f2326da582158fd22090f31547679160 and the two commits mentioned above, that is e90db11f4a1d983bc2bfe12bec30edbdbb200634 and 3ea4ea58ed80d6e517699bcab8810398f9ce5957 - all three builds show the same failure. i.e. The failure is not due to a change in Biopython between those commits, but is in some way caused by a change to the buildslave environment. My first suggestion that this is due to Python 3.1.3 -> 3.1.5 remains my prime suspect. I could try downgrading Python 3.1 on this machine to confirm that I suppose... or updating Python 3.1 on another machine? The other recent Python 3.1 buildbot runs were both using Python 3.1.2 (Windows XP 32bit and Linux 32 bit). Can anyone else reproduce this, or have an idea what the fix might be? Regards, Peter From Leighton.Pritchard at hutton.ac.uk Thu Dec 6 07:28:39 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 6 Dec 2012 12:28:39 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: Hi all, I'm starting to remember why I left circular labelling options alone ;) On 5 Dec 2012, at Wednesday, December 5, 16:57, Peter Cock wrote: On Wed, Dec 5, 2012 at 4:29 PM, David Martin > wrote: label_orientation: circular|upright Sometimes it is nice to have a proper circular plot I'd have to see the code or an example (and it seems any image attachment will stall your emails for moderation - I'm a moderator but there is some time delay before this gets that far). I still don't like 'upright' - but that's a naming issue, rather than one of functionality. label_placement: inside|outside|overlap|strand which maintains overlap as default, inside is all inside, outside is all outside, strand is forward outside and reverse inside. Perhaps below/above rather than inside/outside and then it could be done to both the linear and circular drawers? Do you think this is useful then? 'Below' and 'above' are context- (and viewer!) dependent: on a circular diagram 'above' on a feature at 12 o'clock is on the opposite side of the feature when it's 'above' at 6 o'clock. It's not clear what either would mean for a feature at 3 o'clock or 9 o'clock. 'Inside' and 'outside' are stably relative to the circular track for a feature at any position on the circle, so I prefer them as settings. I'm not keen on 'overlap' or 'strand', as I'm not clear what kind of label orientation they refer to: for example, what is being 'overlapped'? Looking at the .pdf, it seems like you've anchored the green labels to the track, rather than to the feature, which I think looks good there - but I'd like to have the option of track vs feature anchoring available via an argument like 'label_anchor', which could be distinguished from 'label_text_anchor'. Including this choice, my preferred arguments would be something like: label_direction='clockwise'|'anticlockwise' - 'clockwise': The text looks like it's progressing clockwise (like the green text in the .pdf); 'anticlockwise' like the blue text. By choosing 'clockwise' or 'anticlockwise' for the appropriate group of features, we achieve part of what I think you might mean by 'upright' (i.e. clockwise from pi/2 to 3pi/2, anticlockwise elsewhere). That could be handled with an 'auto' option. This argument essentially dictates label_angle for each feature: more of which later. It would be nice to have synonyms of 'counterclockwise', 'anticlockwise' and 'widdershins' ;) label_anchor='track'|'feature' Describes what element the text bounding box will be anchored to. label_text_anchor='start'|'end' Which part of the text bounding box (relative to the text) gets anchored. I think it's a good idea to have this wrap a lower-level setting that has label_text_anchor=float, as a relative location on the feature, where start=0, center=0.5, end=1, and values beyond that offer a label separation, relative to the label size - though I can't imagine why I'd use it over the option below - since spacing would depend on bounding box size - the flexibility could be useful, and you'd have to do that calculation anyway ;) label_placement='inner'|'outer' Do we anchor on the track/feature towards the circle centre (inner) or on the other side (outer)? I think it's a good idea to have this wrap a lower-level representation that has label_placement=float, as a relative location on the feature, where inner=-1,outer=1 as a proportion of track/feature height, and other values place the anchor relative to the feature/track boundary - this again offers a choice of label separation, but one that's uniform for all features. label_position='start'|'end'|'center' Where, relative to the feature, do we anchor? I think it's a good idea to have this wrap a lower-level representation that has label_position=[0,1], as a relative location on the feature, where start=0, center=0.5, end=1. That gives more flexibility for those who want it (and you have to do the calculation, anyway). label_orientation='radial'|'horizontal' Fairly obviously, 'radial' = as it is now, and 'horizontal' is reading like regular text. But this one's a tricky one, which is why all the labels are radial at the moment ;) I think that this choice has to either live with ('radial') or override ('horizontal') the label_direction argument. As with label_direction, this essentially dictates label_angle for each individual feature, which has its own issues (what do we measure the angle relative to? If it's relative to a common reference, then for a constant angle you get some funny-looking label patterns, and it doesn't look good in bulk. Relative to a feature-local reference, we can choose the tangent or the normal - but at what point of the feature? Really, we want that to be the tangent or normal at the anchor point of the text, so that the same angle looks consistent across all features (45deg to the normal at the start of a long feature is different to 45deg to the normal at the centre of that feature, relative to the bottom of the page: this looks weird)). A complicating issue here with text anchoring is what part of the text box gets anchored: depending on the font, and the string, choosing the top or bottom of the bounding box (which will include ascender and descender spaces) can look weird, so it's probably best to anchor on the midline of the text box. This avoids a problem with 'anticlockwise' vs 'clockwise' when implemented as a rotation, in that anchoring to the lower left of text, then rotating 180deg around the centre of the text box gives a different final positioning (and anchoring) than anchoring to the midline of the text box, then performing the same rotation. By appropriate choices of these settings, we can obtain pretty much any labelling style. We need to keep in mind, though, that the arguments won't be interpreted properly until the Diagram gets passed to the renderer, so 'auto' settings to achieve a particular effect with complicated combinations of arguments dependent on feature location might be better passed with draw(). As specific examples: 1) Let's say the effect we're looking for is for horizontal text, anchored to the outside of the track. Here we'd need to consider two halves of the diagram. On the left hand side we need to set label_text_anchor='end', and on the right we set label_text_anchor='start'. On both sides we set label_orientation='horizontal', label_anchor='track', label_placement='outer'. However, we need to take care with features towards the top and bottom of the image, as horizontal labels will run into each other, here. 2) Dropping the requirement for horizontal text, we can set label_orientation='radial', label_anchor='track', label_placement='outer' on both sides (maybe this should be the default?), but set label_direction='clockwise', label_text_anchor='end' on the left, and label_direction='counterclockwise', label_text_anchor='start' on the right. 3) If we wanted to label features directly, on the appropriate side of their track, we could set label_anchor='feature' for all features, with label_placement='inner' for reverse-strand, and label_placement='outer' for forward-strand features. These are some fairly obvious standard settings which could be made available as presets in the calls to draw(), so that the fiddly details are hidden. Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From w.arindrarto at gmail.com Thu Dec 6 22:32:06 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 7 Dec 2012 04:32:06 +0100 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: > Confirmed, using test_Emboss.py and Python 3.1.5 on > this machine (running as the buildslave user using the > same Python 3.1.5 installation), using the current tip > 5092e0e9f2326da582158fd22090f31547679160 and > the two commits mentioned above, that is > e90db11f4a1d983bc2bfe12bec30edbdbb200634 and > 3ea4ea58ed80d6e517699bcab8810398f9ce5957 - > all three builds show the same failure. > > i.e. The failure is not due to a change in Biopython > between those commits, but is in some way caused > by a change to the buildslave environment. My first > suggestion that this is due to Python 3.1.3 -> 3.1.5 > remains my prime suspect. > > I could try downgrading Python 3.1 on this machine > to confirm that I suppose... or updating Python 3.1 on > another machine? > > The other recent Python 3.1 buildbot runs were both > using Python 3.1.2 (Windows XP 32bit and Linux 32 bit). > > Can anyone else reproduce this, or have an idea what > the fix might be? It's reproducible in my machine: Arch Linux 64 bit running Python3.1.5. Haven't figured out a fix yet, but trying to see if I can. By the way, I was wondering, what's our deprecation policy for Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't seem to be any major updates coming soon. How long should we keep supporting Python <3.2? regards, Bow From p.j.a.cock at googlemail.com Fri Dec 7 05:06:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Dec 2012 10:06:57 +0000 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: On Fri, Dec 7, 2012 at 3:32 AM, Wibowo Arindrarto wrote: > > > Confirmed, using test_Emboss.py and Python 3.1.5 on > > this machine (running as the buildslave user using the > > same Python 3.1.5 installation), using the current tip > > 5092e0e9f2326da582158fd22090f31547679160 and > > the two commits mentioned above, that is > > e90db11f4a1d983bc2bfe12bec30edbdbb200634 and > > 3ea4ea58ed80d6e517699bcab8810398f9ce5957 - > > all three builds show the same failure. > > > > i.e. The failure is not due to a change in Biopython > > between those commits, but is in some way caused > > by a change to the buildslave environment. My first > > suggestion that this is due to Python 3.1.3 -> 3.1.5 > > remains my prime suspect. > > > > I could try downgrading Python 3.1 on this machine > > to confirm that I suppose... or updating Python 3.1 on > > another machine? > > > > The other recent Python 3.1 buildbot runs were both > > using Python 3.1.2 (Windows XP 32bit and Linux 32 bit). > > > > Can anyone else reproduce this, or have an idea what > > the fix might be? > > It's reproducible in my machine: Arch Linux 64 bit running > Python3.1.5. Haven't figured out a fix yet, but trying to see if I > can. Great. We haven't really proved this is down to a change in either Python 3.1.4 or 3.1.5 but it does look likely. > > By the way, I was wondering, what's our deprecation policy for > Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't > seem to be any major updates coming soon. How long should we keep > supporting Python <3.2? As long as it doesn't cost us much effort? If we can't solve this issue easily that might be enough to drop Python 3.1? My impression is that Python 3.0 is dead, and the only sizeable group stuck with Python 3.1 will those on Ubuntu lucid (LTS is supported through 2013 on desktops and 2015 on servers), but as with life under Python 2.x it is fairly straightforward to have a local/additional Python without disturbing the system installation. On a related note, TravisCI currently still supports Python 3.1 unofficially (we're not using this with Biopython but I've tried it with other projects), but this will be dropped soon - once they have Python 3.3 working. Since we don't yet officially support Python 3 (but we probably should soon) we have the flexibility to recommend either Python 3.2 or 3.3 as a baseline. Peter From redmine at redmine.open-bio.org Sat Dec 8 23:11:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 9 Dec 2012 04:11:35 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. It looks like your data file is corrupted. In _read_value_from_handle, the length of the key it tries to read is 1490353651722. This does not seem correct. Can you create a minimal data file that shows the problem? Then, when you fill in the trie, you can identify which key causes the problem. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Dec 9 04:53:30 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 9 Dec 2012 09:53:30 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. That just means that bug is in save() not in load() function. But of course I will provide data file, although I can't guarantee it will be minimal. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Dec 9 07:13:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 9 Dec 2012 12:13:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. You don't need to provide the data file to us. The idea is that you create the smallest trie.dat file that will cause the load() to fail. Then you know which item in the trie is problematic. Once you know that, we can try to figure out why the save() creates a corrupted file. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 10 12:39:24 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 10 Dec 2012 17:39:24 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. File minimal_data.pkl added This is my minimal test case: from Bio import trie import pickle f = open('minimal_data.pkl', 'r') list = pickle.load(f) f.close() index = trie.trie() for item in list: for chunk in item[0].split('/')[1:]: if len(chunk) > 2: if index.get(str(chunk)): index[str(chunk)].append(item[1]) else: index[str(chunk)] = [item[1]] f = open('trie.dat', 'w') trie.save(f, index) f.close() f = open('trie.dat', 'r') index = trie.load(f) f.close() ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Dec 11 00:32:02 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 11 Dec 2012 05:32:02 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. Hi Michal, Unfortunately I cannot load your minimal_data.pkl file. At list = pickle.load(f) I get ImportError: No module named django.db.models.query Can you check which item in list is actually causing the problem? Just reduce the list until you find the item that is causing the trie.load(f) to fail. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From MatatTHC at gmx.de Tue Dec 11 03:11:48 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Tue, 11 Dec 2012 09:11:48 +0100 Subject: [Biopython-dev] genetic code Message-ID: Dear biopython developers, there is a new genetic code table (24) in the NCBI resources (see NC_015649). Maybe you can update this with the next release. Would it be an idea to distribute the genetic code file from ncbi with biopython and create the code tables on import or during installation? Then biopython would be automatically up-to-date. Regards, Matthias From redmine at redmine.open-bio.org Tue Dec 11 04:15:22 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 11 Dec 2012 09:15:22 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. Hello, As I said, this is minimal test case. That means there is no single key that causes a problem. If you remove any of the items from the list it will work. You can try to run this examble from django shell (python manage.py shell). It there will be any further problems with running it I can provide model classes as well. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From arklenna at gmail.com Tue Dec 11 11:00:33 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 11 Dec 2012 11:00:33 -0500 Subject: [Biopython-dev] genetic code In-Reply-To: References: Message-ID: Hi Matthias, In a similar case, we have a file in the Scripts/ directory to download and parse the file. The generated file (and not the source file) is committed, but the script is available in the source for end users who wish to update it: https://github.com/biopython/biopython/blob/master/Scripts/PDB/generate_three_to_one_dict.py I think a similar situation would be appropriate here. Does Biopython currently include alternate codon tables? Cheers, Lenna On Tuesday, December 11, 2012, Matthias Bernt wrote: > Dear biopython developers, > > there is a new genetic code table (24) in the NCBI resources (see > NC_015649). Maybe you can update this with the next release. > > Would it be an idea to distribute the genetic code file from ncbi with > biopython and create the code tables on import or during installation? Then > biopython would be automatically up-to-date. > > Regards, > Matthias > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Dec 11 13:42:13 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Dec 2012 18:42:13 +0000 Subject: [Biopython-dev] genetic code In-Reply-To: References: Message-ID: On Tuesday, December 11, 2012, Lenna Peterson wrote: > Hi Matthias, > > In a similar case, we have a file in the Scripts/ directory to download and > parse the file. The generated file (and not the source file) is committed, > but the script is available in the source for end users who wish to update > it: > > > https://github.com/biopython/biopython/blob/master/Scripts/PDB/generate_three_to_one_dict.py > > I think a similar situation would be appropriate here. Does Biopython > currently include alternate codon tables? > > Cheers, > > Lenna Yes, see https://github.com/biopython/biopython/blob/master/Bio/Data/CodonTable.pyon the parser therein. On Tuesday, December 11, 2012, Matthias Bernt wrote: > > > Dear biopython developers, > > > > there is a new genetic code table (24) in the NCBI resources (see > > NC_015649). Maybe you can update this with the next release. That seems like a Good idea :) > > Would it be an idea to distribute the genetic code file from ncbi with > > biopython and create the code tables on import or during installation? > Then > > biopython would be automatically up-to-date. > > > > Regards, > > Matthias > That would just make installation more complex (and it is already complicated). I would prefer to keep setup.py as normal as possible. The NCBI tables rarely change, so this works OK overall. Peter From redmine at redmine.open-bio.org Tue Dec 11 23:16:27 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 04:16:27 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. We need to isolate the bug further to be able to solve it. I would suggest to find a data set that fails to load but does not depend on django. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Dec 12 02:56:52 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 07:56:52 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. Sure, today I'll strip all django dependencies and resubmit data set and loading code. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Dec 12 05:04:28 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 10:04:28 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. File minimal_data.pkl added Minimal test case with stripped django dependencies, loading code below: from Bio import trie import pickle f = open('minimal_data.pkl', 'r') list = pickle.load(f) f.close() index = trie.trie() for item in list: for chunk in item[0].split('/')[1:]: if len(chunk) > 2: if index.get(str(chunk)): index[str(chunk)].append(item[1]) else: index[str(chunk)] = [item[1]] f = open('trie.dat', 'w') trie.save(f, index) f.close() f = open('trie.dat', 'r') new_trie = trie.load(f) f.close() ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Dec 12 07:29:19 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 12:29:19 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. The problem was indeed that one of the chunks had a size of 2000. I've uploaded a fix to github; could you please give it a try? See https://github.com/biopython/biopython/commit/6e09a4a67b7dec1910b13e3d730e3a1f5c2261c9 In particular, please make sure that new_trie is identical to trie. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Dec 12 16:44:14 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 21:44:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #3400] (New) Hmmer3-text parser crashes when parsing hmmscan --cut_tc files Message-ID: Issue #3400 has been reported by Kai Blin. ---------------------------------------- Bug #3400: Hmmer3-text parser crashes when parsing hmmscan --cut_tc files https://redmine.open-bio.org/issues/3400 Author: Kai Blin Status: New Priority: Normal Assignee: Category: Target version: URL: I'm currently struggling with a crash in the hmmer3-text parser when dealing with files generated by hmmscan --cut_tc. I'm not quite sure what happens yet, but I have the feeling that some part of the hit parsing logic is reading into the next query without yielding a result. The backtrace is
Traceback (most recent call last):
  File "t.py", line 4, in 
    i = it.next()
  File "/data/uni/biopython/Bio/SearchIO/__init__.py", line 317, in parse
    yield qresult
  File "/usr/lib/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/data/uni/biopython/Bio/File.py", line 84, in as_handle
    yield fp
  File "/data/uni/biopython/Bio/SearchIO/__init__.py", line 316, in parse
    for qresult in generator:
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 47, in __iter__
    for qresult in self._parse_qresult():
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 133, in _parse_qresult
    hit_list = self._parse_hit(qid)
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 176, in _parse_hit
    hit_list = self._create_hits(hit_attr_list, qid)
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 239, in _create_hits
    hit_attr = hit_attrs.pop(0)
IndexError: pop from empty list
Line numbers might be a bit off as I added debug output to understand what's happening already. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From bow at bow.web.id Wed Dec 12 23:15:01 2012 From: bow at bow.web.id (Wibowo Arindrarto) Date: Thu, 13 Dec 2012 05:15:01 +0100 Subject: [Biopython-dev] Slight modifcation to BlastXML parser for AB-BLAST input In-Reply-To: References: Message-ID: Hi Colin, Thanks for the report. AB-BLAST wasn't included in the BLAST XML parser's test suite so I'm glad you spotted this :). You're proposing a bug fix, so yes, this should be included in our code. You could submit a pull request on our github page: https://github.com/biopython/biopython/pulls, or I can submit it on your behalf if you prefer not to submit it yourself. If you're not familiar with GitHub, we have a quick guide on how to use it to develop Biopython here: http://biopython.org/wiki/GitUsage. GitHub's help on how to submit pull requests is a useful read too: https://help.github.com/articles/using-pull-requests Along with the patch, a unit test on the AB-BLAST output would also be very welcomed. As for the actual regex change, I was wondering, is that the only possible pattern of the BlastOutput_version tag in AB-BLAST? Do you have examples of any other version output from AB-BLAST? cheers, Bow P.S. CC-ed to the Biopython-dev mailing list On Thu, Dec 13, 2012 at 4:41 AM, Colin Archer wrote: > Hi Bow, > I have been using your implementation of the biopython BLAST > output parser but for AB-BLAST input and it has been working OK so far, > although I haven't thoroughly had a look at the speed yet. I initially found > that the version tag (BlastOutput_version) for AB-BLAST results were slighly > different from NCBI BLAST and changed the regex you implemented to cover > both versions. The difference between them was: > > BLASTN 2.2.27+ > 3.0PE-AB [2009-10-30] [linux26-x64-I32LPF64 > 2009-11-17T18:52:53] > > > and the regex I ended up using was: > r'(\d+\.(?:\d+\.)*\d+)(?:\w+-\w+|\+)?' > > and here is the tested output: >>>> _RE_VERSION1 = re.compile(r'\d+\.\d+\.\d+\+?') >>>> _RE_VERSION2 = re.compile(r'(\d+\.(?:\d+\.)*\d+)(?:\w+-\w+|\+)?') >>>> version1 > 'BLASTN 2.2.27+' >>>> version2 > '3.0PE-AB [2009-10-30] [linux26-x64-I32LPF64 2009-11-17T18:52:53]' >>>> re.search(_RE_VERSION1, version1).group(0) > '2.2.27+' >>>> re.search(_RE_VERSION2, version1).group(0) > '2.2.27+' >>>> re.search(_RE_VERSION1, version2).group(0) > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'NoneType' object has no attribute 'group' >>>> re.search(_RE_VERSION2, version2).group(0) > '3.0PE-AB' > > Would there be any chance of including this in a future release of > BioPython? > > Thanks > Colin > > From bow at bow.web.id Thu Dec 13 11:14:27 2012 From: bow at bow.web.id (Wibowo Arindrarto) Date: Thu, 13 Dec 2012 17:14:27 +0100 Subject: [Biopython-dev] Slight modifcation to BlastXML parser for AB-BLAST input In-Reply-To: References: Message-ID: Hi Colin, > From what I have seen, the version value is formatted > differently based on the edition of AB-BLAST being used: personal, > commerical etc. As I only use the personal edition, I'm not sure if the > other versions are different but I imagine that they conform to the same > format, with the version followed by the edition (for example, 3.0PE-AB for > personal edition). The regex I sent you will keep the edition so I imagine > it will work on other versions of AB-BLAST as long as the edition is > represented by "words-words" Ok then. The regex looks good. You can probably make it more reader-friendly by separating the regex for NCBI and AB BLAST (e.g. r'(?:ncbi_blast_regex)|(?:ab_blast_regex)'. But even without this, it seems to work ok. > I'll submit a pull request as well and submit the revised regex. If you are > interested, there are a couple other differences in the XML output between > AB-BLAST and NCBI-BLAST. I can send you an example output if you would like > to have a look at it. Presently, SearchIO can't parse AB-BLAST XML output > for multiple queries as the AB-BLAST output is just a concatentation of > multiple single queries. Each query contains the section > at the beginning and causes ElementTree to error during iteration. To get > around this I have been piping the AB-BLAST output and parsing it into a > more NCBI-BLAST form. Hmm..it is a problem if AB-BLAST concatenates outputs like that. It makes the XML invalid, though, so I'm not sure if we should change the parser to tolerate this. What are the other differences? As for the example files, they would indeed be useful for unit testing (as long as they're not that big ~ less than 50K?). You can send them to me. If you're feeling it, you can also write your own unit tests using them :). Looking forward to the pull request :), Bow From p.j.a.cock at googlemail.com Thu Dec 13 12:09:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 17:09:59 +0000 Subject: [Biopython-dev] Slight modifcation to BlastXML parser for AB-BLAST input In-Reply-To: References: Message-ID: On Thu, Dec 13, 2012 at 4:14 PM, Wibowo Arindrarto wrote: >> Presently, SearchIO can't parse AB-BLAST XML output >> for multiple queries as the AB-BLAST output is just a concatentation of >> multiple single queries. Each query contains the section >> at the beginning and causes ElementTree to error during iteration. To get >> around this I have been piping the AB-BLAST output and parsing it into a >> more NCBI-BLAST form. > > Hmm..it is a problem if AB-BLAST concatenates outputs like that. It > makes the XML invalid, though, so I'm not sure if we should change > the parser to tolerate this. What are the other differences? The older NCBI BLAST tools had this bug as well - and as a result our NCBIXML has a hack to cope with it. It might be worth applying the same kind of fix to the SearchIO BLAST XML parser as well if it would help with both AB-BLAST and any older NCBI XML files. Peter From lucas.sinclair at me.com Thu Dec 13 11:29:19 2012 From: lucas.sinclair at me.com (Lucas Sinclair) Date: Thu, 13 Dec 2012 17:29:19 +0100 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator Message-ID: Hi ! I'm working a lot with fasta files. They can be large (>50GB) and contain lots of sequences (>40,000,000). Often I need to get one sequence from the file. WIth a flat FASTA file this requires parsing, on average, half of the file before finding it. I would like to write something that solves this problem, and rather than making a new repository, I thought I could contribute to biopython. As I just wrote, the iterator nature of parsing sequences files has it's limits. I was thinking of something that is indexed. And not some hack like I see sometimes where a second".fai" file is added nest to the ".fa" file. The natural thing to do is to put these entries in a SQLite file. The appraisal of such solutions is well made here: http://defindit.com/readme_files/sqlite_for_data.html Now I looked into the biopython source code, and it seems everything is based on returning a generator object which essentially has only one method: next() giving SeqRecords. For what I want to do, I would also need the get(id) method. Plus any other methods that could now be added to query the DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is a class called InterlacedSequenceIterator(SequenceIterator) that contains a __getitem__(i) method, but it's unclear how to I should go about implementing that. Any help/example on how to add such a format to SeqIO ? Thanks ! Lucas Sinclair, PhD student Ecology and Genetics Uppsala University From p.j.a.cock at googlemail.com Thu Dec 13 12:40:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 17:40:46 +0000 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair wrote: > Hi ! > > I'm working a lot with fasta files. They can be large (>50GB) and contain > lots of sequences (>40,000,000). Often I need to get one sequence from the > file. WIth a flat FASTA file this requires parsing, on average, half of the > file before finding it. I would like to write something that solves this > problem, and rather than making a new repository, I thought I could > contribute to biopython. > > As I just wrote, the iterator nature of parsing sequences files has it's > limits. I was thinking of something that is indexed. And not some hack like > I see sometimes where a second".fai" file is added nest to the ".fa" file. > The natural thing to do is to put these entries in a SQLite file. The > appraisal of such solutions is well made here: > http://defindit.com/readme_files/sqlite_for_data.html > > Now I looked into the biopython source code, and it seems everything is > based on returning a generator object which essentially has only one method: > next() giving SeqRecords. For what I want to do, I would also need the > get(id) method. Plus any other methods that could now be added to query the > DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is > a class called InterlacedSequenceIterator(SequenceIterator) that contains a > __getitem__(i) method, but it's unclear how to I should go about > implementing that. Any help/example on how to add such a format to SeqIO ? > > Thanks ! Have you looked at Bio.SeqIO.index (index held in memory) and Bio.SeqIO.index_db (index held in an SQLite3 database), and do they solve your needs? Note these only index the location of records - unlike tabix/fai indexes which also look at the line length to be able to pull out subsequences. This means the Bio.SeqIO indexing isn't ideal for dealing with large records were you are only interested in small subsequences. Peter From p.j.a.cock at googlemail.com Thu Dec 13 12:51:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 17:51:40 +0000 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: On Thu, Dec 13, 2012 at 5:40 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair wrote: >> >> I see there is >> a class called InterlacedSequenceIterator(SequenceIterator) that contains a >> __getitem__(i) method, but it's unclear how to I should go about >> implementing that. >> Hmm - I think that entire class is obsolete and could be removed. Peter From p.j.a.cock at googlemail.com Thu Dec 13 13:54:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 18:54:04 +0000 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: On Thu, Dec 13, 2012 at 5:51 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 5:40 PM, Peter Cock wrote: >> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair wrote: >>> >>> I see there is >>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a >>> __getitem__(i) method, but it's unclear how to I should go about >>> implementing that. >>> > > Hmm - I think that entire class is obsolete and could be removed. I've marked it as deprecated, but since it doesn't really have any executable code a deprecation warning doesn't seem relevant. We can probably remove this after the next release. https://github.com/biopython/biopython/commit/316c42aad05b9de3d3b3004ec295670691ae1804 Thanks for flagging up this bit of the code Lucas. Going further, the SequenceIterator isn't used either, and perhaps could be dropped too? We do use the similar class in AlignIO... Regards, Peter From ben at benfulton.net Thu Dec 13 21:25:47 2012 From: ben at benfulton.net (Ben Fulton) Date: Thu, 13 Dec 2012 21:25:47 -0500 Subject: [Biopython-dev] Code coverage reporting Message-ID: On my Biopython fork, I've extended the test run on Travis to create and upload a code coverage report to GitHub. I'd like to submit a pull request to put this in the main code base, but in order to do so, I need a token generated to allow uploading the file to the biopython GitHub account. Can someone work with me on that? You can view the coverage report at http://cloud.github.com/downloads/benfulton/biopython/coverage.txt Thanks! Ben Fulton From p.j.a.cock at googlemail.com Fri Dec 14 05:58:49 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Dec 2012 10:58:49 +0000 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: On Fri, Dec 14, 2012 at 10:07 AM, Lucas Sinclair wrote: > Hello, > > Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes > an index, but it is held in memory. So it must be recomputed every > time the interpreter is reloaded. Yes, that is right. > This step is wasting enough time for me that I would like to compute > the index on my 50GB file once, and then be done with it. SQLite > really is the technology of choice for such a problem... Yes, which is why Bio.SeqIO.index_db() stores the index in SQLite. The SeqIO chapter in the Tutorial does try to explain this and the advantages compared to Bio.SeqIO.index(). Have you tried this yet? > I suppose you agree storing all this sequence information in flat > ascii files is not piratical. It may not be optimal, but it is very practical (although at the scale of next generation sequencing data less so). Peter From lucas.sinclair at me.com Fri Dec 14 05:07:55 2012 From: lucas.sinclair at me.com (Lucas Sinclair) Date: Fri, 14 Dec 2012 11:07:55 +0100 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: Hello, Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes an index, but it is held in memory. So it must be recomputed every time the interpreter is reloaded. This step is wasting enough time for me that I would like to compute the index on my 50GB file once, and then be done with it. SQLite really is the technology of choice for such a problem... I suppose you agree storing all this sequence information in flat ascii files is not piratical. Actually, I found a reasonable work around way of achieving this result with these two commands: $ formatdb -i reads -p T -o T -n reads $ blastdbcmd -db reads -dbtype prot -entry "105107064179" -outfmt %f -out test.fasta But then I need to have calls to subprocess... Since, I thought my first small contribution to biopython was fun doing, (https://github.com/biopython/biopython/commit/1c72a63b35db70d11c628b83a0269d1a9c6443a4) I maybe still fell like writing a proper solution. Would such a thing be a welcome addition to Bio.SeqIO ? If so, where would I place it ? The schema would be a SQLite file with a single table named "sequences". This table would have columns corresponding to the attributes of a SeqRecord. But you would need to get a different type object back when calling parse than a generator, you would need an object that has a __getitem__ method. Sincerely, Lucas Sinclair, PhD student Ecology and Genetics Uppsala University On 13 d?c. 2012, at 18:40, Peter Cock wrote: > On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair wrote: >> Hi ! >> >> I'm working a lot with fasta files. They can be large (>50GB) and contain >> lots of sequences (>40,000,000). Often I need to get one sequence from the >> file. WIth a flat FASTA file this requires parsing, on average, half of the >> file before finding it. I would like to write something that solves this >> problem, and rather than making a new repository, I thought I could >> contribute to biopython. >> >> As I just wrote, the iterator nature of parsing sequences files has it's >> limits. I was thinking of something that is indexed. And not some hack like >> I see sometimes where a second".fai" file is added nest to the ".fa" file. >> The natural thing to do is to put these entries in a SQLite file. The >> appraisal of such solutions is well made here: >> http://defindit.com/readme_files/sqlite_for_data.html >> >> Now I looked into the biopython source code, and it seems everything is >> based on returning a generator object which essentially has only one method: >> next() giving SeqRecords. For what I want to do, I would also need the >> get(id) method. Plus any other methods that could now be added to query the >> DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is >> a class called InterlacedSequenceIterator(SequenceIterator) that contains a >> __getitem__(i) method, but it's unclear how to I should go about >> implementing that. Any help/example on how to add such a format to SeqIO ? >> >> Thanks ! > > Have you looked at Bio.SeqIO.index (index held in memory) and > Bio.SeqIO.index_db (index held in an SQLite3 database), and do > they solve your needs? > > Note these only index the location of records - unlike tabix/fai indexes > which also look at the line length to be able to pull out subsequences. > This means the Bio.SeqIO indexing isn't ideal for dealing with large > records were you are only interested in small subsequences. > > Peter From w.arindrarto at gmail.com Fri Dec 14 07:48:12 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 14 Dec 2012 13:48:12 +0100 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: Hi everyone, >> It's reproducible in my machine: Arch Linux 64 bit running >> Python3.1.5. Haven't figured out a fix yet, but trying to see if I >> can. > > Great. We haven't really proved this is down to a change in > either Python 3.1.4 or 3.1.5 but it does look likely. It's reproduced in my local 3.1.4 installation. Seems like an unfixed bug that went through to 3.1.5. >> By the way, I was wondering, what's our deprecation policy for >> Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't >> seem to be any major updates coming soon. How long should we keep >> supporting Python <3.2? > > As long as it doesn't cost us much effort? If we can't solve this > issue easily that might be enough to drop Python 3.1? Fixing this seems difficult (has anyone else tried a fix?). The _io module is built-in and compiled when Python is installed, so fixing it (I imagine) may require tweaking the C-code (which requires fiddling with the actual Python installation). > My impression is that Python 3.0 is dead, and the only sizeable > group stuck with Python 3.1 will those on Ubuntu lucid (LTS is > supported through 2013 on desktops and 2015 on servers), > but as with life under Python 2.x it is fairly straightforward > to have a local/additional Python without disturbing the system > installation. > > > Since we don't yet officially support Python 3 (but we probably > should soon) we have the flexibility to recommend > either Python 3.2 or 3.3 as a baseline. Yes. I think it may be easier and better for us to officially start supporting from Python3.2 or 3.3 onwards. regards, Bow From christian at brueffer.de Mon Dec 17 06:05:04 2012 From: christian at brueffer.de (Christian Brueffer) Date: Mon, 17 Dec 2012 19:05:04 +0800 Subject: [Biopython-dev] Biopython AlignAce Wrapper In-Reply-To: References: <50CAC1C2.9090705@brueffer.de> <50CEE193.2010003@brueffer.de> Message-ID: <50CEFC60.8020400@brueffer.de> (CC'ing biopython-dev) Thanks for the feedback. I'd propose the following plan for the AlignAce wrapper then: 1. Submit the cleanup patches I have to give the wrapper at least a fighting chance at actually working 2. Add a BiopythonDeprecationWarning 3. Remove the wrapper after 1.61 is released (except the situation changes of course) Does that sound acceptable? Chris On 12/17/2012 05:25 PM, Bartek Wilczynski wrote: > Well, > > sounds like a good plan. I think the situation is hopeless: If we had > the source of AlignAce with appropriate license we could think of > supporting it ourselves, but in this situation I guess we can only > deprecate the module and phase it out... > > best > Bartek > > On Mon, Dec 17, 2012 at 10:10 AM, Christian Brueffer > wrote: >> Hi Bartek, >> >> thanks for checking. The thing is, the "new" version is actually an >> ancient version: >> >> AlignACE version 2.3 October 27, 1998 >> >> I made it work by installing Fedora Code 3 in a VM and using >> elfstatifier to bind AlignAce and all libraries into one executable. >> I works, but I doubt it's of any use these days. >> >> I wonder whether it's better to remove the wrapper. The AlignAce >> developers are unresponsive, none of the Biopython people has a >> version and from what I can see the current wrapper cannot possibly >> work. >> >> What do you think? >> >> Chris >> >> >> On 12/17/2012 05:01 PM, Bartek Wilczynski wrote: >>> >>> Hi, >>> >>> I've looked around and it seems I don't have it. We probably need to >>> "update" the parser to work with the current version of AlignACE >>> available from Harvard. Were you able to run it? On mys system, it >>> cannot find the libraries it needs... >>> >>> best >>> Bartek >>> >>> On Fri, Dec 14, 2012 at 7:05 AM, Christian Brueffer >>> wrote: >>>> >>>> Hi Bartek, >>>> >>>> I currently clean up the Biopython AlignAce wrapper. Unfortunately >>>> I've been unable to obtain the latest AlignAce version since the >>>> download page disappeared and the Church lab is unresponsive. >>>> >>>> Do you happen to have a version of AlignAce 4.0 for Linux lying around, >>>> that you could send me? >>>> >>>> Thanks a lot, >>>> >>>> Chris >>> From redmine at redmine.open-bio.org Mon Dec 17 08:49:33 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 17 Dec 2012 13:49:33 +0000 Subject: [Biopython-dev] [Biopython - Bug #3401] (New) is_terminal bug in newick trees Message-ID: Issue #3401 has been reported by Aleksey Kladov. ---------------------------------------- Bug #3401: is_terminal bug in newick trees https://redmine.open-bio.org/issues/3401 Author: Aleksey Kladov Status: New Priority: Normal Assignee: Category: Target version: URL: Consider this weird Newick tree (((B,C),D))A; Here 'A' is both a root node and a terminal node(since it has only one child: ((B,C),D);). However, is_terminal for 'A' is False:
from Bio import Phylo
import cStringIO

bad_tree = '(((B,C),D))A'

t = Phylo.read(cStringIO.StringIO(bad_tree), 'newick')

for c in t.find_clades(terminal=True):
    print c,
Gives @B C D@ ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From MatatTHC at gmx.de Tue Dec 18 07:40:35 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Tue, 18 Dec 2012 13:40:35 +0100 Subject: [Biopython-dev] Location Parser Message-ID: Dear list, I have some problems with the GenBank parser in version 1.60. Its again nested location strings like: order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403) as found in NC_003048. What happens is that the parser stalls. It seems as if it takes forever to parse _re_complex_compound in and never gets to the if statement that checks if order and join appears in the location string. I suggest to move the if statement before the regular expressions are tested. I remember that I posted something like this before. But I can not remember how and if this was solved. Regards, Matthaas From k.d.murray.91 at gmail.com Tue Dec 18 08:46:06 2012 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Wed, 19 Dec 2012 00:46:06 +1100 Subject: [Biopython-dev] [biopython] TAIR (Arabidopsis) sequence retreival module (#132) In-Reply-To: References: Message-ID: Hi Peter, Chris and the mailing list, Thanks very much for the feedback! > Query: It isn't clear to me (from a first read) what MultipartPostHandler is needed for. The arabidopsis.org server form requires the content-type to be a multipart form, not a urlencoded form, which the standard urllib2 does not handle. I could write a custom handler, however when writing the module I found MultipartPostHandler, and figured I should use that. I may be wrong, but couldn't figure out any other way of doing it. >Minor: The module's docstring should start with a one line summary then a blank line (see PEP8 style guide). >Note: Since your unit test requires internet access, it should include these lines to work nicely in our testing framework (which allows the tests needing network access to be skipped) I'll fix the module docstring and requires_internet check tomorrow. >Why does the NCBI code exist given it is such a thin wrapper round the Bio.Entrez code - the module would be a lot simpler if it was just a wrapper for www.arabidopsis.org alone. The NCBI functions exist to get genbank files for AGIs, as TAIR's sequence retrieval only gives fasta files, so if users need/want the extra metadata a genbank file gives, they can use this module. As you've said, this is a *very* thin wrapper, so would it be better to just provide the mapping dicts in Bio.TAIR._ncbi for people to use however they see fit? >Query: Why do your methods return SeqRecord objects? Is this because the handle might return FASTA with a non-FASTA header which must be stripped off? SeqRecord handles were returned for two reasons, the first being as you said that the raw return text is not always a valid fasta file, despite my efforts to trim extraneous text. The latter is simply that is what i required when writing it, and i could not think of a better way of returning it. (and I thought that the return of a SeqRecord allowed "pythonic" processing of results, a la the test suite). Again happy for any suggestions >Why does classes TAIRDirect and TAIRNCBI exist? Wouldn't module level functions be simpler (or at least, consistent with other modules like Bio.Entrez) >Style: Why introduce the mode argument and two magic values NCBI_RNA and NCBI_PROTEIN? The honest answer to both of these is personal choice. If consistency is an issue i will reimplement as module-level functions and textual arguments respectively. Regarding the placement of modules, i'm happy for it to go wherever. I would imagine that there are other niche web interface "getters" such as this, and think your suggestion sounds great, although i can't think what we could call it. Perhaps Bio.Web.TAIR? Regards Kevin Murray On 18 December 2012 10:34, Peter Cock wrote: > Hi Kevin, > > Thanks for your code submission. I've not had a chance to play with it, > but I do have some comments/queries - some of which are perhaps just style > issues. > > Note: Since your unit test requires internet access, it should include > these lines to work nicely in our testing framework (which allows the tests > needing network access to be skipped): > > import requires_internet > requires_internet.check() > > Query: It isn't clear to me (from a first read) what MultipartPostHandler > is needed for. > > Minor: The module's docstring should start with a one line summary then a > blank line (see PEP8 style guide). > > Query: Why does classes TAIRDirect and TAIRNCBI exist? Wouldn't module > level functions be simpler (or at least, consistent with other modules like > Bio.Entrez)? > > Query: Why do your methods return SeqRecord objects? Is this because the > handle might return FASTA with a non-FASTA header which must be stripped > off? > > Style: Why introduce the mode argument and two magic values NCBI_RNA and > NCBI_PROTEIN? > > In fact I would go further and ask why does the NCBI code exist given it > is such a thin wrapper round the Bio.Entrez code - the module would be a > lot simpler if it was just a wrapper for www.arabidopsis.org alone. > > I'm also not sure about the namespace Bio.TAIR, the old Bio.www namespace > might have been better but that was deprecated a while back, and the other > semi-natural fit under Biopython's old OBDA effort is also defunct > (attempting to catalogue a collection of sequence resources, see > http://obda.open-bio.org for background if curious). The namespace issue > at least would be worth bringing up on the dev mailing list... especially > if you can think of many other examples like this for specialised resources. > > Regards, > > Peter > > ? > Reply to this email directly or view it on GitHub. > > From kjwu at ucsd.edu Tue Dec 18 23:25:35 2012 From: kjwu at ucsd.edu (Kevin Wu) Date: Tue, 18 Dec 2012 20:25:35 -0800 Subject: [Biopython-dev] KEGG API Wrapper In-Reply-To: <1351219962.39081.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1351219962.39081.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Hi All, Sorry in the delay in updating this KEGG code. Michiel, I've addressed your suggestions regarding the querying code and the documentation and have committed changes that reflect this. ( https://github.com/kevinwuhoo/biopython/) There's a namespace collision created by the KEGG.list function, so I use KEGG.list_ instead. However, I'm sure there's a more elegant solution than this. Regarding the parsers, there should be a way to unify all parsers and writers for KEGG objects as they list fields for all their objects here: http://www.kegg.jp/kegg/rest/dbentry.html. Each class should extend from a parent while specifying their valid fields. Parsing all files should be generalized, but there should be field specific code to handle the different fields so that fields like genes are handled correctly and ubiquitously. After solidifying discussion on these, I'll move the tests over to unittest too. Thanks! Kevin On Thu, Oct 25, 2012 at 7:52 PM, Michiel de Hoon wrote: > Hi Kevin, > > Thanks for the documentation! That makes everything a lot clearer. > Overall I like the querying code and I think we should add it to Biopython. > > I have a bunch of comments on the KEGG module, some on the existing code > and some on the new querying code, see below. Most of these are trivial; > some may need some further discussion. Perhaps could you let us know which > of these comments you can address, and which ones you want to skip for now? > > Once we converged with regards to the querying code and the documentation, > I think we can import your version of the KEGG module into the main > Biopython repository and add your chapter on KEGG to the main > documentation, and continue from there on the parsers and the unit tests. > > Many thanks! > -Michiel. > > > About the querying code: > ---------------------------------- > > I would replace KEGG.query("list", KEGG.query("find", KEGG.query("conv", > KEGG.query("link", KEGG.query("info", KEGG.query("get" by the functions > KEGG.list, KEGG.find, KEGG.conv, KEGG.link, KEGG.info, and KEGG.get. > > For list, find, conv, link, and info, instead of going through > KEGG.generic_parser, I would return the result directly as a Python list. > In contrast, KEGG.get should return the handle to the results, not the > data itself. So the _q function, instead of > ... > resp = urllib2.urlopen(req) > data = resp.read() > return query_url, data > have > ... > resp = urllib2.urlopen(req) > return resp > Then the user can decide whether to parse the data on the fly with > Bio.KEGG, or read the data line by line and pick up what they are > interested in, or to get all data from the handle and save it in a file. > Note that resp will have a .url attribute that contains the url, so you > won't need the ret_url keyword. > > About the parsers: > ------------------------ > > I think that we should drop generic_parser. For link, find, conv, link, > and info, parsing is trivial and can be done by the respective functions > directly. For get, we already have an appropriate parser for some databases > (compound, map, and enzyme), but it's easy to add parsers for the other > databases. > > For all parsers in Biopython, there is the question whether the record > should store information in attributes (as is currently done in Bio.KEGG), > or alternatively if the record should inherit from a dictionary and store > information in keys in the dictionary. Personally I have a preference for a > dictionary, since that allows us to use the exact same keys in the > dictionary as is used in the file (e.g., we can use "CLASS" as a key, while > we cannot use .class as an attribute since it is a reserved word, so we use > .classname instead). But other Biopython developers may not agree with me, > and to some extent it depends on personal preference. > > The parsers miss some key words. The ones I noticed are ALL_REAC, > REFERENCE, and ORTHOLOGY. Probably we'll find more once we extend the unit > tests. > > Remove the ';' at the end of each term in record.classname. > > Convert record.genes to a dictionary for each organism. So instead of > [('HSA', ['5236', '55276']), ('PTR', ['456908', '461162']), ('PON', > ['100190836', '100438793']), ('MCC', ['100424648', '699401']... > have > {'HSA': ['5236', '55276'], 'PTR': ['456908', '461162'], 'PON': > ['100190836', '100438793'], 'MCC': ['100424648', '699401'], ... > > Also for record.dblinks, record.disease, record.structures, use a > dictionary. > > In record.pathway, all entries start with 'PATH'. Perhaps we should check > with KEGG if there could be anything else than 'PATH' there, otherwise I > don't see the reason why it's there. Assuming that there could be something > different there, I would also use a dictionary with 'PATH' as the key. > > In record.reaction, some chemical names can be very long and extend over > multiple lines. In such cases, the continuation line starts with a '$'. The > parser should remove the '$' and join the two lines. > > About the tests: > -------------------- > > We should update the data files in Tests/KEGG. This will fix some "bugs" > in these data files. > > We should switch test_KEGG.py to the unit test framework. > > We should do some more extensive testing to make sure we are not missing > some key words. > > About the documentation: > --------------------------------- > It's great that we now have some documentation. > > On page 233, I would suggest to replace the "id_" by "accession" or > something else, since the underscore in "id_" may look funky to new users. > > Also it may be better not to reuse variable names (e.g. "pathway" is used > in three different ways in the example). It's OK of course in general, but > for this example it may be more clear to distinguish the different usages > of this variable from each other. > > For repair_genes, you can use a set instead of a list throughout. > > > > > --- On *Wed, 10/24/12, Kevin Wu * wrote: > > > From: Kevin Wu > Subject: Re: [Biopython-dev] KEGG API Wrapper > To: "Peter Cock" , "Zachary Charlop-Powers" < > zcharlop at mail.rockefeller.edu>, "Michiel de Hoon" > Cc: Biopython-dev at lists.open-bio.org > Date: Wednesday, October 24, 2012, 6:38 PM > > > Hi All, > > Thanks for the comments, I've written a bit of documentation on the entire > KEGG module and have attached those relevant pages to the email. There > didn't seem like an appropriate place for examples, so I just added a new > chapter. I've also committed the updated file to github. > > I did leave out the parsers due to the fact that the current parsers only > cover a small portion of possible responses from the api. Also, I'm not > confident that the some of the parsers correctly retrieves all the fields. > However, I've written a really general parser that does a rough job of > retrieving fields if it's a database format returned since I find myself > reusing the code for all database formats. It's possible to modify this to > correctly account for the different fields, but would probably take a bit > of work to manually figure each field out. Otherwise it also parses the > tsv/flat file returned. > > Also, @zach, thanks for checking it out and testing it! > > Thanks All! > Kevin > > On Wed, Oct 17, 2012 at 4:09 AM, Peter Cock > > wrote: > > On Wed, Oct 17, 2012 at 12:55 AM, Zachary Charlop-Powers > > > wrote: > > Kevin, > > Michiel, > > > > I just tested Kevin's code for a few simple queries and it worked great. > I > > have always liked KEGG's organization of data and really appreciate this > > RESTful interface to their data; in some ways I think it easier to use > the > > web interfaces for KEGG than it is for NCBI. Plus the KEGG coverage of > > metabolic networks is awesome. I found the examples in Kevin's test > script > > to be fairly self-explanatory but a simple-spelled out example in the > > Tutorial would be nice. > > > > One thought, though, is that you can retrieve MANY different types of > data > > from the KEGG Rest API - which means that the user will probably have to > > parse the data his/herself. Data retrieved with "list" can return lists > of > > genes or compounds or organism and after a cursory look these are each > > formatted differently. Also true with the 'find' command. So I think you > > were right to leave out parsers because i think they will be a moving > target > > highly dependent on the query. > > > > Thank You Kevin, > > zach cp > > Good point about decoupling the web API wrapper and the parsers - > how the Bio.Entrez module and Bio.TogoWS handle this is to return > handles for web results, which you can then parse with an appropriate > parser (e.g. SeqIO for GenBank files, Medline parser, etc). > > Note that this is a little more fiddly under Python 3 due to the text > mode distinction between unicode and binary... just something to > keep in the back of your mind. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From gokcen.eraslan at gmail.com Thu Dec 20 19:12:43 2012 From: gokcen.eraslan at gmail.com (=?ISO-8859-1?Q?G=F6k=E7en_Eraslan?=) Date: Fri, 21 Dec 2012 01:12:43 +0100 Subject: [Biopython-dev] numpy/matlab style index arrays for Seq objects Message-ID: <50D3A97B.60108@gmail.com> Hello, During the development of a project, I have come across an issue that I want to share. As far as I know, Bio.Seq.Seq object can only be indexed using an int or a slice object, just as regular strings: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq[4:12] Seq('GATGGGCC', IUPACUnambiguousDNA()) However, it would be really nice to be able to index Seq objects using index arrays as in numpy.array, like >>> my_inidices = [0, 3, 7] >>> my_seq[my_indices] Seq('GCG', IUPACUnambiguousDNA()) (Since I'm not really familiar with BioPython API and codebase, please ignore/forgive me if such thing already exists now.) For example in my project, I'm trying to eliminate noisy columns of a MSA fasta file. Let's assume that I have a list of non-noisy column indices than this would solve my problem: In [1]: from Bio import AlignIO In [2]: msa = AlignIO.read("s001.fasta", "fasta") In [3]: print msa[:, [0, 3, 4]] SingleLetterAlphabet() alignment with 5 rows and 3 columns KPG sp2 TPG sp11 SPG sp7 KPP sp6 SPG sp10 I have attached a tiny patch (~4 lines) implementing this stuff. At first, I have thought keeping the sequence string as numpy.array(list()) to be able to use indexing mechanism of numpy, but it would be over-engineering so I have just used a simple list comprehension trick. Regards. -------------- next part -------------- A non-text attachment was scrubbed... Name: biopython-index-array-for-seq.diff Type: text/x-patch Size: 3845 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Dec 21 08:09:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Dec 2012 13:09:47 +0000 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt wrote: > Dear list, > > I have some problems with the GenBank parser in version 1.60. Its again > nested location strings like: > > order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403) > as found in NC_003048. Do you have a URL for that? This looks OK to me: http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1 Perhaps the entry came from the FTP site? e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/ > What happens is that the parser stalls. It seems as if it takes forever to > parse _re_complex_compound in and never gets to the if statement that > checks if order and join appears in the location string. > > I suggest to move the if statement before the regular expressions are > tested. > > I remember that I posted something like this before. But I can not remember > how and if this was solved. > > Regards, > Matthaas Were similar odd locations have come up in some cases they did seem to be NCBI bugs - could you raise a query with the NCBI for this case please? If this is valid (which I doubt), then our object model doesn't cope. If this is invalid, then Biopython should give a warning and skip this location. Right now I can't find the file to test this (see query above about where it came from). Regards, Peter From MatatTHC at gmx.de Fri Dec 21 10:18:45 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Fri, 21 Dec 2012 16:18:45 +0100 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: Dear Peter, you are right the current RefSeq record is valid and can be parsed. In order to reproduce old results I keep old refseq versions (of mitochondrial genomes) on hard disk. So probably this is an old refseq bug. According to the documentation ( http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.4): """ Note : location operator "complement" can be used in combination with either " join" or "order" within the same location; combinations of "join" and "order" within the same location (nested operators) are illegal. """ Since this was urgent I fixed the files manually by removing the nested files. I was not able to find a file in other RefSeq versions that can reproduce the bug (i.e. the parser seemingly takes forever [>5min] and does not raise an exception). You may still reproduce the bug by pasting the location line in another GenBank file. I agree that the desired behaviour would be a warning and skip of the feature. Regards, Matthias 2012/12/21 Peter Cock > On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt wrote: > > Dear list, > > > > I have some problems with the GenBank parser in version 1.60. Its again > > nested location strings like: > > > > > order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403) > > as found in NC_003048. > > Do you have a URL for that? This looks OK to me: > http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1 > > Perhaps the entry came from the FTP site? > e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/ > > > What happens is that the parser stalls. It seems as if it takes forever > to > > parse _re_complex_compound in and never gets to the if statement that > > checks if order and join appears in the location string. > > > > I suggest to move the if statement before the regular expressions are > > tested. > > > > I remember that I posted something like this before. But I can not > remember > > how and if this was solved. > > > > Regards, > > Matthaas > > Were similar odd locations have come up in some cases they did > seem to be NCBI bugs - could you raise a query with the NCBI > for this case please? > > If this is valid (which I doubt), then our object model doesn't cope. > > If this is invalid, then Biopython should give a warning and skip > this location. Right now I can't find the file to test this (see > query above about where it came from). > > Regards, > > Peter > -------------- next part -------------- A non-text attachment was scrubbed... Name: NC_001326.gb Type: application/octet-stream Size: 65527 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Dec 21 10:34:48 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Dec 2012 15:34:48 +0000 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: On Fri, Dec 21, 2012 at 3:18 PM, Matthias Bernt wrote: > Dear Peter, > > you are right the current RefSeq record is valid and can be parsed. In order > to reproduce old results I keep old refseq versions (of mitochondrial > genomes) on hard disk. So probably this is an old refseq bug. ... Could you email me (not the list) the old NC_003048.gb file please? Was there a similar issue in the NC_001326.gb file you just sent? It seems to load OK for me... Thanks, Peter From p.j.a.cock at googlemail.com Fri Dec 21 11:13:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Dec 2012 16:13:40 +0000 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: On Fri, Dec 21, 2012 at 3:53 PM, Matthias Bernt wrote: > Dear Peter, > > its attached (from RefSeq39). For me parsing does not finish for this file > (biopython 1.6, python 2.7.3). > > Regards, > Matthias Got it, thanks. It also seems to get stuck for me too - there is a bug here :( See also: https://redmine.open-bio.org/issues/3197 Peter From p.j.a.cock at googlemail.com Fri Dec 21 11:54:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Dec 2012 16:54:38 +0000 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: On Fri, Dec 21, 2012 at 4:13 PM, Peter Cock wrote: > On Fri, Dec 21, 2012 at 3:53 PM, Matthias Bernt > wrote: >> Dear Peter, >> >> its attached (from RefSeq39). For me parsing does not finish for this file >> (biopython 1.6, python 2.7.3). >> >> Regards, >> Matthias > > Got it, thanks. It also seems to get stuck for me too - there is a bug here :( > > See also: https://redmine.open-bio.org/issues/3197 The problem seems to be in the regular expression search itself getting stuck: $ python Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.GenBank import _re_complex_compound >>> _re_complex_compound.match("order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)") ^CTraceback (most recent call last): File "", line 1, in KeyboardInterrupt Odd. Peter From ben at bendmorris.com Mon Dec 24 11:58:19 2012 From: ben at bendmorris.com (Ben Morris) Date: Mon, 24 Dec 2012 11:58:19 -0500 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo Message-ID: Hi all, I've implemented support for two new phylogenetic tree formats: NeXML and RDF (conforming to the Comparative Data Analysis Ontology). I noticed that NeXML support was planned, but I didn't see anyone working on it on GitHub and the feature request hadn't been updated in about a year, so I went ahead and implemented a simple version. At first I tried the generateDS.py approach, but the generated writer doesn't give very much control over the output, so I ended up writing my own parser/writer using ElementTree. As for the RDF/CDAO format, AFAIK this is not a format that's supported by any other phylogenetic libraries, so I'm not sure how useful this is to everyone else. It provides a simple, standards-compliant format that can be imported to a triple store and supports annotation. We'll be using it at NESCent so I wanted to make it available to everyone else as well. The parser and writer require the Redlands Python bindings. The code is available in my fork of Biopython, https://github.com/bendmorris/biopython under branches "cdao" and "nexml." I'd love to get everyone's thoughts and see if these contributions would be a good fit for the Biopython project. ~Ben Morris PhD student, Department of Biology University of North Carolina at Chapel Hill and the National Evolutionary Synthesis Center ben at bendmorris.com From p.j.a.cock at googlemail.com Mon Dec 24 13:05:29 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 24 Dec 2012 18:05:29 +0000 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo In-Reply-To: References: Message-ID: On Mon, Dec 24, 2012 at 4:58 PM, Ben Morris wrote: > Hi all, > > I've implemented support for two new phylogenetic tree formats: NeXML and > RDF (conforming to the Comparative Data Analysis Ontology). > > I noticed that NeXML support was planned, but I didn't see anyone working > on it on GitHub and the feature request hadn't been updated in about a > year, so I went ahead and implemented a simple version. At first I tried > the generateDS.py approach, but the generated writer doesn't give very much > control over the output, so I ended up writing my own parser/writer using > ElementTree. > > As for the RDF/CDAO format, AFAIK this is not a format that's supported by > any other phylogenetic libraries, so I'm not sure how useful this is to > everyone else. It provides a simple, standards-compliant format that can be > imported to a triple store and supports annotation. We'll be using it at > NESCent so I wanted to make it available to everyone else as well. The > parser and writer require the Redlands Python bindings. > > The code is available in my fork of Biopython, > > https://github.com/bendmorris/biopython > > under branches "cdao" and "nexml." I'd love to get everyone's thoughts and > see if these contributions would be a good fit for the Biopython project. Sounds good - and the librdf Redlands Python bindings do seem to be a safe choice for RDF under Python. I guess we need Eric to take a look... and some tests would be needed too. Thanks, Peter From eric.talevich at gmail.com Tue Dec 25 02:18:40 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 24 Dec 2012 23:18:40 -0800 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo In-Reply-To: References: Message-ID: On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris wrote: > Hi all, > > I've implemented support for two new phylogenetic tree formats: NeXML and > RDF (conforming to the Comparative Data Analysis Ontology). > > I noticed that NeXML support was planned, but I didn't see anyone working > on it on GitHub and the feature request hadn't been updated in about a > year, so I went ahead and implemented a simple version. At first I tried > the generateDS.py approach, but the generated writer doesn't give very much > control over the output, so I ended up writing my own parser/writer using > ElementTree. > > As for the RDF/CDAO format, AFAIK this is not a format that's supported by > any other phylogenetic libraries, so I'm not sure how useful this is to > everyone else. It provides a simple, standards-compliant format that can be > imported to a triple store and supports annotation. We'll be using it at > NESCent so I wanted to make it available to everyone else as well. The > parser and writer require the Redlands Python bindings. > > The code is available in my fork of Biopython, > > https://github.com/bendmorris/biopython > > under branches "cdao" and "nexml." I'd love to get everyone's thoughts and > see if these contributions would be a good fit for the Biopython project. > Thanks for letting us know! I'll try it out soonish. Looking at the code on your nexml branch, I have a few comments: - The parser uses ElementTree.parse rather than iterparse, so in its current state it would not be able to parse massive files (those larger than available RAM). Worth fixing eventually? - The parser creates Newick.Tree and Newick.Clade objects, which is nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you don't have any additional attributes to attach to those classes at the moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and PhyloXMLIO.py.) - The 'confidence' or 'confidences' attribute isn't used (for e.g. bootstrap support values). Does NeXML define it? Best, Eric From ben at bendmorris.com Fri Dec 28 10:50:02 2012 From: ben at bendmorris.com (Ben Morris) Date: Fri, 28 Dec 2012 10:50:02 -0500 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo In-Reply-To: References: Message-ID: On Tue, Dec 25, 2012 at 2:18 AM, Eric Talevich wrote: > > On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris wrote: >> >> Hi all, >> >> I've implemented support for two new phylogenetic tree formats: NeXML and >> RDF (conforming to the Comparative Data Analysis Ontology). >> >> I noticed that NeXML support was planned, but I didn't see anyone working >> on it on GitHub and the feature request hadn't been updated in about a >> year, so I went ahead and implemented a simple version. At first I tried >> the generateDS.py approach, but the generated writer doesn't give very much >> control over the output, so I ended up writing my own parser/writer using >> ElementTree. >> >> As for the RDF/CDAO format, AFAIK this is not a format that's supported by >> any other phylogenetic libraries, so I'm not sure how useful this is to >> everyone else. It provides a simple, standards-compliant format that can be >> imported to a triple store and supports annotation. We'll be using it at >> NESCent so I wanted to make it available to everyone else as well. The >> parser and writer require the Redlands Python bindings. >> >> The code is available in my fork of Biopython, >> >> https://github.com/bendmorris/biopython >> >> under branches "cdao" and "nexml." I'd love to get everyone's thoughts and >> see if these contributions would be a good fit for the Biopython project. > > > > Thanks for letting us know! I'll try it out soonish. Looking at the code on your nexml branch, I have a few comments: > > - The parser uses ElementTree.parse rather than iterparse, so in its current state it would not be able to parse massive files (those larger than available RAM). Worth fixing eventually? Great point. I rewrote it to use iterparse instead. > - The parser creates Newick.Tree and Newick.Clade objects, which is nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you don't have any additional attributes to attach to those classes at the moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and PhyloXMLIO.py.) Went ahead and did this as well. > - The 'confidence' or 'confidences' attribute isn't used (for e.g. bootstrap support values). Does NeXML define it? Not that I'm aware of, but I'm not sure. I searched http://nexml.org/nexml/html/doc/schema-1/ and didn't find anything. I'm going to ask some people who know more about this than I do. ~Ben From diego_zea at yahoo.com.ar Fri Dec 28 18:33:35 2012 From: diego_zea at yahoo.com.ar (Diego Zea) Date: Fri, 28 Dec 2012 15:33:35 -0800 (PST) Subject: [Biopython-dev] Error on Bio.PDB Message-ID: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com> One of the PDB (I have a very large dataset of PDB and there are a lot of them generating this kind of error) that give me the error is: http://www.rcsb.org/pdb/files/2ER9.pdb Ant the error output is: /usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 2895. ? PDBConstructionWarning) /usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 3216. ? PDBConstructionWarning) Traceback (most recent call last): ? File "AsignarPDBaMIfile.py", line 45, in ??? cmap, pdb? = contactos_CB(pdb_file,pdb_cad,cutoff=1,n_salto=1) ? File "funciones_pdb.py", line 15, in contactos_CB ??? cadena = model[cad] ? File "/usr/lib/pymodules/python2.7/Bio/PDB/Entity.py", line 38, in __getitem__ ??? return self.child_dict[id] KeyError: 'A' ? How Can be fixed? P.D.: The lines of the firts warning are (at the beginig I wrote the number of lines for reference, I think that the TER line can be the cause of the problem but I'm not sure): 2893?? ATOM?? 2455? N?? PHE I?? 8????? 38.110 -15.236?? 4.503? 0.89? 0.76?????????? N? 2894?? TER??? 2456????? PHE I?? 8????????????????????????????????????????????????????? 2895??? HETATM 2457? O?? HOH E 327????? 10.873? -3.134? 11.448? 0.89? 0.01?????????? O? if ((dx*dp)>=(h/(2*pi))) { printf("Diego Javier Zea\n"); } From diego_zea at yahoo.com.ar Fri Dec 28 18:59:28 2012 From: diego_zea at yahoo.com.ar (Diego Zea) Date: Fri, 28 Dec 2012 15:59:28 -0800 (PST) Subject: [Biopython-dev] Error on Bio.PDB In-Reply-To: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com> References: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com> Message-ID: <1356739168.13594.YahooMailNeo@web140606.mail.bf1.yahoo.com> Excuse me, there is not error. Only a warning on a lot of PDBs. I confuse the chain on my example :/ ? if ((dx*dp)>=(h/(2*pi))) { printf("Diego Javier Zea\n"); } >________________________________ > De: Diego Zea >Para: "biopython-dev at biopython.org" >Enviado: viernes, 28 de diciembre de 2012 20:33 >Asunto: [Biopython-dev] Error on Bio.PDB > >One of the PDB (I have a very large dataset of PDB and there are a lot of them generating this kind of error) that give me the error is: http://www.rcsb.org/pdb/files/2ER9.pdb > >Ant the error output is: >/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 2895. >? PDBConstructionWarning) >/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 3216. >? PDBConstructionWarning) >Traceback (most recent call last): >? File "AsignarPDBaMIfile.py", line 45, in >??? cmap, pdb? = contactos_CB(pdb_file,pdb_cad,cutoff=1,n_salto=1) >? File "funciones_pdb.py", line 15, in contactos_CB >??? cadena = model[cad] >? File "/usr/lib/pymodules/python2.7/Bio/PDB/Entity.py", line 38, in __getitem__ >??? return self.child_dict[id] >KeyError: 'A' > >? >How Can be fixed? > >P.D.: The lines of the firts warning are (at the beginig I wrote the number of lines for reference, I think that the TER line can be the cause of the problem but I'm not sure): > >2893?? ATOM?? 2455? N?? PHE I?? 8????? 38.110 -15.236?? 4.503? 0.89? 0.76?????????? N? >2894?? TER??? 2456????? PHE I?? 8????????????????????????????????????????????????????? >2895??? HETATM 2457? O?? HOH E 327????? 10.873? -3.134? 11.448? 0.89? 0.01?????????? O? > > > > > > > > >if ((dx*dp)>=(h/(2*pi))) >{ >printf("Diego Javier Zea\n"); >} >_______________________________________________ >Biopython-dev mailing list >Biopython-dev at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From redmine at redmine.open-bio.org Sun Dec 30 07:46:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 30 Dec 2012 12:46:35 +0000 Subject: [Biopython-dev] [Biopython - Feature #3388] add annotation and letter_annotations attributed for Bio.Align.MultipleSeqAlignment. object References: Message-ID: Issue #3388 has been updated by Peter Cock. Support for a generic annotation dictionary done, https://github.com/biopython/biopython/commit/793f9210696e0acc9606faeca3d6ca47a9d97813 Started work on per-column annotation as well - currently on this branch: https://github.com/peterjc/biopython/tree/per-column-annotation ---------------------------------------- Feature #3388: add annotation and letter_annotations attributed for Bio.Align.MultipleSeqAlignment. object https://redmine.open-bio.org/issues/3388 Author: saverio vicario Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: At the moment I could not add annotation at alignment level. annotation could be usefull for tracking info linked to the loci ( i.e. name of domain), while letter annotation could be usefull to track quality score of alignment or if the sites belong to a given character set. In particular when to alignment are merged it would be usefull tha the bounduary of the merge is tracked for example in Letter annotation of the merge of an alignment a with 10 sites and b of 5 sites the letter_annotations would be as following {locus1:'111111111100000',locus2:'000000000011111'} this could be usefull also to annotate the 3 position of codons {pos1:'1001001001',pos2:'0100100100', pos3:'0010010010'} If this letter_annotation would be supported the annotation could be kept across merging and splitting of the alignment -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Sun Dec 2 23:41:49 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 2 Dec 2012 23:41:49 +0000 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: On Mon, Nov 26, 2012 at 4:46 PM, Peter Cock wrote: > > Done, > https://github.com/biopython/biopython/commit/9f6e810cc68dd1e353d899772fda3053d9f49513 > >>> Once that's done there is some housekeeping to do, like >>> the indexing code duplication with Bio.SeqIO, and tackling >>> indexing BGZF compressed files with Bio.SearchIO which >>> I will have a go at. >> >> Yes. > > Started, it seems the two _index.py files have diverged a > little more than I'd expected: > https://github.com/biopython/biopython/commit/ad1786b99afd2a50248246d877ff00a53949546b I've just refactored the code in order to avoid most of the index duplication (including SQLite backend) between the SeqIO and new SearchIO index and index_db functions. In the short term at least, the common code is now part of Bio/File.py (but remains as private classes). That seemed neater than introducing a new private module. Fingers crossed everything is fine on the buildslaves, TravisCI seems happy. Bow, if you find I've broken anything then we need more unit tests ;) Regards, Peter From w.arindrarto at gmail.com Mon Dec 3 11:22:07 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 3 Dec 2012 12:22:07 +0100 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: Hi Peter, >>>> Once that's done there is some housekeeping to do, like >>>> the indexing code duplication with Bio.SeqIO, and tackling >>>> indexing BGZF compressed files with Bio.SearchIO which >>>> I will have a go at. >>> >>> Yes. >> >> Started, it seems the two _index.py files have diverged a >> little more than I'd expected: >> https://github.com/biopython/biopython/commit/ad1786b99afd2a50248246d877ff00a53949546b > > I've just refactored the code in order to avoid most of the > index duplication (including SQLite backend) between the > SeqIO and new SearchIO index and index_db functions. Thanks :). I remember I did change some of the variable names. Other than this, the biggest change is probably related to the Indexer classes lazy loading in SearchIO. But it seems to have been handled as well :). > In the short term at least, the common code is now part > of Bio/File.py (but remains as private classes). That > seemed neater than introducing a new private module. Looks like a good place for now, Bio.File as the location for common file-handling code. > Fingers crossed everything is fine on the buildslaves, > TravisCI seems happy. Bow, if you find I've broken > anything then we need more unit tests ;) Will keep that in mind :). regards, Bow From p.j.a.cock at googlemail.com Mon Dec 3 11:36:16 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Dec 2012 11:36:16 +0000 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: On Mon, Dec 3, 2012 at 11:22 AM, Wibowo Arindrarto wrote: > Hi Peter, > >> I've just refactored the code in order to avoid most of the >> index duplication (including SQLite backend) between the >> SeqIO and new SearchIO index and index_db functions. > > Thanks :). I remember I did change some of the variable names. Basically I moved the core SeqIO indexing code into Bio.File, generalised it enough to work for SearchIO as well, then removed the SearchIO indexing code. > Other than this, the biggest change is probably related to the > Indexer classes lazy loading in SearchIO. But it seems to have > been handled as well :). Yes, the SearchIO indexing is still calling your lazy loading function to get the parser objects. >> In the short term at least, the common code is now part >> of Bio/File.py (but remains as private classes). That >> seemed neater than introducing a new private module. > > Looks like a good place for now, Bio.File as the location for > common file-handling code. That was my thinking too. >> Fingers crossed everything is fine on the buildslaves, >> TravisCI seems happy. Bow, if you find I've broken >> anything then we need more unit tests ;) > > Will keep that in mind :). *Grin* I've just done a base class for the random access proxy classes, potentially a little more refactoring to follow here (or renaming): https://github.com/biopython/biopython/commit/9721cd00b5662309456c3dc573642cbb88e4e0a1 Peter From christian at brueffer.de Mon Dec 3 12:46:23 2012 From: christian at brueffer.de (Christian Brueffer) Date: Mon, 03 Dec 2012 20:46:23 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup Message-ID: <50BC9F1F.4090904@brueffer.de> Hi, I just submitted pull request #102 which fixes several types of PEP8 warnings (found using the awesome pep8 tool). Here's what's left after those fixes: $ pep8 --statistics -qq repos/biopython 789 E111 indentation is not a multiple of four 673 E121 continuation line indentation is not a multiple of four 693 E122 continuation line missing indentation or outdented 171 E123 closing bracket does not match indentation of opening bracket's line 86 E124 closing bracket does not match visual indentation 49 E125 continuation line does not distinguish itself from next logical line 197 E126 continuation line over-indented for hanging indent 575 E127 continuation line over-indented for visual indent 1092 E128 continuation line under-indented for visual indent 773 E201 whitespace after '(' 540 E202 whitespace before ')' 23543 E203 whitespace before ':' 55 E211 whitespace before '(' 180 E221 multiple spaces before operator 59 E222 multiple spaces after operator 5848 E225 missing whitespace around operator 6517 E231 missing whitespace after ',' 2544 E251 no spaces around keyword / parameter equals 644 E261 at least two spaces before inline comment 346 E262 inline comment should start with '# ' 156 E301 expected 1 blank line, found 0 1838 E302 expected 2 blank lines, found 1 364 E303 too many blank lines (2) 15553 E501 line too long (82 > 79 characters) 857 E502 the backslash is redundant between brackets 291 E701 multiple statements on one line (colon) 122 E711 comparison to None should be 'if cond is None:' 3707 W291 trailing whitespace 1913 W293 blank line contains whitespace I'm not sure where to go from here with regard to what's worth fixing and what would be considered repo churn (or gratuitous changes that make merging of existing patches harder). I'd especially like to clean up E301, E302, E701, E711, W291 and W293. Other items like E251 are more dubious, as some developers seem to prefer the current style. What do you think? Chris From p.j.a.cock at googlemail.com Mon Dec 3 13:34:52 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Dec 2012 13:34:52 +0000 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: <50BC9F1F.4090904@brueffer.de> References: <50BC9F1F.4090904@brueffer.de> Message-ID: On Mon, Dec 3, 2012 at 12:46 PM, Christian Brueffer wrote: > Hi, Hi Christian, Thanks for all the pull requests sorting out issues like this, in terms of lines of code you'll probably be one of the top contributors to the next release ;) This sort of work isn't as high profile as new features or bug fixes, but has a more subtle role in the long term of the project - making our code easier to follow etc. So we do appreciate these contributions. > I just submitted pull request #102 which fixes several types of PEP8 > warnings (found using the awesome pep8 tool). 101 not 102? https://github.com/biopython/biopython/pull/101 > Here's what's left after those fixes: > > $ pep8 --statistics -qq repos/biopython > 789 E111 indentation is not a multiple of four That's nasty - although I think we've got rid of all the tabbed indentation already which was also very annoying. > 673 E121 continuation line indentation is not a multiple of four I suspect many of those are a style judgement and done that way to line up parentheses etc. > 693 E122 continuation line missing indentation or outdented > 171 E123 closing bracket does not match indentation of opening bracket's > line > 86 E124 closing bracket does not match visual indentation > 49 E125 continuation line does not distinguish itself from next logical > line > 197 E126 continuation line over-indented for hanging indent > 575 E127 continuation line over-indented for visual indent > 1092 E128 continuation line under-indented for visual indent > 773 E201 whitespace after '(' > 540 E202 whitespace before ')' > 23543 E203 whitespace before ':' > 55 E211 whitespace before '(' I'd like to see E201, E202, and E211 fixed (whitespace next to parentheses). The count for E203 is surprisingly high - I suspect that could include some large dictionaries? Note some of the dictionaries are auto-generated so the code to do that would also need fixing. > 180 E221 multiple spaces before operator > 59 E222 multiple spaces after operator > 5848 E225 missing whitespace around operator > 6517 E231 missing whitespace after ',' > 2544 E251 no spaces around keyword / parameter equals > 644 E261 at least two spaces before inline comment > 346 E262 inline comment should start with '# ' > 156 E301 expected 1 blank line, found 0 > 1838 E302 expected 2 blank lines, found 1 > 364 E303 too many blank lines (2) > 15553 E501 line too long (82 > 79 characters) > 857 E502 the backslash is redundant between brackets Fixing E502 seems a good idea, I suspect many of these are purely accidental due to not realising when they are redundant. > 291 E701 multiple statements on one line (colon) > 122 E711 comparison to None should be 'if cond is None:' > 3707 W291 trailing whitespace > 1913 W293 blank line contains whitespace > > I'm not sure where to go from here with regard to what's worth fixing and > what would be considered repo churn (or gratuitous changes that make > merging of existing patches harder). > > I'd especially like to clean up E301, E302, E301 and E302 presumable are about the recommended spacing between function, class and method names? If you want to fix them next that seems low risk in terms of complicating merges. > ... E701, E711, W291 and W293. Did you already fix most of those in today's pull request? https://github.com/biopython/biopython/pull/101 If there are more cases, then by all means fix them too. > Other items like E251 are more dubious, as some developers > seem to prefer the current style. > > What do you think? We have a range of styles in the current code base reflecting different authors - and also changes in the Python conventions as some of the code is now over ten years old. And if any of my personal coding style is flagged, I'm willing to adapt ;) (e.g. I've learnt not to put a space before if statement colons) As you point out, the "repo churn" from fixing minor things like spaces around operators does have a cost in making merges a little harder. Things like the exception style updates which you've already fixed (seems I missed some) are more urgent for Python 3 support, so worth doing anyway. You've got us a lot closer to PEP8 compliance - do you think subject to a short white list of known cases (like module names) where we don't follow PEP8 we could aim to run a a pep8 tool automatically (e.g. as a unit test, or even a commit hook)? That is quite appealing as a way to spot any new code which breaks the style guidelines... Regards, Peter From p.j.a.cock at googlemail.com Mon Dec 3 14:02:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Dec 2012 14:02:40 +0000 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: On Mon, Nov 26, 2012 at 1:49 PM, Peter Cock wrote: > > Once that's done there is some housekeeping to do, like > the indexing code duplication with Bio.SeqIO, and tackling > indexing BGZF compressed files with Bio.SearchIO which > I will have a go at. > I've started work on SearchIO indexing of BGZF files now, enabling it was quite simple (the same code as used for SeqIO the indexing): https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f Thus far I've only tested this with BLAST XML, but that did require a bit of reworking to avoid doing file offset arithmetic: https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5 I will resume this work later this afternoon, going over all the SearchIO file formats one by one. Regards, Peter From p.j.a.cock at googlemail.com Mon Dec 3 16:49:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Dec 2012 16:49:47 +0000 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: On Mon, Dec 3, 2012 at 2:02 PM, Peter Cock wrote: > > I've started work on SearchIO indexing of BGZF files now, > enabling it was quite simple (the same code as used for > SeqIO the indexing): > https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f > > Thus far I've only tested this with BLAST XML, but that did > require a bit of reworking to avoid doing file offset arithmetic: > https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5 > > I will resume this work later this afternoon, going over all > the SearchIO file formats one by one. I've refactored test_SearchIO_index.py to make adding additional get_raw tests easier. Proper testing of all the formats with BGZF will some larger test files (over 64k before compression) which we probably don't want to include in the repository. However, I also added code to additionally test Bio.SearchIO.index_db(...).get_raw(...) as well as your original testing of Bio.SearchIO.index(...).get_raw(...) alone. These should return the exact same string, and that is now working nicely for BLAST XML (and BGZF from limited testing), but not on all the formats. Could you look at the difference in get_raw and the record length found during indexing for: blast-tab (with comments), hmmscan3-domtab, hmmer3-tab, and hmmer3-text? i.e. Anything where test_SearchIO_index.py is now printing a WARNING line when run. Thanks, Peter From christian at brueffer.de Mon Dec 3 17:02:31 2012 From: christian at brueffer.de (Christian Brueffer) Date: Tue, 04 Dec 2012 01:02:31 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: References: <50BC9F1F.4090904@brueffer.de> Message-ID: <50BCDB27.7040402@brueffer.de> On 12/3/12 21:34 , Peter Cock wrote: > On Mon, Dec 3, 2012 at 12:46 PM, Christian Brueffer > wrote: >> Hi, > > Hi Christian, > > Thanks for all the pull requests sorting out issues like this, in > terms of lines of code you'll probably be one of the top > contributors to the next release ;) This sort of work isn't as > high profile as new features or bug fixes, but has a more > subtle role in the long term of the project - making our code > easier to follow etc. So we do appreciate these contributions. > >> I just submitted pull request #102 which fixes several types of PEP8 >> warnings (found using the awesome pep8 tool). > > 101 not 102? https://github.com/biopython/biopython/pull/101 > 102 and 103 (I actually meant 103). >> Here's what's left after those fixes: >> >> $ pep8 --statistics -qq repos/biopython >> 789 E111 indentation is not a multiple of four > > That's nasty - although I think we've got rid of all the tabbed > indentation already which was also very annoying. > Some code uses two spaces etc, definatelty worth fixing. >> 673 E121 continuation line indentation is not a multiple of four > > I suspect many of those are a style judgement and done that > way to line up parentheses etc. > I'll see about those and apply case by case judgement. >> 693 E122 continuation line missing indentation or outdented >> 171 E123 closing bracket does not match indentation of opening bracket's >> line >> 86 E124 closing bracket does not match visual indentation >> 49 E125 continuation line does not distinguish itself from next logical >> line >> 197 E126 continuation line over-indented for hanging indent >> 575 E127 continuation line over-indented for visual indent >> 1092 E128 continuation line under-indented for visual indent >> 773 E201 whitespace after '(' >> 540 E202 whitespace before ')' >> 23543 E203 whitespace before ':' >> 55 E211 whitespace before '(' > > I'd like to see E201, E202, and E211 fixed (whitespace next to > parentheses). > > The count for E203 is surprisingly high - I suspect that > could include some large dictionaries? Note some of the > dictionaries are auto-generated so the code to do that > would also need fixing. > >> 180 E221 multiple spaces before operator >> 59 E222 multiple spaces after operator >> 5848 E225 missing whitespace around operator >> 6517 E231 missing whitespace after ',' >> 2544 E251 no spaces around keyword / parameter equals >> 644 E261 at least two spaces before inline comment >> 346 E262 inline comment should start with '# ' >> 156 E301 expected 1 blank line, found 0 >> 1838 E302 expected 2 blank lines, found 1 >> 364 E303 too many blank lines (2) >> 15553 E501 line too long (82 > 79 characters) >> 857 E502 the backslash is redundant between brackets > > Fixing E502 seems a good idea, I suspect many of these are > purely accidental due to not realising when they are redundant. > Agreed. >> 291 E701 multiple statements on one line (colon) >> 122 E711 comparison to None should be 'if cond is None:' >> 3707 W291 trailing whitespace >> 1913 W293 blank line contains whitespace >> >> I'm not sure where to go from here with regard to what's worth fixing and >> what would be considered repo churn (or gratuitous changes that make >> merging of existing patches harder). >> >> I'd especially like to clean up E301, E302, > > E301 and E302 presumable are about the recommended spacing > between function, class and method names? If you want to fix > them next that seems low risk in terms of complicating merges. > That and spacing between functions or between a function and a new class. >> ... E701, E711, W291 and W293. > > Did you already fix most of those in today's pull request? > https://github.com/biopython/biopython/pull/101 > > If there are more cases, then by all means fix them too. > I fixed some in Nexus, that was before actually using the pep8 tool. >> Other items like E251 are more dubious, as some developers >> seem to prefer the current style. >> >> What do you think? > > We have a range of styles in the current code base reflecting > different authors - and also changes in the Python conventions > as some of the code is now over ten years old. And if any of > my personal coding style is flagged, I'm willing to adapt ;) > > (e.g. I've learnt not to put a space before if statement colons) > > As you point out, the "repo churn" from fixing minor things > like spaces around operators does have a cost in making > merges a little harder. Things like the exception style updates > which you've already fixed (seems I missed some) are more > urgent for Python 3 support, so worth doing anyway. > On the other hand, it's basically a one-time cost. However I want to fix the lowest-hanging fruit (read: the ones with the lowest counts ;-) first. > You've got us a lot closer to PEP8 compliance - do you think > subject to a short white list of known cases (like module > names) where we don't follow PEP8 we could aim to run a > a pep8 tool automatically (e.g. as a unit test, or even a commit > hook)? That is quite appealing as a way to spot any new code > which breaks the style guidelines... > Having a commit hook would be ideal (maybe with a possibility to override). This would be especially useful against the introduction of gratuitous whitespace. With some editors/IDEs you don't even notice it. Chris From w.arindrarto at gmail.com Tue Dec 4 13:33:32 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 4 Dec 2012 14:33:32 +0100 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: Hi Peter and everyone, >> I've started work on SearchIO indexing of BGZF files now, >> enabling it was quite simple (the same code as used for >> SeqIO the indexing): >> https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f >> >> Thus far I've only tested this with BLAST XML, but that did >> require a bit of reworking to avoid doing file offset arithmetic: >> https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5 >> >> I will resume this work later this afternoon, going over all >> the SearchIO file formats one by one. Yes, the original one that I wrote did have some less straightforward arithmetic as I was trying to adhere to the strict XML definition (i.e. no matter the whitespace outside of the start and end elements, indexing will still work). But line-based indexing should work too (and is simpler) so long as BLAST XML keeps its style (and any user modification afterwards doesn't introduce any wacky whitespaces). > I've refactored test_SearchIO_index.py to make adding > additional get_raw tests easier. Proper testing of all the > formats with BGZF will some larger test files (over 64k > before compression) which we probably don't want to > include in the repository. > > However, I also added code to additionally test > Bio.SearchIO.index_db(...).get_raw(...) as well as your > original testing of Bio.SearchIO.index(...).get_raw(...) > alone. These should return the exact same string, and > that is now working nicely for BLAST XML (and BGZF > from limited testing), but not on all the formats. > > Could you look at the difference in get_raw and the > record length found during indexing for: blast-tab > (with comments), hmmscan3-domtab, hmmer3-tab, > and hmmer3-text? > > i.e. Anything where test_SearchIO_index.py is now > printing a WARNING line when run. Sure :). Based on a quick initial look, it seems that these are due to filler texts (e.g. the BLAST tab format ending with lines like "# BLAST processed 3 queries"). These texts won't affect the calculation results and the values of our objects, but does add additional text length. regards, Bow From redmine at redmine.open-bio.org Tue Dec 4 23:01:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 4 Dec 2012 23:01:35 +0000 Subject: [Biopython-dev] [Biopython - Bug #3399] (New) SearchIO hmmer3-text parser fails to parse hits that have large gaps Message-ID: Issue #3399 has been reported by Kai Blin. ---------------------------------------- Bug #3399: SearchIO hmmer3-text parser fails to parse hits that have large gaps https://redmine.open-bio.org/issues/3399 Author: Kai Blin Status: New Priority: Normal Assignee: Category: Target version: URL: While trying to parse a hit that has a really bad match to the profile, there might be alignment lines that don't contain query sequence characters at all. In that case the SearchIO hmmer3-text module currently throws a ValueError
>>> it = SearchIO.parse('../broken.hsr', 'hmmer3-text')
>>> i = it.next()
Traceback (most recent call last):
  File "", line 1, in 
  File "Bio/SearchIO/__init__.py", line 313, in parse
    for qresult in generator:
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 60, in __iter__
    for qresult in self._parse_qresult():
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 145, in _parse_qresult
    hit_list = self._parse_hit(qid)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 188, in _parse_hit
    hit_list = self._create_hits(hit_list, qid)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 309, in _create_hits
    self._parse_aln_block(hid, hit.hsps)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 358, in _parse_aln_block
    frag.query = aliseq
  File "Bio/SearchIO/_model/hsp.py", line 816, in _query_set
    self._query = self._set_seq(value, 'query')
  File "Bio/SearchIO/_model/hsp.py", line 784, in _set_seq
    len(seq), seq_type))
ValueError: Sequence lengths do not match. Expected: 202 (hit); found: 131 (query).
See the attached file broken.hsr for a dataset that triggers the error. If you remove the esterase hit (including the domain annotation), this error does not happen (broken2.hsr). If you insert fake position information into the query sequence line (broken3.hsr), the parser is happy again. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Wed Dec 5 06:46:20 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 5 Dec 2012 07:46:20 +0100 Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names? In-Reply-To: References: Message-ID: Hi everyone, >> However, I also added code to additionally test >> Bio.SearchIO.index_db(...).get_raw(...) as well as your >> original testing of Bio.SearchIO.index(...).get_raw(...) >> alone. These should return the exact same string, and >> that is now working nicely for BLAST XML (and BGZF >> from limited testing), but not on all the formats. >> >> Could you look at the difference in get_raw and the >> record length found during indexing for: blast-tab >> (with comments), hmmscan3-domtab, hmmer3-tab, >> and hmmer3-text? >> >> i.e. Anything where test_SearchIO_index.py is now >> printing a WARNING line when run. > > Sure :). Based on a quick initial look, it seems that these are due to > filler texts (e.g. the BLAST > tab format ending with lines like "# BLAST processed 3 queries"). > These texts won't affect the calculation results and the values of our > objects, but does add additional text length. I've looked into this and submitted a pull request to fix the issues here: https://github.com/biopython/biopython/pull/111. The details on the errors are also there. regards, Bow From kai.blin at biotech.uni-tuebingen.de Wed Dec 5 07:24:14 2012 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Wed, 05 Dec 2012 17:24:14 +1000 Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading attributes. Message-ID: <50BEF69E.2000806@biotech.uni-tuebingen.de> Hi folks, I'm trying to finally get my hmmer2-text parser in, but I'm failing one unit test. The code is a bit too smart for me, it seems. So in the file I'm parsing, I only ever get the description of the hit in the hit table, like this (appologies if my mail client breaks this): Model Description Score E-value N -------- ----------- ----- ------- --- Glu_synthase Conserved region in glutamate synthas 858.6 3.6e-255 2 But of course I can't create a hit object when parsing the hit table, as I first need to have HSPFragments to create the hit object with. Anyway, I create a placeholder hit object that I'll later convert into a real Hit object. In that placeholder object, I set a description. Now I'm parsing the HSP table, looking like this: Model Domain seq-f seq-t hmm-f hmm-t score E-value -------- ------- ----- ----- ----- ----- ----- ------- GATase_2 1/1 34 404 .. 1 385 [] 731.8 3.9e-226 The HSP table is in a different order than the hit table, so never mind the different model name. Now, I need to create an HSPFragment with the same description as the Hit object, or querying for the Hit object's description will cascade through the HSPs and HSPFragments, and return multiple values for the description. However, no matter what I do, I seem to get an tossed in there somehow. The parser is at https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py the test code is at https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py and the test file that's failing is the hmmpfam2.3 file at https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out Any pointers would be appreciated. The code is working fine in my current development work in general, and I'd love to get it upstream to get rid of an extra patch step during installation. Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-University of T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Deutschland Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From p.j.a.cock at googlemail.com Wed Dec 5 11:41:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 11:41:05 +0000 Subject: [Biopython-dev] Minor buildbot issues from SearchIO In-Reply-To: References: Message-ID: On Fri, Nov 30, 2012 at 2:35 AM, Wibowo Arindrarto wrote: > Hi everyone, > > I've done some digging around to see how to deal with these issues. > Here's what I found: > >> The BuildBot flagged two new issues overnight, >> http://testing.open-bio.org/biopython/tgrid >> >> Python 2.5 on Windows - doctests are failing due to floating point decimal place >> differences in the exponent (down to C library differences, something fixed in >> later Python releases). Perhaps a Python 2.5 hack is the way to go here? >> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%202.5/builds/664/steps/shell/logs/stdio > > I've submitted a pull request to fix this here: > https://github.com/biopython/biopython/pull/98 The Windows detection wasn't quite right, it should now match how we look for Windows elsewhere in Biopython: https://github.com/biopython/biopython/commit/fc24967b89eda56675e67824a4a57a6059650636 >> There is a separate cross-platform issue on Python 3.1, "TypeError: >> invalid event tuple" again with XML parsing. Curiously this had started >> a few days back in the UniprotIO tests on one machine, pre-dating the >> SearchIO merge. I'm not sure what triggered it. >> http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/767 >> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/766/steps/shell/logs/stdio >> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/648/steps/shell/logs/stdio > > As for this one, it seems that it's caused by a bug in Python3.1 > (http://bugs.python.org/issue9257) due to the way > `xml.etree.cElemenTree.iterparse` accepts the `event` argument. Ah - I remember that bug now, we have a hack in place elsewhere to try and avoid that - seems it won't be fixed in Python 3.1.x now so I've relaxed the version check here: https://github.com/biopython/biopython/commit/52fdd0ed7fa576494005e635b6a6610daab2ab0e Hopefully that will bring the buildbot back to all green tonight. (TravisCI has now dropped their Python 3.1 support, but they should have Python 3.3 with NumPy working soon). Peter From p.j.a.cock at googlemail.com Wed Dec 5 14:16:43 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 14:16:43 +0000 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: <50BCDB27.7040402@brueffer.de> References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> Message-ID: On Mon, Dec 3, 2012 at 5:02 PM, Christian Brueffer wrote: >> As you point out, the "repo churn" from fixing minor things >> like spaces around operators does have a cost in making >> merges a little harder. Things like the exception style updates >> which you've already fixed (seems I missed some) are more >> urgent for Python 3 support, so worth doing anyway. >> > > On the other hand, it's basically a one-time cost. However I > want to fix the lowest-hanging fruit (read: the ones with the > lowest counts ;-) first. The shear number of files touched in these PEP8 fixes would probably deserve to be called "repository churn" now - wow! Although we have good test coverage, it isn't complete (anyone fancy trying some test coverage measuring tools like figleaf?) so there is a small but real risk we've accidentally broken something. I'm wondering if therefore a 'beta' release would be prudent, of if I am just worrying about things too much? >> You've got us a lot closer to PEP8 compliance - do you think >> subject to a short white list of known cases (like module >> names) where we don't follow PEP8 we could aim to run a >> a pep8 tool automatically (e.g. as a unit test, or even a commit >> hook)? That is quite appealing as a way to spot any new code >> which breaks the style guidelines... > > Having a commit hook would be ideal (maybe with a possibility to > override). This would be especially useful against the introduction of > gratuitous whitespace. With some editors/IDEs you don't even notice it. Would you be interested in looking into how to set that up? Presumably a client-side git hook would be best, but we'd need to explore cross platform issues (e.g. developing and testing on Windows) and making sure it allowed an override on demand (where the developer wants/needs to ignore a style warning). Thanks, Peter From d.m.a.martin at dundee.ac.uk Wed Dec 5 13:50:21 2012 From: d.m.a.martin at dundee.ac.uk (David Martin) Date: Wed, 5 Dec 2012 13:50:21 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer Message-ID: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams. I'd like to modify the CircularDrawer feature drawing to allow the following: label_position: start|middle|end as per LinearDrawer label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse This will cause some issues with track widths (how can you specify a track width for a feature track?) Any thoughts/suggestions? ..d The University of Dundee is a registered Scottish Charity, No: SC015096 From christian at brueffer.de Wed Dec 5 15:28:19 2012 From: christian at brueffer.de (Christian Brueffer) Date: Wed, 05 Dec 2012 23:28:19 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> Message-ID: <50BF6813.4070102@brueffer.de> On 12/5/12 22:16 , Peter Cock wrote: > On Mon, Dec 3, 2012 at 5:02 PM, Christian Brueffer > wrote: >>> As you point out, the "repo churn" from fixing minor things >>> like spaces around operators does have a cost in making >>> merges a little harder. Things like the exception style updates >>> which you've already fixed (seems I missed some) are more >>> urgent for Python 3 support, so worth doing anyway. >>> >> >> On the other hand, it's basically a one-time cost. However I >> want to fix the lowest-hanging fruit (read: the ones with the >> lowest counts ;-) first. > > The shear number of files touched in these PEP8 fixes would > probably deserve to be called "repository churn" now - wow! > I wonder whether there's a file left I haven't touched yet (except the data files in Tests)... > Although we have good test coverage, it isn't complete (anyone > fancy trying some test coverage measuring tools like figleaf?) > so there is a small but real risk we've accidentally broken > something. I'm wondering if therefore a 'beta' release would > be prudent, of if I am just worrying about things too much? > It certainly can't hurt to advise users to have an extra eye on possible regressions and strange behaviours in existing code. I think the only risky changes were the ones concerning indentation, (f68d334b1edfd743fe8a7bb4654046295f0ff939), I was extra careful about those. So, I'm pretty confident I haven't screwed things up but it's good to be careful. FYI, here's the "pep8 --statistics -qq" output as of commit df4f12965a2ad3b6ed31bbf9d201bd5c716bd4ee: 680 E121 continuation line indentation is not a multiple of four 691 E122 continuation line missing indentation or outdented 171 E123 closing bracket does not match indentation of opening bracket's line 86 E124 closing bracket does not match visual indentation 197 E126 continuation line over-indented for hanging indent 601 E127 continuation line over-indented for visual indent 1072 E128 continuation line under-indented for visual indent 772 E201 whitespace after '(' 536 E202 whitespace before ')' 23444 E203 whitespace before ':' 94 E221 multiple spaces before operator 11 E222 multiple spaces after operator 5763 E225 missing whitespace around operator 6519 E231 missing whitespace after ',' 2542 E251 no spaces around keyword / parameter equals 622 E261 at least two spaces before inline comment 347 E262 inline comment should start with '# ' 1044 E302 expected 2 blank lines, found 1 1 E303 too many blank lines (2) 15526 E501 line too long (82 > 79 characters) 3 E711 comparison to None should be 'if cond is None:' 75 W291 trailing whitespace 12 W293 blank line contains whitespace 5 W601 .has_key() is deprecated, use 'in' E203 looks scary, but 9900 of those are in Bio/SubsMat/MatrixInfo.py alone. >>> You've got us a lot closer to PEP8 compliance - do you think >>> subject to a short white list of known cases (like module >>> names) where we don't follow PEP8 we could aim to run a >>> a pep8 tool automatically (e.g. as a unit test, or even a commit >>> hook)? That is quite appealing as a way to spot any new code >>> which breaks the style guidelines... >> >> Having a commit hook would be ideal (maybe with a possibility to >> override). This would be especially useful against the introduction of >> gratuitous whitespace. With some editors/IDEs you don't even notice it. > > Would you be interested in looking into how to set that up? > Presumably a client-side git hook would be best, but we'd > need to explore cross platform issues (e.g. developing and > testing on Windows) and making sure it allowed an override > on demand (where the developer wants/needs to ignore a > style warning). > Yes, It's fairly high on my TODO list. Chris From p.j.a.cock at googlemail.com Wed Dec 5 15:57:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 15:57:44 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Wed, Dec 5, 2012 at 1:50 PM, David Martin wrote: > Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams. > > I'd like to modify the CircularDrawer feature drawing to allow the following: > > label_position: start|middle|end as per LinearDrawer I would find it natural if we treated start/middle/end from the point of view of the feature (and its strand) as in the LinearDrawer. However the current circular drawer tries to position things at the vertical bottom of the feature (it cares about the left and right halves of the circle) which is rather different. I am suggesting a break in backwards compatibility (old code would still run but put the labels in different places) but for large circular diagrams the difference should be minor - and I think it would be an overall improvement. > label_placement: inside|outside|overlap where inside and outside are > anchored just inside and just outside the feature but do not overlap it, > and overlap is the current behaviour If I have understood your intended meaning, that won't work nicely with stranded features. I would suggest two options: outside (i.e. outside the feature's bounding box, either outside the track circle for forward strand or strand-less, or inside the track circle for reverse strand) matching the current linear code, or inside matching the current circular code. i.e. This would essentially toggle the text element's anchoring between start/end. i.e. Maintain the convention that labels above/outside the track are for the forward strand (and strand-less) features, while labels below/inside the track are for reverse strand features. > label_orientation: upright|circular which determines the orientation of > the label. upright is the current behaviour. Circular would be oriented > to face clockwise for the forward strand and anticlockwise for the reverse I would prefer making the existing (linear) option label_angle work nicely on circular diagrams (which would make sense as part of reworking the code to obey label_placement). > This will cause some issues with track widths (how can you specify a > track width for a feature track?) Do you mean how to allocate more white space between the tracks to ensure the labels have a clear background if printed outside the features? The quick and dirty solution is a spacer track (you can allocate track numbers to leave a gap). > Any thoughts/suggestions? > Comments in-line, if need be we could meet up to hash some of this out in person (although I not be in the Dundee area next week). Regards, Peter From Leighton.Pritchard at hutton.ac.uk Wed Dec 5 16:28:26 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 5 Dec 2012 16:28:26 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On 5 Dec 2012, at Wednesday, December 5, 15:57, Peter Cock wrote: On Wed, Dec 5, 2012 at 1:50 PM, David Martin > wrote: label_position: start|middle|end as per LinearDrawer I am suggesting a break in backwards compatibility (old code would still run but put the labels in different places) but for large circular diagrams the difference should be minor - and I think it would be an overall improvement. Yep - I agree label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse I would prefer making the existing (linear) option label_angle work nicely on circular diagrams (which would make sense as part of reworking the code to obey label_placement). Good point - the automatic reorientation on either side of the circle (to respect the viewer's local gravity) could effectively be handled through a working label_angle for circular diagrams. And more adventurous manual reorientation would also be possible ;) One issue there is what the angle is defined with respect to: a 'vertical' reference on the page, or a tangent/normal to some point on the feature. The first is straightforward, and might be what we want - the second will likely result in some odd - or attractive - patterns. Comments in-line, if need be we could meet up to hash some of this out in person (although I not be in the Dundee area next week). Friday's good for me. L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From ben at benfulton.net Wed Dec 5 16:28:52 2012 From: ben at benfulton.net (Ben Fulton) Date: Wed, 5 Dec 2012 11:28:52 -0500 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> Message-ID: I've been studying this a bit and have a preference for Ned Batchelder's Coverage tool. But I plan on putting some more work into it this week and next. On Wed, Dec 5, 2012 at 9:16 AM, Peter Cock wrote >Although we have good test coverage, it isn't complete (anyone >fancy trying some test coverage measuring tools like figleaf?) From w.arindrarto at gmail.com Wed Dec 5 16:39:13 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 5 Dec 2012 17:39:13 +0100 Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading attributes. In-Reply-To: <50BEF69E.2000806@biotech.uni-tuebingen.de> References: <50BEF69E.2000806@biotech.uni-tuebingen.de> Message-ID: Hi Kai and everyone, Very happy to see the parser near completion (with tests too!). The issue you're facing is unfortunately the consequence of trying to keep attribute values in sync across the object hierarchy. It is a bit troublesome for now, but not without solution. > However, no matter what I do, I seem to get an > tossed in there somehow. > > The parser is at > https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py > the test code is at > https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py > and the test file that's failing is the hmmpfam2.3 file at > https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out '' is the default value for any description attribute (be it in the QueryResult object, or in the HSPFragment.hit_description). The error you're seeing is because the hit description is being accessed through the hit object (hit.description) and the cascading property getter checks first whether all HSP contains the same `hit_description` attribute value. It'll only return the value if all HSPFragment.hit_description values are equal. Otherwise, it'll raise the error you're seeing here. In your case, there are two values: 'Conserved region in glutamate synthas' and '', while there should only be one (the first one). After prodding here and there, it seems that this is caused by the if clause here: https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py#L191 The 'else' clause in that block adds the HSP to the hit object, but does not do any cascading attribute assignment (query_description and hit_description). Here, the simple fix would be to force a description assignment to the HSP. For example, you could have the `else` block like so: ... else: hit = unordered_hits[id_] hsp.hit_description = hit.description hit.append(hsp) Other fixes are of course possible, but this is the simplest I can imagine (though it seems a bit crude). Also, I would like to note that the query description assignment of the parser may break the cascade as well. If you try to access `qresult.description` (qresult being the QueryResult object), you'd get the true query description. But if you try to access it from `qresult[0].query_description` (the query description stored in the hit object), you'd get ''. The fix here would be to assign the description at the last moment before the QueryResult object is yielded. That way, the cascading setter works properly and all Hit, HSP, and HSPFragment inside the QueryResult object will contain the same value. I realize that this approach is not without flaws (and I'm always open to suggestions), but at the moment this seems to be the most sensible way to keep the attribute values in-sync while keeping the objects more user-friendly (i.e. making the parser slightlymore complex to write, but with the result of consistent attribute value to the users). Hope this helps! Bow From Leighton.Pritchard at hutton.ac.uk Wed Dec 5 16:21:06 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 5 Dec 2012 16:21:06 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: Hi all, On 5 Dec 2012, at Wednesday, December 5, 13:50, David Martin wrote: Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams. I'd like to modify the CircularDrawer feature drawing to allow the following: label_position: start|middle|end as per LinearDrawer label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse This will cause some issues with track widths (how can you specify a track width for a feature track?) Any thoughts/suggestions? I think that the proposed changes are all sensible, but circular labels are fiddly, so we'll need to think a wee bit about what to do exactly (probably another reason why I didn't do much on this originally ;) ). label_position makes perfect sense, as suggested. label_placement as a concept is fine, but I think we might need to be more precise about the intended behaviour of the arguments. I think that 'overlap' is possibly just a result of the choice of two anchoring parameters: 'inside'/'outside' and 'start'/'end' of the label string (see .png if the list accepts them). We might be able to cover all possibilities with just these two choices - does this image cover the range of intended positioning David? If so, then how about two arguments (I'm easy on the argument names): 'label_outer=True/False' and 'label_anchor=start/end/0/1'? [cid:4EA13CE3-20E7-41D8-870F-CBBAA9DD06B0 at scri.sari.ac.uk] label_orientation: What I think you're saying is that we want a distinction between text that orients to assume a static page (one you always view upright: e.g. a monitor), and text that doesn't. I prefer 'reorient_labels=True/False', or some other geometrically neutral argument name, to 'upright' (whose expected meaning could change depending on page location and local context) as a parameter, but it's a good idea. IIRC Track widths are relative, rather than absolute, and don't include label bounding boxes, so off-hand I don't think there ought to be any downstream issues. Famous last words, there! ;) Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2012-12-05 at Wednesday, December 5 16.06.12.png Type: image/png Size: 22969 bytes Desc: Screen Shot 2012-12-05 at Wednesday, December 5 16.06.12.png URL: From d.m.a.martin at dundee.ac.uk Wed Dec 5 16:29:14 2012 From: d.m.a.martin at dundee.ac.uk (David Martin) Date: Wed, 5 Dec 2012 16:29:14 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Just got my head out of hacking at this. The options I have now are: label_position: start|middle|end with reference to the feature. So the end is always the pointy bit. label_orientation: circular|upright Sometimes it is nice to have a proper circular plot label_placement: inside|outside|overlap|strand which maintains overlap as default, inside is all inside, outside is all outside, strand is forward outside and reverse inside. It even works. Angles and so on are not so relevant with circular plots though I would prefer a label_angle: radial|tangent|[degrees] Should I attach an example? ..d From: Leighton Pritchard [mailto:Leighton.Pritchard at hutton.ac.uk] Sent: 05 December 2012 16:21 To: David Martin Cc: BioPython-Dev; Peter Cock Subject: Re: [Biopython-dev] Modifications to CircularDrawer Hi all, On 5 Dec 2012, at Wednesday, December 5, 13:50, David Martin wrote: Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams. I'd like to modify the CircularDrawer feature drawing to allow the following: label_position: start|middle|end as per LinearDrawer label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse This will cause some issues with track widths (how can you specify a track width for a feature track?) Any thoughts/suggestions? I think that the proposed changes are all sensible, but circular labels are fiddly, so we'll need to think a wee bit about what to do exactly (probably another reason why I didn't do much on this originally ;) ). label_position makes perfect sense, as suggested. label_placement as a concept is fine, but I think we might need to be more precise about the intended behaviour of the arguments. I think that 'overlap' is possibly just a result of the choice of two anchoring parameters: 'inside'/'outside' and 'start'/'end' of the label string (see .png if the list accepts them). We might be able to cover all possibilities with just these two choices - does this image cover the range of intended positioning David? If so, then how about two arguments (I'm easy on the argument names): 'label_outer=True/False' and 'label_anchor=start/end/0/1'? [cid:image001.png at 01CDD305.AA06C500] label_orientation: What I think you're saying is that we want a distinction between text that orients to assume a static page (one you always view upright: e.g. a monitor), and text that doesn't. I prefer 'reorient_labels=True/False', or some other geometrically neutral argument name, to 'upright' (whose expected meaning could change depending on page location and local context) as a parameter, but it's a good idea. IIRC Track widths are relative, rather than absolute, and don't include label bounding boxes, so off-hand I don't think there ought to be any downstream issues. Famous last words, there! ;) Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 The University of Dundee is a registered Scottish Charity, No: SC015096 -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 22969 bytes Desc: image001.png URL: From p.j.a.cock at googlemail.com Wed Dec 5 16:57:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 16:57:39 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Wed, Dec 5, 2012 at 4:29 PM, David Martin wrote: > Just got my head out of hacking at this. The options I have now are: > > label_position: start|middle|end with reference to the feature. So the end is > always the pointy bit. Sounds good and uncontentious. > label_orientation: circular|upright Sometimes it is nice to have a proper circular plot I'd have to see the code or an example (and it seems any image attachment will stall your emails for moderation - I'm a moderator but there is some time delay before this gets that far). > label_placement: inside|outside|overlap|strand which maintains overlap as > default, inside is all inside, outside is all outside, strand is forward outside > and reverse inside. Perhaps below/above rather than inside/outside and then it could be done to both the linear and circular drawers? Do you think this is useful then? Note the current circular behaviour which overlaps is strand aware, so those may not be the best names... See also my earlier email with an alternative suggestion. > It even works. Angles and so on are not so relevant with circular plots > though I would prefer a label_angle: radial|tangent|[degrees] > > Should I attach an example? You can try if the files are not overly larger (moderation delays will still occur), posting a link would be easier although probably less lasting. Are you OK with github? A natural option would be to show us your proposals on a branch (separate commits if possible, otherwise I can try and break out each bit if needed). Ta, Peter From p.j.a.cock at googlemail.com Wed Dec 5 17:24:08 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 17:24:08 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Wed, Dec 5, 2012 at 4:57 PM, Peter Cock wrote: > On Wed, Dec 5, 2012 at 4:29 PM, David Martin wrote: >> label_placement: inside|outside|overlap|strand which maintains overlap as >> default, inside is all inside, outside is all outside, strand is forward outside >> and reverse inside. > > Perhaps below/above rather than inside/outside and then it could be done > to both the linear and circular drawers? Do you think this is useful then? Having seen your example (sent directly off list), I'm convinced about the usefulness of the inside and outside idea (when used on the inner-most or outer-most track). Still not sure about those names as I would also like to support this on the linear diagrams as well. Regards, Peter From d.m.a.martin at dundee.ac.uk Wed Dec 5 17:30:26 2012 From: d.m.a.martin at dundee.ac.uk (David Martin) Date: Wed, 5 Dec 2012 17:30:26 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: <959CFF5060375249824CC633DDDF896F1C0C5EE5@AMSPRD0410MB351.eurprd04.prod.outlook.com> -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: 05 December 2012 17:24 To: David Martin Cc: Leighton Pritchard; BioPython-Dev Subject: Re: [Biopython-dev] Modifications to CircularDrawer On Wed, Dec 5, 2012 at 4:57 PM, Peter Cock wrote: > On Wed, Dec 5, 2012 at 4:29 PM, David Martin wrote: >> label_placement: inside|outside|overlap|strand which maintains >> overlap as default, inside is all inside, outside is all outside, >> strand is forward outside and reverse inside. > > Perhaps below/above rather than inside/outside and then it could be > done to both the linear and circular drawers? Do you think this is useful then? Having seen your example (sent directly off list), I'm convinced about the usefulness of the inside and outside idea (when used on the inner-most or outer-most track). Still not sure about those names as I would also like to support this on the linear diagrams as well. Linear and Circular are similar but not identical. No problem with having a above|below|strand or a more complex anchoring scheme but I don't need it right now so I'm just playing with the circular one. I've attached a PDF to this mail - it might get through and I'll try to fork/clone/push git. ..d The University of Dundee is a registered Scottish Charity, No: SC015096 -------------- next part -------------- A non-text attachment was scrubbed... Name: plasmid_circular_nice.pdf Type: application/pdf Size: 148125 bytes Desc: plasmid_circular_nice.pdf URL: From p.j.a.cock at googlemail.com Wed Dec 5 18:41:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Dec 2012 18:41:59 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: Hi David, I've been experimenting with your pull request, thank you: https://github.com/biopython/biopython/pull/116 On Wed, Dec 5, 2012 at 5:22 PM, Peter Cock wrote: > On Wed, Dec 5, 2012 at 5:10 PM, David Martin wrote: >> In the mean-time here is a plot (that doesn't show all layouts) > > Nice. Looking at that now I'm pretty sure I hacked the label anchor > once before of a quick job in order to get the labels outside like that... > certainly worth making this change. Found it, that change made it to a branch I'd forgotten about: https://github.com/peterjc/biopython/commit/d4764dfe929f135ec55b83ad14a9cd34e2d14bba This is bringing back memories... I think I'd concluded last time that attempting to offer anything other than radial label orientation was probably a mistake, and that if we restrict that we can safely offset the vertical position of the text midline (since right now it is positioned according to the bottom line of the font). Without that, positioning labels at the top (as you look at the page) of a circular feature gave non-ideal placement. This is likely one reason for the current hard-coded placement of the feature labels at the bottom (as you look at the circle). Hmm. I think I have a compromise forming that would allow figures like your motivating example :) Peter From kai.blin at biotech.uni-tuebingen.de Thu Dec 6 01:44:40 2012 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Thu, 06 Dec 2012 11:44:40 +1000 Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading attributes. In-Reply-To: References: <50BEF69E.2000806@biotech.uni-tuebingen.de> Message-ID: <50BFF888.50300@biotech.uni-tuebingen.de> On 2012-12-06 02:39, Wibowo Arindrarto wrote: Hi Bow, everyone, > Very happy to see the parser near completion (with tests too!). The > issue you're facing is unfortunately the consequence of trying to keep > attribute values in sync across the object hierarchy. It is a bit > troublesome for now, but not without solution. ... > Here, the simple fix would be to force a description assignment to the > HSP. For example, you could have the `else` block like so: > > ... > else: > hit = unordered_hits[id_] > hsp.hit_description = hit.description > hit.append(hsp) Thanks for the tip, that was the last speedbump I had. I just sent off the pull request for the hmmer2 parser. Thanks again for the help, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-University of T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Deutschland Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From christian at brueffer.de Thu Dec 6 04:04:37 2012 From: christian at brueffer.de (Christian Brueffer) Date: Thu, 06 Dec 2012 12:04:37 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: <50BF6813.4070102@brueffer.de> References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> <50BF6813.4070102@brueffer.de> Message-ID: <50C01955.8060505@brueffer.de> On 12/05/2012 11:28 PM, Christian Brueffer wrote: > On 12/5/12 22:16 , Peter Cock wrote: [...] > >>>> You've got us a lot closer to PEP8 compliance - do you think >>>> subject to a short white list of known cases (like module >>>> names) where we don't follow PEP8 we could aim to run a >>>> a pep8 tool automatically (e.g. as a unit test, or even a commit >>>> hook)? That is quite appealing as a way to spot any new code >>>> which breaks the style guidelines... >>> >>> Having a commit hook would be ideal (maybe with a possibility to >>> override). This would be especially useful against the introduction of >>> gratuitous whitespace. With some editors/IDEs you don't even notice it. >> >> Would you be interested in looking into how to set that up? >> Presumably a client-side git hook would be best, but we'd >> need to explore cross platform issues (e.g. developing and >> testing on Windows) and making sure it allowed an override >> on demand (where the developer wants/needs to ignore a >> style warning). >> > > Yes, It's fairly high on my TODO list. > I just had a look at this. Turns out some people have had this idea before :-) Here's a first version: https://github.com/cbrueffer/pep8-git-hook/blob/master/pre-commit Basically you just save this as biopython/.git/hooks/pre-commit and mark it executable. You also need to install pep8 (pip install pep8). The checks can be bypassed with git commit --no-verify. Currently it ignores E124 (which I think should remain that way). Any other errors or files it should ignore? I'd be grateful if someone could give this a try on Windows. Chris From christian at brueffer.de Thu Dec 6 06:22:24 2012 From: christian at brueffer.de (Christian Brueffer) Date: Thu, 06 Dec 2012 14:22:24 +0800 Subject: [Biopython-dev] Further PEP8 Cleanup In-Reply-To: <50C01955.8060505@brueffer.de> References: <50BC9F1F.4090904@brueffer.de> <50BCDB27.7040402@brueffer.de> <50BF6813.4070102@brueffer.de> <50C01955.8060505@brueffer.de> Message-ID: <50C039A0.8040208@brueffer.de> On 12/06/2012 12:04 PM, Christian Brueffer wrote: > On 12/05/2012 11:28 PM, Christian Brueffer wrote: >> On 12/5/12 22:16 , Peter Cock wrote: > [...] >> >>>>> You've got us a lot closer to PEP8 compliance - do you think >>>>> subject to a short white list of known cases (like module >>>>> names) where we don't follow PEP8 we could aim to run a >>>>> a pep8 tool automatically (e.g. as a unit test, or even a commit >>>>> hook)? That is quite appealing as a way to spot any new code >>>>> which breaks the style guidelines... >>>> >>>> Having a commit hook would be ideal (maybe with a possibility to >>>> override). This would be especially useful against the introduction of >>>> gratuitous whitespace. With some editors/IDEs you don't even notice >>>> it. >>> >>> Would you be interested in looking into how to set that up? >>> Presumably a client-side git hook would be best, but we'd >>> need to explore cross platform issues (e.g. developing and >>> testing on Windows) and making sure it allowed an override >>> on demand (where the developer wants/needs to ignore a >>> style warning). >>> >> >> Yes, It's fairly high on my TODO list. >> > > I just had a look at this. Turns out some people have had this idea > before :-) > > Here's a first version: > > https://github.com/cbrueffer/pep8-git-hook/blob/master/pre-commit > > Basically you just save this as biopython/.git/hooks/pre-commit and mark > it executable. You also need to install pep8 (pip install pep8). The > checks can be bypassed with git commit --no-verify. > > Currently it ignores E124 (which I think should remain that way). Any > other errors or files it should ignore? > > I'd be grateful if someone could give this a try on Windows. > Thinking about it, I think it would make sense to ignore the following: E121 continuation line indentation is not a multiple of four E122 continuation line missing indentation or outdented E123 closing bracket does not match indentation of opening bracket's line E124 closing bracket does not match visual indentation E126 continuation line over-indented for hanging indent E127 continuation line over-indented for visual indent E128 continuation line under-indented for visual indent They all deal with indentation, but are not always beneficial to readability. E125 is missing from that list, which is a useful one. Chris From p.j.a.cock at googlemail.com Thu Dec 6 10:07:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 10:07:55 +0000 Subject: [Biopython-dev] Minor buildbot issues from SearchIO In-Reply-To: References: Message-ID: On Wed, Dec 5, 2012 at 11:41 AM, Peter Cock wrote: > On Fri, Nov 30, 2012 at 2:35 AM, Wibowo Arindrarto > wrote: >> Hi everyone, >> >> I've done some digging around to see how to deal with these issues. >> Here's what I found: >> >>> The BuildBot flagged two new issues overnight, >>> http://testing.open-bio.org/biopython/tgrid >>> >>> Python 2.5 on Windows - doctests are failing due to floating point decimal place >>> differences in the exponent (down to C library differences, something fixed in >>> later Python releases). Perhaps a Python 2.5 hack is the way to go here? >>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%202.5/builds/664/steps/shell/logs/stdio >> >> I've submitted a pull request to fix this here: >> https://github.com/biopython/biopython/pull/98 > > The Windows detection wasn't quite right, it should now match > how we look for Windows elsewhere in Biopython: > https://github.com/biopython/biopython/commit/fc24967b89eda56675e67824a4a57a6059650636 > >>> There is a separate cross-platform issue on Python 3.1, "TypeError: >>> invalid event tuple" again with XML parsing. Curiously this had started >>> a few days back in the UniprotIO tests on one machine, pre-dating the >>> SearchIO merge. I'm not sure what triggered it. >>> http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/767 >>> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/766/steps/shell/logs/stdio >>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/648/steps/shell/logs/stdio >> >> As for this one, it seems that it's caused by a bug in Python3.1 >> (http://bugs.python.org/issue9257) due to the way >> `xml.etree.cElemenTree.iterparse` accepts the `event` argument. > > Ah - I remember that bug now, we have a hack in place elsewhere > to try and avoid that - seems it won't be fixed in Python 3.1.x now > so I've relaxed the version check here: > https://github.com/biopython/biopython/commit/52fdd0ed7fa576494005e635b6a6610daab2ab0e > > Hopefully that will bring the buildbot back to all green tonight. > (TravisCI has now dropped their Python 3.1 support, but they > should have Python 3.3 with NumPy working soon). > > Peter OK, the buildbot looks happy now from the SearchIO work. There is one issue under Python 3.1.5 on a 64 bit Linux server, which I suspect is down to the Python version (this buildslave used to run an older version - Python 3.1.3 (separate email to follow). Regards, Peter From p.j.a.cock at googlemail.com Thu Dec 6 10:24:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 10:24:47 +0000 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? Message-ID: On Thu, Dec 6, 2012 at 10:07 AM, Peter Cock wrote: > > OK, the buildbot looks happy now from the SearchIO work. > > There is one issue under Python 3.1.5 on a 64 bit Linux server, > which I suspect is down to the Python version (this buildslave > used to run an older version - Python 3.1.3 (separate email > to follow). There are 18 test failures like this - all to do with handles and stdout, which have been happening for a while now but I've not found time to look into it. Example: ====================================================================== ERROR: test_needle_piped (test_Emboss.PairwiseAlignmentTests) needle with asis trick, output piped to stdout. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py", line 74, in __next__ line = self._header AttributeError: 'EmbossIterator' object has no attribute '_header' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/Tests/test_Emboss.py", line 571, in test_needle_piped align = AlignIO.read(child.stdout, "emboss") File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py", line 418, in read first = next(iterator) File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py", line 366, in parse for a in i: File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py", line 77, in __next__ line = handle.readline() AttributeError: '_io.FileIO' object has no attribute 'read1' Lasting working build, Python 3.1.3, http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/710/steps/shell/logs/stdio https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634 Next build (after a couple of weeks offline while this server was being rebuilt), Python 3.1.5, http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/722/steps/shell/logs/stdio https://github.com/biopython/biopython/commit/3ea4ea58ed80d6e517699bcab8810398f9ce5957 The timing does suggest an issue introduced in the rebuild, and the obvious difference is the version of Python jumped from 3.1.3 to 3.1.5 (likely things like NumPy etc also changed). There were some security fixes only in Python 3.1.5, none of which sound relevant here: http://www.python.org/download/releases/3.1.5/ The change log for Python 3.1.4 is longer, and does mention stdout/stderr issues so this is perhaps the cause: hg.python.org/cpython/raw-file/feae9f9e9f30/Misc/NEWS See also http://bugs.python.org/issue4996 as possibly related. The whole Python 3 text vs binary handle issue is important with stdout/stderr. What I am doing now is testing those two commits (with Python 3.1.5) to confirm they both fail, and thus rule out a Biopython code change in those two weeks being to blame. Peter From p.j.a.cock at googlemail.com Thu Dec 6 10:45:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Dec 2012 10:45:07 +0000 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: On Thu, Dec 6, 2012 at 10:24 AM, Peter Cock wrote: > On Thu, Dec 6, 2012 at 10:07 AM, Peter Cock wrote: >> >> OK, the buildbot looks happy now from the SearchIO work. >> >> There is one issue under Python 3.1.5 on a 64 bit Linux server, >> which I suspect is down to the Python version (this buildslave >> used to run an older version - Python 3.1.3 (separate email >> to follow). > > There are 18 test failures like this - all to do with handles and stdout, > which have been happening for a while now but I've not found time > to look into it. Example: > > ====================================================================== > ERROR: test_needle_piped (test_Emboss.PairwiseAlignmentTests) > needle with asis trick, output piped to stdout. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py", > line 74, in __next__ > line = self._header > AttributeError: 'EmbossIterator' object has no attribute '_header' > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/Tests/test_Emboss.py", > line 571, in test_needle_piped > align = AlignIO.read(child.stdout, "emboss") > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py", > line 418, in read > first = next(iterator) > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py", > line 366, in parse > for a in i: > File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py", > line 77, in __next__ > line = handle.readline() > AttributeError: '_io.FileIO' object has no attribute 'read1' > > Lasting working build, Python 3.1.3, > http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/710/steps/shell/logs/stdio > https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634 > > Next build (after a couple of weeks offline while this server was > being rebuilt), Python 3.1.5, > http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/722/steps/shell/logs/stdio > https://github.com/biopython/biopython/commit/3ea4ea58ed80d6e517699bcab8810398f9ce5957 > > The timing does suggest an issue introduced in the rebuild, and > the obvious difference is the version of Python jumped from > 3.1.3 to 3.1.5 (likely things like NumPy etc also changed). > > There were some security fixes only in Python 3.1.5, none of > which sound relevant here: > http://www.python.org/download/releases/3.1.5/ > > The change log for Python 3.1.4 is longer, and does mention > stdout/stderr issues so this is perhaps the cause: > hg.python.org/cpython/raw-file/feae9f9e9f30/Misc/NEWS > > See also http://bugs.python.org/issue4996 as possibly > related. The whole Python 3 text vs binary handle issue > is important with stdout/stderr. > > What I am doing now is testing those two commits (with > Python 3.1.5) to confirm they both fail, and thus rule out > a Biopython code change in those two weeks being to > blame. > > Peter Confirmed, using test_Emboss.py and Python 3.1.5 on this machine (running as the buildslave user using the same Python 3.1.5 installation), using the current tip 5092e0e9f2326da582158fd22090f31547679160 and the two commits mentioned above, that is e90db11f4a1d983bc2bfe12bec30edbdbb200634 and 3ea4ea58ed80d6e517699bcab8810398f9ce5957 - all three builds show the same failure. i.e. The failure is not due to a change in Biopython between those commits, but is in some way caused by a change to the buildslave environment. My first suggestion that this is due to Python 3.1.3 -> 3.1.5 remains my prime suspect. I could try downgrading Python 3.1 on this machine to confirm that I suppose... or updating Python 3.1 on another machine? The other recent Python 3.1 buildbot runs were both using Python 3.1.2 (Windows XP 32bit and Linux 32 bit). Can anyone else reproduce this, or have an idea what the fix might be? Regards, Peter From Leighton.Pritchard at hutton.ac.uk Thu Dec 6 12:28:39 2012 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 6 Dec 2012 12:28:39 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: Hi all, I'm starting to remember why I left circular labelling options alone ;) On 5 Dec 2012, at Wednesday, December 5, 16:57, Peter Cock wrote: On Wed, Dec 5, 2012 at 4:29 PM, David Martin > wrote: label_orientation: circular|upright Sometimes it is nice to have a proper circular plot I'd have to see the code or an example (and it seems any image attachment will stall your emails for moderation - I'm a moderator but there is some time delay before this gets that far). I still don't like 'upright' - but that's a naming issue, rather than one of functionality. label_placement: inside|outside|overlap|strand which maintains overlap as default, inside is all inside, outside is all outside, strand is forward outside and reverse inside. Perhaps below/above rather than inside/outside and then it could be done to both the linear and circular drawers? Do you think this is useful then? 'Below' and 'above' are context- (and viewer!) dependent: on a circular diagram 'above' on a feature at 12 o'clock is on the opposite side of the feature when it's 'above' at 6 o'clock. It's not clear what either would mean for a feature at 3 o'clock or 9 o'clock. 'Inside' and 'outside' are stably relative to the circular track for a feature at any position on the circle, so I prefer them as settings. I'm not keen on 'overlap' or 'strand', as I'm not clear what kind of label orientation they refer to: for example, what is being 'overlapped'? Looking at the .pdf, it seems like you've anchored the green labels to the track, rather than to the feature, which I think looks good there - but I'd like to have the option of track vs feature anchoring available via an argument like 'label_anchor', which could be distinguished from 'label_text_anchor'. Including this choice, my preferred arguments would be something like: label_direction='clockwise'|'anticlockwise' - 'clockwise': The text looks like it's progressing clockwise (like the green text in the .pdf); 'anticlockwise' like the blue text. By choosing 'clockwise' or 'anticlockwise' for the appropriate group of features, we achieve part of what I think you might mean by 'upright' (i.e. clockwise from pi/2 to 3pi/2, anticlockwise elsewhere). That could be handled with an 'auto' option. This argument essentially dictates label_angle for each feature: more of which later. It would be nice to have synonyms of 'counterclockwise', 'anticlockwise' and 'widdershins' ;) label_anchor='track'|'feature' Describes what element the text bounding box will be anchored to. label_text_anchor='start'|'end' Which part of the text bounding box (relative to the text) gets anchored. I think it's a good idea to have this wrap a lower-level setting that has label_text_anchor=float, as a relative location on the feature, where start=0, center=0.5, end=1, and values beyond that offer a label separation, relative to the label size - though I can't imagine why I'd use it over the option below - since spacing would depend on bounding box size - the flexibility could be useful, and you'd have to do that calculation anyway ;) label_placement='inner'|'outer' Do we anchor on the track/feature towards the circle centre (inner) or on the other side (outer)? I think it's a good idea to have this wrap a lower-level representation that has label_placement=float, as a relative location on the feature, where inner=-1,outer=1 as a proportion of track/feature height, and other values place the anchor relative to the feature/track boundary - this again offers a choice of label separation, but one that's uniform for all features. label_position='start'|'end'|'center' Where, relative to the feature, do we anchor? I think it's a good idea to have this wrap a lower-level representation that has label_position=[0,1], as a relative location on the feature, where start=0, center=0.5, end=1. That gives more flexibility for those who want it (and you have to do the calculation, anyway). label_orientation='radial'|'horizontal' Fairly obviously, 'radial' = as it is now, and 'horizontal' is reading like regular text. But this one's a tricky one, which is why all the labels are radial at the moment ;) I think that this choice has to either live with ('radial') or override ('horizontal') the label_direction argument. As with label_direction, this essentially dictates label_angle for each individual feature, which has its own issues (what do we measure the angle relative to? If it's relative to a common reference, then for a constant angle you get some funny-looking label patterns, and it doesn't look good in bulk. Relative to a feature-local reference, we can choose the tangent or the normal - but at what point of the feature? Really, we want that to be the tangent or normal at the anchor point of the text, so that the same angle looks consistent across all features (45deg to the normal at the start of a long feature is different to 45deg to the normal at the centre of that feature, relative to the bottom of the page: this looks weird)). A complicating issue here with text anchoring is what part of the text box gets anchored: depending on the font, and the string, choosing the top or bottom of the bounding box (which will include ascender and descender spaces) can look weird, so it's probably best to anchor on the midline of the text box. This avoids a problem with 'anticlockwise' vs 'clockwise' when implemented as a rotation, in that anchoring to the lower left of text, then rotating 180deg around the centre of the text box gives a different final positioning (and anchoring) than anchoring to the midline of the text box, then performing the same rotation. By appropriate choices of these settings, we can obtain pretty much any labelling style. We need to keep in mind, though, that the arguments won't be interpreted properly until the Diagram gets passed to the renderer, so 'auto' settings to achieve a particular effect with complicated combinations of arguments dependent on feature location might be better passed with draw(). As specific examples: 1) Let's say the effect we're looking for is for horizontal text, anchored to the outside of the track. Here we'd need to consider two halves of the diagram. On the left hand side we need to set label_text_anchor='end', and on the right we set label_text_anchor='start'. On both sides we set label_orientation='horizontal', label_anchor='track', label_placement='outer'. However, we need to take care with features towards the top and bottom of the image, as horizontal labels will run into each other, here. 2) Dropping the requirement for horizontal text, we can set label_orientation='radial', label_anchor='track', label_placement='outer' on both sides (maybe this should be the default?), but set label_direction='clockwise', label_text_anchor='end' on the left, and label_direction='counterclockwise', label_text_anchor='start' on the right. 3) If we wanted to label features directly, on the appropriate side of their track, we could set label_anchor='feature' for all features, with label_placement='inner' for reverse-strand, and label_placement='outer' for forward-strand features. These are some fairly obvious standard settings which could be made available as presets in the calls to draw(), so that the fiddly details are hidden. Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From w.arindrarto at gmail.com Fri Dec 7 03:32:06 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 7 Dec 2012 04:32:06 +0100 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: > Confirmed, using test_Emboss.py and Python 3.1.5 on > this machine (running as the buildslave user using the > same Python 3.1.5 installation), using the current tip > 5092e0e9f2326da582158fd22090f31547679160 and > the two commits mentioned above, that is > e90db11f4a1d983bc2bfe12bec30edbdbb200634 and > 3ea4ea58ed80d6e517699bcab8810398f9ce5957 - > all three builds show the same failure. > > i.e. The failure is not due to a change in Biopython > between those commits, but is in some way caused > by a change to the buildslave environment. My first > suggestion that this is due to Python 3.1.3 -> 3.1.5 > remains my prime suspect. > > I could try downgrading Python 3.1 on this machine > to confirm that I suppose... or updating Python 3.1 on > another machine? > > The other recent Python 3.1 buildbot runs were both > using Python 3.1.2 (Windows XP 32bit and Linux 32 bit). > > Can anyone else reproduce this, or have an idea what > the fix might be? It's reproducible in my machine: Arch Linux 64 bit running Python3.1.5. Haven't figured out a fix yet, but trying to see if I can. By the way, I was wondering, what's our deprecation policy for Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't seem to be any major updates coming soon. How long should we keep supporting Python <3.2? regards, Bow From p.j.a.cock at googlemail.com Fri Dec 7 10:06:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Dec 2012 10:06:57 +0000 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: On Fri, Dec 7, 2012 at 3:32 AM, Wibowo Arindrarto wrote: > > > Confirmed, using test_Emboss.py and Python 3.1.5 on > > this machine (running as the buildslave user using the > > same Python 3.1.5 installation), using the current tip > > 5092e0e9f2326da582158fd22090f31547679160 and > > the two commits mentioned above, that is > > e90db11f4a1d983bc2bfe12bec30edbdbb200634 and > > 3ea4ea58ed80d6e517699bcab8810398f9ce5957 - > > all three builds show the same failure. > > > > i.e. The failure is not due to a change in Biopython > > between those commits, but is in some way caused > > by a change to the buildslave environment. My first > > suggestion that this is due to Python 3.1.3 -> 3.1.5 > > remains my prime suspect. > > > > I could try downgrading Python 3.1 on this machine > > to confirm that I suppose... or updating Python 3.1 on > > another machine? > > > > The other recent Python 3.1 buildbot runs were both > > using Python 3.1.2 (Windows XP 32bit and Linux 32 bit). > > > > Can anyone else reproduce this, or have an idea what > > the fix might be? > > It's reproducible in my machine: Arch Linux 64 bit running > Python3.1.5. Haven't figured out a fix yet, but trying to see if I > can. Great. We haven't really proved this is down to a change in either Python 3.1.4 or 3.1.5 but it does look likely. > > By the way, I was wondering, what's our deprecation policy for > Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't > seem to be any major updates coming soon. How long should we keep > supporting Python <3.2? As long as it doesn't cost us much effort? If we can't solve this issue easily that might be enough to drop Python 3.1? My impression is that Python 3.0 is dead, and the only sizeable group stuck with Python 3.1 will those on Ubuntu lucid (LTS is supported through 2013 on desktops and 2015 on servers), but as with life under Python 2.x it is fairly straightforward to have a local/additional Python without disturbing the system installation. On a related note, TravisCI currently still supports Python 3.1 unofficially (we're not using this with Biopython but I've tried it with other projects), but this will be dropped soon - once they have Python 3.3 working. Since we don't yet officially support Python 3 (but we probably should soon) we have the flexibility to recommend either Python 3.2 or 3.3 as a baseline. Peter From redmine at redmine.open-bio.org Sun Dec 9 04:11:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 9 Dec 2012 04:11:35 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. It looks like your data file is corrupted. In _read_value_from_handle, the length of the key it tries to read is 1490353651722. This does not seem correct. Can you create a minimal data file that shows the problem? Then, when you fill in the trie, you can identify which key causes the problem. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Dec 9 09:53:30 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 9 Dec 2012 09:53:30 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. That just means that bug is in save() not in load() function. But of course I will provide data file, although I can't guarantee it will be minimal. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Dec 9 12:13:09 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 9 Dec 2012 12:13:09 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. You don't need to provide the data file to us. The idea is that you create the smallest trie.dat file that will cause the load() to fail. Then you know which item in the trie is problematic. Once you know that, we can try to figure out why the save() creates a corrupted file. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 10 17:39:24 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 10 Dec 2012 17:39:24 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. File minimal_data.pkl added This is my minimal test case: from Bio import trie import pickle f = open('minimal_data.pkl', 'r') list = pickle.load(f) f.close() index = trie.trie() for item in list: for chunk in item[0].split('/')[1:]: if len(chunk) > 2: if index.get(str(chunk)): index[str(chunk)].append(item[1]) else: index[str(chunk)] = [item[1]] f = open('trie.dat', 'w') trie.save(f, index) f.close() f = open('trie.dat', 'r') index = trie.load(f) f.close() ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue Dec 11 05:32:02 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 11 Dec 2012 05:32:02 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. Hi Michal, Unfortunately I cannot load your minimal_data.pkl file. At list = pickle.load(f) I get ImportError: No module named django.db.models.query Can you check which item in list is actually causing the problem? Just reduce the list until you find the item that is causing the trie.load(f) to fail. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From MatatTHC at gmx.de Tue Dec 11 08:11:48 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Tue, 11 Dec 2012 09:11:48 +0100 Subject: [Biopython-dev] genetic code Message-ID: Dear biopython developers, there is a new genetic code table (24) in the NCBI resources (see NC_015649). Maybe you can update this with the next release. Would it be an idea to distribute the genetic code file from ncbi with biopython and create the code tables on import or during installation? Then biopython would be automatically up-to-date. Regards, Matthias From redmine at redmine.open-bio.org Tue Dec 11 09:15:22 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 11 Dec 2012 09:15:22 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. Hello, As I said, this is minimal test case. That means there is no single key that causes a problem. If you remove any of the items from the list it will work. You can try to run this examble from django shell (python manage.py shell). It there will be any further problems with running it I can provide model classes as well. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From arklenna at gmail.com Tue Dec 11 16:00:33 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 11 Dec 2012 11:00:33 -0500 Subject: [Biopython-dev] genetic code In-Reply-To: References: Message-ID: Hi Matthias, In a similar case, we have a file in the Scripts/ directory to download and parse the file. The generated file (and not the source file) is committed, but the script is available in the source for end users who wish to update it: https://github.com/biopython/biopython/blob/master/Scripts/PDB/generate_three_to_one_dict.py I think a similar situation would be appropriate here. Does Biopython currently include alternate codon tables? Cheers, Lenna On Tuesday, December 11, 2012, Matthias Bernt wrote: > Dear biopython developers, > > there is a new genetic code table (24) in the NCBI resources (see > NC_015649). Maybe you can update this with the next release. > > Would it be an idea to distribute the genetic code file from ncbi with > biopython and create the code tables on import or during installation? Then > biopython would be automatically up-to-date. > > Regards, > Matthias > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Dec 11 18:42:13 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Dec 2012 18:42:13 +0000 Subject: [Biopython-dev] genetic code In-Reply-To: References: Message-ID: On Tuesday, December 11, 2012, Lenna Peterson wrote: > Hi Matthias, > > In a similar case, we have a file in the Scripts/ directory to download and > parse the file. The generated file (and not the source file) is committed, > but the script is available in the source for end users who wish to update > it: > > > https://github.com/biopython/biopython/blob/master/Scripts/PDB/generate_three_to_one_dict.py > > I think a similar situation would be appropriate here. Does Biopython > currently include alternate codon tables? > > Cheers, > > Lenna Yes, see https://github.com/biopython/biopython/blob/master/Bio/Data/CodonTable.pyon the parser therein. On Tuesday, December 11, 2012, Matthias Bernt wrote: > > > Dear biopython developers, > > > > there is a new genetic code table (24) in the NCBI resources (see > > NC_015649). Maybe you can update this with the next release. That seems like a Good idea :) > > Would it be an idea to distribute the genetic code file from ncbi with > > biopython and create the code tables on import or during installation? > Then > > biopython would be automatically up-to-date. > > > > Regards, > > Matthias > That would just make installation more complex (and it is already complicated). I would prefer to keep setup.py as normal as possible. The NCBI tables rarely change, so this works OK overall. Peter From redmine at redmine.open-bio.org Wed Dec 12 04:16:27 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 04:16:27 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. We need to isolate the bug further to be able to solve it. I would suggest to find a data set that fails to load but does not depend on django. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Dec 12 07:56:52 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 07:56:52 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. Sure, today I'll strip all django dependencies and resubmit data set and loading code. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Dec 12 10:04:28 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 10:04:28 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. File minimal_data.pkl added Minimal test case with stripped django dependencies, loading code below: from Bio import trie import pickle f = open('minimal_data.pkl', 'r') list = pickle.load(f) f.close() index = trie.trie() for item in list: for chunk in item[0].split('/')[1:]: if len(chunk) > 2: if index.get(str(chunk)): index[str(chunk)].append(item[1]) else: index[str(chunk)] = [item[1]] f = open('trie.dat', 'w') trie.save(f, index) f.close() f = open('trie.dat', 'r') new_trie = trie.load(f) f.close() ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Dec 12 12:29:19 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 12:29:19 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. The problem was indeed that one of the chunks had a size of 2000. I've uploaded a fix to github; could you please give it a try? See https://github.com/biopython/biopython/commit/6e09a4a67b7dec1910b13e3d730e3a1f5c2261c9 In particular, please make sure that new_trie is identical to trie. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Dec 12 21:44:14 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 12 Dec 2012 21:44:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #3400] (New) Hmmer3-text parser crashes when parsing hmmscan --cut_tc files Message-ID: Issue #3400 has been reported by Kai Blin. ---------------------------------------- Bug #3400: Hmmer3-text parser crashes when parsing hmmscan --cut_tc files https://redmine.open-bio.org/issues/3400 Author: Kai Blin Status: New Priority: Normal Assignee: Category: Target version: URL: I'm currently struggling with a crash in the hmmer3-text parser when dealing with files generated by hmmscan --cut_tc. I'm not quite sure what happens yet, but I have the feeling that some part of the hit parsing logic is reading into the next query without yielding a result. The backtrace is
Traceback (most recent call last):
  File "t.py", line 4, in 
    i = it.next()
  File "/data/uni/biopython/Bio/SearchIO/__init__.py", line 317, in parse
    yield qresult
  File "/usr/lib/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/data/uni/biopython/Bio/File.py", line 84, in as_handle
    yield fp
  File "/data/uni/biopython/Bio/SearchIO/__init__.py", line 316, in parse
    for qresult in generator:
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 47, in __iter__
    for qresult in self._parse_qresult():
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 133, in _parse_qresult
    hit_list = self._parse_hit(qid)
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 176, in _parse_hit
    hit_list = self._create_hits(hit_attr_list, qid)
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 239, in _create_hits
    hit_attr = hit_attrs.pop(0)
IndexError: pop from empty list
Line numbers might be a bit off as I added debug output to understand what's happening already. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From bow at bow.web.id Thu Dec 13 04:15:01 2012 From: bow at bow.web.id (Wibowo Arindrarto) Date: Thu, 13 Dec 2012 05:15:01 +0100 Subject: [Biopython-dev] Slight modifcation to BlastXML parser for AB-BLAST input In-Reply-To: References: Message-ID: Hi Colin, Thanks for the report. AB-BLAST wasn't included in the BLAST XML parser's test suite so I'm glad you spotted this :). You're proposing a bug fix, so yes, this should be included in our code. You could submit a pull request on our github page: https://github.com/biopython/biopython/pulls, or I can submit it on your behalf if you prefer not to submit it yourself. If you're not familiar with GitHub, we have a quick guide on how to use it to develop Biopython here: http://biopython.org/wiki/GitUsage. GitHub's help on how to submit pull requests is a useful read too: https://help.github.com/articles/using-pull-requests Along with the patch, a unit test on the AB-BLAST output would also be very welcomed. As for the actual regex change, I was wondering, is that the only possible pattern of the BlastOutput_version tag in AB-BLAST? Do you have examples of any other version output from AB-BLAST? cheers, Bow P.S. CC-ed to the Biopython-dev mailing list On Thu, Dec 13, 2012 at 4:41 AM, Colin Archer wrote: > Hi Bow, > I have been using your implementation of the biopython BLAST > output parser but for AB-BLAST input and it has been working OK so far, > although I haven't thoroughly had a look at the speed yet. I initially found > that the version tag (BlastOutput_version) for AB-BLAST results were slighly > different from NCBI BLAST and changed the regex you implemented to cover > both versions. The difference between them was: > > BLASTN 2.2.27+ > 3.0PE-AB [2009-10-30] [linux26-x64-I32LPF64 > 2009-11-17T18:52:53] > > > and the regex I ended up using was: > r'(\d+\.(?:\d+\.)*\d+)(?:\w+-\w+|\+)?' > > and here is the tested output: >>>> _RE_VERSION1 = re.compile(r'\d+\.\d+\.\d+\+?') >>>> _RE_VERSION2 = re.compile(r'(\d+\.(?:\d+\.)*\d+)(?:\w+-\w+|\+)?') >>>> version1 > 'BLASTN 2.2.27+' >>>> version2 > '3.0PE-AB [2009-10-30] [linux26-x64-I32LPF64 2009-11-17T18:52:53]' >>>> re.search(_RE_VERSION1, version1).group(0) > '2.2.27+' >>>> re.search(_RE_VERSION2, version1).group(0) > '2.2.27+' >>>> re.search(_RE_VERSION1, version2).group(0) > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'NoneType' object has no attribute 'group' >>>> re.search(_RE_VERSION2, version2).group(0) > '3.0PE-AB' > > Would there be any chance of including this in a future release of > BioPython? > > Thanks > Colin > > From bow at bow.web.id Thu Dec 13 16:14:27 2012 From: bow at bow.web.id (Wibowo Arindrarto) Date: Thu, 13 Dec 2012 17:14:27 +0100 Subject: [Biopython-dev] Slight modifcation to BlastXML parser for AB-BLAST input In-Reply-To: References: Message-ID: Hi Colin, > From what I have seen, the version value is formatted > differently based on the edition of AB-BLAST being used: personal, > commerical etc. As I only use the personal edition, I'm not sure if the > other versions are different but I imagine that they conform to the same > format, with the version followed by the edition (for example, 3.0PE-AB for > personal edition). The regex I sent you will keep the edition so I imagine > it will work on other versions of AB-BLAST as long as the edition is > represented by "words-words" Ok then. The regex looks good. You can probably make it more reader-friendly by separating the regex for NCBI and AB BLAST (e.g. r'(?:ncbi_blast_regex)|(?:ab_blast_regex)'. But even without this, it seems to work ok. > I'll submit a pull request as well and submit the revised regex. If you are > interested, there are a couple other differences in the XML output between > AB-BLAST and NCBI-BLAST. I can send you an example output if you would like > to have a look at it. Presently, SearchIO can't parse AB-BLAST XML output > for multiple queries as the AB-BLAST output is just a concatentation of > multiple single queries. Each query contains the section > at the beginning and causes ElementTree to error during iteration. To get > around this I have been piping the AB-BLAST output and parsing it into a > more NCBI-BLAST form. Hmm..it is a problem if AB-BLAST concatenates outputs like that. It makes the XML invalid, though, so I'm not sure if we should change the parser to tolerate this. What are the other differences? As for the example files, they would indeed be useful for unit testing (as long as they're not that big ~ less than 50K?). You can send them to me. If you're feeling it, you can also write your own unit tests using them :). Looking forward to the pull request :), Bow From p.j.a.cock at googlemail.com Thu Dec 13 17:09:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 17:09:59 +0000 Subject: [Biopython-dev] Slight modifcation to BlastXML parser for AB-BLAST input In-Reply-To: References: Message-ID: On Thu, Dec 13, 2012 at 4:14 PM, Wibowo Arindrarto wrote: >> Presently, SearchIO can't parse AB-BLAST XML output >> for multiple queries as the AB-BLAST output is just a concatentation of >> multiple single queries. Each query contains the section >> at the beginning and causes ElementTree to error during iteration. To get >> around this I have been piping the AB-BLAST output and parsing it into a >> more NCBI-BLAST form. > > Hmm..it is a problem if AB-BLAST concatenates outputs like that. It > makes the XML invalid, though, so I'm not sure if we should change > the parser to tolerate this. What are the other differences? The older NCBI BLAST tools had this bug as well - and as a result our NCBIXML has a hack to cope with it. It might be worth applying the same kind of fix to the SearchIO BLAST XML parser as well if it would help with both AB-BLAST and any older NCBI XML files. Peter From lucas.sinclair at me.com Thu Dec 13 16:29:19 2012 From: lucas.sinclair at me.com (Lucas Sinclair) Date: Thu, 13 Dec 2012 17:29:19 +0100 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator Message-ID: Hi ! I'm working a lot with fasta files. They can be large (>50GB) and contain lots of sequences (>40,000,000). Often I need to get one sequence from the file. WIth a flat FASTA file this requires parsing, on average, half of the file before finding it. I would like to write something that solves this problem, and rather than making a new repository, I thought I could contribute to biopython. As I just wrote, the iterator nature of parsing sequences files has it's limits. I was thinking of something that is indexed. And not some hack like I see sometimes where a second".fai" file is added nest to the ".fa" file. The natural thing to do is to put these entries in a SQLite file. The appraisal of such solutions is well made here: http://defindit.com/readme_files/sqlite_for_data.html Now I looked into the biopython source code, and it seems everything is based on returning a generator object which essentially has only one method: next() giving SeqRecords. For what I want to do, I would also need the get(id) method. Plus any other methods that could now be added to query the DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is a class called InterlacedSequenceIterator(SequenceIterator) that contains a __getitem__(i) method, but it's unclear how to I should go about implementing that. Any help/example on how to add such a format to SeqIO ? Thanks ! Lucas Sinclair, PhD student Ecology and Genetics Uppsala University From p.j.a.cock at googlemail.com Thu Dec 13 17:40:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 17:40:46 +0000 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair wrote: > Hi ! > > I'm working a lot with fasta files. They can be large (>50GB) and contain > lots of sequences (>40,000,000). Often I need to get one sequence from the > file. WIth a flat FASTA file this requires parsing, on average, half of the > file before finding it. I would like to write something that solves this > problem, and rather than making a new repository, I thought I could > contribute to biopython. > > As I just wrote, the iterator nature of parsing sequences files has it's > limits. I was thinking of something that is indexed. And not some hack like > I see sometimes where a second".fai" file is added nest to the ".fa" file. > The natural thing to do is to put these entries in a SQLite file. The > appraisal of such solutions is well made here: > http://defindit.com/readme_files/sqlite_for_data.html > > Now I looked into the biopython source code, and it seems everything is > based on returning a generator object which essentially has only one method: > next() giving SeqRecords. For what I want to do, I would also need the > get(id) method. Plus any other methods that could now be added to query the > DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is > a class called InterlacedSequenceIterator(SequenceIterator) that contains a > __getitem__(i) method, but it's unclear how to I should go about > implementing that. Any help/example on how to add such a format to SeqIO ? > > Thanks ! Have you looked at Bio.SeqIO.index (index held in memory) and Bio.SeqIO.index_db (index held in an SQLite3 database), and do they solve your needs? Note these only index the location of records - unlike tabix/fai indexes which also look at the line length to be able to pull out subsequences. This means the Bio.SeqIO indexing isn't ideal for dealing with large records were you are only interested in small subsequences. Peter From p.j.a.cock at googlemail.com Thu Dec 13 17:51:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 17:51:40 +0000 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: On Thu, Dec 13, 2012 at 5:40 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair wrote: >> >> I see there is >> a class called InterlacedSequenceIterator(SequenceIterator) that contains a >> __getitem__(i) method, but it's unclear how to I should go about >> implementing that. >> Hmm - I think that entire class is obsolete and could be removed. Peter From p.j.a.cock at googlemail.com Thu Dec 13 18:54:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Dec 2012 18:54:04 +0000 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: On Thu, Dec 13, 2012 at 5:51 PM, Peter Cock wrote: > On Thu, Dec 13, 2012 at 5:40 PM, Peter Cock wrote: >> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair wrote: >>> >>> I see there is >>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a >>> __getitem__(i) method, but it's unclear how to I should go about >>> implementing that. >>> > > Hmm - I think that entire class is obsolete and could be removed. I've marked it as deprecated, but since it doesn't really have any executable code a deprecation warning doesn't seem relevant. We can probably remove this after the next release. https://github.com/biopython/biopython/commit/316c42aad05b9de3d3b3004ec295670691ae1804 Thanks for flagging up this bit of the code Lucas. Going further, the SequenceIterator isn't used either, and perhaps could be dropped too? We do use the similar class in AlignIO... Regards, Peter From ben at benfulton.net Fri Dec 14 02:25:47 2012 From: ben at benfulton.net (Ben Fulton) Date: Thu, 13 Dec 2012 21:25:47 -0500 Subject: [Biopython-dev] Code coverage reporting Message-ID: On my Biopython fork, I've extended the test run on Travis to create and upload a code coverage report to GitHub. I'd like to submit a pull request to put this in the main code base, but in order to do so, I need a token generated to allow uploading the file to the biopython GitHub account. Can someone work with me on that? You can view the coverage report at http://cloud.github.com/downloads/benfulton/biopython/coverage.txt Thanks! Ben Fulton From p.j.a.cock at googlemail.com Fri Dec 14 10:58:49 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Dec 2012 10:58:49 +0000 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: On Fri, Dec 14, 2012 at 10:07 AM, Lucas Sinclair wrote: > Hello, > > Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes > an index, but it is held in memory. So it must be recomputed every > time the interpreter is reloaded. Yes, that is right. > This step is wasting enough time for me that I would like to compute > the index on my 50GB file once, and then be done with it. SQLite > really is the technology of choice for such a problem... Yes, which is why Bio.SeqIO.index_db() stores the index in SQLite. The SeqIO chapter in the Tutorial does try to explain this and the advantages compared to Bio.SeqIO.index(). Have you tried this yet? > I suppose you agree storing all this sequence information in flat > ascii files is not piratical. It may not be optimal, but it is very practical (although at the scale of next generation sequencing data less so). Peter From lucas.sinclair at me.com Fri Dec 14 10:07:55 2012 From: lucas.sinclair at me.com (Lucas Sinclair) Date: Fri, 14 Dec 2012 11:07:55 +0100 Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator In-Reply-To: References: Message-ID: Hello, Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes an index, but it is held in memory. So it must be recomputed every time the interpreter is reloaded. This step is wasting enough time for me that I would like to compute the index on my 50GB file once, and then be done with it. SQLite really is the technology of choice for such a problem... I suppose you agree storing all this sequence information in flat ascii files is not piratical. Actually, I found a reasonable work around way of achieving this result with these two commands: $ formatdb -i reads -p T -o T -n reads $ blastdbcmd -db reads -dbtype prot -entry "105107064179" -outfmt %f -out test.fasta But then I need to have calls to subprocess... Since, I thought my first small contribution to biopython was fun doing, (https://github.com/biopython/biopython/commit/1c72a63b35db70d11c628b83a0269d1a9c6443a4) I maybe still fell like writing a proper solution. Would such a thing be a welcome addition to Bio.SeqIO ? If so, where would I place it ? The schema would be a SQLite file with a single table named "sequences". This table would have columns corresponding to the attributes of a SeqRecord. But you would need to get a different type object back when calling parse than a generator, you would need an object that has a __getitem__ method. Sincerely, Lucas Sinclair, PhD student Ecology and Genetics Uppsala University On 13 d?c. 2012, at 18:40, Peter Cock wrote: > On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair wrote: >> Hi ! >> >> I'm working a lot with fasta files. They can be large (>50GB) and contain >> lots of sequences (>40,000,000). Often I need to get one sequence from the >> file. WIth a flat FASTA file this requires parsing, on average, half of the >> file before finding it. I would like to write something that solves this >> problem, and rather than making a new repository, I thought I could >> contribute to biopython. >> >> As I just wrote, the iterator nature of parsing sequences files has it's >> limits. I was thinking of something that is indexed. And not some hack like >> I see sometimes where a second".fai" file is added nest to the ".fa" file. >> The natural thing to do is to put these entries in a SQLite file. The >> appraisal of such solutions is well made here: >> http://defindit.com/readme_files/sqlite_for_data.html >> >> Now I looked into the biopython source code, and it seems everything is >> based on returning a generator object which essentially has only one method: >> next() giving SeqRecords. For what I want to do, I would also need the >> get(id) method. Plus any other methods that could now be added to query the >> DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is >> a class called InterlacedSequenceIterator(SequenceIterator) that contains a >> __getitem__(i) method, but it's unclear how to I should go about >> implementing that. Any help/example on how to add such a format to SeqIO ? >> >> Thanks ! > > Have you looked at Bio.SeqIO.index (index held in memory) and > Bio.SeqIO.index_db (index held in an SQLite3 database), and do > they solve your needs? > > Note these only index the location of records - unlike tabix/fai indexes > which also look at the line length to be able to pull out subsequences. > This means the Bio.SeqIO indexing isn't ideal for dealing with large > records were you are only interested in small subsequences. > > Peter From w.arindrarto at gmail.com Fri Dec 14 12:48:12 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 14 Dec 2012 13:48:12 +0100 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: Hi everyone, >> It's reproducible in my machine: Arch Linux 64 bit running >> Python3.1.5. Haven't figured out a fix yet, but trying to see if I >> can. > > Great. We haven't really proved this is down to a change in > either Python 3.1.4 or 3.1.5 but it does look likely. It's reproduced in my local 3.1.4 installation. Seems like an unfixed bug that went through to 3.1.5. >> By the way, I was wondering, what's our deprecation policy for >> Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't >> seem to be any major updates coming soon. How long should we keep >> supporting Python <3.2? > > As long as it doesn't cost us much effort? If we can't solve this > issue easily that might be enough to drop Python 3.1? Fixing this seems difficult (has anyone else tried a fix?). The _io module is built-in and compiled when Python is installed, so fixing it (I imagine) may require tweaking the C-code (which requires fiddling with the actual Python installation). > My impression is that Python 3.0 is dead, and the only sizeable > group stuck with Python 3.1 will those on Ubuntu lucid (LTS is > supported through 2013 on desktops and 2015 on servers), > but as with life under Python 2.x it is fairly straightforward > to have a local/additional Python without disturbing the system > installation. > > > Since we don't yet officially support Python 3 (but we probably > should soon) we have the flexibility to recommend > either Python 3.2 or 3.3 as a baseline. Yes. I think it may be easier and better for us to officially start supporting from Python3.2 or 3.3 onwards. regards, Bow From christian at brueffer.de Mon Dec 17 11:05:04 2012 From: christian at brueffer.de (Christian Brueffer) Date: Mon, 17 Dec 2012 19:05:04 +0800 Subject: [Biopython-dev] Biopython AlignAce Wrapper In-Reply-To: References: <50CAC1C2.9090705@brueffer.de> <50CEE193.2010003@brueffer.de> Message-ID: <50CEFC60.8020400@brueffer.de> (CC'ing biopython-dev) Thanks for the feedback. I'd propose the following plan for the AlignAce wrapper then: 1. Submit the cleanup patches I have to give the wrapper at least a fighting chance at actually working 2. Add a BiopythonDeprecationWarning 3. Remove the wrapper after 1.61 is released (except the situation changes of course) Does that sound acceptable? Chris On 12/17/2012 05:25 PM, Bartek Wilczynski wrote: > Well, > > sounds like a good plan. I think the situation is hopeless: If we had > the source of AlignAce with appropriate license we could think of > supporting it ourselves, but in this situation I guess we can only > deprecate the module and phase it out... > > best > Bartek > > On Mon, Dec 17, 2012 at 10:10 AM, Christian Brueffer > wrote: >> Hi Bartek, >> >> thanks for checking. The thing is, the "new" version is actually an >> ancient version: >> >> AlignACE version 2.3 October 27, 1998 >> >> I made it work by installing Fedora Code 3 in a VM and using >> elfstatifier to bind AlignAce and all libraries into one executable. >> I works, but I doubt it's of any use these days. >> >> I wonder whether it's better to remove the wrapper. The AlignAce >> developers are unresponsive, none of the Biopython people has a >> version and from what I can see the current wrapper cannot possibly >> work. >> >> What do you think? >> >> Chris >> >> >> On 12/17/2012 05:01 PM, Bartek Wilczynski wrote: >>> >>> Hi, >>> >>> I've looked around and it seems I don't have it. We probably need to >>> "update" the parser to work with the current version of AlignACE >>> available from Harvard. Were you able to run it? On mys system, it >>> cannot find the libraries it needs... >>> >>> best >>> Bartek >>> >>> On Fri, Dec 14, 2012 at 7:05 AM, Christian Brueffer >>> wrote: >>>> >>>> Hi Bartek, >>>> >>>> I currently clean up the Biopython AlignAce wrapper. Unfortunately >>>> I've been unable to obtain the latest AlignAce version since the >>>> download page disappeared and the Church lab is unresponsive. >>>> >>>> Do you happen to have a version of AlignAce 4.0 for Linux lying around, >>>> that you could send me? >>>> >>>> Thanks a lot, >>>> >>>> Chris >>> From redmine at redmine.open-bio.org Mon Dec 17 13:49:33 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 17 Dec 2012 13:49:33 +0000 Subject: [Biopython-dev] [Biopython - Bug #3401] (New) is_terminal bug in newick trees Message-ID: Issue #3401 has been reported by Aleksey Kladov. ---------------------------------------- Bug #3401: is_terminal bug in newick trees https://redmine.open-bio.org/issues/3401 Author: Aleksey Kladov Status: New Priority: Normal Assignee: Category: Target version: URL: Consider this weird Newick tree (((B,C),D))A; Here 'A' is both a root node and a terminal node(since it has only one child: ((B,C),D);). However, is_terminal for 'A' is False:
from Bio import Phylo
import cStringIO

bad_tree = '(((B,C),D))A'

t = Phylo.read(cStringIO.StringIO(bad_tree), 'newick')

for c in t.find_clades(terminal=True):
    print c,
Gives @B C D@ ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From MatatTHC at gmx.de Tue Dec 18 12:40:35 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Tue, 18 Dec 2012 13:40:35 +0100 Subject: [Biopython-dev] Location Parser Message-ID: Dear list, I have some problems with the GenBank parser in version 1.60. Its again nested location strings like: order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403) as found in NC_003048. What happens is that the parser stalls. It seems as if it takes forever to parse _re_complex_compound in and never gets to the if statement that checks if order and join appears in the location string. I suggest to move the if statement before the regular expressions are tested. I remember that I posted something like this before. But I can not remember how and if this was solved. Regards, Matthaas From k.d.murray.91 at gmail.com Tue Dec 18 13:46:06 2012 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Wed, 19 Dec 2012 00:46:06 +1100 Subject: [Biopython-dev] [biopython] TAIR (Arabidopsis) sequence retreival module (#132) In-Reply-To: References: Message-ID: Hi Peter, Chris and the mailing list, Thanks very much for the feedback! > Query: It isn't clear to me (from a first read) what MultipartPostHandler is needed for. The arabidopsis.org server form requires the content-type to be a multipart form, not a urlencoded form, which the standard urllib2 does not handle. I could write a custom handler, however when writing the module I found MultipartPostHandler, and figured I should use that. I may be wrong, but couldn't figure out any other way of doing it. >Minor: The module's docstring should start with a one line summary then a blank line (see PEP8 style guide). >Note: Since your unit test requires internet access, it should include these lines to work nicely in our testing framework (which allows the tests needing network access to be skipped) I'll fix the module docstring and requires_internet check tomorrow. >Why does the NCBI code exist given it is such a thin wrapper round the Bio.Entrez code - the module would be a lot simpler if it was just a wrapper for www.arabidopsis.org alone. The NCBI functions exist to get genbank files for AGIs, as TAIR's sequence retrieval only gives fasta files, so if users need/want the extra metadata a genbank file gives, they can use this module. As you've said, this is a *very* thin wrapper, so would it be better to just provide the mapping dicts in Bio.TAIR._ncbi for people to use however they see fit? >Query: Why do your methods return SeqRecord objects? Is this because the handle might return FASTA with a non-FASTA header which must be stripped off? SeqRecord handles were returned for two reasons, the first being as you said that the raw return text is not always a valid fasta file, despite my efforts to trim extraneous text. The latter is simply that is what i required when writing it, and i could not think of a better way of returning it. (and I thought that the return of a SeqRecord allowed "pythonic" processing of results, a la the test suite). Again happy for any suggestions >Why does classes TAIRDirect and TAIRNCBI exist? Wouldn't module level functions be simpler (or at least, consistent with other modules like Bio.Entrez) >Style: Why introduce the mode argument and two magic values NCBI_RNA and NCBI_PROTEIN? The honest answer to both of these is personal choice. If consistency is an issue i will reimplement as module-level functions and textual arguments respectively. Regarding the placement of modules, i'm happy for it to go wherever. I would imagine that there are other niche web interface "getters" such as this, and think your suggestion sounds great, although i can't think what we could call it. Perhaps Bio.Web.TAIR? Regards Kevin Murray On 18 December 2012 10:34, Peter Cock wrote: > Hi Kevin, > > Thanks for your code submission. I've not had a chance to play with it, > but I do have some comments/queries - some of which are perhaps just style > issues. > > Note: Since your unit test requires internet access, it should include > these lines to work nicely in our testing framework (which allows the tests > needing network access to be skipped): > > import requires_internet > requires_internet.check() > > Query: It isn't clear to me (from a first read) what MultipartPostHandler > is needed for. > > Minor: The module's docstring should start with a one line summary then a > blank line (see PEP8 style guide). > > Query: Why does classes TAIRDirect and TAIRNCBI exist? Wouldn't module > level functions be simpler (or at least, consistent with other modules like > Bio.Entrez)? > > Query: Why do your methods return SeqRecord objects? Is this because the > handle might return FASTA with a non-FASTA header which must be stripped > off? > > Style: Why introduce the mode argument and two magic values NCBI_RNA and > NCBI_PROTEIN? > > In fact I would go further and ask why does the NCBI code exist given it > is such a thin wrapper round the Bio.Entrez code - the module would be a > lot simpler if it was just a wrapper for www.arabidopsis.org alone. > > I'm also not sure about the namespace Bio.TAIR, the old Bio.www namespace > might have been better but that was deprecated a while back, and the other > semi-natural fit under Biopython's old OBDA effort is also defunct > (attempting to catalogue a collection of sequence resources, see > http://obda.open-bio.org for background if curious). The namespace issue > at least would be worth bringing up on the dev mailing list... especially > if you can think of many other examples like this for specialised resources. > > Regards, > > Peter > > ? > Reply to this email directly or view it on GitHub. > > From kjwu at ucsd.edu Wed Dec 19 04:25:35 2012 From: kjwu at ucsd.edu (Kevin Wu) Date: Tue, 18 Dec 2012 20:25:35 -0800 Subject: [Biopython-dev] KEGG API Wrapper In-Reply-To: <1351219962.39081.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1351219962.39081.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Hi All, Sorry in the delay in updating this KEGG code. Michiel, I've addressed your suggestions regarding the querying code and the documentation and have committed changes that reflect this. ( https://github.com/kevinwuhoo/biopython/) There's a namespace collision created by the KEGG.list function, so I use KEGG.list_ instead. However, I'm sure there's a more elegant solution than this. Regarding the parsers, there should be a way to unify all parsers and writers for KEGG objects as they list fields for all their objects here: http://www.kegg.jp/kegg/rest/dbentry.html. Each class should extend from a parent while specifying their valid fields. Parsing all files should be generalized, but there should be field specific code to handle the different fields so that fields like genes are handled correctly and ubiquitously. After solidifying discussion on these, I'll move the tests over to unittest too. Thanks! Kevin On Thu, Oct 25, 2012 at 7:52 PM, Michiel de Hoon wrote: > Hi Kevin, > > Thanks for the documentation! That makes everything a lot clearer. > Overall I like the querying code and I think we should add it to Biopython. > > I have a bunch of comments on the KEGG module, some on the existing code > and some on the new querying code, see below. Most of these are trivial; > some may need some further discussion. Perhaps could you let us know which > of these comments you can address, and which ones you want to skip for now? > > Once we converged with regards to the querying code and the documentation, > I think we can import your version of the KEGG module into the main > Biopython repository and add your chapter on KEGG to the main > documentation, and continue from there on the parsers and the unit tests. > > Many thanks! > -Michiel. > > > About the querying code: > ---------------------------------- > > I would replace KEGG.query("list", KEGG.query("find", KEGG.query("conv", > KEGG.query("link", KEGG.query("info", KEGG.query("get" by the functions > KEGG.list, KEGG.find, KEGG.conv, KEGG.link, KEGG.info, and KEGG.get. > > For list, find, conv, link, and info, instead of going through > KEGG.generic_parser, I would return the result directly as a Python list. > In contrast, KEGG.get should return the handle to the results, not the > data itself. So the _q function, instead of > ... > resp = urllib2.urlopen(req) > data = resp.read() > return query_url, data > have > ... > resp = urllib2.urlopen(req) > return resp > Then the user can decide whether to parse the data on the fly with > Bio.KEGG, or read the data line by line and pick up what they are > interested in, or to get all data from the handle and save it in a file. > Note that resp will have a .url attribute that contains the url, so you > won't need the ret_url keyword. > > About the parsers: > ------------------------ > > I think that we should drop generic_parser. For link, find, conv, link, > and info, parsing is trivial and can be done by the respective functions > directly. For get, we already have an appropriate parser for some databases > (compound, map, and enzyme), but it's easy to add parsers for the other > databases. > > For all parsers in Biopython, there is the question whether the record > should store information in attributes (as is currently done in Bio.KEGG), > or alternatively if the record should inherit from a dictionary and store > information in keys in the dictionary. Personally I have a preference for a > dictionary, since that allows us to use the exact same keys in the > dictionary as is used in the file (e.g., we can use "CLASS" as a key, while > we cannot use .class as an attribute since it is a reserved word, so we use > .classname instead). But other Biopython developers may not agree with me, > and to some extent it depends on personal preference. > > The parsers miss some key words. The ones I noticed are ALL_REAC, > REFERENCE, and ORTHOLOGY. Probably we'll find more once we extend the unit > tests. > > Remove the ';' at the end of each term in record.classname. > > Convert record.genes to a dictionary for each organism. So instead of > [('HSA', ['5236', '55276']), ('PTR', ['456908', '461162']), ('PON', > ['100190836', '100438793']), ('MCC', ['100424648', '699401']... > have > {'HSA': ['5236', '55276'], 'PTR': ['456908', '461162'], 'PON': > ['100190836', '100438793'], 'MCC': ['100424648', '699401'], ... > > Also for record.dblinks, record.disease, record.structures, use a > dictionary. > > In record.pathway, all entries start with 'PATH'. Perhaps we should check > with KEGG if there could be anything else than 'PATH' there, otherwise I > don't see the reason why it's there. Assuming that there could be something > different there, I would also use a dictionary with 'PATH' as the key. > > In record.reaction, some chemical names can be very long and extend over > multiple lines. In such cases, the continuation line starts with a '$'. The > parser should remove the '$' and join the two lines. > > About the tests: > -------------------- > > We should update the data files in Tests/KEGG. This will fix some "bugs" > in these data files. > > We should switch test_KEGG.py to the unit test framework. > > We should do some more extensive testing to make sure we are not missing > some key words. > > About the documentation: > --------------------------------- > It's great that we now have some documentation. > > On page 233, I would suggest to replace the "id_" by "accession" or > something else, since the underscore in "id_" may look funky to new users. > > Also it may be better not to reuse variable names (e.g. "pathway" is used > in three different ways in the example). It's OK of course in general, but > for this example it may be more clear to distinguish the different usages > of this variable from each other. > > For repair_genes, you can use a set instead of a list throughout. > > > > > --- On *Wed, 10/24/12, Kevin Wu * wrote: > > > From: Kevin Wu > Subject: Re: [Biopython-dev] KEGG API Wrapper > To: "Peter Cock" , "Zachary Charlop-Powers" < > zcharlop at mail.rockefeller.edu>, "Michiel de Hoon" > Cc: Biopython-dev at lists.open-bio.org > Date: Wednesday, October 24, 2012, 6:38 PM > > > Hi All, > > Thanks for the comments, I've written a bit of documentation on the entire > KEGG module and have attached those relevant pages to the email. There > didn't seem like an appropriate place for examples, so I just added a new > chapter. I've also committed the updated file to github. > > I did leave out the parsers due to the fact that the current parsers only > cover a small portion of possible responses from the api. Also, I'm not > confident that the some of the parsers correctly retrieves all the fields. > However, I've written a really general parser that does a rough job of > retrieving fields if it's a database format returned since I find myself > reusing the code for all database formats. It's possible to modify this to > correctly account for the different fields, but would probably take a bit > of work to manually figure each field out. Otherwise it also parses the > tsv/flat file returned. > > Also, @zach, thanks for checking it out and testing it! > > Thanks All! > Kevin > > On Wed, Oct 17, 2012 at 4:09 AM, Peter Cock > > wrote: > > On Wed, Oct 17, 2012 at 12:55 AM, Zachary Charlop-Powers > > > wrote: > > Kevin, > > Michiel, > > > > I just tested Kevin's code for a few simple queries and it worked great. > I > > have always liked KEGG's organization of data and really appreciate this > > RESTful interface to their data; in some ways I think it easier to use > the > > web interfaces for KEGG than it is for NCBI. Plus the KEGG coverage of > > metabolic networks is awesome. I found the examples in Kevin's test > script > > to be fairly self-explanatory but a simple-spelled out example in the > > Tutorial would be nice. > > > > One thought, though, is that you can retrieve MANY different types of > data > > from the KEGG Rest API - which means that the user will probably have to > > parse the data his/herself. Data retrieved with "list" can return lists > of > > genes or compounds or organism and after a cursory look these are each > > formatted differently. Also true with the 'find' command. So I think you > > were right to leave out parsers because i think they will be a moving > target > > highly dependent on the query. > > > > Thank You Kevin, > > zach cp > > Good point about decoupling the web API wrapper and the parsers - > how the Bio.Entrez module and Bio.TogoWS handle this is to return > handles for web results, which you can then parse with an appropriate > parser (e.g. SeqIO for GenBank files, Medline parser, etc). > > Note that this is a little more fiddly under Python 3 due to the text > mode distinction between unicode and binary... just something to > keep in the back of your mind. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From gokcen.eraslan at gmail.com Fri Dec 21 00:12:43 2012 From: gokcen.eraslan at gmail.com (=?ISO-8859-1?Q?G=F6k=E7en_Eraslan?=) Date: Fri, 21 Dec 2012 01:12:43 +0100 Subject: [Biopython-dev] numpy/matlab style index arrays for Seq objects Message-ID: <50D3A97B.60108@gmail.com> Hello, During the development of a project, I have come across an issue that I want to share. As far as I know, Bio.Seq.Seq object can only be indexed using an int or a slice object, just as regular strings: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq[4:12] Seq('GATGGGCC', IUPACUnambiguousDNA()) However, it would be really nice to be able to index Seq objects using index arrays as in numpy.array, like >>> my_inidices = [0, 3, 7] >>> my_seq[my_indices] Seq('GCG', IUPACUnambiguousDNA()) (Since I'm not really familiar with BioPython API and codebase, please ignore/forgive me if such thing already exists now.) For example in my project, I'm trying to eliminate noisy columns of a MSA fasta file. Let's assume that I have a list of non-noisy column indices than this would solve my problem: In [1]: from Bio import AlignIO In [2]: msa = AlignIO.read("s001.fasta", "fasta") In [3]: print msa[:, [0, 3, 4]] SingleLetterAlphabet() alignment with 5 rows and 3 columns KPG sp2 TPG sp11 SPG sp7 KPP sp6 SPG sp10 I have attached a tiny patch (~4 lines) implementing this stuff. At first, I have thought keeping the sequence string as numpy.array(list()) to be able to use indexing mechanism of numpy, but it would be over-engineering so I have just used a simple list comprehension trick. Regards. -------------- next part -------------- A non-text attachment was scrubbed... Name: biopython-index-array-for-seq.diff Type: text/x-patch Size: 3845 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Dec 21 13:09:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Dec 2012 13:09:47 +0000 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt wrote: > Dear list, > > I have some problems with the GenBank parser in version 1.60. Its again > nested location strings like: > > order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403) > as found in NC_003048. Do you have a URL for that? This looks OK to me: http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1 Perhaps the entry came from the FTP site? e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/ > What happens is that the parser stalls. It seems as if it takes forever to > parse _re_complex_compound in and never gets to the if statement that > checks if order and join appears in the location string. > > I suggest to move the if statement before the regular expressions are > tested. > > I remember that I posted something like this before. But I can not remember > how and if this was solved. > > Regards, > Matthaas Were similar odd locations have come up in some cases they did seem to be NCBI bugs - could you raise a query with the NCBI for this case please? If this is valid (which I doubt), then our object model doesn't cope. If this is invalid, then Biopython should give a warning and skip this location. Right now I can't find the file to test this (see query above about where it came from). Regards, Peter From MatatTHC at gmx.de Fri Dec 21 15:18:45 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Fri, 21 Dec 2012 16:18:45 +0100 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: Dear Peter, you are right the current RefSeq record is valid and can be parsed. In order to reproduce old results I keep old refseq versions (of mitochondrial genomes) on hard disk. So probably this is an old refseq bug. According to the documentation ( http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.4): """ Note : location operator "complement" can be used in combination with either " join" or "order" within the same location; combinations of "join" and "order" within the same location (nested operators) are illegal. """ Since this was urgent I fixed the files manually by removing the nested files. I was not able to find a file in other RefSeq versions that can reproduce the bug (i.e. the parser seemingly takes forever [>5min] and does not raise an exception). You may still reproduce the bug by pasting the location line in another GenBank file. I agree that the desired behaviour would be a warning and skip of the feature. Regards, Matthias 2012/12/21 Peter Cock > On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt wrote: > > Dear list, > > > > I have some problems with the GenBank parser in version 1.60. Its again > > nested location strings like: > > > > > order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403) > > as found in NC_003048. > > Do you have a URL for that? This looks OK to me: > http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1 > > Perhaps the entry came from the FTP site? > e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/ > > > What happens is that the parser stalls. It seems as if it takes forever > to > > parse _re_complex_compound in and never gets to the if statement that > > checks if order and join appears in the location string. > > > > I suggest to move the if statement before the regular expressions are > > tested. > > > > I remember that I posted something like this before. But I can not > remember > > how and if this was solved. > > > > Regards, > > Matthaas > > Were similar odd locations have come up in some cases they did > seem to be NCBI bugs - could you raise a query with the NCBI > for this case please? > > If this is valid (which I doubt), then our object model doesn't cope. > > If this is invalid, then Biopython should give a warning and skip > this location. Right now I can't find the file to test this (see > query above about where it came from). > > Regards, > > Peter > -------------- next part -------------- A non-text attachment was scrubbed... Name: NC_001326.gb Type: application/octet-stream Size: 65527 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Dec 21 15:34:48 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Dec 2012 15:34:48 +0000 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: On Fri, Dec 21, 2012 at 3:18 PM, Matthias Bernt wrote: > Dear Peter, > > you are right the current RefSeq record is valid and can be parsed. In order > to reproduce old results I keep old refseq versions (of mitochondrial > genomes) on hard disk. So probably this is an old refseq bug. ... Could you email me (not the list) the old NC_003048.gb file please? Was there a similar issue in the NC_001326.gb file you just sent? It seems to load OK for me... Thanks, Peter From p.j.a.cock at googlemail.com Fri Dec 21 16:13:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Dec 2012 16:13:40 +0000 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: On Fri, Dec 21, 2012 at 3:53 PM, Matthias Bernt wrote: > Dear Peter, > > its attached (from RefSeq39). For me parsing does not finish for this file > (biopython 1.6, python 2.7.3). > > Regards, > Matthias Got it, thanks. It also seems to get stuck for me too - there is a bug here :( See also: https://redmine.open-bio.org/issues/3197 Peter From p.j.a.cock at googlemail.com Fri Dec 21 16:54:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 21 Dec 2012 16:54:38 +0000 Subject: [Biopython-dev] Location Parser In-Reply-To: References: Message-ID: On Fri, Dec 21, 2012 at 4:13 PM, Peter Cock wrote: > On Fri, Dec 21, 2012 at 3:53 PM, Matthias Bernt > wrote: >> Dear Peter, >> >> its attached (from RefSeq39). For me parsing does not finish for this file >> (biopython 1.6, python 2.7.3). >> >> Regards, >> Matthias > > Got it, thanks. It also seems to get stuck for me too - there is a bug here :( > > See also: https://redmine.open-bio.org/issues/3197 The problem seems to be in the regular expression search itself getting stuck: $ python Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.GenBank import _re_complex_compound >>> _re_complex_compound.match("order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)") ^CTraceback (most recent call last): File "", line 1, in KeyboardInterrupt Odd. Peter From ben at bendmorris.com Mon Dec 24 16:58:19 2012 From: ben at bendmorris.com (Ben Morris) Date: Mon, 24 Dec 2012 11:58:19 -0500 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo Message-ID: Hi all, I've implemented support for two new phylogenetic tree formats: NeXML and RDF (conforming to the Comparative Data Analysis Ontology). I noticed that NeXML support was planned, but I didn't see anyone working on it on GitHub and the feature request hadn't been updated in about a year, so I went ahead and implemented a simple version. At first I tried the generateDS.py approach, but the generated writer doesn't give very much control over the output, so I ended up writing my own parser/writer using ElementTree. As for the RDF/CDAO format, AFAIK this is not a format that's supported by any other phylogenetic libraries, so I'm not sure how useful this is to everyone else. It provides a simple, standards-compliant format that can be imported to a triple store and supports annotation. We'll be using it at NESCent so I wanted to make it available to everyone else as well. The parser and writer require the Redlands Python bindings. The code is available in my fork of Biopython, https://github.com/bendmorris/biopython under branches "cdao" and "nexml." I'd love to get everyone's thoughts and see if these contributions would be a good fit for the Biopython project. ~Ben Morris PhD student, Department of Biology University of North Carolina at Chapel Hill and the National Evolutionary Synthesis Center ben at bendmorris.com From p.j.a.cock at googlemail.com Mon Dec 24 18:05:29 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 24 Dec 2012 18:05:29 +0000 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo In-Reply-To: References: Message-ID: On Mon, Dec 24, 2012 at 4:58 PM, Ben Morris wrote: > Hi all, > > I've implemented support for two new phylogenetic tree formats: NeXML and > RDF (conforming to the Comparative Data Analysis Ontology). > > I noticed that NeXML support was planned, but I didn't see anyone working > on it on GitHub and the feature request hadn't been updated in about a > year, so I went ahead and implemented a simple version. At first I tried > the generateDS.py approach, but the generated writer doesn't give very much > control over the output, so I ended up writing my own parser/writer using > ElementTree. > > As for the RDF/CDAO format, AFAIK this is not a format that's supported by > any other phylogenetic libraries, so I'm not sure how useful this is to > everyone else. It provides a simple, standards-compliant format that can be > imported to a triple store and supports annotation. We'll be using it at > NESCent so I wanted to make it available to everyone else as well. The > parser and writer require the Redlands Python bindings. > > The code is available in my fork of Biopython, > > https://github.com/bendmorris/biopython > > under branches "cdao" and "nexml." I'd love to get everyone's thoughts and > see if these contributions would be a good fit for the Biopython project. Sounds good - and the librdf Redlands Python bindings do seem to be a safe choice for RDF under Python. I guess we need Eric to take a look... and some tests would be needed too. Thanks, Peter From eric.talevich at gmail.com Tue Dec 25 07:18:40 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 24 Dec 2012 23:18:40 -0800 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo In-Reply-To: References: Message-ID: On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris wrote: > Hi all, > > I've implemented support for two new phylogenetic tree formats: NeXML and > RDF (conforming to the Comparative Data Analysis Ontology). > > I noticed that NeXML support was planned, but I didn't see anyone working > on it on GitHub and the feature request hadn't been updated in about a > year, so I went ahead and implemented a simple version. At first I tried > the generateDS.py approach, but the generated writer doesn't give very much > control over the output, so I ended up writing my own parser/writer using > ElementTree. > > As for the RDF/CDAO format, AFAIK this is not a format that's supported by > any other phylogenetic libraries, so I'm not sure how useful this is to > everyone else. It provides a simple, standards-compliant format that can be > imported to a triple store and supports annotation. We'll be using it at > NESCent so I wanted to make it available to everyone else as well. The > parser and writer require the Redlands Python bindings. > > The code is available in my fork of Biopython, > > https://github.com/bendmorris/biopython > > under branches "cdao" and "nexml." I'd love to get everyone's thoughts and > see if these contributions would be a good fit for the Biopython project. > Thanks for letting us know! I'll try it out soonish. Looking at the code on your nexml branch, I have a few comments: - The parser uses ElementTree.parse rather than iterparse, so in its current state it would not be able to parse massive files (those larger than available RAM). Worth fixing eventually? - The parser creates Newick.Tree and Newick.Clade objects, which is nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you don't have any additional attributes to attach to those classes at the moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and PhyloXMLIO.py.) - The 'confidence' or 'confidences' attribute isn't used (for e.g. bootstrap support values). Does NeXML define it? Best, Eric From ben at bendmorris.com Fri Dec 28 15:50:02 2012 From: ben at bendmorris.com (Ben Morris) Date: Fri, 28 Dec 2012 10:50:02 -0500 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo In-Reply-To: References: Message-ID: On Tue, Dec 25, 2012 at 2:18 AM, Eric Talevich wrote: > > On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris wrote: >> >> Hi all, >> >> I've implemented support for two new phylogenetic tree formats: NeXML and >> RDF (conforming to the Comparative Data Analysis Ontology). >> >> I noticed that NeXML support was planned, but I didn't see anyone working >> on it on GitHub and the feature request hadn't been updated in about a >> year, so I went ahead and implemented a simple version. At first I tried >> the generateDS.py approach, but the generated writer doesn't give very much >> control over the output, so I ended up writing my own parser/writer using >> ElementTree. >> >> As for the RDF/CDAO format, AFAIK this is not a format that's supported by >> any other phylogenetic libraries, so I'm not sure how useful this is to >> everyone else. It provides a simple, standards-compliant format that can be >> imported to a triple store and supports annotation. We'll be using it at >> NESCent so I wanted to make it available to everyone else as well. The >> parser and writer require the Redlands Python bindings. >> >> The code is available in my fork of Biopython, >> >> https://github.com/bendmorris/biopython >> >> under branches "cdao" and "nexml." I'd love to get everyone's thoughts and >> see if these contributions would be a good fit for the Biopython project. > > > > Thanks for letting us know! I'll try it out soonish. Looking at the code on your nexml branch, I have a few comments: > > - The parser uses ElementTree.parse rather than iterparse, so in its current state it would not be able to parse massive files (those larger than available RAM). Worth fixing eventually? Great point. I rewrote it to use iterparse instead. > - The parser creates Newick.Tree and Newick.Clade objects, which is nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you don't have any additional attributes to attach to those classes at the moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and PhyloXMLIO.py.) Went ahead and did this as well. > - The 'confidence' or 'confidences' attribute isn't used (for e.g. bootstrap support values). Does NeXML define it? Not that I'm aware of, but I'm not sure. I searched http://nexml.org/nexml/html/doc/schema-1/ and didn't find anything. I'm going to ask some people who know more about this than I do. ~Ben From diego_zea at yahoo.com.ar Fri Dec 28 23:33:35 2012 From: diego_zea at yahoo.com.ar (Diego Zea) Date: Fri, 28 Dec 2012 15:33:35 -0800 (PST) Subject: [Biopython-dev] Error on Bio.PDB Message-ID: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com> One of the PDB (I have a very large dataset of PDB and there are a lot of them generating this kind of error) that give me the error is: http://www.rcsb.org/pdb/files/2ER9.pdb Ant the error output is: /usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 2895. ? PDBConstructionWarning) /usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 3216. ? PDBConstructionWarning) Traceback (most recent call last): ? File "AsignarPDBaMIfile.py", line 45, in ??? cmap, pdb? = contactos_CB(pdb_file,pdb_cad,cutoff=1,n_salto=1) ? File "funciones_pdb.py", line 15, in contactos_CB ??? cadena = model[cad] ? File "/usr/lib/pymodules/python2.7/Bio/PDB/Entity.py", line 38, in __getitem__ ??? return self.child_dict[id] KeyError: 'A' ? How Can be fixed? P.D.: The lines of the firts warning are (at the beginig I wrote the number of lines for reference, I think that the TER line can be the cause of the problem but I'm not sure): 2893?? ATOM?? 2455? N?? PHE I?? 8????? 38.110 -15.236?? 4.503? 0.89? 0.76?????????? N? 2894?? TER??? 2456????? PHE I?? 8????????????????????????????????????????????????????? 2895??? HETATM 2457? O?? HOH E 327????? 10.873? -3.134? 11.448? 0.89? 0.01?????????? O? if ((dx*dp)>=(h/(2*pi))) { printf("Diego Javier Zea\n"); } From diego_zea at yahoo.com.ar Fri Dec 28 23:59:28 2012 From: diego_zea at yahoo.com.ar (Diego Zea) Date: Fri, 28 Dec 2012 15:59:28 -0800 (PST) Subject: [Biopython-dev] Error on Bio.PDB In-Reply-To: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com> References: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com> Message-ID: <1356739168.13594.YahooMailNeo@web140606.mail.bf1.yahoo.com> Excuse me, there is not error. Only a warning on a lot of PDBs. I confuse the chain on my example :/ ? if ((dx*dp)>=(h/(2*pi))) { printf("Diego Javier Zea\n"); } >________________________________ > De: Diego Zea >Para: "biopython-dev at biopython.org" >Enviado: viernes, 28 de diciembre de 2012 20:33 >Asunto: [Biopython-dev] Error on Bio.PDB > >One of the PDB (I have a very large dataset of PDB and there are a lot of them generating this kind of error) that give me the error is: http://www.rcsb.org/pdb/files/2ER9.pdb > >Ant the error output is: >/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 2895. >? PDBConstructionWarning) >/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 3216. >? PDBConstructionWarning) >Traceback (most recent call last): >? File "AsignarPDBaMIfile.py", line 45, in >??? cmap, pdb? = contactos_CB(pdb_file,pdb_cad,cutoff=1,n_salto=1) >? File "funciones_pdb.py", line 15, in contactos_CB >??? cadena = model[cad] >? File "/usr/lib/pymodules/python2.7/Bio/PDB/Entity.py", line 38, in __getitem__ >??? return self.child_dict[id] >KeyError: 'A' > >? >How Can be fixed? > >P.D.: The lines of the firts warning are (at the beginig I wrote the number of lines for reference, I think that the TER line can be the cause of the problem but I'm not sure): > >2893?? ATOM?? 2455? N?? PHE I?? 8????? 38.110 -15.236?? 4.503? 0.89? 0.76?????????? N? >2894?? TER??? 2456????? PHE I?? 8????????????????????????????????????????????????????? >2895??? HETATM 2457? O?? HOH E 327????? 10.873? -3.134? 11.448? 0.89? 0.01?????????? O? > > > > > > > > >if ((dx*dp)>=(h/(2*pi))) >{ >printf("Diego Javier Zea\n"); >} >_______________________________________________ >Biopython-dev mailing list >Biopython-dev at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From redmine at redmine.open-bio.org Sun Dec 30 12:46:35 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 30 Dec 2012 12:46:35 +0000 Subject: [Biopython-dev] [Biopython - Feature #3388] add annotation and letter_annotations attributed for Bio.Align.MultipleSeqAlignment. object References: Message-ID: Issue #3388 has been updated by Peter Cock. Support for a generic annotation dictionary done, https://github.com/biopython/biopython/commit/793f9210696e0acc9606faeca3d6ca47a9d97813 Started work on per-column annotation as well - currently on this branch: https://github.com/peterjc/biopython/tree/per-column-annotation ---------------------------------------- Feature #3388: add annotation and letter_annotations attributed for Bio.Align.MultipleSeqAlignment. object https://redmine.open-bio.org/issues/3388 Author: saverio vicario Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: At the moment I could not add annotation at alignment level. annotation could be usefull for tracking info linked to the loci ( i.e. name of domain), while letter annotation could be usefull to track quality score of alignment or if the sites belong to a given character set. In particular when to alignment are merged it would be usefull tha the bounduary of the merge is tracked for example in Letter annotation of the merge of an alignment a with 10 sites and b of 5 sites the letter_annotations would be as following {locus1:'111111111100000',locus2:'000000000011111'} this could be usefull also to annotate the 3 position of codons {pos1:'1001001001',pos2:'0100100100', pos3:'0010010010'} If this letter_annotation would be supported the annotation could be kept across merging and splitting of the alignment -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org