From p.j.a.cock at googlemail.com  Sun Dec  2 18:41:49 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 2 Dec 2012 23:41:49 +0000
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7v1XaqfbwUN-Juu6-26AwfHO1697haZV5Pw-hdK7wTrA@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CADEGkF7GerjkkuN1FamM8MNVp+h=-10dDeY9d4UkZ6ise4t9+Q@mail.gmail.com>
	<CAKVJ-_7v1XaqfbwUN-Juu6-26AwfHO1697haZV5Pw-hdK7wTrA@mail.gmail.com>
Message-ID: <CAKVJ-_7qkugPM8M1uF8x7z4-vO79G-_aisOuWRKX_itX9ZN_-w@mail.gmail.com>

On Mon, Nov 26, 2012 at 4:46 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Done,
> https://github.com/biopython/biopython/commit/9f6e810cc68dd1e353d899772fda3053d9f49513
>
>>> Once that's done there is some housekeeping to do, like
>>> the indexing code duplication with Bio.SeqIO, and tackling
>>> indexing BGZF compressed files with Bio.SearchIO which
>>> I will have a go at.
>>
>> Yes.
>
> Started, it seems the two _index.py files have diverged a
> little more than I'd expected:
> https://github.com/biopython/biopython/commit/ad1786b99afd2a50248246d877ff00a53949546b

I've just refactored the code in order to avoid most of the
index duplication (including SQLite backend) between the
SeqIO and new SearchIO index and index_db functions.

In the short term at least, the common code is now part
of Bio/File.py (but remains as private classes). That
seemed neater than introducing a new private module.

Fingers crossed everything is fine on the buildslaves,
TravisCI seems happy. Bow, if you find I've broken
anything then we need more unit tests ;)

Regards,

Peter

From w.arindrarto at gmail.com  Mon Dec  3 06:22:07 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Mon, 3 Dec 2012 12:22:07 +0100
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7qkugPM8M1uF8x7z4-vO79G-_aisOuWRKX_itX9ZN_-w@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CADEGkF7GerjkkuN1FamM8MNVp+h=-10dDeY9d4UkZ6ise4t9+Q@mail.gmail.com>
	<CAKVJ-_7v1XaqfbwUN-Juu6-26AwfHO1697haZV5Pw-hdK7wTrA@mail.gmail.com>
	<CAKVJ-_7qkugPM8M1uF8x7z4-vO79G-_aisOuWRKX_itX9ZN_-w@mail.gmail.com>
Message-ID: <CADEGkF7CKTEfcG7u_+yYGLUUs9mCtiwjDc9uHW3=zH2DFTvxVw@mail.gmail.com>

Hi Peter,

>>>> Once that's done there is some housekeeping to do, like
>>>> the indexing code duplication with Bio.SeqIO, and tackling
>>>> indexing BGZF compressed files with Bio.SearchIO which
>>>> I will have a go at.
>>>
>>> Yes.
>>
>> Started, it seems the two _index.py files have diverged a
>> little more than I'd expected:
>> https://github.com/biopython/biopython/commit/ad1786b99afd2a50248246d877ff00a53949546b
>
> I've just refactored the code in order to avoid most of the
> index duplication (including SQLite backend) between the
> SeqIO and new SearchIO index and index_db functions.

Thanks :). I remember I did change some of the variable names. Other
than this, the biggest change is probably related to the Indexer
classes lazy loading in SearchIO. But it seems to have been handled as
well :).

> In the short term at least, the common code is now part
> of Bio/File.py (but remains as private classes). That
> seemed neater than introducing a new private module.

Looks like a good place for now, Bio.File as the location for common
file-handling code.

> Fingers crossed everything is fine on the buildslaves,
> TravisCI seems happy. Bow, if you find I've broken
> anything then we need more unit tests ;)

Will keep that in mind :).

regards,
Bow

From p.j.a.cock at googlemail.com  Mon Dec  3 06:36:16 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Dec 2012 11:36:16 +0000
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CADEGkF7CKTEfcG7u_+yYGLUUs9mCtiwjDc9uHW3=zH2DFTvxVw@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CADEGkF7GerjkkuN1FamM8MNVp+h=-10dDeY9d4UkZ6ise4t9+Q@mail.gmail.com>
	<CAKVJ-_7v1XaqfbwUN-Juu6-26AwfHO1697haZV5Pw-hdK7wTrA@mail.gmail.com>
	<CAKVJ-_7qkugPM8M1uF8x7z4-vO79G-_aisOuWRKX_itX9ZN_-w@mail.gmail.com>
	<CADEGkF7CKTEfcG7u_+yYGLUUs9mCtiwjDc9uHW3=zH2DFTvxVw@mail.gmail.com>
Message-ID: <CAKVJ-_6V0ePYt0Ds-dJWv9GOmOEhL9=av0umzEf-fc=Q9oh1Zg@mail.gmail.com>

On Mon, Dec 3, 2012 at 11:22 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Peter,
>
>> I've just refactored the code in order to avoid most of the
>> index duplication (including SQLite backend) between the
>> SeqIO and new SearchIO index and index_db functions.
>
> Thanks :). I remember I did change some of the variable names.

Basically I moved the core SeqIO indexing code into Bio.File,
generalised it enough to work for SearchIO as well, then removed
the SearchIO indexing code.

> Other than this, the biggest change is probably related to the
> Indexer classes lazy loading in SearchIO. But it seems to have
> been handled as well :).

Yes, the SearchIO indexing is still calling your lazy loading
function to get the parser objects.

>> In the short term at least, the common code is now part
>> of Bio/File.py (but remains as private classes). That
>> seemed neater than introducing a new private module.
>
> Looks like a good place for now, Bio.File as the location for
> common file-handling code.

That was my thinking too.

>> Fingers crossed everything is fine on the buildslaves,
>> TravisCI seems happy. Bow, if you find I've broken
>> anything then we need more unit tests ;)
>
> Will keep that in mind :).

*Grin*

I've just done a base class for the random access proxy
classes, potentially a little more refactoring to follow here
(or renaming):
https://github.com/biopython/biopython/commit/9721cd00b5662309456c3dc573642cbb88e4e0a1

Peter

From christian at brueffer.de  Mon Dec  3 07:46:23 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Mon, 03 Dec 2012 20:46:23 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
Message-ID: <50BC9F1F.4090904@brueffer.de>

Hi,

I just submitted pull request #102 which fixes several types of PEP8
warnings (found using the awesome pep8 tool).

Here's what's left after those fixes:

$ pep8 --statistics -qq repos/biopython
789     E111 indentation is not a multiple of four
673     E121 continuation line indentation is not a multiple of four
693     E122 continuation line missing indentation or outdented
171     E123 closing bracket does not match indentation of opening 
bracket's line
86      E124 closing bracket does not match visual indentation
49      E125 continuation line does not distinguish itself from next 
logical line
197     E126 continuation line over-indented for hanging indent
575     E127 continuation line over-indented for visual indent
1092    E128 continuation line under-indented for visual indent
773     E201 whitespace after '('
540     E202 whitespace before ')'
23543   E203 whitespace before ':'
55      E211 whitespace before '('
180     E221 multiple spaces before operator
59      E222 multiple spaces after operator
5848    E225 missing whitespace around operator
6517    E231 missing whitespace after ','
2544    E251 no spaces around keyword / parameter equals
644     E261 at least two spaces before inline comment
346     E262 inline comment should start with '# '
156     E301 expected 1 blank line, found 0
1838    E302 expected 2 blank lines, found 1
364     E303 too many blank lines (2)
15553   E501 line too long (82 > 79 characters)
857     E502 the backslash is redundant between brackets
291     E701 multiple statements on one line (colon)
122     E711 comparison to None should be 'if cond is None:'
3707    W291 trailing whitespace
1913    W293 blank line contains whitespace

I'm not sure where to go from here with regard to what's worth fixing 
and what would be considered repo churn (or gratuitous changes that make
merging of existing patches harder).

I'd especially like to clean up E301, E302, E701, E711, W291 and W293.
Other items like E251 are more dubious, as some developers seem to
prefer the current style.

What do you think?

Chris

From p.j.a.cock at googlemail.com  Mon Dec  3 08:34:52 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Dec 2012 13:34:52 +0000
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <50BC9F1F.4090904@brueffer.de>
References: <50BC9F1F.4090904@brueffer.de>
Message-ID: <CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>

On Mon, Dec 3, 2012 at 12:46 PM, Christian Brueffer
<christian at brueffer.de> wrote:
> Hi,

Hi Christian,

Thanks for all the pull requests sorting out issues like this, in
terms of lines of code you'll probably be one of the top
contributors to the next release ;)  This sort of work isn't as
high profile as new features or bug fixes, but has a more
subtle role in the long term of the project - making our code
easier to follow etc. So we do appreciate these contributions.

> I just submitted pull request #102 which fixes several types of PEP8
> warnings (found using the awesome pep8 tool).

101 not 102? https://github.com/biopython/biopython/pull/101

> Here's what's left after those fixes:
>
> $ pep8 --statistics -qq repos/biopython
> 789     E111 indentation is not a multiple of four

That's nasty - although I think we've got rid of all the tabbed
indentation already which was also very annoying.

> 673     E121 continuation line indentation is not a multiple of four

I suspect many of those are a style judgement and done that
way to line up parentheses etc.

> 693     E122 continuation line missing indentation or outdented
> 171     E123 closing bracket does not match indentation of opening bracket's
> line
> 86      E124 closing bracket does not match visual indentation
> 49      E125 continuation line does not distinguish itself from next logical
> line
> 197     E126 continuation line over-indented for hanging indent
> 575     E127 continuation line over-indented for visual indent
> 1092    E128 continuation line under-indented for visual indent
> 773     E201 whitespace after '('
> 540     E202 whitespace before ')'
> 23543   E203 whitespace before ':'
> 55      E211 whitespace before '('

I'd like to see E201, E202, and E211 fixed (whitespace next to
parentheses).

The count for E203 is surprisingly high - I suspect that
could include some large dictionaries? Note some of the
dictionaries are auto-generated so the code to do that
would also need fixing.

> 180     E221 multiple spaces before operator
> 59      E222 multiple spaces after operator
> 5848    E225 missing whitespace around operator
> 6517    E231 missing whitespace after ','
> 2544    E251 no spaces around keyword / parameter equals
> 644     E261 at least two spaces before inline comment
> 346     E262 inline comment should start with '# '
> 156     E301 expected 1 blank line, found 0
> 1838    E302 expected 2 blank lines, found 1
> 364     E303 too many blank lines (2)
> 15553   E501 line too long (82 > 79 characters)
> 857     E502 the backslash is redundant between brackets

Fixing E502 seems a good idea, I suspect many of these are
purely accidental due to not realising when they are redundant.

> 291     E701 multiple statements on one line (colon)
> 122     E711 comparison to None should be 'if cond is None:'
> 3707    W291 trailing whitespace
> 1913    W293 blank line contains whitespace
>
> I'm not sure where to go from here with regard to what's worth fixing and
> what would be considered repo churn (or gratuitous changes that make
> merging of existing patches harder).
>
> I'd especially like to clean up E301, E302,

E301 and E302 presumable are about the recommended spacing
between function, class and method names? If you want to fix
them next that seems low risk in terms of complicating merges.

> ... E701, E711, W291 and W293.

Did you already fix most of those in today's pull request?
https://github.com/biopython/biopython/pull/101

If there are more cases, then by all means fix them too.

> Other items like E251 are more dubious, as some developers
> seem to prefer the current style.
>
> What do you think?

We have a range of styles in the current code base reflecting
different authors - and also changes in the Python conventions
as some of the code is now over ten years old. And if any of
my personal coding style is flagged, I'm willing to adapt ;)

(e.g. I've learnt not to put a space before if statement colons)

As you point out, the "repo churn" from fixing minor things
like spaces around operators does have a cost in making
merges a little harder. Things like the exception style updates
which you've already fixed (seems I missed some) are more
urgent for Python 3 support, so worth doing anyway.

You've got us a lot closer to PEP8 compliance - do you think
subject to a short white list of known cases (like module
names) where we don't follow PEP8 we could aim to run a
a pep8 tool automatically (e.g. as a unit test, or even a commit
hook)? That is quite appealing as a way to spot any new code
which breaks the style guidelines...

Regards,

Peter

From p.j.a.cock at googlemail.com  Mon Dec  3 09:02:40 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Dec 2012 14:02:40 +0000
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
Message-ID: <CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>

On Mon, Nov 26, 2012 at 1:49 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Once that's done there is some housekeeping to do, like
> the indexing code duplication with Bio.SeqIO, and tackling
> indexing BGZF compressed files with Bio.SearchIO which
> I will have a go at.
>

I've started work on SearchIO indexing of BGZF files now,
enabling it was quite simple (the same code as used for
SeqIO the indexing):
https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f

Thus far I've only tested this with BLAST XML, but that did
require a bit of reworking to avoid doing file offset arithmetic:
https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5

I will resume this work later this afternoon, going over all
the SearchIO file formats one by one.

Regards,

Peter

From p.j.a.cock at googlemail.com  Mon Dec  3 11:49:47 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Dec 2012 16:49:47 +0000
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>
Message-ID: <CAKVJ-_78h12b5k_2pLe-JcL0G=HOQvrCKAWrbey18qsv2-4+Fw@mail.gmail.com>

On Mon, Dec 3, 2012 at 2:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I've started work on SearchIO indexing of BGZF files now,
> enabling it was quite simple (the same code as used for
> SeqIO the indexing):
> https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f
>
> Thus far I've only tested this with BLAST XML, but that did
> require a bit of reworking to avoid doing file offset arithmetic:
> https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5
>
> I will resume this work later this afternoon, going over all
> the SearchIO file formats one by one.

I've refactored test_SearchIO_index.py to make adding
additional get_raw tests easier. Proper testing of all the
formats with BGZF will some larger test files (over 64k
before compression) which we probably don't want to
include in the repository.

However, I also added code to additionally test
Bio.SearchIO.index_db(...).get_raw(...) as well as your
original testing of Bio.SearchIO.index(...).get_raw(...)
alone. These should return the exact same string, and
that is now working nicely for BLAST XML (and BGZF
from limited testing), but not on all the formats.

Could you look at the difference in get_raw and the
record length found during indexing for: blast-tab
(with comments), hmmscan3-domtab, hmmer3-tab,
and hmmer3-text?

i.e. Anything where test_SearchIO_index.py is now
printing a WARNING line when run.

Thanks,

Peter

From christian at brueffer.de  Mon Dec  3 12:02:31 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Tue, 04 Dec 2012 01:02:31 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
Message-ID: <50BCDB27.7040402@brueffer.de>

On 12/3/12 21:34 , Peter Cock wrote:
> On Mon, Dec 3, 2012 at 12:46 PM, Christian Brueffer
> <christian at brueffer.de> wrote:
>> Hi,
> 
> Hi Christian,
> 
> Thanks for all the pull requests sorting out issues like this, in
> terms of lines of code you'll probably be one of the top
> contributors to the next release ;)  This sort of work isn't as
> high profile as new features or bug fixes, but has a more
> subtle role in the long term of the project - making our code
> easier to follow etc. So we do appreciate these contributions.
> 
>> I just submitted pull request #102 which fixes several types of PEP8
>> warnings (found using the awesome pep8 tool).
> 
> 101 not 102? https://github.com/biopython/biopython/pull/101
> 

102 and 103 (I actually meant 103).

>> Here's what's left after those fixes:
>>
>> $ pep8 --statistics -qq repos/biopython
>> 789     E111 indentation is not a multiple of four
> 
> That's nasty - although I think we've got rid of all the tabbed
> indentation already which was also very annoying.
> 

Some code uses two spaces etc, definatelty worth fixing.

>> 673     E121 continuation line indentation is not a multiple of four
> 
> I suspect many of those are a style judgement and done that
> way to line up parentheses etc.
> 

I'll see about those and apply case by case judgement.

>> 693     E122 continuation line missing indentation or outdented
>> 171     E123 closing bracket does not match indentation of opening bracket's
>> line
>> 86      E124 closing bracket does not match visual indentation
>> 49      E125 continuation line does not distinguish itself from next logical
>> line
>> 197     E126 continuation line over-indented for hanging indent
>> 575     E127 continuation line over-indented for visual indent
>> 1092    E128 continuation line under-indented for visual indent
>> 773     E201 whitespace after '('
>> 540     E202 whitespace before ')'
>> 23543   E203 whitespace before ':'
>> 55      E211 whitespace before '('
> 
> I'd like to see E201, E202, and E211 fixed (whitespace next to
> parentheses).
> 
> The count for E203 is surprisingly high - I suspect that
> could include some large dictionaries? Note some of the
> dictionaries are auto-generated so the code to do that
> would also need fixing.
> 
>> 180     E221 multiple spaces before operator
>> 59      E222 multiple spaces after operator
>> 5848    E225 missing whitespace around operator
>> 6517    E231 missing whitespace after ','
>> 2544    E251 no spaces around keyword / parameter equals
>> 644     E261 at least two spaces before inline comment
>> 346     E262 inline comment should start with '# '
>> 156     E301 expected 1 blank line, found 0
>> 1838    E302 expected 2 blank lines, found 1
>> 364     E303 too many blank lines (2)
>> 15553   E501 line too long (82 > 79 characters)
>> 857     E502 the backslash is redundant between brackets
> 
> Fixing E502 seems a good idea, I suspect many of these are
> purely accidental due to not realising when they are redundant.
> 

Agreed.

>> 291     E701 multiple statements on one line (colon)
>> 122     E711 comparison to None should be 'if cond is None:'
>> 3707    W291 trailing whitespace
>> 1913    W293 blank line contains whitespace
>>
>> I'm not sure where to go from here with regard to what's worth fixing and
>> what would be considered repo churn (or gratuitous changes that make
>> merging of existing patches harder).
>>
>> I'd especially like to clean up E301, E302,
> 
> E301 and E302 presumable are about the recommended spacing
> between function, class and method names? If you want to fix
> them next that seems low risk in terms of complicating merges.
> 

That and spacing between functions or between a function and a new
class.

>> ... E701, E711, W291 and W293.
> 
> Did you already fix most of those in today's pull request?
> https://github.com/biopython/biopython/pull/101
> 
> If there are more cases, then by all means fix them too.
> 

I fixed some in Nexus, that was before actually using the pep8 tool.

>> Other items like E251 are more dubious, as some developers
>> seem to prefer the current style.
>>
>> What do you think?
> 
> We have a range of styles in the current code base reflecting
> different authors - and also changes in the Python conventions
> as some of the code is now over ten years old. And if any of
> my personal coding style is flagged, I'm willing to adapt ;)
> 
> (e.g. I've learnt not to put a space before if statement colons)
> 
> As you point out, the "repo churn" from fixing minor things
> like spaces around operators does have a cost in making
> merges a little harder. Things like the exception style updates
> which you've already fixed (seems I missed some) are more
> urgent for Python 3 support, so worth doing anyway.
> 

On the other hand, it's basically a one-time cost.  However I
want to fix the lowest-hanging fruit (read: the ones with the
lowest counts ;-) first.

> You've got us a lot closer to PEP8 compliance - do you think
> subject to a short white list of known cases (like module
> names) where we don't follow PEP8 we could aim to run a
> a pep8 tool automatically (e.g. as a unit test, or even a commit
> hook)? That is quite appealing as a way to spot any new code
> which breaks the style guidelines...
> 

Having a commit hook would be ideal (maybe with a possibility to
override).  This would be especially useful against the introduction of
gratuitous whitespace.  With some editors/IDEs you don't even notice it.

Chris

From w.arindrarto at gmail.com  Tue Dec  4 08:33:32 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Tue, 4 Dec 2012 14:33:32 +0100
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_78h12b5k_2pLe-JcL0G=HOQvrCKAWrbey18qsv2-4+Fw@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>
	<CAKVJ-_78h12b5k_2pLe-JcL0G=HOQvrCKAWrbey18qsv2-4+Fw@mail.gmail.com>
Message-ID: <CADEGkF42x6QQ_NXLZU1jj3c3c9ZiCSMstqYw6Vri9Tuqmn8o9w@mail.gmail.com>

Hi Peter and everyone,

>> I've started work on SearchIO indexing of BGZF files now,
>> enabling it was quite simple (the same code as used for
>> SeqIO the indexing):
>> https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f
>>
>> Thus far I've only tested this with BLAST XML, but that did
>> require a bit of reworking to avoid doing file offset arithmetic:
>> https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5
>>
>> I will resume this work later this afternoon, going over all
>> the SearchIO file formats one by one.

Yes, the original one that I wrote did have some less straightforward
arithmetic as I was trying to adhere to the strict XML definition
(i.e. no matter the whitespace outside of the start and end elements,
indexing will still work). But line-based indexing should work too
(and is simpler) so long as BLAST XML keeps its style (and any user
modification afterwards doesn't introduce any wacky whitespaces).

> I've refactored test_SearchIO_index.py to make adding
> additional get_raw tests easier. Proper testing of all the
> formats with BGZF will some larger test files (over 64k
> before compression) which we probably don't want to
> include in the repository.
>
> However, I also added code to additionally test
> Bio.SearchIO.index_db(...).get_raw(...) as well as your
> original testing of Bio.SearchIO.index(...).get_raw(...)
> alone. These should return the exact same string, and
> that is now working nicely for BLAST XML (and BGZF
> from limited testing), but not on all the formats.
>
> Could you look at the difference in get_raw and the
> record length found during indexing for: blast-tab
> (with comments), hmmscan3-domtab, hmmer3-tab,
> and hmmer3-text?
>
> i.e. Anything where test_SearchIO_index.py is now
> printing a WARNING line when run.

Sure :). Based on a quick initial look, it seems that these are due to
filler texts (e.g. the BLAST
tab format ending with lines like "# BLAST processed 3 queries").
These texts won't affect the calculation results and the values of our
objects, but does add additional text length.

regards,
Bow

From redmine at redmine.open-bio.org  Tue Dec  4 18:01:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 4 Dec 2012 23:01:35 +0000
Subject: [Biopython-dev] [Biopython - Bug #3399] (New) SearchIO hmmer3-text
	parser fails to parse hits that have large gaps
Message-ID: <redmine.issue-3399.20121204230135@redmine.open-bio.org>


Issue #3399 has been reported by Kai Blin.

----------------------------------------
Bug #3399: SearchIO hmmer3-text parser fails to parse hits that have large gaps
https://redmine.open-bio.org/issues/3399

Author: Kai Blin
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


While trying to parse a hit that has a really bad match to the profile, there might be alignment lines that don't contain query sequence characters at all. In that case the SearchIO hmmer3-text module currently throws a ValueError

<pre>
>>> it = SearchIO.parse('../broken.hsr', 'hmmer3-text')
>>> i = it.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/SearchIO/__init__.py", line 313, in parse
    for qresult in generator:
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 60, in __iter__
    for qresult in self._parse_qresult():
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 145, in _parse_qresult
    hit_list = self._parse_hit(qid)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 188, in _parse_hit
    hit_list = self._create_hits(hit_list, qid)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 309, in _create_hits
    self._parse_aln_block(hid, hit.hsps)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 358, in _parse_aln_block
    frag.query = aliseq
  File "Bio/SearchIO/_model/hsp.py", line 816, in _query_set
    self._query = self._set_seq(value, 'query')
  File "Bio/SearchIO/_model/hsp.py", line 784, in _set_seq
    len(seq), seq_type))
ValueError: Sequence lengths do not match. Expected: 202 (hit); found: 131 (query).
</pre>

See the attached file broken.hsr for a dataset that triggers the error. If you remove the esterase hit (including the domain annotation), this error does not happen (broken2.hsr). If you insert fake position information into the query sequence line (broken3.hsr), the parser is happy again.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From w.arindrarto at gmail.com  Wed Dec  5 01:46:20 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Wed, 5 Dec 2012 07:46:20 +0100
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CADEGkF42x6QQ_NXLZU1jj3c3c9ZiCSMstqYw6Vri9Tuqmn8o9w@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>
	<CAKVJ-_78h12b5k_2pLe-JcL0G=HOQvrCKAWrbey18qsv2-4+Fw@mail.gmail.com>
	<CADEGkF42x6QQ_NXLZU1jj3c3c9ZiCSMstqYw6Vri9Tuqmn8o9w@mail.gmail.com>
Message-ID: <CADEGkF5PxtSr1+dA2KYDjE1v7ZD3ucMCQQ=9f8WqM_5AsX6E4Q@mail.gmail.com>

Hi everyone,

>> However, I also added code to additionally test
>> Bio.SearchIO.index_db(...).get_raw(...) as well as your
>> original testing of Bio.SearchIO.index(...).get_raw(...)
>> alone. These should return the exact same string, and
>> that is now working nicely for BLAST XML (and BGZF
>> from limited testing), but not on all the formats.
>>
>> Could you look at the difference in get_raw and the
>> record length found during indexing for: blast-tab
>> (with comments), hmmscan3-domtab, hmmer3-tab,
>> and hmmer3-text?
>>
>> i.e. Anything where test_SearchIO_index.py is now
>> printing a WARNING line when run.
>
> Sure :). Based on a quick initial look, it seems that these are due to
> filler texts (e.g. the BLAST
> tab format ending with lines like "# BLAST processed 3 queries").
> These texts won't affect the calculation results and the values of our
> objects, but does add additional text length.

I've looked into this and submitted a pull request to fix the issues
here: https://github.com/biopython/biopython/pull/111. The details on
the errors are also there.

regards,
Bow

From kai.blin at biotech.uni-tuebingen.de  Wed Dec  5 02:24:14 2012
From: kai.blin at biotech.uni-tuebingen.de (Kai Blin)
Date: Wed, 05 Dec 2012 17:24:14 +1000
Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading
	attributes.
Message-ID: <50BEF69E.2000806@biotech.uni-tuebingen.de>

Hi folks,

I'm trying to finally get my hmmer2-text parser in, but I'm failing one
unit test. The code is a bit too smart for me, it seems.

So in the file I'm parsing, I only ever get the description of the hit
in the hit table, like this (appologies if my mail client breaks this):

Model           Description                             Score    E-value  N
--------        -----------                             -----    ------- ---
Glu_synthase    Conserved region in glutamate synthas   858.6   3.6e-255   2


But of course I can't create a hit object when parsing the hit table, as
I first need to have HSPFragments to create the hit object with.

Anyway, I create a placeholder hit object that I'll later convert into a
real Hit object. In that placeholder object, I set a description.

Now I'm parsing the HSP table, looking like this:

Model           Domain  seq-f seq-t    hmm-f hmm-t      score  E-value
--------        ------- ----- -----    ----- -----      -----  -------
GATase_2          1/1      34   404 ..     1   385 []   731.8 3.9e-226

The HSP table is in a different order than the hit table, so never mind
the different model name.

Now, I need to create an HSPFragment with the same description as the
Hit object, or querying for the Hit object's description will cascade
through the HSPs and HSPFragments, and return multiple values for the
description.

However, no matter what I do, I seem to get an <unknown description>
tossed in there somehow.

The parser is at
https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py
the test code is at
https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py
and the test file that's failing is the hmmpfam2.3 file at
https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out

Any pointers would be appreciated. The code is working fine in my
current development work in general, and I'd love to get it upstream to
get rid of an extra patch step during installation.

Cheers,
Kai

-- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-University of T?bingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 T?bingen                        Fax :   ++49 7071 29-5979
Deutschland
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben

From p.j.a.cock at googlemail.com  Wed Dec  5 06:41:05 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 11:41:05 +0000
Subject: [Biopython-dev] Minor buildbot issues from SearchIO
In-Reply-To: <CADEGkF4RLmQDMS2sBNTs=Rwag_CypmU6WX-Q71R=Xsbuc4_GQg@mail.gmail.com>
References: <CAKVJ-_6N_Wy9QVKp=niHSexB0_yEL5svh4oDzbxEYuSHv3KfWA@mail.gmail.com>
	<CADEGkF4RLmQDMS2sBNTs=Rwag_CypmU6WX-Q71R=Xsbuc4_GQg@mail.gmail.com>
Message-ID: <CAKVJ-_5nLjkwZCugRNTLtKx50a0-Ow807sC=qYpSBOy=ZHFh_g@mail.gmail.com>

On Fri, Nov 30, 2012 at 2:35 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi everyone,
>
> I've done some digging around to see how to deal with these issues.
> Here's what I found:
>
>> The BuildBot flagged two new issues overnight,
>> http://testing.open-bio.org/biopython/tgrid
>>
>> Python 2.5 on Windows - doctests are failing due to floating point decimal place
>> differences in the exponent (down to C library differences, something fixed in
>> later Python releases). Perhaps a Python 2.5 hack is the way to go here?
>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%202.5/builds/664/steps/shell/logs/stdio
>
> I've submitted a pull request to fix this here:
> https://github.com/biopython/biopython/pull/98

The Windows detection wasn't quite right, it should now match
how we look for Windows elsewhere in Biopython:
https://github.com/biopython/biopython/commit/fc24967b89eda56675e67824a4a57a6059650636

>> There is a separate cross-platform issue on Python 3.1, "TypeError:
>> invalid event tuple" again with XML parsing. Curiously this had started
>> a few days back in the UniprotIO tests on one machine, pre-dating the
>> SearchIO merge. I'm not sure what triggered it.
>> http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/767
>> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/766/steps/shell/logs/stdio
>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/648/steps/shell/logs/stdio
>
> As for this one, it seems that it's caused by a bug in Python3.1
> (http://bugs.python.org/issue9257) due to the way
> `xml.etree.cElemenTree.iterparse` accepts the `event` argument.

Ah - I remember that bug now, we have a hack in place elsewhere
to try and avoid that - seems it won't be fixed in Python 3.1.x now
so I've relaxed the version check here:
https://github.com/biopython/biopython/commit/52fdd0ed7fa576494005e635b6a6610daab2ab0e

Hopefully that will bring the buildbot back to all green tonight.
(TravisCI has now dropped their Python 3.1 support, but they
should have Python 3.3 with NumPy working soon).

Peter

From p.j.a.cock at googlemail.com  Wed Dec  5 09:16:43 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 14:16:43 +0000
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <50BCDB27.7040402@brueffer.de>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
Message-ID: <CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>

On Mon, Dec 3, 2012 at 5:02 PM, Christian Brueffer
<christian at brueffer.de> wrote:
>> As you point out, the "repo churn" from fixing minor things
>> like spaces around operators does have a cost in making
>> merges a little harder. Things like the exception style updates
>> which you've already fixed (seems I missed some) are more
>> urgent for Python 3 support, so worth doing anyway.
>>
>
> On the other hand, it's basically a one-time cost.  However I
> want to fix the lowest-hanging fruit (read: the ones with the
> lowest counts ;-) first.

The shear number of files touched in these PEP8 fixes would
probably deserve to be called "repository churn" now - wow!

Although we have good test coverage, it isn't complete (anyone
fancy trying some test coverage measuring tools like figleaf?)
so there is a small but real risk we've accidentally broken
something. I'm wondering if therefore a 'beta' release would
be prudent, of if I am just worrying about things too much?

>> You've got us a lot closer to PEP8 compliance - do you think
>> subject to a short white list of known cases (like module
>> names) where we don't follow PEP8 we could aim to run a
>> a pep8 tool automatically (e.g. as a unit test, or even a commit
>> hook)? That is quite appealing as a way to spot any new code
>> which breaks the style guidelines...
>
> Having a commit hook would be ideal (maybe with a possibility to
> override).  This would be especially useful against the introduction of
> gratuitous whitespace.  With some editors/IDEs you don't even notice it.

Would you be interested in looking into how to set that up?
Presumably a client-side git hook would be best, but we'd
need to explore cross platform issues (e.g. developing and
testing on Windows) and making sure it allowed an override
on demand (where the developer wants/needs to ignore a
style warning).

Thanks,

Peter

From d.m.a.martin at dundee.ac.uk  Wed Dec  5 08:50:21 2012
From: d.m.a.martin at dundee.ac.uk (David Martin)
Date: Wed, 5 Dec 2012 13:50:21 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
Message-ID: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>

Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams.

I'd like to modify the CircularDrawer feature drawing to allow the following:

label_position: start|middle|end as per LinearDrawer
label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour
label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse

This will cause some issues with track widths (how can you specify a track width for a feature track?)

Any thoughts/suggestions?

..d


The University of Dundee is a registered Scottish Charity, No: SC015096


From christian at brueffer.de  Wed Dec  5 10:28:19 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Wed, 05 Dec 2012 23:28:19 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
	<CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
Message-ID: <50BF6813.4070102@brueffer.de>

On 12/5/12 22:16 , Peter Cock wrote:
> On Mon, Dec 3, 2012 at 5:02 PM, Christian Brueffer
> <christian at brueffer.de> wrote:
>>> As you point out, the "repo churn" from fixing minor things
>>> like spaces around operators does have a cost in making
>>> merges a little harder. Things like the exception style updates
>>> which you've already fixed (seems I missed some) are more
>>> urgent for Python 3 support, so worth doing anyway.
>>>
>>
>> On the other hand, it's basically a one-time cost.  However I
>> want to fix the lowest-hanging fruit (read: the ones with the
>> lowest counts ;-) first.
> 
> The shear number of files touched in these PEP8 fixes would
> probably deserve to be called "repository churn" now - wow!
> 

I wonder whether there's a file left I haven't touched yet (except
the data files in Tests)...

> Although we have good test coverage, it isn't complete (anyone
> fancy trying some test coverage measuring tools like figleaf?)
> so there is a small but real risk we've accidentally broken
> something. I'm wondering if therefore a 'beta' release would
> be prudent, of if I am just worrying about things too much?
> 

It certainly can't hurt to advise users to have an extra eye on
possible regressions and strange behaviours in existing code.
I think the only risky changes were the ones concerning indentation,
(f68d334b1edfd743fe8a7bb4654046295f0ff939), I was extra careful about
those.

So, I'm pretty confident I haven't screwed things up but it's good
to be careful.

FYI, here's the "pep8 --statistics -qq" output as of commit
df4f12965a2ad3b6ed31bbf9d201bd5c716bd4ee:

680     E121 continuation line indentation is not a multiple of four
691     E122 continuation line missing indentation or outdented
171     E123 closing bracket does not match indentation of opening
bracket's line
86      E124 closing bracket does not match visual indentation
197     E126 continuation line over-indented for hanging indent
601     E127 continuation line over-indented for visual indent
1072    E128 continuation line under-indented for visual indent
772     E201 whitespace after '('
536     E202 whitespace before ')'
23444   E203 whitespace before ':'
94      E221 multiple spaces before operator
11      E222 multiple spaces after operator
5763    E225 missing whitespace around operator
6519    E231 missing whitespace after ','
2542    E251 no spaces around keyword / parameter equals
622     E261 at least two spaces before inline comment
347     E262 inline comment should start with '# '
1044    E302 expected 2 blank lines, found 1
1       E303 too many blank lines (2)
15526   E501 line too long (82 > 79 characters)
3       E711 comparison to None should be 'if cond is None:'
75      W291 trailing whitespace
12      W293 blank line contains whitespace
5       W601 .has_key() is deprecated, use 'in'

E203 looks scary, but 9900 of those are in Bio/SubsMat/MatrixInfo.py
alone.

>>> You've got us a lot closer to PEP8 compliance - do you think
>>> subject to a short white list of known cases (like module
>>> names) where we don't follow PEP8 we could aim to run a
>>> a pep8 tool automatically (e.g. as a unit test, or even a commit
>>> hook)? That is quite appealing as a way to spot any new code
>>> which breaks the style guidelines...
>>
>> Having a commit hook would be ideal (maybe with a possibility to
>> override).  This would be especially useful against the introduction of
>> gratuitous whitespace.  With some editors/IDEs you don't even notice it.
> 
> Would you be interested in looking into how to set that up?
> Presumably a client-side git hook would be best, but we'd
> need to explore cross platform issues (e.g. developing and
> testing on Windows) and making sure it allowed an override
> on demand (where the developer wants/needs to ignore a
> style warning).
> 

Yes, It's fairly high on my TODO list.

Chris

From p.j.a.cock at googlemail.com  Wed Dec  5 10:57:44 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 15:57:44 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
Message-ID: <CAKVJ-_5zs7LJpv0SWogPqf5qL+xxaXmOJG+Y8fs0yxKmi3Kj5A@mail.gmail.com>

On Wed, Dec 5, 2012 at 1:50 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
> Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams.
>
> I'd like to modify the CircularDrawer feature drawing to allow the following:
>
> label_position: start|middle|end as per LinearDrawer

I would find it natural if we treated start/middle/end from the point of view
of the feature (and its strand) as in the LinearDrawer. However the current
circular drawer tries to position things at the vertical bottom of the feature
(it cares about the left and right halves of the circle) which is
rather different.

I am suggesting a break in backwards compatibility (old code would still
run but put the labels in different places) but for large circular diagrams
the difference should be minor - and I think it would be an overall
improvement.

> label_placement: inside|outside|overlap where inside and outside are
> anchored just inside and just outside the feature but do not overlap it,
> and overlap is the current behaviour

If I have understood your intended meaning, that won't work nicely with
stranded features.

I would suggest two options: outside (i.e. outside the feature's bounding
box, either outside the track circle for forward strand or strand-less, or
inside the track circle for reverse strand) matching the current linear code,
or inside matching the current circular code. i.e. This would essentially
toggle the text element's anchoring between start/end.

i.e. Maintain the convention that labels above/outside the track are for
the forward strand (and strand-less) features, while labels below/inside
the track are for reverse strand features.

> label_orientation: upright|circular which determines the orientation of
> the label. upright is the current behaviour. Circular would be oriented
> to face clockwise for the forward strand and anticlockwise for the reverse

I would prefer making the existing (linear) option label_angle work nicely
on circular diagrams (which would make sense as part of reworking the
code to obey label_placement).

> This will cause some issues with track widths (how can you specify a
> track width for a feature track?)

Do you mean how to allocate more white space between the tracks
to ensure the labels have a clear background if printed outside the
features? The quick and dirty solution is a spacer track (you can
allocate track numbers to leave a gap).

> Any thoughts/suggestions?
>

Comments in-line, if need be we could meet up to hash some of this
out in person (although I not be in the Dundee area next week).

Regards,

Peter

From Leighton.Pritchard at hutton.ac.uk  Wed Dec  5 11:28:26 2012
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Wed, 5 Dec 2012 16:28:26 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_5zs7LJpv0SWogPqf5qL+xxaXmOJG+Y8fs0yxKmi3Kj5A@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_5zs7LJpv0SWogPqf5qL+xxaXmOJG+Y8fs0yxKmi3Kj5A@mail.gmail.com>
Message-ID: <E72D33BF424829408854FEB604A6959B6E4730E8@DUEXC02.ad.hutton.ac.uk>

On 5 Dec 2012, at Wednesday, December 5, 15:57, Peter Cock wrote:

On Wed, Dec 5, 2012 at 1:50 PM, David Martin <d.m.a.martin at dundee.ac.uk<mailto:d.m.a.martin at dundee.ac.uk>> wrote:
label_position: start|middle|end as per LinearDrawer

I am suggesting a break in backwards compatibility (old code would still
run but put the labels in different places) but for large circular diagrams
the difference should be minor - and I think it would be an overall
improvement.

Yep - I agree

label_orientation: upright|circular which determines the orientation of
the label. upright is the current behaviour. Circular would be oriented
to face clockwise for the forward strand and anticlockwise for the reverse

I would prefer making the existing (linear) option label_angle work nicely
on circular diagrams (which would make sense as part of reworking the
code to obey label_placement).

Good point - the automatic reorientation on either side of the circle (to respect the viewer's local gravity) could effectively be handled through a working label_angle for circular diagrams.  And more adventurous manual reorientation would also be possible ;)

One issue there is what the angle is defined with respect to: a 'vertical' reference on the page, or a tangent/normal to some point on the feature.  The first is straightforward, and might be what we want - the second will likely result in some odd - or attractive - patterns.

Comments in-line, if need be we could meet up to hash some of this
out in person (although I not be in the Dundee area next week).

Friday's good for me.

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk       w:http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From ben at benfulton.net  Wed Dec  5 11:28:52 2012
From: ben at benfulton.net (Ben Fulton)
Date: Wed, 5 Dec 2012 11:28:52 -0500
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
	<CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
Message-ID: <CA+ijMsnYnxhMA+ostqciQiZ-a=mEXKKFp84fVggvMY4Y5wpQtg@mail.gmail.com>

I've been studying this a bit and have a preference for Ned Batchelder's
Coverage tool. But I plan on putting some more work into it this week and
next.

On Wed, Dec 5, 2012 at 9:16 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote

>Although we have good test coverage, it isn't complete (anyone
>fancy trying some test coverage measuring tools like figleaf?)

From w.arindrarto at gmail.com  Wed Dec  5 11:39:13 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Wed, 5 Dec 2012 17:39:13 +0100
Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading
	attributes.
In-Reply-To: <50BEF69E.2000806@biotech.uni-tuebingen.de>
References: <50BEF69E.2000806@biotech.uni-tuebingen.de>
Message-ID: <CADEGkF6eQmPLkTWOX=CMbA=V0tkEHrfHwpksp-RR2cSixEKNfQ@mail.gmail.com>

Hi Kai and everyone,

Very happy to see the parser near completion (with tests too!). The
issue you're facing is unfortunately the consequence of trying to keep
attribute values in sync across the object hierarchy. It is a bit
troublesome for now, but not without solution.

> However, no matter what I do, I seem to get an <unknown description>
> tossed in there somehow.
>
> The parser is at
> https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py
> the test code is at
> https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py
> and the test file that's failing is the hmmpfam2.3 file at
> https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out

'<unknown_description>' is the default value for any description
attribute (be it in the QueryResult object, or in the
HSPFragment.hit_description). The error you're seeing is because the
hit description is being accessed through the hit object
(hit.description) and the cascading property getter checks first whether all
HSP contains the same `hit_description` attribute value. It'll only
return the value if all HSPFragment.hit_description values are equal.
Otherwise, it'll raise the error you're seeing here.

In your case, there are two values: 'Conserved region in glutamate
synthas' and '<unknown_description>', while there should only be one
(the first one). After prodding here and there, it seems that this is
caused by the if clause here:
https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py#L191

The 'else' clause in that block adds the HSP to the hit object, but
does not do any cascading attribute assignment (query_description and
hit_description).

Here, the simple fix would be to force a description assignment to the
HSP. For example, you could have the `else` block like so:

...
else:
    hit = unordered_hits[id_]
    hsp.hit_description = hit.description
    hit.append(hsp)

Other fixes are of course possible, but this is the simplest I can
imagine (though it seems a bit crude).

Also, I would like to note that the query description assignment of
the parser may break the cascade as well. If you try to access
`qresult.description` (qresult being the QueryResult object), you'd
get the true query description. But if you try to access it from
`qresult[0].query_description` (the query description stored in the
hit object), you'd get '<unknown_description>'. The fix here would be
to assign the description at the last moment before the QueryResult
object is yielded. That way, the cascading setter works properly and
all Hit, HSP, and HSPFragment inside the QueryResult object will
contain the same value.

I realize that this approach is not without flaws (and I'm always open to
suggestions), but at the moment this seems to be the most sensible way
to keep the attribute values in-sync while keeping the objects more
user-friendly
(i.e. making the parser slightlymore complex to write, but with the
result of consistent attribute
value to the users).

Hope this helps!
Bow

From Leighton.Pritchard at hutton.ac.uk  Wed Dec  5 11:21:06 2012
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Wed, 5 Dec 2012 16:21:06 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
Message-ID: <E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>

Hi all,

On 5 Dec 2012, at Wednesday, December 5, 13:50, David Martin wrote:

Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams.

I'd like to modify the CircularDrawer feature drawing to allow the following:

label_position: start|middle|end as per LinearDrawer
label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour
label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse

This will cause some issues with track widths (how can you specify a track width for a feature track?)

Any thoughts/suggestions?

I think that the proposed changes are all sensible, but circular labels are fiddly, so we'll need to think a wee bit about what to do exactly (probably another reason why I didn't do much on this originally ;) ).

label_position makes perfect sense, as suggested.

label_placement as a concept is fine, but I think we might need to be more precise about the intended behaviour of the arguments. I think that 'overlap' is possibly just a result of the choice of two anchoring parameters: 'inside'/'outside' and 'start'/'end' of the label string (see .png if the list accepts them).  We might be able to cover all possibilities with just these two choices - does this image cover the range of intended positioning David?  If so, then how about two arguments (I'm easy on the argument names): 'label_outer=True/False' and 'label_anchor=start/end/0/1'?

[cid:4EA13CE3-20E7-41D8-870F-CBBAA9DD06B0 at scri.sari.ac.uk]

label_orientation: What I think you're saying is that we want a distinction between text that orients to assume a static page (one you always view upright: e.g. a monitor), and text that doesn't.  I prefer 'reorient_labels=True/False', or some other geometrically neutral argument name, to 'upright' (whose expected meaning could change depending on page location and local context) as a parameter, but it's a good idea.

IIRC Track widths are relative, rather than absolute, and don't include label bounding boxes, so off-hand I don't think there ought to be any downstream issues.  Famous last words, there! ;)

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk       w:http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2012-12-05 at Wednesday, December 5	16.06.12.png
Type: image/png
Size: 22969 bytes
Desc: Screen Shot 2012-12-05 at Wednesday, December 5	16.06.12.png
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121205/ab34b6ff/attachment-0001.png>

From d.m.a.martin at dundee.ac.uk  Wed Dec  5 11:29:14 2012
From: d.m.a.martin at dundee.ac.uk (David Martin)
Date: Wed, 5 Dec 2012 16:29:14 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
Message-ID: <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>

Just got my head out of hacking at this. The options I have now are:

label_position: start|middle|end with reference to the feature. So the end is always the pointy bit.
label_orientation: circular|upright  Sometimes it is nice to have a proper circular plot
label_placement: inside|outside|overlap|strand which maintains overlap as default, inside is all inside, outside is all outside, strand is forward outside and reverse inside.

It even works. Angles and so on are not so relevant with circular plots though I would prefer a label_angle: radial|tangent|[degrees]

Should I attach an example?

..d


From: Leighton Pritchard [mailto:Leighton.Pritchard at hutton.ac.uk]
Sent: 05 December 2012 16:21
To: David Martin
Cc: BioPython-Dev; Peter Cock
Subject: Re: [Biopython-dev] Modifications to CircularDrawer

Hi all,

On 5 Dec 2012, at Wednesday, December 5, 13:50, David Martin wrote:


Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams.

I'd like to modify the CircularDrawer feature drawing to allow the following:

label_position: start|middle|end as per LinearDrawer
label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour
label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse

This will cause some issues with track widths (how can you specify a track width for a feature track?)

Any thoughts/suggestions?

I think that the proposed changes are all sensible, but circular labels are fiddly, so we'll need to think a wee bit about what to do exactly (probably another reason why I didn't do much on this originally ;) ).

label_position makes perfect sense, as suggested.

label_placement as a concept is fine, but I think we might need to be more precise about the intended behaviour of the arguments. I think that 'overlap' is possibly just a result of the choice of two anchoring parameters: 'inside'/'outside' and 'start'/'end' of the label string (see .png if the list accepts them).  We might be able to cover all possibilities with just these two choices - does this image cover the range of intended positioning David?  If so, then how about two arguments (I'm easy on the argument names): 'label_outer=True/False' and 'label_anchor=start/end/0/1'?

[cid:image001.png at 01CDD305.AA06C500]

label_orientation: What I think you're saying is that we want a distinction between text that orients to assume a static page (one you always view upright: e.g. a monitor), and text that doesn't.  I prefer 'reorient_labels=True/False', or some other geometrically neutral argument name, to 'upright' (whose expected meaning could change depending on page location and local context) as a parameter, but it's a good idea.

IIRC Track widths are relative, rather than absolute, and don't include label bounding boxes, so off-hand I don't think there ought to be any downstream issues.  Famous last words, there! ;)

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk       w:http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________

This email is from the James Hutton Institute, however the views
expressed by the sender are not necessarily the views of the James Hutton
Institute and its subsidiaries. This email and any attachments are confidential and
are intended solely for the use of the recipient(s) to whom they are addressed.
If you are not the intended recipient, you should not read, copy, disclose or rely on
any information contained in this email, and we would ask you to contact the
sender immediately and delete the email from your system. Although the James
Hutton Institute has taken reasonable precautions to ensure no viruses are present
in this email, neither the Institute nor the sender accepts any responsibility for any
viruses, and it is your responsibility to scan the email and any attachments.

The James Hutton Institute is a Scottish charitable company limited by guarantee.
Registered in Scotland No. SC374831
Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA.
Charity No. SC041796

The University of Dundee is a registered Scottish Charity, No: SC015096
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 22969 bytes
Desc: image001.png
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121205/bc3027ca/attachment-0001.png>

From p.j.a.cock at googlemail.com  Wed Dec  5 11:57:39 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 16:57:39 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
Message-ID: <CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>

On Wed, Dec 5, 2012 at 4:29 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
> Just got my head out of hacking at this. The options I have now are:
>
> label_position: start|middle|end with reference to the feature. So the end is
> always the pointy bit.

Sounds good and uncontentious.

> label_orientation: circular|upright  Sometimes it is nice to have a proper circular plot

I'd have to see the code or an example (and it seems any image attachment
will stall your emails for moderation - I'm a moderator but there is some time
delay before this gets that far).

> label_placement: inside|outside|overlap|strand which maintains overlap as
> default, inside is all inside, outside is all outside, strand is forward outside
> and reverse inside.

Perhaps below/above rather than inside/outside and then it could be done
to both the linear and circular drawers? Do you think this is useful then?

Note the current circular behaviour which overlaps is strand aware, so
those may not be the best names...

See also my earlier email with an alternative suggestion.

> It even works. Angles and so on are not so relevant with circular plots
> though I would prefer a label_angle: radial|tangent|[degrees]
>
> Should I attach an example?

You can try if the files are not overly larger (moderation delays will still
occur), posting a link would be easier although probably less lasting.

Are you OK with github? A natural option would be to show us your
proposals on a branch (separate commits if possible, otherwise I
can try and break out each bit if needed).

Ta,

Peter

From p.j.a.cock at googlemail.com  Wed Dec  5 12:24:08 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 17:24:08 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
Message-ID: <CAKVJ-_4EPG-qf7G8igxefmV8r25UC3Hm1q+Beiz2JFwGR7CziA@mail.gmail.com>

On Wed, Dec 5, 2012 at 4:57 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Dec 5, 2012 at 4:29 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
>> label_placement: inside|outside|overlap|strand which maintains overlap as
>> default, inside is all inside, outside is all outside, strand is forward outside
>> and reverse inside.
>
> Perhaps below/above rather than inside/outside and then it could be done
> to both the linear and circular drawers? Do you think this is useful then?

Having seen your example (sent directly off list), I'm convinced about the
usefulness of the inside and outside idea (when used on the inner-most
or outer-most track). Still not sure about those names as I would also like
to support this on the linear diagrams as well.

Regards,

Peter

From d.m.a.martin at dundee.ac.uk  Wed Dec  5 12:30:26 2012
From: d.m.a.martin at dundee.ac.uk (David Martin)
Date: Wed, 5 Dec 2012 17:30:26 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_4EPG-qf7G8igxefmV8r25UC3Hm1q+Beiz2JFwGR7CziA@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
	<CAKVJ-_4EPG-qf7G8igxefmV8r25UC3Hm1q+Beiz2JFwGR7CziA@mail.gmail.com>
Message-ID: <959CFF5060375249824CC633DDDF896F1C0C5EE5@AMSPRD0410MB351.eurprd04.prod.outlook.com>


-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
Sent: 05 December 2012 17:24
To: David Martin
Cc: Leighton Pritchard; BioPython-Dev
Subject: Re: [Biopython-dev] Modifications to CircularDrawer

On Wed, Dec 5, 2012 at 4:57 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Dec 5, 2012 at 4:29 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
>> label_placement: inside|outside|overlap|strand which maintains
>> overlap as default, inside is all inside, outside is all outside,
>> strand is forward outside and reverse inside.
>
> Perhaps below/above rather than inside/outside and then it could be
> done to both the linear and circular drawers? Do you think this is useful then?

Having seen your example (sent directly off list), I'm convinced about the usefulness of the inside and outside idea (when used on the inner-most or outer-most track). Still not sure about those names as I would also like to support this on the linear diagrams as well.

Linear and Circular are similar but not identical. No problem with having a above|below|strand or a more complex anchoring scheme but I don't need it right now so I'm just playing with the circular one.

I've attached a PDF to this mail - it might get through and I'll try to fork/clone/push git.

..d


The University of Dundee is a registered Scottish Charity, No: SC015096
-------------- next part --------------
A non-text attachment was scrubbed...
Name: plasmid_circular_nice.pdf
Type: application/pdf
Size: 148125 bytes
Desc: plasmid_circular_nice.pdf
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121205/e77a4cd7/attachment-0001.pdf>

From p.j.a.cock at googlemail.com  Wed Dec  5 13:41:59 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 18:41:59 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_5LatE+JT04d0dcf_WiTRE9tEUNPw6oPC004H=TTqB43A@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
	<959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_5LatE+JT04d0dcf_WiTRE9tEUNPw6oPC004H=TTqB43A@mail.gmail.com>
Message-ID: <CAKVJ-_7LjEdGtm3T26L3wwkraN9C+11mLvjogDBXAK+EDjGeDQ@mail.gmail.com>

Hi David,

I've been experimenting with your pull request, thank you:
https://github.com/biopython/biopython/pull/116

On Wed, Dec 5, 2012 at 5:22 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Dec 5, 2012 at 5:10 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
>> In the mean-time here is a plot (that doesn't show all layouts)
>
> Nice. Looking at that now I'm pretty sure I hacked the label anchor
> once before of a quick job in order to get the labels outside like that...
> certainly worth making this change.

Found it, that change made it to a branch I'd forgotten about:
https://github.com/peterjc/biopython/commit/d4764dfe929f135ec55b83ad14a9cd34e2d14bba

This is bringing back memories... I think I'd concluded last time
that attempting to offer anything other than radial label orientation
was probably a mistake, and that if we restrict that we can
safely offset the vertical position of the text midline (since right
now it is positioned according to the bottom line of the font).
Without that, positioning labels at the top (as you look at the
page) of a circular feature gave non-ideal placement. This
is likely one reason for the current hard-coded placement of
the feature labels at the bottom (as you look at the circle).

Hmm. I think I have a compromise forming that would allow
figures like your motivating example :)

Peter

From kai.blin at biotech.uni-tuebingen.de  Wed Dec  5 20:44:40 2012
From: kai.blin at biotech.uni-tuebingen.de (Kai Blin)
Date: Thu, 06 Dec 2012 11:44:40 +1000
Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading
	attributes.
In-Reply-To: <CADEGkF6eQmPLkTWOX=CMbA=V0tkEHrfHwpksp-RR2cSixEKNfQ@mail.gmail.com>
References: <50BEF69E.2000806@biotech.uni-tuebingen.de>
	<CADEGkF6eQmPLkTWOX=CMbA=V0tkEHrfHwpksp-RR2cSixEKNfQ@mail.gmail.com>
Message-ID: <50BFF888.50300@biotech.uni-tuebingen.de>

On 2012-12-06 02:39, Wibowo Arindrarto wrote:

Hi Bow, everyone,

> Very happy to see the parser near completion (with tests too!). The
> issue you're facing is unfortunately the consequence of trying to keep
> attribute values in sync across the object hierarchy. It is a bit
> troublesome for now, but not without solution.

...

> Here, the simple fix would be to force a description assignment to the
> HSP. For example, you could have the `else` block like so:
> 
> ...
> else:
>     hit = unordered_hits[id_]
>     hsp.hit_description = hit.description
>     hit.append(hsp)

Thanks for the tip, that was the last speedbump I had. I just sent off
the pull request for the hmmer2 parser.

Thanks again for the help,
Kai

-- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-University of T?bingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 T?bingen                        Fax :   ++49 7071 29-5979
Deutschland
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben

From christian at brueffer.de  Wed Dec  5 23:04:37 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Thu, 06 Dec 2012 12:04:37 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <50BF6813.4070102@brueffer.de>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
	<CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
	<50BF6813.4070102@brueffer.de>
Message-ID: <50C01955.8060505@brueffer.de>

On 12/05/2012 11:28 PM, Christian Brueffer wrote:
> On 12/5/12 22:16 , Peter Cock wrote:
[...]
>
>>>> You've got us a lot closer to PEP8 compliance - do you think
>>>> subject to a short white list of known cases (like module
>>>> names) where we don't follow PEP8 we could aim to run a
>>>> a pep8 tool automatically (e.g. as a unit test, or even a commit
>>>> hook)? That is quite appealing as a way to spot any new code
>>>> which breaks the style guidelines...
>>>
>>> Having a commit hook would be ideal (maybe with a possibility to
>>> override).  This would be especially useful against the introduction of
>>> gratuitous whitespace.  With some editors/IDEs you don't even notice it.
>>
>> Would you be interested in looking into how to set that up?
>> Presumably a client-side git hook would be best, but we'd
>> need to explore cross platform issues (e.g. developing and
>> testing on Windows) and making sure it allowed an override
>> on demand (where the developer wants/needs to ignore a
>> style warning).
>>
>
> Yes, It's fairly high on my TODO list.
>

I just had a look at this.  Turns out some people have had this idea
before :-)

Here's a first version:

https://github.com/cbrueffer/pep8-git-hook/blob/master/pre-commit

Basically you just save this as biopython/.git/hooks/pre-commit and mark
it executable.  You also need to install pep8 (pip install pep8).  The 
checks can be bypassed with git commit --no-verify.

Currently it ignores E124 (which I think should remain that way).  Any 
other errors or files it should ignore?

I'd be grateful if someone could give this a try on Windows.

Chris

From christian at brueffer.de  Thu Dec  6 01:22:24 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Thu, 06 Dec 2012 14:22:24 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <50C01955.8060505@brueffer.de>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
	<CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
	<50BF6813.4070102@brueffer.de> <50C01955.8060505@brueffer.de>
Message-ID: <50C039A0.8040208@brueffer.de>

On 12/06/2012 12:04 PM, Christian Brueffer wrote:
> On 12/05/2012 11:28 PM, Christian Brueffer wrote:
>> On 12/5/12 22:16 , Peter Cock wrote:
> [...]
>>
>>>>> You've got us a lot closer to PEP8 compliance - do you think
>>>>> subject to a short white list of known cases (like module
>>>>> names) where we don't follow PEP8 we could aim to run a
>>>>> a pep8 tool automatically (e.g. as a unit test, or even a commit
>>>>> hook)? That is quite appealing as a way to spot any new code
>>>>> which breaks the style guidelines...
>>>>
>>>> Having a commit hook would be ideal (maybe with a possibility to
>>>> override).  This would be especially useful against the introduction of
>>>> gratuitous whitespace.  With some editors/IDEs you don't even notice
>>>> it.
>>>
>>> Would you be interested in looking into how to set that up?
>>> Presumably a client-side git hook would be best, but we'd
>>> need to explore cross platform issues (e.g. developing and
>>> testing on Windows) and making sure it allowed an override
>>> on demand (where the developer wants/needs to ignore a
>>> style warning).
>>>
>>
>> Yes, It's fairly high on my TODO list.
>>
>
> I just had a look at this.  Turns out some people have had this idea
> before :-)
>
> Here's a first version:
>
> https://github.com/cbrueffer/pep8-git-hook/blob/master/pre-commit
>
> Basically you just save this as biopython/.git/hooks/pre-commit and mark
> it executable.  You also need to install pep8 (pip install pep8).  The
> checks can be bypassed with git commit --no-verify.
>
> Currently it ignores E124 (which I think should remain that way).  Any
> other errors or files it should ignore?
>
> I'd be grateful if someone could give this a try on Windows.
>

Thinking about it, I think it would make sense to ignore the following:

E121 continuation line indentation is not a multiple of four
E122 continuation line missing indentation or outdented
E123 closing bracket does not match indentation of opening bracket's line
E124 closing bracket does not match visual indentation
E126 continuation line over-indented for hanging indent
E127 continuation line over-indented for visual indent
E128 continuation line under-indented for visual indent

They all deal with indentation, but are not always beneficial to
readability.  E125 is missing from that list, which is a useful one.

Chris

From p.j.a.cock at googlemail.com  Thu Dec  6 05:07:55 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Dec 2012 10:07:55 +0000
Subject: [Biopython-dev] Minor buildbot issues from SearchIO
In-Reply-To: <CAKVJ-_5nLjkwZCugRNTLtKx50a0-Ow807sC=qYpSBOy=ZHFh_g@mail.gmail.com>
References: <CAKVJ-_6N_Wy9QVKp=niHSexB0_yEL5svh4oDzbxEYuSHv3KfWA@mail.gmail.com>
	<CADEGkF4RLmQDMS2sBNTs=Rwag_CypmU6WX-Q71R=Xsbuc4_GQg@mail.gmail.com>
	<CAKVJ-_5nLjkwZCugRNTLtKx50a0-Ow807sC=qYpSBOy=ZHFh_g@mail.gmail.com>
Message-ID: <CAKVJ-_4JBDX0pjdR=EmfxG8x5Htar9AF_t=e0rnbNrvRO1AGSA@mail.gmail.com>

On Wed, Dec 5, 2012 at 11:41 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Nov 30, 2012 at 2:35 AM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
>> Hi everyone,
>>
>> I've done some digging around to see how to deal with these issues.
>> Here's what I found:
>>
>>> The BuildBot flagged two new issues overnight,
>>> http://testing.open-bio.org/biopython/tgrid
>>>
>>> Python 2.5 on Windows - doctests are failing due to floating point decimal place
>>> differences in the exponent (down to C library differences, something fixed in
>>> later Python releases). Perhaps a Python 2.5 hack is the way to go here?
>>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%202.5/builds/664/steps/shell/logs/stdio
>>
>> I've submitted a pull request to fix this here:
>> https://github.com/biopython/biopython/pull/98
>
> The Windows detection wasn't quite right, it should now match
> how we look for Windows elsewhere in Biopython:
> https://github.com/biopython/biopython/commit/fc24967b89eda56675e67824a4a57a6059650636
>
>>> There is a separate cross-platform issue on Python 3.1, "TypeError:
>>> invalid event tuple" again with XML parsing. Curiously this had started
>>> a few days back in the UniprotIO tests on one machine, pre-dating the
>>> SearchIO merge. I'm not sure what triggered it.
>>> http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/767
>>> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/766/steps/shell/logs/stdio
>>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/648/steps/shell/logs/stdio
>>
>> As for this one, it seems that it's caused by a bug in Python3.1
>> (http://bugs.python.org/issue9257) due to the way
>> `xml.etree.cElemenTree.iterparse` accepts the `event` argument.
>
> Ah - I remember that bug now, we have a hack in place elsewhere
> to try and avoid that - seems it won't be fixed in Python 3.1.x now
> so I've relaxed the version check here:
> https://github.com/biopython/biopython/commit/52fdd0ed7fa576494005e635b6a6610daab2ab0e
>
> Hopefully that will bring the buildbot back to all green tonight.
> (TravisCI has now dropped their Python 3.1 support, but they
> should have Python 3.3 with NumPy working soon).
>
> Peter

OK, the buildbot looks happy now from the SearchIO work.

There is one issue under Python 3.1.5 on a 64 bit Linux server,
which I suspect is down to the Python version (this buildslave
used to run an older version - Python 3.1.3 (separate email
to follow).

Regards,

Peter

From p.j.a.cock at googlemail.com  Thu Dec  6 05:24:47 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Dec 2012 10:24:47 +0000
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
Message-ID: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>

On Thu, Dec 6, 2012 at 10:07 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> OK, the buildbot looks happy now from the SearchIO work.
>
> There is one issue under Python 3.1.5 on a 64 bit Linux server,
> which I suspect is down to the Python version (this buildslave
> used to run an older version - Python 3.1.3 (separate email
> to follow).

There are 18 test failures like this - all to do with handles and stdout,
which have been happening for a while now but I've not found time
to look into it. Example:

======================================================================
ERROR: test_needle_piped (test_Emboss.PairwiseAlignmentTests)
needle with asis trick, output piped to stdout.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py",
line 74, in __next__
    line = self._header
AttributeError: 'EmbossIterator' object has no attribute '_header'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/Tests/test_Emboss.py",
line 571, in test_needle_piped
    align = AlignIO.read(child.stdout, "emboss")
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py",
line 418, in read
    first = next(iterator)
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py",
line 366, in parse
    for a in i:
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py",
line 77, in __next__
    line = handle.readline()
AttributeError: '_io.FileIO' object has no attribute 'read1'

Lasting working build, Python 3.1.3,
http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/710/steps/shell/logs/stdio
https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634

Next build (after a couple of weeks offline while this server was
being rebuilt), Python 3.1.5,
http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/722/steps/shell/logs/stdio
https://github.com/biopython/biopython/commit/3ea4ea58ed80d6e517699bcab8810398f9ce5957

The timing does suggest an issue introduced in the rebuild, and
the obvious difference is the version of Python jumped from
3.1.3 to 3.1.5 (likely things like NumPy etc also changed).

There were some security fixes only in Python 3.1.5, none of
which sound relevant here:
http://www.python.org/download/releases/3.1.5/

The change log for Python 3.1.4 is longer, and does mention
stdout/stderr issues so this is perhaps the cause:
hg.python.org/cpython/raw-file/feae9f9e9f30/Misc/NEWS

See also http://bugs.python.org/issue4996 as possibly
related. The whole Python 3 text vs binary handle issue
is important with stdout/stderr.

What I am doing now is testing those two commits (with
Python 3.1.5) to confirm they both fail, and thus rule out
a Biopython code change in those two weeks being to
blame.

Peter

From p.j.a.cock at googlemail.com  Thu Dec  6 05:45:07 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Dec 2012 10:45:07 +0000
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
In-Reply-To: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
References: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
Message-ID: <CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>

On Thu, Dec 6, 2012 at 10:24 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Dec 6, 2012 at 10:07 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> OK, the buildbot looks happy now from the SearchIO work.
>>
>> There is one issue under Python 3.1.5 on a 64 bit Linux server,
>> which I suspect is down to the Python version (this buildslave
>> used to run an older version - Python 3.1.3 (separate email
>> to follow).
>
> There are 18 test failures like this - all to do with handles and stdout,
> which have been happening for a while now but I've not found time
> to look into it. Example:
>
> ======================================================================
> ERROR: test_needle_piped (test_Emboss.PairwiseAlignmentTests)
> needle with asis trick, output piped to stdout.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py",
> line 74, in __next__
>     line = self._header
> AttributeError: 'EmbossIterator' object has no attribute '_header'
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/Tests/test_Emboss.py",
> line 571, in test_needle_piped
>     align = AlignIO.read(child.stdout, "emboss")
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py",
> line 418, in read
>     first = next(iterator)
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py",
> line 366, in parse
>     for a in i:
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py",
> line 77, in __next__
>     line = handle.readline()
> AttributeError: '_io.FileIO' object has no attribute 'read1'
>
> Lasting working build, Python 3.1.3,
> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/710/steps/shell/logs/stdio
> https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634
>
> Next build (after a couple of weeks offline while this server was
> being rebuilt), Python 3.1.5,
> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/722/steps/shell/logs/stdio
> https://github.com/biopython/biopython/commit/3ea4ea58ed80d6e517699bcab8810398f9ce5957
>
> The timing does suggest an issue introduced in the rebuild, and
> the obvious difference is the version of Python jumped from
> 3.1.3 to 3.1.5 (likely things like NumPy etc also changed).
>
> There were some security fixes only in Python 3.1.5, none of
> which sound relevant here:
> http://www.python.org/download/releases/3.1.5/
>
> The change log for Python 3.1.4 is longer, and does mention
> stdout/stderr issues so this is perhaps the cause:
> hg.python.org/cpython/raw-file/feae9f9e9f30/Misc/NEWS
>
> See also http://bugs.python.org/issue4996 as possibly
> related. The whole Python 3 text vs binary handle issue
> is important with stdout/stderr.
>
> What I am doing now is testing those two commits (with
> Python 3.1.5) to confirm they both fail, and thus rule out
> a Biopython code change in those two weeks being to
> blame.
>
> Peter

Confirmed, using test_Emboss.py and Python 3.1.5 on
this machine (running as the buildslave user using the
same Python 3.1.5 installation), using the current tip
5092e0e9f2326da582158fd22090f31547679160 and
the two commits mentioned above, that is
e90db11f4a1d983bc2bfe12bec30edbdbb200634 and
3ea4ea58ed80d6e517699bcab8810398f9ce5957 -
all three builds show the same failure.

i.e. The failure is not due to a change in Biopython
between those commits, but is in some way caused
by a change to the buildslave environment. My first
suggestion that this is due to Python 3.1.3 -> 3.1.5
remains my prime suspect.

I could try downgrading Python 3.1 on this machine
to confirm that I suppose... or updating Python 3.1 on
another machine?

The other recent Python 3.1 buildbot runs were both
using Python 3.1.2 (Windows XP 32bit and Linux 32 bit).

Can anyone else reproduce this, or have an idea what
the fix might be?

Regards,

Peter

From Leighton.Pritchard at hutton.ac.uk  Thu Dec  6 07:28:39 2012
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Thu, 6 Dec 2012 12:28:39 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
Message-ID: <E72D33BF424829408854FEB604A6959B6E474468@DUEXC02.ad.hutton.ac.uk>

Hi all,

I'm starting to remember why I left circular labelling options alone ;)

On 5 Dec 2012, at Wednesday, December 5, 16:57, Peter Cock wrote:

On Wed, Dec 5, 2012 at 4:29 PM, David Martin <d.m.a.martin at dundee.ac.uk<mailto:d.m.a.martin at dundee.ac.uk>> wrote:

label_orientation: circular|upright  Sometimes it is nice to have a proper circular plot

I'd have to see the code or an example (and it seems any image attachment
will stall your emails for moderation - I'm a moderator but there is some time
delay before this gets that far).

I still don't like 'upright' - but that's a naming issue, rather than one of functionality.

label_placement: inside|outside|overlap|strand which maintains overlap as
default, inside is all inside, outside is all outside, strand is forward outside
and reverse inside.

Perhaps below/above rather than inside/outside and then it could be done
to both the linear and circular drawers? Do you think this is useful then?

'Below' and 'above' are context- (and viewer!) dependent: on a circular diagram 'above' on a feature at 12 o'clock is on the opposite side of the feature when it's 'above' at 6 o'clock.  It's not clear what either would mean for a feature at 3 o'clock or 9 o'clock.  'Inside' and 'outside' are stably relative to the circular track for a feature at any position on the circle, so I prefer them as settings.

I'm not keen on 'overlap' or 'strand', as I'm not clear what kind of label orientation they refer to: for example, what is being 'overlapped'?

Looking at the .pdf, it seems like you've anchored the green labels to the track, rather than to the feature, which I think looks good there - but I'd like to have the option of track vs feature anchoring available via an argument like 'label_anchor', which could be distinguished from 'label_text_anchor'. Including this choice, my preferred arguments would be something like:

label_direction='clockwise'|'anticlockwise' - 'clockwise':
The text looks like it's progressing clockwise (like the green text in the .pdf); 'anticlockwise' like the blue text.  By choosing 'clockwise' or 'anticlockwise' for the appropriate group of features, we achieve part of what I think you might mean by 'upright' (i.e. clockwise from pi/2 to 3pi/2, anticlockwise elsewhere).  That could be handled with an 'auto' option.  This argument essentially dictates label_angle for each feature: more of which later.  It would be nice to have synonyms of 'counterclockwise', 'anticlockwise' and 'widdershins' ;)

label_anchor='track'|'feature'
Describes what element the text bounding box will be anchored to.

label_text_anchor='start'|'end'
Which part of the text bounding box (relative to the text) gets anchored. I think it's a good idea to have this wrap a lower-level setting that has label_text_anchor=float, as a relative location on the feature, where start=0, center=0.5, end=1, and values beyond that offer a label separation, relative to the label size - though I can't imagine why I'd use it over the option below - since spacing would depend on bounding box size - the flexibility could be useful, and you'd have to do that calculation anyway ;)

label_placement='inner'|'outer'
Do we anchor on the track/feature towards the circle centre (inner) or on the other side (outer)?  I think it's a good idea to have this wrap a lower-level representation that has label_placement=float, as a relative location on the feature, where inner=-1,outer=1 as a proportion of track/feature height, and other values place the anchor relative to the feature/track boundary - this again offers a choice of label separation, but one that's uniform for all features.

label_position='start'|'end'|'center'
Where, relative to the feature, do we anchor?  I think it's a good idea to have this wrap a lower-level representation that has label_position=[0,1], as a relative location on the feature, where start=0, center=0.5, end=1.  That gives more flexibility for those who want it (and you have to do the calculation, anyway).

label_orientation='radial'|'horizontal'
Fairly obviously, 'radial' = as it is now, and 'horizontal' is reading like regular text.  But this one's a tricky one, which is why all the labels are radial at the moment ;)  I think that this choice has to either live with ('radial') or override ('horizontal') the label_direction argument.  As with label_direction, this essentially dictates label_angle for each individual feature, which has its own issues (what do we measure the angle relative to?  If it's relative to a common reference, then for a constant angle you get some funny-looking label patterns, and it doesn't look good in bulk.  Relative to a feature-local reference, we can choose the tangent or the normal - but at what point of the feature?  Really, we want that to be the tangent or normal at the anchor point of the text, so that the same angle looks consistent across all features (45deg to the normal at the start of a long feature is different to 45deg to the normal at the centre of that feature, relative to the bottom of the page: this looks weird)).
A complicating issue here with text anchoring is what part of the text box gets anchored: depending on the font, and the string, choosing the top or bottom of the bounding box (which will include ascender and descender spaces) can look weird, so it's probably best to anchor on the midline of the text box.  This avoids a problem with 'anticlockwise' vs 'clockwise' when implemented as a rotation, in that anchoring to the lower left of text, then rotating 180deg around the centre of the text box gives a different final positioning (and anchoring) than anchoring to the midline of the text box, then performing the same rotation.

By appropriate choices of these settings, we can obtain pretty much any labelling style.  We need to keep in mind, though, that the arguments won't be interpreted properly until the Diagram gets passed to the renderer, so 'auto' settings to achieve a particular effect with complicated combinations of arguments dependent on feature location might be better passed with draw().

As specific examples:

1) Let's say the effect we're looking for is for horizontal text, anchored to the outside of the track.  Here we'd need to consider two halves of the diagram.  On the left hand side we need to set label_text_anchor='end', and on the right we set label_text_anchor='start'.  On both sides we set label_orientation='horizontal', label_anchor='track', label_placement='outer'.  However, we need to take care with features towards the top and bottom of the image, as horizontal labels will run into each other, here.

2) Dropping the requirement for horizontal text, we can set label_orientation='radial', label_anchor='track', label_placement='outer' on both sides (maybe this should be the default?), but set label_direction='clockwise', label_text_anchor='end' on the left, and label_direction='counterclockwise', label_text_anchor='start' on the right.

3) If we wanted to label features directly, on the appropriate side of their track, we could set label_anchor='feature' for all features, with label_placement='inner' for reverse-strand, and label_placement='outer' for forward-strand features.

These are some fairly obvious standard settings which could be made available as presets in the calls to draw(), so that the fiddly details are hidden.

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk       w:http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From w.arindrarto at gmail.com  Thu Dec  6 22:32:06 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Fri, 7 Dec 2012 04:32:06 +0100
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
In-Reply-To: <CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>
References: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
	<CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>
Message-ID: <CADEGkF41Fu8BzuBh_3DfRSF5SS6C8UecU7F-TXTgnd-Md44Kcw@mail.gmail.com>

> Confirmed, using test_Emboss.py and Python 3.1.5 on
> this machine (running as the buildslave user using the
> same Python 3.1.5 installation), using the current tip
> 5092e0e9f2326da582158fd22090f31547679160 and
> the two commits mentioned above, that is
> e90db11f4a1d983bc2bfe12bec30edbdbb200634 and
> 3ea4ea58ed80d6e517699bcab8810398f9ce5957 -
> all three builds show the same failure.
>
> i.e. The failure is not due to a change in Biopython
> between those commits, but is in some way caused
> by a change to the buildslave environment. My first
> suggestion that this is due to Python 3.1.3 -> 3.1.5
> remains my prime suspect.
>
> I could try downgrading Python 3.1 on this machine
> to confirm that I suppose... or updating Python 3.1 on
> another machine?
>
> The other recent Python 3.1 buildbot runs were both
> using Python 3.1.2 (Windows XP 32bit and Linux 32 bit).
>
> Can anyone else reproduce this, or have an idea what
> the fix might be?

It's reproducible in my machine: Arch Linux 64 bit running
Python3.1.5. Haven't figured out a fix yet, but trying to see if I
can.

By the way, I was wondering, what's our deprecation policy for
Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't
seem to be any major updates coming soon. How long should we keep
supporting Python <3.2?

regards,
Bow

From p.j.a.cock at googlemail.com  Fri Dec  7 05:06:57 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 7 Dec 2012 10:06:57 +0000
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
In-Reply-To: <CADEGkF41Fu8BzuBh_3DfRSF5SS6C8UecU7F-TXTgnd-Md44Kcw@mail.gmail.com>
References: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
	<CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>
	<CADEGkF41Fu8BzuBh_3DfRSF5SS6C8UecU7F-TXTgnd-Md44Kcw@mail.gmail.com>
Message-ID: <CAKVJ-_5SjfRFiiKSatU9ds8b5ESdUTexa3TP=k+W=TPmtHoTfA@mail.gmail.com>

On Fri, Dec 7, 2012 at 3:32 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
>
> > Confirmed, using test_Emboss.py and Python 3.1.5 on
> > this machine (running as the buildslave user using the
> > same Python 3.1.5 installation), using the current tip
> > 5092e0e9f2326da582158fd22090f31547679160 and
> > the two commits mentioned above, that is
> > e90db11f4a1d983bc2bfe12bec30edbdbb200634 and
> > 3ea4ea58ed80d6e517699bcab8810398f9ce5957 -
> > all three builds show the same failure.
> >
> > i.e. The failure is not due to a change in Biopython
> > between those commits, but is in some way caused
> > by a change to the buildslave environment. My first
> > suggestion that this is due to Python 3.1.3 -> 3.1.5
> > remains my prime suspect.
> >
> > I could try downgrading Python 3.1 on this machine
> > to confirm that I suppose... or updating Python 3.1 on
> > another machine?
> >
> > The other recent Python 3.1 buildbot runs were both
> > using Python 3.1.2 (Windows XP 32bit and Linux 32 bit).
> >
> > Can anyone else reproduce this, or have an idea what
> > the fix might be?
>
> It's reproducible in my machine: Arch Linux 64 bit running
> Python3.1.5. Haven't figured out a fix yet, but trying to see if I
> can.

Great. We haven't really proved this is down to a change in
either Python 3.1.4 or 3.1.5 but it does look likely.

>
> By the way, I was wondering, what's our deprecation policy for
> Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't
> seem to be any major updates coming soon. How long should we keep
> supporting Python <3.2?

As long as it doesn't cost us much effort? If we can't solve this
issue easily that might be enough to drop Python 3.1?

My impression is that Python 3.0 is dead, and the only sizeable
group stuck with Python 3.1 will those on Ubuntu lucid (LTS is
supported through 2013 on desktops and 2015 on servers),
but as with life under Python 2.x it is fairly straightforward
to have a local/additional Python without disturbing the system
installation.

On a related note, TravisCI currently still supports Python 3.1
unofficially (we're not using this with Biopython but I've tried
it with other projects), but this will be dropped soon - once
they have Python 3.3 working.

Since we don't yet officially support Python 3 (but we probably
should soon) we have the flexibility to recommend
either Python 3.2 or 3.3 as a baseline.

Peter

From redmine at redmine.open-bio.org  Sat Dec  8 23:11:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 9 Dec 2012 04:11:35 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15030.20121209041135@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


It looks like your data file is corrupted. In _read_value_from_handle, the length of the key it tries to read is 1490353651722. This does not seem correct. Can you create a minimal data file that shows the problem? Then, when you fill in the trie, you can identify which key causes the problem.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Dec  9 04:53:30 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 9 Dec 2012 09:53:30 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15031.20121209095330@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.


That just means that bug is in save() not in load() function.
But of course I will provide data file, although I can't guarantee it will be minimal. 
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Dec  9 07:13:09 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 9 Dec 2012 12:13:09 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15032.20121209121309@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


You don't need to provide the data file to us. The idea is that you create the smallest trie.dat file that will cause the load() to fail. Then you know which item in the trie is problematic. Once you know that, we can try to figure out why the save() creates a corrupted file.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Mon Dec 10 12:39:24 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 10 Dec 2012 17:39:24 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15033.20121210173924@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.

File minimal_data.pkl added

This is my minimal test case:

        from Bio import trie
        import pickle

        f = open('minimal_data.pkl', 'r')
        list = pickle.load(f)
        f.close()

        index = trie.trie()

        for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]

        f = open('trie.dat', 'w')
        trie.save(f, index)
        f.close()
        f = open('trie.dat', 'r')
        index = trie.load(f)
        f.close()
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Tue Dec 11 00:32:02 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 11 Dec 2012 05:32:02 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15034.20121211053202@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


Hi Michal,

Unfortunately I cannot load your minimal_data.pkl file. At
list = pickle.load(f)
I get
ImportError: No module named django.db.models.query

Can you check which item in list is actually causing the problem? Just reduce the list until you find the item that is causing the trie.load(f) to fail.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From MatatTHC at gmx.de  Tue Dec 11 03:11:48 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Tue, 11 Dec 2012 09:11:48 +0100
Subject: [Biopython-dev] genetic code
Message-ID: <CAFwXb29-tAmZ9K_NHGh=6AJq4SCQa_sCBMyWn7OJFax=KCGrsA@mail.gmail.com>

Dear biopython developers,

there is a new genetic code table (24) in the NCBI resources (see
NC_015649). Maybe you can update this with the next release.

Would it be an idea to distribute the genetic code file from ncbi with
biopython and create the code tables on import or during installation? Then
biopython would be automatically up-to-date.

Regards,
Matthias

From redmine at redmine.open-bio.org  Tue Dec 11 04:15:22 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 11 Dec 2012 09:15:22 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15035.20121211091522@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.


Hello,
As I said, this is minimal test case. That means there is no single key that causes a problem. If you remove any of the items from the list it will work. You can try to run this examble from django shell (python manage.py shell). It there will be any further problems with running it I can provide model classes as well.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From arklenna at gmail.com  Tue Dec 11 11:00:33 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Tue, 11 Dec 2012 11:00:33 -0500
Subject: [Biopython-dev] genetic code
In-Reply-To: <CAFwXb29-tAmZ9K_NHGh=6AJq4SCQa_sCBMyWn7OJFax=KCGrsA@mail.gmail.com>
References: <CAFwXb29-tAmZ9K_NHGh=6AJq4SCQa_sCBMyWn7OJFax=KCGrsA@mail.gmail.com>
Message-ID: <CALfq9t+JdrNECegb017g9JE2kGbahkh4C5QUuFV1oWf5RT4M8A@mail.gmail.com>

Hi Matthias,

In a similar case, we have a file in the Scripts/ directory to download and
parse the file. The generated file (and not the source file) is committed,
but the script is available in the source for end users who wish to update
it:

https://github.com/biopython/biopython/blob/master/Scripts/PDB/generate_three_to_one_dict.py

I think a similar situation would be appropriate here. Does Biopython
currently include alternate codon tables?

Cheers,

Lenna

On Tuesday, December 11, 2012, Matthias Bernt wrote:

> Dear biopython developers,
>
> there is a new genetic code table (24) in the NCBI resources (see
> NC_015649). Maybe you can update this with the next release.
>
> Would it be an idea to distribute the genetic code file from ncbi with
> biopython and create the code tables on import or during installation? Then
> biopython would be automatically up-to-date.
>
> Regards,
> Matthias
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org <javascript:;>
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

From p.j.a.cock at googlemail.com  Tue Dec 11 13:42:13 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 11 Dec 2012 18:42:13 +0000
Subject: [Biopython-dev] genetic code
In-Reply-To: <CALfq9t+JdrNECegb017g9JE2kGbahkh4C5QUuFV1oWf5RT4M8A@mail.gmail.com>
References: <CAFwXb29-tAmZ9K_NHGh=6AJq4SCQa_sCBMyWn7OJFax=KCGrsA@mail.gmail.com>
	<CALfq9t+JdrNECegb017g9JE2kGbahkh4C5QUuFV1oWf5RT4M8A@mail.gmail.com>
Message-ID: <CAKVJ-_5i-BWAhqgZ=6ARw6+G2Nuue1KOvMjQF+WheVmLph4vUQ@mail.gmail.com>

On Tuesday, December 11, 2012, Lenna Peterson wrote:

> Hi Matthias,
>
> In a similar case, we have a file in the Scripts/ directory to download and
> parse the file. The generated file (and not the source file) is committed,
> but the script is available in the source for end users who wish to update
> it:
>
>
> https://github.com/biopython/biopython/blob/master/Scripts/PDB/generate_three_to_one_dict.py
>
> I think a similar situation would be appropriate here. Does Biopython
> currently include alternate codon tables?
>
> Cheers,
>
> Lenna


Yes, see
https://github.com/biopython/biopython/blob/master/Bio/Data/CodonTable.pyon
the parser therein.

On Tuesday, December 11, 2012, Matthias Bernt wrote:
>
> > Dear biopython developers,
> >
> > there is a new genetic code table (24) in the NCBI resources (see
> > NC_015649). Maybe you can update this with the next release.


That seems like a Good idea :)


> > Would it be an idea to distribute the genetic code file from ncbi with
> > biopython and create the code tables on import or during installation?
> Then
> > biopython would be automatically up-to-date.
> >
> > Regards,
> > Matthias
>

That would just make installation more complex (and it is already
complicated). I would prefer to keep setup.py as normal as possible.

The NCBI tables rarely change, so this works OK overall.

Peter

From redmine at redmine.open-bio.org  Tue Dec 11 23:16:27 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 04:16:27 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15036.20121212041627@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


We need to isolate the bug further to be able to solve it. I would suggest to find a data set that fails to load but does not depend on django.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Dec 12 02:56:52 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 07:56:52 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15037.20121212075652@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.


Sure, today I'll strip all django dependencies and resubmit data set and loading code.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Dec 12 05:04:28 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 10:04:28 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15038.20121212100428@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.

File minimal_data.pkl added

Minimal test case with stripped django dependencies, loading code below:

        from Bio import trie
        import pickle

        f = open('minimal_data.pkl', 'r')
        list = pickle.load(f)
        f.close()

        index = trie.trie()

        for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]

        f = open('trie.dat', 'w')
        trie.save(f, index)
        f.close()

        f = open('trie.dat', 'r')
        new_trie = trie.load(f)
        f.close()
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Dec 12 07:29:19 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 12:29:19 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15039.20121212122919@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


The problem was indeed that one of the chunks had a size of 2000.
I've uploaded a fix to github; could you please give it a try? See

https://github.com/biopython/biopython/commit/6e09a4a67b7dec1910b13e3d730e3a1f5c2261c9

In particular, please make sure that new_trie is identical to trie.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Dec 12 16:44:14 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 21:44:14 +0000
Subject: [Biopython-dev] [Biopython - Bug #3400] (New) Hmmer3-text parser
	crashes when parsing hmmscan --cut_tc files
Message-ID: <redmine.issue-3400.20121212214414@redmine.open-bio.org>


Issue #3400 has been reported by Kai Blin.

----------------------------------------
Bug #3400: Hmmer3-text parser crashes when parsing hmmscan --cut_tc files
https://redmine.open-bio.org/issues/3400

Author: Kai Blin
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


I'm currently struggling with a crash in the hmmer3-text parser when dealing with files generated by hmmscan --cut_tc.
I'm not quite sure what happens yet, but I have the feeling that some part of the hit parsing logic is reading into the next query without yielding a result.

The backtrace is
<pre>
Traceback (most recent call last):
  File "t.py", line 4, in <module>
    i = it.next()
  File "/data/uni/biopython/Bio/SearchIO/__init__.py", line 317, in parse
    yield qresult
  File "/usr/lib/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/data/uni/biopython/Bio/File.py", line 84, in as_handle
    yield fp
  File "/data/uni/biopython/Bio/SearchIO/__init__.py", line 316, in parse
    for qresult in generator:
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 47, in __iter__
    for qresult in self._parse_qresult():
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 133, in _parse_qresult
    hit_list = self._parse_hit(qid)
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 176, in _parse_hit
    hit_list = self._create_hits(hit_attr_list, qid)
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 239, in _create_hits
    hit_attr = hit_attrs.pop(0)
IndexError: pop from empty list
</pre>

Line numbers might be a bit off as I added debug output to understand what's happening already.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From bow at bow.web.id  Wed Dec 12 23:15:01 2012
From: bow at bow.web.id (Wibowo Arindrarto)
Date: Thu, 13 Dec 2012 05:15:01 +0100
Subject: [Biopython-dev] Slight modifcation to BlastXML parser for
	AB-BLAST input
In-Reply-To: <CAF++dEfZZRFfAZvcS_MJ+AG9pGj6QHBUukAUChZ5yyqk+X4YBg@mail.gmail.com>
References: <CAF++dEfZZRFfAZvcS_MJ+AG9pGj6QHBUukAUChZ5yyqk+X4YBg@mail.gmail.com>
Message-ID: <CADEGkF609CawoG=ZP0hVUWk+_At28tVp-K4V+oZ9FzdHAksFXA@mail.gmail.com>

Hi Colin,

Thanks for the report. AB-BLAST wasn't included in the BLAST XML
parser's test suite so I'm glad you spotted this :).

You're proposing a bug fix, so yes, this should be included in our code.
You could submit a pull request on our github page:
https://github.com/biopython/biopython/pulls, or I can submit it on
your behalf if you prefer not to submit it yourself. If you're not
familiar with GitHub, we have a quick guide on how to use it to
develop Biopython here: http://biopython.org/wiki/GitUsage. GitHub's
help on how to submit pull requests is a useful read too:
https://help.github.com/articles/using-pull-requests

Along with the patch, a unit test on the AB-BLAST output would also be very
welcomed.

As for the actual regex change, I was wondering, is that the only
possible pattern of the BlastOutput_version tag in AB-BLAST? Do you
have examples of any other version output from AB-BLAST?

cheers,
Bow


P.S. CC-ed to the Biopython-dev mailing list


On Thu, Dec 13, 2012 at 4:41 AM, Colin Archer <ctnarcher at gmail.com> wrote:
> Hi Bow,
>            I have been using your implementation of the biopython BLAST
> output parser but for AB-BLAST input and it has been working OK so far,
> although I haven't thoroughly had a look at the speed yet. I initially found
> that the version tag (BlastOutput_version) for AB-BLAST results were slighly
> different from NCBI BLAST and changed the regex you implemented to cover
> both versions. The difference between them was:
>
>   <BlastOutput_version>BLASTN 2.2.27+</BlastOutput_version>
>   <BlastOutput_version>3.0PE-AB [2009-10-30] [linux26-x64-I32LPF64
> 2009-11-17T18:52:53]</BlastOutput_version>
>
>
> and the regex I ended up using was:
> r'(\d+\.(?:\d+\.)*\d+)(?:\w+-\w+|\+)?'
>
> and here is the tested output:
>>>> _RE_VERSION1 = re.compile(r'\d+\.\d+\.\d+\+?')
>>>> _RE_VERSION2 = re.compile(r'(\d+\.(?:\d+\.)*\d+)(?:\w+-\w+|\+)?')
>>>> version1
> 'BLASTN 2.2.27+'
>>>> version2
> '3.0PE-AB [2009-10-30] [linux26-x64-I32LPF64 2009-11-17T18:52:53]'
>>>> re.search(_RE_VERSION1, version1).group(0)
> '2.2.27+'
>>>> re.search(_RE_VERSION2, version1).group(0)
> '2.2.27+'
>>>> re.search(_RE_VERSION1, version2).group(0)
> Traceback (most recent call last):
>   File "<input>", line 1, in <module>
> AttributeError: 'NoneType' object has no attribute 'group'
>>>> re.search(_RE_VERSION2, version2).group(0)
> '3.0PE-AB'
>
> Would there be any chance of including this in a future release of
> BioPython?
>
> Thanks
> Colin
>
>

From bow at bow.web.id  Thu Dec 13 11:14:27 2012
From: bow at bow.web.id (Wibowo Arindrarto)
Date: Thu, 13 Dec 2012 17:14:27 +0100
Subject: [Biopython-dev] Slight modifcation to BlastXML parser for
	AB-BLAST input
In-Reply-To: <CAF++dEfreFXYVC5rER-smtmCgpPxv0BdajHKQWnBs-ODH5_3VQ@mail.gmail.com>
References: <CAF++dEfZZRFfAZvcS_MJ+AG9pGj6QHBUukAUChZ5yyqk+X4YBg@mail.gmail.com>
	<CADEGkF609CawoG=ZP0hVUWk+_At28tVp-K4V+oZ9FzdHAksFXA@mail.gmail.com>
	<CAF++dEfreFXYVC5rER-smtmCgpPxv0BdajHKQWnBs-ODH5_3VQ@mail.gmail.com>
Message-ID: <CADEGkF7D0YRLB=o4pOcp5qZ9aEqXN+9XtGA6gvcuZJs_iqg2Qw@mail.gmail.com>

Hi Colin,

>                   From what I have seen, the version value is formatted
> differently based on the edition of AB-BLAST being used: personal,
> commerical etc. As I only use the personal edition, I'm not sure if the
> other versions are different but I imagine that they conform to the same
> format, with the version followed by the edition (for example, 3.0PE-AB for
> personal edition). The regex I sent you will keep the edition so I imagine
> it will work on other versions of AB-BLAST as long as the edition is
> represented by "words-words"

Ok then. The regex looks good. You can probably make it more
reader-friendly by separating the regex for NCBI and AB BLAST (e.g.
r'(?:ncbi_blast_regex)|(?:ab_blast_regex)'. But even without this, it
seems to work ok.

> I'll submit a pull request as well and submit the revised regex. If you are
> interested, there are a couple other differences in the XML output between
> AB-BLAST and NCBI-BLAST. I can send you an example output if you would like
> to have a look at it. Presently, SearchIO can't parse AB-BLAST XML output
> for multiple queries as the AB-BLAST output is just a concatentation of
> multiple single queries. Each query contains the <?xml version  ...> section
> at the beginning and causes ElementTree to error during iteration. To get
> around this I have been piping the AB-BLAST output and parsing it into a
> more NCBI-BLAST form.

Hmm..it is a problem if AB-BLAST concatenates outputs like that. It
makes the XML
invalid, though, so I'm not sure if we should change the parser to
tolerate this. What are the other differences?

As for the example files, they would indeed be useful for unit testing
(as long as they're not that big ~ less than 50K?). You can send them
to me. If you're feeling it, you can also write your own unit tests
using them :).

Looking forward to the pull request :),
Bow

From p.j.a.cock at googlemail.com  Thu Dec 13 12:09:59 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 17:09:59 +0000
Subject: [Biopython-dev] Slight modifcation to BlastXML parser for
 AB-BLAST input
In-Reply-To: <CADEGkF7D0YRLB=o4pOcp5qZ9aEqXN+9XtGA6gvcuZJs_iqg2Qw@mail.gmail.com>
References: <CAF++dEfZZRFfAZvcS_MJ+AG9pGj6QHBUukAUChZ5yyqk+X4YBg@mail.gmail.com>
	<CADEGkF609CawoG=ZP0hVUWk+_At28tVp-K4V+oZ9FzdHAksFXA@mail.gmail.com>
	<CAF++dEfreFXYVC5rER-smtmCgpPxv0BdajHKQWnBs-ODH5_3VQ@mail.gmail.com>
	<CADEGkF7D0YRLB=o4pOcp5qZ9aEqXN+9XtGA6gvcuZJs_iqg2Qw@mail.gmail.com>
Message-ID: <CAKVJ-_6A2bSuTaT02C7-MUV2qqMTBD4rgJ5WSci+0a_TTMbN4Q@mail.gmail.com>

On Thu, Dec 13, 2012 at 4:14 PM, Wibowo Arindrarto <bow at bow.web.id> wrote:
>> Presently, SearchIO can't parse AB-BLAST XML output
>> for multiple queries as the AB-BLAST output is just a concatentation of
>> multiple single queries. Each query contains the <?xml version  ...> section
>> at the beginning and causes ElementTree to error during iteration. To get
>> around this I have been piping the AB-BLAST output and parsing it into a
>> more NCBI-BLAST form.
>
> Hmm..it is a problem if AB-BLAST concatenates outputs like that. It
> makes the XML invalid, though, so I'm not sure if we should change
> the parser to tolerate this. What are the other differences?

The older NCBI BLAST tools had this bug as well - and as a result
our NCBIXML has a hack to cope with it. It might be worth applying
the same kind of fix to the SearchIO BLAST XML parser as well
if it would help with both AB-BLAST and any older NCBI XML files.

Peter

From lucas.sinclair at me.com  Thu Dec 13 11:29:19 2012
From: lucas.sinclair at me.com (Lucas Sinclair)
Date: Thu, 13 Dec 2012 17:29:19 +0100
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
Message-ID: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>

Hi !

I'm working a lot with fasta files. They can be large (>50GB) and contain lots of sequences (>40,000,000). Often I need to get one sequence from the file. WIth a flat FASTA file this requires parsing, on average, half of the file before finding it. I would like to write something that solves this problem, and rather than making a new repository, I thought I could contribute to biopython.

As I just wrote, the iterator nature of parsing sequences files has it's limits. I was thinking of something that is indexed. And not some hack like I see sometimes where a second".fai" file is added nest to the ".fa" file. The natural thing to do is to put these entries in a SQLite file. The appraisal of such solutions is well made here: http://defindit.com/readme_files/sqlite_for_data.html

Now I looked into the biopython source code, and it seems everything is based on returning a generator object which essentially has only one method: next() giving SeqRecords. For what I want to do, I would also need the get(id) method. Plus any other methods that could now be added to query the DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is a class called InterlacedSequenceIterator(SequenceIterator) that contains a __getitem__(i) method, but it's unclear how to I should go about implementing that. Any help/example on how to add such a format to SeqIO ?

Thanks !

Lucas Sinclair, PhD student
Ecology and Genetics
Uppsala University


From p.j.a.cock at googlemail.com  Thu Dec 13 12:40:46 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 17:40:46 +0000
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
Message-ID: <CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>

On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
> Hi !
>
> I'm working a lot with fasta files. They can be large (>50GB) and contain
> lots of sequences (>40,000,000). Often I need to get one sequence from the
> file. WIth a flat FASTA file this requires parsing, on average, half of the
> file before finding it. I would like to write something that solves this
> problem, and rather than making a new repository, I thought I could
> contribute to biopython.
>
> As I just wrote, the iterator nature of parsing sequences files has it's
> limits. I was thinking of something that is indexed. And not some hack like
> I see sometimes where a second".fai" file is added nest to the ".fa" file.
> The natural thing to do is to put these entries in a SQLite file. The
> appraisal of such solutions is well made here:
> http://defindit.com/readme_files/sqlite_for_data.html
>
> Now I looked into the biopython source code, and it seems everything is
> based on returning a generator object which essentially has only one method:
> next() giving SeqRecords. For what I want to do, I would also need the
> get(id) method. Plus any other methods that could now be added to query the
> DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is
> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
> __getitem__(i) method, but it's unclear how to I should go about
> implementing that. Any help/example on how to add such a format to SeqIO ?
>
> Thanks !

Have you looked at Bio.SeqIO.index (index held in memory) and
Bio.SeqIO.index_db (index held in an SQLite3 database), and do
they solve your needs?

Note these only index the location of records - unlike tabix/fai indexes
which also look at the line length to be able to pull out subsequences.
This means the Bio.SeqIO indexing isn't ideal for dealing with large
records were you are only interested in small subsequences.

Peter

From p.j.a.cock at googlemail.com  Thu Dec 13 12:51:40 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 17:51:40 +0000
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
	<CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
Message-ID: <CAKVJ-_6zPbwHJdoO_OpyZtOONtvPBqwvzwQ1Ah3Uwc46Yz+pKw@mail.gmail.com>

On Thu, Dec 13, 2012 at 5:40 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
>>
>> I see there is
>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
>> __getitem__(i) method, but it's unclear how to I should go about
>> implementing that.
>>

Hmm - I think that entire class is obsolete and could be removed.

Peter

From p.j.a.cock at googlemail.com  Thu Dec 13 13:54:04 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 18:54:04 +0000
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <CAKVJ-_6zPbwHJdoO_OpyZtOONtvPBqwvzwQ1Ah3Uwc46Yz+pKw@mail.gmail.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
	<CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
	<CAKVJ-_6zPbwHJdoO_OpyZtOONtvPBqwvzwQ1Ah3Uwc46Yz+pKw@mail.gmail.com>
Message-ID: <CAKVJ-_63nNPCxkhN0-Ftd2wQMuYboajhgDc1B8ouY=88nJcP6A@mail.gmail.com>

On Thu, Dec 13, 2012 at 5:51 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Dec 13, 2012 at 5:40 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
>>>
>>> I see there is
>>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
>>> __getitem__(i) method, but it's unclear how to I should go about
>>> implementing that.
>>>
>
> Hmm - I think that entire class is obsolete and could be removed.

I've marked it as deprecated, but since it doesn't really have any
executable code a deprecation warning doesn't seem relevant.
We can probably remove this after the next release.
https://github.com/biopython/biopython/commit/316c42aad05b9de3d3b3004ec295670691ae1804

Thanks for flagging up this bit of the code Lucas.

Going further, the SequenceIterator isn't used either, and perhaps
could be dropped too? We do use the similar class in AlignIO...

Regards,

Peter

From ben at benfulton.net  Thu Dec 13 21:25:47 2012
From: ben at benfulton.net (Ben Fulton)
Date: Thu, 13 Dec 2012 21:25:47 -0500
Subject: [Biopython-dev] Code coverage reporting
Message-ID: <CA+ijMsmW=YLmHRWhMBtf6ChC7iqs7_8N=6m0TtV-iXP_P3zd6w@mail.gmail.com>

On my Biopython fork, I've extended the test run on Travis to create and
upload a code coverage report to GitHub. I'd like to submit a pull request
to put this in the main code base, but in order to do so, I need a token
generated to allow uploading the file to the biopython GitHub account. Can
someone work with me on that?

You can view the coverage report at

http://cloud.github.com/downloads/benfulton/biopython/coverage.txt

Thanks!
Ben Fulton

From p.j.a.cock at googlemail.com  Fri Dec 14 05:58:49 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 14 Dec 2012 10:58:49 +0000
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <AA897CDE-935F-4766-AFAF-A19D1D79AA36@me.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
	<CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
	<AA897CDE-935F-4766-AFAF-A19D1D79AA36@me.com>
Message-ID: <CAKVJ-_7NgZJbfJtJc4DT0=SOZLpDY-RR7N2k0+x7N8ms4S_1dw@mail.gmail.com>

On Fri, Dec 14, 2012 at 10:07 AM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
> Hello,
>
> Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes
> an index, but it is held in memory. So it must be recomputed every
> time the interpreter is reloaded.

Yes, that is right.

> This step is wasting enough time for me that I would like to compute
> the index on my 50GB file once, and then be done with it. SQLite
> really is the technology of choice for such a problem...

Yes, which is why Bio.SeqIO.index_db() stores the index in SQLite.
The SeqIO chapter in the Tutorial does try to explain this and the
advantages compared to Bio.SeqIO.index(). Have you tried this yet?

> I suppose you agree storing all this sequence information in flat
> ascii files is not piratical.

It may not be optimal, but it is very practical (although at the scale
of next generation sequencing data less so).

Peter

From lucas.sinclair at me.com  Fri Dec 14 05:07:55 2012
From: lucas.sinclair at me.com (Lucas Sinclair)
Date: Fri, 14 Dec 2012 11:07:55 +0100
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
	<CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
Message-ID: <AA897CDE-935F-4766-AFAF-A19D1D79AA36@me.com>

Hello,

Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes an index, but it is held in memory. So it must be recomputed every time the interpreter is reloaded. This step is wasting enough time for me that I would like to compute the index on my 50GB file once, and then be done with it. SQLite really is the technology of choice for such a problem... I suppose you agree storing all this sequence information in flat ascii files is not piratical.

Actually, I found a reasonable work around way of achieving this result with these two commands:
$ formatdb -i reads -p T -o T -n reads
$ blastdbcmd -db reads -dbtype prot -entry "105107064179" -outfmt %f -out test.fasta

But then I need to have calls to subprocess...

Since, I thought my first small contribution to biopython was fun doing, (https://github.com/biopython/biopython/commit/1c72a63b35db70d11c628b83a0269d1a9c6443a4) I maybe still fell like writing a proper solution. Would such a thing be a welcome addition to Bio.SeqIO ? If so, where would I place it ? The schema would be a SQLite file with a single table named "sequences". This table would have columns corresponding to the attributes of a SeqRecord. But you would need to get a different type object back when calling parse than a generator, you would need an object that has a __getitem__ method.

Sincerely,

Lucas Sinclair, PhD student
Ecology and Genetics
Uppsala University

On 13 d?c. 2012, at 18:40, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
>> Hi !
>> 
>> I'm working a lot with fasta files. They can be large (>50GB) and contain
>> lots of sequences (>40,000,000). Often I need to get one sequence from the
>> file. WIth a flat FASTA file this requires parsing, on average, half of the
>> file before finding it. I would like to write something that solves this
>> problem, and rather than making a new repository, I thought I could
>> contribute to biopython.
>> 
>> As I just wrote, the iterator nature of parsing sequences files has it's
>> limits. I was thinking of something that is indexed. And not some hack like
>> I see sometimes where a second".fai" file is added nest to the ".fa" file.
>> The natural thing to do is to put these entries in a SQLite file. The
>> appraisal of such solutions is well made here:
>> http://defindit.com/readme_files/sqlite_for_data.html
>> 
>> Now I looked into the biopython source code, and it seems everything is
>> based on returning a generator object which essentially has only one method:
>> next() giving SeqRecords. For what I want to do, I would also need the
>> get(id) method. Plus any other methods that could now be added to query the
>> DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is
>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
>> __getitem__(i) method, but it's unclear how to I should go about
>> implementing that. Any help/example on how to add such a format to SeqIO ?
>> 
>> Thanks !
> 
> Have you looked at Bio.SeqIO.index (index held in memory) and
> Bio.SeqIO.index_db (index held in an SQLite3 database), and do
> they solve your needs?
> 
> Note these only index the location of records - unlike tabix/fai indexes
> which also look at the line length to be able to pull out subsequences.
> This means the Bio.SeqIO indexing isn't ideal for dealing with large
> records were you are only interested in small subsequences.
> 
> Peter


From w.arindrarto at gmail.com  Fri Dec 14 07:48:12 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Fri, 14 Dec 2012 13:48:12 +0100
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
In-Reply-To: <CAKVJ-_5SjfRFiiKSatU9ds8b5ESdUTexa3TP=k+W=TPmtHoTfA@mail.gmail.com>
References: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
	<CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>
	<CADEGkF41Fu8BzuBh_3DfRSF5SS6C8UecU7F-TXTgnd-Md44Kcw@mail.gmail.com>
	<CAKVJ-_5SjfRFiiKSatU9ds8b5ESdUTexa3TP=k+W=TPmtHoTfA@mail.gmail.com>
Message-ID: <CADEGkF6Sk0N7-2Ygay7FD_hGP-ZXyhKYkpXdp=qPy9mg++_WxQ@mail.gmail.com>

Hi everyone,

>> It's reproducible in my machine: Arch Linux 64 bit running
>> Python3.1.5. Haven't figured out a fix yet, but trying to see if I
>> can.
>
> Great. We haven't really proved this is down to a change in
> either Python 3.1.4 or 3.1.5 but it does look likely.

It's reproduced in my local 3.1.4 installation. Seems like an unfixed
bug that went through to 3.1.5.

>> By the way, I was wondering, what's our deprecation policy for
>> Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't
>> seem to be any major updates coming soon. How long should we keep
>> supporting Python <3.2?
>
> As long as it doesn't cost us much effort? If we can't solve this
> issue easily that might be enough to drop Python 3.1?

Fixing this seems difficult (has anyone else tried a fix?). The _io
module is built-in and compiled when Python is installed, so fixing it
(I imagine) may require tweaking the C-code (which requires fiddling
with the actual Python installation).

> My impression is that Python 3.0 is dead, and the only sizeable
> group stuck with Python 3.1 will those on Ubuntu lucid (LTS is
> supported through 2013 on desktops and 2015 on servers),
> but as with life under Python 2.x it is fairly straightforward
> to have a local/additional Python without disturbing the system
> installation.
>
>
> Since we don't yet officially support Python 3 (but we probably
> should soon) we have the flexibility to recommend
> either Python 3.2 or 3.3 as a baseline.

Yes. I think it may be easier and better for us to officially start
supporting from Python3.2 or 3.3 onwards.

regards,
Bow

From christian at brueffer.de  Mon Dec 17 06:05:04 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Mon, 17 Dec 2012 19:05:04 +0800
Subject: [Biopython-dev] Biopython AlignAce Wrapper
In-Reply-To: <CABHxouVLpsmb4T_AAgRhHdHoe5QMcE8NTCPpj6quYf3ZoKP5dw@mail.gmail.com>
References: <50CAC1C2.9090705@brueffer.de>
	<CABHxouVzMyKbE9Uzw07Lk02JPXg4_dpHNJpuCHn3s5ms=22t1w@mail.gmail.com>
	<50CEE193.2010003@brueffer.de>
	<CABHxouVLpsmb4T_AAgRhHdHoe5QMcE8NTCPpj6quYf3ZoKP5dw@mail.gmail.com>
Message-ID: <50CEFC60.8020400@brueffer.de>

(CC'ing biopython-dev)

Thanks for the feedback.

I'd propose the following plan for the AlignAce wrapper then:

1. Submit the cleanup patches I have to give the wrapper at least a
    fighting chance at actually working
2. Add a BiopythonDeprecationWarning
3. Remove the wrapper after 1.61 is released (except the situation
    changes of course)

Does that sound acceptable?

Chris


On 12/17/2012 05:25 PM, Bartek Wilczynski wrote:
> Well,
>
> sounds like a good plan. I think the situation is hopeless: If we had
> the source of AlignAce with appropriate license we could think of
> supporting it ourselves, but in this situation I guess we can only
> deprecate the module and phase it out...
>
> best
> Bartek
>
> On Mon, Dec 17, 2012 at 10:10 AM, Christian Brueffer
> <christian at brueffer.de> wrote:
>> Hi Bartek,
>>
>> thanks for checking.  The thing is, the "new" version is actually an
>> ancient version:
>>
>> AlignACE version 2.3  October 27, 1998
>>
>> I made it work by installing Fedora Code 3 in a VM and using
>> elfstatifier to bind AlignAce and all libraries into one executable.
>> I works, but I doubt it's of any use these days.
>>
>> I wonder whether it's better to remove the wrapper.  The AlignAce
>> developers are unresponsive, none of the Biopython people has a
>> version and from what I can see the current wrapper cannot possibly
>> work.
>>
>> What do you think?
>>
>> Chris
>>
>>
>> On 12/17/2012 05:01 PM, Bartek Wilczynski wrote:
>>>
>>> Hi,
>>>
>>> I've looked around and it seems I don't have it. We probably need to
>>> "update" the parser to work with the current version of AlignACE
>>> available from Harvard. Were you able to run it? On mys system, it
>>> cannot find the libraries it needs...
>>>
>>> best
>>> Bartek
>>>
>>> On Fri, Dec 14, 2012 at 7:05 AM, Christian Brueffer
>>> <christian at brueffer.de> wrote:
>>>>
>>>> Hi Bartek,
>>>>
>>>> I currently clean up the Biopython AlignAce wrapper.  Unfortunately
>>>> I've been unable to obtain the latest AlignAce version since the
>>>> download page disappeared and the Church lab is unresponsive.
>>>>
>>>> Do you happen to have a version of AlignAce 4.0 for Linux lying around,
>>>> that you could send me?
>>>>
>>>> Thanks a lot,
>>>>
>>>> Chris
>>>


From redmine at redmine.open-bio.org  Mon Dec 17 08:49:33 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 17 Dec 2012 13:49:33 +0000
Subject: [Biopython-dev] [Biopython - Bug #3401] (New) is_terminal bug in
	newick trees
Message-ID: <redmine.issue-3401.20121217134933@redmine.open-bio.org>


Issue #3401 has been reported by Aleksey Kladov.

----------------------------------------
Bug #3401: is_terminal bug in newick trees
https://redmine.open-bio.org/issues/3401

Author: Aleksey Kladov
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Consider this weird  Newick tree

    (((B,C),D))A;

Here 'A' is both a root node and a terminal node(since it has only one child: ((B,C),D);). However, is_terminal for 'A' is False:


<pre>
from Bio import Phylo
import cStringIO

bad_tree = '(((B,C),D))A'

t = Phylo.read(cStringIO.StringIO(bad_tree), 'newick')

for c in t.find_clades(terminal=True):
    print c,
</pre>

Gives @B C D@


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From MatatTHC at gmx.de  Tue Dec 18 07:40:35 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Tue, 18 Dec 2012 13:40:35 +0100
Subject: [Biopython-dev] Location Parser
Message-ID: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>

Dear list,

I have some problems with the GenBank parser in version 1.60. Its again
nested location strings like:

order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)
as found in NC_003048.

What happens is that the parser stalls. It seems as if it takes forever to
parse _re_complex_compound in
and never gets to the if statement that checks if order and join appears in
the location string.

I suggest to move the if statement before the regular expressions are
tested.

I remember that I posted something like this before. But I can not remember
how and if this was solved.

Regards,
Matthaas


From k.d.murray.91 at gmail.com  Tue Dec 18 08:46:06 2012
From: k.d.murray.91 at gmail.com (Kevin Murray)
Date: Wed, 19 Dec 2012 00:46:06 +1100
Subject: [Biopython-dev] [biopython] TAIR (Arabidopsis) sequence
	retreival module (#132)
In-Reply-To: <biopython/biopython/pull/132/c11467257@github.com>
References: <biopython/biopython/pull/132@github.com>
	<biopython/biopython/pull/132/c11467257@github.com>
Message-ID: <CAH80STXLSJu16iiUU0fKSsQrqF3jvn4JFNmenBLk7bV0C0DTNw@mail.gmail.com>

Hi Peter, Chris and the mailing list,

Thanks very much for the feedback!

> Query: It isn't clear to me (from a first read) what MultipartPostHandler
is needed for.
The arabidopsis.org server form requires the content-type to be a multipart
form, not a urlencoded form, which the standard urllib2 does not handle. I
could write a custom handler, however when writing the module I found
MultipartPostHandler, and figured I should use that. I may be wrong, but
couldn't figure out any other way of doing it.

>Minor: The module's docstring should start with a one line summary then a
blank line (see PEP8 style guide).
>Note: Since your unit test requires internet access, it should include
these lines to work nicely in our testing framework (which allows the tests
needing network access to be skipped)
I'll fix the module docstring and requires_internet check tomorrow.

>Why does the NCBI code exist given it is such a thin wrapper round the
Bio.Entrez code - the module would be a lot simpler if it was just a
wrapper for www.arabidopsis.org alone.
The NCBI functions exist to get genbank files for AGIs, as TAIR's
sequence retrieval only gives fasta files, so if users need/want the extra
metadata a genbank file gives, they can use this module. As you've said,
this is a *very* thin wrapper, so would it be better to just provide the
mapping dicts in Bio.TAIR._ncbi for people to use however they see fit?

>Query: Why do your methods return SeqRecord objects? Is this because the
handle might return FASTA with a non-FASTA header which must be stripped
off?
SeqRecord handles were returned for two reasons, the first being as you
said that the raw return text is not always a valid fasta file, despite my
efforts to trim extraneous text. The latter is simply that is what i
required when writing it, and i could not think of a better way of
returning it. (and I thought that the return of a SeqRecord allowed
"pythonic" processing of results, a la the test suite). Again happy for any
suggestions

>Why does classes TAIRDirect and TAIRNCBI exist? Wouldn't module level
functions be simpler (or at least, consistent with other modules like
Bio.Entrez)
>Style: Why introduce the mode argument and two magic values NCBI_RNA and
NCBI_PROTEIN?
The honest answer to both of these is personal choice. If consistency is an
issue i will reimplement as module-level functions and textual arguments
respectively.

Regarding the placement of modules, i'm happy for it to go wherever. I
would imagine that there are other niche web interface "getters" such as
this, and think your suggestion sounds great, although i can't think what
we could call it. Perhaps Bio.Web.TAIR?


Regards
Kevin Murray


On 18 December 2012 10:34, Peter Cock <notifications at github.com> wrote:

> Hi Kevin,
>
> Thanks for your code submission. I've not had a chance to play with it,
> but I do have some comments/queries - some of which are perhaps just style
> issues.
>
> Note: Since your unit test requires internet access, it should include
> these lines to work nicely in our testing framework (which allows the tests
> needing network access to be skipped):
>
> import requires_internet
> requires_internet.check()
>
> Query: It isn't clear to me (from a first read) what MultipartPostHandler
> is needed for.
>
> Minor: The module's docstring should start with a one line summary then a
> blank line (see PEP8 style guide).
>
> Query: Why does classes TAIRDirect and TAIRNCBI exist? Wouldn't module
> level functions be simpler (or at least, consistent with other modules like
> Bio.Entrez)?
>
> Query: Why do your methods return SeqRecord objects? Is this because the
> handle might return FASTA with a non-FASTA header which must be stripped
> off?
>
> Style: Why introduce the mode argument and two magic values NCBI_RNA and
> NCBI_PROTEIN?
>
> In fact I would go further and ask why does the NCBI code exist given it
> is such a thin wrapper round the Bio.Entrez code - the module would be a
> lot simpler if it was just a wrapper for www.arabidopsis.org alone.
>
> I'm also not sure about the namespace Bio.TAIR, the old Bio.www namespace
> might have been better but that was deprecated a while back, and the other
> semi-natural fit under Biopython's old OBDA effort is also defunct
> (attempting to catalogue a collection of sequence resources, see
> http://obda.open-bio.org for background if curious). The namespace issue
> at least would be worth bringing up on the dev mailing list... especially
> if you can think of many other examples like this for specialised resources.
>
> Regards,
>
> Peter
>
> ?
> Reply to this email directly or view it on GitHub<https://github.com/biopython/biopython/pull/132#issuecomment-11467257>.
>
>


From kjwu at ucsd.edu  Tue Dec 18 23:25:35 2012
From: kjwu at ucsd.edu (Kevin Wu)
Date: Tue, 18 Dec 2012 20:25:35 -0800
Subject: [Biopython-dev] KEGG API Wrapper
In-Reply-To: <1351219962.39081.YahooMailClassic@web164002.mail.gq1.yahoo.com>
References: <CAEe6yUEbiK3tFdvx1hEGE2==QR7Pab2HcvL6x-CqOivWCB9=sg@mail.gmail.com>
	<1351219962.39081.YahooMailClassic@web164002.mail.gq1.yahoo.com>
Message-ID: <CAEe6yUHZj86LZ5JN+X7_1ANH0oZetZPN_nP+jyS+f9h31y9dwg@mail.gmail.com>

Hi All,

Sorry in the delay in updating this KEGG code. Michiel, I've addressed your
suggestions regarding the querying code and the documentation and have
committed changes that reflect this. (
https://github.com/kevinwuhoo/biopython/) There's a namespace collision
created by the KEGG.list function, so I use KEGG.list_ instead. However,
I'm sure there's a more elegant solution than this.

Regarding the parsers, there should be a way to unify all parsers and
writers for KEGG objects as they list fields for all their objects here:
http://www.kegg.jp/kegg/rest/dbentry.html. Each class should extend from a
parent while specifying their valid fields. Parsing all files should be
generalized, but there should be field specific code to handle the
different fields so that fields like genes are handled correctly and
ubiquitously.

After solidifying discussion on these, I'll move the tests over to unittest
too.

Thanks!
Kevin


On Thu, Oct 25, 2012 at 7:52 PM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:

> Hi Kevin,
>
> Thanks for the documentation! That makes everything a lot clearer.
> Overall I like the querying code and I think we should add it to Biopython.
>
> I have a bunch of comments on the KEGG module, some on the existing code
> and some on the new querying code, see below. Most of these are trivial;
> some may need some further discussion. Perhaps could you let us know which
> of these comments you can address, and which ones you want to skip for now?
>
> Once we converged with regards to the querying code and the documentation,
> I think we can import your version of the KEGG module into the main
> Biopython repository and add your chapter on KEGG to the main
> documentation, and continue from there on the parsers and the unit tests.
>
> Many thanks!
> -Michiel.
>
>
> About the querying code:
> ----------------------------------
>
> I would replace KEGG.query("list", KEGG.query("find", KEGG.query("conv",
> KEGG.query("link", KEGG.query("info", KEGG.query("get" by the functions
> KEGG.list, KEGG.find, KEGG.conv, KEGG.link, KEGG.info, and KEGG.get.
>
> For list, find, conv, link, and info, instead of going through
> KEGG.generic_parser, I would return the result directly as a Python list.
> In contrast, KEGG.get should return the handle to the results, not the
> data itself. So the _q function, instead of
>   ...
>   resp = urllib2.urlopen(req)
>   data = resp.read()
>   return query_url, data
> have
>   ...
>   resp = urllib2.urlopen(req)
>   return resp
> Then the user can decide whether to parse the data on the fly with
> Bio.KEGG, or read the data line by line and pick up what they are
> interested in, or to get all data from the handle and save it in a file.
> Note that resp will have a .url attribute that contains the url, so you
> won't need the ret_url keyword.
>
> About the parsers:
> ------------------------
>
> I think that we should drop generic_parser. For link, find, conv, link,
> and info, parsing is trivial and can be done by the respective functions
> directly. For get, we already have an appropriate parser for some databases
> (compound, map, and enzyme), but it's easy to add parsers for the other
> databases.
>
> For all parsers in Biopython, there is the question whether the record
> should store information in attributes (as is currently done in Bio.KEGG),
> or alternatively if the record should inherit from a dictionary and store
> information in keys in the dictionary. Personally I have a preference for a
> dictionary, since that allows us to use the exact same keys in the
> dictionary as is used in the file (e.g., we can use "CLASS" as a key, while
> we cannot use .class as an attribute since it is a reserved word, so we use
> .classname instead). But other Biopython developers may not agree with me,
> and to some extent it depends on personal preference.
>
> The parsers miss some key words. The ones I noticed are ALL_REAC,
> REFERENCE, and ORTHOLOGY. Probably we'll find more once we extend the unit
> tests.
>
> Remove the ';' at the end of each term in record.classname.
>
> Convert record.genes to a dictionary for each organism. So instead of
> [('HSA', ['5236', '55276']), ('PTR', ['456908', '461162']), ('PON',
> ['100190836', '100438793']), ('MCC', ['100424648', '699401']...
> have
> {'HSA': ['5236', '55276'], 'PTR': ['456908', '461162'], 'PON':
> ['100190836', '100438793'], 'MCC': ['100424648', '699401'], ...
>
> Also for record.dblinks, record.disease, record.structures, use a
> dictionary.
>
> In record.pathway, all entries start with 'PATH'. Perhaps we should check
> with KEGG if there could be anything else than 'PATH' there, otherwise I
> don't see the reason why it's there. Assuming that there could be something
> different there, I would also use a dictionary with 'PATH' as the key.
>
> In record.reaction, some chemical names can be very long and extend over
> multiple lines. In such cases, the continuation line starts with a '$'. The
> parser should remove the '$' and join the two lines.
>
> About the tests:
> --------------------
>
> We should update the data files in Tests/KEGG. This will fix some "bugs"
> in these data files.
>
> We should switch test_KEGG.py to the unit test framework.
>
> We should do some more extensive testing to make sure we are not missing
> some key words.
>
> About the documentation:
> ---------------------------------
> It's great that we now have some documentation.
>
> On page 233, I would suggest to replace the "id_" by "accession" or
> something else, since the underscore in "id_" may look funky to new users.
>
> Also it may be better not to reuse variable names (e.g. "pathway" is used
> in three different ways in the example). It's OK of course in general, but
> for this example it may be more clear to distinguish the different usages
> of this variable from each other.
>
> For repair_genes, you can use a set instead of a list throughout.
>
>
>
>
> --- On *Wed, 10/24/12, Kevin Wu <kjwu at ucsd.edu>* wrote:
>
>
> From: Kevin Wu <kjwu at ucsd.edu>
> Subject: Re: [Biopython-dev] KEGG API Wrapper
> To: "Peter Cock" <p.j.a.cock at googlemail.com>, "Zachary Charlop-Powers" <
> zcharlop at mail.rockefeller.edu>, "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: Biopython-dev at lists.open-bio.org
> Date: Wednesday, October 24, 2012, 6:38 PM
>
>
> Hi All,
>
> Thanks for the comments, I've written a bit of documentation on the entire
> KEGG module and have attached those relevant pages to the email. There
> didn't seem like an appropriate place for examples, so I just added a new
> chapter. I've also committed the updated file to github.
>
> I did leave out the parsers due to the fact that the current parsers only
> cover a small portion of possible responses from the api. Also, I'm not
> confident that the some of the parsers correctly retrieves all the fields.
> However, I've written a really general parser that does a rough job of
> retrieving fields if it's a database format returned since I find myself
> reusing the code for all database formats. It's possible to modify this to
> correctly account for the different fields, but would probably take a bit
> of work to manually figure each field out. Otherwise it also parses the
> tsv/flat file returned.
>
> Also, @zach, thanks for checking it out and testing it!
>
> Thanks All!
> Kevin
>
> On Wed, Oct 17, 2012 at 4:09 AM, Peter Cock <p.j.a.cock at googlemail.com<http://mc/compose?to=p.j.a.cock at googlemail.com>
> > wrote:
>
> On Wed, Oct 17, 2012 at 12:55 AM, Zachary Charlop-Powers
> <zcharlop at mail.rockefeller.edu<http://mc/compose?to=zcharlop at mail.rockefeller.edu>>
> wrote:
> > Kevin,
> > Michiel,
> >
> > I just tested Kevin's code for a few simple queries and it worked great.
> I
> > have always liked KEGG's organization of data and really appreciate this
> > RESTful interface to their data; in some ways I think it easier to use
> the
> > web interfaces for KEGG than it is for NCBI. Plus the KEGG coverage of
> > metabolic networks is awesome.  I found the examples in Kevin's test
> script
> > to be fairly self-explanatory but a simple-spelled out example in the
> > Tutorial would be nice.
> >
> > One thought, though, is that you can retrieve MANY different types of
> data
> > from the KEGG Rest API - which means that the user will probably have to
> > parse the data his/herself. Data retrieved with "list" can return lists
> of
> > genes or compounds or organism and after a  cursory look  these are each
> > formatted differently. Also true with the 'find' command. So I think you
> > were right to leave out parsers because i think they will be a moving
> target
> > highly dependent on the query.
> >
> > Thank You Kevin,
> > zach cp
>
> Good point about decoupling the web API wrapper and the parsers -
> how the Bio.Entrez module and Bio.TogoWS handle this is to return
> handles for web results, which you can then parse with an appropriate
> parser (e.g. SeqIO for GenBank files, Medline parser, etc).
>
> Note that this is a little more fiddly under Python 3 due to the text
> mode distinction between unicode and binary... just something to
> keep in the back of your mind.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org<http://mc/compose?to=Biopython-dev at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>

From gokcen.eraslan at gmail.com  Thu Dec 20 19:12:43 2012
From: gokcen.eraslan at gmail.com (=?ISO-8859-1?Q?G=F6k=E7en_Eraslan?=)
Date: Fri, 21 Dec 2012 01:12:43 +0100
Subject: [Biopython-dev] numpy/matlab style index arrays for Seq objects
Message-ID: <50D3A97B.60108@gmail.com>

Hello,

During the development of a project, I have come across an issue that I
want to share. As far as I know, Bio.Seq.Seq object can only be indexed
using an int or a slice object, just as regular strings:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())

However, it would be really nice to be able to index Seq objects using
index arrays as in numpy.array, like

>>> my_inidices = [0, 3, 7]
>>> my_seq[my_indices]
Seq('GCG', IUPACUnambiguousDNA())

(Since I'm not really familiar with BioPython API and codebase, please
ignore/forgive me if such thing already exists now.)

For example in my project, I'm trying to eliminate noisy columns of a
MSA fasta file. Let's assume that I have a list of non-noisy column
indices than this would solve my problem:

In [1]: from Bio import AlignIO
In [2]: msa = AlignIO.read("s001.fasta", "fasta")
In [3]: print msa[:, [0, 3, 4]]

SingleLetterAlphabet() alignment with 5 rows and 3 columns
KPG sp2
TPG sp11
SPG sp7
KPP sp6
SPG sp10

I have attached a tiny patch (~4 lines) implementing this stuff. At
first, I have thought keeping the sequence string as numpy.array(list())
to be able to use indexing mechanism of numpy, but it would be
over-engineering so I have just used a simple list comprehension trick.

Regards.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: biopython-index-array-for-seq.diff
Type: text/x-patch
Size: 3845 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121221/326baacf/attachment.bin>

From p.j.a.cock at googlemail.com  Fri Dec 21 08:09:47 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 21 Dec 2012 13:09:47 +0000
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
Message-ID: <CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>

On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> Dear list,
>
> I have some problems with the GenBank parser in version 1.60. Its again
> nested location strings like:
>
> order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)
> as found in NC_003048.

Do you have a URL for that? This looks OK to me:
http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1

Perhaps the entry came from the FTP site?
e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/

> What happens is that the parser stalls. It seems as if it takes forever to
> parse _re_complex_compound in and never gets to the if statement that
> checks if order and join appears in the location string.
>
> I suggest to move the if statement before the regular expressions are
> tested.
>
> I remember that I posted something like this before. But I can not remember
> how and if this was solved.
>
> Regards,
> Matthaas

Were similar odd locations have come up in some cases they did
seem to be NCBI bugs - could you raise a query with the NCBI
for this case please?

If this is valid (which I doubt), then our object model doesn't cope.

If this is invalid, then Biopython should give a warning and skip
this location. Right now I can't find the file to test this (see
query above about where it came from).

Regards,

Peter


From MatatTHC at gmx.de  Fri Dec 21 10:18:45 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Fri, 21 Dec 2012 16:18:45 +0100
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
	<CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
Message-ID: <CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>

Dear Peter,

you are right the current RefSeq record is valid and can be parsed. In
order to reproduce old results I keep old refseq versions (of mitochondrial
genomes) on hard disk. So probably this is an old refseq bug. According to
the documentation (
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.4):


"""
Note : location operator "complement" can be used in combination with either "
join" or "order" within the same location; combinations of "join" and "order"
within the same location (nested operators) are illegal.
"""

Since this was urgent I fixed the files manually by removing the nested
files. I was not able to find a file in other RefSeq versions that can
reproduce the bug (i.e. the parser seemingly takes forever [>5min] and does
not raise an exception). You may still reproduce the bug by pasting the
location line in another GenBank file.

I agree that the desired behaviour would be a warning and skip of the
feature.

Regards,
Matthias


2012/12/21 Peter Cock <p.j.a.cock at googlemail.com>

> On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> > Dear list,
> >
> > I have some problems with the GenBank parser in version 1.60. Its again
> > nested location strings like:
> >
> >
> order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)
> > as found in NC_003048.
>
> Do you have a URL for that? This looks OK to me:
> http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1
>
> Perhaps the entry came from the FTP site?
> e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/
>
> > What happens is that the parser stalls. It seems as if it takes forever
> to
> > parse _re_complex_compound in and never gets to the if statement that
> > checks if order and join appears in the location string.
> >
> > I suggest to move the if statement before the regular expressions are
> > tested.
> >
> > I remember that I posted something like this before. But I can not
> remember
> > how and if this was solved.
> >
> > Regards,
> > Matthaas
>
> Were similar odd locations have come up in some cases they did
> seem to be NCBI bugs - could you raise a query with the NCBI
> for this case please?
>
> If this is valid (which I doubt), then our object model doesn't cope.
>
> If this is invalid, then Biopython should give a warning and skip
> this location. Right now I can't find the file to test this (see
> query above about where it came from).
>
> Regards,
>
> Peter
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NC_001326.gb
Type: application/octet-stream
Size: 65527 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121221/911b2ce3/attachment-0001.obj>

From p.j.a.cock at googlemail.com  Fri Dec 21 10:34:48 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 21 Dec 2012 15:34:48 +0000
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
	<CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
	<CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>
Message-ID: <CAKVJ-_6D1qssSwoSAff5BxF0EDh-ftAD9Xrp8YX7Z=EXn1PtHg@mail.gmail.com>

On Fri, Dec 21, 2012 at 3:18 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> Dear Peter,
>
> you are right the current RefSeq record is valid and can be parsed. In order
> to reproduce old results I keep old refseq versions (of mitochondrial
> genomes) on hard disk. So probably this is an old refseq bug. ...

Could you email me (not the list) the old NC_003048.gb file please?

Was there a similar issue in the NC_001326.gb file you just sent?
It seems to load OK for me...

Thanks,

Peter

From p.j.a.cock at googlemail.com  Fri Dec 21 11:13:40 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 21 Dec 2012 16:13:40 +0000
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAFwXb2_mB-ATYhjZ44Df+6t2ctjOWac7jYuj5Y-e3xxf3_b3eQ@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
	<CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
	<CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>
	<CAKVJ-_6D1qssSwoSAff5BxF0EDh-ftAD9Xrp8YX7Z=EXn1PtHg@mail.gmail.com>
	<CAFwXb2_mB-ATYhjZ44Df+6t2ctjOWac7jYuj5Y-e3xxf3_b3eQ@mail.gmail.com>
Message-ID: <CAKVJ-_43OUkRn+7i2qfM=9auLnNiQHJ3ckPC8FdE5aRp07B07A@mail.gmail.com>

On Fri, Dec 21, 2012 at 3:53 PM, Matthias Bernt
<bernt.matthias at gmail.com> wrote:
> Dear Peter,
>
> its attached (from RefSeq39). For me parsing does not finish for this file
> (biopython 1.6, python 2.7.3).
>
> Regards,
> Matthias

Got it, thanks. It also seems to get stuck for me too - there is a bug here :(

See also: https://redmine.open-bio.org/issues/3197

Peter

From p.j.a.cock at googlemail.com  Fri Dec 21 11:54:38 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 21 Dec 2012 16:54:38 +0000
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAKVJ-_43OUkRn+7i2qfM=9auLnNiQHJ3ckPC8FdE5aRp07B07A@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
	<CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
	<CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>
	<CAKVJ-_6D1qssSwoSAff5BxF0EDh-ftAD9Xrp8YX7Z=EXn1PtHg@mail.gmail.com>
	<CAFwXb2_mB-ATYhjZ44Df+6t2ctjOWac7jYuj5Y-e3xxf3_b3eQ@mail.gmail.com>
	<CAKVJ-_43OUkRn+7i2qfM=9auLnNiQHJ3ckPC8FdE5aRp07B07A@mail.gmail.com>
Message-ID: <CAKVJ-_5WL+rzpwoMrPneF6en3YOpARLEYP=pVdb9+7=BUoysPw@mail.gmail.com>

On Fri, Dec 21, 2012 at 4:13 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Dec 21, 2012 at 3:53 PM, Matthias Bernt
> <bernt.matthias at gmail.com> wrote:
>> Dear Peter,
>>
>> its attached (from RefSeq39). For me parsing does not finish for this file
>> (biopython 1.6, python 2.7.3).
>>
>> Regards,
>> Matthias
>
> Got it, thanks. It also seems to get stuck for me too - there is a bug here :(
>
> See also: https://redmine.open-bio.org/issues/3197

The problem seems to be in the regular expression search itself getting stuck:

$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.GenBank import _re_complex_compound
>>> _re_complex_compound.match("order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)")
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyboardInterrupt

Odd.

Peter


From ben at bendmorris.com  Mon Dec 24 11:58:19 2012
From: ben at bendmorris.com (Ben Morris)
Date: Mon, 24 Dec 2012 11:58:19 -0500
Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo
Message-ID: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>

Hi all,

I've implemented support for two new phylogenetic tree formats: NeXML and
RDF (conforming to the Comparative Data Analysis Ontology).

I noticed that NeXML support was planned, but I didn't see anyone working
on it on GitHub and the feature request hadn't been updated in about a
year, so I went ahead and implemented a simple version. At first I tried
the generateDS.py approach, but the generated writer doesn't give very much
control over the output, so I ended up writing my own parser/writer using
ElementTree.

As for the RDF/CDAO format, AFAIK this is not a format that's supported by
any other phylogenetic libraries, so I'm not sure how useful this is to
everyone else. It provides a simple, standards-compliant format that can be
imported to a triple store and supports annotation. We'll be using it at
NESCent so I wanted to make it available to everyone else as well. The
parser and writer require the Redlands Python bindings.

The code is available in my fork of Biopython,

    https://github.com/bendmorris/biopython

under branches "cdao" and "nexml." I'd love to get everyone's thoughts and
see if these contributions would be a good fit for the Biopython project.


~Ben Morris
PhD student, Department of Biology
University of North Carolina at Chapel Hill
and the National Evolutionary Synthesis Center
ben at bendmorris.com

From p.j.a.cock at googlemail.com  Mon Dec 24 13:05:29 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 24 Dec 2012 18:05:29 +0000
Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo
In-Reply-To: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
References: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
Message-ID: <CAKVJ-_6xpA_LPd2JQ2j1s=yLLV251vsGMvC_Qod8SdbxcPpxiA@mail.gmail.com>

On Mon, Dec 24, 2012 at 4:58 PM, Ben Morris <ben at bendmorris.com> wrote:
> Hi all,
>
> I've implemented support for two new phylogenetic tree formats: NeXML and
> RDF (conforming to the Comparative Data Analysis Ontology).
>
> I noticed that NeXML support was planned, but I didn't see anyone working
> on it on GitHub and the feature request hadn't been updated in about a
> year, so I went ahead and implemented a simple version. At first I tried
> the generateDS.py approach, but the generated writer doesn't give very much
> control over the output, so I ended up writing my own parser/writer using
> ElementTree.
>
> As for the RDF/CDAO format, AFAIK this is not a format that's supported by
> any other phylogenetic libraries, so I'm not sure how useful this is to
> everyone else. It provides a simple, standards-compliant format that can be
> imported to a triple store and supports annotation. We'll be using it at
> NESCent so I wanted to make it available to everyone else as well. The
> parser and writer require the Redlands Python bindings.
>
> The code is available in my fork of Biopython,
>
>     https://github.com/bendmorris/biopython
>
> under branches "cdao" and "nexml." I'd love to get everyone's thoughts and
> see if these contributions would be a good fit for the Biopython project.

Sounds good - and the librdf Redlands Python bindings do seem to
be a safe choice for RDF under Python. I guess we need Eric to
take a look... and some tests would be needed too.

Thanks,

Peter

From eric.talevich at gmail.com  Tue Dec 25 02:18:40 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 24 Dec 2012 23:18:40 -0800
Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo
In-Reply-To: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
References: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
Message-ID: <CAMC681=OrHJmfEbxWz=8-qzo2rEVJaqFeqgihiAMVi6No7GBCw@mail.gmail.com>

On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris <ben at bendmorris.com> wrote:

> Hi all,
>
> I've implemented support for two new phylogenetic tree formats: NeXML and
> RDF (conforming to the Comparative Data Analysis Ontology).
>
> I noticed that NeXML support was planned, but I didn't see anyone working
> on it on GitHub and the feature request hadn't been updated in about a
> year, so I went ahead and implemented a simple version. At first I tried
> the generateDS.py approach, but the generated writer doesn't give very much
> control over the output, so I ended up writing my own parser/writer using
> ElementTree.
>
> As for the RDF/CDAO format, AFAIK this is not a format that's supported by
> any other phylogenetic libraries, so I'm not sure how useful this is to
> everyone else. It provides a simple, standards-compliant format that can be
> imported to a triple store and supports annotation. We'll be using it at
> NESCent so I wanted to make it available to everyone else as well. The
> parser and writer require the Redlands Python bindings.
>
> The code is available in my fork of Biopython,
>
>     https://github.com/bendmorris/biopython
>
> under branches "cdao" and "nexml." I'd love to get everyone's thoughts and
> see if these contributions would be a good fit for the Biopython project.
>


Thanks for letting us know! I'll try it out soonish. Looking at the code on
your nexml branch, I have a few comments:

- The parser uses ElementTree.parse rather than iterparse, so in its
current state it would not be able to parse massive files (those larger
than available RAM). Worth fixing eventually?

- The parser creates Newick.Tree and Newick.Clade objects, which is nearly
correct in my opinion. I would suggest subclassing BaseTree.Tree and
BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you
don't have any additional attributes to attach to those classes at the
moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and
PhyloXMLIO.py.)

- The 'confidence' or 'confidences' attribute isn't used (for e.g.
bootstrap support values). Does NeXML define it?

Best,
Eric

From ben at bendmorris.com  Fri Dec 28 10:50:02 2012
From: ben at bendmorris.com (Ben Morris)
Date: Fri, 28 Dec 2012 10:50:02 -0500
Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo
In-Reply-To: <CAMC681=OrHJmfEbxWz=8-qzo2rEVJaqFeqgihiAMVi6No7GBCw@mail.gmail.com>
References: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
	<CAMC681=OrHJmfEbxWz=8-qzo2rEVJaqFeqgihiAMVi6No7GBCw@mail.gmail.com>
Message-ID: <CAAzEd5Bz5xvc2Bz80Ru+FbUbJK-WnAjfvLv70SfkPZup89NGRQ@mail.gmail.com>

On Tue, Dec 25, 2012 at 2:18 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris <ben at bendmorris.com> wrote:
>>
>> Hi all,
>>
>> I've implemented support for two new phylogenetic tree formats: NeXML and
>> RDF (conforming to the Comparative Data Analysis Ontology).
>>
>> I noticed that NeXML support was planned, but I didn't see anyone working
>> on it on GitHub and the feature request hadn't been updated in about a
>> year, so I went ahead and implemented a simple version. At first I tried
>> the generateDS.py approach, but the generated writer doesn't give very much
>> control over the output, so I ended up writing my own parser/writer using
>> ElementTree.
>>
>> As for the RDF/CDAO format, AFAIK this is not a format that's supported by
>> any other phylogenetic libraries, so I'm not sure how useful this is to
>> everyone else. It provides a simple, standards-compliant format that can be
>> imported to a triple store and supports annotation. We'll be using it at
>> NESCent so I wanted to make it available to everyone else as well. The
>> parser and writer require the Redlands Python bindings.
>>
>> The code is available in my fork of Biopython,
>>
>>     https://github.com/bendmorris/biopython
>>
>> under branches "cdao" and "nexml." I'd love to get everyone's thoughts and
>> see if these contributions would be a good fit for the Biopython project.
>
>
>
> Thanks for letting us know! I'll try it out soonish. Looking at the code on your nexml branch, I have a few comments:
>
> - The parser uses ElementTree.parse rather than iterparse, so in its current state it would not be able to parse massive files (those larger than available RAM). Worth fixing eventually?

Great point. I rewrote it to use iterparse instead.

> - The parser creates Newick.Tree and Newick.Clade objects, which is nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you don't have any additional attributes to attach to those classes at the moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and PhyloXMLIO.py.)

Went ahead and did this as well.

> - The 'confidence' or 'confidences' attribute isn't used (for e.g. bootstrap support values). Does NeXML define it?

Not that I'm aware of, but I'm not sure. I searched
http://nexml.org/nexml/html/doc/schema-1/ and didn't find anything.
I'm going to ask some people who know more about this than I do.

~Ben


From diego_zea at yahoo.com.ar  Fri Dec 28 18:33:35 2012
From: diego_zea at yahoo.com.ar (Diego Zea)
Date: Fri, 28 Dec 2012 15:33:35 -0800 (PST)
Subject: [Biopython-dev] Error on Bio.PDB
Message-ID: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com>

One of the PDB (I have a very large dataset of PDB and there are a lot of them generating this kind of error) that give me the error is: http://www.rcsb.org/pdb/files/2ER9.pdb

Ant the error output is:
/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 2895.
? PDBConstructionWarning)
/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 3216.
? PDBConstructionWarning)
Traceback (most recent call last):
? File "AsignarPDBaMIfile.py", line 45, in <module>
??? cmap, pdb? = contactos_CB(pdb_file,pdb_cad,cutoff=1,n_salto=1)
? File "funciones_pdb.py", line 15, in contactos_CB
??? cadena = model[cad]
? File "/usr/lib/pymodules/python2.7/Bio/PDB/Entity.py", line 38, in __getitem__
??? return self.child_dict[id]
KeyError: 'A'

?
How Can be fixed?

P.D.: The lines of the firts warning are (at the beginig I wrote the number of lines for reference, I think that the TER line can be the cause of the problem but I'm not sure):

2893?? ATOM?? 2455? N?? PHE I?? 8????? 38.110 -15.236?? 4.503? 0.89? 0.76?????????? N? 
2894?? TER??? 2456????? PHE I?? 8????????????????????????????????????????????????????? 
2895??? HETATM 2457? O?? HOH E 327????? 10.873? -3.134? 11.448? 0.89? 0.01?????????? O? 


if ((dx*dp)>=(h/(2*pi)))
{
printf("Diego Javier Zea\n");
}

From diego_zea at yahoo.com.ar  Fri Dec 28 18:59:28 2012
From: diego_zea at yahoo.com.ar (Diego Zea)
Date: Fri, 28 Dec 2012 15:59:28 -0800 (PST)
Subject: [Biopython-dev] Error on Bio.PDB
In-Reply-To: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com>
References: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com>
Message-ID: <1356739168.13594.YahooMailNeo@web140606.mail.bf1.yahoo.com>

Excuse me, there is not error. Only a warning on a lot of PDBs. I confuse the chain on my example :/


?
if ((dx*dp)>=(h/(2*pi)))
{
printf("Diego Javier Zea\n");
}


>________________________________
> De: Diego Zea <diego_zea at yahoo.com.ar>
>Para: "biopython-dev at biopython.org" <biopython-dev at biopython.org> 
>Enviado: viernes, 28 de diciembre de 2012 20:33
>Asunto: [Biopython-dev] Error on Bio.PDB
> 
>One of the PDB (I have a very large dataset of PDB and there are a lot of them generating this kind of error) that give me the error is: http://www.rcsb.org/pdb/files/2ER9.pdb
>
>Ant the error output is:
>/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 2895.
>? PDBConstructionWarning)
>/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 3216.
>? PDBConstructionWarning)
>Traceback (most recent call last):
>? File "AsignarPDBaMIfile.py", line 45, in <module>
>??? cmap, pdb? = contactos_CB(pdb_file,pdb_cad,cutoff=1,n_salto=1)
>? File "funciones_pdb.py", line 15, in contactos_CB
>??? cadena = model[cad]
>? File "/usr/lib/pymodules/python2.7/Bio/PDB/Entity.py", line 38, in __getitem__
>??? return self.child_dict[id]
>KeyError: 'A'
>
>?
>How Can be fixed?
>
>P.D.: The lines of the firts warning are (at the beginig I wrote the number of lines for reference, I think that the TER line can be the cause of the problem but I'm not sure):
>
>2893?? ATOM?? 2455? N?? PHE I?? 8????? 38.110 -15.236?? 4.503? 0.89? 0.76?????????? N? 
>2894?? TER??? 2456????? PHE I?? 8????????????????????????????????????????????????????? 
>2895??? HETATM 2457? O?? HOH E 327????? 10.873? -3.134? 11.448? 0.89? 0.01?????????? O? 
>
>
>
>
>
>
>
>
>if ((dx*dp)>=(h/(2*pi)))
>{
>printf("Diego Javier Zea\n");
>}
>_______________________________________________
>Biopython-dev mailing list
>Biopython-dev at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>

From redmine at redmine.open-bio.org  Sun Dec 30 07:46:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 30 Dec 2012 12:46:35 +0000
Subject: [Biopython-dev] [Biopython - Feature #3388] add annotation and
	letter_annotations attributed for
	Bio.Align.MultipleSeqAlignment. object
References: <redmine.issue-3388.20121017133925@redmine.open-bio.org>
Message-ID: <redmine.journal-15061.20121230124635@redmine.open-bio.org>


Issue #3388 has been updated by Peter Cock.


Support for a generic annotation dictionary done,
https://github.com/biopython/biopython/commit/793f9210696e0acc9606faeca3d6ca47a9d97813

Started work on per-column annotation as well - currently on this branch:
https://github.com/peterjc/biopython/tree/per-column-annotation

----------------------------------------
Feature #3388: add annotation and letter_annotations attributed for Bio.Align.MultipleSeqAlignment. object
https://redmine.open-bio.org/issues/3388

Author: saverio vicario
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


At the moment I could not add annotation at alignment level.  annotation could be usefull for tracking info linked to the loci ( i.e. name of domain), while letter annotation could be usefull to track quality score of alignment or if the sites belong to a given character set.
In particular when to alignment are merged it would be usefull tha the bounduary of the merge is tracked
for example in Letter annotation of the merge of an alignment a with 10 sites and b of 5 sites the letter_annotations would be as following 

{locus1:'111111111100000',locus2:'000000000011111'} 
this could be usefull also to annotate the 3 position of codons
{pos1:'1001001001',pos2:'0100100100', pos3:'0010010010'}

If this letter_annotation would be supported the annotation could be kept across merging and splitting of the alignment


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Sun Dec  2 23:41:49 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 2 Dec 2012 23:41:49 +0000
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7v1XaqfbwUN-Juu6-26AwfHO1697haZV5Pw-hdK7wTrA@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CADEGkF7GerjkkuN1FamM8MNVp+h=-10dDeY9d4UkZ6ise4t9+Q@mail.gmail.com>
	<CAKVJ-_7v1XaqfbwUN-Juu6-26AwfHO1697haZV5Pw-hdK7wTrA@mail.gmail.com>
Message-ID: <CAKVJ-_7qkugPM8M1uF8x7z4-vO79G-_aisOuWRKX_itX9ZN_-w@mail.gmail.com>

On Mon, Nov 26, 2012 at 4:46 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Done,
> https://github.com/biopython/biopython/commit/9f6e810cc68dd1e353d899772fda3053d9f49513
>
>>> Once that's done there is some housekeeping to do, like
>>> the indexing code duplication with Bio.SeqIO, and tackling
>>> indexing BGZF compressed files with Bio.SearchIO which
>>> I will have a go at.
>>
>> Yes.
>
> Started, it seems the two _index.py files have diverged a
> little more than I'd expected:
> https://github.com/biopython/biopython/commit/ad1786b99afd2a50248246d877ff00a53949546b

I've just refactored the code in order to avoid most of the
index duplication (including SQLite backend) between the
SeqIO and new SearchIO index and index_db functions.

In the short term at least, the common code is now part
of Bio/File.py (but remains as private classes). That
seemed neater than introducing a new private module.

Fingers crossed everything is fine on the buildslaves,
TravisCI seems happy. Bow, if you find I've broken
anything then we need more unit tests ;)

Regards,

Peter


From w.arindrarto at gmail.com  Mon Dec  3 11:22:07 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Mon, 3 Dec 2012 12:22:07 +0100
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7qkugPM8M1uF8x7z4-vO79G-_aisOuWRKX_itX9ZN_-w@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CADEGkF7GerjkkuN1FamM8MNVp+h=-10dDeY9d4UkZ6ise4t9+Q@mail.gmail.com>
	<CAKVJ-_7v1XaqfbwUN-Juu6-26AwfHO1697haZV5Pw-hdK7wTrA@mail.gmail.com>
	<CAKVJ-_7qkugPM8M1uF8x7z4-vO79G-_aisOuWRKX_itX9ZN_-w@mail.gmail.com>
Message-ID: <CADEGkF7CKTEfcG7u_+yYGLUUs9mCtiwjDc9uHW3=zH2DFTvxVw@mail.gmail.com>

Hi Peter,

>>>> Once that's done there is some housekeeping to do, like
>>>> the indexing code duplication with Bio.SeqIO, and tackling
>>>> indexing BGZF compressed files with Bio.SearchIO which
>>>> I will have a go at.
>>>
>>> Yes.
>>
>> Started, it seems the two _index.py files have diverged a
>> little more than I'd expected:
>> https://github.com/biopython/biopython/commit/ad1786b99afd2a50248246d877ff00a53949546b
>
> I've just refactored the code in order to avoid most of the
> index duplication (including SQLite backend) between the
> SeqIO and new SearchIO index and index_db functions.

Thanks :). I remember I did change some of the variable names. Other
than this, the biggest change is probably related to the Indexer
classes lazy loading in SearchIO. But it seems to have been handled as
well :).

> In the short term at least, the common code is now part
> of Bio/File.py (but remains as private classes). That
> seemed neater than introducing a new private module.

Looks like a good place for now, Bio.File as the location for common
file-handling code.

> Fingers crossed everything is fine on the buildslaves,
> TravisCI seems happy. Bow, if you find I've broken
> anything then we need more unit tests ;)

Will keep that in mind :).

regards,
Bow


From p.j.a.cock at googlemail.com  Mon Dec  3 11:36:16 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Dec 2012 11:36:16 +0000
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CADEGkF7CKTEfcG7u_+yYGLUUs9mCtiwjDc9uHW3=zH2DFTvxVw@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CADEGkF7GerjkkuN1FamM8MNVp+h=-10dDeY9d4UkZ6ise4t9+Q@mail.gmail.com>
	<CAKVJ-_7v1XaqfbwUN-Juu6-26AwfHO1697haZV5Pw-hdK7wTrA@mail.gmail.com>
	<CAKVJ-_7qkugPM8M1uF8x7z4-vO79G-_aisOuWRKX_itX9ZN_-w@mail.gmail.com>
	<CADEGkF7CKTEfcG7u_+yYGLUUs9mCtiwjDc9uHW3=zH2DFTvxVw@mail.gmail.com>
Message-ID: <CAKVJ-_6V0ePYt0Ds-dJWv9GOmOEhL9=av0umzEf-fc=Q9oh1Zg@mail.gmail.com>

On Mon, Dec 3, 2012 at 11:22 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Peter,
>
>> I've just refactored the code in order to avoid most of the
>> index duplication (including SQLite backend) between the
>> SeqIO and new SearchIO index and index_db functions.
>
> Thanks :). I remember I did change some of the variable names.

Basically I moved the core SeqIO indexing code into Bio.File,
generalised it enough to work for SearchIO as well, then removed
the SearchIO indexing code.

> Other than this, the biggest change is probably related to the
> Indexer classes lazy loading in SearchIO. But it seems to have
> been handled as well :).

Yes, the SearchIO indexing is still calling your lazy loading
function to get the parser objects.

>> In the short term at least, the common code is now part
>> of Bio/File.py (but remains as private classes). That
>> seemed neater than introducing a new private module.
>
> Looks like a good place for now, Bio.File as the location for
> common file-handling code.

That was my thinking too.

>> Fingers crossed everything is fine on the buildslaves,
>> TravisCI seems happy. Bow, if you find I've broken
>> anything then we need more unit tests ;)
>
> Will keep that in mind :).

*Grin*

I've just done a base class for the random access proxy
classes, potentially a little more refactoring to follow here
(or renaming):
https://github.com/biopython/biopython/commit/9721cd00b5662309456c3dc573642cbb88e4e0a1

Peter


From christian at brueffer.de  Mon Dec  3 12:46:23 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Mon, 03 Dec 2012 20:46:23 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
Message-ID: <50BC9F1F.4090904@brueffer.de>

Hi,

I just submitted pull request #102 which fixes several types of PEP8
warnings (found using the awesome pep8 tool).

Here's what's left after those fixes:

$ pep8 --statistics -qq repos/biopython
789     E111 indentation is not a multiple of four
673     E121 continuation line indentation is not a multiple of four
693     E122 continuation line missing indentation or outdented
171     E123 closing bracket does not match indentation of opening 
bracket's line
86      E124 closing bracket does not match visual indentation
49      E125 continuation line does not distinguish itself from next 
logical line
197     E126 continuation line over-indented for hanging indent
575     E127 continuation line over-indented for visual indent
1092    E128 continuation line under-indented for visual indent
773     E201 whitespace after '('
540     E202 whitespace before ')'
23543   E203 whitespace before ':'
55      E211 whitespace before '('
180     E221 multiple spaces before operator
59      E222 multiple spaces after operator
5848    E225 missing whitespace around operator
6517    E231 missing whitespace after ','
2544    E251 no spaces around keyword / parameter equals
644     E261 at least two spaces before inline comment
346     E262 inline comment should start with '# '
156     E301 expected 1 blank line, found 0
1838    E302 expected 2 blank lines, found 1
364     E303 too many blank lines (2)
15553   E501 line too long (82 > 79 characters)
857     E502 the backslash is redundant between brackets
291     E701 multiple statements on one line (colon)
122     E711 comparison to None should be 'if cond is None:'
3707    W291 trailing whitespace
1913    W293 blank line contains whitespace

I'm not sure where to go from here with regard to what's worth fixing 
and what would be considered repo churn (or gratuitous changes that make
merging of existing patches harder).

I'd especially like to clean up E301, E302, E701, E711, W291 and W293.
Other items like E251 are more dubious, as some developers seem to
prefer the current style.

What do you think?

Chris


From p.j.a.cock at googlemail.com  Mon Dec  3 13:34:52 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Dec 2012 13:34:52 +0000
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <50BC9F1F.4090904@brueffer.de>
References: <50BC9F1F.4090904@brueffer.de>
Message-ID: <CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>

On Mon, Dec 3, 2012 at 12:46 PM, Christian Brueffer
<christian at brueffer.de> wrote:
> Hi,

Hi Christian,

Thanks for all the pull requests sorting out issues like this, in
terms of lines of code you'll probably be one of the top
contributors to the next release ;)  This sort of work isn't as
high profile as new features or bug fixes, but has a more
subtle role in the long term of the project - making our code
easier to follow etc. So we do appreciate these contributions.

> I just submitted pull request #102 which fixes several types of PEP8
> warnings (found using the awesome pep8 tool).

101 not 102? https://github.com/biopython/biopython/pull/101

> Here's what's left after those fixes:
>
> $ pep8 --statistics -qq repos/biopython
> 789     E111 indentation is not a multiple of four

That's nasty - although I think we've got rid of all the tabbed
indentation already which was also very annoying.

> 673     E121 continuation line indentation is not a multiple of four

I suspect many of those are a style judgement and done that
way to line up parentheses etc.

> 693     E122 continuation line missing indentation or outdented
> 171     E123 closing bracket does not match indentation of opening bracket's
> line
> 86      E124 closing bracket does not match visual indentation
> 49      E125 continuation line does not distinguish itself from next logical
> line
> 197     E126 continuation line over-indented for hanging indent
> 575     E127 continuation line over-indented for visual indent
> 1092    E128 continuation line under-indented for visual indent
> 773     E201 whitespace after '('
> 540     E202 whitespace before ')'
> 23543   E203 whitespace before ':'
> 55      E211 whitespace before '('

I'd like to see E201, E202, and E211 fixed (whitespace next to
parentheses).

The count for E203 is surprisingly high - I suspect that
could include some large dictionaries? Note some of the
dictionaries are auto-generated so the code to do that
would also need fixing.

> 180     E221 multiple spaces before operator
> 59      E222 multiple spaces after operator
> 5848    E225 missing whitespace around operator
> 6517    E231 missing whitespace after ','
> 2544    E251 no spaces around keyword / parameter equals
> 644     E261 at least two spaces before inline comment
> 346     E262 inline comment should start with '# '
> 156     E301 expected 1 blank line, found 0
> 1838    E302 expected 2 blank lines, found 1
> 364     E303 too many blank lines (2)
> 15553   E501 line too long (82 > 79 characters)
> 857     E502 the backslash is redundant between brackets

Fixing E502 seems a good idea, I suspect many of these are
purely accidental due to not realising when they are redundant.

> 291     E701 multiple statements on one line (colon)
> 122     E711 comparison to None should be 'if cond is None:'
> 3707    W291 trailing whitespace
> 1913    W293 blank line contains whitespace
>
> I'm not sure where to go from here with regard to what's worth fixing and
> what would be considered repo churn (or gratuitous changes that make
> merging of existing patches harder).
>
> I'd especially like to clean up E301, E302,

E301 and E302 presumable are about the recommended spacing
between function, class and method names? If you want to fix
them next that seems low risk in terms of complicating merges.

> ... E701, E711, W291 and W293.

Did you already fix most of those in today's pull request?
https://github.com/biopython/biopython/pull/101

If there are more cases, then by all means fix them too.

> Other items like E251 are more dubious, as some developers
> seem to prefer the current style.
>
> What do you think?

We have a range of styles in the current code base reflecting
different authors - and also changes in the Python conventions
as some of the code is now over ten years old. And if any of
my personal coding style is flagged, I'm willing to adapt ;)

(e.g. I've learnt not to put a space before if statement colons)

As you point out, the "repo churn" from fixing minor things
like spaces around operators does have a cost in making
merges a little harder. Things like the exception style updates
which you've already fixed (seems I missed some) are more
urgent for Python 3 support, so worth doing anyway.

You've got us a lot closer to PEP8 compliance - do you think
subject to a short white list of known cases (like module
names) where we don't follow PEP8 we could aim to run a
a pep8 tool automatically (e.g. as a unit test, or even a commit
hook)? That is quite appealing as a way to spot any new code
which breaks the style guidelines...

Regards,

Peter


From p.j.a.cock at googlemail.com  Mon Dec  3 14:02:40 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Dec 2012 14:02:40 +0000
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
Message-ID: <CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>

On Mon, Nov 26, 2012 at 1:49 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Once that's done there is some housekeeping to do, like
> the indexing code duplication with Bio.SeqIO, and tackling
> indexing BGZF compressed files with Bio.SearchIO which
> I will have a go at.
>

I've started work on SearchIO indexing of BGZF files now,
enabling it was quite simple (the same code as used for
SeqIO the indexing):
https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f

Thus far I've only tested this with BLAST XML, but that did
require a bit of reworking to avoid doing file offset arithmetic:
https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5

I will resume this work later this afternoon, going over all
the SearchIO file formats one by one.

Regards,

Peter


From p.j.a.cock at googlemail.com  Mon Dec  3 16:49:47 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Dec 2012 16:49:47 +0000
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>
Message-ID: <CAKVJ-_78h12b5k_2pLe-JcL0G=HOQvrCKAWrbey18qsv2-4+Fw@mail.gmail.com>

On Mon, Dec 3, 2012 at 2:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I've started work on SearchIO indexing of BGZF files now,
> enabling it was quite simple (the same code as used for
> SeqIO the indexing):
> https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f
>
> Thus far I've only tested this with BLAST XML, but that did
> require a bit of reworking to avoid doing file offset arithmetic:
> https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5
>
> I will resume this work later this afternoon, going over all
> the SearchIO file formats one by one.

I've refactored test_SearchIO_index.py to make adding
additional get_raw tests easier. Proper testing of all the
formats with BGZF will some larger test files (over 64k
before compression) which we probably don't want to
include in the repository.

However, I also added code to additionally test
Bio.SearchIO.index_db(...).get_raw(...) as well as your
original testing of Bio.SearchIO.index(...).get_raw(...)
alone. These should return the exact same string, and
that is now working nicely for BLAST XML (and BGZF
from limited testing), but not on all the formats.

Could you look at the difference in get_raw and the
record length found during indexing for: blast-tab
(with comments), hmmscan3-domtab, hmmer3-tab,
and hmmer3-text?

i.e. Anything where test_SearchIO_index.py is now
printing a WARNING line when run.

Thanks,

Peter


From christian at brueffer.de  Mon Dec  3 17:02:31 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Tue, 04 Dec 2012 01:02:31 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
Message-ID: <50BCDB27.7040402@brueffer.de>

On 12/3/12 21:34 , Peter Cock wrote:
> On Mon, Dec 3, 2012 at 12:46 PM, Christian Brueffer
> <christian at brueffer.de> wrote:
>> Hi,
> 
> Hi Christian,
> 
> Thanks for all the pull requests sorting out issues like this, in
> terms of lines of code you'll probably be one of the top
> contributors to the next release ;)  This sort of work isn't as
> high profile as new features or bug fixes, but has a more
> subtle role in the long term of the project - making our code
> easier to follow etc. So we do appreciate these contributions.
> 
>> I just submitted pull request #102 which fixes several types of PEP8
>> warnings (found using the awesome pep8 tool).
> 
> 101 not 102? https://github.com/biopython/biopython/pull/101
> 

102 and 103 (I actually meant 103).

>> Here's what's left after those fixes:
>>
>> $ pep8 --statistics -qq repos/biopython
>> 789     E111 indentation is not a multiple of four
> 
> That's nasty - although I think we've got rid of all the tabbed
> indentation already which was also very annoying.
> 

Some code uses two spaces etc, definatelty worth fixing.

>> 673     E121 continuation line indentation is not a multiple of four
> 
> I suspect many of those are a style judgement and done that
> way to line up parentheses etc.
> 

I'll see about those and apply case by case judgement.

>> 693     E122 continuation line missing indentation or outdented
>> 171     E123 closing bracket does not match indentation of opening bracket's
>> line
>> 86      E124 closing bracket does not match visual indentation
>> 49      E125 continuation line does not distinguish itself from next logical
>> line
>> 197     E126 continuation line over-indented for hanging indent
>> 575     E127 continuation line over-indented for visual indent
>> 1092    E128 continuation line under-indented for visual indent
>> 773     E201 whitespace after '('
>> 540     E202 whitespace before ')'
>> 23543   E203 whitespace before ':'
>> 55      E211 whitespace before '('
> 
> I'd like to see E201, E202, and E211 fixed (whitespace next to
> parentheses).
> 
> The count for E203 is surprisingly high - I suspect that
> could include some large dictionaries? Note some of the
> dictionaries are auto-generated so the code to do that
> would also need fixing.
> 
>> 180     E221 multiple spaces before operator
>> 59      E222 multiple spaces after operator
>> 5848    E225 missing whitespace around operator
>> 6517    E231 missing whitespace after ','
>> 2544    E251 no spaces around keyword / parameter equals
>> 644     E261 at least two spaces before inline comment
>> 346     E262 inline comment should start with '# '
>> 156     E301 expected 1 blank line, found 0
>> 1838    E302 expected 2 blank lines, found 1
>> 364     E303 too many blank lines (2)
>> 15553   E501 line too long (82 > 79 characters)
>> 857     E502 the backslash is redundant between brackets
> 
> Fixing E502 seems a good idea, I suspect many of these are
> purely accidental due to not realising when they are redundant.
> 

Agreed.

>> 291     E701 multiple statements on one line (colon)
>> 122     E711 comparison to None should be 'if cond is None:'
>> 3707    W291 trailing whitespace
>> 1913    W293 blank line contains whitespace
>>
>> I'm not sure where to go from here with regard to what's worth fixing and
>> what would be considered repo churn (or gratuitous changes that make
>> merging of existing patches harder).
>>
>> I'd especially like to clean up E301, E302,
> 
> E301 and E302 presumable are about the recommended spacing
> between function, class and method names? If you want to fix
> them next that seems low risk in terms of complicating merges.
> 

That and spacing between functions or between a function and a new
class.

>> ... E701, E711, W291 and W293.
> 
> Did you already fix most of those in today's pull request?
> https://github.com/biopython/biopython/pull/101
> 
> If there are more cases, then by all means fix them too.
> 

I fixed some in Nexus, that was before actually using the pep8 tool.

>> Other items like E251 are more dubious, as some developers
>> seem to prefer the current style.
>>
>> What do you think?
> 
> We have a range of styles in the current code base reflecting
> different authors - and also changes in the Python conventions
> as some of the code is now over ten years old. And if any of
> my personal coding style is flagged, I'm willing to adapt ;)
> 
> (e.g. I've learnt not to put a space before if statement colons)
> 
> As you point out, the "repo churn" from fixing minor things
> like spaces around operators does have a cost in making
> merges a little harder. Things like the exception style updates
> which you've already fixed (seems I missed some) are more
> urgent for Python 3 support, so worth doing anyway.
> 

On the other hand, it's basically a one-time cost.  However I
want to fix the lowest-hanging fruit (read: the ones with the
lowest counts ;-) first.

> You've got us a lot closer to PEP8 compliance - do you think
> subject to a short white list of known cases (like module
> names) where we don't follow PEP8 we could aim to run a
> a pep8 tool automatically (e.g. as a unit test, or even a commit
> hook)? That is quite appealing as a way to spot any new code
> which breaks the style guidelines...
> 

Having a commit hook would be ideal (maybe with a possibility to
override).  This would be especially useful against the introduction of
gratuitous whitespace.  With some editors/IDEs you don't even notice it.

Chris


From w.arindrarto at gmail.com  Tue Dec  4 13:33:32 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Tue, 4 Dec 2012 14:33:32 +0100
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CAKVJ-_78h12b5k_2pLe-JcL0G=HOQvrCKAWrbey18qsv2-4+Fw@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>
	<CAKVJ-_78h12b5k_2pLe-JcL0G=HOQvrCKAWrbey18qsv2-4+Fw@mail.gmail.com>
Message-ID: <CADEGkF42x6QQ_NXLZU1jj3c3c9ZiCSMstqYw6Vri9Tuqmn8o9w@mail.gmail.com>

Hi Peter and everyone,

>> I've started work on SearchIO indexing of BGZF files now,
>> enabling it was quite simple (the same code as used for
>> SeqIO the indexing):
>> https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f
>>
>> Thus far I've only tested this with BLAST XML, but that did
>> require a bit of reworking to avoid doing file offset arithmetic:
>> https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5
>>
>> I will resume this work later this afternoon, going over all
>> the SearchIO file formats one by one.

Yes, the original one that I wrote did have some less straightforward
arithmetic as I was trying to adhere to the strict XML definition
(i.e. no matter the whitespace outside of the start and end elements,
indexing will still work). But line-based indexing should work too
(and is simpler) so long as BLAST XML keeps its style (and any user
modification afterwards doesn't introduce any wacky whitespaces).

> I've refactored test_SearchIO_index.py to make adding
> additional get_raw tests easier. Proper testing of all the
> formats with BGZF will some larger test files (over 64k
> before compression) which we probably don't want to
> include in the repository.
>
> However, I also added code to additionally test
> Bio.SearchIO.index_db(...).get_raw(...) as well as your
> original testing of Bio.SearchIO.index(...).get_raw(...)
> alone. These should return the exact same string, and
> that is now working nicely for BLAST XML (and BGZF
> from limited testing), but not on all the formats.
>
> Could you look at the difference in get_raw and the
> record length found during indexing for: blast-tab
> (with comments), hmmscan3-domtab, hmmer3-tab,
> and hmmer3-text?
>
> i.e. Anything where test_SearchIO_index.py is now
> printing a WARNING line when run.

Sure :). Based on a quick initial look, it seems that these are due to
filler texts (e.g. the BLAST
tab format ending with lines like "# BLAST processed 3 queries").
These texts won't affect the calculation results and the values of our
objects, but does add additional text length.

regards,
Bow


From redmine at redmine.open-bio.org  Tue Dec  4 23:01:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 4 Dec 2012 23:01:35 +0000
Subject: [Biopython-dev] [Biopython - Bug #3399] (New) SearchIO hmmer3-text
	parser fails to parse hits that have large gaps
Message-ID: <redmine.issue-3399.20121204230135@redmine.open-bio.org>


Issue #3399 has been reported by Kai Blin.

----------------------------------------
Bug #3399: SearchIO hmmer3-text parser fails to parse hits that have large gaps
https://redmine.open-bio.org/issues/3399

Author: Kai Blin
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


While trying to parse a hit that has a really bad match to the profile, there might be alignment lines that don't contain query sequence characters at all. In that case the SearchIO hmmer3-text module currently throws a ValueError

<pre>
>>> it = SearchIO.parse('../broken.hsr', 'hmmer3-text')
>>> i = it.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/SearchIO/__init__.py", line 313, in parse
    for qresult in generator:
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 60, in __iter__
    for qresult in self._parse_qresult():
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 145, in _parse_qresult
    hit_list = self._parse_hit(qid)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 188, in _parse_hit
    hit_list = self._create_hits(hit_list, qid)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 309, in _create_hits
    self._parse_aln_block(hid, hit.hsps)
  File "Bio/SearchIO/HmmerIO/hmmer3_text.py", line 358, in _parse_aln_block
    frag.query = aliseq
  File "Bio/SearchIO/_model/hsp.py", line 816, in _query_set
    self._query = self._set_seq(value, 'query')
  File "Bio/SearchIO/_model/hsp.py", line 784, in _set_seq
    len(seq), seq_type))
ValueError: Sequence lengths do not match. Expected: 202 (hit); found: 131 (query).
</pre>

See the attached file broken.hsr for a dataset that triggers the error. If you remove the esterase hit (including the domain annotation), this error does not happen (broken2.hsr). If you insert fake position information into the query sequence line (broken3.hsr), the parser is happy again.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From w.arindrarto at gmail.com  Wed Dec  5 06:46:20 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Wed, 5 Dec 2012 07:46:20 +0100
Subject: [Biopython-dev] SearchIO, was: PEP8 lower case module names?
In-Reply-To: <CADEGkF42x6QQ_NXLZU1jj3c3c9ZiCSMstqYw6Vri9Tuqmn8o9w@mail.gmail.com>
References: <CAKVJ-_7NpV0aveBJfut1Mn5xh=gxzJ2qPFr3UnCBRt589SwQVA@mail.gmail.com>
	<CADEGkF6LkJT2WrDrKSevozSi=FYL_iZcDMMx3QdnBE5FLM=szw@mail.gmail.com>
	<CAKVJ-_6zdSOq9JFtvKNx7NynBjX-m1p6iw4Pfd02LcWsdp+tig@mail.gmail.com>
	<CAKVJ-_7j6JS2RSN13JAcuJOr-AD8fxhfL9Fo_VCUEztW5FeTtg@mail.gmail.com>
	<CAKVJ-_78h12b5k_2pLe-JcL0G=HOQvrCKAWrbey18qsv2-4+Fw@mail.gmail.com>
	<CADEGkF42x6QQ_NXLZU1jj3c3c9ZiCSMstqYw6Vri9Tuqmn8o9w@mail.gmail.com>
Message-ID: <CADEGkF5PxtSr1+dA2KYDjE1v7ZD3ucMCQQ=9f8WqM_5AsX6E4Q@mail.gmail.com>

Hi everyone,

>> However, I also added code to additionally test
>> Bio.SearchIO.index_db(...).get_raw(...) as well as your
>> original testing of Bio.SearchIO.index(...).get_raw(...)
>> alone. These should return the exact same string, and
>> that is now working nicely for BLAST XML (and BGZF
>> from limited testing), but not on all the formats.
>>
>> Could you look at the difference in get_raw and the
>> record length found during indexing for: blast-tab
>> (with comments), hmmscan3-domtab, hmmer3-tab,
>> and hmmer3-text?
>>
>> i.e. Anything where test_SearchIO_index.py is now
>> printing a WARNING line when run.
>
> Sure :). Based on a quick initial look, it seems that these are due to
> filler texts (e.g. the BLAST
> tab format ending with lines like "# BLAST processed 3 queries").
> These texts won't affect the calculation results and the values of our
> objects, but does add additional text length.

I've looked into this and submitted a pull request to fix the issues
here: https://github.com/biopython/biopython/pull/111. The details on
the errors are also there.

regards,
Bow


From kai.blin at biotech.uni-tuebingen.de  Wed Dec  5 07:24:14 2012
From: kai.blin at biotech.uni-tuebingen.de (Kai Blin)
Date: Wed, 05 Dec 2012 17:24:14 +1000
Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading
	attributes.
Message-ID: <50BEF69E.2000806@biotech.uni-tuebingen.de>

Hi folks,

I'm trying to finally get my hmmer2-text parser in, but I'm failing one
unit test. The code is a bit too smart for me, it seems.

So in the file I'm parsing, I only ever get the description of the hit
in the hit table, like this (appologies if my mail client breaks this):

Model           Description                             Score    E-value  N
--------        -----------                             -----    ------- ---
Glu_synthase    Conserved region in glutamate synthas   858.6   3.6e-255   2


But of course I can't create a hit object when parsing the hit table, as
I first need to have HSPFragments to create the hit object with.

Anyway, I create a placeholder hit object that I'll later convert into a
real Hit object. In that placeholder object, I set a description.

Now I'm parsing the HSP table, looking like this:

Model           Domain  seq-f seq-t    hmm-f hmm-t      score  E-value
--------        ------- ----- -----    ----- -----      -----  -------
GATase_2          1/1      34   404 ..     1   385 []   731.8 3.9e-226

The HSP table is in a different order than the hit table, so never mind
the different model name.

Now, I need to create an HSPFragment with the same description as the
Hit object, or querying for the Hit object's description will cascade
through the HSPs and HSPFragments, and return multiple values for the
description.

However, no matter what I do, I seem to get an <unknown description>
tossed in there somehow.

The parser is at
https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py
the test code is at
https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py
and the test file that's failing is the hmmpfam2.3 file at
https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out

Any pointers would be appreciated. The code is working fine in my
current development work in general, and I'd love to get it upstream to
get rid of an extra patch step during installation.

Cheers,
Kai

-- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-University of T?bingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 T?bingen                        Fax :   ++49 7071 29-5979
Deutschland
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben


From p.j.a.cock at googlemail.com  Wed Dec  5 11:41:05 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 11:41:05 +0000
Subject: [Biopython-dev] Minor buildbot issues from SearchIO
In-Reply-To: <CADEGkF4RLmQDMS2sBNTs=Rwag_CypmU6WX-Q71R=Xsbuc4_GQg@mail.gmail.com>
References: <CAKVJ-_6N_Wy9QVKp=niHSexB0_yEL5svh4oDzbxEYuSHv3KfWA@mail.gmail.com>
	<CADEGkF4RLmQDMS2sBNTs=Rwag_CypmU6WX-Q71R=Xsbuc4_GQg@mail.gmail.com>
Message-ID: <CAKVJ-_5nLjkwZCugRNTLtKx50a0-Ow807sC=qYpSBOy=ZHFh_g@mail.gmail.com>

On Fri, Nov 30, 2012 at 2:35 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi everyone,
>
> I've done some digging around to see how to deal with these issues.
> Here's what I found:
>
>> The BuildBot flagged two new issues overnight,
>> http://testing.open-bio.org/biopython/tgrid
>>
>> Python 2.5 on Windows - doctests are failing due to floating point decimal place
>> differences in the exponent (down to C library differences, something fixed in
>> later Python releases). Perhaps a Python 2.5 hack is the way to go here?
>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%202.5/builds/664/steps/shell/logs/stdio
>
> I've submitted a pull request to fix this here:
> https://github.com/biopython/biopython/pull/98

The Windows detection wasn't quite right, it should now match
how we look for Windows elsewhere in Biopython:
https://github.com/biopython/biopython/commit/fc24967b89eda56675e67824a4a57a6059650636

>> There is a separate cross-platform issue on Python 3.1, "TypeError:
>> invalid event tuple" again with XML parsing. Curiously this had started
>> a few days back in the UniprotIO tests on one machine, pre-dating the
>> SearchIO merge. I'm not sure what triggered it.
>> http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/767
>> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/766/steps/shell/logs/stdio
>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/648/steps/shell/logs/stdio
>
> As for this one, it seems that it's caused by a bug in Python3.1
> (http://bugs.python.org/issue9257) due to the way
> `xml.etree.cElemenTree.iterparse` accepts the `event` argument.

Ah - I remember that bug now, we have a hack in place elsewhere
to try and avoid that - seems it won't be fixed in Python 3.1.x now
so I've relaxed the version check here:
https://github.com/biopython/biopython/commit/52fdd0ed7fa576494005e635b6a6610daab2ab0e

Hopefully that will bring the buildbot back to all green tonight.
(TravisCI has now dropped their Python 3.1 support, but they
should have Python 3.3 with NumPy working soon).

Peter


From p.j.a.cock at googlemail.com  Wed Dec  5 14:16:43 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 14:16:43 +0000
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <50BCDB27.7040402@brueffer.de>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
Message-ID: <CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>

On Mon, Dec 3, 2012 at 5:02 PM, Christian Brueffer
<christian at brueffer.de> wrote:
>> As you point out, the "repo churn" from fixing minor things
>> like spaces around operators does have a cost in making
>> merges a little harder. Things like the exception style updates
>> which you've already fixed (seems I missed some) are more
>> urgent for Python 3 support, so worth doing anyway.
>>
>
> On the other hand, it's basically a one-time cost.  However I
> want to fix the lowest-hanging fruit (read: the ones with the
> lowest counts ;-) first.

The shear number of files touched in these PEP8 fixes would
probably deserve to be called "repository churn" now - wow!

Although we have good test coverage, it isn't complete (anyone
fancy trying some test coverage measuring tools like figleaf?)
so there is a small but real risk we've accidentally broken
something. I'm wondering if therefore a 'beta' release would
be prudent, of if I am just worrying about things too much?

>> You've got us a lot closer to PEP8 compliance - do you think
>> subject to a short white list of known cases (like module
>> names) where we don't follow PEP8 we could aim to run a
>> a pep8 tool automatically (e.g. as a unit test, or even a commit
>> hook)? That is quite appealing as a way to spot any new code
>> which breaks the style guidelines...
>
> Having a commit hook would be ideal (maybe with a possibility to
> override).  This would be especially useful against the introduction of
> gratuitous whitespace.  With some editors/IDEs you don't even notice it.

Would you be interested in looking into how to set that up?
Presumably a client-side git hook would be best, but we'd
need to explore cross platform issues (e.g. developing and
testing on Windows) and making sure it allowed an override
on demand (where the developer wants/needs to ignore a
style warning).

Thanks,

Peter


From d.m.a.martin at dundee.ac.uk  Wed Dec  5 13:50:21 2012
From: d.m.a.martin at dundee.ac.uk (David Martin)
Date: Wed, 5 Dec 2012 13:50:21 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
Message-ID: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>

Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams.

I'd like to modify the CircularDrawer feature drawing to allow the following:

label_position: start|middle|end as per LinearDrawer
label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour
label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse

This will cause some issues with track widths (how can you specify a track width for a feature track?)

Any thoughts/suggestions?

..d


The University of Dundee is a registered Scottish Charity, No: SC015096


From christian at brueffer.de  Wed Dec  5 15:28:19 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Wed, 05 Dec 2012 23:28:19 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
	<CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
Message-ID: <50BF6813.4070102@brueffer.de>

On 12/5/12 22:16 , Peter Cock wrote:
> On Mon, Dec 3, 2012 at 5:02 PM, Christian Brueffer
> <christian at brueffer.de> wrote:
>>> As you point out, the "repo churn" from fixing minor things
>>> like spaces around operators does have a cost in making
>>> merges a little harder. Things like the exception style updates
>>> which you've already fixed (seems I missed some) are more
>>> urgent for Python 3 support, so worth doing anyway.
>>>
>>
>> On the other hand, it's basically a one-time cost.  However I
>> want to fix the lowest-hanging fruit (read: the ones with the
>> lowest counts ;-) first.
> 
> The shear number of files touched in these PEP8 fixes would
> probably deserve to be called "repository churn" now - wow!
> 

I wonder whether there's a file left I haven't touched yet (except
the data files in Tests)...

> Although we have good test coverage, it isn't complete (anyone
> fancy trying some test coverage measuring tools like figleaf?)
> so there is a small but real risk we've accidentally broken
> something. I'm wondering if therefore a 'beta' release would
> be prudent, of if I am just worrying about things too much?
> 

It certainly can't hurt to advise users to have an extra eye on
possible regressions and strange behaviours in existing code.
I think the only risky changes were the ones concerning indentation,
(f68d334b1edfd743fe8a7bb4654046295f0ff939), I was extra careful about
those.

So, I'm pretty confident I haven't screwed things up but it's good
to be careful.

FYI, here's the "pep8 --statistics -qq" output as of commit
df4f12965a2ad3b6ed31bbf9d201bd5c716bd4ee:

680     E121 continuation line indentation is not a multiple of four
691     E122 continuation line missing indentation or outdented
171     E123 closing bracket does not match indentation of opening
bracket's line
86      E124 closing bracket does not match visual indentation
197     E126 continuation line over-indented for hanging indent
601     E127 continuation line over-indented for visual indent
1072    E128 continuation line under-indented for visual indent
772     E201 whitespace after '('
536     E202 whitespace before ')'
23444   E203 whitespace before ':'
94      E221 multiple spaces before operator
11      E222 multiple spaces after operator
5763    E225 missing whitespace around operator
6519    E231 missing whitespace after ','
2542    E251 no spaces around keyword / parameter equals
622     E261 at least two spaces before inline comment
347     E262 inline comment should start with '# '
1044    E302 expected 2 blank lines, found 1
1       E303 too many blank lines (2)
15526   E501 line too long (82 > 79 characters)
3       E711 comparison to None should be 'if cond is None:'
75      W291 trailing whitespace
12      W293 blank line contains whitespace
5       W601 .has_key() is deprecated, use 'in'

E203 looks scary, but 9900 of those are in Bio/SubsMat/MatrixInfo.py
alone.

>>> You've got us a lot closer to PEP8 compliance - do you think
>>> subject to a short white list of known cases (like module
>>> names) where we don't follow PEP8 we could aim to run a
>>> a pep8 tool automatically (e.g. as a unit test, or even a commit
>>> hook)? That is quite appealing as a way to spot any new code
>>> which breaks the style guidelines...
>>
>> Having a commit hook would be ideal (maybe with a possibility to
>> override).  This would be especially useful against the introduction of
>> gratuitous whitespace.  With some editors/IDEs you don't even notice it.
> 
> Would you be interested in looking into how to set that up?
> Presumably a client-side git hook would be best, but we'd
> need to explore cross platform issues (e.g. developing and
> testing on Windows) and making sure it allowed an override
> on demand (where the developer wants/needs to ignore a
> style warning).
> 

Yes, It's fairly high on my TODO list.

Chris


From p.j.a.cock at googlemail.com  Wed Dec  5 15:57:44 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 15:57:44 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
Message-ID: <CAKVJ-_5zs7LJpv0SWogPqf5qL+xxaXmOJG+Y8fs0yxKmi3Kj5A@mail.gmail.com>

On Wed, Dec 5, 2012 at 1:50 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
> Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams.
>
> I'd like to modify the CircularDrawer feature drawing to allow the following:
>
> label_position: start|middle|end as per LinearDrawer

I would find it natural if we treated start/middle/end from the point of view
of the feature (and its strand) as in the LinearDrawer. However the current
circular drawer tries to position things at the vertical bottom of the feature
(it cares about the left and right halves of the circle) which is
rather different.

I am suggesting a break in backwards compatibility (old code would still
run but put the labels in different places) but for large circular diagrams
the difference should be minor - and I think it would be an overall
improvement.

> label_placement: inside|outside|overlap where inside and outside are
> anchored just inside and just outside the feature but do not overlap it,
> and overlap is the current behaviour

If I have understood your intended meaning, that won't work nicely with
stranded features.

I would suggest two options: outside (i.e. outside the feature's bounding
box, either outside the track circle for forward strand or strand-less, or
inside the track circle for reverse strand) matching the current linear code,
or inside matching the current circular code. i.e. This would essentially
toggle the text element's anchoring between start/end.

i.e. Maintain the convention that labels above/outside the track are for
the forward strand (and strand-less) features, while labels below/inside
the track are for reverse strand features.

> label_orientation: upright|circular which determines the orientation of
> the label. upright is the current behaviour. Circular would be oriented
> to face clockwise for the forward strand and anticlockwise for the reverse

I would prefer making the existing (linear) option label_angle work nicely
on circular diagrams (which would make sense as part of reworking the
code to obey label_placement).

> This will cause some issues with track widths (how can you specify a
> track width for a feature track?)

Do you mean how to allocate more white space between the tracks
to ensure the labels have a clear background if printed outside the
features? The quick and dirty solution is a spacer track (you can
allocate track numbers to leave a gap).

> Any thoughts/suggestions?
>

Comments in-line, if need be we could meet up to hash some of this
out in person (although I not be in the Dundee area next week).

Regards,

Peter


From Leighton.Pritchard at hutton.ac.uk  Wed Dec  5 16:28:26 2012
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Wed, 5 Dec 2012 16:28:26 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_5zs7LJpv0SWogPqf5qL+xxaXmOJG+Y8fs0yxKmi3Kj5A@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_5zs7LJpv0SWogPqf5qL+xxaXmOJG+Y8fs0yxKmi3Kj5A@mail.gmail.com>
Message-ID: <E72D33BF424829408854FEB604A6959B6E4730E8@DUEXC02.ad.hutton.ac.uk>

On 5 Dec 2012, at Wednesday, December 5, 15:57, Peter Cock wrote:

On Wed, Dec 5, 2012 at 1:50 PM, David Martin <d.m.a.martin at dundee.ac.uk<mailto:d.m.a.martin at dundee.ac.uk>> wrote:
label_position: start|middle|end as per LinearDrawer

I am suggesting a break in backwards compatibility (old code would still
run but put the labels in different places) but for large circular diagrams
the difference should be minor - and I think it would be an overall
improvement.

Yep - I agree

label_orientation: upright|circular which determines the orientation of
the label. upright is the current behaviour. Circular would be oriented
to face clockwise for the forward strand and anticlockwise for the reverse

I would prefer making the existing (linear) option label_angle work nicely
on circular diagrams (which would make sense as part of reworking the
code to obey label_placement).

Good point - the automatic reorientation on either side of the circle (to respect the viewer's local gravity) could effectively be handled through a working label_angle for circular diagrams.  And more adventurous manual reorientation would also be possible ;)

One issue there is what the angle is defined with respect to: a 'vertical' reference on the page, or a tangent/normal to some point on the feature.  The first is straightforward, and might be what we want - the second will likely result in some odd - or attractive - patterns.

Comments in-line, if need be we could meet up to hash some of this
out in person (although I not be in the Dundee area next week).

Friday's good for me.

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk       w:http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From ben at benfulton.net  Wed Dec  5 16:28:52 2012
From: ben at benfulton.net (Ben Fulton)
Date: Wed, 5 Dec 2012 11:28:52 -0500
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
	<CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
Message-ID: <CA+ijMsnYnxhMA+ostqciQiZ-a=mEXKKFp84fVggvMY4Y5wpQtg@mail.gmail.com>

I've been studying this a bit and have a preference for Ned Batchelder's
Coverage tool. But I plan on putting some more work into it this week and
next.

On Wed, Dec 5, 2012 at 9:16 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote

>Although we have good test coverage, it isn't complete (anyone
>fancy trying some test coverage measuring tools like figleaf?)


From w.arindrarto at gmail.com  Wed Dec  5 16:39:13 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Wed, 5 Dec 2012 17:39:13 +0100
Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading
	attributes.
In-Reply-To: <50BEF69E.2000806@biotech.uni-tuebingen.de>
References: <50BEF69E.2000806@biotech.uni-tuebingen.de>
Message-ID: <CADEGkF6eQmPLkTWOX=CMbA=V0tkEHrfHwpksp-RR2cSixEKNfQ@mail.gmail.com>

Hi Kai and everyone,

Very happy to see the parser near completion (with tests too!). The
issue you're facing is unfortunately the consequence of trying to keep
attribute values in sync across the object hierarchy. It is a bit
troublesome for now, but not without solution.

> However, no matter what I do, I seem to get an <unknown description>
> tossed in there somehow.
>
> The parser is at
> https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py
> the test code is at
> https://github.com/kblin/biopython/blob/antismash/Tests/test_SearchIO_hmmer2_text.py
> and the test file that's failing is the hmmpfam2.3 file at
> https://github.com/kblin/biopython/blob/antismash/Tests/Hmmer/text_23_hmmpfam_001.out

'<unknown_description>' is the default value for any description
attribute (be it in the QueryResult object, or in the
HSPFragment.hit_description). The error you're seeing is because the
hit description is being accessed through the hit object
(hit.description) and the cascading property getter checks first whether all
HSP contains the same `hit_description` attribute value. It'll only
return the value if all HSPFragment.hit_description values are equal.
Otherwise, it'll raise the error you're seeing here.

In your case, there are two values: 'Conserved region in glutamate
synthas' and '<unknown_description>', while there should only be one
(the first one). After prodding here and there, it seems that this is
caused by the if clause here:
https://github.com/kblin/biopython/blob/antismash/Bio/SearchIO/HmmerIO/hmmer2_text.py#L191

The 'else' clause in that block adds the HSP to the hit object, but
does not do any cascading attribute assignment (query_description and
hit_description).

Here, the simple fix would be to force a description assignment to the
HSP. For example, you could have the `else` block like so:

...
else:
    hit = unordered_hits[id_]
    hsp.hit_description = hit.description
    hit.append(hsp)

Other fixes are of course possible, but this is the simplest I can
imagine (though it seems a bit crude).

Also, I would like to note that the query description assignment of
the parser may break the cascade as well. If you try to access
`qresult.description` (qresult being the QueryResult object), you'd
get the true query description. But if you try to access it from
`qresult[0].query_description` (the query description stored in the
hit object), you'd get '<unknown_description>'. The fix here would be
to assign the description at the last moment before the QueryResult
object is yielded. That way, the cascading setter works properly and
all Hit, HSP, and HSPFragment inside the QueryResult object will
contain the same value.

I realize that this approach is not without flaws (and I'm always open to
suggestions), but at the moment this seems to be the most sensible way
to keep the attribute values in-sync while keeping the objects more
user-friendly
(i.e. making the parser slightlymore complex to write, but with the
result of consistent attribute
value to the users).

Hope this helps!
Bow


From Leighton.Pritchard at hutton.ac.uk  Wed Dec  5 16:21:06 2012
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Wed, 5 Dec 2012 16:21:06 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
Message-ID: <E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>

Hi all,

On 5 Dec 2012, at Wednesday, December 5, 13:50, David Martin wrote:

Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams.

I'd like to modify the CircularDrawer feature drawing to allow the following:

label_position: start|middle|end as per LinearDrawer
label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour
label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse

This will cause some issues with track widths (how can you specify a track width for a feature track?)

Any thoughts/suggestions?

I think that the proposed changes are all sensible, but circular labels are fiddly, so we'll need to think a wee bit about what to do exactly (probably another reason why I didn't do much on this originally ;) ).

label_position makes perfect sense, as suggested.

label_placement as a concept is fine, but I think we might need to be more precise about the intended behaviour of the arguments. I think that 'overlap' is possibly just a result of the choice of two anchoring parameters: 'inside'/'outside' and 'start'/'end' of the label string (see .png if the list accepts them).  We might be able to cover all possibilities with just these two choices - does this image cover the range of intended positioning David?  If so, then how about two arguments (I'm easy on the argument names): 'label_outer=True/False' and 'label_anchor=start/end/0/1'?

[cid:4EA13CE3-20E7-41D8-870F-CBBAA9DD06B0 at scri.sari.ac.uk]

label_orientation: What I think you're saying is that we want a distinction between text that orients to assume a static page (one you always view upright: e.g. a monitor), and text that doesn't.  I prefer 'reorient_labels=True/False', or some other geometrically neutral argument name, to 'upright' (whose expected meaning could change depending on page location and local context) as a parameter, but it's a good idea.

IIRC Track widths are relative, rather than absolute, and don't include label bounding boxes, so off-hand I don't think there ought to be any downstream issues.  Famous last words, there! ;)

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk       w:http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2012-12-05 at Wednesday, December 5	16.06.12.png
Type: image/png
Size: 22969 bytes
Desc: Screen Shot 2012-12-05 at Wednesday, December 5	16.06.12.png
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121205/ab34b6ff/attachment-0002.png>

From d.m.a.martin at dundee.ac.uk  Wed Dec  5 16:29:14 2012
From: d.m.a.martin at dundee.ac.uk (David Martin)
Date: Wed, 5 Dec 2012 16:29:14 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
Message-ID: <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>

Just got my head out of hacking at this. The options I have now are:

label_position: start|middle|end with reference to the feature. So the end is always the pointy bit.
label_orientation: circular|upright  Sometimes it is nice to have a proper circular plot
label_placement: inside|outside|overlap|strand which maintains overlap as default, inside is all inside, outside is all outside, strand is forward outside and reverse inside.

It even works. Angles and so on are not so relevant with circular plots though I would prefer a label_angle: radial|tangent|[degrees]

Should I attach an example?

..d


From: Leighton Pritchard [mailto:Leighton.Pritchard at hutton.ac.uk]
Sent: 05 December 2012 16:21
To: David Martin
Cc: BioPython-Dev; Peter Cock
Subject: Re: [Biopython-dev] Modifications to CircularDrawer

Hi all,

On 5 Dec 2012, at Wednesday, December 5, 13:50, David Martin wrote:


Peter, Leighton and I have had a brief discussion on Twitter re drawing circular genome diagrams.

I'd like to modify the CircularDrawer feature drawing to allow the following:

label_position: start|middle|end as per LinearDrawer
label_placement: inside|outside|overlap where inside and outside are anchored just inside and just outside the feature but do not overlap it, and overlap is the current behaviour
label_orientation: upright|circular which determines the orientation of the label. upright is the current behaviour. Circular would be oriented to face clockwise for the forward strand and anticlockwise for the reverse

This will cause some issues with track widths (how can you specify a track width for a feature track?)

Any thoughts/suggestions?

I think that the proposed changes are all sensible, but circular labels are fiddly, so we'll need to think a wee bit about what to do exactly (probably another reason why I didn't do much on this originally ;) ).

label_position makes perfect sense, as suggested.

label_placement as a concept is fine, but I think we might need to be more precise about the intended behaviour of the arguments. I think that 'overlap' is possibly just a result of the choice of two anchoring parameters: 'inside'/'outside' and 'start'/'end' of the label string (see .png if the list accepts them).  We might be able to cover all possibilities with just these two choices - does this image cover the range of intended positioning David?  If so, then how about two arguments (I'm easy on the argument names): 'label_outer=True/False' and 'label_anchor=start/end/0/1'?

[cid:image001.png at 01CDD305.AA06C500]

label_orientation: What I think you're saying is that we want a distinction between text that orients to assume a static page (one you always view upright: e.g. a monitor), and text that doesn't.  I prefer 'reorient_labels=True/False', or some other geometrically neutral argument name, to 'upright' (whose expected meaning could change depending on page location and local context) as a parameter, but it's a good idea.

IIRC Track widths are relative, rather than absolute, and don't include label bounding boxes, so off-hand I don't think there ought to be any downstream issues.  Famous last words, there! ;)

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk       w:http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________

This email is from the James Hutton Institute, however the views
expressed by the sender are not necessarily the views of the James Hutton
Institute and its subsidiaries. This email and any attachments are confidential and
are intended solely for the use of the recipient(s) to whom they are addressed.
If you are not the intended recipient, you should not read, copy, disclose or rely on
any information contained in this email, and we would ask you to contact the
sender immediately and delete the email from your system. Although the James
Hutton Institute has taken reasonable precautions to ensure no viruses are present
in this email, neither the Institute nor the sender accepts any responsibility for any
viruses, and it is your responsibility to scan the email and any attachments.

The James Hutton Institute is a Scottish charitable company limited by guarantee.
Registered in Scotland No. SC374831
Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA.
Charity No. SC041796

The University of Dundee is a registered Scottish Charity, No: SC015096
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 22969 bytes
Desc: image001.png
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121205/bc3027ca/attachment-0002.png>

From p.j.a.cock at googlemail.com  Wed Dec  5 16:57:39 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 16:57:39 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
Message-ID: <CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>

On Wed, Dec 5, 2012 at 4:29 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
> Just got my head out of hacking at this. The options I have now are:
>
> label_position: start|middle|end with reference to the feature. So the end is
> always the pointy bit.

Sounds good and uncontentious.

> label_orientation: circular|upright  Sometimes it is nice to have a proper circular plot

I'd have to see the code or an example (and it seems any image attachment
will stall your emails for moderation - I'm a moderator but there is some time
delay before this gets that far).

> label_placement: inside|outside|overlap|strand which maintains overlap as
> default, inside is all inside, outside is all outside, strand is forward outside
> and reverse inside.

Perhaps below/above rather than inside/outside and then it could be done
to both the linear and circular drawers? Do you think this is useful then?

Note the current circular behaviour which overlaps is strand aware, so
those may not be the best names...

See also my earlier email with an alternative suggestion.

> It even works. Angles and so on are not so relevant with circular plots
> though I would prefer a label_angle: radial|tangent|[degrees]
>
> Should I attach an example?

You can try if the files are not overly larger (moderation delays will still
occur), posting a link would be easier although probably less lasting.

Are you OK with github? A natural option would be to show us your
proposals on a branch (separate commits if possible, otherwise I
can try and break out each bit if needed).

Ta,

Peter


From p.j.a.cock at googlemail.com  Wed Dec  5 17:24:08 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 17:24:08 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
Message-ID: <CAKVJ-_4EPG-qf7G8igxefmV8r25UC3Hm1q+Beiz2JFwGR7CziA@mail.gmail.com>

On Wed, Dec 5, 2012 at 4:57 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Dec 5, 2012 at 4:29 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
>> label_placement: inside|outside|overlap|strand which maintains overlap as
>> default, inside is all inside, outside is all outside, strand is forward outside
>> and reverse inside.
>
> Perhaps below/above rather than inside/outside and then it could be done
> to both the linear and circular drawers? Do you think this is useful then?

Having seen your example (sent directly off list), I'm convinced about the
usefulness of the inside and outside idea (when used on the inner-most
or outer-most track). Still not sure about those names as I would also like
to support this on the linear diagrams as well.

Regards,

Peter


From d.m.a.martin at dundee.ac.uk  Wed Dec  5 17:30:26 2012
From: d.m.a.martin at dundee.ac.uk (David Martin)
Date: Wed, 5 Dec 2012 17:30:26 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_4EPG-qf7G8igxefmV8r25UC3Hm1q+Beiz2JFwGR7CziA@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
	<CAKVJ-_4EPG-qf7G8igxefmV8r25UC3Hm1q+Beiz2JFwGR7CziA@mail.gmail.com>
Message-ID: <959CFF5060375249824CC633DDDF896F1C0C5EE5@AMSPRD0410MB351.eurprd04.prod.outlook.com>


-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
Sent: 05 December 2012 17:24
To: David Martin
Cc: Leighton Pritchard; BioPython-Dev
Subject: Re: [Biopython-dev] Modifications to CircularDrawer

On Wed, Dec 5, 2012 at 4:57 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Dec 5, 2012 at 4:29 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
>> label_placement: inside|outside|overlap|strand which maintains
>> overlap as default, inside is all inside, outside is all outside,
>> strand is forward outside and reverse inside.
>
> Perhaps below/above rather than inside/outside and then it could be
> done to both the linear and circular drawers? Do you think this is useful then?

Having seen your example (sent directly off list), I'm convinced about the usefulness of the inside and outside idea (when used on the inner-most or outer-most track). Still not sure about those names as I would also like to support this on the linear diagrams as well.

Linear and Circular are similar but not identical. No problem with having a above|below|strand or a more complex anchoring scheme but I don't need it right now so I'm just playing with the circular one.

I've attached a PDF to this mail - it might get through and I'll try to fork/clone/push git.

..d


The University of Dundee is a registered Scottish Charity, No: SC015096
-------------- next part --------------
A non-text attachment was scrubbed...
Name: plasmid_circular_nice.pdf
Type: application/pdf
Size: 148125 bytes
Desc: plasmid_circular_nice.pdf
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121205/e77a4cd7/attachment-0002.pdf>

From p.j.a.cock at googlemail.com  Wed Dec  5 18:41:59 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Dec 2012 18:41:59 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_5LatE+JT04d0dcf_WiTRE9tEUNPw6oPC004H=TTqB43A@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
	<959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_5LatE+JT04d0dcf_WiTRE9tEUNPw6oPC004H=TTqB43A@mail.gmail.com>
Message-ID: <CAKVJ-_7LjEdGtm3T26L3wwkraN9C+11mLvjogDBXAK+EDjGeDQ@mail.gmail.com>

Hi David,

I've been experimenting with your pull request, thank you:
https://github.com/biopython/biopython/pull/116

On Wed, Dec 5, 2012 at 5:22 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Dec 5, 2012 at 5:10 PM, David Martin <d.m.a.martin at dundee.ac.uk> wrote:
>> In the mean-time here is a plot (that doesn't show all layouts)
>
> Nice. Looking at that now I'm pretty sure I hacked the label anchor
> once before of a quick job in order to get the labels outside like that...
> certainly worth making this change.

Found it, that change made it to a branch I'd forgotten about:
https://github.com/peterjc/biopython/commit/d4764dfe929f135ec55b83ad14a9cd34e2d14bba

This is bringing back memories... I think I'd concluded last time
that attempting to offer anything other than radial label orientation
was probably a mistake, and that if we restrict that we can
safely offset the vertical position of the text midline (since right
now it is positioned according to the bottom line of the font).
Without that, positioning labels at the top (as you look at the
page) of a circular feature gave non-ideal placement. This
is likely one reason for the current hard-coded placement of
the feature labels at the bottom (as you look at the circle).

Hmm. I think I have a compromise forming that would allow
figures like your motivating example :)

Peter


From kai.blin at biotech.uni-tuebingen.de  Thu Dec  6 01:44:40 2012
From: kai.blin at biotech.uni-tuebingen.de (Kai Blin)
Date: Thu, 06 Dec 2012 11:44:40 +1000
Subject: [Biopython-dev] Need some help with SearchIO HSPs cascading
	attributes.
In-Reply-To: <CADEGkF6eQmPLkTWOX=CMbA=V0tkEHrfHwpksp-RR2cSixEKNfQ@mail.gmail.com>
References: <50BEF69E.2000806@biotech.uni-tuebingen.de>
	<CADEGkF6eQmPLkTWOX=CMbA=V0tkEHrfHwpksp-RR2cSixEKNfQ@mail.gmail.com>
Message-ID: <50BFF888.50300@biotech.uni-tuebingen.de>

On 2012-12-06 02:39, Wibowo Arindrarto wrote:

Hi Bow, everyone,

> Very happy to see the parser near completion (with tests too!). The
> issue you're facing is unfortunately the consequence of trying to keep
> attribute values in sync across the object hierarchy. It is a bit
> troublesome for now, but not without solution.

...

> Here, the simple fix would be to force a description assignment to the
> HSP. For example, you could have the `else` block like so:
> 
> ...
> else:
>     hit = unordered_hits[id_]
>     hsp.hit_description = hit.description
>     hit.append(hsp)

Thanks for the tip, that was the last speedbump I had. I just sent off
the pull request for the hmmer2 parser.

Thanks again for the help,
Kai

-- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-University of T?bingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 T?bingen                        Fax :   ++49 7071 29-5979
Deutschland
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben


From christian at brueffer.de  Thu Dec  6 04:04:37 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Thu, 06 Dec 2012 12:04:37 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <50BF6813.4070102@brueffer.de>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
	<CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
	<50BF6813.4070102@brueffer.de>
Message-ID: <50C01955.8060505@brueffer.de>

On 12/05/2012 11:28 PM, Christian Brueffer wrote:
> On 12/5/12 22:16 , Peter Cock wrote:
[...]
>
>>>> You've got us a lot closer to PEP8 compliance - do you think
>>>> subject to a short white list of known cases (like module
>>>> names) where we don't follow PEP8 we could aim to run a
>>>> a pep8 tool automatically (e.g. as a unit test, or even a commit
>>>> hook)? That is quite appealing as a way to spot any new code
>>>> which breaks the style guidelines...
>>>
>>> Having a commit hook would be ideal (maybe with a possibility to
>>> override).  This would be especially useful against the introduction of
>>> gratuitous whitespace.  With some editors/IDEs you don't even notice it.
>>
>> Would you be interested in looking into how to set that up?
>> Presumably a client-side git hook would be best, but we'd
>> need to explore cross platform issues (e.g. developing and
>> testing on Windows) and making sure it allowed an override
>> on demand (where the developer wants/needs to ignore a
>> style warning).
>>
>
> Yes, It's fairly high on my TODO list.
>

I just had a look at this.  Turns out some people have had this idea
before :-)

Here's a first version:

https://github.com/cbrueffer/pep8-git-hook/blob/master/pre-commit

Basically you just save this as biopython/.git/hooks/pre-commit and mark
it executable.  You also need to install pep8 (pip install pep8).  The 
checks can be bypassed with git commit --no-verify.

Currently it ignores E124 (which I think should remain that way).  Any 
other errors or files it should ignore?

I'd be grateful if someone could give this a try on Windows.

Chris


From christian at brueffer.de  Thu Dec  6 06:22:24 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Thu, 06 Dec 2012 14:22:24 +0800
Subject: [Biopython-dev] Further PEP8 Cleanup
In-Reply-To: <50C01955.8060505@brueffer.de>
References: <50BC9F1F.4090904@brueffer.de>
	<CAKVJ-_4+YAu=DeJ8VpJTi2CicDk1hLkQwLoiNbWqMAZH0EN1qA@mail.gmail.com>
	<50BCDB27.7040402@brueffer.de>
	<CAKVJ-_5fBfQZbJgSEp_bbFAXBeH6KW=NSh5UTHVVikErOS41yA@mail.gmail.com>
	<50BF6813.4070102@brueffer.de> <50C01955.8060505@brueffer.de>
Message-ID: <50C039A0.8040208@brueffer.de>

On 12/06/2012 12:04 PM, Christian Brueffer wrote:
> On 12/05/2012 11:28 PM, Christian Brueffer wrote:
>> On 12/5/12 22:16 , Peter Cock wrote:
> [...]
>>
>>>>> You've got us a lot closer to PEP8 compliance - do you think
>>>>> subject to a short white list of known cases (like module
>>>>> names) where we don't follow PEP8 we could aim to run a
>>>>> a pep8 tool automatically (e.g. as a unit test, or even a commit
>>>>> hook)? That is quite appealing as a way to spot any new code
>>>>> which breaks the style guidelines...
>>>>
>>>> Having a commit hook would be ideal (maybe with a possibility to
>>>> override).  This would be especially useful against the introduction of
>>>> gratuitous whitespace.  With some editors/IDEs you don't even notice
>>>> it.
>>>
>>> Would you be interested in looking into how to set that up?
>>> Presumably a client-side git hook would be best, but we'd
>>> need to explore cross platform issues (e.g. developing and
>>> testing on Windows) and making sure it allowed an override
>>> on demand (where the developer wants/needs to ignore a
>>> style warning).
>>>
>>
>> Yes, It's fairly high on my TODO list.
>>
>
> I just had a look at this.  Turns out some people have had this idea
> before :-)
>
> Here's a first version:
>
> https://github.com/cbrueffer/pep8-git-hook/blob/master/pre-commit
>
> Basically you just save this as biopython/.git/hooks/pre-commit and mark
> it executable.  You also need to install pep8 (pip install pep8).  The
> checks can be bypassed with git commit --no-verify.
>
> Currently it ignores E124 (which I think should remain that way).  Any
> other errors or files it should ignore?
>
> I'd be grateful if someone could give this a try on Windows.
>

Thinking about it, I think it would make sense to ignore the following:

E121 continuation line indentation is not a multiple of four
E122 continuation line missing indentation or outdented
E123 closing bracket does not match indentation of opening bracket's line
E124 closing bracket does not match visual indentation
E126 continuation line over-indented for hanging indent
E127 continuation line over-indented for visual indent
E128 continuation line under-indented for visual indent

They all deal with indentation, but are not always beneficial to
readability.  E125 is missing from that list, which is a useful one.

Chris


From p.j.a.cock at googlemail.com  Thu Dec  6 10:07:55 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Dec 2012 10:07:55 +0000
Subject: [Biopython-dev] Minor buildbot issues from SearchIO
In-Reply-To: <CAKVJ-_5nLjkwZCugRNTLtKx50a0-Ow807sC=qYpSBOy=ZHFh_g@mail.gmail.com>
References: <CAKVJ-_6N_Wy9QVKp=niHSexB0_yEL5svh4oDzbxEYuSHv3KfWA@mail.gmail.com>
	<CADEGkF4RLmQDMS2sBNTs=Rwag_CypmU6WX-Q71R=Xsbuc4_GQg@mail.gmail.com>
	<CAKVJ-_5nLjkwZCugRNTLtKx50a0-Ow807sC=qYpSBOy=ZHFh_g@mail.gmail.com>
Message-ID: <CAKVJ-_4JBDX0pjdR=EmfxG8x5Htar9AF_t=e0rnbNrvRO1AGSA@mail.gmail.com>

On Wed, Dec 5, 2012 at 11:41 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Nov 30, 2012 at 2:35 AM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
>> Hi everyone,
>>
>> I've done some digging around to see how to deal with these issues.
>> Here's what I found:
>>
>>> The BuildBot flagged two new issues overnight,
>>> http://testing.open-bio.org/biopython/tgrid
>>>
>>> Python 2.5 on Windows - doctests are failing due to floating point decimal place
>>> differences in the exponent (down to C library differences, something fixed in
>>> later Python releases). Perhaps a Python 2.5 hack is the way to go here?
>>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%202.5/builds/664/steps/shell/logs/stdio
>>
>> I've submitted a pull request to fix this here:
>> https://github.com/biopython/biopython/pull/98
>
> The Windows detection wasn't quite right, it should now match
> how we look for Windows elsewhere in Biopython:
> https://github.com/biopython/biopython/commit/fc24967b89eda56675e67824a4a57a6059650636
>
>>> There is a separate cross-platform issue on Python 3.1, "TypeError:
>>> invalid event tuple" again with XML parsing. Curiously this had started
>>> a few days back in the UniprotIO tests on one machine, pre-dating the
>>> SearchIO merge. I'm not sure what triggered it.
>>> http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/767
>>> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/766/steps/shell/logs/stdio
>>> http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/648/steps/shell/logs/stdio
>>
>> As for this one, it seems that it's caused by a bug in Python3.1
>> (http://bugs.python.org/issue9257) due to the way
>> `xml.etree.cElemenTree.iterparse` accepts the `event` argument.
>
> Ah - I remember that bug now, we have a hack in place elsewhere
> to try and avoid that - seems it won't be fixed in Python 3.1.x now
> so I've relaxed the version check here:
> https://github.com/biopython/biopython/commit/52fdd0ed7fa576494005e635b6a6610daab2ab0e
>
> Hopefully that will bring the buildbot back to all green tonight.
> (TravisCI has now dropped their Python 3.1 support, but they
> should have Python 3.3 with NumPy working soon).
>
> Peter

OK, the buildbot looks happy now from the SearchIO work.

There is one issue under Python 3.1.5 on a 64 bit Linux server,
which I suspect is down to the Python version (this buildslave
used to run an older version - Python 3.1.3 (separate email
to follow).

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Dec  6 10:24:47 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Dec 2012 10:24:47 +0000
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
Message-ID: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>

On Thu, Dec 6, 2012 at 10:07 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> OK, the buildbot looks happy now from the SearchIO work.
>
> There is one issue under Python 3.1.5 on a 64 bit Linux server,
> which I suspect is down to the Python version (this buildslave
> used to run an older version - Python 3.1.3 (separate email
> to follow).

There are 18 test failures like this - all to do with handles and stdout,
which have been happening for a while now but I've not found time
to look into it. Example:

======================================================================
ERROR: test_needle_piped (test_Emboss.PairwiseAlignmentTests)
needle with asis trick, output piped to stdout.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py",
line 74, in __next__
    line = self._header
AttributeError: 'EmbossIterator' object has no attribute '_header'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/Tests/test_Emboss.py",
line 571, in test_needle_piped
    align = AlignIO.read(child.stdout, "emboss")
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py",
line 418, in read
    first = next(iterator)
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py",
line 366, in parse
    for a in i:
  File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py",
line 77, in __next__
    line = handle.readline()
AttributeError: '_io.FileIO' object has no attribute 'read1'

Lasting working build, Python 3.1.3,
http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/710/steps/shell/logs/stdio
https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634

Next build (after a couple of weeks offline while this server was
being rebuilt), Python 3.1.5,
http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/722/steps/shell/logs/stdio
https://github.com/biopython/biopython/commit/3ea4ea58ed80d6e517699bcab8810398f9ce5957

The timing does suggest an issue introduced in the rebuild, and
the obvious difference is the version of Python jumped from
3.1.3 to 3.1.5 (likely things like NumPy etc also changed).

There were some security fixes only in Python 3.1.5, none of
which sound relevant here:
http://www.python.org/download/releases/3.1.5/

The change log for Python 3.1.4 is longer, and does mention
stdout/stderr issues so this is perhaps the cause:
hg.python.org/cpython/raw-file/feae9f9e9f30/Misc/NEWS

See also http://bugs.python.org/issue4996 as possibly
related. The whole Python 3 text vs binary handle issue
is important with stdout/stderr.

What I am doing now is testing those two commits (with
Python 3.1.5) to confirm they both fail, and thus rule out
a Biopython code change in those two weeks being to
blame.

Peter


From p.j.a.cock at googlemail.com  Thu Dec  6 10:45:07 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Dec 2012 10:45:07 +0000
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
In-Reply-To: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
References: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
Message-ID: <CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>

On Thu, Dec 6, 2012 at 10:24 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Dec 6, 2012 at 10:07 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> OK, the buildbot looks happy now from the SearchIO work.
>>
>> There is one issue under Python 3.1.5 on a 64 bit Linux server,
>> which I suspect is down to the Python version (this buildslave
>> used to run an older version - Python 3.1.3 (separate email
>> to follow).
>
> There are 18 test failures like this - all to do with handles and stdout,
> which have been happening for a while now but I've not found time
> to look into it. Example:
>
> ======================================================================
> ERROR: test_needle_piped (test_Emboss.PairwiseAlignmentTests)
> needle with asis trick, output piped to stdout.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py",
> line 74, in __next__
>     line = self._header
> AttributeError: 'EmbossIterator' object has no attribute '_header'
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/Tests/test_Emboss.py",
> line 571, in test_needle_piped
>     align = AlignIO.read(child.stdout, "emboss")
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py",
> line 418, in read
>     first = next(iterator)
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/__init__.py",
> line 366, in parse
>     for a in i:
>   File "/home_local/buildslave/BuildBot_Biopython/lin3164/build/build/py3.1/build/lib.linux-x86_64-3.1/Bio/AlignIO/EmbossIO.py",
> line 77, in __next__
>     line = handle.readline()
> AttributeError: '_io.FileIO' object has no attribute 'read1'
>
> Lasting working build, Python 3.1.3,
> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/710/steps/shell/logs/stdio
> https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634
>
> Next build (after a couple of weeks offline while this server was
> being rebuilt), Python 3.1.5,
> http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.1/builds/722/steps/shell/logs/stdio
> https://github.com/biopython/biopython/commit/3ea4ea58ed80d6e517699bcab8810398f9ce5957
>
> The timing does suggest an issue introduced in the rebuild, and
> the obvious difference is the version of Python jumped from
> 3.1.3 to 3.1.5 (likely things like NumPy etc also changed).
>
> There were some security fixes only in Python 3.1.5, none of
> which sound relevant here:
> http://www.python.org/download/releases/3.1.5/
>
> The change log for Python 3.1.4 is longer, and does mention
> stdout/stderr issues so this is perhaps the cause:
> hg.python.org/cpython/raw-file/feae9f9e9f30/Misc/NEWS
>
> See also http://bugs.python.org/issue4996 as possibly
> related. The whole Python 3 text vs binary handle issue
> is important with stdout/stderr.
>
> What I am doing now is testing those two commits (with
> Python 3.1.5) to confirm they both fail, and thus rule out
> a Biopython code change in those two weeks being to
> blame.
>
> Peter

Confirmed, using test_Emboss.py and Python 3.1.5 on
this machine (running as the buildslave user using the
same Python 3.1.5 installation), using the current tip
5092e0e9f2326da582158fd22090f31547679160 and
the two commits mentioned above, that is
e90db11f4a1d983bc2bfe12bec30edbdbb200634 and
3ea4ea58ed80d6e517699bcab8810398f9ce5957 -
all three builds show the same failure.

i.e. The failure is not due to a change in Biopython
between those commits, but is in some way caused
by a change to the buildslave environment. My first
suggestion that this is due to Python 3.1.3 -> 3.1.5
remains my prime suspect.

I could try downgrading Python 3.1 on this machine
to confirm that I suppose... or updating Python 3.1 on
another machine?

The other recent Python 3.1 buildbot runs were both
using Python 3.1.2 (Windows XP 32bit and Linux 32 bit).

Can anyone else reproduce this, or have an idea what
the fix might be?

Regards,

Peter


From Leighton.Pritchard at hutton.ac.uk  Thu Dec  6 12:28:39 2012
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Thu, 6 Dec 2012 12:28:39 +0000
Subject: [Biopython-dev] Modifications to CircularDrawer
In-Reply-To: <CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<E72D33BF424829408854FEB604A6959B6E47307C@DUEXC02.ad.hutton.ac.uk>
	<959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com>
	<CAKVJ-_7bf_n8dPXvw+pH+edo9aCWVBuiXRL_Gjx62b3faRgjJg@mail.gmail.com>
Message-ID: <E72D33BF424829408854FEB604A6959B6E474468@DUEXC02.ad.hutton.ac.uk>

Hi all,

I'm starting to remember why I left circular labelling options alone ;)

On 5 Dec 2012, at Wednesday, December 5, 16:57, Peter Cock wrote:

On Wed, Dec 5, 2012 at 4:29 PM, David Martin <d.m.a.martin at dundee.ac.uk<mailto:d.m.a.martin at dundee.ac.uk>> wrote:

label_orientation: circular|upright  Sometimes it is nice to have a proper circular plot

I'd have to see the code or an example (and it seems any image attachment
will stall your emails for moderation - I'm a moderator but there is some time
delay before this gets that far).

I still don't like 'upright' - but that's a naming issue, rather than one of functionality.

label_placement: inside|outside|overlap|strand which maintains overlap as
default, inside is all inside, outside is all outside, strand is forward outside
and reverse inside.

Perhaps below/above rather than inside/outside and then it could be done
to both the linear and circular drawers? Do you think this is useful then?

'Below' and 'above' are context- (and viewer!) dependent: on a circular diagram 'above' on a feature at 12 o'clock is on the opposite side of the feature when it's 'above' at 6 o'clock.  It's not clear what either would mean for a feature at 3 o'clock or 9 o'clock.  'Inside' and 'outside' are stably relative to the circular track for a feature at any position on the circle, so I prefer them as settings.

I'm not keen on 'overlap' or 'strand', as I'm not clear what kind of label orientation they refer to: for example, what is being 'overlapped'?

Looking at the .pdf, it seems like you've anchored the green labels to the track, rather than to the feature, which I think looks good there - but I'd like to have the option of track vs feature anchoring available via an argument like 'label_anchor', which could be distinguished from 'label_text_anchor'. Including this choice, my preferred arguments would be something like:

label_direction='clockwise'|'anticlockwise' - 'clockwise':
The text looks like it's progressing clockwise (like the green text in the .pdf); 'anticlockwise' like the blue text.  By choosing 'clockwise' or 'anticlockwise' for the appropriate group of features, we achieve part of what I think you might mean by 'upright' (i.e. clockwise from pi/2 to 3pi/2, anticlockwise elsewhere).  That could be handled with an 'auto' option.  This argument essentially dictates label_angle for each feature: more of which later.  It would be nice to have synonyms of 'counterclockwise', 'anticlockwise' and 'widdershins' ;)

label_anchor='track'|'feature'
Describes what element the text bounding box will be anchored to.

label_text_anchor='start'|'end'
Which part of the text bounding box (relative to the text) gets anchored. I think it's a good idea to have this wrap a lower-level setting that has label_text_anchor=float, as a relative location on the feature, where start=0, center=0.5, end=1, and values beyond that offer a label separation, relative to the label size - though I can't imagine why I'd use it over the option below - since spacing would depend on bounding box size - the flexibility could be useful, and you'd have to do that calculation anyway ;)

label_placement='inner'|'outer'
Do we anchor on the track/feature towards the circle centre (inner) or on the other side (outer)?  I think it's a good idea to have this wrap a lower-level representation that has label_placement=float, as a relative location on the feature, where inner=-1,outer=1 as a proportion of track/feature height, and other values place the anchor relative to the feature/track boundary - this again offers a choice of label separation, but one that's uniform for all features.

label_position='start'|'end'|'center'
Where, relative to the feature, do we anchor?  I think it's a good idea to have this wrap a lower-level representation that has label_position=[0,1], as a relative location on the feature, where start=0, center=0.5, end=1.  That gives more flexibility for those who want it (and you have to do the calculation, anyway).

label_orientation='radial'|'horizontal'
Fairly obviously, 'radial' = as it is now, and 'horizontal' is reading like regular text.  But this one's a tricky one, which is why all the labels are radial at the moment ;)  I think that this choice has to either live with ('radial') or override ('horizontal') the label_direction argument.  As with label_direction, this essentially dictates label_angle for each individual feature, which has its own issues (what do we measure the angle relative to?  If it's relative to a common reference, then for a constant angle you get some funny-looking label patterns, and it doesn't look good in bulk.  Relative to a feature-local reference, we can choose the tangent or the normal - but at what point of the feature?  Really, we want that to be the tangent or normal at the anchor point of the text, so that the same angle looks consistent across all features (45deg to the normal at the start of a long feature is different to 45deg to the normal at the centre of that feature, relative to the bottom of the page: this looks weird)).
A complicating issue here with text anchoring is what part of the text box gets anchored: depending on the font, and the string, choosing the top or bottom of the bounding box (which will include ascender and descender spaces) can look weird, so it's probably best to anchor on the midline of the text box.  This avoids a problem with 'anticlockwise' vs 'clockwise' when implemented as a rotation, in that anchoring to the lower left of text, then rotating 180deg around the centre of the text box gives a different final positioning (and anchoring) than anchoring to the midline of the text box, then performing the same rotation.

By appropriate choices of these settings, we can obtain pretty much any labelling style.  We need to keep in mind, though, that the arguments won't be interpreted properly until the Diagram gets passed to the renderer, so 'auto' settings to achieve a particular effect with complicated combinations of arguments dependent on feature location might be better passed with draw().

As specific examples:

1) Let's say the effect we're looking for is for horizontal text, anchored to the outside of the track.  Here we'd need to consider two halves of the diagram.  On the left hand side we need to set label_text_anchor='end', and on the right we set label_text_anchor='start'.  On both sides we set label_orientation='horizontal', label_anchor='track', label_placement='outer'.  However, we need to take care with features towards the top and bottom of the image, as horizontal labels will run into each other, here.

2) Dropping the requirement for horizontal text, we can set label_orientation='radial', label_anchor='track', label_placement='outer' on both sides (maybe this should be the default?), but set label_direction='clockwise', label_text_anchor='end' on the left, and label_direction='counterclockwise', label_text_anchor='start' on the right.

3) If we wanted to label features directly, on the appropriate side of their track, we could set label_anchor='feature' for all features, with label_placement='inner' for reverse-strand, and label_placement='outer' for forward-strand features.

These are some fairly obvious standard settings which could be made available as presets in the calls to draw(), so that the fiddly details are hidden.

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk       w:http://www.hutton.ac.uk/staff/leighton-pritchard
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From w.arindrarto at gmail.com  Fri Dec  7 03:32:06 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Fri, 7 Dec 2012 04:32:06 +0100
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
In-Reply-To: <CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>
References: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
	<CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>
Message-ID: <CADEGkF41Fu8BzuBh_3DfRSF5SS6C8UecU7F-TXTgnd-Md44Kcw@mail.gmail.com>

> Confirmed, using test_Emboss.py and Python 3.1.5 on
> this machine (running as the buildslave user using the
> same Python 3.1.5 installation), using the current tip
> 5092e0e9f2326da582158fd22090f31547679160 and
> the two commits mentioned above, that is
> e90db11f4a1d983bc2bfe12bec30edbdbb200634 and
> 3ea4ea58ed80d6e517699bcab8810398f9ce5957 -
> all three builds show the same failure.
>
> i.e. The failure is not due to a change in Biopython
> between those commits, but is in some way caused
> by a change to the buildslave environment. My first
> suggestion that this is due to Python 3.1.3 -> 3.1.5
> remains my prime suspect.
>
> I could try downgrading Python 3.1 on this machine
> to confirm that I suppose... or updating Python 3.1 on
> another machine?
>
> The other recent Python 3.1 buildbot runs were both
> using Python 3.1.2 (Windows XP 32bit and Linux 32 bit).
>
> Can anyone else reproduce this, or have an idea what
> the fix might be?

It's reproducible in my machine: Arch Linux 64 bit running
Python3.1.5. Haven't figured out a fix yet, but trying to see if I
can.

By the way, I was wondering, what's our deprecation policy for
Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't
seem to be any major updates coming soon. How long should we keep
supporting Python <3.2?

regards,
Bow


From p.j.a.cock at googlemail.com  Fri Dec  7 10:06:57 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 7 Dec 2012 10:06:57 +0000
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
In-Reply-To: <CADEGkF41Fu8BzuBh_3DfRSF5SS6C8UecU7F-TXTgnd-Md44Kcw@mail.gmail.com>
References: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
	<CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>
	<CADEGkF41Fu8BzuBh_3DfRSF5SS6C8UecU7F-TXTgnd-Md44Kcw@mail.gmail.com>
Message-ID: <CAKVJ-_5SjfRFiiKSatU9ds8b5ESdUTexa3TP=k+W=TPmtHoTfA@mail.gmail.com>

On Fri, Dec 7, 2012 at 3:32 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
>
> > Confirmed, using test_Emboss.py and Python 3.1.5 on
> > this machine (running as the buildslave user using the
> > same Python 3.1.5 installation), using the current tip
> > 5092e0e9f2326da582158fd22090f31547679160 and
> > the two commits mentioned above, that is
> > e90db11f4a1d983bc2bfe12bec30edbdbb200634 and
> > 3ea4ea58ed80d6e517699bcab8810398f9ce5957 -
> > all three builds show the same failure.
> >
> > i.e. The failure is not due to a change in Biopython
> > between those commits, but is in some way caused
> > by a change to the buildslave environment. My first
> > suggestion that this is due to Python 3.1.3 -> 3.1.5
> > remains my prime suspect.
> >
> > I could try downgrading Python 3.1 on this machine
> > to confirm that I suppose... or updating Python 3.1 on
> > another machine?
> >
> > The other recent Python 3.1 buildbot runs were both
> > using Python 3.1.2 (Windows XP 32bit and Linux 32 bit).
> >
> > Can anyone else reproduce this, or have an idea what
> > the fix might be?
>
> It's reproducible in my machine: Arch Linux 64 bit running
> Python3.1.5. Haven't figured out a fix yet, but trying to see if I
> can.

Great. We haven't really proved this is down to a change in
either Python 3.1.4 or 3.1.5 but it does look likely.

>
> By the way, I was wondering, what's our deprecation policy for
> Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't
> seem to be any major updates coming soon. How long should we keep
> supporting Python <3.2?

As long as it doesn't cost us much effort? If we can't solve this
issue easily that might be enough to drop Python 3.1?

My impression is that Python 3.0 is dead, and the only sizeable
group stuck with Python 3.1 will those on Ubuntu lucid (LTS is
supported through 2013 on desktops and 2015 on servers),
but as with life under Python 2.x it is fairly straightforward
to have a local/additional Python without disturbing the system
installation.

On a related note, TravisCI currently still supports Python 3.1
unofficially (we're not using this with Biopython but I've tried
it with other projects), but this will be dropped soon - once
they have Python 3.3 working.

Since we don't yet officially support Python 3 (but we probably
should soon) we have the flexibility to recommend
either Python 3.2 or 3.3 as a baseline.

Peter


From redmine at redmine.open-bio.org  Sun Dec  9 04:11:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 9 Dec 2012 04:11:35 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15030.20121209041135@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


It looks like your data file is corrupted. In _read_value_from_handle, the length of the key it tries to read is 1490353651722. This does not seem correct. Can you create a minimal data file that shows the problem? Then, when you fill in the trie, you can identify which key causes the problem.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Dec  9 09:53:30 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 9 Dec 2012 09:53:30 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15031.20121209095330@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.


That just means that bug is in save() not in load() function.
But of course I will provide data file, although I can't guarantee it will be minimal. 
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Dec  9 12:13:09 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 9 Dec 2012 12:13:09 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15032.20121209121309@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


You don't need to provide the data file to us. The idea is that you create the smallest trie.dat file that will cause the load() to fail. Then you know which item in the trie is problematic. Once you know that, we can try to figure out why the save() creates a corrupted file.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Mon Dec 10 17:39:24 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 10 Dec 2012 17:39:24 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15033.20121210173924@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.

File minimal_data.pkl added

This is my minimal test case:

        from Bio import trie
        import pickle

        f = open('minimal_data.pkl', 'r')
        list = pickle.load(f)
        f.close()

        index = trie.trie()

        for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]

        f = open('trie.dat', 'w')
        trie.save(f, index)
        f.close()
        f = open('trie.dat', 'r')
        index = trie.load(f)
        f.close()
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Tue Dec 11 05:32:02 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 11 Dec 2012 05:32:02 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15034.20121211053202@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


Hi Michal,

Unfortunately I cannot load your minimal_data.pkl file. At
list = pickle.load(f)
I get
ImportError: No module named django.db.models.query

Can you check which item in list is actually causing the problem? Just reduce the list until you find the item that is causing the trie.load(f) to fail.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From MatatTHC at gmx.de  Tue Dec 11 08:11:48 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Tue, 11 Dec 2012 09:11:48 +0100
Subject: [Biopython-dev] genetic code
Message-ID: <CAFwXb29-tAmZ9K_NHGh=6AJq4SCQa_sCBMyWn7OJFax=KCGrsA@mail.gmail.com>

Dear biopython developers,

there is a new genetic code table (24) in the NCBI resources (see
NC_015649). Maybe you can update this with the next release.

Would it be an idea to distribute the genetic code file from ncbi with
biopython and create the code tables on import or during installation? Then
biopython would be automatically up-to-date.

Regards,
Matthias


From redmine at redmine.open-bio.org  Tue Dec 11 09:15:22 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 11 Dec 2012 09:15:22 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15035.20121211091522@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.


Hello,
As I said, this is minimal test case. That means there is no single key that causes a problem. If you remove any of the items from the list it will work. You can try to run this examble from django shell (python manage.py shell). It there will be any further problems with running it I can provide model classes as well.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From arklenna at gmail.com  Tue Dec 11 16:00:33 2012
From: arklenna at gmail.com (Lenna Peterson)
Date: Tue, 11 Dec 2012 11:00:33 -0500
Subject: [Biopython-dev] genetic code
In-Reply-To: <CAFwXb29-tAmZ9K_NHGh=6AJq4SCQa_sCBMyWn7OJFax=KCGrsA@mail.gmail.com>
References: <CAFwXb29-tAmZ9K_NHGh=6AJq4SCQa_sCBMyWn7OJFax=KCGrsA@mail.gmail.com>
Message-ID: <CALfq9t+JdrNECegb017g9JE2kGbahkh4C5QUuFV1oWf5RT4M8A@mail.gmail.com>

Hi Matthias,

In a similar case, we have a file in the Scripts/ directory to download and
parse the file. The generated file (and not the source file) is committed,
but the script is available in the source for end users who wish to update
it:

https://github.com/biopython/biopython/blob/master/Scripts/PDB/generate_three_to_one_dict.py

I think a similar situation would be appropriate here. Does Biopython
currently include alternate codon tables?

Cheers,

Lenna

On Tuesday, December 11, 2012, Matthias Bernt wrote:

> Dear biopython developers,
>
> there is a new genetic code table (24) in the NCBI resources (see
> NC_015649). Maybe you can update this with the next release.
>
> Would it be an idea to distribute the genetic code file from ncbi with
> biopython and create the code tables on import or during installation? Then
> biopython would be automatically up-to-date.
>
> Regards,
> Matthias
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org <javascript:;>
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From p.j.a.cock at googlemail.com  Tue Dec 11 18:42:13 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 11 Dec 2012 18:42:13 +0000
Subject: [Biopython-dev] genetic code
In-Reply-To: <CALfq9t+JdrNECegb017g9JE2kGbahkh4C5QUuFV1oWf5RT4M8A@mail.gmail.com>
References: <CAFwXb29-tAmZ9K_NHGh=6AJq4SCQa_sCBMyWn7OJFax=KCGrsA@mail.gmail.com>
	<CALfq9t+JdrNECegb017g9JE2kGbahkh4C5QUuFV1oWf5RT4M8A@mail.gmail.com>
Message-ID: <CAKVJ-_5i-BWAhqgZ=6ARw6+G2Nuue1KOvMjQF+WheVmLph4vUQ@mail.gmail.com>

On Tuesday, December 11, 2012, Lenna Peterson wrote:

> Hi Matthias,
>
> In a similar case, we have a file in the Scripts/ directory to download and
> parse the file. The generated file (and not the source file) is committed,
> but the script is available in the source for end users who wish to update
> it:
>
>
> https://github.com/biopython/biopython/blob/master/Scripts/PDB/generate_three_to_one_dict.py
>
> I think a similar situation would be appropriate here. Does Biopython
> currently include alternate codon tables?
>
> Cheers,
>
> Lenna


Yes, see
https://github.com/biopython/biopython/blob/master/Bio/Data/CodonTable.pyon
the parser therein.

On Tuesday, December 11, 2012, Matthias Bernt wrote:
>
> > Dear biopython developers,
> >
> > there is a new genetic code table (24) in the NCBI resources (see
> > NC_015649). Maybe you can update this with the next release.


That seems like a Good idea :)


> > Would it be an idea to distribute the genetic code file from ncbi with
> > biopython and create the code tables on import or during installation?
> Then
> > biopython would be automatically up-to-date.
> >
> > Regards,
> > Matthias
>

That would just make installation more complex (and it is already
complicated). I would prefer to keep setup.py as normal as possible.

The NCBI tables rarely change, so this works OK overall.

Peter


From redmine at redmine.open-bio.org  Wed Dec 12 04:16:27 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 04:16:27 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15036.20121212041627@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


We need to isolate the bug further to be able to solve it. I would suggest to find a data set that fails to load but does not depend on django.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Dec 12 07:56:52 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 07:56:52 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15037.20121212075652@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.


Sure, today I'll strip all django dependencies and resubmit data set and loading code.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Dec 12 10:04:28 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 10:04:28 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15038.20121212100428@redmine.open-bio.org>


Issue #3395 has been updated by Micha? Nowotka.

File minimal_data.pkl added

Minimal test case with stripped django dependencies, loading code below:

        from Bio import trie
        import pickle

        f = open('minimal_data.pkl', 'r')
        list = pickle.load(f)
        f.close()

        index = trie.trie()

        for item in list:
            for chunk in item[0].split('/')[1:]:
                if len(chunk) > 2:
                    if index.get(str(chunk)):
                        index[str(chunk)].append(item[1])
                    else:
                        index[str(chunk)] = [item[1]]

        f = open('trie.dat', 'w')
        trie.save(f, index)
        f.close()

        f = open('trie.dat', 'r')
        new_trie = trie.load(f)
        f.close()
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Dec 12 12:29:19 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 12:29:19 +0000
Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie
	implementation can't load large data sets
References: <redmine.issue-3395.20121120134147@redmine.open-bio.org>
Message-ID: <redmine.journal-15039.20121212122919@redmine.open-bio.org>


Issue #3395 has been updated by Michiel de Hoon.


The problem was indeed that one of the chunks had a size of 2000.
I've uploaded a fix to github; could you please give it a try? See

https://github.com/biopython/biopython/commit/6e09a4a67b7dec1910b13e3d730e3a1f5c2261c9

In particular, please make sure that new_trie is identical to trie.
----------------------------------------
Bug #3395: Biopython trie implementation can't load large data sets
https://redmine.open-bio.org/issues/3395

Author: Micha? Nowotka
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Imagine I have Biopython trie:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'w')
tr = trie.trie()
#fill in the trie
trie.save(f, trie)

Now /tmp/trie.dat.gz is about 50MB. Let's try to read it:

from Bio import trie
import gzip

f = gzip.open('/tmp/trie.dat.gz', 'r')
tr = trie.load(f)

Unfortunately I'm getting meaningless error saying:
"loading failed for some reason"

Any hints?


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Wed Dec 12 21:44:14 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 12 Dec 2012 21:44:14 +0000
Subject: [Biopython-dev] [Biopython - Bug #3400] (New) Hmmer3-text parser
	crashes when parsing hmmscan --cut_tc files
Message-ID: <redmine.issue-3400.20121212214414@redmine.open-bio.org>


Issue #3400 has been reported by Kai Blin.

----------------------------------------
Bug #3400: Hmmer3-text parser crashes when parsing hmmscan --cut_tc files
https://redmine.open-bio.org/issues/3400

Author: Kai Blin
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


I'm currently struggling with a crash in the hmmer3-text parser when dealing with files generated by hmmscan --cut_tc.
I'm not quite sure what happens yet, but I have the feeling that some part of the hit parsing logic is reading into the next query without yielding a result.

The backtrace is
<pre>
Traceback (most recent call last):
  File "t.py", line 4, in <module>
    i = it.next()
  File "/data/uni/biopython/Bio/SearchIO/__init__.py", line 317, in parse
    yield qresult
  File "/usr/lib/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/data/uni/biopython/Bio/File.py", line 84, in as_handle
    yield fp
  File "/data/uni/biopython/Bio/SearchIO/__init__.py", line 316, in parse
    for qresult in generator:
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 47, in __iter__
    for qresult in self._parse_qresult():
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 133, in _parse_qresult
    hit_list = self._parse_hit(qid)
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 176, in _parse_hit
    hit_list = self._create_hits(hit_attr_list, qid)
  File "/data/uni/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py", line 239, in _create_hits
    hit_attr = hit_attrs.pop(0)
IndexError: pop from empty list
</pre>

Line numbers might be a bit off as I added debug output to understand what's happening already.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From bow at bow.web.id  Thu Dec 13 04:15:01 2012
From: bow at bow.web.id (Wibowo Arindrarto)
Date: Thu, 13 Dec 2012 05:15:01 +0100
Subject: [Biopython-dev] Slight modifcation to BlastXML parser for
	AB-BLAST input
In-Reply-To: <CAF++dEfZZRFfAZvcS_MJ+AG9pGj6QHBUukAUChZ5yyqk+X4YBg@mail.gmail.com>
References: <CAF++dEfZZRFfAZvcS_MJ+AG9pGj6QHBUukAUChZ5yyqk+X4YBg@mail.gmail.com>
Message-ID: <CADEGkF609CawoG=ZP0hVUWk+_At28tVp-K4V+oZ9FzdHAksFXA@mail.gmail.com>

Hi Colin,

Thanks for the report. AB-BLAST wasn't included in the BLAST XML
parser's test suite so I'm glad you spotted this :).

You're proposing a bug fix, so yes, this should be included in our code.
You could submit a pull request on our github page:
https://github.com/biopython/biopython/pulls, or I can submit it on
your behalf if you prefer not to submit it yourself. If you're not
familiar with GitHub, we have a quick guide on how to use it to
develop Biopython here: http://biopython.org/wiki/GitUsage. GitHub's
help on how to submit pull requests is a useful read too:
https://help.github.com/articles/using-pull-requests

Along with the patch, a unit test on the AB-BLAST output would also be very
welcomed.

As for the actual regex change, I was wondering, is that the only
possible pattern of the BlastOutput_version tag in AB-BLAST? Do you
have examples of any other version output from AB-BLAST?

cheers,
Bow


P.S. CC-ed to the Biopython-dev mailing list


On Thu, Dec 13, 2012 at 4:41 AM, Colin Archer <ctnarcher at gmail.com> wrote:
> Hi Bow,
>            I have been using your implementation of the biopython BLAST
> output parser but for AB-BLAST input and it has been working OK so far,
> although I haven't thoroughly had a look at the speed yet. I initially found
> that the version tag (BlastOutput_version) for AB-BLAST results were slighly
> different from NCBI BLAST and changed the regex you implemented to cover
> both versions. The difference between them was:
>
>   <BlastOutput_version>BLASTN 2.2.27+</BlastOutput_version>
>   <BlastOutput_version>3.0PE-AB [2009-10-30] [linux26-x64-I32LPF64
> 2009-11-17T18:52:53]</BlastOutput_version>
>
>
> and the regex I ended up using was:
> r'(\d+\.(?:\d+\.)*\d+)(?:\w+-\w+|\+)?'
>
> and here is the tested output:
>>>> _RE_VERSION1 = re.compile(r'\d+\.\d+\.\d+\+?')
>>>> _RE_VERSION2 = re.compile(r'(\d+\.(?:\d+\.)*\d+)(?:\w+-\w+|\+)?')
>>>> version1
> 'BLASTN 2.2.27+'
>>>> version2
> '3.0PE-AB [2009-10-30] [linux26-x64-I32LPF64 2009-11-17T18:52:53]'
>>>> re.search(_RE_VERSION1, version1).group(0)
> '2.2.27+'
>>>> re.search(_RE_VERSION2, version1).group(0)
> '2.2.27+'
>>>> re.search(_RE_VERSION1, version2).group(0)
> Traceback (most recent call last):
>   File "<input>", line 1, in <module>
> AttributeError: 'NoneType' object has no attribute 'group'
>>>> re.search(_RE_VERSION2, version2).group(0)
> '3.0PE-AB'
>
> Would there be any chance of including this in a future release of
> BioPython?
>
> Thanks
> Colin
>
>


From bow at bow.web.id  Thu Dec 13 16:14:27 2012
From: bow at bow.web.id (Wibowo Arindrarto)
Date: Thu, 13 Dec 2012 17:14:27 +0100
Subject: [Biopython-dev] Slight modifcation to BlastXML parser for
	AB-BLAST input
In-Reply-To: <CAF++dEfreFXYVC5rER-smtmCgpPxv0BdajHKQWnBs-ODH5_3VQ@mail.gmail.com>
References: <CAF++dEfZZRFfAZvcS_MJ+AG9pGj6QHBUukAUChZ5yyqk+X4YBg@mail.gmail.com>
	<CADEGkF609CawoG=ZP0hVUWk+_At28tVp-K4V+oZ9FzdHAksFXA@mail.gmail.com>
	<CAF++dEfreFXYVC5rER-smtmCgpPxv0BdajHKQWnBs-ODH5_3VQ@mail.gmail.com>
Message-ID: <CADEGkF7D0YRLB=o4pOcp5qZ9aEqXN+9XtGA6gvcuZJs_iqg2Qw@mail.gmail.com>

Hi Colin,

>                   From what I have seen, the version value is formatted
> differently based on the edition of AB-BLAST being used: personal,
> commerical etc. As I only use the personal edition, I'm not sure if the
> other versions are different but I imagine that they conform to the same
> format, with the version followed by the edition (for example, 3.0PE-AB for
> personal edition). The regex I sent you will keep the edition so I imagine
> it will work on other versions of AB-BLAST as long as the edition is
> represented by "words-words"

Ok then. The regex looks good. You can probably make it more
reader-friendly by separating the regex for NCBI and AB BLAST (e.g.
r'(?:ncbi_blast_regex)|(?:ab_blast_regex)'. But even without this, it
seems to work ok.

> I'll submit a pull request as well and submit the revised regex. If you are
> interested, there are a couple other differences in the XML output between
> AB-BLAST and NCBI-BLAST. I can send you an example output if you would like
> to have a look at it. Presently, SearchIO can't parse AB-BLAST XML output
> for multiple queries as the AB-BLAST output is just a concatentation of
> multiple single queries. Each query contains the <?xml version  ...> section
> at the beginning and causes ElementTree to error during iteration. To get
> around this I have been piping the AB-BLAST output and parsing it into a
> more NCBI-BLAST form.

Hmm..it is a problem if AB-BLAST concatenates outputs like that. It
makes the XML
invalid, though, so I'm not sure if we should change the parser to
tolerate this. What are the other differences?

As for the example files, they would indeed be useful for unit testing
(as long as they're not that big ~ less than 50K?). You can send them
to me. If you're feeling it, you can also write your own unit tests
using them :).

Looking forward to the pull request :),
Bow


From p.j.a.cock at googlemail.com  Thu Dec 13 17:09:59 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 17:09:59 +0000
Subject: [Biopython-dev] Slight modifcation to BlastXML parser for
 AB-BLAST input
In-Reply-To: <CADEGkF7D0YRLB=o4pOcp5qZ9aEqXN+9XtGA6gvcuZJs_iqg2Qw@mail.gmail.com>
References: <CAF++dEfZZRFfAZvcS_MJ+AG9pGj6QHBUukAUChZ5yyqk+X4YBg@mail.gmail.com>
	<CADEGkF609CawoG=ZP0hVUWk+_At28tVp-K4V+oZ9FzdHAksFXA@mail.gmail.com>
	<CAF++dEfreFXYVC5rER-smtmCgpPxv0BdajHKQWnBs-ODH5_3VQ@mail.gmail.com>
	<CADEGkF7D0YRLB=o4pOcp5qZ9aEqXN+9XtGA6gvcuZJs_iqg2Qw@mail.gmail.com>
Message-ID: <CAKVJ-_6A2bSuTaT02C7-MUV2qqMTBD4rgJ5WSci+0a_TTMbN4Q@mail.gmail.com>

On Thu, Dec 13, 2012 at 4:14 PM, Wibowo Arindrarto <bow at bow.web.id> wrote:
>> Presently, SearchIO can't parse AB-BLAST XML output
>> for multiple queries as the AB-BLAST output is just a concatentation of
>> multiple single queries. Each query contains the <?xml version  ...> section
>> at the beginning and causes ElementTree to error during iteration. To get
>> around this I have been piping the AB-BLAST output and parsing it into a
>> more NCBI-BLAST form.
>
> Hmm..it is a problem if AB-BLAST concatenates outputs like that. It
> makes the XML invalid, though, so I'm not sure if we should change
> the parser to tolerate this. What are the other differences?

The older NCBI BLAST tools had this bug as well - and as a result
our NCBIXML has a hack to cope with it. It might be worth applying
the same kind of fix to the SearchIO BLAST XML parser as well
if it would help with both AB-BLAST and any older NCBI XML files.

Peter


From lucas.sinclair at me.com  Thu Dec 13 16:29:19 2012
From: lucas.sinclair at me.com (Lucas Sinclair)
Date: Thu, 13 Dec 2012 17:29:19 +0100
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
Message-ID: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>

Hi !

I'm working a lot with fasta files. They can be large (>50GB) and contain lots of sequences (>40,000,000). Often I need to get one sequence from the file. WIth a flat FASTA file this requires parsing, on average, half of the file before finding it. I would like to write something that solves this problem, and rather than making a new repository, I thought I could contribute to biopython.

As I just wrote, the iterator nature of parsing sequences files has it's limits. I was thinking of something that is indexed. And not some hack like I see sometimes where a second".fai" file is added nest to the ".fa" file. The natural thing to do is to put these entries in a SQLite file. The appraisal of such solutions is well made here: http://defindit.com/readme_files/sqlite_for_data.html

Now I looked into the biopython source code, and it seems everything is based on returning a generator object which essentially has only one method: next() giving SeqRecords. For what I want to do, I would also need the get(id) method. Plus any other methods that could now be added to query the DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is a class called InterlacedSequenceIterator(SequenceIterator) that contains a __getitem__(i) method, but it's unclear how to I should go about implementing that. Any help/example on how to add such a format to SeqIO ?

Thanks !

Lucas Sinclair, PhD student
Ecology and Genetics
Uppsala University


From p.j.a.cock at googlemail.com  Thu Dec 13 17:40:46 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 17:40:46 +0000
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
Message-ID: <CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>

On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
> Hi !
>
> I'm working a lot with fasta files. They can be large (>50GB) and contain
> lots of sequences (>40,000,000). Often I need to get one sequence from the
> file. WIth a flat FASTA file this requires parsing, on average, half of the
> file before finding it. I would like to write something that solves this
> problem, and rather than making a new repository, I thought I could
> contribute to biopython.
>
> As I just wrote, the iterator nature of parsing sequences files has it's
> limits. I was thinking of something that is indexed. And not some hack like
> I see sometimes where a second".fai" file is added nest to the ".fa" file.
> The natural thing to do is to put these entries in a SQLite file. The
> appraisal of such solutions is well made here:
> http://defindit.com/readme_files/sqlite_for_data.html
>
> Now I looked into the biopython source code, and it seems everything is
> based on returning a generator object which essentially has only one method:
> next() giving SeqRecords. For what I want to do, I would also need the
> get(id) method. Plus any other methods that could now be added to query the
> DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is
> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
> __getitem__(i) method, but it's unclear how to I should go about
> implementing that. Any help/example on how to add such a format to SeqIO ?
>
> Thanks !

Have you looked at Bio.SeqIO.index (index held in memory) and
Bio.SeqIO.index_db (index held in an SQLite3 database), and do
they solve your needs?

Note these only index the location of records - unlike tabix/fai indexes
which also look at the line length to be able to pull out subsequences.
This means the Bio.SeqIO indexing isn't ideal for dealing with large
records were you are only interested in small subsequences.

Peter


From p.j.a.cock at googlemail.com  Thu Dec 13 17:51:40 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 17:51:40 +0000
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
	<CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
Message-ID: <CAKVJ-_6zPbwHJdoO_OpyZtOONtvPBqwvzwQ1Ah3Uwc46Yz+pKw@mail.gmail.com>

On Thu, Dec 13, 2012 at 5:40 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
>>
>> I see there is
>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
>> __getitem__(i) method, but it's unclear how to I should go about
>> implementing that.
>>

Hmm - I think that entire class is obsolete and could be removed.

Peter


From p.j.a.cock at googlemail.com  Thu Dec 13 18:54:04 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 13 Dec 2012 18:54:04 +0000
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <CAKVJ-_6zPbwHJdoO_OpyZtOONtvPBqwvzwQ1Ah3Uwc46Yz+pKw@mail.gmail.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
	<CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
	<CAKVJ-_6zPbwHJdoO_OpyZtOONtvPBqwvzwQ1Ah3Uwc46Yz+pKw@mail.gmail.com>
Message-ID: <CAKVJ-_63nNPCxkhN0-Ftd2wQMuYboajhgDc1B8ouY=88nJcP6A@mail.gmail.com>

On Thu, Dec 13, 2012 at 5:51 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Dec 13, 2012 at 5:40 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
>>>
>>> I see there is
>>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
>>> __getitem__(i) method, but it's unclear how to I should go about
>>> implementing that.
>>>
>
> Hmm - I think that entire class is obsolete and could be removed.

I've marked it as deprecated, but since it doesn't really have any
executable code a deprecation warning doesn't seem relevant.
We can probably remove this after the next release.
https://github.com/biopython/biopython/commit/316c42aad05b9de3d3b3004ec295670691ae1804

Thanks for flagging up this bit of the code Lucas.

Going further, the SequenceIterator isn't used either, and perhaps
could be dropped too? We do use the similar class in AlignIO...

Regards,

Peter


From ben at benfulton.net  Fri Dec 14 02:25:47 2012
From: ben at benfulton.net (Ben Fulton)
Date: Thu, 13 Dec 2012 21:25:47 -0500
Subject: [Biopython-dev] Code coverage reporting
Message-ID: <CA+ijMsmW=YLmHRWhMBtf6ChC7iqs7_8N=6m0TtV-iXP_P3zd6w@mail.gmail.com>

On my Biopython fork, I've extended the test run on Travis to create and
upload a code coverage report to GitHub. I'd like to submit a pull request
to put this in the main code base, but in order to do so, I need a token
generated to allow uploading the file to the biopython GitHub account. Can
someone work with me on that?

You can view the coverage report at

http://cloud.github.com/downloads/benfulton/biopython/coverage.txt

Thanks!
Ben Fulton


From p.j.a.cock at googlemail.com  Fri Dec 14 10:58:49 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 14 Dec 2012 10:58:49 +0000
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <AA897CDE-935F-4766-AFAF-A19D1D79AA36@me.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
	<CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
	<AA897CDE-935F-4766-AFAF-A19D1D79AA36@me.com>
Message-ID: <CAKVJ-_7NgZJbfJtJc4DT0=SOZLpDY-RR7N2k0+x7N8ms4S_1dw@mail.gmail.com>

On Fri, Dec 14, 2012 at 10:07 AM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
> Hello,
>
> Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes
> an index, but it is held in memory. So it must be recomputed every
> time the interpreter is reloaded.

Yes, that is right.

> This step is wasting enough time for me that I would like to compute
> the index on my 50GB file once, and then be done with it. SQLite
> really is the technology of choice for such a problem...

Yes, which is why Bio.SeqIO.index_db() stores the index in SQLite.
The SeqIO chapter in the Tutorial does try to explain this and the
advantages compared to Bio.SeqIO.index(). Have you tried this yet?

> I suppose you agree storing all this sequence information in flat
> ascii files is not piratical.

It may not be optimal, but it is very practical (although at the scale
of next generation sequencing data less so).

Peter


From lucas.sinclair at me.com  Fri Dec 14 10:07:55 2012
From: lucas.sinclair at me.com (Lucas Sinclair)
Date: Fri, 14 Dec 2012 11:07:55 +0100
Subject: [Biopython-dev] SeqIO.InterlacedSequenceIterator
In-Reply-To: <CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
References: <FEC9DF61-1206-450D-8450-21BE2AAF7596@me.com>
	<CAKVJ-_4BXBUgxr0Qz5K_Qi4ZMY5GWNT-+2yMArQvbz5Fgvd72Q@mail.gmail.com>
Message-ID: <AA897CDE-935F-4766-AFAF-A19D1D79AA36@me.com>

Hello,

Thanks for your response. Yes I looked at Bio.SeqIO.index, it makes an index, but it is held in memory. So it must be recomputed every time the interpreter is reloaded. This step is wasting enough time for me that I would like to compute the index on my 50GB file once, and then be done with it. SQLite really is the technology of choice for such a problem... I suppose you agree storing all this sequence information in flat ascii files is not piratical.

Actually, I found a reasonable work around way of achieving this result with these two commands:
$ formatdb -i reads -p T -o T -n reads
$ blastdbcmd -db reads -dbtype prot -entry "105107064179" -outfmt %f -out test.fasta

But then I need to have calls to subprocess...

Since, I thought my first small contribution to biopython was fun doing, (https://github.com/biopython/biopython/commit/1c72a63b35db70d11c628b83a0269d1a9c6443a4) I maybe still fell like writing a proper solution. Would such a thing be a welcome addition to Bio.SeqIO ? If so, where would I place it ? The schema would be a SQLite file with a single table named "sequences". This table would have columns corresponding to the attributes of a SeqRecord. But you would need to get a different type object back when calling parse than a generator, you would need an object that has a __getitem__ method.

Sincerely,

Lucas Sinclair, PhD student
Ecology and Genetics
Uppsala University

On 13 d?c. 2012, at 18:40, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Dec 13, 2012 at 4:29 PM, Lucas Sinclair <lucas.sinclair at me.com> wrote:
>> Hi !
>> 
>> I'm working a lot with fasta files. They can be large (>50GB) and contain
>> lots of sequences (>40,000,000). Often I need to get one sequence from the
>> file. WIth a flat FASTA file this requires parsing, on average, half of the
>> file before finding it. I would like to write something that solves this
>> problem, and rather than making a new repository, I thought I could
>> contribute to biopython.
>> 
>> As I just wrote, the iterator nature of parsing sequences files has it's
>> limits. I was thinking of something that is indexed. And not some hack like
>> I see sometimes where a second".fai" file is added nest to the ".fa" file.
>> The natural thing to do is to put these entries in a SQLite file. The
>> appraisal of such solutions is well made here:
>> http://defindit.com/readme_files/sqlite_for_data.html
>> 
>> Now I looked into the biopython source code, and it seems everything is
>> based on returning a generator object which essentially has only one method:
>> next() giving SeqRecords. For what I want to do, I would also need the
>> get(id) method. Plus any other methods that could now be added to query the
>> DB in a useful fashion (e.g. SELECT entry where length > 5). I see there is
>> a class called InterlacedSequenceIterator(SequenceIterator) that contains a
>> __getitem__(i) method, but it's unclear how to I should go about
>> implementing that. Any help/example on how to add such a format to SeqIO ?
>> 
>> Thanks !
> 
> Have you looked at Bio.SeqIO.index (index held in memory) and
> Bio.SeqIO.index_db (index held in an SQLite3 database), and do
> they solve your needs?
> 
> Note these only index the location of records - unlike tabix/fai indexes
> which also look at the line length to be able to pull out subsequences.
> This means the Bio.SeqIO indexing isn't ideal for dealing with large
> records were you are only interested in small subsequences.
> 
> Peter


From w.arindrarto at gmail.com  Fri Dec 14 12:48:12 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Fri, 14 Dec 2012 13:48:12 +0100
Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout?
In-Reply-To: <CAKVJ-_5SjfRFiiKSatU9ds8b5ESdUTexa3TP=k+W=TPmtHoTfA@mail.gmail.com>
References: <CAKVJ-_4HJ-Qze2UwFtnU8MkHQc3dBL0t=aJW=wdJ08aOSt8gUA@mail.gmail.com>
	<CAKVJ-_4qi0txaYWXv8axJHf_WJJc7uZiRLdo3MBx_5BtSZrR6w@mail.gmail.com>
	<CADEGkF41Fu8BzuBh_3DfRSF5SS6C8UecU7F-TXTgnd-Md44Kcw@mail.gmail.com>
	<CAKVJ-_5SjfRFiiKSatU9ds8b5ESdUTexa3TP=k+W=TPmtHoTfA@mail.gmail.com>
Message-ID: <CADEGkF6Sk0N7-2Ygay7FD_hGP-ZXyhKYkpXdp=qPy9mg++_WxQ@mail.gmail.com>

Hi everyone,

>> It's reproducible in my machine: Arch Linux 64 bit running
>> Python3.1.5. Haven't figured out a fix yet, but trying to see if I
>> can.
>
> Great. We haven't really proved this is down to a change in
> either Python 3.1.4 or 3.1.5 but it does look likely.

It's reproduced in my local 3.1.4 installation. Seems like an unfixed
bug that went through to 3.1.5.

>> By the way, I was wondering, what's our deprecation policy for
>> Python3.x? I saw that 3.1.5 was released in 2009, and there doesn't
>> seem to be any major updates coming soon. How long should we keep
>> supporting Python <3.2?
>
> As long as it doesn't cost us much effort? If we can't solve this
> issue easily that might be enough to drop Python 3.1?

Fixing this seems difficult (has anyone else tried a fix?). The _io
module is built-in and compiled when Python is installed, so fixing it
(I imagine) may require tweaking the C-code (which requires fiddling
with the actual Python installation).

> My impression is that Python 3.0 is dead, and the only sizeable
> group stuck with Python 3.1 will those on Ubuntu lucid (LTS is
> supported through 2013 on desktops and 2015 on servers),
> but as with life under Python 2.x it is fairly straightforward
> to have a local/additional Python without disturbing the system
> installation.
>
>
> Since we don't yet officially support Python 3 (but we probably
> should soon) we have the flexibility to recommend
> either Python 3.2 or 3.3 as a baseline.

Yes. I think it may be easier and better for us to officially start
supporting from Python3.2 or 3.3 onwards.

regards,
Bow


From christian at brueffer.de  Mon Dec 17 11:05:04 2012
From: christian at brueffer.de (Christian Brueffer)
Date: Mon, 17 Dec 2012 19:05:04 +0800
Subject: [Biopython-dev] Biopython AlignAce Wrapper
In-Reply-To: <CABHxouVLpsmb4T_AAgRhHdHoe5QMcE8NTCPpj6quYf3ZoKP5dw@mail.gmail.com>
References: <50CAC1C2.9090705@brueffer.de>
	<CABHxouVzMyKbE9Uzw07Lk02JPXg4_dpHNJpuCHn3s5ms=22t1w@mail.gmail.com>
	<50CEE193.2010003@brueffer.de>
	<CABHxouVLpsmb4T_AAgRhHdHoe5QMcE8NTCPpj6quYf3ZoKP5dw@mail.gmail.com>
Message-ID: <50CEFC60.8020400@brueffer.de>

(CC'ing biopython-dev)

Thanks for the feedback.

I'd propose the following plan for the AlignAce wrapper then:

1. Submit the cleanup patches I have to give the wrapper at least a
    fighting chance at actually working
2. Add a BiopythonDeprecationWarning
3. Remove the wrapper after 1.61 is released (except the situation
    changes of course)

Does that sound acceptable?

Chris


On 12/17/2012 05:25 PM, Bartek Wilczynski wrote:
> Well,
>
> sounds like a good plan. I think the situation is hopeless: If we had
> the source of AlignAce with appropriate license we could think of
> supporting it ourselves, but in this situation I guess we can only
> deprecate the module and phase it out...
>
> best
> Bartek
>
> On Mon, Dec 17, 2012 at 10:10 AM, Christian Brueffer
> <christian at brueffer.de> wrote:
>> Hi Bartek,
>>
>> thanks for checking.  The thing is, the "new" version is actually an
>> ancient version:
>>
>> AlignACE version 2.3  October 27, 1998
>>
>> I made it work by installing Fedora Code 3 in a VM and using
>> elfstatifier to bind AlignAce and all libraries into one executable.
>> I works, but I doubt it's of any use these days.
>>
>> I wonder whether it's better to remove the wrapper.  The AlignAce
>> developers are unresponsive, none of the Biopython people has a
>> version and from what I can see the current wrapper cannot possibly
>> work.
>>
>> What do you think?
>>
>> Chris
>>
>>
>> On 12/17/2012 05:01 PM, Bartek Wilczynski wrote:
>>>
>>> Hi,
>>>
>>> I've looked around and it seems I don't have it. We probably need to
>>> "update" the parser to work with the current version of AlignACE
>>> available from Harvard. Were you able to run it? On mys system, it
>>> cannot find the libraries it needs...
>>>
>>> best
>>> Bartek
>>>
>>> On Fri, Dec 14, 2012 at 7:05 AM, Christian Brueffer
>>> <christian at brueffer.de> wrote:
>>>>
>>>> Hi Bartek,
>>>>
>>>> I currently clean up the Biopython AlignAce wrapper.  Unfortunately
>>>> I've been unable to obtain the latest AlignAce version since the
>>>> download page disappeared and the Church lab is unresponsive.
>>>>
>>>> Do you happen to have a version of AlignAce 4.0 for Linux lying around,
>>>> that you could send me?
>>>>
>>>> Thanks a lot,
>>>>
>>>> Chris
>>>


From redmine at redmine.open-bio.org  Mon Dec 17 13:49:33 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 17 Dec 2012 13:49:33 +0000
Subject: [Biopython-dev] [Biopython - Bug #3401] (New) is_terminal bug in
	newick trees
Message-ID: <redmine.issue-3401.20121217134933@redmine.open-bio.org>


Issue #3401 has been reported by Aleksey Kladov.

----------------------------------------
Bug #3401: is_terminal bug in newick trees
https://redmine.open-bio.org/issues/3401

Author: Aleksey Kladov
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Consider this weird  Newick tree

    (((B,C),D))A;

Here 'A' is both a root node and a terminal node(since it has only one child: ((B,C),D);). However, is_terminal for 'A' is False:


<pre>
from Bio import Phylo
import cStringIO

bad_tree = '(((B,C),D))A'

t = Phylo.read(cStringIO.StringIO(bad_tree), 'newick')

for c in t.find_clades(terminal=True):
    print c,
</pre>

Gives @B C D@


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From MatatTHC at gmx.de  Tue Dec 18 12:40:35 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Tue, 18 Dec 2012 13:40:35 +0100
Subject: [Biopython-dev] Location Parser
Message-ID: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>

Dear list,

I have some problems with the GenBank parser in version 1.60. Its again
nested location strings like:

order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)
as found in NC_003048.

What happens is that the parser stalls. It seems as if it takes forever to
parse _re_complex_compound in
and never gets to the if statement that checks if order and join appears in
the location string.

I suggest to move the if statement before the regular expressions are
tested.

I remember that I posted something like this before. But I can not remember
how and if this was solved.

Regards,
Matthaas


From k.d.murray.91 at gmail.com  Tue Dec 18 13:46:06 2012
From: k.d.murray.91 at gmail.com (Kevin Murray)
Date: Wed, 19 Dec 2012 00:46:06 +1100
Subject: [Biopython-dev] [biopython] TAIR (Arabidopsis) sequence
	retreival module (#132)
In-Reply-To: <biopython/biopython/pull/132/c11467257@github.com>
References: <biopython/biopython/pull/132@github.com>
	<biopython/biopython/pull/132/c11467257@github.com>
Message-ID: <CAH80STXLSJu16iiUU0fKSsQrqF3jvn4JFNmenBLk7bV0C0DTNw@mail.gmail.com>

Hi Peter, Chris and the mailing list,

Thanks very much for the feedback!

> Query: It isn't clear to me (from a first read) what MultipartPostHandler
is needed for.
The arabidopsis.org server form requires the content-type to be a multipart
form, not a urlencoded form, which the standard urllib2 does not handle. I
could write a custom handler, however when writing the module I found
MultipartPostHandler, and figured I should use that. I may be wrong, but
couldn't figure out any other way of doing it.

>Minor: The module's docstring should start with a one line summary then a
blank line (see PEP8 style guide).
>Note: Since your unit test requires internet access, it should include
these lines to work nicely in our testing framework (which allows the tests
needing network access to be skipped)
I'll fix the module docstring and requires_internet check tomorrow.

>Why does the NCBI code exist given it is such a thin wrapper round the
Bio.Entrez code - the module would be a lot simpler if it was just a
wrapper for www.arabidopsis.org alone.
The NCBI functions exist to get genbank files for AGIs, as TAIR's
sequence retrieval only gives fasta files, so if users need/want the extra
metadata a genbank file gives, they can use this module. As you've said,
this is a *very* thin wrapper, so would it be better to just provide the
mapping dicts in Bio.TAIR._ncbi for people to use however they see fit?

>Query: Why do your methods return SeqRecord objects? Is this because the
handle might return FASTA with a non-FASTA header which must be stripped
off?
SeqRecord handles were returned for two reasons, the first being as you
said that the raw return text is not always a valid fasta file, despite my
efforts to trim extraneous text. The latter is simply that is what i
required when writing it, and i could not think of a better way of
returning it. (and I thought that the return of a SeqRecord allowed
"pythonic" processing of results, a la the test suite). Again happy for any
suggestions

>Why does classes TAIRDirect and TAIRNCBI exist? Wouldn't module level
functions be simpler (or at least, consistent with other modules like
Bio.Entrez)
>Style: Why introduce the mode argument and two magic values NCBI_RNA and
NCBI_PROTEIN?
The honest answer to both of these is personal choice. If consistency is an
issue i will reimplement as module-level functions and textual arguments
respectively.

Regarding the placement of modules, i'm happy for it to go wherever. I
would imagine that there are other niche web interface "getters" such as
this, and think your suggestion sounds great, although i can't think what
we could call it. Perhaps Bio.Web.TAIR?


Regards
Kevin Murray


On 18 December 2012 10:34, Peter Cock <notifications at github.com> wrote:

> Hi Kevin,
>
> Thanks for your code submission. I've not had a chance to play with it,
> but I do have some comments/queries - some of which are perhaps just style
> issues.
>
> Note: Since your unit test requires internet access, it should include
> these lines to work nicely in our testing framework (which allows the tests
> needing network access to be skipped):
>
> import requires_internet
> requires_internet.check()
>
> Query: It isn't clear to me (from a first read) what MultipartPostHandler
> is needed for.
>
> Minor: The module's docstring should start with a one line summary then a
> blank line (see PEP8 style guide).
>
> Query: Why does classes TAIRDirect and TAIRNCBI exist? Wouldn't module
> level functions be simpler (or at least, consistent with other modules like
> Bio.Entrez)?
>
> Query: Why do your methods return SeqRecord objects? Is this because the
> handle might return FASTA with a non-FASTA header which must be stripped
> off?
>
> Style: Why introduce the mode argument and two magic values NCBI_RNA and
> NCBI_PROTEIN?
>
> In fact I would go further and ask why does the NCBI code exist given it
> is such a thin wrapper round the Bio.Entrez code - the module would be a
> lot simpler if it was just a wrapper for www.arabidopsis.org alone.
>
> I'm also not sure about the namespace Bio.TAIR, the old Bio.www namespace
> might have been better but that was deprecated a while back, and the other
> semi-natural fit under Biopython's old OBDA effort is also defunct
> (attempting to catalogue a collection of sequence resources, see
> http://obda.open-bio.org for background if curious). The namespace issue
> at least would be worth bringing up on the dev mailing list... especially
> if you can think of many other examples like this for specialised resources.
>
> Regards,
>
> Peter
>
> ?
> Reply to this email directly or view it on GitHub<https://github.com/biopython/biopython/pull/132#issuecomment-11467257>.
>
>


From kjwu at ucsd.edu  Wed Dec 19 04:25:35 2012
From: kjwu at ucsd.edu (Kevin Wu)
Date: Tue, 18 Dec 2012 20:25:35 -0800
Subject: [Biopython-dev] KEGG API Wrapper
In-Reply-To: <1351219962.39081.YahooMailClassic@web164002.mail.gq1.yahoo.com>
References: <CAEe6yUEbiK3tFdvx1hEGE2==QR7Pab2HcvL6x-CqOivWCB9=sg@mail.gmail.com>
	<1351219962.39081.YahooMailClassic@web164002.mail.gq1.yahoo.com>
Message-ID: <CAEe6yUHZj86LZ5JN+X7_1ANH0oZetZPN_nP+jyS+f9h31y9dwg@mail.gmail.com>

Hi All,

Sorry in the delay in updating this KEGG code. Michiel, I've addressed your
suggestions regarding the querying code and the documentation and have
committed changes that reflect this. (
https://github.com/kevinwuhoo/biopython/) There's a namespace collision
created by the KEGG.list function, so I use KEGG.list_ instead. However,
I'm sure there's a more elegant solution than this.

Regarding the parsers, there should be a way to unify all parsers and
writers for KEGG objects as they list fields for all their objects here:
http://www.kegg.jp/kegg/rest/dbentry.html. Each class should extend from a
parent while specifying their valid fields. Parsing all files should be
generalized, but there should be field specific code to handle the
different fields so that fields like genes are handled correctly and
ubiquitously.

After solidifying discussion on these, I'll move the tests over to unittest
too.

Thanks!
Kevin


On Thu, Oct 25, 2012 at 7:52 PM, Michiel de Hoon <mjldehoon at yahoo.com>wrote:

> Hi Kevin,
>
> Thanks for the documentation! That makes everything a lot clearer.
> Overall I like the querying code and I think we should add it to Biopython.
>
> I have a bunch of comments on the KEGG module, some on the existing code
> and some on the new querying code, see below. Most of these are trivial;
> some may need some further discussion. Perhaps could you let us know which
> of these comments you can address, and which ones you want to skip for now?
>
> Once we converged with regards to the querying code and the documentation,
> I think we can import your version of the KEGG module into the main
> Biopython repository and add your chapter on KEGG to the main
> documentation, and continue from there on the parsers and the unit tests.
>
> Many thanks!
> -Michiel.
>
>
> About the querying code:
> ----------------------------------
>
> I would replace KEGG.query("list", KEGG.query("find", KEGG.query("conv",
> KEGG.query("link", KEGG.query("info", KEGG.query("get" by the functions
> KEGG.list, KEGG.find, KEGG.conv, KEGG.link, KEGG.info, and KEGG.get.
>
> For list, find, conv, link, and info, instead of going through
> KEGG.generic_parser, I would return the result directly as a Python list.
> In contrast, KEGG.get should return the handle to the results, not the
> data itself. So the _q function, instead of
>   ...
>   resp = urllib2.urlopen(req)
>   data = resp.read()
>   return query_url, data
> have
>   ...
>   resp = urllib2.urlopen(req)
>   return resp
> Then the user can decide whether to parse the data on the fly with
> Bio.KEGG, or read the data line by line and pick up what they are
> interested in, or to get all data from the handle and save it in a file.
> Note that resp will have a .url attribute that contains the url, so you
> won't need the ret_url keyword.
>
> About the parsers:
> ------------------------
>
> I think that we should drop generic_parser. For link, find, conv, link,
> and info, parsing is trivial and can be done by the respective functions
> directly. For get, we already have an appropriate parser for some databases
> (compound, map, and enzyme), but it's easy to add parsers for the other
> databases.
>
> For all parsers in Biopython, there is the question whether the record
> should store information in attributes (as is currently done in Bio.KEGG),
> or alternatively if the record should inherit from a dictionary and store
> information in keys in the dictionary. Personally I have a preference for a
> dictionary, since that allows us to use the exact same keys in the
> dictionary as is used in the file (e.g., we can use "CLASS" as a key, while
> we cannot use .class as an attribute since it is a reserved word, so we use
> .classname instead). But other Biopython developers may not agree with me,
> and to some extent it depends on personal preference.
>
> The parsers miss some key words. The ones I noticed are ALL_REAC,
> REFERENCE, and ORTHOLOGY. Probably we'll find more once we extend the unit
> tests.
>
> Remove the ';' at the end of each term in record.classname.
>
> Convert record.genes to a dictionary for each organism. So instead of
> [('HSA', ['5236', '55276']), ('PTR', ['456908', '461162']), ('PON',
> ['100190836', '100438793']), ('MCC', ['100424648', '699401']...
> have
> {'HSA': ['5236', '55276'], 'PTR': ['456908', '461162'], 'PON':
> ['100190836', '100438793'], 'MCC': ['100424648', '699401'], ...
>
> Also for record.dblinks, record.disease, record.structures, use a
> dictionary.
>
> In record.pathway, all entries start with 'PATH'. Perhaps we should check
> with KEGG if there could be anything else than 'PATH' there, otherwise I
> don't see the reason why it's there. Assuming that there could be something
> different there, I would also use a dictionary with 'PATH' as the key.
>
> In record.reaction, some chemical names can be very long and extend over
> multiple lines. In such cases, the continuation line starts with a '$'. The
> parser should remove the '$' and join the two lines.
>
> About the tests:
> --------------------
>
> We should update the data files in Tests/KEGG. This will fix some "bugs"
> in these data files.
>
> We should switch test_KEGG.py to the unit test framework.
>
> We should do some more extensive testing to make sure we are not missing
> some key words.
>
> About the documentation:
> ---------------------------------
> It's great that we now have some documentation.
>
> On page 233, I would suggest to replace the "id_" by "accession" or
> something else, since the underscore in "id_" may look funky to new users.
>
> Also it may be better not to reuse variable names (e.g. "pathway" is used
> in three different ways in the example). It's OK of course in general, but
> for this example it may be more clear to distinguish the different usages
> of this variable from each other.
>
> For repair_genes, you can use a set instead of a list throughout.
>
>
>
>
> --- On *Wed, 10/24/12, Kevin Wu <kjwu at ucsd.edu>* wrote:
>
>
> From: Kevin Wu <kjwu at ucsd.edu>
> Subject: Re: [Biopython-dev] KEGG API Wrapper
> To: "Peter Cock" <p.j.a.cock at googlemail.com>, "Zachary Charlop-Powers" <
> zcharlop at mail.rockefeller.edu>, "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: Biopython-dev at lists.open-bio.org
> Date: Wednesday, October 24, 2012, 6:38 PM
>
>
> Hi All,
>
> Thanks for the comments, I've written a bit of documentation on the entire
> KEGG module and have attached those relevant pages to the email. There
> didn't seem like an appropriate place for examples, so I just added a new
> chapter. I've also committed the updated file to github.
>
> I did leave out the parsers due to the fact that the current parsers only
> cover a small portion of possible responses from the api. Also, I'm not
> confident that the some of the parsers correctly retrieves all the fields.
> However, I've written a really general parser that does a rough job of
> retrieving fields if it's a database format returned since I find myself
> reusing the code for all database formats. It's possible to modify this to
> correctly account for the different fields, but would probably take a bit
> of work to manually figure each field out. Otherwise it also parses the
> tsv/flat file returned.
>
> Also, @zach, thanks for checking it out and testing it!
>
> Thanks All!
> Kevin
>
> On Wed, Oct 17, 2012 at 4:09 AM, Peter Cock <p.j.a.cock at googlemail.com<http://mc/compose?to=p.j.a.cock at googlemail.com>
> > wrote:
>
> On Wed, Oct 17, 2012 at 12:55 AM, Zachary Charlop-Powers
> <zcharlop at mail.rockefeller.edu<http://mc/compose?to=zcharlop at mail.rockefeller.edu>>
> wrote:
> > Kevin,
> > Michiel,
> >
> > I just tested Kevin's code for a few simple queries and it worked great.
> I
> > have always liked KEGG's organization of data and really appreciate this
> > RESTful interface to their data; in some ways I think it easier to use
> the
> > web interfaces for KEGG than it is for NCBI. Plus the KEGG coverage of
> > metabolic networks is awesome.  I found the examples in Kevin's test
> script
> > to be fairly self-explanatory but a simple-spelled out example in the
> > Tutorial would be nice.
> >
> > One thought, though, is that you can retrieve MANY different types of
> data
> > from the KEGG Rest API - which means that the user will probably have to
> > parse the data his/herself. Data retrieved with "list" can return lists
> of
> > genes or compounds or organism and after a  cursory look  these are each
> > formatted differently. Also true with the 'find' command. So I think you
> > were right to leave out parsers because i think they will be a moving
> target
> > highly dependent on the query.
> >
> > Thank You Kevin,
> > zach cp
>
> Good point about decoupling the web API wrapper and the parsers -
> how the Bio.Entrez module and Bio.TogoWS handle this is to return
> handles for web results, which you can then parse with an appropriate
> parser (e.g. SeqIO for GenBank files, Medline parser, etc).
>
> Note that this is a little more fiddly under Python 3 due to the text
> mode distinction between unicode and binary... just something to
> keep in the back of your mind.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org<http://mc/compose?to=Biopython-dev at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>


From gokcen.eraslan at gmail.com  Fri Dec 21 00:12:43 2012
From: gokcen.eraslan at gmail.com (=?ISO-8859-1?Q?G=F6k=E7en_Eraslan?=)
Date: Fri, 21 Dec 2012 01:12:43 +0100
Subject: [Biopython-dev] numpy/matlab style index arrays for Seq objects
Message-ID: <50D3A97B.60108@gmail.com>

Hello,

During the development of a project, I have come across an issue that I
want to share. As far as I know, Bio.Seq.Seq object can only be indexed
using an int or a slice object, just as regular strings:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())

However, it would be really nice to be able to index Seq objects using
index arrays as in numpy.array, like

>>> my_inidices = [0, 3, 7]
>>> my_seq[my_indices]
Seq('GCG', IUPACUnambiguousDNA())

(Since I'm not really familiar with BioPython API and codebase, please
ignore/forgive me if such thing already exists now.)

For example in my project, I'm trying to eliminate noisy columns of a
MSA fasta file. Let's assume that I have a list of non-noisy column
indices than this would solve my problem:

In [1]: from Bio import AlignIO
In [2]: msa = AlignIO.read("s001.fasta", "fasta")
In [3]: print msa[:, [0, 3, 4]]

SingleLetterAlphabet() alignment with 5 rows and 3 columns
KPG sp2
TPG sp11
SPG sp7
KPP sp6
SPG sp10

I have attached a tiny patch (~4 lines) implementing this stuff. At
first, I have thought keeping the sequence string as numpy.array(list())
to be able to use indexing mechanism of numpy, but it would be
over-engineering so I have just used a simple list comprehension trick.

Regards.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: biopython-index-array-for-seq.diff
Type: text/x-patch
Size: 3845 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121221/326baacf/attachment-0002.bin>

From p.j.a.cock at googlemail.com  Fri Dec 21 13:09:47 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 21 Dec 2012 13:09:47 +0000
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
Message-ID: <CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>

On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> Dear list,
>
> I have some problems with the GenBank parser in version 1.60. Its again
> nested location strings like:
>
> order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)
> as found in NC_003048.

Do you have a URL for that? This looks OK to me:
http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1

Perhaps the entry came from the FTP site?
e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/

> What happens is that the parser stalls. It seems as if it takes forever to
> parse _re_complex_compound in and never gets to the if statement that
> checks if order and join appears in the location string.
>
> I suggest to move the if statement before the regular expressions are
> tested.
>
> I remember that I posted something like this before. But I can not remember
> how and if this was solved.
>
> Regards,
> Matthaas

Were similar odd locations have come up in some cases they did
seem to be NCBI bugs - could you raise a query with the NCBI
for this case please?

If this is valid (which I doubt), then our object model doesn't cope.

If this is invalid, then Biopython should give a warning and skip
this location. Right now I can't find the file to test this (see
query above about where it came from).

Regards,

Peter


From MatatTHC at gmx.de  Fri Dec 21 15:18:45 2012
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Fri, 21 Dec 2012 16:18:45 +0100
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
	<CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
Message-ID: <CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>

Dear Peter,

you are right the current RefSeq record is valid and can be parsed. In
order to reproduce old results I keep old refseq versions (of mitochondrial
genomes) on hard disk. So probably this is an old refseq bug. According to
the documentation (
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.4):


"""
Note : location operator "complement" can be used in combination with either "
join" or "order" within the same location; combinations of "join" and "order"
within the same location (nested operators) are illegal.
"""

Since this was urgent I fixed the files manually by removing the nested
files. I was not able to find a file in other RefSeq versions that can
reproduce the bug (i.e. the parser seemingly takes forever [>5min] and does
not raise an exception). You may still reproduce the bug by pasting the
location line in another GenBank file.

I agree that the desired behaviour would be a warning and skip of the
feature.

Regards,
Matthias


2012/12/21 Peter Cock <p.j.a.cock at googlemail.com>

> On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> > Dear list,
> >
> > I have some problems with the GenBank parser in version 1.60. Its again
> > nested location strings like:
> >
> >
> order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)
> > as found in NC_003048.
>
> Do you have a URL for that? This looks OK to me:
> http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1
>
> Perhaps the entry came from the FTP site?
> e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/
>
> > What happens is that the parser stalls. It seems as if it takes forever
> to
> > parse _re_complex_compound in and never gets to the if statement that
> > checks if order and join appears in the location string.
> >
> > I suggest to move the if statement before the regular expressions are
> > tested.
> >
> > I remember that I posted something like this before. But I can not
> remember
> > how and if this was solved.
> >
> > Regards,
> > Matthaas
>
> Were similar odd locations have come up in some cases they did
> seem to be NCBI bugs - could you raise a query with the NCBI
> for this case please?
>
> If this is valid (which I doubt), then our object model doesn't cope.
>
> If this is invalid, then Biopython should give a warning and skip
> this location. Right now I can't find the file to test this (see
> query above about where it came from).
>
> Regards,
>
> Peter
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NC_001326.gb
Type: application/octet-stream
Size: 65527 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121221/911b2ce3/attachment-0002.obj>

From p.j.a.cock at googlemail.com  Fri Dec 21 15:34:48 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 21 Dec 2012 15:34:48 +0000
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
	<CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
	<CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>
Message-ID: <CAKVJ-_6D1qssSwoSAff5BxF0EDh-ftAD9Xrp8YX7Z=EXn1PtHg@mail.gmail.com>

On Fri, Dec 21, 2012 at 3:18 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> Dear Peter,
>
> you are right the current RefSeq record is valid and can be parsed. In order
> to reproduce old results I keep old refseq versions (of mitochondrial
> genomes) on hard disk. So probably this is an old refseq bug. ...

Could you email me (not the list) the old NC_003048.gb file please?

Was there a similar issue in the NC_001326.gb file you just sent?
It seems to load OK for me...

Thanks,

Peter


From p.j.a.cock at googlemail.com  Fri Dec 21 16:13:40 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 21 Dec 2012 16:13:40 +0000
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAFwXb2_mB-ATYhjZ44Df+6t2ctjOWac7jYuj5Y-e3xxf3_b3eQ@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
	<CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
	<CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>
	<CAKVJ-_6D1qssSwoSAff5BxF0EDh-ftAD9Xrp8YX7Z=EXn1PtHg@mail.gmail.com>
	<CAFwXb2_mB-ATYhjZ44Df+6t2ctjOWac7jYuj5Y-e3xxf3_b3eQ@mail.gmail.com>
Message-ID: <CAKVJ-_43OUkRn+7i2qfM=9auLnNiQHJ3ckPC8FdE5aRp07B07A@mail.gmail.com>

On Fri, Dec 21, 2012 at 3:53 PM, Matthias Bernt
<bernt.matthias at gmail.com> wrote:
> Dear Peter,
>
> its attached (from RefSeq39). For me parsing does not finish for this file
> (biopython 1.6, python 2.7.3).
>
> Regards,
> Matthias

Got it, thanks. It also seems to get stuck for me too - there is a bug here :(

See also: https://redmine.open-bio.org/issues/3197

Peter


From p.j.a.cock at googlemail.com  Fri Dec 21 16:54:38 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 21 Dec 2012 16:54:38 +0000
Subject: [Biopython-dev] Location Parser
In-Reply-To: <CAKVJ-_43OUkRn+7i2qfM=9auLnNiQHJ3ckPC8FdE5aRp07B07A@mail.gmail.com>
References: <CAFwXb29Qo-VK4qpuLo1t7myyvY0CCU8pnr1gVYCzfBOr3v9EKg@mail.gmail.com>
	<CAKVJ-_52PS7Bhmrc=L4+RvDSOnry-uVUTechmANM35rPztVM3g@mail.gmail.com>
	<CAFwXb2_tkH279kQWVe30983VtorXYES=MQb41_zU0QdZH0G5oQ@mail.gmail.com>
	<CAKVJ-_6D1qssSwoSAff5BxF0EDh-ftAD9Xrp8YX7Z=EXn1PtHg@mail.gmail.com>
	<CAFwXb2_mB-ATYhjZ44Df+6t2ctjOWac7jYuj5Y-e3xxf3_b3eQ@mail.gmail.com>
	<CAKVJ-_43OUkRn+7i2qfM=9auLnNiQHJ3ckPC8FdE5aRp07B07A@mail.gmail.com>
Message-ID: <CAKVJ-_5WL+rzpwoMrPneF6en3YOpARLEYP=pVdb9+7=BUoysPw@mail.gmail.com>

On Fri, Dec 21, 2012 at 4:13 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Dec 21, 2012 at 3:53 PM, Matthias Bernt
> <bernt.matthias at gmail.com> wrote:
>> Dear Peter,
>>
>> its attached (from RefSeq39). For me parsing does not finish for this file
>> (biopython 1.6, python 2.7.3).
>>
>> Regards,
>> Matthias
>
> Got it, thanks. It also seems to get stuck for me too - there is a bug here :(
>
> See also: https://redmine.open-bio.org/issues/3197

The problem seems to be in the regular expression search itself getting stuck:

$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.GenBank import _re_complex_compound
>>> _re_complex_compound.match("order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)")
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyboardInterrupt

Odd.

Peter


From ben at bendmorris.com  Mon Dec 24 16:58:19 2012
From: ben at bendmorris.com (Ben Morris)
Date: Mon, 24 Dec 2012 11:58:19 -0500
Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo
Message-ID: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>

Hi all,

I've implemented support for two new phylogenetic tree formats: NeXML and
RDF (conforming to the Comparative Data Analysis Ontology).

I noticed that NeXML support was planned, but I didn't see anyone working
on it on GitHub and the feature request hadn't been updated in about a
year, so I went ahead and implemented a simple version. At first I tried
the generateDS.py approach, but the generated writer doesn't give very much
control over the output, so I ended up writing my own parser/writer using
ElementTree.

As for the RDF/CDAO format, AFAIK this is not a format that's supported by
any other phylogenetic libraries, so I'm not sure how useful this is to
everyone else. It provides a simple, standards-compliant format that can be
imported to a triple store and supports annotation. We'll be using it at
NESCent so I wanted to make it available to everyone else as well. The
parser and writer require the Redlands Python bindings.

The code is available in my fork of Biopython,

    https://github.com/bendmorris/biopython

under branches "cdao" and "nexml." I'd love to get everyone's thoughts and
see if these contributions would be a good fit for the Biopython project.


~Ben Morris
PhD student, Department of Biology
University of North Carolina at Chapel Hill
and the National Evolutionary Synthesis Center
ben at bendmorris.com


From p.j.a.cock at googlemail.com  Mon Dec 24 18:05:29 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 24 Dec 2012 18:05:29 +0000
Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo
In-Reply-To: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
References: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
Message-ID: <CAKVJ-_6xpA_LPd2JQ2j1s=yLLV251vsGMvC_Qod8SdbxcPpxiA@mail.gmail.com>

On Mon, Dec 24, 2012 at 4:58 PM, Ben Morris <ben at bendmorris.com> wrote:
> Hi all,
>
> I've implemented support for two new phylogenetic tree formats: NeXML and
> RDF (conforming to the Comparative Data Analysis Ontology).
>
> I noticed that NeXML support was planned, but I didn't see anyone working
> on it on GitHub and the feature request hadn't been updated in about a
> year, so I went ahead and implemented a simple version. At first I tried
> the generateDS.py approach, but the generated writer doesn't give very much
> control over the output, so I ended up writing my own parser/writer using
> ElementTree.
>
> As for the RDF/CDAO format, AFAIK this is not a format that's supported by
> any other phylogenetic libraries, so I'm not sure how useful this is to
> everyone else. It provides a simple, standards-compliant format that can be
> imported to a triple store and supports annotation. We'll be using it at
> NESCent so I wanted to make it available to everyone else as well. The
> parser and writer require the Redlands Python bindings.
>
> The code is available in my fork of Biopython,
>
>     https://github.com/bendmorris/biopython
>
> under branches "cdao" and "nexml." I'd love to get everyone's thoughts and
> see if these contributions would be a good fit for the Biopython project.

Sounds good - and the librdf Redlands Python bindings do seem to
be a safe choice for RDF under Python. I guess we need Eric to
take a look... and some tests would be needed too.

Thanks,

Peter


From eric.talevich at gmail.com  Tue Dec 25 07:18:40 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 24 Dec 2012 23:18:40 -0800
Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo
In-Reply-To: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
References: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
Message-ID: <CAMC681=OrHJmfEbxWz=8-qzo2rEVJaqFeqgihiAMVi6No7GBCw@mail.gmail.com>

On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris <ben at bendmorris.com> wrote:

> Hi all,
>
> I've implemented support for two new phylogenetic tree formats: NeXML and
> RDF (conforming to the Comparative Data Analysis Ontology).
>
> I noticed that NeXML support was planned, but I didn't see anyone working
> on it on GitHub and the feature request hadn't been updated in about a
> year, so I went ahead and implemented a simple version. At first I tried
> the generateDS.py approach, but the generated writer doesn't give very much
> control over the output, so I ended up writing my own parser/writer using
> ElementTree.
>
> As for the RDF/CDAO format, AFAIK this is not a format that's supported by
> any other phylogenetic libraries, so I'm not sure how useful this is to
> everyone else. It provides a simple, standards-compliant format that can be
> imported to a triple store and supports annotation. We'll be using it at
> NESCent so I wanted to make it available to everyone else as well. The
> parser and writer require the Redlands Python bindings.
>
> The code is available in my fork of Biopython,
>
>     https://github.com/bendmorris/biopython
>
> under branches "cdao" and "nexml." I'd love to get everyone's thoughts and
> see if these contributions would be a good fit for the Biopython project.
>


Thanks for letting us know! I'll try it out soonish. Looking at the code on
your nexml branch, I have a few comments:

- The parser uses ElementTree.parse rather than iterparse, so in its
current state it would not be able to parse massive files (those larger
than available RAM). Worth fixing eventually?

- The parser creates Newick.Tree and Newick.Clade objects, which is nearly
correct in my opinion. I would suggest subclassing BaseTree.Tree and
BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you
don't have any additional attributes to attach to those classes at the
moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and
PhyloXMLIO.py.)

- The 'confidence' or 'confidences' attribute isn't used (for e.g.
bootstrap support values). Does NeXML define it?

Best,
Eric


From ben at bendmorris.com  Fri Dec 28 15:50:02 2012
From: ben at bendmorris.com (Ben Morris)
Date: Fri, 28 Dec 2012 10:50:02 -0500
Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo
In-Reply-To: <CAMC681=OrHJmfEbxWz=8-qzo2rEVJaqFeqgihiAMVi6No7GBCw@mail.gmail.com>
References: <CAAzEd5AvRgkr=UYmqwHPH+cBYXCS+5yLHs=bHjCDxN1rY_aGFg@mail.gmail.com>
	<CAMC681=OrHJmfEbxWz=8-qzo2rEVJaqFeqgihiAMVi6No7GBCw@mail.gmail.com>
Message-ID: <CAAzEd5Bz5xvc2Bz80Ru+FbUbJK-WnAjfvLv70SfkPZup89NGRQ@mail.gmail.com>

On Tue, Dec 25, 2012 at 2:18 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris <ben at bendmorris.com> wrote:
>>
>> Hi all,
>>
>> I've implemented support for two new phylogenetic tree formats: NeXML and
>> RDF (conforming to the Comparative Data Analysis Ontology).
>>
>> I noticed that NeXML support was planned, but I didn't see anyone working
>> on it on GitHub and the feature request hadn't been updated in about a
>> year, so I went ahead and implemented a simple version. At first I tried
>> the generateDS.py approach, but the generated writer doesn't give very much
>> control over the output, so I ended up writing my own parser/writer using
>> ElementTree.
>>
>> As for the RDF/CDAO format, AFAIK this is not a format that's supported by
>> any other phylogenetic libraries, so I'm not sure how useful this is to
>> everyone else. It provides a simple, standards-compliant format that can be
>> imported to a triple store and supports annotation. We'll be using it at
>> NESCent so I wanted to make it available to everyone else as well. The
>> parser and writer require the Redlands Python bindings.
>>
>> The code is available in my fork of Biopython,
>>
>>     https://github.com/bendmorris/biopython
>>
>> under branches "cdao" and "nexml." I'd love to get everyone's thoughts and
>> see if these contributions would be a good fit for the Biopython project.
>
>
>
> Thanks for letting us know! I'll try it out soonish. Looking at the code on your nexml branch, I have a few comments:
>
> - The parser uses ElementTree.parse rather than iterparse, so in its current state it would not be able to parse massive files (those larger than available RAM). Worth fixing eventually?

Great point. I rewrote it to use iterparse instead.

> - The parser creates Newick.Tree and Newick.Clade objects, which is nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you don't have any additional attributes to attach to those classes at the moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and PhyloXMLIO.py.)

Went ahead and did this as well.

> - The 'confidence' or 'confidences' attribute isn't used (for e.g. bootstrap support values). Does NeXML define it?

Not that I'm aware of, but I'm not sure. I searched
http://nexml.org/nexml/html/doc/schema-1/ and didn't find anything.
I'm going to ask some people who know more about this than I do.

~Ben


From diego_zea at yahoo.com.ar  Fri Dec 28 23:33:35 2012
From: diego_zea at yahoo.com.ar (Diego Zea)
Date: Fri, 28 Dec 2012 15:33:35 -0800 (PST)
Subject: [Biopython-dev] Error on Bio.PDB
Message-ID: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com>

One of the PDB (I have a very large dataset of PDB and there are a lot of them generating this kind of error) that give me the error is: http://www.rcsb.org/pdb/files/2ER9.pdb

Ant the error output is:
/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 2895.
? PDBConstructionWarning)
/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 3216.
? PDBConstructionWarning)
Traceback (most recent call last):
? File "AsignarPDBaMIfile.py", line 45, in <module>
??? cmap, pdb? = contactos_CB(pdb_file,pdb_cad,cutoff=1,n_salto=1)
? File "funciones_pdb.py", line 15, in contactos_CB
??? cadena = model[cad]
? File "/usr/lib/pymodules/python2.7/Bio/PDB/Entity.py", line 38, in __getitem__
??? return self.child_dict[id]
KeyError: 'A'

?
How Can be fixed?

P.D.: The lines of the firts warning are (at the beginig I wrote the number of lines for reference, I think that the TER line can be the cause of the problem but I'm not sure):

2893?? ATOM?? 2455? N?? PHE I?? 8????? 38.110 -15.236?? 4.503? 0.89? 0.76?????????? N? 
2894?? TER??? 2456????? PHE I?? 8????????????????????????????????????????????????????? 
2895??? HETATM 2457? O?? HOH E 327????? 10.873? -3.134? 11.448? 0.89? 0.01?????????? O? 


if ((dx*dp)>=(h/(2*pi)))
{
printf("Diego Javier Zea\n");
}


From diego_zea at yahoo.com.ar  Fri Dec 28 23:59:28 2012
From: diego_zea at yahoo.com.ar (Diego Zea)
Date: Fri, 28 Dec 2012 15:59:28 -0800 (PST)
Subject: [Biopython-dev] Error on Bio.PDB
In-Reply-To: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com>
References: <1356737615.28816.YahooMailNeo@web140601.mail.bf1.yahoo.com>
Message-ID: <1356739168.13594.YahooMailNeo@web140606.mail.bf1.yahoo.com>

Excuse me, there is not error. Only a warning on a lot of PDBs. I confuse the chain on my example :/


?
if ((dx*dp)>=(h/(2*pi)))
{
printf("Diego Javier Zea\n");
}


>________________________________
> De: Diego Zea <diego_zea at yahoo.com.ar>
>Para: "biopython-dev at biopython.org" <biopython-dev at biopython.org> 
>Enviado: viernes, 28 de diciembre de 2012 20:33
>Asunto: [Biopython-dev] Error on Bio.PDB
> 
>One of the PDB (I have a very large dataset of PDB and there are a lot of them generating this kind of error) that give me the error is: http://www.rcsb.org/pdb/files/2ER9.pdb
>
>Ant the error output is:
>/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 2895.
>? PDBConstructionWarning)
>/usr/lib/pymodules/python2.7/Bio/PDB/StructureBuilder.py:85: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 3216.
>? PDBConstructionWarning)
>Traceback (most recent call last):
>? File "AsignarPDBaMIfile.py", line 45, in <module>
>??? cmap, pdb? = contactos_CB(pdb_file,pdb_cad,cutoff=1,n_salto=1)
>? File "funciones_pdb.py", line 15, in contactos_CB
>??? cadena = model[cad]
>? File "/usr/lib/pymodules/python2.7/Bio/PDB/Entity.py", line 38, in __getitem__
>??? return self.child_dict[id]
>KeyError: 'A'
>
>?
>How Can be fixed?
>
>P.D.: The lines of the firts warning are (at the beginig I wrote the number of lines for reference, I think that the TER line can be the cause of the problem but I'm not sure):
>
>2893?? ATOM?? 2455? N?? PHE I?? 8????? 38.110 -15.236?? 4.503? 0.89? 0.76?????????? N? 
>2894?? TER??? 2456????? PHE I?? 8????????????????????????????????????????????????????? 
>2895??? HETATM 2457? O?? HOH E 327????? 10.873? -3.134? 11.448? 0.89? 0.01?????????? O? 
>
>
>
>
>
>
>
>
>if ((dx*dp)>=(h/(2*pi)))
>{
>printf("Diego Javier Zea\n");
>}
>_______________________________________________
>Biopython-dev mailing list
>Biopython-dev at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>


From redmine at redmine.open-bio.org  Sun Dec 30 12:46:35 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 30 Dec 2012 12:46:35 +0000
Subject: [Biopython-dev] [Biopython - Feature #3388] add annotation and
	letter_annotations attributed for
	Bio.Align.MultipleSeqAlignment. object
References: <redmine.issue-3388.20121017133925@redmine.open-bio.org>
Message-ID: <redmine.journal-15061.20121230124635@redmine.open-bio.org>


Issue #3388 has been updated by Peter Cock.


Support for a generic annotation dictionary done,
https://github.com/biopython/biopython/commit/793f9210696e0acc9606faeca3d6ca47a9d97813

Started work on per-column annotation as well - currently on this branch:
https://github.com/peterjc/biopython/tree/per-column-annotation

----------------------------------------
Feature #3388: add annotation and letter_annotations attributed for Bio.Align.MultipleSeqAlignment. object
https://redmine.open-bio.org/issues/3388

Author: saverio vicario
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


At the moment I could not add annotation at alignment level.  annotation could be usefull for tracking info linked to the loci ( i.e. name of domain), while letter annotation could be usefull to track quality score of alignment or if the sites belong to a given character set.
In particular when to alignment are merged it would be usefull tha the bounduary of the merge is tracked
for example in Letter annotation of the merge of an alignment a with 10 sites and b of 5 sites the letter_annotations would be as following 

{locus1:'111111111100000',locus2:'000000000011111'} 
this could be usefull also to annotate the 3 position of codons
{pos1:'1001001001',pos2:'0100100100', pos3:'0010010010'}

If this letter_annotation would be supported the annotation could be kept across merging and splitting of the alignment


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org