From redmine at redmine.open-bio.org  Sun Sep  2 15:20:01 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 2 Sep 2012 19:20:01 +0000
Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse
	PDBs produced by PatchDock
References: <redmine.issue-3379.20120821102714@redmine.open-bio.org>
Message-ID: <redmine.journal-14952.20120902192001@redmine.open-bio.org>


Issue #3379 has been updated by Jo?o Rodrigues.


I contacted the developers of PatchDock and they updated their code. Their PDBs no longer have the double END statement, but they might have conflicting chains though: the parser will likely break if by chance both chains have id A and overlapping residue numbers. Still, a slight improvement.
----------------------------------------
Bug #3379: PDBParser fails to parse PDBs produced by PatchDock
https://redmine.open-bio.org/issues/3379

Author: David Cain
Status: New
Priority: Low
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.57
URL: 


I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs.


h3. Background

Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file.

h3. Why PDBParser fails

Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand.

h3. How to fix the problem

Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files?

h3. Potential change to @PDBParser._parse_coordinates@?

If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure.

If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing.

My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sun Sep  2 21:05:19 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 3 Sep 2012 01:05:19 +0000
Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse
	PDBs produced by PatchDock
References: <redmine.issue-3379.20120821102714@redmine.open-bio.org>
Message-ID: <redmine.journal-14953.20120903010519@redmine.open-bio.org>


Issue #3379 has been updated by David Cain.


That's awesome! Thanks for doing that. Well, chain renumbering is definitely a problem, but I don't see any easy fix for that. I still think the "pull request":https://github.com/biopython/biopython/pull/60 is relevant for detecting otherwise malformed PDB files (additionally, parsing will still stop after the first file if @CONECT@ files are relevant).
----------------------------------------
Bug #3379: PDBParser fails to parse PDBs produced by PatchDock
https://redmine.open-bio.org/issues/3379

Author: David Cain
Status: New
Priority: Low
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.57
URL: 


I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs.


h3. Background

Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file.

h3. Why PDBParser fails

Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand.

h3. How to fix the problem

Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files?

h3. Potential change to @PDBParser._parse_coordinates@?

If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure.

If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing.

My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From w.arindrarto at gmail.com  Mon Sep  3 06:14:59 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Mon, 3 Sep 2012 12:14:59 +0200
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CADEGkF4URxn5zwXOwU1J6s21U22aLwTdUw3aU6G0=MRt+LbfOA@mail.gmail.com>
References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>
	<87lim4h07o.fsf@fastmail.fm>
	<CAKVJ-_7b=RGpDGX3v0x5PJFWgW5dB3Otfg8Sq2Gehhg4SU2bUg@mail.gmail.com>
	<CADEGkF4URxn5zwXOwU1J6s21U22aLwTdUw3aU6G0=MRt+LbfOA@mail.gmail.com>
Message-ID: <CADEGkF6Si_gFOQZ8HKto-8GMbk53Z1J8HKrO-UAm2puPQ0mp-Q@mail.gmail.com>

Hello everyone,

I'd like to update everyone on my latest SearchIO(?) developments. There
has been some progress and bug fixes since GSoC officially ended two weeks
ago. Some of them I'd like to share here:

1. I've written a draft tutorial chapter for the submodule. It' been pushed
to my development repo (https://github.com/bow/biopython/tree/searchio) and
I'm hosting the HTML temporarily on my site (
http://bow.web.id/biopython/Tutorial.html). Comments and critiques are
welcomed :).

2. Back on the naming issue, I'm still using SearchIO for now. I've
experimented with other names (Bio.Search and Bio.SeqSearch), and my
impression is I like Bio.SeqSearch the most, followed by Bio.Search, and
Bio.SearchIO. It does feel confusing initially (we have SeqUtils,
SeqFeature, etc.), but after a while it's the one that feels most natural.

3. And finally, Peter and I discussed this briefly previously: what about
if we merge the existing BLAST wrappers and NCBI qblast into Bio.(SeqSearch
/ Search / SearchIO)? I felt there were a lot of overlap between this
submodule and Bio.BLAST when writing the tutorial, so merging surfaced in
my thoughts again. We could put the BLAST wrappers under
Bio.SeqSearch.Applications (for example), along with other wrappers (I have
a yet-untested Bio.HMMER3 wrapper and possibly Bio.BLAT wrapper that put
here as well). As for qblast (and other remote searches, like the one
provided by HMMER at the moment), we could put them in
Bio.SeqSearch.Remote, perhaps. I think this would make it easier for anyone
who works with BLAST / other sequence search tools as all Biopython-related
functionalities are grouped in one place.

This is just a thought for now, but I'd love to hear your thoughts on the
merge (and the naming ;) ).

cheers,
Bow


On Tue, Aug 21, 2012 at 6:01 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com>wrote:

> On Tue, Aug 14, 2012 at 9:49 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Tue, Apr 10, 2012 at 1:58 AM, Brad Chapman wrote:
> >> Michiel;
> >>> Hi Eric, Peter,
> >>>
> >>> > How about Bio.Search, for now?
> >>>
> >>> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells
> >>> users something about what the module is for. Bio.Search could be
> >>> anything (search PubMed? search the Entrez databases? search Google?
> >>> anyway Bio.Search does not suggest that this module is about pairwise
> >>> alignments). But Peter previously mentioned that he doesn't like
> >>> Bio.Pairwise; can we convince you?
> >>
> >> I agree with Peter on this one. The module is primarily about searching
> >> a sequence database with an input via multiple methods, not about
> >> pairwise alignment of two sequences with is what Bio.Align.Pairwise
> >> suggests to me.
> >>
> >> Brad
> >
> > On potential problem with Bio.Search (on top of concerns raised
> > here about vagueness) Bow and I were just talking about during
> > our weekly GSoC video call was the existence of Bio/Search.py
> > which is obsolete and long overdue for removal. I have just
> > deprecated it (something I forgot to do before the last release):
> >
> https://github.com/biopython/biopython/commit/5a275ccd1df3def40df1eef517af755d373dadd8
> >
> > We'd earlier talked about using Bio.Search as the namespace. I was
> > worried about the potential existence on a user's machine of both
> > Bio/Search.py (the old obsolete code) and Bio/Search/__init__.py
> > (aka SearchIO, the new module) and which would take precedence
> > when doing: from Bio import Search
> >
> > Given how Python module installations work, that seems highly
> > likely to occur. The good news is that the package would take
> > priority - see http://www.python.org/doc/essays/packages.html
> >
> >>>>> What If I Have a Module and a Package With The Same Name?
> >>>>>
> >>>>> You may have a directory (on sys.path) which has both a module
> >>>>> spam.py and a subdirectory spam that contains an __init__.py
> >>>>> (without the __init__.py, a directory is not recognized as a
> package).
> >>>>> In this case, the subdirectory has precedence, and importing spam
> >>>>> will ignore the spam.py file, loading the package spam instead. If
> >>>>> you want the module spam.py to have precedence, it must be
> >>>>> placed in a directory that comes earlier in sys.path.
> >
> > So there is no technical reason to avoid Bio.Search as an
> > option for the Bio.SearchIO namespace. We could then
> > have Bio.Search.Applications for command line wrappers,
> > consistent with Bio.Phylo.Applications, Bio.Motif.Applications
> > and Bio.Align.Applications.
> >
> > Of course, Bio.Search is still perhaps too broad a name... but
> > on balance perhaps it is still better than Bio.SearchIO?
> >
> > Regards,
> >
> > Peter
>
> Hi everyone,
>
> If I may add my two cents, for now I am in favor of putting the module
> under Bio.Search. It is not the best name out there (it does sound a
> bit vague), but it's the one that seem to be the most intuitive (until
> a better alternative comes out). There were some other alternatives
> that I and Peter have discussed, but they seem less appealing for us.
> You're free to add your thoughts on these of course :) :
>
> - Bio.SeqSearch. This sounds ok, but when you consider we have
> Bio.Seq, Bio.SeqRecord, Bio.SeqFeature, and Bio.SeqUtils, it becomes
> quite confusing quickly.
>
> - Bio.PSearch ('p' for pairwise). This one seemed the less intuitive
> among the three options, so I'm not so big on this.
>
> For now, I'm still writing everything (code, docstrings, tutorial)
> using SearchIO. I suppose it's better if we could agree on a more
> suitable name, though.
>
> On another note, I'm also in favor of using the Bio.Phylo module
> skeleton for Bio.SearchIO / Bio.Search. We may then group all sequence
> search-related application wrappers under Applications (I actually
> prefers 'app' for better PEP8 compliance, but that's another
> discussion) and perhaps even refactor our remote search calls (e.g.
> the 'qblast' module) under Bio.Search as well.
>
> cheers,
> Bow
>

From p.j.a.cock at googlemail.com  Mon Sep  3 08:28:30 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Sep 2012 13:28:30 +0100
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CADEGkF6Si_gFOQZ8HKto-8GMbk53Z1J8HKrO-UAm2puPQ0mp-Q@mail.gmail.com>
References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>
	<87lim4h07o.fsf@fastmail.fm>
	<CAKVJ-_7b=RGpDGX3v0x5PJFWgW5dB3Otfg8Sq2Gehhg4SU2bUg@mail.gmail.com>
	<CADEGkF4URxn5zwXOwU1J6s21U22aLwTdUw3aU6G0=MRt+LbfOA@mail.gmail.com>
	<CADEGkF6Si_gFOQZ8HKto-8GMbk53Z1J8HKrO-UAm2puPQ0mp-Q@mail.gmail.com>
Message-ID: <CAKVJ-_6jPPdarh3XoVxKtCmoZPw1vOz5249BQtgAmr+P_gMHSg@mail.gmail.com>

On Mon, Sep 3, 2012 at 11:14 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hello everyone,
>
> I'd like to update everyone on my latest SearchIO(?) developments. There
> has been some progress and bug fixes since GSoC officially ended two weeks
> ago. Some of them I'd like to share here:
>
> 1. I've written a draft tutorial chapter for the submodule. It' been pushed
> to my development repo (https://github.com/bow/biopython/tree/searchio) and
> I'm hosting the HTML temporarily on my site (
> http://bow.web.id/biopython/Tutorial.html). Comments and critiques are
> welcomed :).

Oh - excellent - I'll read that in the next few days :)

> 2. Back on the naming issue, I'm still using SearchIO for now. I've
> experimented with other names (Bio.Search and Bio.SeqSearch), and my
> impression is I like Bio.SeqSearch the most, followed by Bio.Search, and
> Bio.SearchIO. It does feel confusing initially (we have SeqUtils,
> SeqFeature, etc.), but after a while it's the one that feels most natural.


Initially Bio.SeqSearch sounds a bit long... but maybe it will
grow on me...

> 3. And finally, Peter and I discussed this briefly previously: what about
> if we merge the existing BLAST wrappers and NCBI qblast into Bio.(SeqSearch
> / Search / SearchIO)? I felt there were a lot of overlap between this
> submodule and Bio.BLAST when writing the tutorial, so merging surfaced in
> my thoughts again. We could put the BLAST wrappers under
> Bio.SeqSearch.Applications (for example), along with other wrappers (I have
> a yet-untested Bio.HMMER3 wrapper and possibly Bio.BLAT wrapper that put
> here as well). As for qblast (and other remote searches, like the one
> provided by HMMER at the moment), we could put them in
> Bio.SeqSearch.Remote, perhaps. I think this would make it easier for anyone
> who works with BLAST / other sequence search tools as all Biopython-related
> functionalities are grouped in one place.

As per my discussion with Bow, I'm OK with aiming to deprecate the
Bio.BLAST namespace as part of introducing Bio.SeqSearch/Search/..,
although I hadn't a strong preference on a naming convention for any
online functionality. Possibly www is shorter than remote and also
clear?

> This is just a thought for now, but I'd love to hear your thoughts on the
> merge (and the naming ;) ).
>
> cheers,
> Bow

Thanks Bow :)

Peter

From p.j.a.cock at googlemail.com  Mon Sep  3 08:55:07 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Sep 2012 13:55:07 +0100
Subject: [Biopython-dev] Beta code in the official releases?
In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org>
References: <CAKVJ-_6_8JUXmCx5q-eghSczNxqPmSbaaTc_GJ_QCqQOtjGUbg@mail.gmail.com>
	<877gsq8mn2.fsf@fastmail.fm>
	<1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org>
Message-ID: <CAKVJ-_5ThfYEBmrhpGcHNRvrf2_QEe4pTUgF0JV+bBQOwyW0Fg@mail.gmail.com>

On Wed, Aug 29, 2012 at 6:54 PM, Sczesnak, Andrew
<Andrew.Sczesnak at med.nyu.edu> wrote:
> +1
>
> It's been over a year since I first submit my MAF code!

Already? Ouch, my apologies.

I'm at a hackathon this week with the OBF GSoC mentors who
looked at MAF for BioRuby - looking at this for inclusion in the
next Biopython release (perhaps with a beta tag) is on my agenda.

Peter

From anaryin at gmail.com  Mon Sep  3 18:07:39 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 4 Sep 2012 01:07:39 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
Message-ID: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>

Hi all,

A quick update on some latest work. I found some time to finally work a bit
on the PDB parser and Bio.PDB in general. I started by optimizing the
current code. I ran cProfile on script that parsed a set of structures
without header and without element columns. I did this because one of the
optimizations rendered the current header parser useless.. (replaced the
PDB file handle by an iterator instead of using the readlines method). I
still need to work a bit on the memory leak, but for now it seems pretty ok
(parsed 400-ish large structures without a glitch).

I am attaching two pictures of cProfile and the two output files. There is
a nice improvement of about 25%, but this can still be improved for sure. I
just replaced some methods here and there, pre-initialized the numpy
arrays, etc.. I pushed this version to my github pdb_enhancements
branch<https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements>
.

One big change I would propose is to eliminate the duality
child_list/child_dict. I think that keeping child_dict and generating
child_list from sorted dict keys would be good enough. OrderedDict also
looks appropriate, but it's Py2.7+.. Still need to look into this, but by
looking at all those "append" methods in the profiling it hints at a nice
speed up, and also at much cleaner code.

Let me know of your opinion if you have some time,

Cheers,

Jo?o

PS. Attached complex_1.pdb as an example of the structures in the dataset
used for this particular test.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-master-TBEV.png
Type: image/png
Size: 166144 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-master-TBEV.profile
Type: application/octet-stream
Size: 252112 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-optimized-TBEV.png
Type: image/png
Size: 148137 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-optimized-TBEV.profile
Type: application/octet-stream
Size: 273487 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: complex_1w.pdb
Type: chemical/x-pdb
Size: 649559 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0001.pdb>

From p.j.a.cock at googlemail.com  Tue Sep  4 01:56:55 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 06:56:55 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
Message-ID: <CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>

On Mon, Sep 3, 2012 at 11:07 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> One big change I would propose is to eliminate the duality
> child_list/child_dict. I think that keeping child_dict and generating
> child_list from sorted dict keys would be good enough. OrderedDict also
> looks appropriate, but it's Py2.7+.. Still need to look into this, but by
> looking at all those "append" methods in the profiling it hints at a nice
> speed up, and also at much cleaner code.
>

Where there are back-ports of the OrderedDict and other useful
classes like NamedTuple, we could probably include these as
part of our Python 2/3 compatibility code. i.e. In Bio.PDB use:

from Bio._py3k import OrderedDict

(Until we drop older versions of Python which don't come with
this). In Bio._py3k we would have something like this:

#Use in preference system OrderedDict (Python 2.7 and 3.x),
#the backport from PyPI, or our own bundled implementation
try:
    from collections import OrderedDict
except ImportError:
    try:
        #Whatever http://pypi.python.org/pypi/ordereddict uses:
        from xxx import OrderedDict
    except ImportError:
        #Import local bundled implementation, e.g.
        from _ordereddict import OrderedDict

See http://code.activestate.com/recipes/576693-ordered-dictionary-for-py24/

Are there any objections to this plan?

Regards,

Peter


From anaryin at gmail.com  Tue Sep  4 01:59:36 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 4 Sep 2012 08:59:36 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
Message-ID: <CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>

Sounds great, I saw the active state link before but I never thought of
including it. Thanks!

From w.arindrarto at gmail.com  Tue Sep  4 02:11:05 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Tue, 4 Sep 2012 08:11:05 +0200
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
Message-ID: <CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>

Hi Peter, Jo?o,

Just a little FYI. I ran into the OrderedDict issue when I started writing
SearchIO a few months ago as well, so I added an OrderedDict implementation
in Bio._py3k (
https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c
).

The code is from the ordereddict module from PyPI at that time. I haven't
checked if it's the same as the one shown in the link (there may have been
some updates), but it seems to work fine up to now.

Hope this is useful :),
Bow


On Tue, Sep 4, 2012 at 7:59 AM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Sounds great, I saw the active state link before but I never thought of
> including it. Thanks!
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From p.j.a.cock at googlemail.com  Tue Sep  4 02:30:51 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 07:30:51 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
Message-ID: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>

Hello all,

Over on one of Bow's pull requests Michiel made a suggestion about
consolidating the Bio.Seq* namespace under Bio.Seq.* which we can
do by replacing Bio/Seq.py with Bio/Seq/__init__.py

See: https://github.com/biopython/biopython/pull/63#issuecomment-8252340

I agree that Bio.Seq, Bio.SeqUtils, Bio.SeqIO, Bio.SeqRecord,
and Bio.SeqFeature isn't ideal. However, changing this would be
a big disruption - so perhaps any large change like this should
also address the mixed case module names which are not PEP8
conformant (Modules should have short, all-lowercase names).

http://www.python.org/dev/peps/pep-0008/#package-and-module-names

One idea I was pondering is a new parallel namespace, ideally
bio.* but we can't use that due to case insensitive file systems
like Windows and (by default) Mac OS X. So perhaps biopy,
or bp? [I've not checked for clashes with other libraries yet.]

We could gradually move code over to the new namespace,
using imports to preserve back compatibility - but support both
namespaces during a (long) transition period.

What I like about this is it allows people to make a gradual
conversion - and we don't have to burden of two main
branches if we attempted a single jump to a Biopython v2.

Does this seem worth considering?

Regards,

Peter

From mjldehoon at yahoo.com  Tue Sep  4 06:27:57 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 4 Sep 2012 03:27:57 -0700 (PDT)
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
Message-ID: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>

Hi Peter,

--- On Tue, 9/4/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> One idea I was pondering is a new parallel namespace,
> ideally bio.* but we can't use that due to case
> insensitive file systems like Windows and (by default)
> Mac OS X. So perhaps biopy, or bp?

As you say, the ideal namespace is bio.*, so let's use that. We have been using Bio.* for more than 10 years. We should not get stuck with a non-ideal namespace for the next 10+ years because there may be some glitches switching from Bio.* to bio.*. Frankly I doubt that this will cause huge problems in practice.

> We could gradually move code over to the new namespace,
> using imports to preserve back compatibility - but support
> both namespaces during a (long) transition period.

Why do we need a transition period? It's just a matter of replacing upper case with lower case in the imports.

> What I like about this is it allows people to make a
> gradual
> conversion - and we don't have to burden of two main
> branches if we attempted a single jump to a Biopython v2.
> 
> Does this seem worth considering?

Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users.

Best,
-Michiel.


From p.j.a.cock at googlemail.com  Tue Sep  4 06:59:00 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 11:59:00 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
References: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
	<1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
Message-ID: <CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>

On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Hi Peter,
>
> --- On Tue, 9/4/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> One idea I was pondering is a new parallel namespace,
>> ideally bio.* but we can't use that due to case
>> insensitive file systems like Windows and (by default)
>> Mac OS X. So perhaps biopy, or bp?
>
> As you say, the ideal namespace is bio.*, so let's use
> that. We have been using Bio.* for more than 10 years.
> We should not get stuck with a non-ideal namespace for
> the next 10+ years because there may be some glitches
> switching from Bio.* to bio.*. Frankly I doubt that this
> will cause huge problems in practice.

So you'd advocate a simple switch where from one
release to the next we change all the module names
(making them lower case, perhaps from consolidation
under bio.seq too)?

This may cause some difficulties for upgrades - it may
require manual intervention to remove the old Bio folder
in order to allow creation of the new bio folder.

>> We could gradually move code over to the new namespace,
>> using imports to preserve back compatibility - but support
>> both namespaces during a (long) transition period.
>
> Why do we need a transition period? It's just a matter
> of replacing upper case with lower case in the imports.

That forces people to update all their scripts at once.
Of course, we can document how to do this so a script
would work before and after the case change, e.g.

try:
    from bio.seq import Seq
except ImportError:

    from Bio.Seq import Seq

>> What I like about this is it allows people to make a
>> gradual
>> conversion - and we don't have to burden of two main
>> branches if we attempted a single jump to a Biopython v2.
>>
>> Does this seem worth considering?
>
> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users.
>
> Best,
> -Michiel.
>

From p.j.a.cock at googlemail.com  Tue Sep  4 08:16:26 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 13:16:26 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
Message-ID: <CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>

On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Peter, Jo?o,
>
> Just a little FYI. I ran into the OrderedDict issue when I started writing
> SearchIO a few months ago as well, so I added an OrderedDict implementation
> in Bio._py3k
> (https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c).
>
> The code is from the ordereddict module from PyPI at that time. I haven't
> checked if it's the same as the one shown in the link (there may have been
> some updates), but it seems to work fine up to now.
>
> Hope this is useful :),
> Bow

Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO,
that seems quite a good case for including it. How does this look
(on the 'od' branch in my repository)?

https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f

This differs from Bow's version in that I put the module in as a separate
file (Bio/_ordereddict.py), and that it will prefer the ordereddict package
if already installed (e.g. from PyPI).

Peter


From w.arindrarto at gmail.com  Tue Sep  4 08:36:55 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Tue, 4 Sep 2012 14:36:55 +0200
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
Message-ID: <CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>

On Tue, Sep 4, 2012 at 2:16 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
> > Hi Peter, Jo?o,
> >
> > Just a little FYI. I ran into the OrderedDict issue when I started
> > writing
> > SearchIO a few months ago as well, so I added an OrderedDict
> > implementation
> > in Bio._py3k
> >
> > (https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c).
> >
> > The code is from the ordereddict module from PyPI at that time. I
> > haven't
> > checked if it's the same as the one shown in the link (there may have
> > been
> > some updates), but it seems to work fine up to now.
> >
> > Hope this is useful :),
> > Bow
>
> Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO,
> that seems quite a good case for including it. How does this look
> (on the 'od' branch in my repository)?
>
>
> https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f
>
> This differs from Bow's version in that I put the module in as a separate
> file (Bio/_ordereddict.py), and that it will prefer the ordereddict
> package
> if already installed (e.g. from PyPI).
>
> Peter

Hi Peter,

This looks good. I like the 'ordereddict' module import check prior to
using our bundled version.

One more thing I would suggest is about the namespace. I feel that in
the future, we may run into similar issues (non-Python3 compatibility
issues) since Python2.7 deprecation is still a long way. Perhaps
create a new subpackage in the root folder (maybe Bio._compat, but I
don't have a strong preference), to keep code like this in one place?
Or we could even put Bio._py3k under this subpackage and have one
central place for compatibility-related code? This would prevent
further root namespace clutter.

regards,
Bow


From k.d.murray.91 at gmail.com  Tue Sep  4 08:57:22 2012
From: k.d.murray.91 at gmail.com (Kevin Murray)
Date: Tue, 4 Sep 2012 22:57:22 +1000
Subject: [Biopython-dev] TAIR/AGI support
Message-ID: <CAH80STXOOUjqYcQ82C2C25-gACyzwx0D4-VD+CMTes90CdZbnw@mail.gmail.com>

Hi All,

What's the status of TAIR AGIs in BioPython (I can see no mention of them,
or support for them)? I've written a brief module which allows a user to
query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there
any interest in including such functionality in BioPython?

More generally, are there any particular areas of BioPython development
which could use an extra pair of hands?

Regards
Kevin Murray

From anaryin at gmail.com  Tue Sep  4 10:19:11 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 4 Sep 2012 17:19:11 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
Message-ID: <CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>

Guys,

Looks great, I will try to 'cherry pick' that branch and merge it with
mine. I have to solve some issues with the tests, but it seems to be a
straightforward change.

Cheers,

Jo?o
No dia 4 de Set de 2012 15:37, "Wibowo Arindrarto" <w.arindrarto at gmail.com>
escreveu:

> On Tue, Sep 4, 2012 at 2:16 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> > On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto
> > <w.arindrarto at gmail.com> wrote:
> > > Hi Peter, Jo?o,
> > >
> > > Just a little FYI. I ran into the OrderedDict issue when I started
> > > writing
> > > SearchIO a few months ago as well, so I added an OrderedDict
> > > implementation
> > > in Bio._py3k
> > >
> > > (
> https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c
> ).
> > >
> > > The code is from the ordereddict module from PyPI at that time. I
> > > haven't
> > > checked if it's the same as the one shown in the link (there may have
> > > been
> > > some updates), but it seems to work fine up to now.
> > >
> > > Hope this is useful :),
> > > Bow
> >
> > Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO,
> > that seems quite a good case for including it. How does this look
> > (on the 'od' branch in my repository)?
> >
> >
> >
> https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f
> >
> > This differs from Bow's version in that I put the module in as a separate
> > file (Bio/_ordereddict.py), and that it will prefer the ordereddict
> > package
> > if already installed (e.g. from PyPI).
> >
> > Peter
>
> Hi Peter,
>
> This looks good. I like the 'ordereddict' module import check prior to
> using our bundled version.
>
> One more thing I would suggest is about the namespace. I feel that in
> the future, we may run into similar issues (non-Python3 compatibility
> issues) since Python2.7 deprecation is still a long way. Perhaps
> create a new subpackage in the root folder (maybe Bio._compat, but I
> don't have a strong preference), to keep code like this in one place?
> Or we could even put Bio._py3k under this subpackage and have one
> central place for compatibility-related code? This would prevent
> further root namespace clutter.
>
> regards,
> Bow
>


From p.j.a.cock at googlemail.com  Tue Sep  4 10:42:35 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 15:42:35 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
Message-ID: <CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>

On Tue, Sep 4, 2012 at 3:19 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Guys,
>
> Looks great, I will try to 'cherry pick' that branch and merge it with mine.

I've applied it to the master now, which might make it easier.
I think Bow might have a point about namespaces - although the
underscore modules are 'private', they still show up in dir(Bio)
so having a single folder for our inter-Python version compatibility
code seems sensible if we add any more (e.g. NamedTuples).

> I have to solve some issues with the tests, but it seems to be a
> straightforward change.

Great.

Peter


From anaryin at gmail.com  Tue Sep  4 12:02:42 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 4 Sep 2012 19:02:42 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
Message-ID: <CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>

I agree, we could move them to a folder then?
No dia 4 de Set de 2012 17:42, "Peter Cock" <p.j.a.cock at googlemail.com>
escreveu:

> On Tue, Sep 4, 2012 at 3:19 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> > Guys,
> >
> > Looks great, I will try to 'cherry pick' that branch and merge it with
> mine.
>
> I've applied it to the master now, which might make it easier.
> I think Bow might have a point about namespaces - although the
> underscore modules are 'private', they still show up in dir(Bio)
> so having a single folder for our inter-Python version compatibility
> code seems sensible if we add any more (e.g. NamedTuples).
>
> > I have to solve some issues with the tests, but it seems to be a
> > straightforward change.
>
> Great.
>
> Peter
>


From p.j.a.cock at googlemail.com  Tue Sep  4 19:54:56 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Sep 2012 00:54:56 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
Message-ID: <CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>

On Tue, Sep 4, 2012 at 5:02 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> I agree, we could move them to a folder then?
>

OK - I moved Bio/_py3k.py to Bio/_py3k/__init__.py and also the
new file Bio/_ordereddict.py to Bio/_py3k/ordereddict.py - this
avoids having to change any of our import statements:
https://github.com/biopython/biopython/commit/1a9bd6eeab0de3283bd1e6cc28c7754fbffefe2d

Peter


From redmine at redmine.open-bio.org  Tue Sep  4 23:19:53 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 5 Sep 2012 03:19:53 +0000
Subject: [Biopython-dev] [Biopython - Bug #3382] (New)
	Bio.PDB.PDBList.retrieve_pdb_file() fails for Python3
Message-ID: <redmine.issue-3382.20120905031953@redmine.open-bio.org>


Issue #3382 has been reported by Alexander Campbell.

----------------------------------------
Bug #3382: Bio.PDB.PDBList.retrieve_pdb_file() fails for Python3
https://redmine.open-bio.org/issues/3382

Author: Alexander Campbell
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


At present, calling @Bio.PDB.PDBList.retrieve_pdb_file()@ on any PDB ID will fail, giving the following traceback:
<pre>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-4ecf112b58e0> in <module>()
----> 1 pdbl.retrieve_pdb_file('1FAT')

/usr/lib64/python3.2/site-packages/Bio/PDB/PDBList.py in retrieve_pdb_file(self, pdb_code, obsolete, compression, uncompress, pdir)
    245         gz = gzip.open(filename, 'rb')
    246         out = open(final_file, 'wb')
--> 247         out.writelines(gz.read())
    248         gz.close()
    249         out.close()

TypeError: 'int' does not support the buffer interface
</pre>

This occurs because in Python3 a file opened in binary mode will return type @bytes@ for @read()@, or a list of type @bytes@ objects for @readlines()@. The @writelines()@ method expects an iterable where each element is of type @str at . This worked in Python2 as a @str@ can be viewed as a sequence of @str@ objects, and so line 247 effectively wrote one character at a time for the single @str@ yielded by @read()@. In Python3 iterating over a @bytes@ yields @int@ objects, leading to the TypeError.

This issue can be fixed by changing line 247's call to @writelines()@ to just @write()@. This does not break functionality in Python2, according to my testing with Python 3.2.3 and 2.7.3 on Fedora 17.

There are 4 more instances of @writelines()@ calls in the codebase, but in each of those cases the argument is a list or generator of @str@ or @bytes@ objects, as I don't think they will raise an error. I haven't tested them though.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From w.arindrarto at gmail.com  Wed Sep  5 05:53:36 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Wed, 5 Sep 2012 11:53:36 +0200
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>
References: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
	<1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
	<CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>
Message-ID: <CADEGkF4rnJ21Ys3m9C2PO8JtzTXxgTV5G7tcGuw=Q3x57REy-w@mail.gmail.com>

Hi guys,

If I may add my two cents on this issue,  I think it's also a chance
to rectify all other namespace issues that we may have (not just
PEP8-related).

For instance:

* In the root namespace, we now have Bio.Align and Bio.AlignIO. Since
we might be merging Bio.Seq and Bio.SeqIO into bio[py].seq (per the
Github discussion[1]), I suppose we should do the same with Bio.Align
as well (perhaps into bio[py].seq.align or bio[py].align).

* With the change above, we might also want to change some of the
submodule names completely. For example, if we merge Bio.Align into
bio[py].align we'll have bio[py].align.applications, which I
personally think could be shortened into bio[py].align.app.

* As per the Github disscussion[1] as well, perhaps Bio.SeqUtils
should also be merged as Seq object methods.

There may be other changes as well, but the bottom line is all these
changes will be quite considerable. As such, I think we could go all
the way and be explicit in stating that the changes will be
incompatible with previous Biopython versions (i.e. old scripts will
break).

As for bio.* and biopy.*, if we do decide to go all the way, bio.*
seems like a better choice since there will be other incompatible
changes anyway. But if we eventually decide to only fix PEP8-related
issues while keeping compatibility with older versions, I'm leaning
more towards biopy.*.

regards,
Bow

[1] https://github.com/biopython/biopython/pull/63#issuecomment-8252340

On Tue, Sep 4, 2012 at 12:59 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> Hi Peter,
>>
>> --- On Tue, 9/4/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> One idea I was pondering is a new parallel namespace,
>>> ideally bio.* but we can't use that due to case
>>> insensitive file systems like Windows and (by default)
>>> Mac OS X. So perhaps biopy, or bp?
>>
>> As you say, the ideal namespace is bio.*, so let's use
>> that. We have been using Bio.* for more than 10 years.
>> We should not get stuck with a non-ideal namespace for
>> the next 10+ years because there may be some glitches
>> switching from Bio.* to bio.*. Frankly I doubt that this
>> will cause huge problems in practice.
>
> So you'd advocate a simple switch where from one
> release to the next we change all the module names
> (making them lower case, perhaps from consolidation
> under bio.seq too)?
>
> This may cause some difficulties for upgrades - it may
> require manual intervention to remove the old Bio folder
> in order to allow creation of the new bio folder.
>
>>> We could gradually move code over to the new namespace,
>>> using imports to preserve back compatibility - but support
>>> both namespaces during a (long) transition period.
>>
>> Why do we need a transition period? It's just a matter
>> of replacing upper case with lower case in the imports.
>
> That forces people to update all their scripts at once.
> Of course, we can document how to do this so a script
> would work before and after the case change, e.g.
>
> try:
>     from bio.seq import Seq
> except ImportError:
>
>     from Bio.Seq import Seq
>
>>> What I like about this is it allows people to make a
>>> gradual
>>> conversion - and we don't have to burden of two main
>>> branches if we attempted a single jump to a Biopython v2.
>>>
>>> Does this seem worth considering?
>>
>> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users.
>>
>> Best,
>> -Michiel.
>>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From anaryin at gmail.com  Wed Sep  5 16:24:23 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 5 Sep 2012 23:24:23 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
Message-ID: <CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>

Hello all,

Some news.

A. The OrderedDict implementation is quite slow. It essentially slows down
the parser by 30%, rendering all the improvements I had done moot.
Therefore, although it's a great idea, a major reason for these updates is
speed so I think it might not be worth it.

B. As an alternative to this, I implemented the following. Entity has now
only child_dict, and is a general dictionary. However, each Object (Model,
Chain, Residue, Atom) gets their own __cmp__ method overloaded with the
information in the "_sort" methods that already existed. In this way, a
simple sorting of the values of the dictionary returns an ordered list. I
tweaked the Atom.__cmp__ to first sort N CA C O atoms and then
alphabetically. I also added that inorganic atoms such as Calcium come at
the end. This will make things a bit nicer when Calcium is involved for
example. Finally, the only downside to this seems to be that we lose the
order in which residues are inserted. Ie. if residue 151 is the first of
the PDB file and all others range from 1-150, then this first 151 is going
to be placed at the end when you iterate. However, from my experience and
in my opinion, not only this is logical, but it also rarely happens in real
PDB files.

C. I am strongly in favour of removing most (if not all) set/get methods
and replace them by direct attribute access. For instance,
"atom.get_parent() --> atom.parent". Saves some space in the code and makes
things more transparent.

D. I edited the PDBParser to tweaks a few things, nothing major. The file
handle is now treated as an iterator throughout the parsing and it should
be more memory-friendly. The line counter is still preserved. I also added
a test to make the get_header argument actually work.

E. General things here and there that I can't just remember..

F. Unittests are breaking everywhere. Checking why, but it all seems
related to this sorting issue.

Cheers,

Jo?o


From p.j.a.cock at googlemail.com  Wed Sep  5 19:31:42 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 00:31:42 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
Message-ID: <CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>

On Wed, Sep 5, 2012 at 9:24 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> Some news.
>
> A. The OrderedDict implementation is quite slow. It essentially slows down
> the parser by 30%, rendering all the improvements I had done moot.
> Therefore, although it's a great idea, a major reason for these updates is
> speed so I think it might not be worth it.

Which Python was that? i.e. The OrderedDict from the standard lib
(which I hope is optimised), or the back port (which might be slower).

> B. As an alternative to this, I implemented the following. Entity has now
> only child_dict, and is a general dictionary. However, each Object (Model,
> Chain, Residue, Atom) gets their own __cmp__ method overloaded with the
> information in the "_sort" methods that already existed. In this way, a
> simple sorting of the values of the dictionary returns an ordered list. I
> tweaked the Atom.__cmp__ to first sort N CA C O atoms and then
> alphabetically. I also added that inorganic atoms such as Calcium come at
> the end. This will make things a bit nicer when Calcium is involved for
> example. Finally, the only downside to this seems to be that we lose the
> order in which residues are inserted. Ie. if residue 151 is the first of the
> PDB file and all others range from 1-150, then this first 151 is going to be
> placed at the end when you iterate. However, from my experience and in my
> opinion, not only this is logical, but it also rarely happens in real PDB
> files.

That seems risky - but see if you can sort out what is happening
with the unit tests (below).

I'm not sure about your atomic sorting... it seems a bit magic. Would
sorting on atomic number be nicer (and simple)?

> C. I am strongly in favour of removing most (if not all) set/get methods and
> replace them by direct attribute access. For instance, "atom.get_parent()
> --> atom.parent". Saves some space in the code and makes things more
> transparent.

It would also look less like Java code ;)

I like this plan - but initially define and document the new properties,
and deprecate the old get/set properties. Without that you'll break
almost every PDB using script out there.

> D. I edited the PDBParser to tweaks a few things, nothing major. The file
> handle is now treated as an iterator throughout the parsing and it should be
> more memory-friendly. The line counter is still preserved. I also added a
> test to make the get_header argument actually work.
>
> E. General things here and there that I can't just remember..
>
> F. Unittests are breaking everywhere. Checking why, but it all seems related
> to this sorting issue.
>
> Cheers,
>
> Jo?o

Regards,

Peter


From p.j.a.cock at googlemail.com  Wed Sep  5 20:10:57 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 01:10:57 +0100
Subject: [Biopython-dev] Beta code in the official releases?
In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org>
References: <CAKVJ-_6_8JUXmCx5q-eghSczNxqPmSbaaTc_GJ_QCqQOtjGUbg@mail.gmail.com>
	<877gsq8mn2.fsf@fastmail.fm>
	<1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org>
	<CAKVJ-_5ThfYEBmrhpGcHNRvrf2_QEe4pTUgF0JV+bBQOwyW0Fg@mail.gmail.com>
	<1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org>
Message-ID: <CAKVJ-_5LE9uu9b5ExDKf63yOFymZWFb+CjHC-6OGqTTM6Gxh-g@mail.gmail.com>

On Wed, Sep 5, 2012 at 8:19 PM, Sczesnak, Andrew wrote:
> Yeah, it would be great if this module could finally be included.
> I've e-mailed the list numerous times asking what would be
> necessary to include it and have done all you and Brad have
> asked. I've watched you include bits and pieces of code from
> other contributors quickly and without much scrutiny, so I
> can't help but feel singled out. What is the logic in delaying
> this? We've heard from people who are already using the
> code and have asked when it will be pulled. Is it serving the
> community to not even include the basic reader/writer? Am
> I wasting my time? Is it your goal to actively discourage
> contributions?

In my mind, the main technical issue regarding MAF and AlignIO
and the common alignment object is the lack of a common way
of handling the idea of start/end (and sometimes strand) for
each sequence (in a consistent co-ordinate system using Python
counting). Evidently I haven't manage to adequately convey my
interpretation/concern.

Some file formats like EMBOSS' have these number explicitly
but we're not parsing them:
http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html

In the case of "fasta-m10" the numbers are stored in private
properties as a 'short term' hack:
http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html

Others like Stockholm have identifier/start-end as a combined
names (but this is not mandatory). Here the start and end are
being stored in the annotations dictionary (as unparsed strings,
still using 1-based co-ordinates).

In MAF the start/end are explicit and much more important.
It would be near pointless to parse the the file ignoring these.
Maybe your approach is good enough for MAF, and we
should have adopted it as is, and delayed better integration
with the other AlignIO formats?

i.e. This is a general limitation in AlignIO and the object
model, somewhat annoying in the formats already supported,
but information critical to the MAF format.

I was expecting a convention for this to fall out of Bow's GSoC
work for 'pairwise alignments' in SearchIO - but the object
model he came up with was not SeqRecord based (many
of the file formats he was using didn't include sequences).

Right now my inclination is still to add a location property to
the SeqRecord, usually a FeatureLocation, but it could also
be the proposed CompoundLocation for more complex cases.
The question then is if/when this would be propagated, e.g.
SeqRecord slicing/addition.
http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html
http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html

So the wheels are turning, but slowly. I have not had as
much time to dedicate to this as I would like - but other
smaller or less inter-connected things are much easer to
review and merge.

Peter

From p.j.a.cock at googlemail.com  Wed Sep  5 20:34:19 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 01:34:19 +0100
Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound
 feature locations
In-Reply-To: <CAKVJ-_5Mc-gSREu6xUA9qE4rJg-befDM3GjAB7Cs2FxC+cnvHg@mail.gmail.com>
References: <CAKVJ-_6yFU7kjgT2Orikra95r_qEq18BkYhEnDdikbKK3NYU5w@mail.gmail.com>
	<CALfq9t+FiUnnBeDfzspddNrJr1a3sky57npiJMGo4KEPhre3Ww@mail.gmail.com>
	<CAKVJ-_5YkFMnyE1oeCXS6rK1L+9UJH+B04LQ30zymhATAp8MdA@mail.gmail.com>
	<CALfq9tJ_MNWkDyAXNJbtSzQY+bgbvoaRNo3gmCPd3_EMHGH2rw@mail.gmail.com>
	<CAKVJ-_5Mc-gSREu6xUA9qE4rJg-befDM3GjAB7Cs2FxC+cnvHg@mail.gmail.com>
Message-ID: <CAKVJ-_6gwzWDNJbBCP=Tv3dOnzimvbWyVcKtnLiF5EmMM7v1_w@mail.gmail.com>

On Tue, Jul 24, 2012 at 10:38 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson <arklenna at gmail.com> wrote:
>> I agree that an "upgraded" FeatureLocation could be more
>> elegant.
>
> It could turn out to be simpler having just one location object...
> certainly worth trying out before committing this branch as is.

Such a new  "upgraded" FeatureLocation would need to hold
a list/tuple of its parts (rather like the proposed CompoundLocation),
and those could be simply as tuples of start, end, strand, db_ref
etc (essentially everything currently held in a FeatureLocation).

I'm not sure that that is any better than the new class
CompoundLocation holding a list of existing FeatureLocation
objects.

On the bright side, the branch still works nicely with the
extra BioSQL tests I added.

One of the issues worth a bit more discussion is the start
and end values of the CompoundLocation - which I am
considering making act as the left/minimum and right/
maximum boundary of the region spanned by the parts.

For normal forward strand features this does give the
biological start and end, likewise for reverse strand
features but inverted (location's start gives the biological
end). i.e. for *most* features this means no change to
the current behaviour.

My proposal would mean that for a feature spanning
the origin on a circular genome of length N, the start
would be 0 and the end N.

Similarly for weird cases from trans-splicing, the
start/end coordinates would give the total region
spanned. As shown below, sometimes that happens
to match the current behaviour, but in other cases
the current behaviour isn't useful anyway.

Adopting start/end as the spanned region makes a
lot of sense for things like drawing features in a
region of interest, or other more abstract tasks
doing feature/region intersection. Here knowing the
min/max boundaries of the region spanned is more
useful than any attempt to capture the biological
start/end of the feature.

Note that already for the simple FeatureLocation for
reverse strand features we have start < end, i.e. the
start coordinate property does NOT represent the
biological starting point.

Under the proposed CompoundLocation behaviour,
the desirable property of the FeatureLocation that
start < end would also hold for compound locations.

Pathological examples at the end,

Regards,

Peter

P.S.  One of the advantages of the CompoundLocation
is when constructing the location you don't give the
overall start/end - there are inferred from the list of parts
automatically. Currently the GenBank/EMBL parser
is having to do this.

P.P.S. I've also confirmed Lenna's testing that sum
of feature locations works if we define integer
addition with locations (so that sum can include
zero and several locations), see:

https://github.com/peterjc/biopython/commit/dc6bc658141cc42e7e6802bbe8baf6c87a6874c0

-----------------------------------------------------------------
Trans-splicing: Mixed Strands

An example where the range/span idea is simpler is
mixed strand features like this trans-spliced example
from NC_000932 (in our unit tests),

join(complement(69611..69724),139856..140650)

What would you expect as the start/end here? The
biological start is base 69724 (one based) and the
last base is 140650. Currently:

>>> from Bio import SeqIO
>>> f = SeqIO.read("NC_000932.gb", "gb").features[135]
>>> print f.location
[69610:140650]
>>> f.location.start
ExactPosition(69610)
>>> f.location.end
ExactPosition(140650)
>>> for sub in f.sub_features: print sub.location
...
[69610:69724](-)
[139855:140650](+)

Here the end value does match the last base in the
feature following the biological order - the start value
is actually a base in the middle of the combined
sequence. In fact, for this example the start/end
are already acting like the range/span idea.

-----------------------------------------------------------------
Trans-splicing: Reverse strand

The example above is a real corner case, and so is this
single strand trans-splcing example, also in NC_000932,
which is a bit like an circular genome origin spanning
annotation:

complement(join(97999..98793,69611..69724))

With the current master branch:

>>> from Bio import SeqIO
>>> f = SeqIO.read("NC_000932.gb", "genbank").features[1]
>>> print f.location
[97998:69724](-)
>>> f.location.start
ExactPosition(97998)
>>> f.location.end
ExactPosition(69724)
>>> for sub in f.sub_features: print sub.location
...
[97998:98793](-)
[69610:69724](-)

Notice that we do not have start < end as you might
expect. However the start and end DO capture the
biological end and start (order inverted - this is on
the reverse strand). To verify this I find it helps to
transform the GenBank style location:

complement(join(97999..98793,69611..69724))

into the old EMBL equivalent:

join(complement(69611..69724),complement(97999..98793))

i.e. The first base is 69724 (one based counting), and
the last base is 97999 (one based counting). So if
you wanted to look at the upstream or downstream
(assuming that makes sense for a trans-spliced
gene), the current start/end values are useful (but
you have to choose start vs end dependent on the
strand).

On the other hand, the range of co-ordindate values
is 69611 to 98793 (one based, inclusive). Therefore
one might expect start 69610 and end 98793 (Python
counting), giving the spanned region.

From chapmanb at 50mail.com  Wed Sep  5 20:37:57 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 05 Sep 2012 20:37:57 -0400
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CADEGkF4rnJ21Ys3m9C2PO8JtzTXxgTV5G7tcGuw=Q3x57REy-w@mail.gmail.com>
References: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
	<1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
	<CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>
	<CADEGkF4rnJ21Ys3m9C2PO8JtzTXxgTV5G7tcGuw=Q3x57REy-w@mail.gmail.com>
Message-ID: <87wr08x9y2.fsf@fastmail.fm>


Hi all;
I don't know if there's going to be a clean way around mucking up the
API for older scripts if we make this change.

If we want to do this my thoughts would be:

- Use the 'bio' module since that's the cleanest.
- Hack together something that will remove old 'Bio' modules on install
  of the new version.
- Write a Biopython1to2 script that will fix the imports on older
  scripts to the new module structure.

However, my vote would be to stick with everything as is. I know we
aren't PEP8 compliant but things aren't that awful that we need an
upheaval. I wish Python library installs weren't so messy that we could
do this more cleanly,
Brad

> Hi guys,
>
> If I may add my two cents on this issue,  I think it's also a chance
> to rectify all other namespace issues that we may have (not just
> PEP8-related).
>
> For instance:
>
> * In the root namespace, we now have Bio.Align and Bio.AlignIO. Since
> we might be merging Bio.Seq and Bio.SeqIO into bio[py].seq (per the
> Github discussion[1]), I suppose we should do the same with Bio.Align
> as well (perhaps into bio[py].seq.align or bio[py].align).
>
> * With the change above, we might also want to change some of the
> submodule names completely. For example, if we merge Bio.Align into
> bio[py].align we'll have bio[py].align.applications, which I
> personally think could be shortened into bio[py].align.app.
>
> * As per the Github disscussion[1] as well, perhaps Bio.SeqUtils
> should also be merged as Seq object methods.
>
> There may be other changes as well, but the bottom line is all these
> changes will be quite considerable. As such, I think we could go all
> the way and be explicit in stating that the changes will be
> incompatible with previous Biopython versions (i.e. old scripts will
> break).
>
> As for bio.* and biopy.*, if we do decide to go all the way, bio.*
> seems like a better choice since there will be other incompatible
> changes anyway. But if we eventually decide to only fix PEP8-related
> issues while keeping compatibility with older versions, I'm leaning
> more towards biopy.*.
>
> regards,
> Bow
>
> [1] https://github.com/biopython/biopython/pull/63#issuecomment-8252340
>
> On Tue, Sep 4, 2012 at 12:59 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>> Hi Peter,
>>>
>>> --- On Tue, 9/4/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>>> One idea I was pondering is a new parallel namespace,
>>>> ideally bio.* but we can't use that due to case
>>>> insensitive file systems like Windows and (by default)
>>>> Mac OS X. So perhaps biopy, or bp?
>>>
>>> As you say, the ideal namespace is bio.*, so let's use
>>> that. We have been using Bio.* for more than 10 years.
>>> We should not get stuck with a non-ideal namespace for
>>> the next 10+ years because there may be some glitches
>>> switching from Bio.* to bio.*. Frankly I doubt that this
>>> will cause huge problems in practice.
>>
>> So you'd advocate a simple switch where from one
>> release to the next we change all the module names
>> (making them lower case, perhaps from consolidation
>> under bio.seq too)?
>>
>> This may cause some difficulties for upgrades - it may
>> require manual intervention to remove the old Bio folder
>> in order to allow creation of the new bio folder.
>>
>>>> We could gradually move code over to the new namespace,
>>>> using imports to preserve back compatibility - but support
>>>> both namespaces during a (long) transition period.
>>>
>>> Why do we need a transition period? It's just a matter
>>> of replacing upper case with lower case in the imports.
>>
>> That forces people to update all their scripts at once.
>> Of course, we can document how to do this so a script
>> would work before and after the case change, e.g.
>>
>> try:
>>     from bio.seq import Seq
>> except ImportError:
>>
>>     from Bio.Seq import Seq
>>
>>>> What I like about this is it allows people to make a
>>>> gradual
>>>> conversion - and we don't have to burden of two main
>>>> branches if we attempted a single jump to a Biopython v2.
>>>>
>>>> Does this seem worth considering?
>>>
>>> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users.
>>>
>>> Best,
>>> -Michiel.
>>>
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From chapmanb at 50mail.com  Wed Sep  5 20:31:58 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 05 Sep 2012 20:31:58 -0400
Subject: [Biopython-dev] Beta code in the official releases?
In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org>
References: <CAKVJ-_6_8JUXmCx5q-eghSczNxqPmSbaaTc_GJ_QCqQOtjGUbg@mail.gmail.com>
	<877gsq8mn2.fsf@fastmail.fm>
	<1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org>
	<CAKVJ-_5ThfYEBmrhpGcHNRvrf2_QEe4pTUgF0JV+bBQOwyW0Fg@mail.gmail.com>
	<1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org>
Message-ID: <87zk54xa81.fsf@fastmail.fm>


Andrew;

> Yeah, it would be great if this module could finally be included. I've
> e-mailed the list numerous times asking what would be necessary to
> include it and have done all you and Brad have asked. I've watched you
> include bits and pieces of code from other contributors quickly and
> without much scrutiny, so I can't help but feel singled out. What is
> the logic in delaying this? We've heard from people who are already
> using the code and have asked when it will be pulled. Is it serving
> the community to not even include the basic reader/writer? Am I
> wasting my time? Is it your goal to actively discourage contributions?

In addition to Peter's technical comments, from a personal side I hope
you don't take offense. We definitely value contributions and your work.

Some changes can end up being tricky because of the need to work with or
fix previous non-optimal design decisions. When they require extra
attention and decisions this can make it hard to allocate time for
folks that volunteer on the project.

This is definitely nothing personal and I hope you don't feel that way.
My GFF parser has languished for even longer for similar reasons.

I think the long term solution for this is incorporating beta code so we
can get these in, recognize the contributions, make them available,
and still giving wiggle room to improve the design before locking into
an API that we need to support long term.

Thanks again for all the work. We do appreciate it,
Brad

From chapmanb at 50mail.com  Wed Sep  5 20:45:19 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 05 Sep 2012 20:45:19 -0400
Subject: [Biopython-dev] TAIR/AGI support
In-Reply-To: <CAH80STXOOUjqYcQ82C2C25-gACyzwx0D4-VD+CMTes90CdZbnw@mail.gmail.com>
References: <CAH80STXOOUjqYcQ82C2C25-gACyzwx0D4-VD+CMTes90CdZbnw@mail.gmail.com>
Message-ID: <87txvcx9ls.fsf@fastmail.fm>


Kevin;
Thanks for the e-mail and offers of code. Always happy to have other
folks involved with the project.

> What's the status of TAIR AGIs in BioPython (I can see no mention of them,
> or support for them)? I've written a brief module which allows a user to
> query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there
> any interest in including such functionality in BioPython?

Is the code available on GitHub to get a better sense of all the
functionality it supports? Do you have an idea where it would fit best?
As a tair submodule inside of Bio.Entrez, or somewhere else?

> More generally, are there any particular areas of BioPython development
> which could use an extra pair of hands?

Following the mailing list for discussions on current projects is the
best way to get a sense of what different folks are working on. The
issue tracker also has open issues and features that could use attention
if anything there strikes your fancy:

https://redmine.open-bio.org/projects/biopython

Hope this helps,
Brad


From p.j.a.cock at googlemail.com  Wed Sep  5 20:57:19 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 01:57:19 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <87wr08x9y2.fsf@fastmail.fm>
References: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
	<1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
	<CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>
	<CADEGkF4rnJ21Ys3m9C2PO8JtzTXxgTV5G7tcGuw=Q3x57REy-w@mail.gmail.com>
	<87wr08x9y2.fsf@fastmail.fm>
Message-ID: <CAKVJ-_7=qK=_XjV4DYBgY8g1E5K=9dRVoe590HU_cwLfTdvCjQ@mail.gmail.com>

On Thu, Sep 6, 2012 at 1:37 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Hi all;
> I don't know if there's going to be a clean way around mucking up the
> API for older scripts if we make this change.
>
> If we want to do this my thoughts would be:
>
> - Use the 'bio' module since that's the cleanest.
> - Hack together something that will remove old 'Bio' modules on install
>   of the new version.
> - Write a Biopython1to2 script that will fix the imports on older
>   scripts to the new module structure.

I really don't like using "bio" since (due to Python's use of
folders for package names) you couldn't in general also have
the old code available under "Bio". i.e. This forces a hard
switch on our users which is a very bad idea I think.

Thus my suggestion of something else like "biopy" (although
the Mac's autocorrection keeps turning it into biopsy  which
would be annoying - grin), or if not already taken "bp".

To expand on my earlier email, the transition structure I
had in mind was that we'd have something like this:

biopy/seq/__init__.py - real code for Seq object etc

Bio/Seq/__init__.py - just "from biopy.seq import Seq"
and a deprecation warning.

> However, my vote would be to stick with everything as is. I know we
> aren't PEP8 compliant but things aren't that awful that we need an
> upheaval. I wish Python library installs weren't so messy that we could
> do this more cleanly,
> Brad

That does seem safer, and we can still do the less invasive
restructuring discussed, e.g.

Bio/Seq.py -> Bio/Seq/__init__.py allowing us to (gradually)
move Bio.Seq* things under Bio.Seq, while preserving the
legacy imports under a deprecation warning.

Also if we're considering moving Bio.SeqIO to Bio.Seq, as
Bow points out, we'd want to do Bio/AlignIO.py -> Bio.Align
(perhaps pushing the core objects into Bio/Align/_objects.py
or similar but exposing them in the current namespace
location).

Regards,

Peter

From p.j.a.cock at googlemail.com  Wed Sep  5 21:34:50 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 02:34:50 +0100
Subject: [Biopython-dev] SeqRecord locations;
	was: Beta code in the official releases?
Message-ID: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>

On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> In my mind, the main technical issue regarding MAF and AlignIO
> and the common alignment object is the lack of a common way
> of handling the idea of start/end (and sometimes strand) for
> each sequence (in a consistent co-ordinate system using Python
> counting). Evidently I haven't manage to adequately convey my
> interpretation/concern.
>
> Some file formats like EMBOSS' have these number explicitly
> but we're not parsing them:
> http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html
>
> In the case of "fasta-m10" the numbers are stored in private
> properties as a 'short term' hack:
> http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html
>
> Others like Stockholm have identifier/start-end as a combined
> names (but this is not mandatory). Here the start and end are
> being stored in the annotations dictionary (as unparsed strings,
> still using 1-based co-ordinates).
>
> In MAF the start/end are explicit and much more important.
> It would be near pointless to parse the the file ignoring these.
> Maybe your approach is good enough for MAF, and we
> should have adopted it as is, and delayed better integration
> with the other AlignIO formats?
>
> i.e. This is a general limitation in AlignIO and the object
> model, somewhat annoying in the formats already supported,
> but information critical to the MAF format.
>
> I was expecting a convention for this to fall out of Bow's GSoC
> work for 'pairwise alignments' in SearchIO - but the object
> model he came up with was not SeqRecord based (many
> of the file formats he was using didn't include sequences).
>
> Right now my inclination is still to add a location property to
> the SeqRecord, usually a FeatureLocation, but it could also
> be the proposed CompoundLocation for more complex cases.
> The question then is if/when this would be propagated, e.g.
> SeqRecord slicing/addition.
> http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html
> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html
>
> So the wheels are turning, but slowly. I have not had as
> much time to dedicate to this as I would like - but other
> smaller or less inter-connected things are much easer to
> review and merge.

To expand on the SeqRecord.location property idea, I am
thinking about (in the typical use cases) using a normal
FeatureLocation object (from Bio.SeqFeature) where the
start, end or strand are in the same co-ordinate system
as the sequence of the SeqRecord.

i.e. For a protein fragment, they would be in amino acids.
For a nucleotide fragment, they would be in base pairs.

Note that you might want to describe the CDS region
for a protein sequence (which would be possible even
for a join using the proposed CompoundLocation), so
maybe 'location' is the wrong name here, perhaps
'fragment' or 'subregion', or something is clearer?

When I talked about adding SeqRecords, and what would
the combined SeqRecord's location be, we could use
FeatureLocation addition (as defined on the branch for
CompoundLocation objects).

For slicing a SeqRecord, provided len(record.location)
== len(record), this is well defined. However, I expect
that quite often if used for alignments, what we will have
instead is len(record.location) = len(record.seq.ungapped())
so we might be able to update the sub-record's location
if we count the gap characters and factor them in. This
equality could be verified in the SeqRecord __init__
(which would require the gap character, but the AlignIO
parsers should all set that).

I would like slicing to update the start/end because
slicing alignment objects seems to be a quite common
operation - so if you started from an alignment file
using start/end (like Stockholm or MAF) it would be
good to update these fields for the sub-alignment.

This feels like it would work, but would it be useful or
just over engineering? Would a simple static location
property which is not automatically propagated in
SeqRecord manipulations be enough (at least initially)?

If so, is Brad's suggestion to just use special values in
the annotations dictionary a simpler way forward (where
we already have policies in place for handling generic
annotation during SeqRecord annotation - in general
dropping it)?

If so, would this be keys 'start', 'end', 'strand' for
integer start and end using Python counting, and
a strand value of +1 or -1 for forward and reverse?
[We could use strand None for unavailable as in
the SeqFeature location object, but I think no entry
in the dictionary is nicer here].

Peter

From anaryin at gmail.com  Thu Sep  6 01:52:34 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 6 Sep 2012 08:52:34 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
Message-ID: <CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>

Hey,

Which Python was that? i.e. The OrderedDict from the standard lib
> (which I hope is optimised), or the back port (which might be slower).
>

Both. I also found it strange and
googled<http://stackoverflow.com/questions/8176513/ordereddict-performance-compared-to-deque>it.
Apparently OrderedDict is pure python, not C like dict, thus the
difference.


That seems risky - but see if you can sort out what is happening
> with the unit tests (below).
>

What Bio.PDB does right now is rely on the list to iterate over things.
Thus, you get the order in which you read the PDB file. However, if you
sort it using the several Objects sort method you will get the following
rules:

Atom.py - N CA C O first, then alphabetically
Residue.py - First aminoacids and nucleic acids, then heteroatoms.
Chain.py - Empty chains last.

These are already in place somewhere in the code. I just used them to
overload the __cmp__ method, with a couple of additions because I
personally disagree with the following:

Atom.py - Inorganic atoms should come out last. For simplicity.
Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get
in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151.
PDB files already have weird large numbers for water and ions for example,
so these come out last anyway. Pushing all HETATMs to the end will
sometimes disrupt the "natural" order of things, for instance modified
residues. Magic perhaps :)

I sorted out all relevant issues with the unittests. I had a small problem
with build_peptides because of this HETATM last rule, so I took it away and
now it works. All tests pass except 4: 2 because of the header, which is
not read decently right now, and 2 because of the ordering which is
explicit in the assert statement of the test. So it's a matter of changing
these assertions and they will work.


It would also look less like Java code ;)
>
> I like this plan - but initially define and document the new properties,
> and deprecate the old get/set properties. Without that you'll break
> almost every PDB using script out there.
>

How do I deprecate the old ones? Is there a DeprecationWarning or so?

Just a reminder, if you want to test/check the code, it's on my
github<https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements>
.

Cheers,

Jo?o


From w.arindrarto at gmail.com  Thu Sep  6 01:57:04 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 6 Sep 2012 07:57:04 +0200
Subject: [Biopython-dev] SeqRecord locations;
 was: Beta code in the official releases?
In-Reply-To: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>
References: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>
Message-ID: <CADEGkF6MZ4JY5kCn8ucMFPTTJ6+3ovovG6SfViATJ_jhH2u1ZA@mail.gmail.com>

Hi guys,

To add my two cents, I am in favor of creating a dynamic SeqRecord
coordinate system using SeqFeature. However, I think it would also be
good if we set some limitations as there are so many ways that slicing
and addition could be used to create new SeqRecords, and anticipating
all these scenarios may create an over-engineered (and probably
slower) SeqRecord.

Some scenarios that I can think now:

1. Slicing SeqRecord objects using step values > 1 (e.g. new_seq = seq[1:120:3])
2. Adding two or more SeqRecord objects with noncontiguous coordinate
(i.e. end coordinate of the first sequence is not directly followed by
the second sequence's start coordinate), and then slice the resulting
object

So maybe some limitations that we could set are:

1. Only update the coordinates if slicing step is 1 (or -1), otherwise
discard it.
2. Only update the coordinates if addition is between contiguous
coordinates, otherwise discard it.

Personally, I think this would cover most use cases for slicing while
allowing us to keep it simple.

As for the name, 'region' sounds better than 'location'. Maybe
'coverage'? I don't have any strong preference between these, but
'subregion' doesn't feel that nice.

Finally, for the coordinate system, I imagine it will use Python's
coordinate system, too? (zero-based, half-open, and the parsers /
writers should do the conversion). Should we also reverse the
coordinates if the objects are sliced in reverse (e.g.
seqrecord[::-1]) or simply inverse the strand value but keep the
coordinates unchanged?

regards,
Bow


On Thu, Sep 6, 2012 at 3:34 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> In my mind, the main technical issue regarding MAF and AlignIO
>> and the common alignment object is the lack of a common way
>> of handling the idea of start/end (and sometimes strand) for
>> each sequence (in a consistent co-ordinate system using Python
>> counting). Evidently I haven't manage to adequately convey my
>> interpretation/concern.
>>
>> Some file formats like EMBOSS' have these number explicitly
>> but we're not parsing them:
>> http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html
>>
>> In the case of "fasta-m10" the numbers are stored in private
>> properties as a 'short term' hack:
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html
>>
>> Others like Stockholm have identifier/start-end as a combined
>> names (but this is not mandatory). Here the start and end are
>> being stored in the annotations dictionary (as unparsed strings,
>> still using 1-based co-ordinates).
>>
>> In MAF the start/end are explicit and much more important.
>> It would be near pointless to parse the the file ignoring these.
>> Maybe your approach is good enough for MAF, and we
>> should have adopted it as is, and delayed better integration
>> with the other AlignIO formats?
>>
>> i.e. This is a general limitation in AlignIO and the object
>> model, somewhat annoying in the formats already supported,
>> but information critical to the MAF format.
>>
>> I was expecting a convention for this to fall out of Bow's GSoC
>> work for 'pairwise alignments' in SearchIO - but the object
>> model he came up with was not SeqRecord based (many
>> of the file formats he was using didn't include sequences).
>>
>> Right now my inclination is still to add a location property to
>> the SeqRecord, usually a FeatureLocation, but it could also
>> be the proposed CompoundLocation for more complex cases.
>> The question then is if/when this would be propagated, e.g.
>> SeqRecord slicing/addition.
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html
>>
>> So the wheels are turning, but slowly. I have not had as
>> much time to dedicate to this as I would like - but other
>> smaller or less inter-connected things are much easer to
>> review and merge.
>
> To expand on the SeqRecord.location property idea, I am
> thinking about (in the typical use cases) using a normal
> FeatureLocation object (from Bio.SeqFeature) where the
> start, end or strand are in the same co-ordinate system
> as the sequence of the SeqRecord.
>
> i.e. For a protein fragment, they would be in amino acids.
> For a nucleotide fragment, they would be in base pairs.
>
> Note that you might want to describe the CDS region
> for a protein sequence (which would be possible even
> for a join using the proposed CompoundLocation), so
> maybe 'location' is the wrong name here, perhaps
> 'fragment' or 'subregion', or something is clearer?
>
> When I talked about adding SeqRecords, and what would
> the combined SeqRecord's location be, we could use
> FeatureLocation addition (as defined on the branch for
> CompoundLocation objects).
>
> For slicing a SeqRecord, provided len(record.location)
> == len(record), this is well defined. However, I expect
> that quite often if used for alignments, what we will have
> instead is len(record.location) = len(record.seq.ungapped())
> so we might be able to update the sub-record's location
> if we count the gap characters and factor them in. This
> equality could be verified in the SeqRecord __init__
> (which would require the gap character, but the AlignIO
> parsers should all set that).
>
> I would like slicing to update the start/end because
> slicing alignment objects seems to be a quite common
> operation - so if you started from an alignment file
> using start/end (like Stockholm or MAF) it would be
> good to update these fields for the sub-alignment.
>
> This feels like it would work, but would it be useful or
> just over engineering? Would a simple static location
> property which is not automatically propagated in
> SeqRecord manipulations be enough (at least initially)?
>
> If so, is Brad's suggestion to just use special values in
> the annotations dictionary a simpler way forward (where
> we already have policies in place for handling generic
> annotation during SeqRecord annotation - in general
> dropping it)?
>
> If so, would this be keys 'start', 'end', 'strand' for
> integer start and end using Python counting, and
> a strand value of +1 or -1 for forward and reverse?
> [We could use strand None for unavailable as in
> the SeqFeature location object, but I think no entry
> in the dictionary is nicer here].
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From mjldehoon at yahoo.com  Thu Sep  6 02:31:57 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 5 Sep 2012 23:31:57 -0700 (PDT)
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7=qK=_XjV4DYBgY8g1E5K=9dRVoe590HU_cwLfTdvCjQ@mail.gmail.com>
Message-ID: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com>

[Brad]
> Hack together something that will remove old 'Bio' modules
> on install of the new version.

We could check in setup.py if we can import Bio, and ask the user to remove the old Biopython installation before proceeding. Since we can tell the user exactly which directory to remove, this would be straightforward. I would prefer this to removing the directory automatically.

[Peter]
> I really don't like using "bio" since (due to Python's use
> of folders for package names) you couldn't in general also
> have the old code available under "Bio". i.e. This forces
> a hard switch on our users which is a very bad idea I think.

I don't see why a user would like to have both an old Biopython under Bio and a new Biopython under bio. Unless he wants to run some scripts with the old Biopython and other scripts with the new Biopython, but I don't see the point of that.

[Peter]
> Thus my suggestion of something else like "biopy" [...]
> , or if not already taken "bp".

[Brad]
> However, my vote would be to stick with everything as is.

If the choice is between "bp", "biopy", or "Bio", then I agree with Brad; I prefer keeping a nice but PEP8-noncompliant module name "Bio" rather than switching to a PEP8-compliant but less attractive name like "biopy" or "bp".

Best,
-Michiel.

From p.j.a.cock at googlemail.com  Thu Sep  6 03:06:07 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 08:06:07 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com>
References: <CAKVJ-_7=qK=_XjV4DYBgY8g1E5K=9dRVoe590HU_cwLfTdvCjQ@mail.gmail.com>
	<1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com>
Message-ID: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>

On Thu, Sep 6, 2012 at 7:31 AM, Michiel de Hoon wrote:
> [Brad]
>> Hack together something that will remove old 'Bio' modules
>> on install of the new version.
>
> We could check in setup.py if we can import Bio, and ask
> the user to remove the old Biopython installation before
> proceeding. Since we can tell the user exactly which directory
> to remove, this would be straightforward. I would prefer this
> to removing the directory automatically.

I agree automatically removing the old install is risky.

For single user machines, where the single user has only a
small collection of scripts this isn't such an issue. For any
shared server, or user with lots of Biopython scripts (some
of which may have been written by different people), you
would be forced into a mass change at one go.

You would also have considerable hassle later on with any
attempt to re-run old scripts.

> [Peter]
>> I really don't like using "bio" since (due to Python's use
>> of folders for package names) you couldn't in general also
>> have the old code available under "Bio". i.e. This forces
>> a hard switch on our users which is a very bad idea I think.
>
> I don't see why a user would like to have both an old
> Biopython under Bio and a new Biopython under bio.
> Unless he wants to run some scripts with the old Biopython
> and other scripts with the new Biopython, but I don't see
> the point of that.

Really? That is exactly what I am concerned about (both
for single user machines like my desktop, and shared
machines like our servers). How about the common
situation of wanting to re-run old scripts from old
projects on new data?

If we were just changing the case, this might not be
too complex (it would still be a frustrating transition
period), but if we're also moving things around at the
same time it is too much I feel.

> [Peter]
>> Thus my suggestion of something else like "biopy" [...]
>> , or if not already taken "bp".
>
> [Brad]
>> However, my vote would be to stick with everything as is.
>
> If the choice is between "bp", "biopy", or "Bio", then
> I agree with Brad; I prefer keeping a nice but
> PEP8-noncompliant module name "Bio" rather than
> switching to a PEP8-compliant but less attractive
> name like "biopy" or "bp".

There is 'biopython' but it is rather long? No other ideas
from anyone else?

How about over the next year we gradually consolidate
modules under the existing mixed case names? e.g.
move Bio.AlignIO functionality and Bio.Align, and
Bio.Seq* under Bio.Seq (leaving backwards compatible
imports supported but deprecated).

Here's a further (and slightly more radical) idea: We
stick with using 'Bio' and the current mixed case
names on Python 2, but adopt 'bio' and other PEP8
compatible names for Python 3 (as a uniform
strict automatic rule: mixed case -> lower case)?
i.e. Do this as part of our 2to3 process.

Some nasty downside might occur to me later
but right now it seems like a neat idea... other
that not being quite in line with the expectation
that Python 3 should not be used as an excuse
to make API changes. Too radical?

Regards,

Peter

From p.j.a.cock at googlemail.com  Thu Sep  6 03:16:41 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 08:16:41 +0100
Subject: [Biopython-dev] SeqRecord locations;
 was: Beta code in the official releases?
In-Reply-To: <CADEGkF6MZ4JY5kCn8ucMFPTTJ6+3ovovG6SfViATJ_jhH2u1ZA@mail.gmail.com>
References: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>
	<CADEGkF6MZ4JY5kCn8ucMFPTTJ6+3ovovG6SfViATJ_jhH2u1ZA@mail.gmail.com>
Message-ID: <CAKVJ-_5DDEkFhsuSZMb2c_oJ3z74tOTqDZb7gELdSsiShXBBLA@mail.gmail.com>

On Thu, Sep 6, 2012 at 6:57 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi guys,
>
> To add my two cents, I am in favor of creating a dynamic SeqRecord
> coordinate system using SeqFeature. However, I think it would also be
> good if we set some limitations as there are so many ways that slicing
> and addition could be used to create new SeqRecords, and anticipating
> all these scenarios may create an over-engineered (and probably
> slower) SeqRecord.
>
> Some scenarios that I can think now:
>
> 1. Slicing SeqRecord objects using step values > 1
> (e.g. new_seq = seq[1:120:3])

Absolutely - here I would expect to lose the location information.
We already have similar restrictions in the SeqRecord slicing
for how SeqFeatures are handled.

> 2. Adding two or more SeqRecord objects with noncontiguous coordinate
> (i.e. end coordinate of the first sequence is not directly followed by
> the second sequence's start coordinate), and then slice the resulting
> object

Adding *could* be done via the CompoundLocation, although that
in itself might want to consider if nicely-abutting locations should
be merged, e.g. in GenBank notation 100..201 and 202..300 could
be 100.300 rather than join(100..201,202..300) which is what my
CompoundLocation code currently does.

> So maybe some limitations that we could set are:
>
> 1. Only update the coordinates if slicing step is 1 (or -1), otherwise
> discard it.

Yep.

> 2. Only update the coordinates if addition is between contiguous
> coordinates, otherwise discard it.

That does seem simple - especially as the primary driver for this
is multiple sequence alignments and those only support simple
continuous locations with a start and end.

> Personally, I think this would cover most use cases for slicing while
> allowing us to keep it simple.

That is perhaps a good balance (and as a bonus means we
don't have to link this to the CompoundLocation unless we
want to).

> As for the name, 'region' sounds better than 'location'. Maybe
> 'coverage'? I don't have any strong preference between these, but
> 'subregion' doesn't feel that nice.

Region seems fine.

> Finally, for the coordinate system, I imagine it will use Python's
> coordinate system, too? (zero-based, half-open, and the parsers /
> writers should do the conversion).

Yes. I'm suggesting using the FeatureLocation object (from
Bio.SeqFeatures), which does this.

> Should we also reverse the
> coordinates if the objects are sliced in reverse (e.g.
> seqrecord[::-1]) or simply inverse the strand value but keep the
> coordinates unchanged?

The strand changes, and the start/end must also be recalculated
from the length of the parent sequence. The FeatureLocation
has a (private) _flip method to do this. In some cases we won't
have the parent sequence length, so would have to drop the
location.

I'll have a go at implementing this on a branch in the next
few hours (unless something more pressing comes up at
the BioHackathon). As it happens this overlaps nicely with
some of the group discussion about how to represent feature
locations in RDF.

Regards,

Peter

From p.j.a.cock at googlemail.com  Thu Sep  6 03:21:16 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 08:21:16 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
Message-ID: <CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>

On Thu, Sep 6, 2012 at 6:52 AM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>
>> It would also look less like Java code ;)
>>
>> I like this plan - but initially define and document the new properties,
>> and deprecate the old get/set properties. Without that you'll break
>> almost every PDB using script out there.
>
> How do I deprecate the old ones? Is there a DeprecationWarning or so?
>

Yes, we use Bio.BiopythonDeprecationWarning rather than the
default DeprecationWarning because the later is now silent
by default. Grep the code for example usage, see also:
http://biopython.org/wiki/Deprecation_policy

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Sep  6 05:36:41 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 10:36:41 +0100
Subject: [Biopython-dev] SeqRecord locations;
 was: Beta code in the official releases?
In-Reply-To: <CAKVJ-_5DDEkFhsuSZMb2c_oJ3z74tOTqDZb7gELdSsiShXBBLA@mail.gmail.com>
References: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>
	<CADEGkF6MZ4JY5kCn8ucMFPTTJ6+3ovovG6SfViATJ_jhH2u1ZA@mail.gmail.com>
	<CAKVJ-_5DDEkFhsuSZMb2c_oJ3z74tOTqDZb7gELdSsiShXBBLA@mail.gmail.com>
Message-ID: <CAKVJ-_7kojE7wJuxQncVc0+3pE+d6KBKrtvoefm6R30w+XzmMw@mail.gmail.com>

On Thu, Sep 6, 2012 at 8:16 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I'll have a go at implementing this on a branch in the next
> few hours (unless something more pressing comes up at
> the BioHackathon). As it happens this overlaps nicely with
> some of the group discussion about how to represent feature
> locations in RDF.
>

I've made a start, will do more later:
https://github.com/peterjc/biopython/tree/sr_loc

Peter

From mjldehoon at yahoo.com  Thu Sep  6 06:13:38 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 6 Sep 2012 03:13:38 -0700 (PDT)
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
Message-ID: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>

--- On Thu, 9/6/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> For any shared server, [...] you
> would be forced into a mass change at one go.

OK, for multiple users on a shared server I see your point.

> Here's a further (and slightly more radical) idea: We
> stick with using 'Bio' and the current mixed case
> names on Python 2, but adopt 'bio' and other PEP8
> compatible names for Python 3 (as a uniform
> strict automatic rule: mixed case -> lower case)?
> i.e. Do this as part of our 2to3 process.

The Python developers argue against combining a switch to Python 3 with other major changes, since then if bugs arise it is unclear if it is due to the switch to Python 3 or due to the other changes. But perhaps it's OK if we have one Bio.* version for Python 2 and one bio.* version for Python 3 that are otherwise completely identical to each other.

> How about over the next year we gradually consolidate
> modules under the existing mixed case names? e.g.
> move Bio.AlignIO functionality and Bio.Align, 

I guess you meant "merge Bio.AlignIO functionality into Bio.Align".

> and Bio.Seq* under Bio.Seq (leaving backwards compatible
> imports supported but deprecated).

Sounds good to me. AFAIAC, we don't need to do this gradually over the next year. May as well do it for the next release.

-Michiel.

From anaryin at gmail.com  Thu Sep  6 09:48:51 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 6 Sep 2012 16:48:51 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
	<CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>
Message-ID: <CAJ9sUYNEs582xgo4AmbRtvEUTMn1OP+4Wt_VEBVW+_LUMkHmSg@mail.gmail.com>

Ok, thanks.

The modules are littered with set/get methods and adding DeprecationWarning
to all of them might be a bit too much.. Instead, should we add one single
warning at the top of the PDBParser, since this is the only obligatory
module for Bio.PDB so that everyone gets the warning message once and once
only? Otherwise I can imagine several warnings popping up everywhere..

Cheers,

Jo?o


From eric.talevich at gmail.com  Thu Sep  6 10:17:03 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 6 Sep 2012 10:17:03 -0400
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
Message-ID: <CAMC681mE3KwUuXgPsQWD1duiWJe-jvoY9NSTSLBE6BYZ6zEdpg@mail.gmail.com>

On Thu, Sep 6, 2012 at 1:52 AM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

>
> What Bio.PDB does right now is rely on the list to iterate over things.
> Thus, you get the order in which you read the PDB file. However, if you
> sort it using the several Objects sort method you will get the following
> rules:
>
> Atom.py - N CA C O first, then alphabetically
> Residue.py - First aminoacids and nucleic acids, then heteroatoms.
> Chain.py - Empty chains last.
>
> These are already in place somewhere in the code. I just used them to
> overload the __cmp__ method, with a couple of additions because I
> personally disagree with the following:
>
> Atom.py - Inorganic atoms should come out last. For simplicity.
> Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get
> in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151.
> PDB files already have weird large numbers for water and ions for example,
> so these come out last anyway. Pushing all HETATMs to the end will
> sometimes disrupt the "natural" order of things, for instance modified
> residues. Magic perhaps :)
>
>
Here's another edge case to think about:
3BEG<http://www.rcsb.org/pdb/explore/explore.do?structureId=3BEG>.
The enzyme is chain A, starting from residue number 69; the substrate
peptide is chain B; and then after listing the atoms for chain B they jump
back to chain A and add the three ligands as individual residues, with
residue numbers 1, 2 and 3, on HETATM lines.

The current PDBParser complains about this structure but parses it so that
the extra HETATM residues are at the end of chain A's child_list. If I were
to try to generate a polypeptide sequence from each of the chains in this
structure, I think I'd want to just ignore the three extra residues, rather
than list them as the first three residues of the peptide as "SAX".

How do you think this should be handled? Maybe treat in-sequence modified
residues differently from out-of-sequence HETATMs?

-E


From eric.talevich at gmail.com  Thu Sep  6 10:40:13 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 6 Sep 2012 10:40:13 -0400
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
Message-ID: <CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>

On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:

> --- On Thu, 9/6/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > For any shared server, [...] you
> > would be forced into a mass change at one go.
>
> OK, for multiple users on a shared server I see your point.


True, and old scripts/pipelines have a way of sticking around, especially
once they've been shared with others in the lab.


> Here's a further (and slightly more radical) idea: We
> > stick with using 'Bio' and the current mixed case
> > names on Python 2, but adopt 'bio' and other PEP8
> > compatible names for Python 3 (as a uniform
> > strict automatic rule: mixed case -> lower case)?
> > i.e. Do this as part of our 2to3 process.
>
> The Python developers argue against combining a switch to Python 3 with
> other major changes, since then if bugs arise it is unclear if it is due to
> the switch to Python 3 or due to the other changes. But perhaps it's OK if
> we have one Bio.* version for Python 2 and one bio.* version for Python 3
> that are otherwise completely identical to each other.
>

Agreed, since the bio.* version is generated by the 2to3 script it should
still be easy enough to distinguish "this is a bug in the library" from
"this is a problem with Py3, 2to3 or your environment". The extra
separation on the filesystem provided by Py2/Py3 should also prevent some
problems with case-insensitivity and the environment.


> > How about over the next year we gradually consolidate
> > modules under the existing mixed case names? e.g.
> > move Bio.AlignIO functionality and Bio.Align,
>
> I guess you meant "merge Bio.AlignIO functionality into Bio.Align".
>
> > and Bio.Seq* under Bio.Seq (leaving backwards compatible
> > imports supported but deprecated).
>
> Sounds good to me. AFAIAC, we don't need to do this gradually over the
> next year. May as well do it for the next release.
>
>
Doing this in a single release might be better, so we can document/remember
the release number when the Grand Reshuffling took place and troubleshoot
users' resulting problems more easily.

Should we call that Biopython 2.0.0 and switch to semantic version numbers?

From anaryin at gmail.com  Thu Sep  6 10:51:11 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 6 Sep 2012 17:51:11 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAMC681mE3KwUuXgPsQWD1duiWJe-jvoY9NSTSLBE6BYZ6zEdpg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
	<CAMC681mE3KwUuXgPsQWD1duiWJe-jvoY9NSTSLBE6BYZ6zEdpg@mail.gmail.com>
Message-ID: <CAJ9sUYP5ZK0KSpT4QKr2_HR6ojMtsAorJXQsErQLLkAReKQB1w@mail.gmail.com>

Well... :) If this is what the authors put in.. well, that's just it. The
parser should not be an interpreter.

However, when building peptides, you should get two peptides: the ALA-SEP,
and the protein chain A. And I think this is what you will get. Also, the
fact that they are heteroatoms is already a good filter if you want them
out of the equation.

From p.j.a.cock at googlemail.com  Thu Sep  6 21:01:04 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 7 Sep 2012 02:01:04 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
	<CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
Message-ID: <CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>

On Thu, Sep 6, 2012 at 3:40 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> --- On Thu, 9/6/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> > Here's a further (and slightly more radical) idea: We
>> > stick with using 'Bio' and the current mixed case
>> > names on Python 2, but adopt 'bio' and other PEP8
>> > compatible names for Python 3 (as a uniform
>> > strict automatic rule: mixed case -> lower case)?
>> > i.e. Do this as part of our 2to3 process.
>>
>> The Python developers argue against combining a switch to Python 3 with
>> other major changes, since then if bugs arise it is unclear if it is due to
>> the switch to Python 3 or due to the other changes. But perhaps it's OK if
>> we have one Bio.* version for Python 2 and one bio.* version for Python 3
>> that are otherwise completely identical to each other.
>
>
> Agreed, since the bio.* version is generated by the 2to3 script it should
> still be easy enough to distinguish "this is a bug in the library" from
> "this is a problem with Py3, 2to3 or your environment". The extra separation
> on the filesystem provided by Py2/Py3 should also prevent some problems with
> case-insensitivity and the environment.

Yes - they would be in different site-packages folders, and since
we have a tiny Python 3 install base, moving them from Bio to
bio seems low impact.

I guess we need to have a little hack with the 2to3 library and
try defining our own custom fixer for the imports...

Note this case difference will slightly complicate our documentation -
but that is always going to be an issue for the Python 2 to 3 move.

>>
>> > How about over the next year we gradually consolidate
>> > modules under the existing mixed case names? e.g.
>> > move Bio.AlignIO functionality and Bio.Align,
>>
>> I guess you meant "merge Bio.AlignIO functionality into Bio.Align".

Yes, sorry.

>> > and Bio.Seq* under Bio.Seq (leaving backwards compatible
>> > imports supported but deprecated).
>>
>> Sounds good to me. AFAIAC, we don't need to do this gradually
>> over the next year. May as well do it for the next release.
>
> Doing this in a single release might be better, so we can document/remember
> the release number when the Grand Reshuffling took place and troubleshoot
> users' resulting problems more easily.

Doing it one release makes sense - but we can do it gradually in
a series of self contained commits - and feel our way.

Michiel - do you want to start with the Bio/Seq.py to Bio/Seq/__init__.py
change? We'll need to do that before any consolidation steps.

> Should we call that Biopython 2.0.0 and switch to semantic version numbers?
>

Maybe... at some point a Biopython 2 would be a good excuse for
some publicity and another application note.

The eventual move from developing under Python 2 (and using 2to3
for Python 3) to natively developing under Python 3 would be an
excuse for a major version bump.

Peter

From p.j.a.cock at googlemail.com  Thu Sep  6 21:03:22 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 7 Sep 2012 02:03:22 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYNEs582xgo4AmbRtvEUTMn1OP+4Wt_VEBVW+_LUMkHmSg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
	<CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>
	<CAJ9sUYNEs582xgo4AmbRtvEUTMn1OP+4Wt_VEBVW+_LUMkHmSg@mail.gmail.com>
Message-ID: <CAKVJ-_5HG-b-BiTrdfhUvWqRLKgvSqg7sWK9d+pJRkapBJSTVw@mail.gmail.com>

On Thu, Sep 6, 2012 at 2:48 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Ok, thanks.
>
> The modules are littered with set/get methods and adding DeprecationWarning
> to all of them might be a bit too much.. Instead, should we add one single
> warning at the top of the PDBParser, since this is the only obligatory
> module for Bio.PDB so that everyone gets the warning message once and once
> only? Otherwise I can imagine several warnings popping up everywhere..

If you use the exact same message, then I think you'll only see the
warning once. Try it with a couple of the get/set methods to confirm.

Having the warning happen even if you don't use the get/set seems
wrong.

Peter


From anaryin at gmail.com  Fri Sep  7 03:21:56 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Fri, 7 Sep 2012 10:21:56 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_5HG-b-BiTrdfhUvWqRLKgvSqg7sWK9d+pJRkapBJSTVw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
	<CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>
	<CAJ9sUYNEs582xgo4AmbRtvEUTMn1OP+4Wt_VEBVW+_LUMkHmSg@mail.gmail.com>
	<CAKVJ-_5HG-b-BiTrdfhUvWqRLKgvSqg7sWK9d+pJRkapBJSTVw@mail.gmail.com>
Message-ID: <CAJ9sUYPt_u5XZJfJHYF7ALKRR8eOjJ4YD8-1PJ7Wis-QAg50hQ@mail.gmail.com>

Likely true.

I'm writing a txt file with the changes. I don't think they can be merged
easily without breaking a lot of stuff, in particular the removal of
child_list. Therefore, I suggest we write a few deprecation warnings here
and there where affected by the consensual changes we agree on and give a
few releases before we actually merge them.

Also, once I'm happy with the changes, I'll make a new branch to allow
'beta testing' by anyone who wants and write a wiki page on it.

Cheers,

Jo?o
No dia 7 de Set de 2012 04:03, "Peter Cock" <p.j.a.cock at googlemail.com>
escreveu:

> On Thu, Sep 6, 2012 at 2:48 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> > Ok, thanks.
> >
> > The modules are littered with set/get methods and adding
> DeprecationWarning
> > to all of them might be a bit too much.. Instead, should we add one
> single
> > warning at the top of the PDBParser, since this is the only obligatory
> > module for Bio.PDB so that everyone gets the warning message once and
> once
> > only? Otherwise I can imagine several warnings popping up everywhere..
>
> If you use the exact same message, then I think you'll only see the
> warning once. Try it with a couple of the get/set methods to confirm.
>
> Having the warning happen even if you don't use the get/set seems
> wrong.
>
> Peter
>


From mjldehoon at yahoo.com  Sun Sep  9 03:31:05 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sun, 9 Sep 2012 00:31:05 -0700 (PDT)
Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif
Message-ID: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com>

Returning to a previous discussion...

[Michiel:]
> ..., currently Bio.Motif._Motif.Motif objects also perform
> functions that are more appropriate for a separate PWM
> (position-weight matrix) class within Bio.Motif. It may be
> a good idea to have a separate PWM class for this functionality.

[Bartek:]
> I'm not sure. I think it is valuable to be able to load
> instances from a file and then convert them to a PWM.
> It could be done with separate classes,
> but I'm not sure it would be easier then...

I think there is one confusing issue here.
The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method).

So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments).
Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix, 
motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score).

So I would suggest to keep the various types of matrices explicit; something along these lines:

>>> motif = Motif.read(...)
>>> counts = motif.counts
# .counts is a property of motif
# counts is an instance of the Motif.FrequencyMatrix class
# you can also make a FrequencyMatrix object directly from
# the frequencies, as in
>>> counts = Motif.FrequencyMatrix(my_frequency_matrix)
>>> counts[2,:]
array([1.0, 4.0, 3.0, 2.0])
# indices refer explicitly to the counts matrix
>>> counts[2,'G']
3.0

>>> my_consensus_sequence = counts.consensus
# .consensus is a property of counts
>>> my_anticonsensus_sequence = counts.anticonsensus
# .anticonsensus is a property of counts

>>> my_probability_matrix = counts.normalize()
# this can be a numpy array, or a Motif.ProbabilityMatrix
# class that inherits from a numpy array
>>> my_probability_matrix[2,:]
array([0.1, 0.4, 0.3, 0.2])
# indices refer explicitly to the probability matrix

>>> pwm = counts.make_pwm(...)
# or pwm = motif.PositionWeightMatrix(my_matrix)
>>> pwm[0,:]
array([ -2.3,  0.1,  1.2,  1.8])
>>> pwm[0,2]
1.2
>>> pwm[0,'C']
0.1
# indices explicitly refer to the pwm

>>> scores = pwm.scan(sequence)
>>> score = pwm.score(sequence)


Does that sound reasonable? Any comments, suggestions?

Best,
-Michiel.

From bartek at rezolwenta.eu.org  Mon Sep 10 03:12:59 2012
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Mon, 10 Sep 2012 09:12:59 +0200
Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif
In-Reply-To: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com>
References: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com>
Message-ID: <CABHxouV7Mc9VeNHX2mw7kSEB3V3=6uvNXt3HCG27zXhBOjfqJQ@mail.gmail.com>

Hi,

I think it is an idea worth discussing a little bit more. Thanks for
bringing it up Michiel.

It captures at least some of the issues caused by the fact that
different motifs might be internally represented differently.

I'm not sure I'm all excited about having to deal with explicit extra
classes for PWMs and aligned instances, but maybe this is the price
for having a clear separation of where certain things are calculated.

The issue I think still needs discussion is where is the searching
done? If I want to search for instances, do I do it from the PWM
object?, This seems to be the natural idea, but then can we find a
nice interface for people who don't want to be bothered with too
complicated interfaces?

I'll try to come up with a more thought through and longer response
later in the week...
best
Bartek

On Sun, Sep 9, 2012 at 9:31 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Returning to a previous discussion...
>
> [Michiel:]
>> ..., currently Bio.Motif._Motif.Motif objects also perform
>> functions that are more appropriate for a separate PWM
>> (position-weight matrix) class within Bio.Motif. It may be
>> a good idea to have a separate PWM class for this functionality.
>
> [Bartek:]
>> I'm not sure. I think it is valuable to be able to load
>> instances from a file and then convert them to a PWM.
>> It could be done with separate classes,
>> but I'm not sure it would be easier then...
>
> I think there is one confusing issue here.
> The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method).
>
> So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments).
> Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix,
> motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score).
>
> So I would suggest to keep the various types of matrices explicit; something along these lines:
>
>>>> motif = Motif.read(...)
>>>> counts = motif.counts
> # .counts is a property of motif
> # counts is an instance of the Motif.FrequencyMatrix class
> # you can also make a FrequencyMatrix object directly from
> # the frequencies, as in
>>>> counts = Motif.FrequencyMatrix(my_frequency_matrix)
>>>> counts[2,:]
> array([1.0, 4.0, 3.0, 2.0])
> # indices refer explicitly to the counts matrix
>>>> counts[2,'G']
> 3.0
>
>>>> my_consensus_sequence = counts.consensus
> # .consensus is a property of counts
>>>> my_anticonsensus_sequence = counts.anticonsensus
> # .anticonsensus is a property of counts
>
>>>> my_probability_matrix = counts.normalize()
> # this can be a numpy array, or a Motif.ProbabilityMatrix
> # class that inherits from a numpy array
>>>> my_probability_matrix[2,:]
> array([0.1, 0.4, 0.3, 0.2])
> # indices refer explicitly to the probability matrix
>
>>>> pwm = counts.make_pwm(...)
> # or pwm = motif.PositionWeightMatrix(my_matrix)
>>>> pwm[0,:]
> array([ -2.3,  0.1,  1.2,  1.8])
>>>> pwm[0,2]
> 1.2
>>>> pwm[0,'C']
> 0.1
> # indices explicitly refer to the pwm
>
>>>> scores = pwm.scan(sequence)
>>>> score = pwm.score(sequence)
>
>
> Does that sound reasonable? Any comments, suggestions?
>
> Best,
> -Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
Bartek Wilczynski


From p.j.a.cock at googlemail.com  Mon Sep 10 04:39:30 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 10 Sep 2012 09:39:30 +0100
Subject: [Biopython-dev] Most buildbot slaves down
Message-ID: <CAKVJ-_6mJn+0y9OOGnqt6H4N5yhFSCsw-4fE6OeVjaHRAWLoyA@mail.gmail.com>

Hi all,

For those of you actively monitoring the nightly BuildBot
for Biopython and/or BioRuby, all the buildslaves at my
institute are currently effectively offline. A new stricter
firewall policy was introduced last week while I was away.
I hope we'll have the necessary outgoing ports opened
again soon.

In the meantime, additional buildslaves hosted elsewhere
would be very useful. The machines need to be online
and are typically only used once every 24 hours for the
scheduled builds. Non-Linux machines are particularly
important for cross-platform testing (while for Linux
the TravisCI testing seems to be working nicely overall).

Any volunteers?

Thanks,

Peter

From tiagoantao at gmail.com  Mon Sep 10 04:50:41 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 10 Sep 2012 09:50:41 +0100
Subject: [Biopython-dev] [BioRuby] Most buildbot slaves down
In-Reply-To: <CAKVJ-_6mJn+0y9OOGnqt6H4N5yhFSCsw-4fE6OeVjaHRAWLoyA@mail.gmail.com>
References: <CAKVJ-_6mJn+0y9OOGnqt6H4N5yhFSCsw-4fE6OeVjaHRAWLoyA@mail.gmail.com>
Message-ID: <CAA9RGEOQYVgwf8NxS52TH84+WPdFcNFawJPTcZaLHM1XiZ+E3A@mail.gmail.com>

Hi,

Not much helpful in the non-linux front, but I noticed that my machine
was down for some reason, restarted it and it is doing at least a few
of the builds.

Tiago

On Mon, Sep 10, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi all,
>
> For those of you actively monitoring the nightly BuildBot
> for Biopython and/or BioRuby, all the buildslaves at my
> institute are currently effectively offline. A new stricter
> firewall policy was introduced last week while I was away.
> I hope we'll have the necessary outgoing ports opened
> again soon.
>
> In the meantime, additional buildslaves hosted elsewhere
> would be very useful. The machines need to be online
> and are typically only used once every 24 hours for the
> scheduled builds. Non-Linux machines are particularly
> important for cross-platform testing (while for Linux
> the TravisCI testing seems to be working nicely overall).
>
> Any volunteers?
>
> Thanks,
>
> Peter
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


-- 
"Liberty for wolves is death to the lambs" - Isaiah Berlin

From redmine at redmine.open-bio.org  Thu Sep 13 22:23:53 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Fri, 14 Sep 2012 02:23:53 +0000
Subject: [Biopython-dev] [Biopython - Bug #3384] (New) Installation fails
	with pip-3.2
Message-ID: <redmine.issue-3384.20120914022353@redmine.open-bio.org>


Issue #3384 has been reported by Roy Crihfield.

----------------------------------------
Bug #3384: Installation fails with pip-3.2
https://redmine.open-bio.org/issues/3384

Author: Roy Crihfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Linux 3.5.3-1-ARCH x86_64 GNU/Linux
Python 3.2.3
Bio.__version__ == '1.60'

Installation fails with with pip 1.2:

$ sudo pip-3.2 install biopython

:
:

Converting build/py3.2/Doc/examples/fasta_dictionary.py

Converting build/py3.2/Doc/examples/nmr/simplepredict.py

Python 2to3 processing done.

running egg_info

error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory

----------------------------------------

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main
    status = self.run(options, args)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Thu Sep 13 22:23:54 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Fri, 14 Sep 2012 02:23:54 +0000
Subject: [Biopython-dev] [Biopython - Bug #3384] (New) Installation fails
	with pip-3.2
Message-ID: <redmine.issue-3384.20120914022353@redmine.open-bio.org>


Issue #3384 has been reported by Roy Crihfield.

----------------------------------------
Bug #3384: Installation fails with pip-3.2
https://redmine.open-bio.org/issues/3384

Author: Roy Crihfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Linux 3.5.3-1-ARCH x86_64 GNU/Linux
Python 3.2.3
Bio.__version__ == '1.60'

Installation fails with with pip 1.2:

$ sudo pip-3.2 install biopython

:
:

Converting build/py3.2/Doc/examples/fasta_dictionary.py

Converting build/py3.2/Doc/examples/nmr/simplepredict.py

Python 2to3 processing done.

running egg_info

error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory

----------------------------------------

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main
    status = self.run(options, args)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Fri Sep 14 04:46:08 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Fri, 14 Sep 2012 08:46:08 +0000
Subject: [Biopython-dev] [Biopython - Bug #3384] Installation fails with
	pip-3.2
References: <redmine.issue-3384.20120914022353@redmine.open-bio.org>
Message-ID: <redmine.journal-14960.20120914084608@redmine.open-bio.org>


Issue #3384 has been updated by Peter Cock.


Does the standard install mechanism work on your machine? i.e.

python3.2 setup.py build
python3.2 setup.py test
sudo python3.2 setup.py install

If you want to investigate the pip error, there is a possible workaround developed by NumPy (who also use 2to3 in a similar way to us), see http://projects.scipy.org/numpy/ticket/1857

Thanks
----------------------------------------
Bug #3384: Installation fails with pip-3.2
https://redmine.open-bio.org/issues/3384

Author: Roy Crihfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Linux 3.5.3-1-ARCH x86_64 GNU/Linux
Python 3.2.3
Bio.__version__ == '1.60'

Installation fails with with pip 1.2:

$ sudo pip-3.2 install biopython

:
:

Converting build/py3.2/Doc/examples/fasta_dictionary.py

Converting build/py3.2/Doc/examples/nmr/simplepredict.py

Python 2to3 processing done.

running egg_info

error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory

----------------------------------------

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main
    status = self.run(options, args)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Fri Sep 14 21:57:53 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 15 Sep 2012 01:57:53 +0000
Subject: [Biopython-dev] [Biopython - Bug #3384] Installation fails with
	pip-3.2
References: <redmine.issue-3384.20120914022353@redmine.open-bio.org>
Message-ID: <redmine.journal-14961.20120915015753@redmine.open-bio.org>


Issue #3384 has been updated by Roy Crihfield.


Yes, installing manually works. I found that hack but was hoping there would be a better solution, or support for pip planned for the future. 
----------------------------------------
Bug #3384: Installation fails with pip-3.2
https://redmine.open-bio.org/issues/3384

Author: Roy Crihfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Linux 3.5.3-1-ARCH x86_64 GNU/Linux
Python 3.2.3
Bio.__version__ == '1.60'

Installation fails with with pip 1.2:

$ sudo pip-3.2 install biopython

:
:

Converting build/py3.2/Doc/examples/fasta_dictionary.py

Converting build/py3.2/Doc/examples/nmr/simplepredict.py

Python 2to3 processing done.

running egg_info

error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory

----------------------------------------

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main
    status = self.run(options, args)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sat Sep 15 17:29:29 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 15 Sep 2012 21:29:29 +0000
Subject: [Biopython-dev] [Biopython - Bug #3340] Example using Bio.Clustalw
	in Tutorial
References: <redmine.issue-3340.20120410202908@redmine.open-bio.org>
Message-ID: <redmine.journal-14964.20120915212929@redmine.open-bio.org>


Issue #3340 has been updated by Grace Yeo.


I've submitted a pull request for this here:  
https://github.com/biopython/biopython/pull/71
----------------------------------------
Bug #3340: Example using Bio.Clustalw in Tutorial
https://redmine.open-bio.org/issues/3340

Author: Peter Cock
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Documentation
Target version: 
URL: 


The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Sun Sep 16 08:34:31 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 16 Sep 2012 13:34:31 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
	<CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
	<CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>
Message-ID: <CAKVJ-_7-KXVZ96bHLG6XD88zcN9rPvnTf7yQ0E6J1jhb_5yx+g@mail.gmail.com>

On Fri, Sep 7, 2012 at 2:01 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Sep 6, 2012 at 3:40 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>> --- On Thu, 9/6/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> > Here's a further (and slightly more radical) idea: We
>>> > stick with using 'Bio' and the current mixed case
>>> > names on Python 2, but adopt 'bio' and other PEP8
>>> > compatible names for Python 3 (as a uniform
>>> > strict automatic rule: mixed case -> lower case)?
>>> > i.e. Do this as part of our 2to3 process.
>>>
>>> The Python developers argue against combining a switch to Python 3 with
>>> other major changes, since then if bugs arise it is unclear if it is due to
>>> the switch to Python 3 or due to the other changes. But perhaps it's OK if
>>> we have one Bio.* version for Python 2 and one bio.* version for Python 3
>>> that are otherwise completely identical to each other.
>>
>>
>> Agreed, since the bio.* version is generated by the 2to3 script it should
>> still be easy enough to distinguish "this is a bug in the library" from
>> "this is a problem with Py3, 2to3 or your environment". The extra separation
>> on the filesystem provided by Py2/Py3 should also prevent some problems with
>> case-insensitivity and the environment.
>
> Yes - they would be in different site-packages folders, and since
> we have a tiny Python 3 install base, moving them from Bio to
> bio seems low impact.
>
> I guess we need to have a little hack with the 2to3 library and
> try defining our own custom fixer for the imports...
>
> Note this case difference will slightly complicate our documentation -
> but that is always going to be an issue for the Python 2 to 3 move.
>

I've made a start at this - the easy part seems to work :)

https://github.com/peterjc/biopython/commits/py3lower

The hard bit will be fixing all the import lines... ;)

Peter

From k.d.murray.91 at gmail.com  Thu Sep 20 00:28:08 2012
From: k.d.murray.91 at gmail.com (Kevin Murray)
Date: Thu, 20 Sep 2012 14:28:08 +1000
Subject: [Biopython-dev] TAIR/AGI support
In-Reply-To: <87txvcx9ls.fsf@fastmail.fm>
References: <CAH80STXOOUjqYcQ82C2C25-gACyzwx0D4-VD+CMTes90CdZbnw@mail.gmail.com>
	<87txvcx9ls.fsf@fastmail.fm>
Message-ID: <CAH80STVrvSnxp4JkgrZoywMQqiMg8t=nJtTcGnNggCe4k-Y4aQ@mail.gmail.com>

Hi Brad,

My TAIR/AGI script is on github here:
https://github.com/kdmurray91/biopython/blob/master/Bio/TAIR/__init__.py

I got it to work directly from TAIR's website, however it has not been
rigorously tested. I plan on implementing the process as i described in my
previous email, whereby it fetches the Genbank record from TOGOws or via
NCBI's Efetch (using biopython's interfaces of course). I will keep you all
posted.

To the list in general, I'm open to suggestions on what to work on next?


Regards
Kevin Murray


On 6 September 2012 10:45, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> Kevin;
> Thanks for the e-mail and offers of code. Always happy to have other
> folks involved with the project.
>
> > What's the status of TAIR AGIs in BioPython (I can see no mention of
> them,
> > or support for them)? I've written a brief module which allows a user to
> > query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there
> > any interest in including such functionality in BioPython?
>
> Is the code available on GitHub to get a better sense of all the
> functionality it supports? Do you have an idea where it would fit best?
> As a tair submodule inside of Bio.Entrez, or somewhere else?
>
> > More generally, are there any particular areas of BioPython development
> > which could use an extra pair of hands?
>
> Following the mailing list for discussions on current projects is the
> best way to get a sense of what different folks are working on. The
> issue tracker also has open issues and features that could use attention
> if anything there strikes your fancy:
>
> https://redmine.open-bio.org/projects/biopython
>
> Hope this helps,
> Brad
>
>

From p.j.a.cock at googlemail.com  Thu Sep 20 05:08:58 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 20 Sep 2012 10:08:58 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7-KXVZ96bHLG6XD88zcN9rPvnTf7yQ0E6J1jhb_5yx+g@mail.gmail.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
	<CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
	<CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>
	<CAKVJ-_7-KXVZ96bHLG6XD88zcN9rPvnTf7yQ0E6J1jhb_5yx+g@mail.gmail.com>
Message-ID: <CAKVJ-_6U0PrsTWM8sMPgsSX8cnfTandTGKz5j829K8so7whPgA@mail.gmail.com>

On Sun, Sep 16, 2012 at 1:34 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> I guess we need to have a little hack with the 2to3 library and
>> try defining our own custom fixer for the imports...
>>
>> Note this case difference will slightly complicate our documentation -
>> but that is always going to be an issue for the Python 2 to 3 move.
>>
>
> I've made a start at this - the easy part seems to work :)
>
> https://github.com/peterjc/biopython/commits/py3lower
>
> The hard bit will be fixing all the import lines... ;)
>
> Peter

Progress - but slow. I think this will work with a bit more
time spent on it.

With hindsight I'd have made more effort to try and reuse
lib2to3, but the documentation is sketchy and they do warn
it is liable to change between releases.

What I've got instead is a pattern matching script which
line-by-line spots imports & updates them, and also notes
what knock on changes must be made later in the file. It
is also aware of and updates doctest examples. e.g.

from Bio import SeqIO
record = SeqIO.read("my_chr.gbk", "genbank")

becomes:

from bio import seqIO
record = seqIO.read("my_chr.gbk", "genbank")

In the process I've spotted some minor style issues and
some quote mistakes in the code base which I have
fixed on the main branch as well, e.g.
https://github.com/biopython/biopython/commit/b396844401da8b5c5ed1f7f13d69622a6ad0c0cd
https://github.com/biopython/biopython/commit/165e2b8da445250f070c3860c9082ff6a0c919e0

I also reformatted a few import lines to make
processing them easier - and arguably easier
to read too:
https://github.com/biopython/biopython/commit/f6940e8a4fcf056fa725225ede5e848c5d6f4fd6

One slightly more complicated issue with lower case module
names is we get clashes in some code with existing variable
or argument names. This seems particularly common with seq,
alphabet and motif.

Most of these fixes for this are on the experimental branch.
In some cases I've opted to change the import, e.g.

from Bio import Alphabet

to:

from Bio import Alphabet as _alphabet

This seemed simplest to avoid changing argument names in
functions/methods.

I'll continue to work on this as time allows - right now the code
is due for a refactoring (e.g. avoid code duplication where I
handle doctests), and would benefit from some self-tests.

But the message remains: This should work :)

Peter

From yhtgrace at gmail.com  Fri Sep 21 12:57:19 2012
From: yhtgrace at gmail.com (Hui Ting Grace Yeo)
Date: Fri, 21 Sep 2012 12:57:19 -0400
Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices
Message-ID: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>

Hey everyone,

I'm working on this bug here https://redmine.open-bio.org/issues/3340 and I've updated the example in the tutorial (on substitution matrices, 17.4.2) using Bio.AlignIO on github here https://github.com/yhtgrace/biopython/tree/clustalw-alignIO-replace. I'm able to reproduce the dictionary replace_info, but when I go on to finish the example, I get the following log odds matrix:

D   2
E  -1   1
H  -5  -4   3
K -10  -5  -4   1
R  -4  -8  -4  -2   2
   D   E   H   K   R

which is different from the one given in the tutorial. I'm wondering if I've missed something. 

Thanks!
Grace Yeo

From p.j.a.cock at googlemail.com  Mon Sep 24 04:53:07 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 24 Sep 2012 09:53:07 +0100
Subject: [Biopython-dev] ColorSpiral for Bio.Graphics
Message-ID: <CAKVJ-_4SxNiZciWt4aL4PJ=8Q_VfeowZPEds2o6kAAGZT4rYSQ@mail.gmail.com>

Hello all,

Last week Leighton was doing some work with Biopython
and GenomeDiagram using the cross-links functionality
we worked on for Biopython 1.59, which I described here:
http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/

As you may have noticed via Twitter or his blog, Leighton has
generated an enormous (5m by 1m) PDF poster printout
comparing 29 bacterial genomes:
http://armchairbiology.blogspot.co.uk/2012/09/the-colours-man-colours.html

As he describes on his blog post, this required generating
arbitrary color sets, with the option of adding some noise
(or jitter as he called it) to make neighbouring colours
visually distinct (rather than the more typical requirement
of a smooth value to color mapping).

His code to do that is now on this branch (with a minor
bug fix and a few more docstrings added), ready for
possible merging into Biopython:
https://github.com/peterjc/biopython/tree/colorspiral

Does this seem like a sensible addition to Bio.Graphics?

Does anyone have any thoughts on the namespace
Bio.Graphics.ColorSpiral given it defines an object
ColorSpiral? Might a Bio.Graphics.Colors be useful?

(If as discussed on the other thread we move to lower
case module names for Python 3, this namespace
clash also present in many other Biopython modules
goes away):
http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009934.html

Regards,

Peter

From p.j.a.cock at googlemail.com  Tue Sep 25 12:00:45 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 25 Sep 2012 17:00:45 +0100
Subject: [Biopython-dev] [Biopython] Legacy blastn XML outfile parsing
 is slow. What XML parser is actually used?
In-Reply-To: <5061C20F.7040209@stats.ox.ac.uk>
References: <CADEGkF7DX-t1bwRp66i3+6FBxpaU1KMCS6V86svGA7g2De=aoA@mail.gmail.com>
	<1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com>
	<CAKVJ-_521uq+Lf756munwjBt2H-P5+1jY1HwVvaK1DNqt5Rm2g@mail.gmail.com>
	<506194F6.9000103@fold.natur.cuni.cz>
	<CAKVJ-_45v6EwoT+JBir1joZrs80uV6+D_xBFV=5+83fR=kmE6w@mail.gmail.com>
	<CAKVJ-_5ynOC_zn-LCTUB4PNvpNi=PsQAexY+w58BhC8CJjgVgg@mail.gmail.com>
	<5061C20F.7040209@stats.ox.ac.uk>
Message-ID: <CAKVJ-_70vvk0wp2HnH12cD4qDf==Ph8LaYBAA-1kBr0N6LHJ9g@mail.gmail.com>

On Tue, Sep 25, 2012 at 3:39 PM, Tanya Golubchik
<golubchi at stats.ox.ac.uk> wrote:
> Hello,
>
> Apologies for not having followed the entire discussion, but just wanted
> to say that we're also using NCBIXML here and are likely to be
> incorporating it in a new piece of software soon, so it would be really
> unfortunate if some tags disappeared, were renamed or (even worse)
> changed meaning in future releases.
>
> I'm a bit late coming in here so maybe this has been answered, but is
> there a better parser that should be used at the moment? I was under the
> impression that NCBIXML is the only one.
>
> Thanks,
> Tanya

Hi Tanya,

I hope I can reassure you there is nothing to worry about :)

Right now there is only the NCBIXML parser, and we're not going
to change it (except possibly to make it a little faster if people
want to work on that).

We're planning to a add new module based on Bow's GSoC
code, under the working name SearchIO, which would cover
BLAST, BLAT, HMMER, etc. This would have a different API
and in the long term would probably replace all of Bio.Blast.
http://biopython.org/wiki/SearchIO

The discussion about possible changes has been (I think)
only about this new code (and would have been better off
on the development mailing list but this thread went off on
a slight tangent).

Once 'SearchIO' is released, we'd want to encourage
people to use that instead of NCBIXML, with a view to
deprecating and eventually removing NCBIXML. See:
http://biopython.org/wiki/Deprecation_policy

Regards,

Peter

From p.j.a.cock at googlemail.com  Thu Sep 27 09:01:44 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Sep 2012 14:01:44 +0100
Subject: [Biopython-dev] ColorSpiral for Bio.Graphics
In-Reply-To: <CAKVJ-_4SxNiZciWt4aL4PJ=8Q_VfeowZPEds2o6kAAGZT4rYSQ@mail.gmail.com>
References: <CAKVJ-_4SxNiZciWt4aL4PJ=8Q_VfeowZPEds2o6kAAGZT4rYSQ@mail.gmail.com>
Message-ID: <CAKVJ-_69rK70ZEROxedDO4iU9uTJJ3096-DMYQhd0QRg9eEF7w@mail.gmail.com>

On Mon, Sep 24, 2012 at 9:53 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hello all,
>
> Last week Leighton was doing some work with Biopython
> and GenomeDiagram using the cross-links functionality
> we worked on for Biopython 1.59, which I described here:
> http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/
>
> As you may have noticed via Twitter or his blog, Leighton has
> generated an enormous (5m by 1m) PDF poster printout
> comparing 29 bacterial genomes:
> http://armchairbiology.blogspot.co.uk/2012/09/the-colours-man-colours.html
>
> As he describes on his blog post, this required generating
> arbitrary color sets, with the option of adding some noise
> (or jitter as he called it) to make neighbouring colours
> visually distinct (rather than the more typical requirement
> of a smooth value to color mapping).
>
> His code to do that is now on this branch (with a minor
> bug fix and a few more docstrings added), ready for
> possible merging into Biopython:
> https://github.com/peterjc/biopython/tree/colorspiral
>
> Does this seem like a sensible addition to Bio.Graphics?
>
> Does anyone have any thoughts on the namespace
> Bio.Graphics.ColorSpiral given it defines an object
> ColorSpiral? Might a Bio.Graphics.Colors be useful?
>
> (If as discussed on the other thread we move to lower
> case module names for Python 3, this namespace
> clash also present in many other Biopython modules
> goes away):
> http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009934.html
>
> Regards,
>
> Peter

I've committed it - we can still move/rename/etc until the
next release if anyone has suggestions for improvement.
https://github.com/biopython/biopython/commit/35a484026b68dd1b530d3446640b2f4d4b73eda7

Peter

From p.j.a.cock at googlemail.com  Thu Sep 27 09:55:21 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Sep 2012 14:55:21 +0100
Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices
In-Reply-To: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>
References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>
Message-ID: <CAKVJ-_52FJpXZE2Q2vCP==6g8cfy4KOiSpMKmbSP2FWu2mJdVw@mail.gmail.com>

On Fri, Sep 21, 2012 at 5:57 PM, Hui Ting Grace Yeo <yhtgrace at gmail.com> wrote:
> Hey everyone,
>
> I'm working on this bug here https://redmine.open-bio.org/issues/3340
> and I've updated the example in the tutorial (on substitution matrices,
> 17.4.2) using Bio.AlignIO on github here
> https://github.com/yhtgrace/biopython/tree/clustalw-alignIO-replace.
> I'm able to reproduce the dictionary replace_info, but when I go on to
> finish the example, I get the following log odds matrix:
>
> D   2
> E  -1   1
> H  -5  -4   3
> K -10  -5  -4   1
> R  -4  -8  -4  -2   2
>    D   E   H   K   R
>
> which is different from the one given in the tutorial. I'm wondering if I've
> missed something.

Hi Grace,

Using the current code and the example as it is, I also observe
the same result as you. According to github's "blame" feature
the current text dates back 4 years,

https://github.com/biopython/biopython/commit/bed3ab39d8a635f1e74be99e6730a48d2460f8b7

However, that was just a reformatting of an older example which
Brad wrote 11 years ago while converting the example from DNA
to protein:

https://github.com/biopython/biopython/commit/21df476c66b279824c51e6abd3f4ae549d003813

The example file itself protein.aln has not changed, committed:

https://github.com/biopython/biopython/commit/ccbe2d72014eafb064994bc3782ca5529d0b0448

See also Doc/examples/make_subsmat.py

So, since the example hasn't been changed in 11 years, this
suggests either Brad committed the wrong output (and no-one
noticed), or something changed in the calculation during that
time.

(Nowadays we try to use doctests for the examples in the
API and in the Tutorial where possible, so that code changes
which affect our examples are detected automatically.)

The most likely candidates would be something in the file
Bio/SubsMat/__init__.py

https://github.com/biopython/biopython/commits/master/Bio/SubsMat/__init__.py

A little detective work might be needed to explain this... sadly
trying to use Biopython from back then is complicated by the
reliance on the Martel/mxTextTools dependency.

Maybe Brad or Michiel has some insight?

--

In the meantime, I have applied your changes to the
example to use AlignIO,

https://github.com/biopython/biopython/commit/19f9317fe0e346f6c3f197d027076d9a1265def7
https://github.com/biopython/biopython/commit/5949f54dadb6d4ac8400e11d2afa33db549afba5

This will now get tested via test_Tutorial.py automatically
(except for the final line about printing the odds matrix):

https://github.com/biopython/biopython/commit/15dd6ba17eb092d0d7df674ac45617d99256d098

Thank you,

Peter

From redmine at redmine.open-bio.org  Thu Sep 27 09:57:38 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 27 Sep 2012 13:57:38 +0000
Subject: [Biopython-dev] [Biopython - Bug #3340] (Resolved) Example using
	Bio.Clustalw in Tutorial
References: <redmine.issue-3340.20120410202908@redmine.open-bio.org>
Message-ID: <redmine.journal-14965.20120927135738@redmine.open-bio.org>


Issue #3340 has been updated by Peter Cock.

Status changed from New to Resolved
% Done changed from 0 to 100

Fixed with Grace's commits, although she has also spotted a separate issue with the log odds matrix output later in the example:
http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009958.html
http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009962.html

----------------------------------------
Bug #3340: Example using Bio.Clustalw in Tutorial
https://redmine.open-bio.org/issues/3340

Author: Peter Cock
Status: Resolved
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Documentation
Target version: 
URL: 


The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Fri Sep 28 06:50:52 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Sep 2012 11:50:52 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6U0PrsTWM8sMPgsSX8cnfTandTGKz5j829K8so7whPgA@mail.gmail.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
	<CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
	<CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>
	<CAKVJ-_7-KXVZ96bHLG6XD88zcN9rPvnTf7yQ0E6J1jhb_5yx+g@mail.gmail.com>
	<CAKVJ-_6U0PrsTWM8sMPgsSX8cnfTandTGKz5j829K8so7whPgA@mail.gmail.com>
Message-ID: <CAKVJ-_4PV3VMx5pju65578gq8TSN936T5ePH_cjhtUQcrECHYg@mail.gmail.com>

On Thu, Sep 20, 2012 at 10:08 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sun, Sep 16, 2012 at 1:34 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>>
>>> I guess we need to have a little hack with the 2to3 library and
>>> try defining our own custom fixer for the imports...
>>
>> I've made a start at this - the easy part seems to work :)
>>
>> https://github.com/peterjc/biopython/commits/py3lower
>>
>> ...

The code to do this lower case name mangling remains
a quite spaghetti like mess in do2to3.py but it now works
enough to pass the test suite (with some but not all 3rd
party dependencies installed) under Linux and my Mac
OS X machine (where like Windows I have a case
insensitive file system).

Here's a clean run on TravisCI (Linux with a case sensitive
file system):
https://travis-ci.org/#!/peterjc/biopython/jobs/2584146

I've not tried Windows itself yet. Also only Python 3.2

Note if you want to try this, after switching to (and after
switching from) the py3lower branch you should delete
the build/py3.* folder where the 2to3 converted code
is cached.

The good news is that only a handful of bits of code
needed special case code (e.g. finding the Entrez DTD
files), with most tweaks just to import lines (as mentioned
earlier) or renaming of internal variables.

So this idea to adopt PEP8 lower case module names
as part of supporting Python 3 appears to be technically
viable.

Peter

From p.j.a.cock at googlemail.com  Fri Sep 28 05:35:42 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Sep 2012 10:35:42 +0100
Subject: [Biopython-dev] ColorSpiral for Bio.Graphics
In-Reply-To: <CAKVJ-_69rK70ZEROxedDO4iU9uTJJ3096-DMYQhd0QRg9eEF7w@mail.gmail.com>
References: <CAKVJ-_4SxNiZciWt4aL4PJ=8Q_VfeowZPEds2o6kAAGZT4rYSQ@mail.gmail.com>
	<CAKVJ-_69rK70ZEROxedDO4iU9uTJJ3096-DMYQhd0QRg9eEF7w@mail.gmail.com>
Message-ID: <CAKVJ-_5ZTT93qF=zyzn2JC_3u_pR0KXjD4MC01A8eZkbTcUPaA@mail.gmail.com>

On Thu, Sep 27, 2012 at 2:01 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Mon, Sep 24, 2012 at 9:53 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> As he describes on his blog post, this required generating
>> arbitrary color sets, with the option of adding some noise
>> (or jitter as he called it) to make neighbouring colours
>> visually distinct (rather than the more typical requirement
>> of a smooth value to color mapping).
>>
>> ...
>
> I've committed it - we can still move/rename/etc until the
> next release if anyone has suggestions for improvement.
> https://github.com/biopython/biopython/commit/35a484026b68dd1b530d3446640b2f4d4b73eda7

The buildbot run last night spotted a problem under Python 2.5
(no cmath.rect function) which I've now fixed.
https://github.com/biopython/biopython/commit/ee933c3f5c4b98ab232c5180492dc11a46b89f0d

We do test under Python 2.5 with TravisCI as well, but at
the moment we don't install the ReportLab dependency.
There is a balance between installing more dependencies
(to get more of our code tested) and the extra runtime
required (meaning the job is more likely to be killed, or
fail due to a network issue) giving false test failures.

Peter

From p.j.a.cock at googlemail.com  Fri Sep 28 06:06:10 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Sep 2012 11:06:10 +0100
Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices
In-Reply-To: <87ipaywk47.fsf@fastmail.fm>
References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>
	<CAKVJ-_52FJpXZE2Q2vCP==6g8cfy4KOiSpMKmbSP2FWu2mJdVw@mail.gmail.com>
	<87ipaywk47.fsf@fastmail.fm>
Message-ID: <CAKVJ-_5J_d5=Ao1e+kkt-rZcB3rWebFj2gp2UZeUe4BXO4pcRw@mail.gmail.com>

On Fri, Sep 28, 2012 at 10:51 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> So, since the example hasn't been changed in 11 years, this
>> suggests either Brad committed the wrong output (and no-one
>> noticed), or something changed in the calculation during that
>> time.
>
> Seriously, I could have easily copy/pasted something wrong when writing
> this, so if there is no obvious code change I'd go with that assumption
> and fix the docs to be correct.

OK - I've done that:
https://github.com/biopython/biopython/commit/b57707f9f3afc0980a3dbf936f6642a4d9cc8a69

Thanks Brad & Grace,

Peter

P.S. I've included Grace as a contributor in the upcoming release notes
(please let me know if you'd prefer this as Hui Ting Grace Yeo instead):
https://github.com/biopython/biopython/commit/5af03e78f37cbce82ce167c762d892cce9cb062e

From bjoern at gruenings.eu  Fri Sep 28 09:03:22 2012
From: bjoern at gruenings.eu (=?ISO-8859-1?Q?Bj=F6rn_Gr=FCning?=)
Date: Fri, 28 Sep 2012 15:03:22 +0200
Subject: [Biopython-dev] [Patch] Genbank Parser
Message-ID: <1348837402.21455.1.camel@threonin>

Hi,

the tbl2asn tool from the ncbi creates genbank files that did not have a
version number. Unfortunately that version number is used to fill
consumer.data.id. 
I implemented the following fall-back:
If there is no version information available than it takes the
consumer.data.name for the consumer.data.id. Does that makes sense?

Thanks!
Bjoern

-------------- next part --------------
A non-text attachment was scrubbed...
Name: biopython_genbank_id-fallback.diff
Type: text/x-patch
Size: 1016 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120928/8f4fe694/attachment.bin>

From p.j.a.cock at googlemail.com  Fri Sep 28 09:38:11 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Sep 2012 14:38:11 +0100
Subject: [Biopython-dev] [Patch] Genbank Parser
In-Reply-To: <1348837402.21455.1.camel@threonin>
References: <1348837402.21455.1.camel@threonin>
Message-ID: <CAKVJ-_5nekcTBYejUTVV6VvjV+mB0WV0eoEWKytGZOTmgfmw1g@mail.gmail.com>

On Fri, Sep 28, 2012 at 2:03 PM, Bj?rn Gr?ning <bjoern at gruenings.eu> wrote:
> Hi,
>
> the tbl2asn tool from the ncbi creates genbank files that did not have a
> version number. Unfortunately that version number is used to fill
> consumer.data.id.
> I implemented the following fall-back:
> If there is no version information available than it takes the
> consumer.data.name for the consumer.data.id. Does that makes sense?
>
> Thanks!
> Bjoern

Can you share some example output from tbl2asn that shows
this problem? Ideally something small we could include as a
unit test.

Thanks,

Peter


From chapmanb at 50mail.com  Fri Sep 28 05:51:36 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 28 Sep 2012 05:51:36 -0400
Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices
In-Reply-To: <CAKVJ-_52FJpXZE2Q2vCP==6g8cfy4KOiSpMKmbSP2FWu2mJdVw@mail.gmail.com>
References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>
	<CAKVJ-_52FJpXZE2Q2vCP==6g8cfy4KOiSpMKmbSP2FWu2mJdVw@mail.gmail.com>
Message-ID: <87ipaywk47.fsf@fastmail.fm>


Grace and Peter;

[Different log odds matrix in documentation]

> However, that was just a reformatting of an older example which
> Brad wrote 11 years ago while converting the example from DNA
> to protein:

Gee, thanks for making me feel old.

> So, since the example hasn't been changed in 11 years, this
> suggests either Brad committed the wrong output (and no-one
> noticed), or something changed in the calculation during that
> time.

Seriously, I could have easily copy/pasted something wrong when writing
this, so if there is no obvious code change I'd go with that assumption
and fix the docs to be correct.

Thanks for spotting this,
Brad

From bjoern at gruenings.eu  Thu Sep 27 18:11:05 2012
From: bjoern at gruenings.eu (bjoern at gruenings.eu)
Date: Fri, 28 Sep 2012 00:11:05 +0200 (CEST)
Subject: [Biopython-dev] [Patch] Genbank Parser fall-back data.id
Message-ID: <59367.132.230.56.143.1348783865.squirrel@mail.gruenings.eu>

Hi,

the tbl2asn tool from the ncbi creates genbank files that did not have a
version number. Unfortunately that version number is used to fill
consumer.data.id.
I implemented the following fall-back:
If there is no version information available than it takes the
consumer.data.name for the consumer.data.id. Does that makes sense?

Thanks!
Bjoern
-------------- next part --------------
A non-text attachment was scrubbed...
Name: biopython_genbank.diff
Type: text/x-patch
Size: 1015 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120928/6027b056/attachment.bin>

From p.j.a.cock at googlemail.com  Sat Sep 29 08:10:24 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 29 Sep 2012 13:10:24 +0100
Subject: [Biopython-dev] Fwd: [Utilities-announce] PubMed E-Utility 2013 DTD
	updates
In-Reply-To: <mailman.155270.1348855208.20059.utilities-announce@ncbi.nlm.nih.gov>
References: <A9D8BF3D8A74DF4A925FB541C0F39D2A16F079160B@NIHMLBX15.nih.gov>
	<mailman.155270.1348855208.20059.utilities-announce@ncbi.nlm.nih.gov>
Message-ID: <CAKVJ-_7c7fcp7BrQcJ82gV+oEASvNZ3B0+bg364_Y5UweM=HqA@mail.gmail.com>

I've added the two new DTD files mentioned below:
https://github.com/biopython/biopython/commit/2a09b03ab4d861e91eb543bd6df717ecb4fdf097

Peter

---------- Forwarded message ----------
From: **
Date: Friday, September 28, 2012
Subject: [Utilities-announce] PubMed E-Utility 2013 DTD updates
To: NLM/NCBI List utilities-announce <utilities-announce at ncbi.nlm.nih.gov>


NCBI PubMed E-Utility Users,****

** **

We anticipate updating the PubMed E-Utility DTDs for 2012 in mid-December,
approximately on December 10 or 11, 2012.****

** **

The forthcoming DTDs are available from:****

** **

http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedlinecitationset_130101.dtd
****

http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_130101.dtd****

** **

Changes to NLMMedlineCitationSet DTD AND MEDLINE/PubMed XML:****

** **

**-          **Indicating abstracts not in MEDLINE/PubMed but available
from publishers****

English-language abstracts are taken directly from the published article
and included in the <Abstract> and <AbstractText> elements. If the article
does not have a published abstract, the record lacks the <Abstract> and
<AbstractText> elements. However, publishers may create English-language
abstracts that are not published with the article, as well as, non-English-
language abstracts that may or may not be published with the article.****

** **

These other abstracts will be indicated in the <OtherAbstract> element. A
new "Language" attribute is added to the <OtherAbstract> element. The
<AbstractText> element will carry the standard phrase: "Abstract available
from the publisher."****

** **

DTD:****

<!ELEMENT OtherAbstract (AbstractText+,CopyrightInformation?)>****

<!ATTLIST            OtherAbstract Type (AAMC | AIDS | KIE | PIP | NASA |
Publisher) #REQUIRED****

                                Language (#PCDATA ) "eng">****

** **

Sample XML:****

<OtherAbstract Type="Publisher" Language="fre"> <AbstractText> Abstract
available from the publisher.</AbstractText>****

</OtherAbstract>****

** **

**-          **Rename NameID to Identifier****

The NameID element was created in 2010 and modified in 2011 but has not yet
been used. NameID is renamed to Identifier. Identifier is an optional,
possibly multiply-occurring element permissible within the Author (personal
and collective) and Investigator elements.  The value in the Identifier
attribute Source designates the organizational authority that established
the unique identifier. ****

** **

DTD:****

<!ELEMENT     Author (((LastName, ForeName?, Initials?, Suffix?) |
CollectiveName),Identifier*)>****

<!ATTLIST     Author ValidYN (Y | N) "Y">****

** **

<!ELEMENT     Investigator (LastName,ForeName?,
Initials?,Suffix?,Identifier*,Affiliation?)>****

<!ATTLIST     Investigator ValidYN (Y | N) "Y">****

** **

<!ELEMENT     Identifier (#PCDATA)>****

<!ATTLIST     Identifier ****

                      Source CDATA #REQUIRED >****

** **

Sample XML:****

<Author ValidYN="Y">****

<LastName>Smith</LastName>****

<ForeName>John</ForeName>****

<Initials>A</Initials>****

<Identifier Source=?ORCID?>55555555555555</Identifier>****

</Author>****

** **

Thank you.****


From p.j.a.cock at googlemail.com  Sat Sep 29 16:25:14 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 29 Sep 2012 21:25:14 +0100
Subject: [Biopython-dev] Nexus __slots__ and Python 3.3
Message-ID: <CAKVJ-_4-7hma258+wo8kSwjDD5H=Fxxtc87j_3Ywm-UdmdjYDw@mail.gmail.com>

Hello all,

I've started testing under the newly released Python 3.3,
and there is a new problem which I don't recall running
into when I tried one of the Python 3.3 alpha releases:

$ python3 test_Nexus.py
Traceback (most recent call last):
  File "test_Nexus.py", line 7, in <module>
    from Bio.Nexus import Nexus, Trees
  File "/Users/peterjc/lib/python3.3/site-packages/Bio/Nexus/Nexus.py",
line 513, in <module>
    class Nexus(object):
ValueError: 'original_taxon_order' in __slots__ conflicts with class variable

I can fix this with the following change, which appears
to have no side effects under Python 2 (the unit tests
still all pass):

$ git diff
diff --git a/Bio/Nexus/Nexus.py b/Bio/Nexus/Nexus.py
index 1d6abd2..8c7fbcc 100644
--- a/Bio/Nexus/Nexus.py
+++ b/Bio/Nexus/Nexus.py
@@ -511,8 +511,6 @@ class Block(object):

 class Nexus(object):

-    __slots__=['original_taxon_order','__dict__']
-
     def __init__(self, input=None):
         self.ntax=0                     # number of taxa
         self.nchar=0                    # number of characters

I have committed this:
https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634

However, I'm not really sure what the intention of this
line was in the first place. It is (assuming I didn't miss
anything with grep), or now was, the only use of
__slots__ in the whole of Biopython.

Regards,

Peter

From p.j.a.cock at googlemail.com  Sat Sep 29 16:34:27 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 29 Sep 2012 21:34:27 +0100
Subject: [Biopython-dev] PAML test problems under Python 3.3.0
Message-ID: <CAKVJ-_4DCG=_d097D=M5Ld1AthCVmZ50qixL4HR7OLOK68ZkuQ@mail.gmail.com>

Hi Brandon (et al),

Could you have a look at the PAML unit tests under Python 3.3 please?
I see a mix of failures and 'blocking' under a self-compiled Python 3.3.0
on Mac OS X 10.8 (Mountain Lion):

$ python3 test_PAML_yn00.py
testAlignmentExists (__main__.ModTest) ... ok
testAlignmentFileIsValid (__main__.ModTest) ... FAIL
testAlignmentSpecified (__main__.ModTest) ... ok
testCtlFileExistsOnRead (__main__.ModTest) ... ok
testCtlFileExistsOnRun (__main__.ModTest) ... ok
testCtlFileValidOnRead (__main__.ModTest) ... ERROR
testCtlFileValidOnRun (__main__.ModTest) ... ok
testOptionExists (__main__.ModTest) ... ok
testOutputFileSpecified (__main__.ModTest) ... ok
testOutputFileValid (__main__.ModTest) ... ok
testParseAllVersions (__main__.ModTest) ... ok
testResultsExist (__main__.ModTest) ... ok
testResultsParsable (__main__.ModTest) ... ok
testResultsValid (__main__.ModTest) ... ^C

$ python3 test_PAML_codeml.py
testAlignmentExists (__main__.ModTest) ... ok
testAlignmentFileIsValid (__main__.ModTest) ... FAIL
testAlignmentSpecified (__main__.ModTest) ... ok
testCtlFileExistsOnRead (__main__.ModTest) ... ok
testCtlFileExistsOnRun (__main__.ModTest) ... ok
testCtlFileValidOnRead (__main__.ModTest) ... ERROR
testCtlFileValidOnRun (__main__.ModTest) ... ok
testOptionExists (__main__.ModTest) ... ok
testOutputFileSpecified (__main__.ModTest) ... ok
testOutputFileValid (__main__.ModTest) ... ok
testPamlErrorsCaught (__main__.ModTest) ... ok
testParseAA (__main__.ModTest) ... ok
testParseAAPairwise (__main__.ModTest) ... ok
testParseAllNSsites (__main__.ModTest) ... ok
testParseBranchSiteA (__main__.ModTest) ... ok
testParseCladeModelC (__main__.ModTest) ... ok
testParseFreeRatio (__main__.ModTest) ... ok
testParseNSsite3 (__main__.ModTest) ... ok
testParseNgene2Mgene02 (__main__.ModTest) ... ok
testParseNgene2Mgene1 (__main__.ModTest) ... ok
testParseNgene2Mgene34 (__main__.ModTest) ... ok
testParsePairwise (__main__.ModTest) ... ok
testParseSEs (__main__.ModTest) ... ok
testResultsExist (__main__.ModTest) ... ok
testResultsParsable (__main__.ModTest) ... ok
testResultsValid (__main__.ModTest) ... ^C

$ python3 test_PAML_baseml.py
testAlignmentExists (__main__.ModTest) ... ok
testAlignmentFileIsValid (__main__.ModTest) ... FAIL
testAlignmentSpecified (__main__.ModTest) ... ok
testCtlFileExistsOnRead (__main__.ModTest) ... ok
testCtlFileExistsOnRun (__main__.ModTest) ... ok
testCtlFileValidOnRead (__main__.ModTest) ... ERROR
testCtlFileValidOnRun (__main__.ModTest) ... ok
testOptionExists (__main__.ModTest) ... ok
testOutputFileSpecified (__main__.ModTest) ... ok
testOutputFileValid (__main__.ModTest) ... ok
testPamlErrorsCaught (__main__.ModTest) ... ok
testParseAllVersions (__main__.ModTest) ... ok
testParseAlpha1Rho1 (__main__.ModTest) ... ok
testParseModel (__main__.ModTest) ... ok
testParseNhomo (__main__.ModTest) ... ok
testParseSEs (__main__.ModTest) ... ok
testResultsExist (__main__.ModTest) ... ok
testResultsParsable (__main__.ModTest) ... ok
testResultsValid (__main__.ModTest) ... ^C

If you've not tried this before, the procedure I'm using is:

$ python3 setup.py build
$ cd build/py3.3/Tests
$ python3 test_PAML_baseml.py
etc

The key point is to run the tests directly (rather than
just via 'python3 setup.py test') you must change
director to the 2to3 converted folder under the build
folder.

By commenting out the test methods which seem to
blocking, it seems some of the failures are to do with
exception handling. I've not dug any further into this.

Thanks,

Peter

From redmine at redmine.open-bio.org  Sun Sep  2 19:20:01 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sun, 2 Sep 2012 19:20:01 +0000
Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse
	PDBs produced by PatchDock
References: <redmine.issue-3379.20120821102714@redmine.open-bio.org>
Message-ID: <redmine.journal-14952.20120902192001@redmine.open-bio.org>


Issue #3379 has been updated by Jo?o Rodrigues.


I contacted the developers of PatchDock and they updated their code. Their PDBs no longer have the double END statement, but they might have conflicting chains though: the parser will likely break if by chance both chains have id A and overlapping residue numbers. Still, a slight improvement.
----------------------------------------
Bug #3379: PDBParser fails to parse PDBs produced by PatchDock
https://redmine.open-bio.org/issues/3379

Author: David Cain
Status: New
Priority: Low
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.57
URL: 


I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs.


h3. Background

Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file.

h3. Why PDBParser fails

Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand.

h3. How to fix the problem

Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files?

h3. Potential change to @PDBParser._parse_coordinates@?

If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure.

If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing.

My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Mon Sep  3 01:05:19 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Mon, 3 Sep 2012 01:05:19 +0000
Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse
	PDBs produced by PatchDock
References: <redmine.issue-3379.20120821102714@redmine.open-bio.org>
Message-ID: <redmine.journal-14953.20120903010519@redmine.open-bio.org>


Issue #3379 has been updated by David Cain.


That's awesome! Thanks for doing that. Well, chain renumbering is definitely a problem, but I don't see any easy fix for that. I still think the "pull request":https://github.com/biopython/biopython/pull/60 is relevant for detecting otherwise malformed PDB files (additionally, parsing will still stop after the first file if @CONECT@ files are relevant).
----------------------------------------
Bug #3379: PDBParser fails to parse PDBs produced by PatchDock
https://redmine.open-bio.org/issues/3379

Author: David Cain
Status: New
Priority: Low
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 1.57
URL: 


I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs.


h3. Background

Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file.

h3. Why PDBParser fails

Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand.

h3. How to fix the problem

Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files?

h3. Potential change to @PDBParser._parse_coordinates@?

If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure.

If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing.

My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From w.arindrarto at gmail.com  Mon Sep  3 10:14:59 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Mon, 3 Sep 2012 12:14:59 +0200
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CADEGkF4URxn5zwXOwU1J6s21U22aLwTdUw3aU6G0=MRt+LbfOA@mail.gmail.com>
References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>
	<87lim4h07o.fsf@fastmail.fm>
	<CAKVJ-_7b=RGpDGX3v0x5PJFWgW5dB3Otfg8Sq2Gehhg4SU2bUg@mail.gmail.com>
	<CADEGkF4URxn5zwXOwU1J6s21U22aLwTdUw3aU6G0=MRt+LbfOA@mail.gmail.com>
Message-ID: <CADEGkF6Si_gFOQZ8HKto-8GMbk53Z1J8HKrO-UAm2puPQ0mp-Q@mail.gmail.com>

Hello everyone,

I'd like to update everyone on my latest SearchIO(?) developments. There
has been some progress and bug fixes since GSoC officially ended two weeks
ago. Some of them I'd like to share here:

1. I've written a draft tutorial chapter for the submodule. It' been pushed
to my development repo (https://github.com/bow/biopython/tree/searchio) and
I'm hosting the HTML temporarily on my site (
http://bow.web.id/biopython/Tutorial.html). Comments and critiques are
welcomed :).

2. Back on the naming issue, I'm still using SearchIO for now. I've
experimented with other names (Bio.Search and Bio.SeqSearch), and my
impression is I like Bio.SeqSearch the most, followed by Bio.Search, and
Bio.SearchIO. It does feel confusing initially (we have SeqUtils,
SeqFeature, etc.), but after a while it's the one that feels most natural.

3. And finally, Peter and I discussed this briefly previously: what about
if we merge the existing BLAST wrappers and NCBI qblast into Bio.(SeqSearch
/ Search / SearchIO)? I felt there were a lot of overlap between this
submodule and Bio.BLAST when writing the tutorial, so merging surfaced in
my thoughts again. We could put the BLAST wrappers under
Bio.SeqSearch.Applications (for example), along with other wrappers (I have
a yet-untested Bio.HMMER3 wrapper and possibly Bio.BLAT wrapper that put
here as well). As for qblast (and other remote searches, like the one
provided by HMMER at the moment), we could put them in
Bio.SeqSearch.Remote, perhaps. I think this would make it easier for anyone
who works with BLAST / other sequence search tools as all Biopython-related
functionalities are grouped in one place.

This is just a thought for now, but I'd love to hear your thoughts on the
merge (and the naming ;) ).

cheers,
Bow


On Tue, Aug 21, 2012 at 6:01 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com>wrote:

> On Tue, Aug 14, 2012 at 9:49 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Tue, Apr 10, 2012 at 1:58 AM, Brad Chapman wrote:
> >> Michiel;
> >>> Hi Eric, Peter,
> >>>
> >>> > How about Bio.Search, for now?
> >>>
> >>> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells
> >>> users something about what the module is for. Bio.Search could be
> >>> anything (search PubMed? search the Entrez databases? search Google?
> >>> anyway Bio.Search does not suggest that this module is about pairwise
> >>> alignments). But Peter previously mentioned that he doesn't like
> >>> Bio.Pairwise; can we convince you?
> >>
> >> I agree with Peter on this one. The module is primarily about searching
> >> a sequence database with an input via multiple methods, not about
> >> pairwise alignment of two sequences with is what Bio.Align.Pairwise
> >> suggests to me.
> >>
> >> Brad
> >
> > On potential problem with Bio.Search (on top of concerns raised
> > here about vagueness) Bow and I were just talking about during
> > our weekly GSoC video call was the existence of Bio/Search.py
> > which is obsolete and long overdue for removal. I have just
> > deprecated it (something I forgot to do before the last release):
> >
> https://github.com/biopython/biopython/commit/5a275ccd1df3def40df1eef517af755d373dadd8
> >
> > We'd earlier talked about using Bio.Search as the namespace. I was
> > worried about the potential existence on a user's machine of both
> > Bio/Search.py (the old obsolete code) and Bio/Search/__init__.py
> > (aka SearchIO, the new module) and which would take precedence
> > when doing: from Bio import Search
> >
> > Given how Python module installations work, that seems highly
> > likely to occur. The good news is that the package would take
> > priority - see http://www.python.org/doc/essays/packages.html
> >
> >>>>> What If I Have a Module and a Package With The Same Name?
> >>>>>
> >>>>> You may have a directory (on sys.path) which has both a module
> >>>>> spam.py and a subdirectory spam that contains an __init__.py
> >>>>> (without the __init__.py, a directory is not recognized as a
> package).
> >>>>> In this case, the subdirectory has precedence, and importing spam
> >>>>> will ignore the spam.py file, loading the package spam instead. If
> >>>>> you want the module spam.py to have precedence, it must be
> >>>>> placed in a directory that comes earlier in sys.path.
> >
> > So there is no technical reason to avoid Bio.Search as an
> > option for the Bio.SearchIO namespace. We could then
> > have Bio.Search.Applications for command line wrappers,
> > consistent with Bio.Phylo.Applications, Bio.Motif.Applications
> > and Bio.Align.Applications.
> >
> > Of course, Bio.Search is still perhaps too broad a name... but
> > on balance perhaps it is still better than Bio.SearchIO?
> >
> > Regards,
> >
> > Peter
>
> Hi everyone,
>
> If I may add my two cents, for now I am in favor of putting the module
> under Bio.Search. It is not the best name out there (it does sound a
> bit vague), but it's the one that seem to be the most intuitive (until
> a better alternative comes out). There were some other alternatives
> that I and Peter have discussed, but they seem less appealing for us.
> You're free to add your thoughts on these of course :) :
>
> - Bio.SeqSearch. This sounds ok, but when you consider we have
> Bio.Seq, Bio.SeqRecord, Bio.SeqFeature, and Bio.SeqUtils, it becomes
> quite confusing quickly.
>
> - Bio.PSearch ('p' for pairwise). This one seemed the less intuitive
> among the three options, so I'm not so big on this.
>
> For now, I'm still writing everything (code, docstrings, tutorial)
> using SearchIO. I suppose it's better if we could agree on a more
> suitable name, though.
>
> On another note, I'm also in favor of using the Bio.Phylo module
> skeleton for Bio.SearchIO / Bio.Search. We may then group all sequence
> search-related application wrappers under Applications (I actually
> prefers 'app' for better PEP8 compliance, but that's another
> discussion) and perhaps even refactor our remote search calls (e.g.
> the 'qblast' module) under Bio.Search as well.
>
> cheers,
> Bow
>


From p.j.a.cock at googlemail.com  Mon Sep  3 12:28:30 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Sep 2012 13:28:30 +0100
Subject: [Biopython-dev] GSoC SearchIO project
In-Reply-To: <CADEGkF6Si_gFOQZ8HKto-8GMbk53Z1J8HKrO-UAm2puPQ0mp-Q@mail.gmail.com>
References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com>
	<87lim4h07o.fsf@fastmail.fm>
	<CAKVJ-_7b=RGpDGX3v0x5PJFWgW5dB3Otfg8Sq2Gehhg4SU2bUg@mail.gmail.com>
	<CADEGkF4URxn5zwXOwU1J6s21U22aLwTdUw3aU6G0=MRt+LbfOA@mail.gmail.com>
	<CADEGkF6Si_gFOQZ8HKto-8GMbk53Z1J8HKrO-UAm2puPQ0mp-Q@mail.gmail.com>
Message-ID: <CAKVJ-_6jPPdarh3XoVxKtCmoZPw1vOz5249BQtgAmr+P_gMHSg@mail.gmail.com>

On Mon, Sep 3, 2012 at 11:14 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hello everyone,
>
> I'd like to update everyone on my latest SearchIO(?) developments. There
> has been some progress and bug fixes since GSoC officially ended two weeks
> ago. Some of them I'd like to share here:
>
> 1. I've written a draft tutorial chapter for the submodule. It' been pushed
> to my development repo (https://github.com/bow/biopython/tree/searchio) and
> I'm hosting the HTML temporarily on my site (
> http://bow.web.id/biopython/Tutorial.html). Comments and critiques are
> welcomed :).

Oh - excellent - I'll read that in the next few days :)

> 2. Back on the naming issue, I'm still using SearchIO for now. I've
> experimented with other names (Bio.Search and Bio.SeqSearch), and my
> impression is I like Bio.SeqSearch the most, followed by Bio.Search, and
> Bio.SearchIO. It does feel confusing initially (we have SeqUtils,
> SeqFeature, etc.), but after a while it's the one that feels most natural.


Initially Bio.SeqSearch sounds a bit long... but maybe it will
grow on me...

> 3. And finally, Peter and I discussed this briefly previously: what about
> if we merge the existing BLAST wrappers and NCBI qblast into Bio.(SeqSearch
> / Search / SearchIO)? I felt there were a lot of overlap between this
> submodule and Bio.BLAST when writing the tutorial, so merging surfaced in
> my thoughts again. We could put the BLAST wrappers under
> Bio.SeqSearch.Applications (for example), along with other wrappers (I have
> a yet-untested Bio.HMMER3 wrapper and possibly Bio.BLAT wrapper that put
> here as well). As for qblast (and other remote searches, like the one
> provided by HMMER at the moment), we could put them in
> Bio.SeqSearch.Remote, perhaps. I think this would make it easier for anyone
> who works with BLAST / other sequence search tools as all Biopython-related
> functionalities are grouped in one place.

As per my discussion with Bow, I'm OK with aiming to deprecate the
Bio.BLAST namespace as part of introducing Bio.SeqSearch/Search/..,
although I hadn't a strong preference on a naming convention for any
online functionality. Possibly www is shorter than remote and also
clear?

> This is just a thought for now, but I'd love to hear your thoughts on the
> merge (and the naming ;) ).
>
> cheers,
> Bow

Thanks Bow :)

Peter


From p.j.a.cock at googlemail.com  Mon Sep  3 12:55:07 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 3 Sep 2012 13:55:07 +0100
Subject: [Biopython-dev] Beta code in the official releases?
In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org>
References: <CAKVJ-_6_8JUXmCx5q-eghSczNxqPmSbaaTc_GJ_QCqQOtjGUbg@mail.gmail.com>
	<877gsq8mn2.fsf@fastmail.fm>
	<1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org>
Message-ID: <CAKVJ-_5ThfYEBmrhpGcHNRvrf2_QEe4pTUgF0JV+bBQOwyW0Fg@mail.gmail.com>

On Wed, Aug 29, 2012 at 6:54 PM, Sczesnak, Andrew
<Andrew.Sczesnak at med.nyu.edu> wrote:
> +1
>
> It's been over a year since I first submit my MAF code!

Already? Ouch, my apologies.

I'm at a hackathon this week with the OBF GSoC mentors who
looked at MAF for BioRuby - looking at this for inclusion in the
next Biopython release (perhaps with a beta tag) is on my agenda.

Peter


From anaryin at gmail.com  Mon Sep  3 22:07:39 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 4 Sep 2012 01:07:39 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
Message-ID: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>

Hi all,

A quick update on some latest work. I found some time to finally work a bit
on the PDB parser and Bio.PDB in general. I started by optimizing the
current code. I ran cProfile on script that parsed a set of structures
without header and without element columns. I did this because one of the
optimizations rendered the current header parser useless.. (replaced the
PDB file handle by an iterator instead of using the readlines method). I
still need to work a bit on the memory leak, but for now it seems pretty ok
(parsed 400-ish large structures without a glitch).

I am attaching two pictures of cProfile and the two output files. There is
a nice improvement of about 25%, but this can still be improved for sure. I
just replaced some methods here and there, pre-initialized the numpy
arrays, etc.. I pushed this version to my github pdb_enhancements
branch<https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements>
.

One big change I would propose is to eliminate the duality
child_list/child_dict. I think that keeping child_dict and generating
child_list from sorted dict keys would be good enough. OrderedDict also
looks appropriate, but it's Py2.7+.. Still need to look into this, but by
looking at all those "append" methods in the profiling it hints at a nice
speed up, and also at much cleaner code.

Let me know of your opinion if you have some time,

Cheers,

Jo?o

PS. Attached complex_1.pdb as an example of the structures in the dataset
used for this particular test.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-master-TBEV.png
Type: image/png
Size: 166144 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-master-TBEV.profile
Type: application/octet-stream
Size: 252112 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-optimized-TBEV.png
Type: image/png
Size: 148137 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BioPDB-optimized-TBEV.profile
Type: application/octet-stream
Size: 273487 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: complex_1w.pdb
Type: chemical/x-pdb
Size: 649559 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120904/51a8c7b6/attachment.bin>

From p.j.a.cock at googlemail.com  Tue Sep  4 05:56:55 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 06:56:55 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
Message-ID: <CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>

On Mon, Sep 3, 2012 at 11:07 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> One big change I would propose is to eliminate the duality
> child_list/child_dict. I think that keeping child_dict and generating
> child_list from sorted dict keys would be good enough. OrderedDict also
> looks appropriate, but it's Py2.7+.. Still need to look into this, but by
> looking at all those "append" methods in the profiling it hints at a nice
> speed up, and also at much cleaner code.
>

Where there are back-ports of the OrderedDict and other useful
classes like NamedTuple, we could probably include these as
part of our Python 2/3 compatibility code. i.e. In Bio.PDB use:

from Bio._py3k import OrderedDict

(Until we drop older versions of Python which don't come with
this). In Bio._py3k we would have something like this:

#Use in preference system OrderedDict (Python 2.7 and 3.x),
#the backport from PyPI, or our own bundled implementation
try:
    from collections import OrderedDict
except ImportError:
    try:
        #Whatever http://pypi.python.org/pypi/ordereddict uses:
        from xxx import OrderedDict
    except ImportError:
        #Import local bundled implementation, e.g.
        from _ordereddict import OrderedDict

See http://code.activestate.com/recipes/576693-ordered-dictionary-for-py24/

Are there any objections to this plan?

Regards,

Peter


From anaryin at gmail.com  Tue Sep  4 05:59:36 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 4 Sep 2012 08:59:36 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
Message-ID: <CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>

Sounds great, I saw the active state link before but I never thought of
including it. Thanks!


From w.arindrarto at gmail.com  Tue Sep  4 06:11:05 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Tue, 4 Sep 2012 08:11:05 +0200
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
Message-ID: <CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>

Hi Peter, Jo?o,

Just a little FYI. I ran into the OrderedDict issue when I started writing
SearchIO a few months ago as well, so I added an OrderedDict implementation
in Bio._py3k (
https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c
).

The code is from the ordereddict module from PyPI at that time. I haven't
checked if it's the same as the one shown in the link (there may have been
some updates), but it seems to work fine up to now.

Hope this is useful :),
Bow


On Tue, Sep 4, 2012 at 7:59 AM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

> Sounds great, I saw the active state link before but I never thought of
> including it. Thanks!
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From p.j.a.cock at googlemail.com  Tue Sep  4 06:30:51 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 07:30:51 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
Message-ID: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>

Hello all,

Over on one of Bow's pull requests Michiel made a suggestion about
consolidating the Bio.Seq* namespace under Bio.Seq.* which we can
do by replacing Bio/Seq.py with Bio/Seq/__init__.py

See: https://github.com/biopython/biopython/pull/63#issuecomment-8252340

I agree that Bio.Seq, Bio.SeqUtils, Bio.SeqIO, Bio.SeqRecord,
and Bio.SeqFeature isn't ideal. However, changing this would be
a big disruption - so perhaps any large change like this should
also address the mixed case module names which are not PEP8
conformant (Modules should have short, all-lowercase names).

http://www.python.org/dev/peps/pep-0008/#package-and-module-names

One idea I was pondering is a new parallel namespace, ideally
bio.* but we can't use that due to case insensitive file systems
like Windows and (by default) Mac OS X. So perhaps biopy,
or bp? [I've not checked for clashes with other libraries yet.]

We could gradually move code over to the new namespace,
using imports to preserve back compatibility - but support both
namespaces during a (long) transition period.

What I like about this is it allows people to make a gradual
conversion - and we don't have to burden of two main
branches if we attempted a single jump to a Biopython v2.

Does this seem worth considering?

Regards,

Peter


From mjldehoon at yahoo.com  Tue Sep  4 10:27:57 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 4 Sep 2012 03:27:57 -0700 (PDT)
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
Message-ID: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>

Hi Peter,

--- On Tue, 9/4/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> One idea I was pondering is a new parallel namespace,
> ideally bio.* but we can't use that due to case
> insensitive file systems like Windows and (by default)
> Mac OS X. So perhaps biopy, or bp?

As you say, the ideal namespace is bio.*, so let's use that. We have been using Bio.* for more than 10 years. We should not get stuck with a non-ideal namespace for the next 10+ years because there may be some glitches switching from Bio.* to bio.*. Frankly I doubt that this will cause huge problems in practice.

> We could gradually move code over to the new namespace,
> using imports to preserve back compatibility - but support
> both namespaces during a (long) transition period.

Why do we need a transition period? It's just a matter of replacing upper case with lower case in the imports.

> What I like about this is it allows people to make a
> gradual
> conversion - and we don't have to burden of two main
> branches if we attempted a single jump to a Biopython v2.
> 
> Does this seem worth considering?

Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users.

Best,
-Michiel.


From p.j.a.cock at googlemail.com  Tue Sep  4 10:59:00 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 11:59:00 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
References: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
	<1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
Message-ID: <CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>

On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Hi Peter,
>
> --- On Tue, 9/4/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> One idea I was pondering is a new parallel namespace,
>> ideally bio.* but we can't use that due to case
>> insensitive file systems like Windows and (by default)
>> Mac OS X. So perhaps biopy, or bp?
>
> As you say, the ideal namespace is bio.*, so let's use
> that. We have been using Bio.* for more than 10 years.
> We should not get stuck with a non-ideal namespace for
> the next 10+ years because there may be some glitches
> switching from Bio.* to bio.*. Frankly I doubt that this
> will cause huge problems in practice.

So you'd advocate a simple switch where from one
release to the next we change all the module names
(making them lower case, perhaps from consolidation
under bio.seq too)?

This may cause some difficulties for upgrades - it may
require manual intervention to remove the old Bio folder
in order to allow creation of the new bio folder.

>> We could gradually move code over to the new namespace,
>> using imports to preserve back compatibility - but support
>> both namespaces during a (long) transition period.
>
> Why do we need a transition period? It's just a matter
> of replacing upper case with lower case in the imports.

That forces people to update all their scripts at once.
Of course, we can document how to do this so a script
would work before and after the case change, e.g.

try:
    from bio.seq import Seq
except ImportError:

    from Bio.Seq import Seq

>> What I like about this is it allows people to make a
>> gradual
>> conversion - and we don't have to burden of two main
>> branches if we attempted a single jump to a Biopython v2.
>>
>> Does this seem worth considering?
>
> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users.
>
> Best,
> -Michiel.
>


From p.j.a.cock at googlemail.com  Tue Sep  4 12:16:26 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 13:16:26 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
Message-ID: <CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>

On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Peter, Jo?o,
>
> Just a little FYI. I ran into the OrderedDict issue when I started writing
> SearchIO a few months ago as well, so I added an OrderedDict implementation
> in Bio._py3k
> (https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c).
>
> The code is from the ordereddict module from PyPI at that time. I haven't
> checked if it's the same as the one shown in the link (there may have been
> some updates), but it seems to work fine up to now.
>
> Hope this is useful :),
> Bow

Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO,
that seems quite a good case for including it. How does this look
(on the 'od' branch in my repository)?

https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f

This differs from Bow's version in that I put the module in as a separate
file (Bio/_ordereddict.py), and that it will prefer the ordereddict package
if already installed (e.g. from PyPI).

Peter


From w.arindrarto at gmail.com  Tue Sep  4 12:36:55 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Tue, 4 Sep 2012 14:36:55 +0200
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
Message-ID: <CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>

On Tue, Sep 4, 2012 at 2:16 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
> > Hi Peter, Jo?o,
> >
> > Just a little FYI. I ran into the OrderedDict issue when I started
> > writing
> > SearchIO a few months ago as well, so I added an OrderedDict
> > implementation
> > in Bio._py3k
> >
> > (https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c).
> >
> > The code is from the ordereddict module from PyPI at that time. I
> > haven't
> > checked if it's the same as the one shown in the link (there may have
> > been
> > some updates), but it seems to work fine up to now.
> >
> > Hope this is useful :),
> > Bow
>
> Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO,
> that seems quite a good case for including it. How does this look
> (on the 'od' branch in my repository)?
>
>
> https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f
>
> This differs from Bow's version in that I put the module in as a separate
> file (Bio/_ordereddict.py), and that it will prefer the ordereddict
> package
> if already installed (e.g. from PyPI).
>
> Peter

Hi Peter,

This looks good. I like the 'ordereddict' module import check prior to
using our bundled version.

One more thing I would suggest is about the namespace. I feel that in
the future, we may run into similar issues (non-Python3 compatibility
issues) since Python2.7 deprecation is still a long way. Perhaps
create a new subpackage in the root folder (maybe Bio._compat, but I
don't have a strong preference), to keep code like this in one place?
Or we could even put Bio._py3k under this subpackage and have one
central place for compatibility-related code? This would prevent
further root namespace clutter.

regards,
Bow


From k.d.murray.91 at gmail.com  Tue Sep  4 12:57:22 2012
From: k.d.murray.91 at gmail.com (Kevin Murray)
Date: Tue, 4 Sep 2012 22:57:22 +1000
Subject: [Biopython-dev] TAIR/AGI support
Message-ID: <CAH80STXOOUjqYcQ82C2C25-gACyzwx0D4-VD+CMTes90CdZbnw@mail.gmail.com>

Hi All,

What's the status of TAIR AGIs in BioPython (I can see no mention of them,
or support for them)? I've written a brief module which allows a user to
query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there
any interest in including such functionality in BioPython?

More generally, are there any particular areas of BioPython development
which could use an extra pair of hands?

Regards
Kevin Murray


From anaryin at gmail.com  Tue Sep  4 14:19:11 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 4 Sep 2012 17:19:11 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
Message-ID: <CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>

Guys,

Looks great, I will try to 'cherry pick' that branch and merge it with
mine. I have to solve some issues with the tests, but it seems to be a
straightforward change.

Cheers,

Jo?o
No dia 4 de Set de 2012 15:37, "Wibowo Arindrarto" <w.arindrarto at gmail.com>
escreveu:

> On Tue, Sep 4, 2012 at 2:16 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> > On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto
> > <w.arindrarto at gmail.com> wrote:
> > > Hi Peter, Jo?o,
> > >
> > > Just a little FYI. I ran into the OrderedDict issue when I started
> > > writing
> > > SearchIO a few months ago as well, so I added an OrderedDict
> > > implementation
> > > in Bio._py3k
> > >
> > > (
> https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c
> ).
> > >
> > > The code is from the ordereddict module from PyPI at that time. I
> > > haven't
> > > checked if it's the same as the one shown in the link (there may have
> > > been
> > > some updates), but it seems to work fine up to now.
> > >
> > > Hope this is useful :),
> > > Bow
> >
> > Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO,
> > that seems quite a good case for including it. How does this look
> > (on the 'od' branch in my repository)?
> >
> >
> >
> https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f
> >
> > This differs from Bow's version in that I put the module in as a separate
> > file (Bio/_ordereddict.py), and that it will prefer the ordereddict
> > package
> > if already installed (e.g. from PyPI).
> >
> > Peter
>
> Hi Peter,
>
> This looks good. I like the 'ordereddict' module import check prior to
> using our bundled version.
>
> One more thing I would suggest is about the namespace. I feel that in
> the future, we may run into similar issues (non-Python3 compatibility
> issues) since Python2.7 deprecation is still a long way. Perhaps
> create a new subpackage in the root folder (maybe Bio._compat, but I
> don't have a strong preference), to keep code like this in one place?
> Or we could even put Bio._py3k under this subpackage and have one
> central place for compatibility-related code? This would prevent
> further root namespace clutter.
>
> regards,
> Bow
>


From p.j.a.cock at googlemail.com  Tue Sep  4 14:42:35 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 4 Sep 2012 15:42:35 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
Message-ID: <CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>

On Tue, Sep 4, 2012 at 3:19 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Guys,
>
> Looks great, I will try to 'cherry pick' that branch and merge it with mine.

I've applied it to the master now, which might make it easier.
I think Bow might have a point about namespaces - although the
underscore modules are 'private', they still show up in dir(Bio)
so having a single folder for our inter-Python version compatibility
code seems sensible if we add any more (e.g. NamedTuples).

> I have to solve some issues with the tests, but it seems to be a
> straightforward change.

Great.

Peter


From anaryin at gmail.com  Tue Sep  4 16:02:42 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Tue, 4 Sep 2012 19:02:42 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
Message-ID: <CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>

I agree, we could move them to a folder then?
No dia 4 de Set de 2012 17:42, "Peter Cock" <p.j.a.cock at googlemail.com>
escreveu:

> On Tue, Sep 4, 2012 at 3:19 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> > Guys,
> >
> > Looks great, I will try to 'cherry pick' that branch and merge it with
> mine.
>
> I've applied it to the master now, which might make it easier.
> I think Bow might have a point about namespaces - although the
> underscore modules are 'private', they still show up in dir(Bio)
> so having a single folder for our inter-Python version compatibility
> code seems sensible if we add any more (e.g. NamedTuples).
>
> > I have to solve some issues with the tests, but it seems to be a
> > straightforward change.
>
> Great.
>
> Peter
>


From p.j.a.cock at googlemail.com  Tue Sep  4 23:54:56 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 5 Sep 2012 00:54:56 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
Message-ID: <CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>

On Tue, Sep 4, 2012 at 5:02 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> I agree, we could move them to a folder then?
>

OK - I moved Bio/_py3k.py to Bio/_py3k/__init__.py and also the
new file Bio/_ordereddict.py to Bio/_py3k/ordereddict.py - this
avoids having to change any of our import statements:
https://github.com/biopython/biopython/commit/1a9bd6eeab0de3283bd1e6cc28c7754fbffefe2d

Peter


From redmine at redmine.open-bio.org  Wed Sep  5 03:19:53 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Wed, 5 Sep 2012 03:19:53 +0000
Subject: [Biopython-dev] [Biopython - Bug #3382] (New)
	Bio.PDB.PDBList.retrieve_pdb_file() fails for Python3
Message-ID: <redmine.issue-3382.20120905031953@redmine.open-bio.org>


Issue #3382 has been reported by Alexander Campbell.

----------------------------------------
Bug #3382: Bio.PDB.PDBList.retrieve_pdb_file() fails for Python3
https://redmine.open-bio.org/issues/3382

Author: Alexander Campbell
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


At present, calling @Bio.PDB.PDBList.retrieve_pdb_file()@ on any PDB ID will fail, giving the following traceback:
<pre>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-4ecf112b58e0> in <module>()
----> 1 pdbl.retrieve_pdb_file('1FAT')

/usr/lib64/python3.2/site-packages/Bio/PDB/PDBList.py in retrieve_pdb_file(self, pdb_code, obsolete, compression, uncompress, pdir)
    245         gz = gzip.open(filename, 'rb')
    246         out = open(final_file, 'wb')
--> 247         out.writelines(gz.read())
    248         gz.close()
    249         out.close()

TypeError: 'int' does not support the buffer interface
</pre>

This occurs because in Python3 a file opened in binary mode will return type @bytes@ for @read()@, or a list of type @bytes@ objects for @readlines()@. The @writelines()@ method expects an iterable where each element is of type @str at . This worked in Python2 as a @str@ can be viewed as a sequence of @str@ objects, and so line 247 effectively wrote one character at a time for the single @str@ yielded by @read()@. In Python3 iterating over a @bytes@ yields @int@ objects, leading to the TypeError.

This issue can be fixed by changing line 247's call to @writelines()@ to just @write()@. This does not break functionality in Python2, according to my testing with Python 3.2.3 and 2.7.3 on Fedora 17.

There are 4 more instances of @writelines()@ calls in the codebase, but in each of those cases the argument is a list or generator of @str@ or @bytes@ objects, as I don't think they will raise an error. I haven't tested them though.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From w.arindrarto at gmail.com  Wed Sep  5 09:53:36 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Wed, 5 Sep 2012 11:53:36 +0200
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>
References: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
	<1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
	<CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>
Message-ID: <CADEGkF4rnJ21Ys3m9C2PO8JtzTXxgTV5G7tcGuw=Q3x57REy-w@mail.gmail.com>

Hi guys,

If I may add my two cents on this issue,  I think it's also a chance
to rectify all other namespace issues that we may have (not just
PEP8-related).

For instance:

* In the root namespace, we now have Bio.Align and Bio.AlignIO. Since
we might be merging Bio.Seq and Bio.SeqIO into bio[py].seq (per the
Github discussion[1]), I suppose we should do the same with Bio.Align
as well (perhaps into bio[py].seq.align or bio[py].align).

* With the change above, we might also want to change some of the
submodule names completely. For example, if we merge Bio.Align into
bio[py].align we'll have bio[py].align.applications, which I
personally think could be shortened into bio[py].align.app.

* As per the Github disscussion[1] as well, perhaps Bio.SeqUtils
should also be merged as Seq object methods.

There may be other changes as well, but the bottom line is all these
changes will be quite considerable. As such, I think we could go all
the way and be explicit in stating that the changes will be
incompatible with previous Biopython versions (i.e. old scripts will
break).

As for bio.* and biopy.*, if we do decide to go all the way, bio.*
seems like a better choice since there will be other incompatible
changes anyway. But if we eventually decide to only fix PEP8-related
issues while keeping compatibility with older versions, I'm leaning
more towards biopy.*.

regards,
Bow

[1] https://github.com/biopython/biopython/pull/63#issuecomment-8252340

On Tue, Sep 4, 2012 at 12:59 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> Hi Peter,
>>
>> --- On Tue, 9/4/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> One idea I was pondering is a new parallel namespace,
>>> ideally bio.* but we can't use that due to case
>>> insensitive file systems like Windows and (by default)
>>> Mac OS X. So perhaps biopy, or bp?
>>
>> As you say, the ideal namespace is bio.*, so let's use
>> that. We have been using Bio.* for more than 10 years.
>> We should not get stuck with a non-ideal namespace for
>> the next 10+ years because there may be some glitches
>> switching from Bio.* to bio.*. Frankly I doubt that this
>> will cause huge problems in practice.
>
> So you'd advocate a simple switch where from one
> release to the next we change all the module names
> (making them lower case, perhaps from consolidation
> under bio.seq too)?
>
> This may cause some difficulties for upgrades - it may
> require manual intervention to remove the old Bio folder
> in order to allow creation of the new bio folder.
>
>>> We could gradually move code over to the new namespace,
>>> using imports to preserve back compatibility - but support
>>> both namespaces during a (long) transition period.
>>
>> Why do we need a transition period? It's just a matter
>> of replacing upper case with lower case in the imports.
>
> That forces people to update all their scripts at once.
> Of course, we can document how to do this so a script
> would work before and after the case change, e.g.
>
> try:
>     from bio.seq import Seq
> except ImportError:
>
>     from Bio.Seq import Seq
>
>>> What I like about this is it allows people to make a
>>> gradual
>>> conversion - and we don't have to burden of two main
>>> branches if we attempted a single jump to a Biopython v2.
>>>
>>> Does this seem worth considering?
>>
>> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users.
>>
>> Best,
>> -Michiel.
>>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From anaryin at gmail.com  Wed Sep  5 20:24:23 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 5 Sep 2012 23:24:23 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
Message-ID: <CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>

Hello all,

Some news.

A. The OrderedDict implementation is quite slow. It essentially slows down
the parser by 30%, rendering all the improvements I had done moot.
Therefore, although it's a great idea, a major reason for these updates is
speed so I think it might not be worth it.

B. As an alternative to this, I implemented the following. Entity has now
only child_dict, and is a general dictionary. However, each Object (Model,
Chain, Residue, Atom) gets their own __cmp__ method overloaded with the
information in the "_sort" methods that already existed. In this way, a
simple sorting of the values of the dictionary returns an ordered list. I
tweaked the Atom.__cmp__ to first sort N CA C O atoms and then
alphabetically. I also added that inorganic atoms such as Calcium come at
the end. This will make things a bit nicer when Calcium is involved for
example. Finally, the only downside to this seems to be that we lose the
order in which residues are inserted. Ie. if residue 151 is the first of
the PDB file and all others range from 1-150, then this first 151 is going
to be placed at the end when you iterate. However, from my experience and
in my opinion, not only this is logical, but it also rarely happens in real
PDB files.

C. I am strongly in favour of removing most (if not all) set/get methods
and replace them by direct attribute access. For instance,
"atom.get_parent() --> atom.parent". Saves some space in the code and makes
things more transparent.

D. I edited the PDBParser to tweaks a few things, nothing major. The file
handle is now treated as an iterator throughout the parsing and it should
be more memory-friendly. The line counter is still preserved. I also added
a test to make the get_header argument actually work.

E. General things here and there that I can't just remember..

F. Unittests are breaking everywhere. Checking why, but it all seems
related to this sorting issue.

Cheers,

Jo?o


From p.j.a.cock at googlemail.com  Wed Sep  5 23:31:42 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 00:31:42 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
Message-ID: <CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>

On Wed, Sep 5, 2012 at 9:24 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Hello all,
>
> Some news.
>
> A. The OrderedDict implementation is quite slow. It essentially slows down
> the parser by 30%, rendering all the improvements I had done moot.
> Therefore, although it's a great idea, a major reason for these updates is
> speed so I think it might not be worth it.

Which Python was that? i.e. The OrderedDict from the standard lib
(which I hope is optimised), or the back port (which might be slower).

> B. As an alternative to this, I implemented the following. Entity has now
> only child_dict, and is a general dictionary. However, each Object (Model,
> Chain, Residue, Atom) gets their own __cmp__ method overloaded with the
> information in the "_sort" methods that already existed. In this way, a
> simple sorting of the values of the dictionary returns an ordered list. I
> tweaked the Atom.__cmp__ to first sort N CA C O atoms and then
> alphabetically. I also added that inorganic atoms such as Calcium come at
> the end. This will make things a bit nicer when Calcium is involved for
> example. Finally, the only downside to this seems to be that we lose the
> order in which residues are inserted. Ie. if residue 151 is the first of the
> PDB file and all others range from 1-150, then this first 151 is going to be
> placed at the end when you iterate. However, from my experience and in my
> opinion, not only this is logical, but it also rarely happens in real PDB
> files.

That seems risky - but see if you can sort out what is happening
with the unit tests (below).

I'm not sure about your atomic sorting... it seems a bit magic. Would
sorting on atomic number be nicer (and simple)?

> C. I am strongly in favour of removing most (if not all) set/get methods and
> replace them by direct attribute access. For instance, "atom.get_parent()
> --> atom.parent". Saves some space in the code and makes things more
> transparent.

It would also look less like Java code ;)

I like this plan - but initially define and document the new properties,
and deprecate the old get/set properties. Without that you'll break
almost every PDB using script out there.

> D. I edited the PDBParser to tweaks a few things, nothing major. The file
> handle is now treated as an iterator throughout the parsing and it should be
> more memory-friendly. The line counter is still preserved. I also added a
> test to make the get_header argument actually work.
>
> E. General things here and there that I can't just remember..
>
> F. Unittests are breaking everywhere. Checking why, but it all seems related
> to this sorting issue.
>
> Cheers,
>
> Jo?o

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Sep  6 00:10:57 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 01:10:57 +0100
Subject: [Biopython-dev] Beta code in the official releases?
In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org>
References: <CAKVJ-_6_8JUXmCx5q-eghSczNxqPmSbaaTc_GJ_QCqQOtjGUbg@mail.gmail.com>
	<877gsq8mn2.fsf@fastmail.fm>
	<1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org>
	<CAKVJ-_5ThfYEBmrhpGcHNRvrf2_QEe4pTUgF0JV+bBQOwyW0Fg@mail.gmail.com>
	<1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org>
Message-ID: <CAKVJ-_5LE9uu9b5ExDKf63yOFymZWFb+CjHC-6OGqTTM6Gxh-g@mail.gmail.com>

On Wed, Sep 5, 2012 at 8:19 PM, Sczesnak, Andrew wrote:
> Yeah, it would be great if this module could finally be included.
> I've e-mailed the list numerous times asking what would be
> necessary to include it and have done all you and Brad have
> asked. I've watched you include bits and pieces of code from
> other contributors quickly and without much scrutiny, so I
> can't help but feel singled out. What is the logic in delaying
> this? We've heard from people who are already using the
> code and have asked when it will be pulled. Is it serving the
> community to not even include the basic reader/writer? Am
> I wasting my time? Is it your goal to actively discourage
> contributions?

In my mind, the main technical issue regarding MAF and AlignIO
and the common alignment object is the lack of a common way
of handling the idea of start/end (and sometimes strand) for
each sequence (in a consistent co-ordinate system using Python
counting). Evidently I haven't manage to adequately convey my
interpretation/concern.

Some file formats like EMBOSS' have these number explicitly
but we're not parsing them:
http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html

In the case of "fasta-m10" the numbers are stored in private
properties as a 'short term' hack:
http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html

Others like Stockholm have identifier/start-end as a combined
names (but this is not mandatory). Here the start and end are
being stored in the annotations dictionary (as unparsed strings,
still using 1-based co-ordinates).

In MAF the start/end are explicit and much more important.
It would be near pointless to parse the the file ignoring these.
Maybe your approach is good enough for MAF, and we
should have adopted it as is, and delayed better integration
with the other AlignIO formats?

i.e. This is a general limitation in AlignIO and the object
model, somewhat annoying in the formats already supported,
but information critical to the MAF format.

I was expecting a convention for this to fall out of Bow's GSoC
work for 'pairwise alignments' in SearchIO - but the object
model he came up with was not SeqRecord based (many
of the file formats he was using didn't include sequences).

Right now my inclination is still to add a location property to
the SeqRecord, usually a FeatureLocation, but it could also
be the proposed CompoundLocation for more complex cases.
The question then is if/when this would be propagated, e.g.
SeqRecord slicing/addition.
http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html
http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html

So the wheels are turning, but slowly. I have not had as
much time to dedicate to this as I would like - but other
smaller or less inter-connected things are much easer to
review and merge.

Peter


From p.j.a.cock at googlemail.com  Thu Sep  6 00:34:19 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 01:34:19 +0100
Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound
 feature locations
In-Reply-To: <CAKVJ-_5Mc-gSREu6xUA9qE4rJg-befDM3GjAB7Cs2FxC+cnvHg@mail.gmail.com>
References: <CAKVJ-_6yFU7kjgT2Orikra95r_qEq18BkYhEnDdikbKK3NYU5w@mail.gmail.com>
	<CALfq9t+FiUnnBeDfzspddNrJr1a3sky57npiJMGo4KEPhre3Ww@mail.gmail.com>
	<CAKVJ-_5YkFMnyE1oeCXS6rK1L+9UJH+B04LQ30zymhATAp8MdA@mail.gmail.com>
	<CALfq9tJ_MNWkDyAXNJbtSzQY+bgbvoaRNo3gmCPd3_EMHGH2rw@mail.gmail.com>
	<CAKVJ-_5Mc-gSREu6xUA9qE4rJg-befDM3GjAB7Cs2FxC+cnvHg@mail.gmail.com>
Message-ID: <CAKVJ-_6gwzWDNJbBCP=Tv3dOnzimvbWyVcKtnLiF5EmMM7v1_w@mail.gmail.com>

On Tue, Jul 24, 2012 at 10:38 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson <arklenna at gmail.com> wrote:
>> I agree that an "upgraded" FeatureLocation could be more
>> elegant.
>
> It could turn out to be simpler having just one location object...
> certainly worth trying out before committing this branch as is.

Such a new  "upgraded" FeatureLocation would need to hold
a list/tuple of its parts (rather like the proposed CompoundLocation),
and those could be simply as tuples of start, end, strand, db_ref
etc (essentially everything currently held in a FeatureLocation).

I'm not sure that that is any better than the new class
CompoundLocation holding a list of existing FeatureLocation
objects.

On the bright side, the branch still works nicely with the
extra BioSQL tests I added.

One of the issues worth a bit more discussion is the start
and end values of the CompoundLocation - which I am
considering making act as the left/minimum and right/
maximum boundary of the region spanned by the parts.

For normal forward strand features this does give the
biological start and end, likewise for reverse strand
features but inverted (location's start gives the biological
end). i.e. for *most* features this means no change to
the current behaviour.

My proposal would mean that for a feature spanning
the origin on a circular genome of length N, the start
would be 0 and the end N.

Similarly for weird cases from trans-splicing, the
start/end coordinates would give the total region
spanned. As shown below, sometimes that happens
to match the current behaviour, but in other cases
the current behaviour isn't useful anyway.

Adopting start/end as the spanned region makes a
lot of sense for things like drawing features in a
region of interest, or other more abstract tasks
doing feature/region intersection. Here knowing the
min/max boundaries of the region spanned is more
useful than any attempt to capture the biological
start/end of the feature.

Note that already for the simple FeatureLocation for
reverse strand features we have start < end, i.e. the
start coordinate property does NOT represent the
biological starting point.

Under the proposed CompoundLocation behaviour,
the desirable property of the FeatureLocation that
start < end would also hold for compound locations.

Pathological examples at the end,

Regards,

Peter

P.S.  One of the advantages of the CompoundLocation
is when constructing the location you don't give the
overall start/end - there are inferred from the list of parts
automatically. Currently the GenBank/EMBL parser
is having to do this.

P.P.S. I've also confirmed Lenna's testing that sum
of feature locations works if we define integer
addition with locations (so that sum can include
zero and several locations), see:

https://github.com/peterjc/biopython/commit/dc6bc658141cc42e7e6802bbe8baf6c87a6874c0

-----------------------------------------------------------------
Trans-splicing: Mixed Strands

An example where the range/span idea is simpler is
mixed strand features like this trans-spliced example
from NC_000932 (in our unit tests),

join(complement(69611..69724),139856..140650)

What would you expect as the start/end here? The
biological start is base 69724 (one based) and the
last base is 140650. Currently:

>>> from Bio import SeqIO
>>> f = SeqIO.read("NC_000932.gb", "gb").features[135]
>>> print f.location
[69610:140650]
>>> f.location.start
ExactPosition(69610)
>>> f.location.end
ExactPosition(140650)
>>> for sub in f.sub_features: print sub.location
...
[69610:69724](-)
[139855:140650](+)

Here the end value does match the last base in the
feature following the biological order - the start value
is actually a base in the middle of the combined
sequence. In fact, for this example the start/end
are already acting like the range/span idea.

-----------------------------------------------------------------
Trans-splicing: Reverse strand

The example above is a real corner case, and so is this
single strand trans-splcing example, also in NC_000932,
which is a bit like an circular genome origin spanning
annotation:

complement(join(97999..98793,69611..69724))

With the current master branch:

>>> from Bio import SeqIO
>>> f = SeqIO.read("NC_000932.gb", "genbank").features[1]
>>> print f.location
[97998:69724](-)
>>> f.location.start
ExactPosition(97998)
>>> f.location.end
ExactPosition(69724)
>>> for sub in f.sub_features: print sub.location
...
[97998:98793](-)
[69610:69724](-)

Notice that we do not have start < end as you might
expect. However the start and end DO capture the
biological end and start (order inverted - this is on
the reverse strand). To verify this I find it helps to
transform the GenBank style location:

complement(join(97999..98793,69611..69724))

into the old EMBL equivalent:

join(complement(69611..69724),complement(97999..98793))

i.e. The first base is 69724 (one based counting), and
the last base is 97999 (one based counting). So if
you wanted to look at the upstream or downstream
(assuming that makes sense for a trans-spliced
gene), the current start/end values are useful (but
you have to choose start vs end dependent on the
strand).

On the other hand, the range of co-ordindate values
is 69611 to 98793 (one based, inclusive). Therefore
one might expect start 69610 and end 98793 (Python
counting), giving the spanned region.


From chapmanb at 50mail.com  Thu Sep  6 00:37:57 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 05 Sep 2012 20:37:57 -0400
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CADEGkF4rnJ21Ys3m9C2PO8JtzTXxgTV5G7tcGuw=Q3x57REy-w@mail.gmail.com>
References: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
	<1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
	<CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>
	<CADEGkF4rnJ21Ys3m9C2PO8JtzTXxgTV5G7tcGuw=Q3x57REy-w@mail.gmail.com>
Message-ID: <87wr08x9y2.fsf@fastmail.fm>


Hi all;
I don't know if there's going to be a clean way around mucking up the
API for older scripts if we make this change.

If we want to do this my thoughts would be:

- Use the 'bio' module since that's the cleanest.
- Hack together something that will remove old 'Bio' modules on install
  of the new version.
- Write a Biopython1to2 script that will fix the imports on older
  scripts to the new module structure.

However, my vote would be to stick with everything as is. I know we
aren't PEP8 compliant but things aren't that awful that we need an
upheaval. I wish Python library installs weren't so messy that we could
do this more cleanly,
Brad

> Hi guys,
>
> If I may add my two cents on this issue,  I think it's also a chance
> to rectify all other namespace issues that we may have (not just
> PEP8-related).
>
> For instance:
>
> * In the root namespace, we now have Bio.Align and Bio.AlignIO. Since
> we might be merging Bio.Seq and Bio.SeqIO into bio[py].seq (per the
> Github discussion[1]), I suppose we should do the same with Bio.Align
> as well (perhaps into bio[py].seq.align or bio[py].align).
>
> * With the change above, we might also want to change some of the
> submodule names completely. For example, if we merge Bio.Align into
> bio[py].align we'll have bio[py].align.applications, which I
> personally think could be shortened into bio[py].align.app.
>
> * As per the Github disscussion[1] as well, perhaps Bio.SeqUtils
> should also be merged as Seq object methods.
>
> There may be other changes as well, but the bottom line is all these
> changes will be quite considerable. As such, I think we could go all
> the way and be explicit in stating that the changes will be
> incompatible with previous Biopython versions (i.e. old scripts will
> break).
>
> As for bio.* and biopy.*, if we do decide to go all the way, bio.*
> seems like a better choice since there will be other incompatible
> changes anyway. But if we eventually decide to only fix PEP8-related
> issues while keeping compatibility with older versions, I'm leaning
> more towards biopy.*.
>
> regards,
> Bow
>
> [1] https://github.com/biopython/biopython/pull/63#issuecomment-8252340
>
> On Tue, Sep 4, 2012 at 12:59 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>> Hi Peter,
>>>
>>> --- On Tue, 9/4/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>>> One idea I was pondering is a new parallel namespace,
>>>> ideally bio.* but we can't use that due to case
>>>> insensitive file systems like Windows and (by default)
>>>> Mac OS X. So perhaps biopy, or bp?
>>>
>>> As you say, the ideal namespace is bio.*, so let's use
>>> that. We have been using Bio.* for more than 10 years.
>>> We should not get stuck with a non-ideal namespace for
>>> the next 10+ years because there may be some glitches
>>> switching from Bio.* to bio.*. Frankly I doubt that this
>>> will cause huge problems in practice.
>>
>> So you'd advocate a simple switch where from one
>> release to the next we change all the module names
>> (making them lower case, perhaps from consolidation
>> under bio.seq too)?
>>
>> This may cause some difficulties for upgrades - it may
>> require manual intervention to remove the old Bio folder
>> in order to allow creation of the new bio folder.
>>
>>>> We could gradually move code over to the new namespace,
>>>> using imports to preserve back compatibility - but support
>>>> both namespaces during a (long) transition period.
>>>
>>> Why do we need a transition period? It's just a matter
>>> of replacing upper case with lower case in the imports.
>>
>> That forces people to update all their scripts at once.
>> Of course, we can document how to do this so a script
>> would work before and after the case change, e.g.
>>
>> try:
>>     from bio.seq import Seq
>> except ImportError:
>>
>>     from Bio.Seq import Seq
>>
>>>> What I like about this is it allows people to make a
>>>> gradual
>>>> conversion - and we don't have to burden of two main
>>>> branches if we attempted a single jump to a Biopython v2.
>>>>
>>>> Does this seem worth considering?
>>>
>>> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users.
>>>
>>> Best,
>>> -Michiel.
>>>
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From chapmanb at 50mail.com  Thu Sep  6 00:31:58 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 05 Sep 2012 20:31:58 -0400
Subject: [Biopython-dev] Beta code in the official releases?
In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org>
References: <CAKVJ-_6_8JUXmCx5q-eghSczNxqPmSbaaTc_GJ_QCqQOtjGUbg@mail.gmail.com>
	<877gsq8mn2.fsf@fastmail.fm>
	<1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org>
	<CAKVJ-_5ThfYEBmrhpGcHNRvrf2_QEe4pTUgF0JV+bBQOwyW0Fg@mail.gmail.com>
	<1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org>
Message-ID: <87zk54xa81.fsf@fastmail.fm>


Andrew;

> Yeah, it would be great if this module could finally be included. I've
> e-mailed the list numerous times asking what would be necessary to
> include it and have done all you and Brad have asked. I've watched you
> include bits and pieces of code from other contributors quickly and
> without much scrutiny, so I can't help but feel singled out. What is
> the logic in delaying this? We've heard from people who are already
> using the code and have asked when it will be pulled. Is it serving
> the community to not even include the basic reader/writer? Am I
> wasting my time? Is it your goal to actively discourage contributions?

In addition to Peter's technical comments, from a personal side I hope
you don't take offense. We definitely value contributions and your work.

Some changes can end up being tricky because of the need to work with or
fix previous non-optimal design decisions. When they require extra
attention and decisions this can make it hard to allocate time for
folks that volunteer on the project.

This is definitely nothing personal and I hope you don't feel that way.
My GFF parser has languished for even longer for similar reasons.

I think the long term solution for this is incorporating beta code so we
can get these in, recognize the contributions, make them available,
and still giving wiggle room to improve the design before locking into
an API that we need to support long term.

Thanks again for all the work. We do appreciate it,
Brad


From chapmanb at 50mail.com  Thu Sep  6 00:45:19 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 05 Sep 2012 20:45:19 -0400
Subject: [Biopython-dev] TAIR/AGI support
In-Reply-To: <CAH80STXOOUjqYcQ82C2C25-gACyzwx0D4-VD+CMTes90CdZbnw@mail.gmail.com>
References: <CAH80STXOOUjqYcQ82C2C25-gACyzwx0D4-VD+CMTes90CdZbnw@mail.gmail.com>
Message-ID: <87txvcx9ls.fsf@fastmail.fm>


Kevin;
Thanks for the e-mail and offers of code. Always happy to have other
folks involved with the project.

> What's the status of TAIR AGIs in BioPython (I can see no mention of them,
> or support for them)? I've written a brief module which allows a user to
> query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there
> any interest in including such functionality in BioPython?

Is the code available on GitHub to get a better sense of all the
functionality it supports? Do you have an idea where it would fit best?
As a tair submodule inside of Bio.Entrez, or somewhere else?

> More generally, are there any particular areas of BioPython development
> which could use an extra pair of hands?

Following the mailing list for discussions on current projects is the
best way to get a sense of what different folks are working on. The
issue tracker also has open issues and features that could use attention
if anything there strikes your fancy:

https://redmine.open-bio.org/projects/biopython

Hope this helps,
Brad


From p.j.a.cock at googlemail.com  Thu Sep  6 00:57:19 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 01:57:19 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <87wr08x9y2.fsf@fastmail.fm>
References: <CAKVJ-_6QRvzos8DUOmqr9cnEU0tP1s7oQMwNJ8L3KHaAEUKY_w@mail.gmail.com>
	<1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com>
	<CAKVJ-_6q8hp92QqCn-2EGsOkb5mS30O1EpnMENgeShDtP6avUA@mail.gmail.com>
	<CADEGkF4rnJ21Ys3m9C2PO8JtzTXxgTV5G7tcGuw=Q3x57REy-w@mail.gmail.com>
	<87wr08x9y2.fsf@fastmail.fm>
Message-ID: <CAKVJ-_7=qK=_XjV4DYBgY8g1E5K=9dRVoe590HU_cwLfTdvCjQ@mail.gmail.com>

On Thu, Sep 6, 2012 at 1:37 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Hi all;
> I don't know if there's going to be a clean way around mucking up the
> API for older scripts if we make this change.
>
> If we want to do this my thoughts would be:
>
> - Use the 'bio' module since that's the cleanest.
> - Hack together something that will remove old 'Bio' modules on install
>   of the new version.
> - Write a Biopython1to2 script that will fix the imports on older
>   scripts to the new module structure.

I really don't like using "bio" since (due to Python's use of
folders for package names) you couldn't in general also have
the old code available under "Bio". i.e. This forces a hard
switch on our users which is a very bad idea I think.

Thus my suggestion of something else like "biopy" (although
the Mac's autocorrection keeps turning it into biopsy  which
would be annoying - grin), or if not already taken "bp".

To expand on my earlier email, the transition structure I
had in mind was that we'd have something like this:

biopy/seq/__init__.py - real code for Seq object etc

Bio/Seq/__init__.py - just "from biopy.seq import Seq"
and a deprecation warning.

> However, my vote would be to stick with everything as is. I know we
> aren't PEP8 compliant but things aren't that awful that we need an
> upheaval. I wish Python library installs weren't so messy that we could
> do this more cleanly,
> Brad

That does seem safer, and we can still do the less invasive
restructuring discussed, e.g.

Bio/Seq.py -> Bio/Seq/__init__.py allowing us to (gradually)
move Bio.Seq* things under Bio.Seq, while preserving the
legacy imports under a deprecation warning.

Also if we're considering moving Bio.SeqIO to Bio.Seq, as
Bow points out, we'd want to do Bio/AlignIO.py -> Bio.Align
(perhaps pushing the core objects into Bio/Align/_objects.py
or similar but exposing them in the current namespace
location).

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Sep  6 01:34:50 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 02:34:50 +0100
Subject: [Biopython-dev] SeqRecord locations;
	was: Beta code in the official releases?
Message-ID: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>

On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> In my mind, the main technical issue regarding MAF and AlignIO
> and the common alignment object is the lack of a common way
> of handling the idea of start/end (and sometimes strand) for
> each sequence (in a consistent co-ordinate system using Python
> counting). Evidently I haven't manage to adequately convey my
> interpretation/concern.
>
> Some file formats like EMBOSS' have these number explicitly
> but we're not parsing them:
> http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html
>
> In the case of "fasta-m10" the numbers are stored in private
> properties as a 'short term' hack:
> http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html
>
> Others like Stockholm have identifier/start-end as a combined
> names (but this is not mandatory). Here the start and end are
> being stored in the annotations dictionary (as unparsed strings,
> still using 1-based co-ordinates).
>
> In MAF the start/end are explicit and much more important.
> It would be near pointless to parse the the file ignoring these.
> Maybe your approach is good enough for MAF, and we
> should have adopted it as is, and delayed better integration
> with the other AlignIO formats?
>
> i.e. This is a general limitation in AlignIO and the object
> model, somewhat annoying in the formats already supported,
> but information critical to the MAF format.
>
> I was expecting a convention for this to fall out of Bow's GSoC
> work for 'pairwise alignments' in SearchIO - but the object
> model he came up with was not SeqRecord based (many
> of the file formats he was using didn't include sequences).
>
> Right now my inclination is still to add a location property to
> the SeqRecord, usually a FeatureLocation, but it could also
> be the proposed CompoundLocation for more complex cases.
> The question then is if/when this would be propagated, e.g.
> SeqRecord slicing/addition.
> http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html
> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html
>
> So the wheels are turning, but slowly. I have not had as
> much time to dedicate to this as I would like - but other
> smaller or less inter-connected things are much easer to
> review and merge.

To expand on the SeqRecord.location property idea, I am
thinking about (in the typical use cases) using a normal
FeatureLocation object (from Bio.SeqFeature) where the
start, end or strand are in the same co-ordinate system
as the sequence of the SeqRecord.

i.e. For a protein fragment, they would be in amino acids.
For a nucleotide fragment, they would be in base pairs.

Note that you might want to describe the CDS region
for a protein sequence (which would be possible even
for a join using the proposed CompoundLocation), so
maybe 'location' is the wrong name here, perhaps
'fragment' or 'subregion', or something is clearer?

When I talked about adding SeqRecords, and what would
the combined SeqRecord's location be, we could use
FeatureLocation addition (as defined on the branch for
CompoundLocation objects).

For slicing a SeqRecord, provided len(record.location)
== len(record), this is well defined. However, I expect
that quite often if used for alignments, what we will have
instead is len(record.location) = len(record.seq.ungapped())
so we might be able to update the sub-record's location
if we count the gap characters and factor them in. This
equality could be verified in the SeqRecord __init__
(which would require the gap character, but the AlignIO
parsers should all set that).

I would like slicing to update the start/end because
slicing alignment objects seems to be a quite common
operation - so if you started from an alignment file
using start/end (like Stockholm or MAF) it would be
good to update these fields for the sub-alignment.

This feels like it would work, but would it be useful or
just over engineering? Would a simple static location
property which is not automatically propagated in
SeqRecord manipulations be enough (at least initially)?

If so, is Brad's suggestion to just use special values in
the annotations dictionary a simpler way forward (where
we already have policies in place for handling generic
annotation during SeqRecord annotation - in general
dropping it)?

If so, would this be keys 'start', 'end', 'strand' for
integer start and end using Python counting, and
a strand value of +1 or -1 for forward and reverse?
[We could use strand None for unavailable as in
the SeqFeature location object, but I think no entry
in the dictionary is nicer here].

Peter


From anaryin at gmail.com  Thu Sep  6 05:52:34 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 6 Sep 2012 08:52:34 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
Message-ID: <CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>

Hey,

Which Python was that? i.e. The OrderedDict from the standard lib
> (which I hope is optimised), or the back port (which might be slower).
>

Both. I also found it strange and
googled<http://stackoverflow.com/questions/8176513/ordereddict-performance-compared-to-deque>it.
Apparently OrderedDict is pure python, not C like dict, thus the
difference.


That seems risky - but see if you can sort out what is happening
> with the unit tests (below).
>

What Bio.PDB does right now is rely on the list to iterate over things.
Thus, you get the order in which you read the PDB file. However, if you
sort it using the several Objects sort method you will get the following
rules:

Atom.py - N CA C O first, then alphabetically
Residue.py - First aminoacids and nucleic acids, then heteroatoms.
Chain.py - Empty chains last.

These are already in place somewhere in the code. I just used them to
overload the __cmp__ method, with a couple of additions because I
personally disagree with the following:

Atom.py - Inorganic atoms should come out last. For simplicity.
Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get
in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151.
PDB files already have weird large numbers for water and ions for example,
so these come out last anyway. Pushing all HETATMs to the end will
sometimes disrupt the "natural" order of things, for instance modified
residues. Magic perhaps :)

I sorted out all relevant issues with the unittests. I had a small problem
with build_peptides because of this HETATM last rule, so I took it away and
now it works. All tests pass except 4: 2 because of the header, which is
not read decently right now, and 2 because of the ordering which is
explicit in the assert statement of the test. So it's a matter of changing
these assertions and they will work.


It would also look less like Java code ;)
>
> I like this plan - but initially define and document the new properties,
> and deprecate the old get/set properties. Without that you'll break
> almost every PDB using script out there.
>

How do I deprecate the old ones? Is there a DeprecationWarning or so?

Just a reminder, if you want to test/check the code, it's on my
github<https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements>
.

Cheers,

Jo?o


From w.arindrarto at gmail.com  Thu Sep  6 05:57:04 2012
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 6 Sep 2012 07:57:04 +0200
Subject: [Biopython-dev] SeqRecord locations;
 was: Beta code in the official releases?
In-Reply-To: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>
References: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>
Message-ID: <CADEGkF6MZ4JY5kCn8ucMFPTTJ6+3ovovG6SfViATJ_jhH2u1ZA@mail.gmail.com>

Hi guys,

To add my two cents, I am in favor of creating a dynamic SeqRecord
coordinate system using SeqFeature. However, I think it would also be
good if we set some limitations as there are so many ways that slicing
and addition could be used to create new SeqRecords, and anticipating
all these scenarios may create an over-engineered (and probably
slower) SeqRecord.

Some scenarios that I can think now:

1. Slicing SeqRecord objects using step values > 1 (e.g. new_seq = seq[1:120:3])
2. Adding two or more SeqRecord objects with noncontiguous coordinate
(i.e. end coordinate of the first sequence is not directly followed by
the second sequence's start coordinate), and then slice the resulting
object

So maybe some limitations that we could set are:

1. Only update the coordinates if slicing step is 1 (or -1), otherwise
discard it.
2. Only update the coordinates if addition is between contiguous
coordinates, otherwise discard it.

Personally, I think this would cover most use cases for slicing while
allowing us to keep it simple.

As for the name, 'region' sounds better than 'location'. Maybe
'coverage'? I don't have any strong preference between these, but
'subregion' doesn't feel that nice.

Finally, for the coordinate system, I imagine it will use Python's
coordinate system, too? (zero-based, half-open, and the parsers /
writers should do the conversion). Should we also reverse the
coordinates if the objects are sliced in reverse (e.g.
seqrecord[::-1]) or simply inverse the strand value but keep the
coordinates unchanged?

regards,
Bow


On Thu, Sep 6, 2012 at 3:34 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> In my mind, the main technical issue regarding MAF and AlignIO
>> and the common alignment object is the lack of a common way
>> of handling the idea of start/end (and sometimes strand) for
>> each sequence (in a consistent co-ordinate system using Python
>> counting). Evidently I haven't manage to adequately convey my
>> interpretation/concern.
>>
>> Some file formats like EMBOSS' have these number explicitly
>> but we're not parsing them:
>> http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html
>>
>> In the case of "fasta-m10" the numbers are stored in private
>> properties as a 'short term' hack:
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html
>>
>> Others like Stockholm have identifier/start-end as a combined
>> names (but this is not mandatory). Here the start and end are
>> being stored in the annotations dictionary (as unparsed strings,
>> still using 1-based co-ordinates).
>>
>> In MAF the start/end are explicit and much more important.
>> It would be near pointless to parse the the file ignoring these.
>> Maybe your approach is good enough for MAF, and we
>> should have adopted it as is, and delayed better integration
>> with the other AlignIO formats?
>>
>> i.e. This is a general limitation in AlignIO and the object
>> model, somewhat annoying in the formats already supported,
>> but information critical to the MAF format.
>>
>> I was expecting a convention for this to fall out of Bow's GSoC
>> work for 'pairwise alignments' in SearchIO - but the object
>> model he came up with was not SeqRecord based (many
>> of the file formats he was using didn't include sequences).
>>
>> Right now my inclination is still to add a location property to
>> the SeqRecord, usually a FeatureLocation, but it could also
>> be the proposed CompoundLocation for more complex cases.
>> The question then is if/when this would be propagated, e.g.
>> SeqRecord slicing/addition.
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html
>>
>> So the wheels are turning, but slowly. I have not had as
>> much time to dedicate to this as I would like - but other
>> smaller or less inter-connected things are much easer to
>> review and merge.
>
> To expand on the SeqRecord.location property idea, I am
> thinking about (in the typical use cases) using a normal
> FeatureLocation object (from Bio.SeqFeature) where the
> start, end or strand are in the same co-ordinate system
> as the sequence of the SeqRecord.
>
> i.e. For a protein fragment, they would be in amino acids.
> For a nucleotide fragment, they would be in base pairs.
>
> Note that you might want to describe the CDS region
> for a protein sequence (which would be possible even
> for a join using the proposed CompoundLocation), so
> maybe 'location' is the wrong name here, perhaps
> 'fragment' or 'subregion', or something is clearer?
>
> When I talked about adding SeqRecords, and what would
> the combined SeqRecord's location be, we could use
> FeatureLocation addition (as defined on the branch for
> CompoundLocation objects).
>
> For slicing a SeqRecord, provided len(record.location)
> == len(record), this is well defined. However, I expect
> that quite often if used for alignments, what we will have
> instead is len(record.location) = len(record.seq.ungapped())
> so we might be able to update the sub-record's location
> if we count the gap characters and factor them in. This
> equality could be verified in the SeqRecord __init__
> (which would require the gap character, but the AlignIO
> parsers should all set that).
>
> I would like slicing to update the start/end because
> slicing alignment objects seems to be a quite common
> operation - so if you started from an alignment file
> using start/end (like Stockholm or MAF) it would be
> good to update these fields for the sub-alignment.
>
> This feels like it would work, but would it be useful or
> just over engineering? Would a simple static location
> property which is not automatically propagated in
> SeqRecord manipulations be enough (at least initially)?
>
> If so, is Brad's suggestion to just use special values in
> the annotations dictionary a simpler way forward (where
> we already have policies in place for handling generic
> annotation during SeqRecord annotation - in general
> dropping it)?
>
> If so, would this be keys 'start', 'end', 'strand' for
> integer start and end using Python counting, and
> a strand value of +1 or -1 for forward and reverse?
> [We could use strand None for unavailable as in
> the SeqFeature location object, but I think no entry
> in the dictionary is nicer here].
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From mjldehoon at yahoo.com  Thu Sep  6 06:31:57 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 5 Sep 2012 23:31:57 -0700 (PDT)
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7=qK=_XjV4DYBgY8g1E5K=9dRVoe590HU_cwLfTdvCjQ@mail.gmail.com>
Message-ID: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com>

[Brad]
> Hack together something that will remove old 'Bio' modules
> on install of the new version.

We could check in setup.py if we can import Bio, and ask the user to remove the old Biopython installation before proceeding. Since we can tell the user exactly which directory to remove, this would be straightforward. I would prefer this to removing the directory automatically.

[Peter]
> I really don't like using "bio" since (due to Python's use
> of folders for package names) you couldn't in general also
> have the old code available under "Bio". i.e. This forces
> a hard switch on our users which is a very bad idea I think.

I don't see why a user would like to have both an old Biopython under Bio and a new Biopython under bio. Unless he wants to run some scripts with the old Biopython and other scripts with the new Biopython, but I don't see the point of that.

[Peter]
> Thus my suggestion of something else like "biopy" [...]
> , or if not already taken "bp".

[Brad]
> However, my vote would be to stick with everything as is.

If the choice is between "bp", "biopy", or "Bio", then I agree with Brad; I prefer keeping a nice but PEP8-noncompliant module name "Bio" rather than switching to a PEP8-compliant but less attractive name like "biopy" or "bp".

Best,
-Michiel.


From p.j.a.cock at googlemail.com  Thu Sep  6 07:06:07 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 08:06:07 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com>
References: <CAKVJ-_7=qK=_XjV4DYBgY8g1E5K=9dRVoe590HU_cwLfTdvCjQ@mail.gmail.com>
	<1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com>
Message-ID: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>

On Thu, Sep 6, 2012 at 7:31 AM, Michiel de Hoon wrote:
> [Brad]
>> Hack together something that will remove old 'Bio' modules
>> on install of the new version.
>
> We could check in setup.py if we can import Bio, and ask
> the user to remove the old Biopython installation before
> proceeding. Since we can tell the user exactly which directory
> to remove, this would be straightforward. I would prefer this
> to removing the directory automatically.

I agree automatically removing the old install is risky.

For single user machines, where the single user has only a
small collection of scripts this isn't such an issue. For any
shared server, or user with lots of Biopython scripts (some
of which may have been written by different people), you
would be forced into a mass change at one go.

You would also have considerable hassle later on with any
attempt to re-run old scripts.

> [Peter]
>> I really don't like using "bio" since (due to Python's use
>> of folders for package names) you couldn't in general also
>> have the old code available under "Bio". i.e. This forces
>> a hard switch on our users which is a very bad idea I think.
>
> I don't see why a user would like to have both an old
> Biopython under Bio and a new Biopython under bio.
> Unless he wants to run some scripts with the old Biopython
> and other scripts with the new Biopython, but I don't see
> the point of that.

Really? That is exactly what I am concerned about (both
for single user machines like my desktop, and shared
machines like our servers). How about the common
situation of wanting to re-run old scripts from old
projects on new data?

If we were just changing the case, this might not be
too complex (it would still be a frustrating transition
period), but if we're also moving things around at the
same time it is too much I feel.

> [Peter]
>> Thus my suggestion of something else like "biopy" [...]
>> , or if not already taken "bp".
>
> [Brad]
>> However, my vote would be to stick with everything as is.
>
> If the choice is between "bp", "biopy", or "Bio", then
> I agree with Brad; I prefer keeping a nice but
> PEP8-noncompliant module name "Bio" rather than
> switching to a PEP8-compliant but less attractive
> name like "biopy" or "bp".

There is 'biopython' but it is rather long? No other ideas
from anyone else?

How about over the next year we gradually consolidate
modules under the existing mixed case names? e.g.
move Bio.AlignIO functionality and Bio.Align, and
Bio.Seq* under Bio.Seq (leaving backwards compatible
imports supported but deprecated).

Here's a further (and slightly more radical) idea: We
stick with using 'Bio' and the current mixed case
names on Python 2, but adopt 'bio' and other PEP8
compatible names for Python 3 (as a uniform
strict automatic rule: mixed case -> lower case)?
i.e. Do this as part of our 2to3 process.

Some nasty downside might occur to me later
but right now it seems like a neat idea... other
that not being quite in line with the expectation
that Python 3 should not be used as an excuse
to make API changes. Too radical?

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Sep  6 07:16:41 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 08:16:41 +0100
Subject: [Biopython-dev] SeqRecord locations;
 was: Beta code in the official releases?
In-Reply-To: <CADEGkF6MZ4JY5kCn8ucMFPTTJ6+3ovovG6SfViATJ_jhH2u1ZA@mail.gmail.com>
References: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>
	<CADEGkF6MZ4JY5kCn8ucMFPTTJ6+3ovovG6SfViATJ_jhH2u1ZA@mail.gmail.com>
Message-ID: <CAKVJ-_5DDEkFhsuSZMb2c_oJ3z74tOTqDZb7gELdSsiShXBBLA@mail.gmail.com>

On Thu, Sep 6, 2012 at 6:57 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi guys,
>
> To add my two cents, I am in favor of creating a dynamic SeqRecord
> coordinate system using SeqFeature. However, I think it would also be
> good if we set some limitations as there are so many ways that slicing
> and addition could be used to create new SeqRecords, and anticipating
> all these scenarios may create an over-engineered (and probably
> slower) SeqRecord.
>
> Some scenarios that I can think now:
>
> 1. Slicing SeqRecord objects using step values > 1
> (e.g. new_seq = seq[1:120:3])

Absolutely - here I would expect to lose the location information.
We already have similar restrictions in the SeqRecord slicing
for how SeqFeatures are handled.

> 2. Adding two or more SeqRecord objects with noncontiguous coordinate
> (i.e. end coordinate of the first sequence is not directly followed by
> the second sequence's start coordinate), and then slice the resulting
> object

Adding *could* be done via the CompoundLocation, although that
in itself might want to consider if nicely-abutting locations should
be merged, e.g. in GenBank notation 100..201 and 202..300 could
be 100.300 rather than join(100..201,202..300) which is what my
CompoundLocation code currently does.

> So maybe some limitations that we could set are:
>
> 1. Only update the coordinates if slicing step is 1 (or -1), otherwise
> discard it.

Yep.

> 2. Only update the coordinates if addition is between contiguous
> coordinates, otherwise discard it.

That does seem simple - especially as the primary driver for this
is multiple sequence alignments and those only support simple
continuous locations with a start and end.

> Personally, I think this would cover most use cases for slicing while
> allowing us to keep it simple.

That is perhaps a good balance (and as a bonus means we
don't have to link this to the CompoundLocation unless we
want to).

> As for the name, 'region' sounds better than 'location'. Maybe
> 'coverage'? I don't have any strong preference between these, but
> 'subregion' doesn't feel that nice.

Region seems fine.

> Finally, for the coordinate system, I imagine it will use Python's
> coordinate system, too? (zero-based, half-open, and the parsers /
> writers should do the conversion).

Yes. I'm suggesting using the FeatureLocation object (from
Bio.SeqFeatures), which does this.

> Should we also reverse the
> coordinates if the objects are sliced in reverse (e.g.
> seqrecord[::-1]) or simply inverse the strand value but keep the
> coordinates unchanged?

The strand changes, and the start/end must also be recalculated
from the length of the parent sequence. The FeatureLocation
has a (private) _flip method to do this. In some cases we won't
have the parent sequence length, so would have to drop the
location.

I'll have a go at implementing this on a branch in the next
few hours (unless something more pressing comes up at
the BioHackathon). As it happens this overlaps nicely with
some of the group discussion about how to represent feature
locations in RDF.

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Sep  6 07:21:16 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 08:21:16 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
Message-ID: <CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>

On Thu, Sep 6, 2012 at 6:52 AM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
>
>> It would also look less like Java code ;)
>>
>> I like this plan - but initially define and document the new properties,
>> and deprecate the old get/set properties. Without that you'll break
>> almost every PDB using script out there.
>
> How do I deprecate the old ones? Is there a DeprecationWarning or so?
>

Yes, we use Bio.BiopythonDeprecationWarning rather than the
default DeprecationWarning because the later is now silent
by default. Grep the code for example usage, see also:
http://biopython.org/wiki/Deprecation_policy

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Sep  6 09:36:41 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 6 Sep 2012 10:36:41 +0100
Subject: [Biopython-dev] SeqRecord locations;
 was: Beta code in the official releases?
In-Reply-To: <CAKVJ-_5DDEkFhsuSZMb2c_oJ3z74tOTqDZb7gELdSsiShXBBLA@mail.gmail.com>
References: <CAKVJ-_66Mv8=zJg9e8RYnLdHiTG_TBeHM_Hh8HL=68is7_bM7w@mail.gmail.com>
	<CADEGkF6MZ4JY5kCn8ucMFPTTJ6+3ovovG6SfViATJ_jhH2u1ZA@mail.gmail.com>
	<CAKVJ-_5DDEkFhsuSZMb2c_oJ3z74tOTqDZb7gELdSsiShXBBLA@mail.gmail.com>
Message-ID: <CAKVJ-_7kojE7wJuxQncVc0+3pE+d6KBKrtvoefm6R30w+XzmMw@mail.gmail.com>

On Thu, Sep 6, 2012 at 8:16 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I'll have a go at implementing this on a branch in the next
> few hours (unless something more pressing comes up at
> the BioHackathon). As it happens this overlaps nicely with
> some of the group discussion about how to represent feature
> locations in RDF.
>

I've made a start, will do more later:
https://github.com/peterjc/biopython/tree/sr_loc

Peter


From mjldehoon at yahoo.com  Thu Sep  6 10:13:38 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 6 Sep 2012 03:13:38 -0700 (PDT)
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
Message-ID: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>

--- On Thu, 9/6/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> For any shared server, [...] you
> would be forced into a mass change at one go.

OK, for multiple users on a shared server I see your point.

> Here's a further (and slightly more radical) idea: We
> stick with using 'Bio' and the current mixed case
> names on Python 2, but adopt 'bio' and other PEP8
> compatible names for Python 3 (as a uniform
> strict automatic rule: mixed case -> lower case)?
> i.e. Do this as part of our 2to3 process.

The Python developers argue against combining a switch to Python 3 with other major changes, since then if bugs arise it is unclear if it is due to the switch to Python 3 or due to the other changes. But perhaps it's OK if we have one Bio.* version for Python 2 and one bio.* version for Python 3 that are otherwise completely identical to each other.

> How about over the next year we gradually consolidate
> modules under the existing mixed case names? e.g.
> move Bio.AlignIO functionality and Bio.Align, 

I guess you meant "merge Bio.AlignIO functionality into Bio.Align".

> and Bio.Seq* under Bio.Seq (leaving backwards compatible
> imports supported but deprecated).

Sounds good to me. AFAIAC, we don't need to do this gradually over the next year. May as well do it for the next release.

-Michiel.


From anaryin at gmail.com  Thu Sep  6 13:48:51 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 6 Sep 2012 16:48:51 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
	<CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>
Message-ID: <CAJ9sUYNEs582xgo4AmbRtvEUTMn1OP+4Wt_VEBVW+_LUMkHmSg@mail.gmail.com>

Ok, thanks.

The modules are littered with set/get methods and adding DeprecationWarning
to all of them might be a bit too much.. Instead, should we add one single
warning at the top of the PDBParser, since this is the only obligatory
module for Bio.PDB so that everyone gets the warning message once and once
only? Otherwise I can imagine several warnings popping up everywhere..

Cheers,

Jo?o


From eric.talevich at gmail.com  Thu Sep  6 14:17:03 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 6 Sep 2012 10:17:03 -0400
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
Message-ID: <CAMC681mE3KwUuXgPsQWD1duiWJe-jvoY9NSTSLBE6BYZ6zEdpg@mail.gmail.com>

On Thu, Sep 6, 2012 at 1:52 AM, Jo?o Rodrigues <anaryin at gmail.com> wrote:

>
> What Bio.PDB does right now is rely on the list to iterate over things.
> Thus, you get the order in which you read the PDB file. However, if you
> sort it using the several Objects sort method you will get the following
> rules:
>
> Atom.py - N CA C O first, then alphabetically
> Residue.py - First aminoacids and nucleic acids, then heteroatoms.
> Chain.py - Empty chains last.
>
> These are already in place somewhere in the code. I just used them to
> overload the __cmp__ method, with a couple of additions because I
> personally disagree with the following:
>
> Atom.py - Inorganic atoms should come out last. For simplicity.
> Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get
> in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151.
> PDB files already have weird large numbers for water and ions for example,
> so these come out last anyway. Pushing all HETATMs to the end will
> sometimes disrupt the "natural" order of things, for instance modified
> residues. Magic perhaps :)
>
>
Here's another edge case to think about:
3BEG<http://www.rcsb.org/pdb/explore/explore.do?structureId=3BEG>.
The enzyme is chain A, starting from residue number 69; the substrate
peptide is chain B; and then after listing the atoms for chain B they jump
back to chain A and add the three ligands as individual residues, with
residue numbers 1, 2 and 3, on HETATM lines.

The current PDBParser complains about this structure but parses it so that
the extra HETATM residues are at the end of chain A's child_list. If I were
to try to generate a polypeptide sequence from each of the chains in this
structure, I think I'd want to just ignore the three extra residues, rather
than list them as the first three residues of the peptide as "SAX".

How do you think this should be handled? Maybe treat in-sequence modified
residues differently from out-of-sequence HETATMs?

-E


From eric.talevich at gmail.com  Thu Sep  6 14:40:13 2012
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 6 Sep 2012 10:40:13 -0400
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
Message-ID: <CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>

On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:

> --- On Thu, 9/6/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > For any shared server, [...] you
> > would be forced into a mass change at one go.
>
> OK, for multiple users on a shared server I see your point.


True, and old scripts/pipelines have a way of sticking around, especially
once they've been shared with others in the lab.


> Here's a further (and slightly more radical) idea: We
> > stick with using 'Bio' and the current mixed case
> > names on Python 2, but adopt 'bio' and other PEP8
> > compatible names for Python 3 (as a uniform
> > strict automatic rule: mixed case -> lower case)?
> > i.e. Do this as part of our 2to3 process.
>
> The Python developers argue against combining a switch to Python 3 with
> other major changes, since then if bugs arise it is unclear if it is due to
> the switch to Python 3 or due to the other changes. But perhaps it's OK if
> we have one Bio.* version for Python 2 and one bio.* version for Python 3
> that are otherwise completely identical to each other.
>

Agreed, since the bio.* version is generated by the 2to3 script it should
still be easy enough to distinguish "this is a bug in the library" from
"this is a problem with Py3, 2to3 or your environment". The extra
separation on the filesystem provided by Py2/Py3 should also prevent some
problems with case-insensitivity and the environment.


> > How about over the next year we gradually consolidate
> > modules under the existing mixed case names? e.g.
> > move Bio.AlignIO functionality and Bio.Align,
>
> I guess you meant "merge Bio.AlignIO functionality into Bio.Align".
>
> > and Bio.Seq* under Bio.Seq (leaving backwards compatible
> > imports supported but deprecated).
>
> Sounds good to me. AFAIAC, we don't need to do this gradually over the
> next year. May as well do it for the next release.
>
>
Doing this in a single release might be better, so we can document/remember
the release number when the Grand Reshuffling took place and troubleshoot
users' resulting problems more easily.

Should we call that Biopython 2.0.0 and switch to semantic version numbers?


From anaryin at gmail.com  Thu Sep  6 14:51:11 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Thu, 6 Sep 2012 17:51:11 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAMC681mE3KwUuXgPsQWD1duiWJe-jvoY9NSTSLBE6BYZ6zEdpg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
	<CAMC681mE3KwUuXgPsQWD1duiWJe-jvoY9NSTSLBE6BYZ6zEdpg@mail.gmail.com>
Message-ID: <CAJ9sUYP5ZK0KSpT4QKr2_HR6ojMtsAorJXQsErQLLkAReKQB1w@mail.gmail.com>

Well... :) If this is what the authors put in.. well, that's just it. The
parser should not be an interpreter.

However, when building peptides, you should get two peptides: the ALA-SEP,
and the protein chain A. And I think this is what you will get. Also, the
fact that they are heteroatoms is already a good filter if you want them
out of the equation.


From p.j.a.cock at googlemail.com  Fri Sep  7 01:01:04 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 7 Sep 2012 02:01:04 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
	<CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
Message-ID: <CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>

On Thu, Sep 6, 2012 at 3:40 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> --- On Thu, 9/6/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> > Here's a further (and slightly more radical) idea: We
>> > stick with using 'Bio' and the current mixed case
>> > names on Python 2, but adopt 'bio' and other PEP8
>> > compatible names for Python 3 (as a uniform
>> > strict automatic rule: mixed case -> lower case)?
>> > i.e. Do this as part of our 2to3 process.
>>
>> The Python developers argue against combining a switch to Python 3 with
>> other major changes, since then if bugs arise it is unclear if it is due to
>> the switch to Python 3 or due to the other changes. But perhaps it's OK if
>> we have one Bio.* version for Python 2 and one bio.* version for Python 3
>> that are otherwise completely identical to each other.
>
>
> Agreed, since the bio.* version is generated by the 2to3 script it should
> still be easy enough to distinguish "this is a bug in the library" from
> "this is a problem with Py3, 2to3 or your environment". The extra separation
> on the filesystem provided by Py2/Py3 should also prevent some problems with
> case-insensitivity and the environment.

Yes - they would be in different site-packages folders, and since
we have a tiny Python 3 install base, moving them from Bio to
bio seems low impact.

I guess we need to have a little hack with the 2to3 library and
try defining our own custom fixer for the imports...

Note this case difference will slightly complicate our documentation -
but that is always going to be an issue for the Python 2 to 3 move.

>>
>> > How about over the next year we gradually consolidate
>> > modules under the existing mixed case names? e.g.
>> > move Bio.AlignIO functionality and Bio.Align,
>>
>> I guess you meant "merge Bio.AlignIO functionality into Bio.Align".

Yes, sorry.

>> > and Bio.Seq* under Bio.Seq (leaving backwards compatible
>> > imports supported but deprecated).
>>
>> Sounds good to me. AFAIAC, we don't need to do this gradually
>> over the next year. May as well do it for the next release.
>
> Doing this in a single release might be better, so we can document/remember
> the release number when the Grand Reshuffling took place and troubleshoot
> users' resulting problems more easily.

Doing it one release makes sense - but we can do it gradually in
a series of self contained commits - and feel our way.

Michiel - do you want to start with the Bio/Seq.py to Bio/Seq/__init__.py
change? We'll need to do that before any consolidation steps.

> Should we call that Biopython 2.0.0 and switch to semantic version numbers?
>

Maybe... at some point a Biopython 2 would be a good excuse for
some publicity and another application note.

The eventual move from developing under Python 2 (and using 2to3
for Python 3) to natively developing under Python 3 would be an
excuse for a major version bump.

Peter


From p.j.a.cock at googlemail.com  Fri Sep  7 01:03:22 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 7 Sep 2012 02:03:22 +0100
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAJ9sUYNEs582xgo4AmbRtvEUTMn1OP+4Wt_VEBVW+_LUMkHmSg@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
	<CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>
	<CAJ9sUYNEs582xgo4AmbRtvEUTMn1OP+4Wt_VEBVW+_LUMkHmSg@mail.gmail.com>
Message-ID: <CAKVJ-_5HG-b-BiTrdfhUvWqRLKgvSqg7sWK9d+pJRkapBJSTVw@mail.gmail.com>

On Thu, Sep 6, 2012 at 2:48 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> Ok, thanks.
>
> The modules are littered with set/get methods and adding DeprecationWarning
> to all of them might be a bit too much.. Instead, should we add one single
> warning at the top of the PDBParser, since this is the only obligatory
> module for Bio.PDB so that everyone gets the warning message once and once
> only? Otherwise I can imagine several warnings popping up everywhere..

If you use the exact same message, then I think you'll only see the
warning once. Try it with a couple of the get/set methods to confirm.

Having the warning happen even if you don't use the get/set seems
wrong.

Peter


From anaryin at gmail.com  Fri Sep  7 07:21:56 2012
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Fri, 7 Sep 2012 10:21:56 +0300
Subject: [Biopython-dev] Optimization of PDBParser and friends
In-Reply-To: <CAKVJ-_5HG-b-BiTrdfhUvWqRLKgvSqg7sWK9d+pJRkapBJSTVw@mail.gmail.com>
References: <CAJ9sUYPsMu_9OrW7EQwSM5mrRNrc=x2Tok4hX7xLBMMGM4HmUQ@mail.gmail.com>
	<CAKVJ-_4AtU=W3sET1frLqbog5mCQkfKxF5CPHoE2L2nPt=eVbw@mail.gmail.com>
	<CAJ9sUYP8mQddEW+vxZpYyUCdxJ3TnYpqpTBdzwpeEf+r=o9vGg@mail.gmail.com>
	<CADEGkF6FQCsExs3E0rL3gjkjU3=r+phxxDysCVehdkAZMq94Sg@mail.gmail.com>
	<CAKVJ-_5G2gVf8rmW-gZ0U-TYcZkwg4Mvfi2P0qWTfrxfPEz=+g@mail.gmail.com>
	<CADEGkF6GDMDsXUSOK+PM4_Opgnqky6pPoNutiQ31w6xh4c_n8A@mail.gmail.com>
	<CAJ9sUYMemvHQG4oUgHkx_VGPTN4mUz55Ou-3wEG=mURRU39OmQ@mail.gmail.com>
	<CAKVJ-_7x-v8ZLuo6Wi-bZCh9NO1zbzyMQgL8ixMcgTm6+=OUcg@mail.gmail.com>
	<CAJ9sUYP-wyXwsgrRVM2fN1P5vf3MEr5dY1hWJzccbtHq8RU7kQ@mail.gmail.com>
	<CAKVJ-_4HoaLh09G9pCrzrnGBWYOrJfLc-Dq2SoH7P4=pGu-hmw@mail.gmail.com>
	<CAJ9sUYPzevnQ5k4ABhSHpNUjAM3VgHNv9QPOx9dgiVRLLMG0Vg@mail.gmail.com>
	<CAKVJ-_6CyztvUAKz=zP3n3hJxrH+8czq_Bfw1_hsBkyCZxqvXA@mail.gmail.com>
	<CAJ9sUYNVUGe9N3Pyjao+8NDAR+UwYzPS8txP+zO4PgyMxwfRcw@mail.gmail.com>
	<CAKVJ-_6fFK1jXqiAmt3rDa7d6Ff1sq8SVBq=MxvDJMDzw7Lt7g@mail.gmail.com>
	<CAJ9sUYNEs582xgo4AmbRtvEUTMn1OP+4Wt_VEBVW+_LUMkHmSg@mail.gmail.com>
	<CAKVJ-_5HG-b-BiTrdfhUvWqRLKgvSqg7sWK9d+pJRkapBJSTVw@mail.gmail.com>
Message-ID: <CAJ9sUYPt_u5XZJfJHYF7ALKRR8eOjJ4YD8-1PJ7Wis-QAg50hQ@mail.gmail.com>

Likely true.

I'm writing a txt file with the changes. I don't think they can be merged
easily without breaking a lot of stuff, in particular the removal of
child_list. Therefore, I suggest we write a few deprecation warnings here
and there where affected by the consensual changes we agree on and give a
few releases before we actually merge them.

Also, once I'm happy with the changes, I'll make a new branch to allow
'beta testing' by anyone who wants and write a wiki page on it.

Cheers,

Jo?o
No dia 7 de Set de 2012 04:03, "Peter Cock" <p.j.a.cock at googlemail.com>
escreveu:

> On Thu, Sep 6, 2012 at 2:48 PM, Jo?o Rodrigues <anaryin at gmail.com> wrote:
> > Ok, thanks.
> >
> > The modules are littered with set/get methods and adding
> DeprecationWarning
> > to all of them might be a bit too much.. Instead, should we add one
> single
> > warning at the top of the PDBParser, since this is the only obligatory
> > module for Bio.PDB so that everyone gets the warning message once and
> once
> > only? Otherwise I can imagine several warnings popping up everywhere..
>
> If you use the exact same message, then I think you'll only see the
> warning once. Try it with a couple of the get/set methods to confirm.
>
> Having the warning happen even if you don't use the get/set seems
> wrong.
>
> Peter
>


From mjldehoon at yahoo.com  Sun Sep  9 07:31:05 2012
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sun, 9 Sep 2012 00:31:05 -0700 (PDT)
Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif
Message-ID: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com>

Returning to a previous discussion...

[Michiel:]
> ..., currently Bio.Motif._Motif.Motif objects also perform
> functions that are more appropriate for a separate PWM
> (position-weight matrix) class within Bio.Motif. It may be
> a good idea to have a separate PWM class for this functionality.

[Bartek:]
> I'm not sure. I think it is valuable to be able to load
> instances from a file and then convert them to a PWM.
> It could be done with separate classes,
> but I'm not sure it would be easier then...

I think there is one confusing issue here.
The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method).

So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments).
Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix, 
motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score).

So I would suggest to keep the various types of matrices explicit; something along these lines:

>>> motif = Motif.read(...)
>>> counts = motif.counts
# .counts is a property of motif
# counts is an instance of the Motif.FrequencyMatrix class
# you can also make a FrequencyMatrix object directly from
# the frequencies, as in
>>> counts = Motif.FrequencyMatrix(my_frequency_matrix)
>>> counts[2,:]
array([1.0, 4.0, 3.0, 2.0])
# indices refer explicitly to the counts matrix
>>> counts[2,'G']
3.0

>>> my_consensus_sequence = counts.consensus
# .consensus is a property of counts
>>> my_anticonsensus_sequence = counts.anticonsensus
# .anticonsensus is a property of counts

>>> my_probability_matrix = counts.normalize()
# this can be a numpy array, or a Motif.ProbabilityMatrix
# class that inherits from a numpy array
>>> my_probability_matrix[2,:]
array([0.1, 0.4, 0.3, 0.2])
# indices refer explicitly to the probability matrix

>>> pwm = counts.make_pwm(...)
# or pwm = motif.PositionWeightMatrix(my_matrix)
>>> pwm[0,:]
array([ -2.3,  0.1,  1.2,  1.8])
>>> pwm[0,2]
1.2
>>> pwm[0,'C']
0.1
# indices explicitly refer to the pwm

>>> scores = pwm.scan(sequence)
>>> score = pwm.score(sequence)


Does that sound reasonable? Any comments, suggestions?

Best,
-Michiel.


From bartek at rezolwenta.eu.org  Mon Sep 10 07:12:59 2012
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Mon, 10 Sep 2012 09:12:59 +0200
Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif
In-Reply-To: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com>
References: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com>
Message-ID: <CABHxouV7Mc9VeNHX2mw7kSEB3V3=6uvNXt3HCG27zXhBOjfqJQ@mail.gmail.com>

Hi,

I think it is an idea worth discussing a little bit more. Thanks for
bringing it up Michiel.

It captures at least some of the issues caused by the fact that
different motifs might be internally represented differently.

I'm not sure I'm all excited about having to deal with explicit extra
classes for PWMs and aligned instances, but maybe this is the price
for having a clear separation of where certain things are calculated.

The issue I think still needs discussion is where is the searching
done? If I want to search for instances, do I do it from the PWM
object?, This seems to be the natural idea, but then can we find a
nice interface for people who don't want to be bothered with too
complicated interfaces?

I'll try to come up with a more thought through and longer response
later in the week...
best
Bartek

On Sun, Sep 9, 2012 at 9:31 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Returning to a previous discussion...
>
> [Michiel:]
>> ..., currently Bio.Motif._Motif.Motif objects also perform
>> functions that are more appropriate for a separate PWM
>> (position-weight matrix) class within Bio.Motif. It may be
>> a good idea to have a separate PWM class for this functionality.
>
> [Bartek:]
>> I'm not sure. I think it is valuable to be able to load
>> instances from a file and then convert them to a PWM.
>> It could be done with separate classes,
>> but I'm not sure it would be easier then...
>
> I think there is one confusing issue here.
> The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method).
>
> So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments).
> Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix,
> motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score).
>
> So I would suggest to keep the various types of matrices explicit; something along these lines:
>
>>>> motif = Motif.read(...)
>>>> counts = motif.counts
> # .counts is a property of motif
> # counts is an instance of the Motif.FrequencyMatrix class
> # you can also make a FrequencyMatrix object directly from
> # the frequencies, as in
>>>> counts = Motif.FrequencyMatrix(my_frequency_matrix)
>>>> counts[2,:]
> array([1.0, 4.0, 3.0, 2.0])
> # indices refer explicitly to the counts matrix
>>>> counts[2,'G']
> 3.0
>
>>>> my_consensus_sequence = counts.consensus
> # .consensus is a property of counts
>>>> my_anticonsensus_sequence = counts.anticonsensus
> # .anticonsensus is a property of counts
>
>>>> my_probability_matrix = counts.normalize()
> # this can be a numpy array, or a Motif.ProbabilityMatrix
> # class that inherits from a numpy array
>>>> my_probability_matrix[2,:]
> array([0.1, 0.4, 0.3, 0.2])
> # indices refer explicitly to the probability matrix
>
>>>> pwm = counts.make_pwm(...)
> # or pwm = motif.PositionWeightMatrix(my_matrix)
>>>> pwm[0,:]
> array([ -2.3,  0.1,  1.2,  1.8])
>>>> pwm[0,2]
> 1.2
>>>> pwm[0,'C']
> 0.1
> # indices explicitly refer to the pwm
>
>>>> scores = pwm.scan(sequence)
>>>> score = pwm.score(sequence)
>
>
> Does that sound reasonable? Any comments, suggestions?
>
> Best,
> -Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
Bartek Wilczynski


From p.j.a.cock at googlemail.com  Mon Sep 10 08:39:30 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 10 Sep 2012 09:39:30 +0100
Subject: [Biopython-dev] Most buildbot slaves down
Message-ID: <CAKVJ-_6mJn+0y9OOGnqt6H4N5yhFSCsw-4fE6OeVjaHRAWLoyA@mail.gmail.com>

Hi all,

For those of you actively monitoring the nightly BuildBot
for Biopython and/or BioRuby, all the buildslaves at my
institute are currently effectively offline. A new stricter
firewall policy was introduced last week while I was away.
I hope we'll have the necessary outgoing ports opened
again soon.

In the meantime, additional buildslaves hosted elsewhere
would be very useful. The machines need to be online
and are typically only used once every 24 hours for the
scheduled builds. Non-Linux machines are particularly
important for cross-platform testing (while for Linux
the TravisCI testing seems to be working nicely overall).

Any volunteers?

Thanks,

Peter


From tiagoantao at gmail.com  Mon Sep 10 08:50:41 2012
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 10 Sep 2012 09:50:41 +0100
Subject: [Biopython-dev] [BioRuby] Most buildbot slaves down
In-Reply-To: <CAKVJ-_6mJn+0y9OOGnqt6H4N5yhFSCsw-4fE6OeVjaHRAWLoyA@mail.gmail.com>
References: <CAKVJ-_6mJn+0y9OOGnqt6H4N5yhFSCsw-4fE6OeVjaHRAWLoyA@mail.gmail.com>
Message-ID: <CAA9RGEOQYVgwf8NxS52TH84+WPdFcNFawJPTcZaLHM1XiZ+E3A@mail.gmail.com>

Hi,

Not much helpful in the non-linux front, but I noticed that my machine
was down for some reason, restarted it and it is doing at least a few
of the builds.

Tiago

On Mon, Sep 10, 2012 at 9:39 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi all,
>
> For those of you actively monitoring the nightly BuildBot
> for Biopython and/or BioRuby, all the buildslaves at my
> institute are currently effectively offline. A new stricter
> firewall policy was introduced last week while I was away.
> I hope we'll have the necessary outgoing ports opened
> again soon.
>
> In the meantime, additional buildslaves hosted elsewhere
> would be very useful. The machines need to be online
> and are typically only used once every 24 hours for the
> scheduled builds. Non-Linux machines are particularly
> important for cross-platform testing (while for Linux
> the TravisCI testing seems to be working nicely overall).
>
> Any volunteers?
>
> Thanks,
>
> Peter
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


-- 
"Liberty for wolves is death to the lambs" - Isaiah Berlin


From redmine at redmine.open-bio.org  Fri Sep 14 02:23:53 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Fri, 14 Sep 2012 02:23:53 +0000
Subject: [Biopython-dev] [Biopython - Bug #3384] (New) Installation fails
	with pip-3.2
Message-ID: <redmine.issue-3384.20120914022353@redmine.open-bio.org>


Issue #3384 has been reported by Roy Crihfield.

----------------------------------------
Bug #3384: Installation fails with pip-3.2
https://redmine.open-bio.org/issues/3384

Author: Roy Crihfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Linux 3.5.3-1-ARCH x86_64 GNU/Linux
Python 3.2.3
Bio.__version__ == '1.60'

Installation fails with with pip 1.2:

$ sudo pip-3.2 install biopython

:
:

Converting build/py3.2/Doc/examples/fasta_dictionary.py

Converting build/py3.2/Doc/examples/nmr/simplepredict.py

Python 2to3 processing done.

running egg_info

error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory

----------------------------------------

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main
    status = self.run(options, args)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Fri Sep 14 02:23:54 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Fri, 14 Sep 2012 02:23:54 +0000
Subject: [Biopython-dev] [Biopython - Bug #3384] (New) Installation fails
	with pip-3.2
Message-ID: <redmine.issue-3384.20120914022353@redmine.open-bio.org>


Issue #3384 has been reported by Roy Crihfield.

----------------------------------------
Bug #3384: Installation fails with pip-3.2
https://redmine.open-bio.org/issues/3384

Author: Roy Crihfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Linux 3.5.3-1-ARCH x86_64 GNU/Linux
Python 3.2.3
Bio.__version__ == '1.60'

Installation fails with with pip 1.2:

$ sudo pip-3.2 install biopython

:
:

Converting build/py3.2/Doc/examples/fasta_dictionary.py

Converting build/py3.2/Doc/examples/nmr/simplepredict.py

Python 2to3 processing done.

running egg_info

error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory

----------------------------------------

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main
    status = self.run(options, args)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Fri Sep 14 08:46:08 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Fri, 14 Sep 2012 08:46:08 +0000
Subject: [Biopython-dev] [Biopython - Bug #3384] Installation fails with
	pip-3.2
References: <redmine.issue-3384.20120914022353@redmine.open-bio.org>
Message-ID: <redmine.journal-14960.20120914084608@redmine.open-bio.org>


Issue #3384 has been updated by Peter Cock.


Does the standard install mechanism work on your machine? i.e.

python3.2 setup.py build
python3.2 setup.py test
sudo python3.2 setup.py install

If you want to investigate the pip error, there is a possible workaround developed by NumPy (who also use 2to3 in a similar way to us), see http://projects.scipy.org/numpy/ticket/1857

Thanks
----------------------------------------
Bug #3384: Installation fails with pip-3.2
https://redmine.open-bio.org/issues/3384

Author: Roy Crihfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Linux 3.5.3-1-ARCH x86_64 GNU/Linux
Python 3.2.3
Bio.__version__ == '1.60'

Installation fails with with pip 1.2:

$ sudo pip-3.2 install biopython

:
:

Converting build/py3.2/Doc/examples/fasta_dictionary.py

Converting build/py3.2/Doc/examples/nmr/simplepredict.py

Python 2to3 processing done.

running egg_info

error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory

----------------------------------------

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main
    status = self.run(options, args)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sat Sep 15 01:57:53 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 15 Sep 2012 01:57:53 +0000
Subject: [Biopython-dev] [Biopython - Bug #3384] Installation fails with
	pip-3.2
References: <redmine.issue-3384.20120914022353@redmine.open-bio.org>
Message-ID: <redmine.journal-14961.20120915015753@redmine.open-bio.org>


Issue #3384 has been updated by Roy Crihfield.


Yes, installing manually works. I found that hack but was hoping there would be a better solution, or support for pip planned for the future. 
----------------------------------------
Bug #3384: Installation fails with pip-3.2
https://redmine.open-bio.org/issues/3384

Author: Roy Crihfield
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


Linux 3.5.3-1-ARCH x86_64 GNU/Linux
Python 3.2.3
Bio.__version__ == '1.60'

Installation fails with with pip 1.2:

$ sudo pip-3.2 install biopython

:
:

Converting build/py3.2/Doc/examples/fasta_dictionary.py

Converting build/py3.2/Doc/examples/nmr/simplepredict.py

Python 2to3 processing done.

running egg_info

error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory

----------------------------------------

Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython

Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main
    status = self.run(options, args)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Sat Sep 15 21:29:29 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Sat, 15 Sep 2012 21:29:29 +0000
Subject: [Biopython-dev] [Biopython - Bug #3340] Example using Bio.Clustalw
	in Tutorial
References: <redmine.issue-3340.20120410202908@redmine.open-bio.org>
Message-ID: <redmine.journal-14964.20120915212929@redmine.open-bio.org>


Issue #3340 has been updated by Grace Yeo.


I've submitted a pull request for this here:  
https://github.com/biopython/biopython/pull/71
----------------------------------------
Bug #3340: Example using Bio.Clustalw in Tutorial
https://redmine.open-bio.org/issues/3340

Author: Peter Cock
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Documentation
Target version: 
URL: 


The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Sun Sep 16 12:34:31 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 16 Sep 2012 13:34:31 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
	<CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
	<CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>
Message-ID: <CAKVJ-_7-KXVZ96bHLG6XD88zcN9rPvnTf7yQ0E6J1jhb_5yx+g@mail.gmail.com>

On Fri, Sep 7, 2012 at 2:01 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Sep 6, 2012 at 3:40 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>> --- On Thu, 9/6/12, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> > Here's a further (and slightly more radical) idea: We
>>> > stick with using 'Bio' and the current mixed case
>>> > names on Python 2, but adopt 'bio' and other PEP8
>>> > compatible names for Python 3 (as a uniform
>>> > strict automatic rule: mixed case -> lower case)?
>>> > i.e. Do this as part of our 2to3 process.
>>>
>>> The Python developers argue against combining a switch to Python 3 with
>>> other major changes, since then if bugs arise it is unclear if it is due to
>>> the switch to Python 3 or due to the other changes. But perhaps it's OK if
>>> we have one Bio.* version for Python 2 and one bio.* version for Python 3
>>> that are otherwise completely identical to each other.
>>
>>
>> Agreed, since the bio.* version is generated by the 2to3 script it should
>> still be easy enough to distinguish "this is a bug in the library" from
>> "this is a problem with Py3, 2to3 or your environment". The extra separation
>> on the filesystem provided by Py2/Py3 should also prevent some problems with
>> case-insensitivity and the environment.
>
> Yes - they would be in different site-packages folders, and since
> we have a tiny Python 3 install base, moving them from Bio to
> bio seems low impact.
>
> I guess we need to have a little hack with the 2to3 library and
> try defining our own custom fixer for the imports...
>
> Note this case difference will slightly complicate our documentation -
> but that is always going to be an issue for the Python 2 to 3 move.
>

I've made a start at this - the easy part seems to work :)

https://github.com/peterjc/biopython/commits/py3lower

The hard bit will be fixing all the import lines... ;)

Peter


From k.d.murray.91 at gmail.com  Thu Sep 20 04:28:08 2012
From: k.d.murray.91 at gmail.com (Kevin Murray)
Date: Thu, 20 Sep 2012 14:28:08 +1000
Subject: [Biopython-dev] TAIR/AGI support
In-Reply-To: <87txvcx9ls.fsf@fastmail.fm>
References: <CAH80STXOOUjqYcQ82C2C25-gACyzwx0D4-VD+CMTes90CdZbnw@mail.gmail.com>
	<87txvcx9ls.fsf@fastmail.fm>
Message-ID: <CAH80STVrvSnxp4JkgrZoywMQqiMg8t=nJtTcGnNggCe4k-Y4aQ@mail.gmail.com>

Hi Brad,

My TAIR/AGI script is on github here:
https://github.com/kdmurray91/biopython/blob/master/Bio/TAIR/__init__.py

I got it to work directly from TAIR's website, however it has not been
rigorously tested. I plan on implementing the process as i described in my
previous email, whereby it fetches the Genbank record from TOGOws or via
NCBI's Efetch (using biopython's interfaces of course). I will keep you all
posted.

To the list in general, I'm open to suggestions on what to work on next?


Regards
Kevin Murray


On 6 September 2012 10:45, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> Kevin;
> Thanks for the e-mail and offers of code. Always happy to have other
> folks involved with the project.
>
> > What's the status of TAIR AGIs in BioPython (I can see no mention of
> them,
> > or support for them)? I've written a brief module which allows a user to
> > query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there
> > any interest in including such functionality in BioPython?
>
> Is the code available on GitHub to get a better sense of all the
> functionality it supports? Do you have an idea where it would fit best?
> As a tair submodule inside of Bio.Entrez, or somewhere else?
>
> > More generally, are there any particular areas of BioPython development
> > which could use an extra pair of hands?
>
> Following the mailing list for discussions on current projects is the
> best way to get a sense of what different folks are working on. The
> issue tracker also has open issues and features that could use attention
> if anything there strikes your fancy:
>
> https://redmine.open-bio.org/projects/biopython
>
> Hope this helps,
> Brad
>
>


From p.j.a.cock at googlemail.com  Thu Sep 20 09:08:58 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 20 Sep 2012 10:08:58 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_7-KXVZ96bHLG6XD88zcN9rPvnTf7yQ0E6J1jhb_5yx+g@mail.gmail.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
	<CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
	<CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>
	<CAKVJ-_7-KXVZ96bHLG6XD88zcN9rPvnTf7yQ0E6J1jhb_5yx+g@mail.gmail.com>
Message-ID: <CAKVJ-_6U0PrsTWM8sMPgsSX8cnfTandTGKz5j829K8so7whPgA@mail.gmail.com>

On Sun, Sep 16, 2012 at 1:34 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> I guess we need to have a little hack with the 2to3 library and
>> try defining our own custom fixer for the imports...
>>
>> Note this case difference will slightly complicate our documentation -
>> but that is always going to be an issue for the Python 2 to 3 move.
>>
>
> I've made a start at this - the easy part seems to work :)
>
> https://github.com/peterjc/biopython/commits/py3lower
>
> The hard bit will be fixing all the import lines... ;)
>
> Peter

Progress - but slow. I think this will work with a bit more
time spent on it.

With hindsight I'd have made more effort to try and reuse
lib2to3, but the documentation is sketchy and they do warn
it is liable to change between releases.

What I've got instead is a pattern matching script which
line-by-line spots imports & updates them, and also notes
what knock on changes must be made later in the file. It
is also aware of and updates doctest examples. e.g.

from Bio import SeqIO
record = SeqIO.read("my_chr.gbk", "genbank")

becomes:

from bio import seqIO
record = seqIO.read("my_chr.gbk", "genbank")

In the process I've spotted some minor style issues and
some quote mistakes in the code base which I have
fixed on the main branch as well, e.g.
https://github.com/biopython/biopython/commit/b396844401da8b5c5ed1f7f13d69622a6ad0c0cd
https://github.com/biopython/biopython/commit/165e2b8da445250f070c3860c9082ff6a0c919e0

I also reformatted a few import lines to make
processing them easier - and arguably easier
to read too:
https://github.com/biopython/biopython/commit/f6940e8a4fcf056fa725225ede5e848c5d6f4fd6

One slightly more complicated issue with lower case module
names is we get clashes in some code with existing variable
or argument names. This seems particularly common with seq,
alphabet and motif.

Most of these fixes for this are on the experimental branch.
In some cases I've opted to change the import, e.g.

from Bio import Alphabet

to:

from Bio import Alphabet as _alphabet

This seemed simplest to avoid changing argument names in
functions/methods.

I'll continue to work on this as time allows - right now the code
is due for a refactoring (e.g. avoid code duplication where I
handle doctests), and would benefit from some self-tests.

But the message remains: This should work :)

Peter


From yhtgrace at gmail.com  Fri Sep 21 16:57:19 2012
From: yhtgrace at gmail.com (Hui Ting Grace Yeo)
Date: Fri, 21 Sep 2012 12:57:19 -0400
Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices
Message-ID: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>

Hey everyone,

I'm working on this bug here https://redmine.open-bio.org/issues/3340 and I've updated the example in the tutorial (on substitution matrices, 17.4.2) using Bio.AlignIO on github here https://github.com/yhtgrace/biopython/tree/clustalw-alignIO-replace. I'm able to reproduce the dictionary replace_info, but when I go on to finish the example, I get the following log odds matrix:

D   2
E  -1   1
H  -5  -4   3
K -10  -5  -4   1
R  -4  -8  -4  -2   2
   D   E   H   K   R

which is different from the one given in the tutorial. I'm wondering if I've missed something. 

Thanks!
Grace Yeo


From p.j.a.cock at googlemail.com  Mon Sep 24 08:53:07 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 24 Sep 2012 09:53:07 +0100
Subject: [Biopython-dev] ColorSpiral for Bio.Graphics
Message-ID: <CAKVJ-_4SxNiZciWt4aL4PJ=8Q_VfeowZPEds2o6kAAGZT4rYSQ@mail.gmail.com>

Hello all,

Last week Leighton was doing some work with Biopython
and GenomeDiagram using the cross-links functionality
we worked on for Biopython 1.59, which I described here:
http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/

As you may have noticed via Twitter or his blog, Leighton has
generated an enormous (5m by 1m) PDF poster printout
comparing 29 bacterial genomes:
http://armchairbiology.blogspot.co.uk/2012/09/the-colours-man-colours.html

As he describes on his blog post, this required generating
arbitrary color sets, with the option of adding some noise
(or jitter as he called it) to make neighbouring colours
visually distinct (rather than the more typical requirement
of a smooth value to color mapping).

His code to do that is now on this branch (with a minor
bug fix and a few more docstrings added), ready for
possible merging into Biopython:
https://github.com/peterjc/biopython/tree/colorspiral

Does this seem like a sensible addition to Bio.Graphics?

Does anyone have any thoughts on the namespace
Bio.Graphics.ColorSpiral given it defines an object
ColorSpiral? Might a Bio.Graphics.Colors be useful?

(If as discussed on the other thread we move to lower
case module names for Python 3, this namespace
clash also present in many other Biopython modules
goes away):
http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009934.html

Regards,

Peter


From p.j.a.cock at googlemail.com  Tue Sep 25 16:00:45 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 25 Sep 2012 17:00:45 +0100
Subject: [Biopython-dev] [Biopython] Legacy blastn XML outfile parsing
 is slow. What XML parser is actually used?
In-Reply-To: <5061C20F.7040209@stats.ox.ac.uk>
References: <CADEGkF7DX-t1bwRp66i3+6FBxpaU1KMCS6V86svGA7g2De=aoA@mail.gmail.com>
	<1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com>
	<CAKVJ-_521uq+Lf756munwjBt2H-P5+1jY1HwVvaK1DNqt5Rm2g@mail.gmail.com>
	<506194F6.9000103@fold.natur.cuni.cz>
	<CAKVJ-_45v6EwoT+JBir1joZrs80uV6+D_xBFV=5+83fR=kmE6w@mail.gmail.com>
	<CAKVJ-_5ynOC_zn-LCTUB4PNvpNi=PsQAexY+w58BhC8CJjgVgg@mail.gmail.com>
	<5061C20F.7040209@stats.ox.ac.uk>
Message-ID: <CAKVJ-_70vvk0wp2HnH12cD4qDf==Ph8LaYBAA-1kBr0N6LHJ9g@mail.gmail.com>

On Tue, Sep 25, 2012 at 3:39 PM, Tanya Golubchik
<golubchi at stats.ox.ac.uk> wrote:
> Hello,
>
> Apologies for not having followed the entire discussion, but just wanted
> to say that we're also using NCBIXML here and are likely to be
> incorporating it in a new piece of software soon, so it would be really
> unfortunate if some tags disappeared, were renamed or (even worse)
> changed meaning in future releases.
>
> I'm a bit late coming in here so maybe this has been answered, but is
> there a better parser that should be used at the moment? I was under the
> impression that NCBIXML is the only one.
>
> Thanks,
> Tanya

Hi Tanya,

I hope I can reassure you there is nothing to worry about :)

Right now there is only the NCBIXML parser, and we're not going
to change it (except possibly to make it a little faster if people
want to work on that).

We're planning to a add new module based on Bow's GSoC
code, under the working name SearchIO, which would cover
BLAST, BLAT, HMMER, etc. This would have a different API
and in the long term would probably replace all of Bio.Blast.
http://biopython.org/wiki/SearchIO

The discussion about possible changes has been (I think)
only about this new code (and would have been better off
on the development mailing list but this thread went off on
a slight tangent).

Once 'SearchIO' is released, we'd want to encourage
people to use that instead of NCBIXML, with a view to
deprecating and eventually removing NCBIXML. See:
http://biopython.org/wiki/Deprecation_policy

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu Sep 27 13:01:44 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Sep 2012 14:01:44 +0100
Subject: [Biopython-dev] ColorSpiral for Bio.Graphics
In-Reply-To: <CAKVJ-_4SxNiZciWt4aL4PJ=8Q_VfeowZPEds2o6kAAGZT4rYSQ@mail.gmail.com>
References: <CAKVJ-_4SxNiZciWt4aL4PJ=8Q_VfeowZPEds2o6kAAGZT4rYSQ@mail.gmail.com>
Message-ID: <CAKVJ-_69rK70ZEROxedDO4iU9uTJJ3096-DMYQhd0QRg9eEF7w@mail.gmail.com>

On Mon, Sep 24, 2012 at 9:53 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hello all,
>
> Last week Leighton was doing some work with Biopython
> and GenomeDiagram using the cross-links functionality
> we worked on for Biopython 1.59, which I described here:
> http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/
>
> As you may have noticed via Twitter or his blog, Leighton has
> generated an enormous (5m by 1m) PDF poster printout
> comparing 29 bacterial genomes:
> http://armchairbiology.blogspot.co.uk/2012/09/the-colours-man-colours.html
>
> As he describes on his blog post, this required generating
> arbitrary color sets, with the option of adding some noise
> (or jitter as he called it) to make neighbouring colours
> visually distinct (rather than the more typical requirement
> of a smooth value to color mapping).
>
> His code to do that is now on this branch (with a minor
> bug fix and a few more docstrings added), ready for
> possible merging into Biopython:
> https://github.com/peterjc/biopython/tree/colorspiral
>
> Does this seem like a sensible addition to Bio.Graphics?
>
> Does anyone have any thoughts on the namespace
> Bio.Graphics.ColorSpiral given it defines an object
> ColorSpiral? Might a Bio.Graphics.Colors be useful?
>
> (If as discussed on the other thread we move to lower
> case module names for Python 3, this namespace
> clash also present in many other Biopython modules
> goes away):
> http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009934.html
>
> Regards,
>
> Peter

I've committed it - we can still move/rename/etc until the
next release if anyone has suggestions for improvement.
https://github.com/biopython/biopython/commit/35a484026b68dd1b530d3446640b2f4d4b73eda7

Peter


From p.j.a.cock at googlemail.com  Thu Sep 27 13:55:21 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 27 Sep 2012 14:55:21 +0100
Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices
In-Reply-To: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>
References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>
Message-ID: <CAKVJ-_52FJpXZE2Q2vCP==6g8cfy4KOiSpMKmbSP2FWu2mJdVw@mail.gmail.com>

On Fri, Sep 21, 2012 at 5:57 PM, Hui Ting Grace Yeo <yhtgrace at gmail.com> wrote:
> Hey everyone,
>
> I'm working on this bug here https://redmine.open-bio.org/issues/3340
> and I've updated the example in the tutorial (on substitution matrices,
> 17.4.2) using Bio.AlignIO on github here
> https://github.com/yhtgrace/biopython/tree/clustalw-alignIO-replace.
> I'm able to reproduce the dictionary replace_info, but when I go on to
> finish the example, I get the following log odds matrix:
>
> D   2
> E  -1   1
> H  -5  -4   3
> K -10  -5  -4   1
> R  -4  -8  -4  -2   2
>    D   E   H   K   R
>
> which is different from the one given in the tutorial. I'm wondering if I've
> missed something.

Hi Grace,

Using the current code and the example as it is, I also observe
the same result as you. According to github's "blame" feature
the current text dates back 4 years,

https://github.com/biopython/biopython/commit/bed3ab39d8a635f1e74be99e6730a48d2460f8b7

However, that was just a reformatting of an older example which
Brad wrote 11 years ago while converting the example from DNA
to protein:

https://github.com/biopython/biopython/commit/21df476c66b279824c51e6abd3f4ae549d003813

The example file itself protein.aln has not changed, committed:

https://github.com/biopython/biopython/commit/ccbe2d72014eafb064994bc3782ca5529d0b0448

See also Doc/examples/make_subsmat.py

So, since the example hasn't been changed in 11 years, this
suggests either Brad committed the wrong output (and no-one
noticed), or something changed in the calculation during that
time.

(Nowadays we try to use doctests for the examples in the
API and in the Tutorial where possible, so that code changes
which affect our examples are detected automatically.)

The most likely candidates would be something in the file
Bio/SubsMat/__init__.py

https://github.com/biopython/biopython/commits/master/Bio/SubsMat/__init__.py

A little detective work might be needed to explain this... sadly
trying to use Biopython from back then is complicated by the
reliance on the Martel/mxTextTools dependency.

Maybe Brad or Michiel has some insight?

--

In the meantime, I have applied your changes to the
example to use AlignIO,

https://github.com/biopython/biopython/commit/19f9317fe0e346f6c3f197d027076d9a1265def7
https://github.com/biopython/biopython/commit/5949f54dadb6d4ac8400e11d2afa33db549afba5

This will now get tested via test_Tutorial.py automatically
(except for the final line about printing the odds matrix):

https://github.com/biopython/biopython/commit/15dd6ba17eb092d0d7df674ac45617d99256d098

Thank you,

Peter


From redmine at redmine.open-bio.org  Thu Sep 27 13:57:38 2012
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 27 Sep 2012 13:57:38 +0000
Subject: [Biopython-dev] [Biopython - Bug #3340] (Resolved) Example using
	Bio.Clustalw in Tutorial
References: <redmine.issue-3340.20120410202908@redmine.open-bio.org>
Message-ID: <redmine.journal-14965.20120927135738@redmine.open-bio.org>


Issue #3340 has been updated by Peter Cock.

Status changed from New to Resolved
% Done changed from 0 to 100

Fixed with Grace's commits, although she has also spotted a separate issue with the log odds matrix output later in the example:
http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009958.html
http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009962.html

----------------------------------------
Bug #3340: Example using Bio.Clustalw in Tutorial
https://redmine.open-bio.org/issues/3340

Author: Peter Cock
Status: Resolved
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Documentation
Target version: 
URL: 


The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Fri Sep 28 10:50:52 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Sep 2012 11:50:52 +0100
Subject: [Biopython-dev] PEP8 lower case module names?
In-Reply-To: <CAKVJ-_6U0PrsTWM8sMPgsSX8cnfTandTGKz5j829K8so7whPgA@mail.gmail.com>
References: <CAKVJ-_4M1q9fw4N9XZ+hQ4BzeWsg4vX5NBwjSbB0J3Yss-pAPw@mail.gmail.com>
	<1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com>
	<CAMC681n6=UuotEUdxGVEWDK4vPGd3=4O0yW82UQ3upTNMfy1iw@mail.gmail.com>
	<CAKVJ-_6rTsfqphX6i+YGA8ijLN+04kP+Gxk=BjwWCcXJtF97Vg@mail.gmail.com>
	<CAKVJ-_7-KXVZ96bHLG6XD88zcN9rPvnTf7yQ0E6J1jhb_5yx+g@mail.gmail.com>
	<CAKVJ-_6U0PrsTWM8sMPgsSX8cnfTandTGKz5j829K8so7whPgA@mail.gmail.com>
Message-ID: <CAKVJ-_4PV3VMx5pju65578gq8TSN936T5ePH_cjhtUQcrECHYg@mail.gmail.com>

On Thu, Sep 20, 2012 at 10:08 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Sun, Sep 16, 2012 at 1:34 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>>
>>> I guess we need to have a little hack with the 2to3 library and
>>> try defining our own custom fixer for the imports...
>>
>> I've made a start at this - the easy part seems to work :)
>>
>> https://github.com/peterjc/biopython/commits/py3lower
>>
>> ...

The code to do this lower case name mangling remains
a quite spaghetti like mess in do2to3.py but it now works
enough to pass the test suite (with some but not all 3rd
party dependencies installed) under Linux and my Mac
OS X machine (where like Windows I have a case
insensitive file system).

Here's a clean run on TravisCI (Linux with a case sensitive
file system):
https://travis-ci.org/#!/peterjc/biopython/jobs/2584146

I've not tried Windows itself yet. Also only Python 3.2

Note if you want to try this, after switching to (and after
switching from) the py3lower branch you should delete
the build/py3.* folder where the 2to3 converted code
is cached.

The good news is that only a handful of bits of code
needed special case code (e.g. finding the Entrez DTD
files), with most tweaks just to import lines (as mentioned
earlier) or renaming of internal variables.

So this idea to adopt PEP8 lower case module names
as part of supporting Python 3 appears to be technically
viable.

Peter


From p.j.a.cock at googlemail.com  Fri Sep 28 09:35:42 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Sep 2012 10:35:42 +0100
Subject: [Biopython-dev] ColorSpiral for Bio.Graphics
In-Reply-To: <CAKVJ-_69rK70ZEROxedDO4iU9uTJJ3096-DMYQhd0QRg9eEF7w@mail.gmail.com>
References: <CAKVJ-_4SxNiZciWt4aL4PJ=8Q_VfeowZPEds2o6kAAGZT4rYSQ@mail.gmail.com>
	<CAKVJ-_69rK70ZEROxedDO4iU9uTJJ3096-DMYQhd0QRg9eEF7w@mail.gmail.com>
Message-ID: <CAKVJ-_5ZTT93qF=zyzn2JC_3u_pR0KXjD4MC01A8eZkbTcUPaA@mail.gmail.com>

On Thu, Sep 27, 2012 at 2:01 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Mon, Sep 24, 2012 at 9:53 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> As he describes on his blog post, this required generating
>> arbitrary color sets, with the option of adding some noise
>> (or jitter as he called it) to make neighbouring colours
>> visually distinct (rather than the more typical requirement
>> of a smooth value to color mapping).
>>
>> ...
>
> I've committed it - we can still move/rename/etc until the
> next release if anyone has suggestions for improvement.
> https://github.com/biopython/biopython/commit/35a484026b68dd1b530d3446640b2f4d4b73eda7

The buildbot run last night spotted a problem under Python 2.5
(no cmath.rect function) which I've now fixed.
https://github.com/biopython/biopython/commit/ee933c3f5c4b98ab232c5180492dc11a46b89f0d

We do test under Python 2.5 with TravisCI as well, but at
the moment we don't install the ReportLab dependency.
There is a balance between installing more dependencies
(to get more of our code tested) and the extra runtime
required (meaning the job is more likely to be killed, or
fail due to a network issue) giving false test failures.

Peter


From p.j.a.cock at googlemail.com  Fri Sep 28 10:06:10 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Sep 2012 11:06:10 +0100
Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices
In-Reply-To: <87ipaywk47.fsf@fastmail.fm>
References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>
	<CAKVJ-_52FJpXZE2Q2vCP==6g8cfy4KOiSpMKmbSP2FWu2mJdVw@mail.gmail.com>
	<87ipaywk47.fsf@fastmail.fm>
Message-ID: <CAKVJ-_5J_d5=Ao1e+kkt-rZcB3rWebFj2gp2UZeUe4BXO4pcRw@mail.gmail.com>

On Fri, Sep 28, 2012 at 10:51 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> So, since the example hasn't been changed in 11 years, this
>> suggests either Brad committed the wrong output (and no-one
>> noticed), or something changed in the calculation during that
>> time.
>
> Seriously, I could have easily copy/pasted something wrong when writing
> this, so if there is no obvious code change I'd go with that assumption
> and fix the docs to be correct.

OK - I've done that:
https://github.com/biopython/biopython/commit/b57707f9f3afc0980a3dbf936f6642a4d9cc8a69

Thanks Brad & Grace,

Peter

P.S. I've included Grace as a contributor in the upcoming release notes
(please let me know if you'd prefer this as Hui Ting Grace Yeo instead):
https://github.com/biopython/biopython/commit/5af03e78f37cbce82ce167c762d892cce9cb062e


From bjoern at gruenings.eu  Fri Sep 28 13:03:22 2012
From: bjoern at gruenings.eu (=?ISO-8859-1?Q?Bj=F6rn_Gr=FCning?=)
Date: Fri, 28 Sep 2012 15:03:22 +0200
Subject: [Biopython-dev] [Patch] Genbank Parser
Message-ID: <1348837402.21455.1.camel@threonin>

Hi,

the tbl2asn tool from the ncbi creates genbank files that did not have a
version number. Unfortunately that version number is used to fill
consumer.data.id. 
I implemented the following fall-back:
If there is no version information available than it takes the
consumer.data.name for the consumer.data.id. Does that makes sense?

Thanks!
Bjoern

-------------- next part --------------
A non-text attachment was scrubbed...
Name: biopython_genbank_id-fallback.diff
Type: text/x-patch
Size: 1016 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120928/8f4fe694/attachment-0002.bin>

From p.j.a.cock at googlemail.com  Fri Sep 28 13:38:11 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 28 Sep 2012 14:38:11 +0100
Subject: [Biopython-dev] [Patch] Genbank Parser
In-Reply-To: <1348837402.21455.1.camel@threonin>
References: <1348837402.21455.1.camel@threonin>
Message-ID: <CAKVJ-_5nekcTBYejUTVV6VvjV+mB0WV0eoEWKytGZOTmgfmw1g@mail.gmail.com>

On Fri, Sep 28, 2012 at 2:03 PM, Bj?rn Gr?ning <bjoern at gruenings.eu> wrote:
> Hi,
>
> the tbl2asn tool from the ncbi creates genbank files that did not have a
> version number. Unfortunately that version number is used to fill
> consumer.data.id.
> I implemented the following fall-back:
> If there is no version information available than it takes the
> consumer.data.name for the consumer.data.id. Does that makes sense?
>
> Thanks!
> Bjoern

Can you share some example output from tbl2asn that shows
this problem? Ideally something small we could include as a
unit test.

Thanks,

Peter


From chapmanb at 50mail.com  Fri Sep 28 09:51:36 2012
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 28 Sep 2012 05:51:36 -0400
Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices
In-Reply-To: <CAKVJ-_52FJpXZE2Q2vCP==6g8cfy4KOiSpMKmbSP2FWu2mJdVw@mail.gmail.com>
References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com>
	<CAKVJ-_52FJpXZE2Q2vCP==6g8cfy4KOiSpMKmbSP2FWu2mJdVw@mail.gmail.com>
Message-ID: <87ipaywk47.fsf@fastmail.fm>


Grace and Peter;

[Different log odds matrix in documentation]

> However, that was just a reformatting of an older example which
> Brad wrote 11 years ago while converting the example from DNA
> to protein:

Gee, thanks for making me feel old.

> So, since the example hasn't been changed in 11 years, this
> suggests either Brad committed the wrong output (and no-one
> noticed), or something changed in the calculation during that
> time.

Seriously, I could have easily copy/pasted something wrong when writing
this, so if there is no obvious code change I'd go with that assumption
and fix the docs to be correct.

Thanks for spotting this,
Brad


From bjoern at gruenings.eu  Thu Sep 27 22:11:05 2012
From: bjoern at gruenings.eu (bjoern at gruenings.eu)
Date: Fri, 28 Sep 2012 00:11:05 +0200 (CEST)
Subject: [Biopython-dev] [Patch] Genbank Parser fall-back data.id
Message-ID: <59367.132.230.56.143.1348783865.squirrel@mail.gruenings.eu>

Hi,

the tbl2asn tool from the ncbi creates genbank files that did not have a
version number. Unfortunately that version number is used to fill
consumer.data.id.
I implemented the following fall-back:
If there is no version information available than it takes the
consumer.data.name for the consumer.data.id. Does that makes sense?

Thanks!
Bjoern
-------------- next part --------------
A non-text attachment was scrubbed...
Name: biopython_genbank.diff
Type: text/x-patch
Size: 1015 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20120928/6027b056/attachment-0002.bin>

From p.j.a.cock at googlemail.com  Sat Sep 29 12:10:24 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 29 Sep 2012 13:10:24 +0100
Subject: [Biopython-dev] Fwd: [Utilities-announce] PubMed E-Utility 2013 DTD
	updates
In-Reply-To: <mailman.155270.1348855208.20059.utilities-announce@ncbi.nlm.nih.gov>
References: <A9D8BF3D8A74DF4A925FB541C0F39D2A16F079160B@NIHMLBX15.nih.gov>
	<mailman.155270.1348855208.20059.utilities-announce@ncbi.nlm.nih.gov>
Message-ID: <CAKVJ-_7c7fcp7BrQcJ82gV+oEASvNZ3B0+bg364_Y5UweM=HqA@mail.gmail.com>

I've added the two new DTD files mentioned below:
https://github.com/biopython/biopython/commit/2a09b03ab4d861e91eb543bd6df717ecb4fdf097

Peter

---------- Forwarded message ----------
From: **
Date: Friday, September 28, 2012
Subject: [Utilities-announce] PubMed E-Utility 2013 DTD updates
To: NLM/NCBI List utilities-announce <utilities-announce at ncbi.nlm.nih.gov>


NCBI PubMed E-Utility Users,****

** **

We anticipate updating the PubMed E-Utility DTDs for 2012 in mid-December,
approximately on December 10 or 11, 2012.****

** **

The forthcoming DTDs are available from:****

** **

http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedlinecitationset_130101.dtd
****

http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_130101.dtd****

** **

Changes to NLMMedlineCitationSet DTD AND MEDLINE/PubMed XML:****

** **

**-          **Indicating abstracts not in MEDLINE/PubMed but available
from publishers****

English-language abstracts are taken directly from the published article
and included in the <Abstract> and <AbstractText> elements. If the article
does not have a published abstract, the record lacks the <Abstract> and
<AbstractText> elements. However, publishers may create English-language
abstracts that are not published with the article, as well as, non-English-
language abstracts that may or may not be published with the article.****

** **

These other abstracts will be indicated in the <OtherAbstract> element. A
new "Language" attribute is added to the <OtherAbstract> element. The
<AbstractText> element will carry the standard phrase: "Abstract available
from the publisher."****

** **

DTD:****

<!ELEMENT OtherAbstract (AbstractText+,CopyrightInformation?)>****

<!ATTLIST            OtherAbstract Type (AAMC | AIDS | KIE | PIP | NASA |
Publisher) #REQUIRED****

                                Language (#PCDATA ) "eng">****

** **

Sample XML:****

<OtherAbstract Type="Publisher" Language="fre"> <AbstractText> Abstract
available from the publisher.</AbstractText>****

</OtherAbstract>****

** **

**-          **Rename NameID to Identifier****

The NameID element was created in 2010 and modified in 2011 but has not yet
been used. NameID is renamed to Identifier. Identifier is an optional,
possibly multiply-occurring element permissible within the Author (personal
and collective) and Investigator elements.  The value in the Identifier
attribute Source designates the organizational authority that established
the unique identifier. ****

** **

DTD:****

<!ELEMENT     Author (((LastName, ForeName?, Initials?, Suffix?) |
CollectiveName),Identifier*)>****

<!ATTLIST     Author ValidYN (Y | N) "Y">****

** **

<!ELEMENT     Investigator (LastName,ForeName?,
Initials?,Suffix?,Identifier*,Affiliation?)>****

<!ATTLIST     Investigator ValidYN (Y | N) "Y">****

** **

<!ELEMENT     Identifier (#PCDATA)>****

<!ATTLIST     Identifier ****

                      Source CDATA #REQUIRED >****

** **

Sample XML:****

<Author ValidYN="Y">****

<LastName>Smith</LastName>****

<ForeName>John</ForeName>****

<Initials>A</Initials>****

<Identifier Source=?ORCID?>55555555555555</Identifier>****

</Author>****

** **

Thank you.****


From p.j.a.cock at googlemail.com  Sat Sep 29 20:25:14 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 29 Sep 2012 21:25:14 +0100
Subject: [Biopython-dev] Nexus __slots__ and Python 3.3
Message-ID: <CAKVJ-_4-7hma258+wo8kSwjDD5H=Fxxtc87j_3Ywm-UdmdjYDw@mail.gmail.com>

Hello all,

I've started testing under the newly released Python 3.3,
and there is a new problem which I don't recall running
into when I tried one of the Python 3.3 alpha releases:

$ python3 test_Nexus.py
Traceback (most recent call last):
  File "test_Nexus.py", line 7, in <module>
    from Bio.Nexus import Nexus, Trees
  File "/Users/peterjc/lib/python3.3/site-packages/Bio/Nexus/Nexus.py",
line 513, in <module>
    class Nexus(object):
ValueError: 'original_taxon_order' in __slots__ conflicts with class variable

I can fix this with the following change, which appears
to have no side effects under Python 2 (the unit tests
still all pass):

$ git diff
diff --git a/Bio/Nexus/Nexus.py b/Bio/Nexus/Nexus.py
index 1d6abd2..8c7fbcc 100644
--- a/Bio/Nexus/Nexus.py
+++ b/Bio/Nexus/Nexus.py
@@ -511,8 +511,6 @@ class Block(object):

 class Nexus(object):

-    __slots__=['original_taxon_order','__dict__']
-
     def __init__(self, input=None):
         self.ntax=0                     # number of taxa
         self.nchar=0                    # number of characters

I have committed this:
https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634

However, I'm not really sure what the intention of this
line was in the first place. It is (assuming I didn't miss
anything with grep), or now was, the only use of
__slots__ in the whole of Biopython.

Regards,

Peter


From p.j.a.cock at googlemail.com  Sat Sep 29 20:34:27 2012
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 29 Sep 2012 21:34:27 +0100
Subject: [Biopython-dev] PAML test problems under Python 3.3.0
Message-ID: <CAKVJ-_4DCG=_d097D=M5Ld1AthCVmZ50qixL4HR7OLOK68ZkuQ@mail.gmail.com>

Hi Brandon (et al),

Could you have a look at the PAML unit tests under Python 3.3 please?
I see a mix of failures and 'blocking' under a self-compiled Python 3.3.0
on Mac OS X 10.8 (Mountain Lion):

$ python3 test_PAML_yn00.py
testAlignmentExists (__main__.ModTest) ... ok
testAlignmentFileIsValid (__main__.ModTest) ... FAIL
testAlignmentSpecified (__main__.ModTest) ... ok
testCtlFileExistsOnRead (__main__.ModTest) ... ok
testCtlFileExistsOnRun (__main__.ModTest) ... ok
testCtlFileValidOnRead (__main__.ModTest) ... ERROR
testCtlFileValidOnRun (__main__.ModTest) ... ok
testOptionExists (__main__.ModTest) ... ok
testOutputFileSpecified (__main__.ModTest) ... ok
testOutputFileValid (__main__.ModTest) ... ok
testParseAllVersions (__main__.ModTest) ... ok
testResultsExist (__main__.ModTest) ... ok
testResultsParsable (__main__.ModTest) ... ok
testResultsValid (__main__.ModTest) ... ^C

$ python3 test_PAML_codeml.py
testAlignmentExists (__main__.ModTest) ... ok
testAlignmentFileIsValid (__main__.ModTest) ... FAIL
testAlignmentSpecified (__main__.ModTest) ... ok
testCtlFileExistsOnRead (__main__.ModTest) ... ok
testCtlFileExistsOnRun (__main__.ModTest) ... ok
testCtlFileValidOnRead (__main__.ModTest) ... ERROR
testCtlFileValidOnRun (__main__.ModTest) ... ok
testOptionExists (__main__.ModTest) ... ok
testOutputFileSpecified (__main__.ModTest) ... ok
testOutputFileValid (__main__.ModTest) ... ok
testPamlErrorsCaught (__main__.ModTest) ... ok
testParseAA (__main__.ModTest) ... ok
testParseAAPairwise (__main__.ModTest) ... ok
testParseAllNSsites (__main__.ModTest) ... ok
testParseBranchSiteA (__main__.ModTest) ... ok
testParseCladeModelC (__main__.ModTest) ... ok
testParseFreeRatio (__main__.ModTest) ... ok
testParseNSsite3 (__main__.ModTest) ... ok
testParseNgene2Mgene02 (__main__.ModTest) ... ok
testParseNgene2Mgene1 (__main__.ModTest) ... ok
testParseNgene2Mgene34 (__main__.ModTest) ... ok
testParsePairwise (__main__.ModTest) ... ok
testParseSEs (__main__.ModTest) ... ok
testResultsExist (__main__.ModTest) ... ok
testResultsParsable (__main__.ModTest) ... ok
testResultsValid (__main__.ModTest) ... ^C

$ python3 test_PAML_baseml.py
testAlignmentExists (__main__.ModTest) ... ok
testAlignmentFileIsValid (__main__.ModTest) ... FAIL
testAlignmentSpecified (__main__.ModTest) ... ok
testCtlFileExistsOnRead (__main__.ModTest) ... ok
testCtlFileExistsOnRun (__main__.ModTest) ... ok
testCtlFileValidOnRead (__main__.ModTest) ... ERROR
testCtlFileValidOnRun (__main__.ModTest) ... ok
testOptionExists (__main__.ModTest) ... ok
testOutputFileSpecified (__main__.ModTest) ... ok
testOutputFileValid (__main__.ModTest) ... ok
testPamlErrorsCaught (__main__.ModTest) ... ok
testParseAllVersions (__main__.ModTest) ... ok
testParseAlpha1Rho1 (__main__.ModTest) ... ok
testParseModel (__main__.ModTest) ... ok
testParseNhomo (__main__.ModTest) ... ok
testParseSEs (__main__.ModTest) ... ok
testResultsExist (__main__.ModTest) ... ok
testResultsParsable (__main__.ModTest) ... ok
testResultsValid (__main__.ModTest) ... ^C

If you've not tried this before, the procedure I'm using is:

$ python3 setup.py build
$ cd build/py3.3/Tests
$ python3 test_PAML_baseml.py
etc

The key point is to run the tests directly (rather than
just via 'python3 setup.py test') you must change
director to the 2to3 converted folder under the build
folder.

By commenting out the test methods which seem to
blocking, it seems some of the failures are to do with
exception handling. I've not dug any further into this.

Thanks,

Peter