From chapmanb at 50mail.com  Tue Sep  1 09:06:39 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 1 Sep 2009 09:06:39 -0400
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
Message-ID: <20090901130639.GI75451@sobchak.mgh.harvard.edu>

Hi Peter;

[indexed dict usage]
> What file formats where you working on, and how many records?

It was a 100Mb fasta file with about 41,000 records. Nothing too
heavy but it worked great. The only change I made was to generalize
the record building line:
                
self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)

to allow an arbitrary function to be passed to define the
identifier, instead of defaulting to the first part of the line.
This is helpful for those fun NCBI ids
(gi|83029091|ref|XM_357633.3|) where other parts of the program only
have the accession number.

> True. Have got any bright ideas for a better name? While the
> index is in memory, the SeqRecord objects are not (unlike the
> original Bio.SeqIO.to_dict() function).
> 
> Or we have one function Bio.SeqIO.indexed_dict() which can
> either use an in memory index, OR an on disk index, offering
> the same functionality.

That's a nice idea -- provide some reasonable defaults based on file
size and type, and allow them to be over-ridden with function
params.

> >> Another option (like the shelve idea we talked about last month)
> >> is to parse the sequence file with SeqIO, and serialise all the
> >> SeqRecord objects to disk, e.g. with pickle or some key/value
> >> database. This is potentially very complex (e.g. arbitrary Python
> >> objects in the annotation), and could lead to a very large "index"
> >> file on disk. On the other hand, some possible back ends would
> >> allow editing the database... which could be very useful.
> >
> > My thought here was to use BioSQL and the SQLite mappings for
> > serializing. We build off a tested and existing serialization, and
> > also guide people into using BioSQL for larger projects.
> > Essentially, we would build an API on top of existing BioSQL
> > functionality that creates the index by loading the SQL and then
> > pushes the parsed records into it.
> 
> Using BioSQL in this way is a much more general tool than
> simply "indexing a sequence file". It feels like a sledgehammer
> to crack a nut. Also, do you expect it to scale well for 10 million
> plus short reads? It may do, but on the other hand it may not.

Agreed that it would introduce extra overhead for something like
short reads. If you are talking about serializing SeqRecords, it
would make sense to re-use what we have in BioSQL. If you are
talking about storing just file offsets, then a lightweight solution
makes more sense.

For me, the initial parse time to prepare an index is not as much of an
issue since it happens once while queries on it will happen multiple
times.

> Also while the current BioSQL mappings are "tried and tested",
> they don't cover everything, in particular per-letter-annotation
> such as a set of quality scores (something that needs addressing
> anyway, probably with JSON or XML serialisation).

Agreed, but the advantage is that improvements can feed back into
BioSQL, instead of work in parallel.

> All the above make me lean towards a less ambitious target
> (read only dictionary access to a sequence file), which just
> requires having an (on disk) index of file offsets (which could
> be done with SQLite or anything else suitable). This choice
> could even be done on the fly at run time (e.g. we look at the
> size of the file to decide if we should use an in memory index
> or on disk - or start out in memory and if the number of records
> gets too big, switch to on disk).

That makes sense. SQLite has in-memory caching which could help with
some of the decision making as it would handle writing and holding
in memory without having to reimplement that bit. Another file based
indexing scheme is the one in bx-python:

http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py

This is a bit more specific as it also handles queries based on
genomic intervals in addition to retrieving by file position. It may
be useful for looking at the underlying storage details.

Brad

From biopython at maubp.freeserve.co.uk  Tue Sep  1 09:25:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Sep 2009 14:25:22 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <20090901130639.GI75451@sobchak.mgh.harvard.edu>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>

On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
> Hi Peter;
>
> [indexed dict usage]
>> What file formats where you working on, and how many records?
>
> It was a 100Mb fasta file with about 41,000 records. Nothing too
> heavy but it worked great.

Yeah, with just 41,000 keys and offsets the in memory dict would
be pretty small too. This is within the range of file sizes I expect
the Bio.SeqIO.indexed_dict() functionality to be used on. Cool.

> The only change I made was to generalize the record building line:
>
> self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)
>
> to allow an arbitrary function to be passed to define the
> identifier, instead of defaulting to the first part of the line.
> This is helpful for those fun NCBI ids
> (gi|83029091|ref|XM_357633.3|) where other parts of the program only
> have the accession number.

Did your callback function get give the "title string" and return
the desired key?

I had wondered about this, but the only way for this to be general
(to work on all file formats) is for the callback function to be given
a SeqRecord object - which means having to fully parse the file
during the indexing, which ends up being *much* slower. We can
do this is you think it adds a lot of utility i.e. mimic the key_function
argument we already have on Bio.SeqIO.to_dict()

Peter

From biopython at maubp.freeserve.co.uk  Tue Sep  1 09:38:07 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Sep 2009 14:38:07 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
	<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
	<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
	<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
	<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
	<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
	<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
	<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
	<320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com>
Message-ID: <320fb6e00909010638v5c9cec06t66b24e1e755c46cb@mail.gmail.com>

On Fri, Aug 14, 2009 at 1:00 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>>> Jose's code uses seek/tell which means it has to have a handle
>>> to an actual file. He also used binary read mode - I'm not sure if
>>> this was essential or not.
>>
>> Binary mode was not essential - opening an SFF file in default
>> mode also seemed to work fine with Jose's code.
>
> Having worked on this more, default mode or binary mode are fine.
> However, as you might expect, you can't use Python's universal
> read lines mode when parsing SFF files.

Just to clarify this for the record - on Unix you can parse an SFF file
opened in default mode ("r") or binary mode ("rb") but not universal
read line mode ("rU"). However, on Windows only binary mode works.

I've updated my SFF code on github to catch this (as otherwise the
error messages are rather cryptic).

Peter

From biopython at maubp.freeserve.co.uk  Tue Sep  1 09:56:26 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Sep 2009 14:56:26 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <20090901130639.GI75451@sobchak.mgh.harvard.edu>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909010656h594e908cu246138d45442df45@mail.gmail.com>

On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>
>Peter wrote:
>> Using BioSQL in this way is a much more general tool than
>> simply "indexing a sequence file". It feels like a sledgehammer
>> to crack a nut. Also, do you expect it to scale well for 10 million
>> plus short reads? It may do, but on the other hand it may not.
>
> Agreed that it would introduce extra overhead for something like
> short reads. If you are talking about serializing SeqRecords, it
> would make sense to re-use what we have in BioSQL.

I wasn't talking about serialising SeqRecord objects. I agree
there is (almost) no point implementing new serialisation code
when we already have BioSQL.

> If you are talking about storing just file offsets, then a lightweight
> solution makes more sense.

Indeed.

> For me, the initial parse time to prepare an index is not as much
> of an issue since it happens once while queries on it will happen
> multiple times.

It depends on the expected work load - if you are thinking about
indexing a local copy of GenBank, but only expect to pull out a
few (hundred) records, then the index time may be longer than
the total access time.

But in general, if we are talking about saving the index to a file
(which can then be reloaded) I would agree, the up front cost to
prepare the index isn't critical.

On the subject of how to store a index off file offsets on disk,
I think the old Biopython Martel/Mindy indexing code used to
create OBDA style indexes (either simple flat files or BDB based).
We should certainly consider these for cross project compatibility,
or perhaps introduce a new OBDA version which might use
something like SQLite internally instead?
http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html
http://lists.open-bio.org/pipermail/open-bio-l/2009-September/000567.html

Peter

From eoc210 at googlemail.com  Wed Sep  2 08:25:24 2009
From: eoc210 at googlemail.com (Ed Cannon)
Date: Wed, 2 Sep 2009 13:25:24 +0100
Subject: [Biopython-dev] OBO2OWL parser / converter
In-Reply-To: <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net>
References: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com>
	<3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net>
Message-ID: <9e02410b0909020525w5cbf59dek46e0ab1b5144f8@mail.gmail.com>

Hi Hilmar,

My OBO2OWL parser is implemented based on Tirmizi & Miranker?s paper titled:
?OBO2OWL: Roundtrip between OBO and OWL? (
www.cs.utexas.edu/~hamid/pub/tirmizi-obo2owl-tr-06-47.pdf<http://www.cs.utexas.edu/%7Ehamid/pub/tirmizi-obo2owl-tr-06-47.pdf>
)1.

After having looked at the link you sent me to the OBO2OWL mappings google
spreadsheet, it appears that there are some differences, which I?m looking
into at the minute.

Ref:

1. Syed Hamid Tirmizi and Daniel P Miranker. (2006). OBO2OWL: Roundtrip
between OBO and OWL. The University of Texas at Austin, Department of
Computer Sciences, Technical Report TR-06-47, October 2, 16 pages.


Cheers,

Ed

2009/8/31 Hilmar Lapp <hlapp at gmx.net>

> Hi Ed -
>
> is your converter operating in a way that is congruent with (or even
> utilizing) the mapping and the converter provided by the NCBO and Berkeley
> Ontology projects?
>
> http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page
>
> If not, I'm not sure how beneficial it is for users to have multiple and
> possibly conflicting mappings.
>
>        -hilmar
>
>
> On Aug 30, 2009, at 3:33 PM, Ed Cannon wrote:
>
>  Hi All,
>>
>> I would like to thank you guys for all your hard work and effort in making
>> biopython a great piece of open software.
>>
>> I would also like to introduce myself, my name is Ed Cannon, I am a
>> postdoc
>> at Cambridge University working in the fields of  chemo/bioinformatics and
>> semantic web technologies in the group of Peter Murray-Rust.
>>
>> Since a fair amount of my work involves ontologies, I have written an open
>> biomedical ontology (.obo) to web ontology language (.owl) converter. The
>> resultant file can be loaded and used from Protege. I was wondering if
>> this
>> software would be of any interest to  the biopython community? I have just
>> sent a pull request to biopython on github. The code is located at my
>> branch
>> on my account: http://github.com/eoc21/biopython/tree/eoc21Branch.
>>
>> Thanks,
>>
>> Ed
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>


From bugzilla-daemon at portal.open-bio.org  Wed Sep  2 11:24:19 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 2 Sep 2009 11:24:19 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200909021524.n82FOJ7U021693@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-02 11:24 EST -------
(In reply to comment #3)
> I can now parse the Roche SFF index, allowing fast random access to
> the reads. See:
> 
> http://github.com/peterjc/biopython/commits/index
> http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006603.html
> 
> Peter

That branch now has support for SeqIO parsing, indexing and *writing* of
SFF files. The write support is still very new and needs more testing,
but is looking promising. Note that while currently I read the undocumented
Roche style SFF index block, I have not yet attempted to write out such an
index (probably unwise unless the format does get published?).

Also note that there is still scope for improvement for how the trimming
information is presented in the SeqRecord object (perhaps some kind of
masked SeqRecord/Seq as has been suggested on the mailing lists).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed Sep  2 12:45:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 2 Sep 2009 12:45:48 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200909021645.n82GjmbA023923@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-02 12:45 EST -------
(In reply to comment #4)
> That branch now has support for SeqIO parsing, indexing and *writing* of
> SFF files. The write support is still very new and needs more testing,
> but is looking promising. Note that while currently I read the undocumented
> Roche style SFF index block, I have not yet attempted to write out such an
> index (probably unwise unless the format does get published?).

It now has a first attempt at writing a Roche style SFF index, which my code
will parse back again happily. I have not yet tried the resulting file with
the Roche SFF tools. Note that this does not preserve any Roche XML meta data.
Note also that the index is skipped if any of the record names are not 14 chars
long (which is try on all the Roche indexes I have looked at).


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Sep  4 06:23:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Sep 2009 06:23:26 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200909041023.n84ANQgj023187@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-04 06:23 EST -------
I've been working on the Roche SFF indexes, and via their tools have discovered
there are at least two index block formats used:

Most SFF files I have looked at have an index block which starts ".mft1.00"
(short for Manifest v1.00 is my guess) which hold both an XML "manifest" or
meta data, plus a read offset index.

You can also get SFF files where the index block starts ".srt1.00" (Short Read
Table v1.00 maybe?) which have just an index.

The indexes details themselves are the same in both cases, and support
arbitrary read name lengths. The offset is in base 255 (not 256), apparently so
that byte 255 (0xFF) can be used as a separator character. For typical Roche
SFF files, the read names are 14 characters, and the index uses 20 bytes per
read.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Sep  4 06:54:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Sep 2009 06:54:39 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200909041054.n84AsdNe023921@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-04 06:54 EST -------
The Staden IO lib has references to ".srt1.00" (454 sorted v1.00) and also
another SFF index format, which start ".hsh1.00" (hash table v1.00).

See files io_lib/progs/hash_sff.c and io_lib/open_trace_file.c from
http://sourceforge.net/projects/staden/

Scanning their code also confirms my base 255 deduction for the ".srt" indexes,
see function getuint4_255, and the use of 0xFF as a break character.
Interestingly they only expect 4 bytes for the offset (limiting this to almost
4GB SFF files). There is a fifth byte which is usually null, this could be a
name terminator (although this is not actually needed), or used for 4GB+ SFF
offsets.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Fri Sep  4 11:33:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 4 Sep 2009 16:33:16 +0100
Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython?
In-Reply-To: <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
Message-ID: <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>

Hi David,

[This is a continuation of a thread on the main list, but it is much more
suited to the dev list now.]

On Tue, Sep 1, 2009 at 11:38 PM, David Winter wrote:
> Peter wrote:
>> David - I would prefer we also put your new wrappers in
>> Bio.Emboss.Applications, and would be happy to look at adding
>> those to CVS now that Biopython 1.51 is out (I had forgotten
>> about them actually - so thanks for the reminder).
>>
>> Peter
>
> Hi Peter,
>
> I'd almost forgotten about them myself! I only put them in their own module
> because I had the PhyML wrapper as well and that's not an EMBOSS
> application.

I see you've done that on github. I had a look at merging this into CVS,
but had a few comments first.

I found you had a load of tabs in your file (please use 4 space indentation
in future). http://www.biopython.org/wiki/Contributing#Coding_conventions

I am unclear why you are subclassing _EmbossMinimalCommandLine
instead of _EmbossCommandLine since most (all?) of the new wrappers
use the "outfile" parameter. As I recall EMBOSS isn't fussy about the
presence of the equals sign (right now our wrappers mostly omit the
equals, but not all the time - which looks odd to me).

Also your code seems to me missing the __str__ / _validate changes
on the trunk.

And finally, I think you can add yourself to the copyright at the top of
the file for this work ;)

Peter

From biopython at maubp.freeserve.co.uk  Fri Sep  4 13:22:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 4 Sep 2009 18:22:27 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
	<320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>
Message-ID: <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com>

On Tue, Sep 1, 2009 at 2:25 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>> Hi Peter;
>>
>> [indexed dict usage]
>>> What file formats where you working on, and how many records?
>>
>> It was a 100Mb fasta file with about 41,000 records. Nothing too
>> heavy but it worked great.
>
> Yeah, with just 41,000 keys and offsets the in memory dict would
> be pretty small too. This is within the range of file sizes I expect
> the Bio.SeqIO.indexed_dict() functionality to be used on. Cool.
>
>> The only change I made was to generalize the record building line:
>>
>> self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)
>>
>> to allow an arbitrary function to be passed to define the
>> identifier, instead of defaulting to the first part of the line.
>> This is helpful for those fun NCBI ids
>> (gi|83029091|ref|XM_357633.3|) where other parts of the program only
>> have the accession number.
>
> Did your callback function get given the "title string" and return
> the desired key?
>
> I had wondered about this, but the only way for this to be general
> (to work on all file formats) is for the callback function to be given
> a SeqRecord object - which means having to fully parse the file
> during the indexing, which ends up being *much* slower. We can
> do this if you think it adds a lot of utility i.e. mimic the key_function
> argument we already have on Bio.SeqIO.to_dict()

A less flexible option is a callback function which maps the default
record.id to a new key. This would solve your NCBI FASTA issue,
and might be handy in other settings (e.g. removing the version
suffix in GenBank identifiers). However, it would not allow for
example switching to a completely different identifier (e.g. the GI
number) which is present elsewhere in the file.

The point is we can support this kind of limited key_function
without suffering the severe speed penalty which doing a full
parse to give SeqRecord objects would impose.

How does that sound Brad? It should add just a little complexity
to the current code, and allows some neat tricks. Or we can
leave things as they are (KISS).

Peter

From mjldehoon at yahoo.com  Sat Sep  5 04:17:00 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 5 Sep 2009 01:17:00 -0700 (PDT)
Subject: [Biopython-dev] Bio.Entrez.parse
Message-ID: <339938.48242.qm@web62405.mail.re1.yahoo.com>

Hi everybody,
Recently I was trying to parse a huge Entrez XML file containing Entrez gene records. Because of the size of the file, Entrez.read failed with a memory error since it could not keep the entire information in the XML file in memory. I decided to add a parse() function to Bio.Entrez that can iterate of such large files. This function is useful if the XML file essentially contains a list of records; the parse() function is a generator function that returns these records one by one.

--Michiel.


From p.j.a.cock at googlemail.com  Sat Sep  5 08:59:09 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 5 Sep 2009 13:59:09 +0100
Subject: [Biopython-dev] Bio.Entrez.parse
In-Reply-To: <339938.48242.qm@web62405.mail.re1.yahoo.com>
References: <339938.48242.qm@web62405.mail.re1.yahoo.com>
Message-ID: <320fb6e00909050559p2c9da2f1o60905ac3dfe0cb35@mail.gmail.com>

On Sat, Sep 5, 2009 at 9:17 AM, Michiel de Hoon<mjldehoon at yahoo.com> wrote:
> Hi everybody,
> Recently I was trying to parse a huge Entrez XML file containing Entrez gene
> records. Because of the size of the file, Entrez.read failed with a memory
> error since it could not keep the entire information in the XML file in memory.
> I decided to add a parse() function to Bio.Entrez that can iterate of such large
> files. This function is useful if the XML file essentially contains a list of records;
> the parse() function is a generator function that returns these records one by one.

That sounds excellent - I'd noticed that usually Bio.Entez.read() would return
a list of (large nested) records, so this should be a natural extension.

Peter

From biopython at maubp.freeserve.co.uk  Mon Sep  7 07:56:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Sep 2009 12:56:17 +0100
Subject: [Biopython-dev] Anonymous CVS working again :)
Message-ID: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com>

Just an FYI,

While the developer server dev.open-bio.org has been fine, recently
our public read only mirror at cvs.open-bio.org (and cvs.biopython.org)
had not been updated. This affected Biopython and EMBOSS.

And for Biopython as a knock on effect, this had meant the latest
code at http://biopython.org/SRC/biopython/ was a little out of date.
[Biopython's github mirror was not affected]

These all seem to be working fine once again - thanks to someone
at the OBF - let me know who and I'll buy you a beer when we (next)
meet up :)

Peter

From biopython at maubp.freeserve.co.uk  Mon Sep  7 13:34:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Sep 2009 18:34:53 +0100
Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython?
In-Reply-To: <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
Message-ID: <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>

On Fri, Sep 4, 2009 at 4:33 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Hi David,
>
> [This is a continuation of a thread on the main list, but it is much more
> suited to the dev list now.]
>
> ...
>
> I see you've done that on github. I had a look at merging this into CVS,
> but had a few comments first.
>
> I found you had a load of tabs in your file (please use 4 space indentation
> in future). http://www.biopython.org/wiki/Contributing#Coding_conventions

Thanks.

> I am unclear why you are subclassing _EmbossMinimalCommandLine
> instead of _EmbossCommandLine since most (all?) of the new wrappers
> use the "outfile" parameter. As I recall EMBOSS isn't fussy about the
> presence of the equals sign (right now our wrappers mostly omit the
> equals, but not all the time - which looks odd to me).

I see you've switched to _EmbossCommandLine - fine.

> Also your code seems to me missing the __str__ / _validate changes
> on the trunk.

Also fixed, thanks.

> And finally, I think you can add yourself to the copyright at the top of
> the file for this work ;)

Cool.

I have checked this into CVS, but did also fix an old typo (in a docstring)
and one new typo (in an argument name). Thanks David!

Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
based on test_Emboss.py? Continuing on the github branch is fine.

We should put you in the CONTRIB file now too (are there any other
recent people we've missed?). Would you like to give a webpage, or
is this email address fine (be warned it may get harvested for spam)?

Thank you,

Peter

From biopython at maubp.freeserve.co.uk  Mon Sep  7 16:00:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Sep 2009 21:00:46 +0100
Subject: [Biopython-dev] [Root-l] Anonymous CVS working again :)
In-Reply-To: <B86883FE-D894-4933-81AE-7D5452075301@illinois.edu>
References: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com>
	<B86883FE-D894-4933-81AE-7D5452075301@illinois.edu>
Message-ID: <320fb6e00909071300x6238e828l4440e71c562e792c@mail.gmail.com>

> Are these being kept in sync? ? bioperl's moved completely away from
> cvs to svn with very little pain. ?We found sync-ing the two more trouble
> than it was worth.

Perhaps we are talking at cross purposes here Chris.

Right now Biopython and EMBOSS are using CVS, with developers
committing to dev.open-bio.org, which then updates a read only CVS
mirror code.open-bio.org (aka cvs.open-bio.org aka cvs.biopython.org)
to provide anonymous assess.

Likewise, BioPerl etc are using SVN, with developers committing to
dev.open-bio.org, which then updates a read only SVN mirror at
code.open-bio.org (or its other aliases) to provide anonymous assess.

Peter


From biopython at maubp.freeserve.co.uk  Mon Sep  7 17:26:26 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Sep 2009 22:26:26 +0100
Subject: [Biopython-dev] [Root-l] Anonymous CVS working again :)
In-Reply-To: <9A75D700-AC7B-4D5B-ABB2-D28267735E4C@illinois.edu>
References: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com>
	<B86883FE-D894-4933-81AE-7D5452075301@illinois.edu>
	<320fb6e00909071300x6238e828l4440e71c562e792c@mail.gmail.com>
	<9A75D700-AC7B-4D5B-ABB2-D28267735E4C@illinois.edu>
Message-ID: <320fb6e00909071426w1dfed95bx703384b3227eee6b@mail.gmail.com>

On Mon, Sep 7, 2009 at 9:44 PM, Chris Fields<cjfields at illinois.edu> wrote:
> On Sep 7, 2009, at 3:00 PM, Peter wrote:
>
>>> Are these being kept in sync? ? bioperl's moved completely away from
>>> cvs to svn with very little pain. ?We found sync-ing the two more trouble
>>> than it was worth.
>>
>> Perhaps we are talking at cross purposes here Chris.
>>
>> Right now Biopython and EMBOSS are using CVS, with developers
>> committing to dev.open-bio.org, which then updates a read only CVS
>> mirror code.open-bio.org (aka cvs.open-bio.org aka cvs.biopython.org)
>> to provide anonymous assess.
>>
>> Likewise, BioPerl etc are using SVN, with developers committing to
>> dev.open-bio.org, which then updates a read only SVN mirror at
>> code.open-bio.org (or its other aliases) to provide anonymous assess.
>>
>> Peter
>
> Right, I understand that, but you also have a git repo on github (unless I'm
> mistaken). ?Based on that I assume you plan on migrating over to dev git
> and/or github eventually, but I'm unsure of the future of the CVS repo.

Right! For now, CVS changes are pushed to github. Once we move to
git, the CVS repo will no longer be used, and well be left frozen in time.

> My point was, we had been in a similar situation. ?We had thought of having
> a sync'ed CVS <-> SVN repo at one point, but it was way too much trouble to
> deal with and just dropped CVS altogether after the migration. ?Instead, we
> just started switching all docs over to point to svn instead with lots of
> ample warning on the mail lists, and it all worked out in the end (we have
> had very few users inquiring about CVS).

Likewise, we could have git changes pushed into CVS, but there
is little point. We plan to just quit using CVS.

Peter


From david.winter at gmail.com  Mon Sep  7 18:54:52 2009
From: david.winter at gmail.com (David Winter)
Date: Tue, 08 Sep 2009 10:54:52 +1200
Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython?
In-Reply-To: <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>	
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>	
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>	
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
Message-ID: <4AA58F3C.6080200@student.otago.ac.nz>

Hi Peter and all,

Sorry the lack of communication from me on this. I successfully made it 
off the grid for the weekend then found I couldn't push to github from 
work (no ssh over the proxy for students) and couldn't email the list 
from home (can't use the uni's SMTP from off campus ) - IT-security 
catch 22!

> I see you've switched to _EmbossCommandLine - fine.
>
>   
Yeah, this was my stupid fault - you'd given me a heads up about the two 
different version of the _EmbossCommandline and I tried out what I 
already had with the the 'normal' version as saw that it failed but 
didn't read the error message properly (of course it failed because I 
was trying to give it the outfile parameter twice...)

> [... snip the other things you asked about...]
>
>
> I have checked this into CVS, but did also fix an old typo (in a docstring)
> and one new typo (in an argument name). Thanks David!
>
> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
> based on test_Emboss.py? Continuing on the github branch is fine.
>   
Sounds good, will have a go at getting something going in the next 
couple of days

> We should put you in the CONTRIB file now too (are there any other
> recent people we've missed?). Would you like to give a webpage, or
> is this email address fine (be warned it may get harvested for spam)?
>
>   
Well, I'm not sure it's much of a contribution from me, but thanks :) 
Perhaps add david.winter at gmail.com - gmail seems to handle spam pretty 
well and I won't be a student here for ever (right?...)


Cheers,
David

From biopython at maubp.freeserve.co.uk  Tue Sep  8 05:21:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Sep 2009 10:21:11 +0100
Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython?
In-Reply-To: <4AA58F3C.6080200@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
Message-ID: <320fb6e00909080221m7377f033ue9b1617b0bc38f5b@mail.gmail.com>

On Mon, Sep 7, 2009 at 11:54 PM, David Winter<david.winter at gmail.com> wrote:
> Hi Peter and all,
>
> Sorry the lack of communication from me on this. I successfully made it off
> the grid for the weekend then found I couldn't push to github from work (no
> ssh over the proxy for students) and couldn't email the list from home
> (can't use the uni's SMTP from off campus ) - IT-security catch 22!

Tricky.

>> I see you've switched to _EmbossCommandLine - fine.
>
> Yeah, this was my stupid fault - you'd given me a heads up about the two
> different version of the _EmbossCommandline and I tried out what I already
> had with the the 'normal' version as saw that it failed but didn't read the
> error message properly (of course it failed because I was trying to give it
> the outfile parameter twice...)

OK - I wondered if there was some other reason I couldn't see, so worth
checking,

>> I have checked this into CVS, but did also fix an old typo (in a
>> docstring) and one new typo (in an argument name). Thanks David!
>>
>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
>> based on test_Emboss.py? Continuing on the github branch is fine.
>
> Sounds good, will have a go at getting something going in the next
> couple of days

Great - whenever you get time. Thanks!

>> We should put you in the CONTRIB file now too (are there any other
>> recent people we've missed?). Would you like to give a webpage, or
>> is this email address fine (be warned it may get harvested for spam)?
>
> Well, I'm not sure it's much of a contribution from me, but thanks :)

But I'm expecting more in future *grin*

> Perhaps add david.winter at gmail.com - gmail seems to handle spam
> pretty well and I won't be a student here for ever (right?...)

There is always a postdoc ;)

Also can someone remind me at some point that we should include
at least one of the EMBOSS PHYLIP tools in the alignment command
line bit of the tutorial...

Peter

From chapmanb at 50mail.com  Tue Sep  8 08:14:05 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 8 Sep 2009 08:14:05 -0400
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
	<320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>
	<320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com>
Message-ID: <20090908121405.GF63266@sobchak.mgh.harvard.edu>

Hi Peter;

[... callback function for specifying an ID ...]
> > Did your callback function get given the "title string" and return
> > the desired key?
> >
> > I had wondered about this, but the only way for this to be general
> > (to work on all file formats) is for the callback function to be given
> > a SeqRecord object - which means having to fully parse the file
> > during the indexing, which ends up being *much* slower. We can
> > do this if you think it adds a lot of utility i.e. mimic the key_function
> > argument we already have on Bio.SeqIO.to_dict()
> 
> A less flexible option is a callback function which maps the default
> record.id to a new key. This would solve your NCBI FASTA issue,
> and might be handy in other settings (e.g. removing the version
> suffix in GenBank identifiers). However, it would not allow for
> example switching to a completely different identifier (e.g. the GI
> number) which is present elsewhere in the file.
> 
> The point is we can support this kind of limited key_function
> without suffering the severe speed penalty which doing a full
> parse to give SeqRecord objects would impose.

This is a great compromise. You're right, parsing the SeqRecord is too
much, and allowing manipulation of default identifier would work fine.
If people need to do something much more complicated to get the ID
they would probably be better off extending the existing classes and
writing a custom indexer that pulls the IDs they need.

Brad

From biopython at maubp.freeserve.co.uk  Tue Sep  8 09:22:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Sep 2009 14:22:35 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <20090908121405.GF63266@sobchak.mgh.harvard.edu>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
	<320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>
	<320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com>
	<20090908121405.GF63266@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>

n Tue, Sep 8, 2009 at 1:14 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
> Hi Peter;
>
> [... callback function for specifying an ID ...]
>
>> A less flexible option is a callback function which maps the default
>> record.id to a new key. This would solve your NCBI FASTA issue,
>> and might be handy in other settings (e.g. removing the version
>> suffix in GenBank identifiers). However, it would not allow for
>> example switching to a completely different identifier (e.g. the GI
>> number) which is present elsewhere in the file.
>>
>> The point is we can support this kind of limited key_function
>> without suffering the severe speed penalty which doing a full
>> parse to give SeqRecord objects would impose.
>
> This is a great compromise. You're right, parsing the SeqRecord is too
> much, and allowing manipulation of default identifier would work fine.

Cool - done in CVS, including the docstring and the tutorial.

> If people need to do something much more complicated to get the ID
> they would probably be better off extending the existing classes and
> writing a custom indexer that pulls the IDs they need.

Certainly - we can't expect to cover every possible use case, and
trying to do so will result in an overly complicated API.

Did you have any ideas for a better name than Bio.SeqIO.indexed_dict()?

Peter

From mjldehoon at yahoo.com  Tue Sep  8 09:30:30 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 8 Sep 2009 06:30:30 -0700 (PDT)
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>
Message-ID: <184931.66541.qm@web62403.mail.re1.yahoo.com>

--- On Tue, 9/8/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Did you have any ideas for a better name than
> Bio.SeqIO.indexed_dict()?
> 
Is indexed_dict a function? If so, I suggest we use a verb instead of a noun. Maybe just "index"?

--Michiel.


From biopython at maubp.freeserve.co.uk  Tue Sep  8 09:53:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Sep 2009 14:53:36 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <184931.66541.qm@web62403.mail.re1.yahoo.com>
References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>
	<184931.66541.qm@web62403.mail.re1.yahoo.com>
Message-ID: <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com>

On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon<mjldehoon at yahoo.com> wrote:
> --- On Tue, 9/8/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Did you have any ideas for a better name than
>> Bio.SeqIO.indexed_dict()?
>
> Is indexed_dict a function? If so, I suggest we use a verb instead
> of a noun. Maybe just "index"?
>
> --Michiel.

Bio.SeqIO.indexed_dict() is a function which returns a dictionary like
object. So yes, a verb would be better, and "index" is short and sweet.

Peter

From bugzilla-daemon at portal.open-bio.org  Wed Sep  9 09:24:41 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Sep 2009 09:24:41 -0400
Subject: [Biopython-dev] [Bug 2781] Bio.PDB Structure instances cannot be
	deepcopied
In-Reply-To: <bug-2781-42@http.bugzilla.open-bio.org/>
Message-ID: <200909091324.n89DOf4Q013555@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2781


klaus.kopec at tuebingen.mpg.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WORKSFORME


------- Comment #2 from klaus.kopec at tuebingen.mpg.de  2009-09-09 09:24 EST -------
this seems to be resolved in 1.51 with Python 2.6.2 under 64Bit Ubuntu?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed Sep  9 11:18:01 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Sep 2009 11:18:01 -0400
Subject: [Biopython-dev] [Bug 2910] New: Parsing some pdb files results in
	shorter peptide sequences than expected
Message-ID: <bug-2910-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910

           Summary: Parsing some pdb files results in shorter peptide
                    sequences than expected
           Product: Biopython
           Version: 1.49
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: critical
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: schafer at rostlab.org


Parsing the one-letter sequence for a specific chain out of a given pdb file
often seems to result in shorter sequences than expected. 

The following code demonstrates this behavior for structure 1a2d chain A.
Aminoacid #118 VAL after the HETATOM (#117) block is missing in the result. 

------------------CODE----------------
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import *

parser = PDBParser()
ppb = PPBuilder()
structure = parser.get_structure('tmp', '1a2d.pdb')
polypeptides = ppb.build_peptides(structure[0]['A'])
sequence = str(polypeptides[0].get_sequence())

print sequence
------------------CODE----------------

Another example is structure 13gs chain C and D. Both sequences are ECG, the
code above however returns only CG.
So this behavior seems to be indepedent from a present HETATOM block.
This bug is also present in version 1.51.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed Sep  9 11:18:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Sep 2009 11:18:48 -0400
Subject: [Biopython-dev] [Bug 2910] Parsing some pdb files results in
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909091518.n89FImn5016415@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


schafer at rostlab.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |schafer at rostlab.org


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 10 08:55:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Sep 2009 08:55:03 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909101255.n8ACt3Jd017456@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|critical                    |normal
            Summary|Parsing some pdb files      |Bio.PDB build_peptides
                   |results in shorter peptide  |sometimes gives shorter
                   |sequences than expected     |peptide sequences than
                   |                            |expected


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-10 08:55 EST -------
Retitled as this appears to be a bug in the PPBuilder build_peptides method,
not the PDB parser, see:
http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html

Test script:

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import PPBuilder, to_one_letter_code
parser = PDBParser()
ppb = PPBuilder()
#structure = parser.get_structure('tmp', '1A2D.pdb')
structure = parser.get_structure('tmp', '13GS.pdb')
for model in structure :
    polypeptides = ppb.build_peptides(model)
    assert len(model) == len(polypeptides)
    for chain, pep in zip(model, polypeptides) :
        print
        print "Chain", chain.id
        print "Raw chain:"
        print "".join(to_one_letter_code.get(res.resname,"X") \
                      for res in chain if "CA" in res.child_dict)
        print "From peptide builder:"
        print pep.get_sequence()

Output for 1A2D,

PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2426.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2427.
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2428.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2448.

Chain A
Raw chain:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA
>From peptide builder:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA

Chain B
Raw chain:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA
>From peptide builder:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA

Notice there are discontinuities in both chains A and B, and a missing residue
in their peptides.

And the output from 13GS,

PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3760.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3812.
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3852.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3948.
PDBConstructionWarning: WARNING: Chain C is discontinuous at line 4033.

Chain A
Raw chain:
MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
>From peptide builder:
MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

Chain B
Raw chain:
PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
>From peptide builder:
PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

Chain C
Raw chain:
ECG
>From peptide builder:
CG

Chain D
Raw chain:
ECG
>From peptide builder:
CG

Notice there are discontinuities in chains A, B and C, but missing residues in
the peptide chains C and D. This suggests the discontinuities are required to
trigger the problem. Also there are no HETATM residues for chains C and D.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 10 08:57:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Sep 2009 08:57:13 -0400
Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed
	assertion in CondonTable Fix+Patch
In-Reply-To: <bug-2894-42@http.bugzilla.open-bio.org/>
Message-ID: <200909101257.n8ACvDe1017562@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2894


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-10 08:57 EST -------
I'm marking this as a duplicated of bug 2887, and believe it to be fixed on the
trunk.

*** This bug has been marked as a duplicate of bug 2887 ***


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 10 08:57:16 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Sep 2009 08:57:16 -0400
Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in
	Bio.Data.CodonTable
In-Reply-To: <bug-2887-42@http.bugzilla.open-bio.org/>
Message-ID: <200909101257.n8ACvGRn017574@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2887


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kellrott at ucsd.edu


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-10 08:57 EST -------
*** Bug 2894 has been marked as a duplicate of this bug. ***


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 10 08:57:20 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Sep 2009 08:57:20 -0400
Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary
	Jython Error Fix+Patch
In-Reply-To: <bug-2895-42@http.bugzilla.open-bio.org/>
Message-ID: <200909101257.n8ACvKL9017592@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2895


Bug 2895 depends on bug 2894, which changed state.

Bug 2894 Summary: Jython List difference causes failed assertion in CondonTable Fix+Patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2894

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Tue Sep 15 09:51:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 14:51:43 +0100
Subject: [Biopython-dev] Another Biopython release?
Message-ID: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>

Hi all,

Looking ahead, Tiago has some population genetics code he hopes to
merge into the trunk at the end of the month (or in October), and we
still have Brad's GFF stuff, my SFF work, Kristian's RNA code, Kyle's
misc suggestions, and perhaps most importantly the phylogenetics
GSoC work to consider.

I know it's been only a month since we released Biopython 1.51, but
does anyone (other than me) think that we already have enough done
to warrant another release? The associated CVS freeze would also
serve as a good break point for moving to github (see other threads).

Here is what we have in the NEWS file at the moment:

<quote>
New helper functions Bio.SeqIO.convert() and Bio.AlignIO.convert() allow an
easier way to use Biopython for simple file format conversions. Additionally,
these new functions allow Biopython to offer important file format specific
optimisations (e.g. FASTQ to FASTA, and interconverting FASTQ variants).

New function Bio.SeqIO.indexed_dict() allows indexing of most sequence file
formats (but not alignment file formats), allowing dictionary like random
access to all the entries in the file as SeqRecord objects, keyed on the
record id. This is especially useful for very large sequencing files, where
all the records cannot be held in memory at once. This supplements the more
flexible but memory demanding Bio.SeqIO.to_dict() function.

Bio.SeqIO can now write "phd" format files (used by PHRED, PHRAD and
CONSED), allowing interconversion with FASTQ files, or FASTA+QUAL files.

Bio.Emboss.Applications now includes wrappers for the "new" PHYLIP EMBASSY
package (e.g. fneighbor) which replace the "old" PHYLIP EMBASSY package
(e.g. efneighbor) whose Biopython wrappers are now obsolete.

See also the DEPRECATED file, as several old deprecated modules have finally
been removed (e.g. Bio.EUtils which had been replaced by Bio.Entrez).
</quote>

[As an aside - Cymon and David - do you want to be named in the NEWS
file for the PHD and PHLIPNEW stuff?]

We're still debating the name of the new function Bio.SeqIO.indexed_dict(),
but I am happy with the code (and new documentation) otherwise. The
related extensions to adding indexing via a lookup file or an SQLite
database is another big chunk of work which I don't have time for at the
moment, but the code already in CVS is still extremely useful as is.

Again, I'm biased, but I think the Bio.SeqIO.convert(...) function will be
a popular addition for its convenience, but especially valuable for anyone
wanting to convert between the different FASTQ files where the optimised
conversion code makes a big speed up.

Does doing another quick release (say at some point next week) sound
like a good plan? If people like the idea, then getting some extra testing
in now would be great - especially on the new stuff (it has unit tests of
course, but real world usage is also important - thanks Brad for already
trying out the FASTA indexing).

Peter

From bartek at rezolwenta.eu.org  Tue Sep 15 10:59:43 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 15 Sep 2009 16:59:43 +0200
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
Message-ID: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>

On Tue, Sep 15, 2009 at 3:51 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I know it's been only a month since we released Biopython 1.51, but
> does anyone (other than me) think that we already have enough done
> to warrant another release? The associated CVS freeze would also
> serve as a good break point for moving to github (see other threads).
>

That would be great. As for the move to github, I've added some (quite
preliminary) docs for developers on how to make commits to the main
branch using git and github to the wiki:
http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch

Any comments and/or improvements are most welcome.

cheers
  Bartek

From tiagoantao at gmail.com  Tue Sep 15 11:29:55 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 15 Sep 2009 16:29:55 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
Message-ID: <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>

On Tue, Sep 15, 2009 at 2:51 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> Looking ahead, Tiago has some population genetics code he hopes to

I can put my stuff in CVS (plus I have docs). Question: CVS is still
"the place". Right?

I just need to test stuff on Windows. All the rest seems ok.

Tiago

From biopython at maubp.freeserve.co.uk  Tue Sep 15 11:35:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 16:35:13 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
Message-ID: <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>

2009/9/15 Tiago Ant?o <tiagoantao at gmail.com>:
> On Tue, Sep 15, 2009 at 2:51 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi all,
>>
>> Looking ahead, Tiago has some population genetics code he hopes to
>
> I can put my stuff in CVS (plus I have docs). Question: CVS is still
> "the place". Right?
>
> I just need to test stuff on Windows. All the rest seems ok.

Yes, for the short term CVS is still the master repository. If you
have that stuff ready to check in now, then sure - go ahead
I was assuming you didn't expect to have this ready just yet,
hence the proposal to sneak out a quick release first ;)

Give me a shout and I'll get my Windows test machine up
and running to double check the unit tests there.

Maybe we'll push back the "next week" idea a bit ;)

Peter


From eric.talevich at gmail.com  Tue Sep 15 11:38:45 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 15 Sep 2009 11:38:45 -0400
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
Message-ID: <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>

>
> On Tue, Sep 15, 2009 at 3:51 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > Hi all,
> >
> > I know it's been only a month since we released Biopython 1.51, but
> > does anyone (other than me) think that we already have enough done
> > to warrant another release? The associated CVS freeze would also
> > serve as a good break point for moving to github (see other threads).
> >
>

Sounds good to me. Completing the Git migration would make it much easier
for me to maintain the Tree/TreeIO stuff, since I already have a few local
branches based on it that an upstream CVS duplication would mangle.


On Tue, Sep 15, 2009 at 10:59 AM, Bartek Wilczynski <
bartek at rezolwenta.eu.org> wrote:

> That would be great. As for the move to github, I've added some (quite
> preliminary) docs for developers on how to make commits to the main
> branch using git and github to the wiki:
> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch
>
>
The setup here for committers looks potentially different from the setup in
"Merging upstream changes" (describing read-only tracking), but also
potentially similar. Diff:
- The github:biopython/biopython repository is called "official" here, but
"upstream" there. Different protocol too, but that's intentional.
- It also shows how to treat the upstream/official repo as the origin,
CVS-style. This would mean the developer doesn't have a separate GitHub fork
to use for personal branches, uncertain commits, etc. that don't belong in
the main repo.

Maybe a good way to organize the page would be in terms of how you want to
use the repo:

1. Tracking Biopython with raw Git (without signing up for GitHub)
   - git clone http://.../biopython/biopython
   - remote: upstream
   - how to format a patch and submit on Bugzilla

2. Tracking Biopython on GitHub (e.g. occasional contributors)
   - sign up, click the "fork" button
   - git clone http://.../your-name-here/biopython
   - remotes: origin, upstream
   - how to submit a pull request on GitHub
   - how to add, manage and delete branches locally and on GitHub

3. Collaborating
   - either #1 or #2 is fine
   - how to add and manage more remotes
   - how to apply Git patches, and why copy/paste kills kittens the next
time you merge

4. Committing to Biopython
   - same as #2, but use the private URL for the "upstream" remote
   - remotes: origin, upstream
   - policy on pushing upstream, code reviews, tagging, etc.


Cheers,
Eric

From tiagoantao at gmail.com  Tue Sep 15 11:39:07 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 15 Sep 2009 16:39:07 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
	<320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
Message-ID: <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>

2009/9/15 Peter <biopython at maubp.freeserve.co.uk>:
> Give me a shout and I'll get my Windows test machine up
> and running to double check the unit tests there.

I think I am not in the mood to impose the burden on you. I will find
a Windows machine and test it myself.

> Maybe we'll push back the "next week" idea a bit ;)

I am OK with "next week". But as I said two months ago, I have
calendarized the extension of Bio.PopGen to October. So the material
can go on the next release after the one on "next week".

I just want to have lots of free time and little travel to be able to
assist potential users (as I intend to announce the new content to the
evolutionary biology crowd quite a lot)

-- 
" It always takes ideology to consummate massive error." - Ambrose
Evans-Pritchard

From biopython at maubp.freeserve.co.uk  Tue Sep 15 11:48:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 16:48:43 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
	<320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
	<6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>
Message-ID: <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com>

2009/9/15 Tiago Ant?o <tiagoantao at gmail.com>:
> 2009/9/15 Peter <biopython at maubp.freeserve.co.uk>:
>> Give me a shout and I'll get my Windows test machine up
>> and running to double check the unit tests there.
>
> I think I am not in the mood to impose the burden on you. I will find
> a Windows machine and test it myself.

I was just going to turn on the machine, update to the latest
CVS, and do a compile/test with Python 2.4, 2.5, 2.6 - Its no
extra effort, as I would be doing this anyway for a new release.

Unless of course you are adding wrappers for more command
line tools, which would ideally require me to install them - that
I might leave for another day ;)

>> Maybe we'll push back the "next week" idea a bit ;)
>
> I am OK with "next week". But as I said two months ago, I have
> calendarized the extension of Bio.PopGen to October. So the material
> can go on the next release after the one on "next week".
>
> I just want to have lots of free time and little travel to be able to
> assist potential users (as I intend to announce the new content to the
> evolutionary biology crowd quite a lot)

If you are happy to merge the code this week (via CVS), and
confident it is ready to release, then I could do the release
next week, and then we move to git.

Or, I can do the release next week, we move to git, and then
you can merge the new code (via git) at your leisure (Oct).

Either plan is fine with me. Which do you prefer?

Peter


From tiagoantao at gmail.com  Tue Sep 15 11:57:17 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 15 Sep 2009 16:57:17 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
	<320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
	<6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>
	<320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com>
Message-ID: <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com>

> Unless of course you are adding wrappers for more command
> line tools, which would ideally require me to install them - that
> I might leave for another day ;)

Spot on ;) .

> If you are happy to merge the code this week (via CVS), and
> confident it is ready to release, then I could do the release
> next week, and then we move to git.

I will be only able to test the code on Windows tomorrow, if I can get
hold to the machine (which I should).

> Either plan is fine with me. Which do you prefer?


I prefer merging on CVS, I am still much more proficient with it. You
should have the merge there on Friday morning when you arrive.
Tutorial included.

Tiago


-- 
" It always takes ideology to consummate massive error." - Ambrose
Evans-Pritchard

From biopython at maubp.freeserve.co.uk  Tue Sep 15 12:09:32 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 17:09:32 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
	<320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
	<6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>
	<320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com>
	<6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com>
Message-ID: <320fb6e00909150909x2f45e0f5g6c4da77eafcd9a49@mail.gmail.com>

2009/9/15 Tiago Ant?o <tiagoantao at gmail.com>:
>> Unless of course you are adding wrappers for more command
>> line tools, which would ideally require me to install them - that
>> I might leave for another day ;)
>
> Spot on ;) .

OK.

>> If you are happy to merge the code this week (via CVS), and
>> confident it is ready to release, then I could do the release
>> next week, and then we move to git.
>
> I will be only able to test the code on Windows tomorrow, if
> I can get hold to the machine (which I should).

Fingers crossed this doesn't throw any surprises at you.

>> Either plan is fine with me. Which do you prefer?
>
> I prefer merging on CVS, I am still much more proficient with it. You
> should have the merge there on Friday morning when you arrive.
> Tutorial included.

OK then :)

Peter


From bartek at rezolwenta.eu.org  Tue Sep 15 15:45:22 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 15 Sep 2009 21:45:22 +0200
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
	<3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>
Message-ID: <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com>

On Tue, Sep 15, 2009 at 5:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:

> Sounds good to me. Completing the Git migration would make it much easier
> for me to maintain the Tree/TreeIO stuff, since I already have a few local
> branches based on it that an upstream CVS duplication would mangle.
>
Then maybe we should  wait with committing your changes to the time we
drop CVS,
in order to avoid loss of change history in your code... What do you
think, Peter?

>
> On Tue, Sep 15, 2009 at 10:59 AM, Bartek Wilczynski <
> bartek at rezolwenta.eu.org> wrote:
>
>> That would be great. As for the move to github, I've added some (quite
>> preliminary) docs for developers on how to make commits to the main
>> branch using git and github to the wiki:
>> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch
>>
>>
> The setup here for committers looks potentially different from the setup in
> "Merging upstream changes" (describing read-only tracking), but also
> potentially similar. Diff:
> - The github:biopython/biopython repository is called "official" here, but
> "upstream" there. Different protocol too, but that's intentional.

Yes, indeed. I know this might seem strange but I was trying to
deliberately make the distinction between the main repository in
read-write mode (official) and in read-only mode (upstream). I would
keep it like this at least for a while so that the transition from CVS
is as easy as possible. We have quite a few developers who are new to
git and comfortable with CVS.

> - It also shows how to treat the upstream/official repo as the origin,
> CVS-style.
Yes, exactly.
> This would mean the developer doesn't have a separate GitHub fork
> to use for personal branches, uncertain commits, etc. that don't belong in
> the main repo.
Not necessarily. It just means that these two roles are separate: a
developer can (but does not have to) have his own branch of biopython
tree where he/she makes the changes, but this is not directly linked
to the official (read-write) biopython branch. I know it's  not
necessarily the best way to use github, but I would like to avoid
getting people used to CVS confused. That's why I decided to describe
the role of developer with read-write access differently.

BTW, I would see the role of the GitUsage wiki page as a guide rather
than a law. That means that if someone understands better how to use
git and github and does not get lost with having in his both local and
remote branches with different origins I'm absolutely fine with this.
But I think it is quite complicated, especially for people new to git.

So, in summary, my idea was to (currently) recommend somewhat CVS-like
usage of git on the main branch, which would be simple for people to
use at first and encourage them to  create their own branches and do
development on them.

>
> Maybe a good way to organize the page would be in terms of how you want to
> use the repo:
>
> 1. Tracking Biopython with raw Git (without signing up for GitHub)
> ? - git clone http://.../biopython/biopython
> ? - remote: upstream
> ? - how to format a patch and submit on Bugzilla
>
> 2. Tracking Biopython on GitHub (e.g. occasional contributors)
> ? - sign up, click the "fork" button
> ? - git clone http://.../your-name-here/biopython
> ? - remotes: origin, upstream
> ? - how to submit a pull request on GitHub
> ? - how to add, manage and delete branches locally and on GitHub
>
> 3. Collaborating
> ? - either #1 or #2 is fine
> ? - how to add and manage more remotes
> ? - how to apply Git patches, and why copy/paste kills kittens the next
> time you merge
>
> 4. Committing to Biopython
> ? - same as #2, but use the private URL for the "upstream" remote
> ? - remotes: origin, upstream
> ? - policy on pushing upstream, code reviews, tagging, etc.
>
>

Having such documentation would be nice. I think that it is currently
structured more or less like that (now we just don't have #1 and #4
currently recommends a very simple CVS-like usage). I think that
adding #1 and putting in place policies on how to submit patches would
be great. For #4 I would vote for recommending (at least for a while)
the CVS-like way, but I'm absolutely for the development of the
alternative procedure, where the developer works with a single repo
both on his code and on official branch.

I don't want to underestimate the git skills of our current
developers, but so far I think only a few people have gotten their
github accounts, which means the simpler we keep it the better (at
least for a while). I certainly hope that people will get used to git
quickly, but I would like to make initial change for people who will
be switching from CVS to git as simple as possible.


cheers
  Bartek


From biopython at maubp.freeserve.co.uk  Tue Sep 15 16:25:00 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 21:25:00 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
	<3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>
	<8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com>
Message-ID: <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com>

On Tue, Sep 15, 2009 at 8:45 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
> On Tue, Sep 15, 2009 at 5:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
>> Sounds good to me. Completing the Git migration would make it much easier
>> for me to maintain the Tree/TreeIO stuff, since I already have a few local
>> branches based on it that an upstream CVS duplication would mangle.
>
> Then maybe we should ?wait with committing your changes to the
> time we drop CVS, in order to avoid loss of change history in your
> code... What do you think, Peter?

Yes, I was suggesting getting a final CVS release out soon,
and then look at merging all the new stuff (including Eric's
tree stuff) starting to pile up on github.

I knew Tiago has a lump of code ready to go, and as we have
just discussed, as he would prefer to check that in via CVS.
So, Tiago will do that (this Friday), then we'll do the final CVS
release next week, and then switch to git - and start to focus
on merging in new stuff.

Peter


From chapmanb at 50mail.com  Wed Sep 16 08:34:07 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 16 Sep 2009 08:34:07 -0400
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
Message-ID: <20090916123407.GE13500@sobchak.mgh.harvard.edu>

Hi Peter;

> > I know it's been only a month since we released Biopython 1.51, but
> > does anyone (other than me) think that we already have enough done
> > to warrant another release? The associated CVS freeze would also
> > serve as a good break point for moving to github (see other threads).

I don't have a strong opinion about the release. It seems a little
early but if you think we are ready go for it.

I have tested Osvaldo's Novoalign commandline object and have it
ready to get in. Right now it's in a git tree but I can move it
over to a CVS tree and integrate it for the release. It'll live in
Bio/Sequencing/Applications like you suggested. I should be able to
do that this evening.

I am all about the move to Git and GitHub. Anything we can do to
finish that off and make it official is cool by me.

> That would be great. As for the move to github, I've added some (quite
> preliminary) docs for developers on how to make commits to the main
> branch using git and github to the wiki:
> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch

This is looking great. I'd agree with Eric that we should be
consistent in the doc for suggestions on naming the official 
biopython branch:

git remote add upstream git://github.com/biopython/biopython.git
git remote add official git at github.com:biopython/biopython.git

My vote is for the "official" naming which is a little more
specific.

Great stuff,
Brad

From biopython at maubp.freeserve.co.uk  Wed Sep 16 09:30:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Sep 2009 14:30:47 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <20090916123407.GE13500@sobchak.mgh.harvard.edu>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
	<20090916123407.GE13500@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909160630o4dc1379dwaba667ed13ed9bde@mail.gmail.com>

On Wed, Sep 16, 2009 at 1:34 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi Peter;
>
>> > I know it's been only a month since we released Biopython 1.51, but
>> > does anyone (other than me) think that we already have enough done
>> > to warrant another release? The associated CVS freeze would also
>> > serve as a good break point for moving to github (see other threads).
>
> I don't have a strong opinion about the release. It seems a little
> early but if you think we are ready go for it.

OK.

> I have tested Osvaldo's Novoalign commandline object and have it
> ready to get in. Right now it's in a git tree but I can move it
> over to a CVS tree and integrate it for the release. It'll live in
> Bio/Sequencing/Applications like you suggested. I should be able to
> do that this evening.

Go for it - I presume you have it in a private git repostory at the
moment, as I couldn't spot it on github?

> I am all about the move to Git and GitHub. Anything we can do to
> finish that off and make it official is cool by me.
>
>> That would be great. As for the move to github, I've added some (quite
>> preliminary) docs for developers on how to make commits to the main
>> branch using git and github to the wiki:
>> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch
>
> This is looking great. I'd agree with Eric that we should be
> consistent in the doc for suggestions on naming the official
> biopython branch:
>
> git remote add upstream git://github.com/biopython/biopython.git
> git remote add official git at github.com:biopython/biopython.git
>
> My vote is for the "official" naming which is a little more
> specific.

Well, both "official" and "upstream" have merit. I don't mind which,
but it does make sense to be consistent.

Peter

From biopython at maubp.freeserve.co.uk  Wed Sep 16 09:48:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Sep 2009 14:48:39 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com>
References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>
	<184931.66541.qm@web62403.mail.re1.yahoo.com>
	<320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com>
Message-ID: <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com>

On Tue, Sep 8, 2009 at 2:53 PM, Peter wrote:
> On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote:
>>  On Tue, 9/8/09, Peter wrote:
>>> Did you have any ideas for a better name than
>>> Bio.SeqIO.indexed_dict()?
>>
>> Is indexed_dict a function? If so, I suggest we use a verb instead
>> of a noun. Maybe just "index"?
>
> Bio.SeqIO.indexed_dict() is a function which returns a dictionary like
> object. So yes, a verb would be better, and "index" is short and sweet.

Any other comments? Otherwise I'll switch Bio.SeqIO.indexed_dict()
to Bio.SeqIO.index() for the next release.

Thinking ahead, in addition to the current code (indexing a file, keeping
the index in memory) we might in future add want to something like
Bio.SeqIO.sqlite_index() where the index is kept in a database etc.

Peter

From bugzilla-daemon at portal.open-bio.org  Wed Sep 16 18:00:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 16 Sep 2009 18:00:59 -0400
Subject: [Biopython-dev] [Bug 2904] Interface for Novoalign
In-Reply-To: <bug-2904-42@http.bugzilla.open-bio.org/>
Message-ID: <200909162200.n8GM0x7d006226@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2904


chapmanb at 50mail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from chapmanb at 50mail.com  2009-09-16 18:00 EST -------
Osvaldo;
Thanks much for the submission. This is committed and lives in:

Bio/Sequencing/Applications

to create a namespace for future sequencing related commandlines. You can
import with:

from Bio.Sequencing.Applications import NovoalignCommandline

It would be great if you wanted to add a cookbook example of using it
(http://biopython.org/wiki/Category:Cookbook) based on a simple pipeline.
Perhaps something involving downstream parsing of the novoalign format, or
converted to SAM as you suggested in Bug 2905.

Thanks,
Brad


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From tiagoantao at gmail.com  Wed Sep 16 18:53:31 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 16 Sep 2009 23:53:31 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
	<3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>
	<8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com>
	<320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com>
Message-ID: <6d941f120909161553l3f9bae6u5ba45e6cde9b33e3@mail.gmail.com>

Hi,

> I knew Tiago has a lump of code ready to go, and as we have
> just discussed, as he would prefer to check that in via CVS.


I just tested my stuff on Windows.
It worked at first attempt. Strange...
I actually have a few tests (18 to be precise). They all passed at first.
Murphy's laws took a once-in-a-life vacation.

I still have a minor problem. I will not have time to update the
Tutorial before Tuesday. All is written in
http://biopython.org/wiki/PopGen_dev_Genepop , which it will mostly
become tutorial. But I simply don't have time until Tuesday to
transpose.

Code and tests will be committed today.

Tiago

From krother at rubor.de  Thu Sep 17 04:40:28 2009
From: krother at rubor.de (Kristian Rother)
Date: Thu, 17 Sep 2009 10:40:28 +0200
Subject: [Biopython-dev] Another Biopython release?
Message-ID: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de>


Hi Peter,

I could prepare 2-3 exemplary modules for parsing secondary structures +
tests for the Bio.RNA package. As I've been using GIT so far, it would be
most convenient to stick with it and contribute when the main archive has
migrated. Or is it easy to "jump" to CVS on the last possible occasion?

Best,
   Kristian


From biopython at maubp.freeserve.co.uk  Thu Sep 17 05:17:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 10:17:37 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de>
References: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de>
Message-ID: <320fb6e00909170217j24bab86eqae45440f72ed415e@mail.gmail.com>

On Thu, Sep 17, 2009 at 9:40 AM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
> I could prepare 2-3 exemplary modules for parsing secondary structures +
> tests for the Bio.RNA package. As I've been using GIT so far, it would be
> most convenient to stick with it and contribute when the main archive has
> migrated. Or is it easy to "jump" to CVS on the last possible occasion?
>
> Best,
> ? Kristian

My plan for this "quick release" was to mark an end to the CVS era, and
not to include any of the really new stuff (like your code), but to wait until
we are on git before looking at it. So keep it in git for now - this should
also make the merge easier.

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 17 07:27:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 12:27:24 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com>
References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>
	<184931.66541.qm@web62403.mail.re1.yahoo.com>
	<320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com>
	<320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com>
Message-ID: <320fb6e00909170427o37813aa7kd86464d9c8e81b36@mail.gmail.com>

On Wed, Sep 16, 2009 at 2:48 PM, Peter wrote:
> On Tue, Sep 8, 2009 at 2:53 PM, Peter wrote:
>> On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote:
>>> ?On Tue, 9/8/09, Peter wrote:
>>>> Did you have any ideas for a better name than
>>>> Bio.SeqIO.indexed_dict()?
>>>
>>> Is indexed_dict a function? If so, I suggest we use a verb instead
>>> of a noun. Maybe just "index"?
>>
>> Bio.SeqIO.indexed_dict() is a function which returns a dictionary like
>> object. So yes, a verb would be better, and "index" is short and sweet.
>
> Any other comments? Otherwise I'll switch Bio.SeqIO.indexed_dict()
> to Bio.SeqIO.index() for the next release.

Done in CVS.

> Thinking ahead, in addition to the current code (indexing a file, keeping
> the index in memory) we might in future add want to something like
> Bio.SeqIO.sqlite_index() where the index is kept in a database etc.

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 17 08:02:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 13:02:18 +0100
Subject: [Biopython-dev] Using PendingDeprecation for obsolete modules
Message-ID: <320fb6e00909170502m14b4e599l66c778bfe67f3625@mail.gmail.com>

Hi all,

Right now we have deprecation process which usually looks like this:

(1) Label as obsolete in docstrings
(2) Label as deprecated in docstrings, issue DeprecationWarning
(3) Remove code

See: http://biopython.org/wiki/Deprecation_policy

I've relatively recently noticed the PendingDeprecationWarning warning
(added in Python 2.3), which is by default silent, but the user can choose
to enable it with the python command line switch -W. For example,

$ python
Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import warnings
>>> warnings.warn("X is obsolete", PendingDeprecationWarning)
>>>

So, by default, no warning message. But if you ask for them:

$ python -W allPython 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import warnings
>>> warnings.warn("X is obsolete", PendingDeprecationWarning)
__main__:1: PendingDeprecationWarning: X is obsolete
>>>

So, I thinking what we should be doing for deprecating modules is:

(1) Label as obsolete in docstrings, issue PendingDeprecationWarning
(2) Label as deprecated in docstrings, issue DeprecationWarning
(3) Remove code

I guess very few people know about pending deprecation warnings,
and so are unlikely to even try using the warning switch. Therefore
I have little inclination to go though all the current modules tagged as
"obsolete" just to add this silent warning.

However, if simply start doing this in future, is really isn't any more work.

Any thoughts?

Peter

From winda002 at student.otago.ac.nz  Thu Sep 17 23:52:11 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Fri, 18 Sep 2009 15:52:11 +1200
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <4AA58F3C.6080200@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>		<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>		<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>		<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
Message-ID: <4AB303EB.1010208@student.otago.ac.nz>


>>
>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
>> based on test_Emboss.py? Continuing on the github branch is fine.
>>  

Well, it didn't end up being very short but there is a test on my 
"phylo" branch (http://github.com/dwinter/biopython/tree/phylo) in  
test_PhylipNew.phy  (which uses a couple of new files in Tests/Phylip) 
that I'd welcome comments on.

Writing them actually exposed a bug in the code already in CVS, the 
FProtParsCommandline option "-intreefile" isn't mandatory so 
"is_required" should be set to 0 rather than 1. In my defence the emboss 
documentation has it listed as being both mandatory and optional.

One possibly foolish thing I did was use TreeIO to test the trees that 
came out of these programs made sense, thinking that module would be 
part of the next release. If the plan is for a new release soon and 
having a test for these wrappers is important the tests could be done 
with Nexus.Trees but I found that was difficult to use for files with 
multiple newick trees.

Cheers,
David


From biopython at maubp.freeserve.co.uk  Fri Sep 18 05:26:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 18 Sep 2009 10:26:59 +0100
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <4AB303EB.1010208@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
	<4AB303EB.1010208@student.otago.ac.nz>
Message-ID: <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com>

On Fri, Sep 18, 2009 at 4:52 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
>
>>>
>>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
>>> based on test_Emboss.py? Continuing on the github branch is fine.
>>>
>
> Well, it didn't end up being very short but there is a test on my "phylo"
> branch (http://github.com/dwinter/biopython/tree/phylo) in
> ?test_PhylipNew.phy ?(which uses a couple of new files in Tests/Phylip) that
> I'd welcome comments on.

Cool - I'll take a look and try and get (some of) it merged into CVS
for this release.

> Writing them actually exposed a bug in the code already in CVS, the
> FProtParsCommandline option "-intreefile" isn't mandatory so "is_required"
> should be set to 0 rather than 1. In my defence the emboss documentation has
> it listed as being both mandatory and optional.

How odd. Maybe EMBOSS switched it at some point?

> One possibly foolish thing I did was use TreeIO to test the trees that came
> out of these programs made sense, thinking that module would be part of the
> next release. If the plan is for a new release soon and having a test for
> these wrappers is important the tests could be done with Nexus.Trees but I
> found that was difficult to use for files with multiple newick trees.

Hmm. In the short term we can either comment out those bits of the test
pending the inclusion of TreeIO in the next release, or add a quick tiny
parser in the test itself to load the trees, split them on the ";" and pass
them one by one to Bio.Nexus.Trees for parsing.

Peter


From biopython at maubp.freeserve.co.uk  Fri Sep 18 07:09:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 18 Sep 2009 12:09:24 +0100
Subject: [Biopython-dev] Entrez ELink history - XML/DTD or Biopython bug?
Message-ID: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com>

Hi Michiel (et al),

I've been trying to get an example working using the Entrez history
for ELink. Strangely here the URL doesn't use history=y but instead
cmd=neighbor_history (while the default is cmd=neighbor).

However, this appears to show a bug in the Bio.Entrez parser. Consider:

from Bio import Entrez
pmid = "14630660"
print Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs",
from_uid=pmid, cmd="neighbor_history").read()

This gives:

<?xml version="1.0"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN"
 "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
<eLinkResult>
<LinkSet>
	<DbFrom>pubmed</DbFrom>
	<IdList>
		<Id>14630660</Id>
	</IdList>
	<LinkSetDbHistory>
		<DbTo>pmc</DbTo>
		<LinkName>pubmed_pmc_refs</LinkName>
		<QueryKey>1</QueryKey>
	</LinkSetDbHistory>
	<WebEnv>NCID_1_2657216_130.14.18.53_9001_1253271778</WebEnv>
</LinkSet>
</eLinkResult>

The XML looks reasonable by eye - although quite different from
the non-history version.

Now if instead of printing that, I try and parse it:

>>> data = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
LinkName="pubmed_pmc_refs", from_uid=pmid, cmd="neighbor_history"))
Traceback (most recent call last):
?File "<stdin>", line 1, in <module>
?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/__init__.py",
line 259, in read
? ?record = handler.run(handle)
?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/Parser.py",
line 90, in run
? ?self.parser.ParseFile(handle)
?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/Parser.py",
line 210, in endElement
? ?current[name] = value
TypeError: 'str' object does not support item assignment

I can file a Biopython bug if you like, but my initial guess is
the problem lies in the XML itself versus the eLink_020511.dtd
file, which does not mention the LinkSetDbHistory element at
all. Do you agree that this looks like an NCBI problem?

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Fri Sep 18 07:40:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 18 Sep 2009 12:40:06 +0100
Subject: [Biopython-dev] Entrez ELink history - XML/DTD or Biopython bug?
In-Reply-To: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com>
References: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com>
Message-ID: <320fb6e00909180440p701d3f5ejd22a605f171989eb@mail.gmail.com>

On Fri, Sep 18, 2009 at 12:09 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Michiel (et al),
>
> I've been trying to get an example working using the Entrez history
> for ELink. Strangely here the URL doesn't use history=y but instead
> cmd=neighbor_history (while the default is cmd=neighbor).
>
> However, this appears to show a bug in the Bio.Entrez parser. Consider:
>
> from Bio import Entrez
> pmid = "14630660"
> print Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs",
> from_uid=pmid, cmd="neighbor_history").read()
>
> This gives:
>
> <?xml version="1.0"?>
> <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN"
> ?"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
> <eLinkResult>
> <LinkSet>
> ? ? ? ?<DbFrom>pubmed</DbFrom>
> ? ? ? ?<IdList>
> ? ? ? ? ? ? ? ?<Id>14630660</Id>
> ? ? ? ?</IdList>
> ? ? ? ?<LinkSetDbHistory>
> ? ? ? ? ? ? ? ?<DbTo>pmc</DbTo>
> ? ? ? ? ? ? ? ?<LinkName>pubmed_pmc_refs</LinkName>
> ? ? ? ? ? ? ? ?<QueryKey>1</QueryKey>
> ? ? ? ?</LinkSetDbHistory>
> ? ? ? ?<WebEnv>NCID_1_2657216_130.14.18.53_9001_1253271778</WebEnv>
> </LinkSet>
> </eLinkResult>
>
> The XML looks reasonable by eye - although quite different from
> the non-history version... but my initial guess is
> the problem lies in the XML itself versus the eLink_020511.dtd
> file, which does not mention the LinkSetDbHistory element at
> all. Do you agree that this looks like an NCBI problem?

I should have done this earlier - but two different XML validators
both agree that the "history" version of the NCBI's ELink XML is
invalid, while the default is fine.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660&cmd=neighbor_history

versus

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660&cmd=neighbor

or:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660

I will get in touch with the NCBI...

Peter


From eric.talevich at gmail.com  Fri Sep 18 10:08:40 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 18 Sep 2009 10:08:40 -0400
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
	<4AB303EB.1010208@student.otago.ac.nz>
	<320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com>
Message-ID: <3f6baf360909180708w2d06c775w18922106bba003e@mail.gmail.com>

On Fri, Sep 18, 2009 at 5:26 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Fri, Sep 18, 2009 at 4:52 AM, David Winter
> <winda002 at student.otago.ac.nz> wrote:
>
> > One possibly foolish thing I did was use TreeIO to test the trees that
> came
> > out of these programs made sense, thinking that module would be part of
> the
> > next release. If the plan is for a new release soon and having a test for
> > these wrappers is important the tests could be done with Nexus.Trees but
> I
> > found that was difficult to use for files with multiple newick trees.
>
> Hmm. In the short term we can either comment out those bits of the test
> pending the inclusion of TreeIO in the next release, or add a quick tiny
> parser in the test itself to load the trees, split them on the ";" and pass
> them one by one to Bio.Nexus.Trees for parsing.
>
>
That's all TreeIO does. The relevant loop is in NewickIO.parse(), if you'd
like to copy it verbatim:
http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py

-Eric

From biopython at maubp.freeserve.co.uk  Sun Sep 20 07:20:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 20 Sep 2009 12:20:43 +0100
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <4AB303EB.1010208@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
	<4AB303EB.1010208@student.otago.ac.nz>
Message-ID: <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com>

On Fri, Sep 18, 2009 at 4:52 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
>>>
>>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
>>> based on test_Emboss.py? Continuing on the github branch is fine.
>>>
>
> Well, it didn't end up being very short but there is a test on my "phylo"
> branch (http://github.com/dwinter/biopython/tree/phylo) in
> ?test_PhylipNew.phy ?(which uses a couple of new files in Tests/Phylip) that
> I'd welcome comments on.

I've checked in something based on the current version from github.

I added a few checks for missing input files (I was getting cryptic errors),
but then decided we had enough input files in the test suite already, and
that it might be more useful to try writing alignments to the PHYLIP tools
via stdin with AlignIO. Certainly at least one example should try this,
assuming it works. I haven't done this yet - feel free to try.

Note that the stdout from the PHYLIPNEW tools isn't clean, so we
can't avoid having temp output files:
http://lists.open-bio.org/pipermail/emboss-dev/2009-September/000632.html

> Writing them actually exposed a bug in the code already in CVS, the
> FProtParsCommandline option "-intreefile" isn't mandatory so "is_required"
> should be set to 0 rather than 1. In my defence the emboss
> documentation has it listed as being both mandatory and optional.

Fixed in CVS - does this affect any of the other tools using this argument?

> One possibly foolish thing I did was use TreeIO to test the trees that came
> out of these programs made sense, thinking that module would be part of the
> next release. If the plan is for a new release soon and having a test for
> these wrappers is important the tests could be done with Nexus.Trees but I
> found that was difficult to use for files with multiple newick trees.

I put a quick crude helper function into the unit test as discussed.

The unit test is working nicely on Linux with EMBOSS PHYLIP
from CVS, I presume you are testing against an official release?
If you could the CVS code works fine on your setup before the
release that would be great. There is a bit more time as I won't
be able to do the release on Monday, but it should be Tuesday
or Wednesday... and fingers crossed getting PHYLIPNEW
installed on my Windows machine will be easy.

We can look at adding some more of your example input files,
and uncommenting their tests later (especially for cases where
we can't generate the input from Biopython directly). I did add
the horses.tree file BTW.

Thank you David :)

Peter


From winda002 at student.otago.ac.nz  Mon Sep 21 01:13:24 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Mon, 21 Sep 2009 17:13:24 +1200
Subject: [Biopython-dev] draft release announcement
Message-ID: <4AB70B74.1040308@student.otago.ac.nz>

Hi guys,

A draft release announcement for 1.52 for you to look at and comment on. 
This is written with the idea that there will be a blog post describing 
the convert and indexed_dict() methods for SeqIO which can be linked to 
so the
announcement itself is pretty brief.

I didn't mention the movement from CVS to git in the announcement which 
might be something worth adding?

+++
We are pleased to announce the availability of Biopython 1.52, a new 
stable release of the Biopython library.

It may only have been one month since the last release but in that time 
we've added enough useful features to warrant a new release. Biopython 
1.52 will be of particular interest to people using next generation 
sequencing - new functions added to the AlignIO and SeqIO tools speed up 
the way very large sequence files can be dealt with and you can now 
write phd files like those created by  Phred and used in 454 sequencing.

SeqIO and AlignIO both now have a helper function called convert() that 
allows for simple, optimized conversion between file formats while SeqIO 
gets a new method called indexed_dict() which allows random access to 
sequences in a file without reading every record in that file into memory.

The new release also adds command line wrappers for the EMBOSS versions 
of the phylip phylogeny programs and squashes a few minor bugs reported 
since 1.51 was released.

Sources and a Windows Installer are available from the downloads page.

Thanks to the Biopython development team and to everyone who has 
reported bugs since our last release

++++

From tiagoantao at gmail.com  Mon Sep 21 01:17:39 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 21 Sep 2009 06:17:39 +0100
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz>
References: <4AB70B74.1040308@student.otago.ac.nz>
Message-ID: <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com>

There is a big update to the PopGen module, which is now able to do
frequentist statistics and tests through GenePop. I can draft one
paragraph about the subject. I would imagine it is one of the biggest
changes and probably the one that adds most functionality.

On Mon, Sep 21, 2009 at 6:13 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> Hi guys,
>
> A draft release announcement for 1.52 for you to look at and comment on.
> This is written with the idea that there will be a blog post describing the
> convert and indexed_dict() methods for SeqIO which can be linked to so the
> announcement itself is pretty brief.
>
> I didn't mention the movement from CVS to git in the announcement which
> might be something worth adding?
>
> +++
> We are pleased to announce the availability of Biopython 1.52, a new stable
> release of the Biopython library.
>
> It may only have been one month since the last release but in that time
> we've added enough useful features to warrant a new release. Biopython 1.52
> will be of particular interest to people using next generation sequencing -
> new functions added to the AlignIO and SeqIO tools speed up the way very
> large sequence files can be dealt with and you can now write phd files like
> those created by ?Phred and used in 454 sequencing.
>
> SeqIO and AlignIO both now have a helper function called convert() that
> allows for simple, optimized conversion between file formats while SeqIO
> gets a new method called indexed_dict() which allows random access to
> sequences in a file without reading every record in that file into memory.
>
> The new release also adds command line wrappers for the EMBOSS versions of
> the phylip phylogeny programs and squashes a few minor bugs reported since
> 1.51 was released.
>
> Sources and a Windows Installer are available from the downloads page.
>
> Thanks to the Biopython development team and to everyone who has reported
> bugs since our last release
>
> ++++
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
" It always takes ideology to consummate massive error." - Ambrose
Evans-Pritchard


From winda002 at student.otago.ac.nz  Mon Sep 21 01:30:44 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Mon, 21 Sep 2009 17:30:44 +1200
Subject: [Biopython-dev] draft blog post for 1.52 stuff
Message-ID: <4AB70F84.6000709@student.otago.ac.nz>

As I mentioned in the draft release announcement it might be useful to 
have a
a blog post up explaining how the new functions for SeqIO and AlignIO 
work (thanks to Peter for this idea).

I've written a draft for a post that looks at the convert function that 
could do with a little more detail and ignores the indexed_dict() 
function entirely because I just don't have a good enough idea of how it 
works.

Again, any comments are welcome. Is it a good idea to have a post like 
this or should we just extend the release announcement to include a 
little bit more detail?

++
It's only been a month since we released Biopython 1.51 but in that time the
CVS server has stacked up enough cool new features that we are going to put
together a new release soon. As ever the new functions will be documented in
the official tutorial and cookbook but we thought we'd show off a few of
these tools here


Simple, optimized format conversion with SeqIO and AlignIO


No one has ever complained that bioinformatics just doesn't have enough file
formats - you probably frequently find yourself converting sequence 
files to suit
particular applications with SeqIO. At the moment this is usually a two step
process, something like this:

 >>>records = SeqIO.parse(in_handle "genbank")
 >>>SeqIO.write(records, out_handle, "fasta")

As of Biopython 1.52 you'll be able to achieve the same result in a 
single step:

 >>>SeqIO.convert(in_handle, "genbank", out_handle, "fasta")

Adding the convert function to SeqIO will make your scripts more 
readable and
might even save you a couple of lines of code but more importantly it 
allows the
conversion process to be optimized for two formats being used. In the above
example we are moving from a genbank file, which might include multiple
features for each sequence, to a fasta file, which doesn't include features.
If we used the two step process above we'd be spending time reading each 
sequence's features into memory just to skip them when they get passed 
to the write function. SeqIO.convert()  knows that the sequences in the 
input
file are destined to be written to a fasta file so it can skip over the 
features
and save a bit of time in doing the conversion.

Obviously, the optimization in SeqIO.convert() is most powerful when its 
used
on very large files like those produced in next generation sequencing 
projects.
When converting between each of the FASTQ file format's variants with 
the "SeqIO two step" a siginficant amount of time is taken creating 
SeqRecord objects for each record in the input file but none of the 
attributes or methods of the SeqRecord object are required to do the 
conversion. For this reason SeqIO.convert() deals with each record as 
two simple strings, one for the record's sequence, the other for its ID. 
[some information on just how much time that saves on a big file should 
probably go here!]
+++

From winda002 at student.otago.ac.nz  Mon Sep 21 01:45:34 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Mon, 21 Sep 2009 17:45:34 +1200
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com>
References: <4AB70B74.1040308@student.otago.ac.nz>
	<6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com>
Message-ID: <4AB712FE.2060304@student.otago.ac.nz>

Tiago Ant?o wrote:
> There is a big update to the PopGen module, which is now able to do
> frequentist statistics and tests through GenePop. I can draft one
> paragraph about the subject. I would imagine it is one of the biggest
> changes and probably the one that adds most functionality.
>   
Cool, I see now that I should've read the original thread about the new 
release more closely

A paragraph from you on your PopGen code would be really helpful.

Cheers,
David

From tiagoantao at gmail.com  Mon Sep 21 03:23:24 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 21 Sep 2009 08:23:24 +0100
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <4AB712FE.2060304@student.otago.ac.nz>
References: <4AB70B74.1040308@student.otago.ac.nz>
	<6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com>
	<4AB712FE.2060304@student.otago.ac.nz>
Message-ID: <6d941f120909210023v5dc91079s6ec54a04ad8385e7@mail.gmail.com>

Something along the lines of:

The Population Genetics module now allows the calculation of several
tests, and statistical estimators via a wrapper to GenePop. Supported
are tests for Hardy-Weinberg equilibrium, linkage disequilibrium and
estimates for various F statistics (Cockerham and Wier Fst and Fis,
Robertson and Hill Fis, ...), null allele frequencies and number of
migrants among many others. Isolation By Distance (IBD) functionality
is also supported.


I suppose the changes to PopGen are the biggest going on this
Biopython version and probably one of the highlights. I should update
the documentation ASAP.

I intend to announce this version to some population genetics and
evolutionary biology communities (something I have never done in the
past)


On Mon, Sep 21, 2009 at 6:45 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> Tiago Ant?o wrote:
>>
>> There is a big update to the PopGen module, which is now able to do
>> frequentist statistics and tests through GenePop. I can draft one
>> paragraph about the subject. I would imagine it is one of the biggest
>> changes and probably the one that adds most functionality.
>>
>
> Cool, I see now that I should've read the original thread about the new
> release more closely
>
> A paragraph from you on your PopGen code would be really helpful.
>
> Cheers,
> David
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
" It always takes ideology to consummate massive error." - Ambrose
Evans-Pritchard


From biopython at maubp.freeserve.co.uk  Mon Sep 21 05:01:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 21 Sep 2009 10:01:10 +0100
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz>
References: <4AB70B74.1040308@student.otago.ac.nz>
Message-ID: <320fb6e00909210201u3d9032e5vf64ba2953d83938d@mail.gmail.com>

On Mon, Sep 21, 2009 at 6:13 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> Hi guys,
>
> A draft release announcement for 1.52 for you to look at and comment on.
> This is written with the idea that there will be a blog post describing the
> convert and indexed_dict() methods for SeqIO which can be linked to so the
> announcement itself is pretty brief.

I switched indexed_dict() to just index() after discussion on the list.

> I didn't mention the movement from CVS to git in the announcement which
> might be something worth adding?

I think that would warrant a one line paragraph (near the end) :)

Peter

From biopython at maubp.freeserve.co.uk  Mon Sep 21 05:11:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 21 Sep 2009 10:11:17 +0100
Subject: [Biopython-dev] draft blog post for 1.52 stuff
In-Reply-To: <4AB70F84.6000709@student.otago.ac.nz>
References: <4AB70F84.6000709@student.otago.ac.nz>
Message-ID: <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com>

On Mon, Sep 21, 2009 at 6:30 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> As I mentioned in the draft release announcement it might be useful to have
> a blog post up explaining how the new functions for SeqIO and AlignIO work
> (thanks to Peter for this idea).
>
> I've written a draft for a post that looks at the convert function that
> could do with a little more detail and ignores the indexed_dict() function
> entirely because I just don't have a good enough idea of how it works.

Great job - thanks for doing this. I'll tackle an indexing introduction
blog post since you've done a nice job for convert :)

It would also be worth mentioning that the convert function will also
take filenames (not just handles), which also helps simplify simple
conversion tasks.

I should be able to provide some timings for things like FASTQ
conversion, or FASTQ to FASTA on multi-million read files
(there are probably some on the dev list already...).

> Again, any comments are welcome. Is it a good idea to have a post like
> this or should we just extend the release announcement to include a little
> bit more detail?

Well, as I mentioned the idea to David directly, I think these little
motivational examples on the blog are worth trying out. What does
everyone else think?

Peter

From biopython at maubp.freeserve.co.uk  Mon Sep 21 13:41:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 21 Sep 2009 18:41:40 +0100
Subject: [Biopython-dev] draft blog post for 1.52 stuff
In-Reply-To: <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com>
References: <4AB70F84.6000709@student.otago.ac.nz>
	<320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com>
Message-ID: <320fb6e00909211041n6378595cx39f2d395aee0ec7c@mail.gmail.com>

On Mon, Sep 21, 2009 at 10:11 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Sep 21, 2009 at 6:30 AM, David Winter
> <winda002 at student.otago.ac.nz> wrote:
>> As I mentioned in the draft release announcement it might be useful to have
>> a blog post up explaining how the new functions for SeqIO and AlignIO work
>> (thanks to Peter for this idea).
>>
>> I've written a draft for a post that looks at the convert function that
>> could do with a little more detail and ignores the indexed_dict() function
>> entirely because I just don't have a good enough idea of how it works.
>
> Great job - thanks for doing this. I'll tackle an indexing introduction
> blog post since you've done a nice job for convert :)

Done, and up online - hopefully without typos:
http://news.open-bio.org/news/2009/09/biopython-seqio-index/

Peter

From winda002 at student.otago.ac.nz  Tue Sep 22 01:05:31 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Tue, 22 Sep 2009 17:05:31 +1200
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz>
References: <4AB70B74.1040308@student.otago.ac.nz>
Message-ID: <4AB85B1B.2000704@student.otago.ac.nz>

David Winter wrote:
> Hi guys,
>
> A draft release announcement for 1.52 for you to look at and comment 
> on. This is written with the idea that there will be a blog post 
> describing the convert and indexed_dict() methods for SeqIO which can 
> be linked to so the
> announcement itself is pretty brief.
Thanks to Peter and Tiago for their suggestions, there is now a marked 
up version of this draft with those suggestions ready and waiting on to 
go on the blog. Still time for suggestions from anyone else.

David

From winda002 at student.otago.ac.nz  Tue Sep 22 01:14:07 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Tue, 22 Sep 2009 17:14:07 +1200
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>	
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>	
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>	
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>	
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>	
	<4AA58F3C.6080200@student.otago.ac.nz>	
	<4AB303EB.1010208@student.otago.ac.nz>
	<320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com>
Message-ID: <4AB85D1F.7010901@student.otago.ac.nz>

Peter wrote:
>
>
>   
>> Writing them actually exposed a bug in the code already in CVS, the
>> FProtParsCommandline option "-intreefile" isn't mandatory so "is_required"
>> should be set to 0 rather than 1. In my defence the emboss
>> documentation has it listed as being both mandatory and optional.
>>     
>
> Fixed in CVS - does this affect any of the other tools using this argument?
>   
Nope, I only slipped on this one ;)
>   
>> One possibly foolish thing I did was use TreeIO to test the trees that came
>> out of these programs made sense, thinking that module would be part of the
>> next release. If the plan is for a new release soon and having a test for
>> these wrappers is important the tests could be done with Nexus.Trees but I
>> found that was difficult to use for files with multiple newick trees.
>>     
>
> I put a quick crude helper function into the unit test as discussed.
>
> The unit test is working nicely on Linux with EMBOSS PHYLIP
> from CVS, I presume you are testing against an official release?
> If you could the CVS code works fine on your setup before the
> release that would be great. 
Finally got in front of the right computer to do this. The tests in the 
(Biopython) CVS work fine with the official EMBOSS 6.1.0 release (on 
ubuntu if that helps). I'd offer to try it out on windows but I don't 
have EMBOSS, a compiler or and of the libraries that I'd need to do that!

Cheers,
David

From biopython at maubp.freeserve.co.uk  Tue Sep 22 05:23:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 10:23:10 +0100
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <4AB85D1F.7010901@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
	<4AB303EB.1010208@student.otago.ac.nz>
	<320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com>
	<4AB85D1F.7010901@student.otago.ac.nz>
Message-ID: <320fb6e00909220223q6f079a39o74916d20291c3400@mail.gmail.com>

On Tue, Sep 22, 2009 at 6:14 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> Peter wrote:
>>>
>>> Writing them actually exposed a bug in the code already in CVS,
>>> the FProtParsCommandline option "-intreefile" isn't mandatory so
>>> "is_required" should be set to 0 rather than 1. In my defence the
>>> emboss documentation has it listed as being both mandatory and
>>> optional.
>>
>> Fixed in CVS - does this affect any of the other tools using this
>> argument?
>
> Nope, I only slipped on this one ;)

Great. It looks like the tests have been useful already :)

>> The unit test is working nicely on Linux with EMBOSS PHYLIP
>> from CVS, I presume you are testing against an official release?
>> If you could the CVS code works fine on your setup before the
>> release that would be great.
>
> Finally got in front of the right computer to do this. The tests in the
> (Biopython) CVS work fine with the official EMBOSS 6.1.0 release
> (on ubuntu if that helps).

Great - thank you.

> I'd offer to try it out on windows but I don't
> have EMBOSS, a compiler or and of the libraries that I'd need to
> do that!

Hmm - EMBOSS only provide a Windows installer for the core
EMBOSS suite, not the extras like PHYLIP. I do have a C
compiler and cygwin setup on my Windows machine, so it may
work. We'll see...

Peter

From mjldehoon at yahoo.com  Tue Sep 22 06:12:37 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 22 Sep 2009 03:12:37 -0700 (PDT)
Subject: [Biopython-dev] Blast records
Message-ID: <230712.78074.qm@web62406.mail.re1.yahoo.com>

Hi everybody,

I was looking at an older bug report about the plain-text and XML Blast parsers in Biopython:

http://bugzilla.open-bio.org/show_bug.cgi?id=2176

When I was checking the current behavior of Biopython's blast parsers, I noticed that the plain-text parser and the XML parser give different results when parsing psi-blast output. The plain-text parser returns a Blast.Record.PSIBlast object, whereas the XML parser returns Blast.Record.Blast objects. In addition, the XML parser misinterprets the psi-blast XML output (creating a separate Blast record for each psi-blast iteration), whereas the plain-text parser fails on psi-blast output of the current blast program.

To fix this, I guess the first step is to decide whether a psi-blast parser should return a Blast.Record.Blast object or a Blast.Record.PSIBlast object. In theory having a Blast.Record.PSIBlast record seems more appropriate. However, this complicates the parser (it's not clear until halfway through the Blast output if it's Blast or Psi-Blast, which means the user has to tell the parser whether it's Blast or Psi-Blast), and the format of the XML output generated for Blast and Psi-Blast is the same. I would therefore suggest to have one Blast.Record class that can contain both Blast and Psi-Blast output.

Any other opinions, comments, suggestions?

--Michiel.


From biopython at maubp.freeserve.co.uk  Tue Sep 22 07:40:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 12:40:46 +0100
Subject: [Biopython-dev] Blast records
In-Reply-To: <230712.78074.qm@web62406.mail.re1.yahoo.com>
References: <230712.78074.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com>

On Tue, Sep 22, 2009 at 11:12 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Hi everybody,
>
> When I was checking the current behavior of Biopython's blast parsers,
> I noticed that the plain-text parser and the XML parser give different
> results when parsing psi-blast output. The plain-text parser returns a
> Blast.Record.PSIBlast object, whereas the XML parser returns
> Blast.Record.Blast objects. ...
>
> Any other opinions, comments, suggestions?

As I recall (backed up by what I wrote in the tutorial), when I last
checked, the plain text PSI-BLAST output (i.e. from the command
line tool blastpgp) included a lot of information missing in the XML
output. Perhaps this has improved? If it hasn't, I am inclinded to
leave things as they are. If the current PSI-BLAST outputs more
details in the XML we may be able to do a better job.

The next bit is my recollection of some of the background to this:
Classic BLAST (and also RPS-BLAST) allow multiple queries and
use the "iterator" block in the XML file for each query. This was an
odd choice of naming, but I think the XML tag was originally only
intended for the PSI-BLAST outout where each "iteration" block
in the XML corresponds to each step of the algorithm. You may
recall early versions of BLAST would output "concatenated" XML
files for multiple queries - which were not true XML files. I guess
they fixed this by reusing the existing "iteration" structure for
multiple queries (rather than adding new XML tags). With this in
mind the current parsing of the XML from PSI-BLAST makes
sense.

[In any case, I plan to do Biopython 1.52 this afternoon, with
the PSI BLAST parsing left as is it].

Peter

From biopython at maubp.freeserve.co.uk  Tue Sep 22 09:29:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 14:29:10 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
Message-ID: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>

Hi all,

As previously announced, I'm going to try and get Biopython 1.52
done this afternoon - and am now declaring a CVS freeze.

If all goes to plan, once I've done the release CVS will remain
"frozen", and we'll probably get it made read only on the server.
Instead, we're going to try and switch over to git (initially on
github with a backup on the OBF servers).

Stay tuned for further announcements...

Peter

From p.j.a.cock at googlemail.com  Tue Sep 22 12:38:21 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Sep 2009 17:38:21 +0100
Subject: [Biopython-dev] Biopython 1.52 released
Message-ID: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>

Dear all,

Those of you who signed up to our newsfeed will know this already,
but we are pleased to announce the release of Biopython 1.52:

http://news.open-bio.org/news/2009/09/biopython-release-152/

Thank you to all our developers, including David Winter for drafting
the release announcement, and everyone else who as contributed
with feedback, bug reports etc.

Could I also take this opportunity to remind you all we have an
application note out in the OUP journal Bioinformatics:
http://news.open-bio.org/news/2009/03/biopython-paper-published/
http://dx.doi.org/10.1093/bioinformatics/btp163

In any scientific publication using Biopython, we kindly request
you cite this, or another appropriate publication from this list:
http://biopython.org/wiki/Documentation#Papers

Thank you,

Peter

From biopython at maubp.freeserve.co.uk  Tue Sep 22 12:42:49 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 17:42:49 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
Message-ID: <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>

On Tue, Sep 22, 2009 at 2:29 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> As previously announced, I'm going to try and get Biopython 1.52
> done this afternoon - and am now declaring a CVS freeze.
>
> If all goes to plan, once I've done the release CVS will remain
> "frozen", and we'll probably get it made read only on the server.
> Instead, we're going to try and switch over to git (initially on
> github with a backup on the OBF servers).
>
> Stay tuned for further announcements...

OK, the release is done. Let's leave things as they are for a day
or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate
with Bartek about the timings for the git transition.

I am considering adding a warning message to setup.py and the
readme file as the final commit to CVS, pointing out that we will
be moving future development to a git repository. One of the first
commit to git would be to remove that warning. Does that make
sense?

Peter

From bartek at rezolwenta.eu.org  Tue Sep 22 15:46:20 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 22 Sep 2009 21:46:20 +0200
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
Message-ID: <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>

On Tue, Sep 22, 2009 at 6:42 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> OK, the release is done. Let's leave things as they are for a day
> or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate
> with Bartek about the timings for the git transition.
>
> I am considering adding a warning message to setup.py and the
> readme file as the final commit to CVS, pointing out that we will
> be moving future development to a git repository. One of the first
> commit to git would be to remove that warning. Does that make
> sense?

It seems OK to me. Let me know when you make the last commit, so that
I turn off the scripts pushing CVS changes to github, which would be
the only technical thing to do to make the transition. From then on,
we should commit only to git.

Bartek.

From biopython at maubp.freeserve.co.uk  Tue Sep 22 16:18:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 21:18:12 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
Message-ID: <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>

On Tue, Sep 22, 2009 at 8:46 PM, Bartek Wilczynski wrote:
> On Tue, Sep 22, 2009 at 6:42 PM, Peter wrote:
>>
>> OK, the release is done. Let's leave things as they are for a day
>> or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate
>> with Bartek about the timings for the git transition.
>>
>> I am considering adding a warning message to setup.py and the
>> readme file as the final commit to CVS, pointing out that we will
>> be moving future development to a git repository. One of the first
>> commit to git would be to remove that warning. Does that make
>> sense?
>
> It seems OK to me.

Great.

> Let me know when you make the last commit, so that I turn off
> the scripts pushing CVS changes to github, ...

Will do - I'll give it a day or so just in case we need to do a
re-release for anything critical.

> ... which would be the only technical thing to do to make the
> transition. From then on, we should commit only to git.

Yep - although I'll ask the OBF admins to make CVS read only
as a precaution.

Peter

From p.j.a.cock at googlemail.com  Tue Sep 22 16:20:54 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Sep 2009 21:20:54 +0100
Subject: [Biopython-dev] Biopython 1.52 released
In-Reply-To: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>
References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>
Message-ID: <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com>

> Dear all,
>
> Those of you who signed up to our newsfeed will know this already,
> but we are pleased to announce the release of Biopython 1.52:
>
> http://news.open-bio.org/news/2009/09/biopython-release-152/
>
> Thank you to all our developers, including David Winter for drafting
> the release announcement, and everyone else who as contributed
> with feedback, bug reports etc.

Brad - if everything looks fine, can you do the PyPi upload now?

Thanks,

Peter

From chapmanb at 50mail.com  Tue Sep 22 16:42:26 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 22 Sep 2009 16:42:26 -0400
Subject: [Biopython-dev] Biopython 1.52 released
In-Reply-To: <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com>
References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>
	<320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com>
Message-ID: <20090922204226.GA13500@sobchak.mgh.harvard.edu>

Hi Peter;
Congrats to everyone on the release. Peter, thanks as always for all
the hard work.

> Brad - if everything looks fine, can you do the PyPi upload now?

No problem, all set:

http://pypi.python.org/pypi/biopython/

I am tempted to secretly commit something to CVS and then vehemently
deny doing it to mess with everyone's head. Wait, so then how did the
README file get changed? A mystery...

Seriously, looking forward to the Git transition,
Brad

From p.j.a.cock at googlemail.com  Tue Sep 22 17:24:11 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Sep 2009 22:24:11 +0100
Subject: [Biopython-dev] Biopython 1.52 released
In-Reply-To: <20090922204226.GA13500@sobchak.mgh.harvard.edu>
References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>
	<320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com>
	<20090922204226.GA13500@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909221424t2cd67249pc1555c382c4f5597@mail.gmail.com>

On Tue, Sep 22, 2009 at 9:42 PM, Brad Chapman wrote:
> Hi Peter;
> Congrats to everyone on the release. Peter, thanks as always for all
> the hard work.
>
>> Brad - if everything looks fine, can you do the PyPi upload now?
>
> No problem, all set:
>
> http://pypi.python.org/pypi/biopython/

Lovely :)

> I am tempted to secretly commit something to CVS and then vehemently
> deny doing it to mess with everyone's head. Wait, so then how did the
> README file get changed? A mystery...

Well, unless you have another CVS account that we don't know
about, it wouldn't be much of a mystery would it? Grin.

> Seriously, looking forward to the Git transition,

May you live in interesting times?

But yeah - should be good.

Peter

From biopython at maubp.freeserve.co.uk  Wed Sep 23 06:28:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 11:28:35 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
Message-ID: <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>

On Tue, Sep 22, 2009 at 9:18 PM, Peter  wrote:
> Bartek wrote:
>> Let me know when you make the last commit, so that I turn off
>> the scripts pushing CVS changes to github, ...
>
> Will do - I'll give it a day or so just in case we need to do a
> re-release for anything critical.

Hi Bartek,

OK - I think that's it for final commits to CVS (a few notes about
git, and finally adding the warning in setup.py). Not all of these
changes have made it to github yet.

We also need to 1.52 tag ("biopython-152") to get copied over.

Once that is done, could you turn off your CVS to github
script, and let us know by email?

Thanks,

Peter

From biopython at maubp.freeserve.co.uk  Wed Sep 23 10:34:42 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 15:34:42 +0100
Subject: [Biopython-dev] Blast records
In-Reply-To: <154350.7800.qm@web62402.mail.re1.yahoo.com>
References: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com>
	<154350.7800.qm@web62402.mail.re1.yahoo.com>
Message-ID: <320fb6e00909230734k612c142cse6888a10c0de01b5@mail.gmail.com>

On Wed, Sep 23, 2009 at 2:51 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> --- On Tue, 9/22/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> As I recall (backed up by what I wrote in the tutorial),
>> when I last checked, the plain text PSI-BLAST output
>> (i.e. from the command line tool blastpgp) included a
>> lot of information missing in the XML output. Perhaps
>> this has improved? If it hasn't, I am inclined to leave
>> things as they are. If the current PSI-BLAST outputs
>> more details in the XML we may be able to do a better job.
>
> As far as I can tell, the XML contains the same information
> as the plain-text psiblast output, but the XML parser doesn't
> parse it correctly, since it assumes it is dealing with regular
> blast rather than psi-blast.

It sounds like the NCBI have changed the PSI BLAST XML
output then.

>> The next bit is my recollection of some of the background
>> to this:
>> Classic BLAST (and also RPS-BLAST) allow multiple queries
>> and use the "iterator" block in the XML file for each query.
>> This was an odd choice of naming, but I think the XML tag was
>> originally only intended for the PSI-BLAST outout where each
>> "iteration" block in the XML corresponds to each step of the
>> algorithm. You may recall early versions of BLAST would output
>> "concatenated" XML files for multiple queries - which were not
>> true XML files.
>
> That is correct. To make things more complex, if you run
> psi-blast with multiple queries you get concatenated XML
> files again, with the iteration blocks corresponding to the
> psi-blast iterations for each query.

Odd - and arguably a bug, since it isn't valid XML.

>> I guess they fixed this by reusing the existing "iteration"
>> structure for multiple queries (rather than adding new XML
>> tags). With this in mind the current parsing of the XML from
>> PSI-BLAST makes sense.
>
> I don't know if it really makes sense. For a single psi-blast
> query, we're getting multiple Blast records. For multiple
> psi-blast queries, we're iterating over the iteration blocks
> while ignoring the fact that they can come from different
> queries.

Is a single Blast record object for each PSI-BLAST iteration
such a bad thing?

> Ideally, we should be able to see from the XML whether
> it was regular blast with multiple queries, or psi-blast with
> a single query. Right now that is possible by looking at
> the query-def lines, but I wonder if NCBI is considering
> a better solution for this. I'll write an email to them to find out.

Certainly clarification from the NCBI sounds useful.

Peter

From mjldehoon at yahoo.com  Wed Sep 23 09:51:04 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 23 Sep 2009 06:51:04 -0700 (PDT)
Subject: [Biopython-dev] Blast records
In-Reply-To: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com>
Message-ID: <154350.7800.qm@web62402.mail.re1.yahoo.com>

--- On Tue, 9/22/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
> As I recall (backed up by what I wrote in the tutorial),
> when I last checked, the plain text PSI-BLAST output
> (i.e. from the command line tool blastpgp) included a
> lot of information missing in the XML output. Perhaps
> this has improved? If it hasn't, I am inclined to leave
> things as they are. If the current PSI-BLAST outputs
> more details in the XML we may be able to do a better job.

As far as I can tell, the XML contains the same information as the plain-text psiblast output, but the XML parser doesn't parse it correctly, since it assumes it is dealing with regular blast rather than psi-blast.

> The next bit is my recollection of some of the background
> to this:
> Classic BLAST (and also RPS-BLAST) allow multiple queries
> and use the "iterator" block in the XML file for each query.
> This was an odd choice of naming, but I think the XML tag was
> originally only intended for the PSI-BLAST outout where each 
> "iteration" block in the XML corresponds to each step of the 
> algorithm. You may recall early versions of BLAST would output 
> "concatenated" XML files for multiple queries - which were not
> true XML files.

That is correct. To make things more complex, if you run psi-blast with multiple queries you get concatenated XML files again, with the iteration blocks corresponding to the psi-blast iterations for each query.

> I guess they fixed this by reusing the existing "iteration"
> structure for multiple queries (rather than adding new XML
> tags). With this in mind the current parsing of the XML from
> PSI-BLAST makes sense.

I don't know if it really makes sense. For a single psi-blast query, we're getting multiple Blast records. For multiple psi-blast queries, we're iterating over the iteration blocks while ignoring the fact that they can come from different queries.

Ideally, we should be able to see from the XML whether it was regular blast with multiple queries, or psi-blast with a single query. Right now that is possible by looking a the query-def lines, but I wonder if NCBI is considering a better solution for this. I'll write an email to them to find out.

--Michiel


From bugzilla-daemon at portal.open-bio.org  Wed Sep 23 10:47:16 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 23 Sep 2009 10:47:16 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909231447.n8NElGi8003751@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-23 10:47 EST -------
I've looked at PDB file 13GS in more detail, and this doesn't look like a bug
in Biopython, but rather just another odd PDB file.

Chains C and D are only three residue peptides, e.g.

ATOM   3301  N   GLU D   1      16.854  13.061  10.252  1.00 65.68           N  
ATOM   3302  CA  GLU D   1      17.100  13.860   9.018  1.00 66.23           C  
ATOM   3303  C   GLU D   1      17.937  15.095   9.363  1.00 65.02           C  
ATOM   3304  O   GLU D   1      18.510  15.724   8.439  1.00 56.86           O  
ATOM   3305  CB  GLU D   1      15.764  14.279   8.389  1.00 66.35           C  
ATOM   3306  CG  GLU D   1      15.913  14.994   7.062  1.00 67.41           C  
ATOM   3307  CD  GLU D   1      14.584  15.456   6.508  1.00 68.72           C  
ATOM   3308  OE1 GLU D   1      13.547  15.340   7.163  1.00 69.08           O  
ATOM   3309  OXT GLU D   1      17.998  15.420  10.569  1.00 66.12           O  
ATOM   3310  N   CYS D   2      14.618  15.966   5.283  1.00 69.97           N  
ATOM   3311  CA  CYS D   2      13.431  16.483   4.614  1.00 70.18           C  
ATOM   3312  C   CYS D   2      13.374  15.898   3.213  1.00 69.53           C  
ATOM   3313  O   CYS D   2      14.409  15.625   2.610  1.00 65.61           O  
ATOM   3314  CB  CYS D   2      13.502  18.008   4.507  1.00 73.18           C  
ATOM   3315  SG  CYS D   2      14.485  18.841   5.796  1.00 76.47           S  
ATOM   3316  N   GLY D   3      12.166  15.713   2.693  1.00 71.49           N  
ATOM   3317  CA  GLY D   3      12.023  15.155   1.360  1.00 75.33           C  
ATOM   3318  C   GLY D   3      11.489  13.733   1.399  1.00 78.72           C  
ATOM   3319  O   GLY D   3      10.840  13.313   0.413  1.00 79.95           O  
ATOM   3320  OXT GLY D   3      11.717  13.031   2.412  1.00 80.37           O  
TER    3321      GLY D   3

Look at the C-alpha distances, (17.100, 13.860, 9.018) to (13.431, 16.483,
4.614) to (12.023, 15.155, 1.360) giving distances of 6.3 and 3.8:

>>> from math import sqrt
>>> import numpy
>>> a = numpy.array((17.100, 13.860, 9.018))
>>> b = numpy.array((13.431, 16.483, 4.614))
>>> c = numpy.array((12.023, 15.155, 1.360))
>>> sqrt(sum((a-b)**2))
6.3037215991825049
>>> sqrt(sum((b-c)**2))
3.7861014249488876

Clearly the first two residues in this "peptide" are very far apart, regardless
of if you do a simple C-alpha distance (as here), or look at the backbone's N
to C bonds.

The "problem" for 13GS goes away if you relax the default distance threshold,
e.g. use PPBuilder(10.0) instead of PPBuilder().

However, whatever affects 1A2D seems to be a different issue...

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bartek at rezolwenta.eu.org  Wed Sep 23 11:10:32 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Wed, 23 Sep 2009 17:10:32 +0200
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
Message-ID: <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>

On Wed, Sep 23, 2009 at 12:28 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Sep 22, 2009 at 9:18 PM, Peter ?wrote:
> OK - I think that's it for final commits to CVS (a few notes about
> git, and finally adding the warning in setup.py). Not all of these
> changes have made it to github yet.
>
> We also need to 1.52 tag ("biopython-152") to get copied over.
>
> Once that is done, could you turn off your CVS to github
> script, and let us know by email?

Ta-da! We are no longer synchronizing from CVS!

Please do not commit  any changes to the CVS because they are not
going to be transferred to git, which is now _the_ repository for
biopython.

Everyone with biopython CVS accounts is welcome to send their github
logins (off the list) to me or Peter to get them added as biopython
collaborators.

cheers
Bartek


From biopython at maubp.freeserve.co.uk  Wed Sep 23 11:16:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 16:16:19 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
Message-ID: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>

On Wed, Sep 23, 2009 at 4:10 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
> On Wed, Sep 23, 2009 at 12:28 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Tue, Sep 22, 2009 at 9:18 PM, Peter ?wrote:
>> OK - I think that's it for final commits to CVS (a few notes about
>> git, and finally adding the warning in setup.py). Not all of these
>> changes have made it to github yet.
>>
>> We also need to 1.52 tag ("biopython-152") to get copied over.
>>
>> Once that is done, could you turn off your CVS to github
>> script, and let us know by email?
>
> Ta-da! We are no longer synchronizing from CVS!

Lovely... but could you double check the last few commits made it?
i.e. The final commit should be:

setup.py CVS revision 1.174
date: 2009/09/23 10:06:08;  author: peterc;  state: Exp;  lines: +8 -0
Adding a warning about CVS/git to setup.py (which we will remove
once we switch to git) so people know they are using an out of date
repository.

Thanks,

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Sep 23 11:40:00 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 23 Sep 2009 11:40:00 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909231540.n8NFe0iU005670@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-23 11:39 EST -------
I think the problem with PDB file 1A2D is due to the atypical PYX residue,

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import is_aa
structure = PDBParser().get_structure('tmp', '1A2D.pdb')
for model in structure :
    for chain in model :
        for res in chain :
            if "CA" in res.child_dict and not is_aa(res) :
                print chain, res

The polypeptide code only looks at residues that pass the is_aa test, which
means we can ignore things like water atoms associated with a chain. In this
PDB file there are two residues which fail this test:

<Chain id=A> <Residue PYX het=H_PYX resseq=117 icode= >
<Chain id=B> <Residue PYX het=H_PYX resseq=117 icode= >

According to the SEQADV and MODRES lines, these are modified CYS residues.
Comparing this to the PDB provided FASTA file, a "C" is used (CYS). This
leads me to believe the fix is to add the PYX -> C mapping to Biopython.
[The dictionary used, to_one_letter_code, is actually defined in file
Bio/SCOP/RAF.py for some historical reason.]

Consulting the PDB documentation suggests that there are potentially
many more examples like this of unknown HETATM entries which are
modified amino acid residues... see:
ftp://ftp.wwpdb.org/pub/pdb/data/monomers/

Christian - did you find any other problem PDB files?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Wed Sep 23 11:47:19 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 23 Sep 2009 11:47:19 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909231547.n8NFlJ39005869@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


------- Comment #4 from schafer at rostlab.org  2009-09-23 11:47 EST -------
Peter,

yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time,
I'll take a look at it and post them here. It's easy to do this. What I did is,
I parsed the structures through the dssp structure assignment tool and compared
the obtained sequence with that obtained from the Bio.PDB parser. Background: I
wanted to map the sequence that dssp sees to atomic coordinates.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bartek at rezolwenta.eu.org  Wed Sep 23 11:56:42 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Wed, 23 Sep 2009 17:56:42 +0200
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
Message-ID: <8b34ec180909230856u235a17ah437e578e02d5e6d3@mail.gmail.com>

On Wed, Sep 23, 2009 at 5:16 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Lovely... but could you double check the last few commits made it?

Sure, your commit didn't make it to github at first, because It was
just two minutes after the last scheduled synchronization.

Now it's in github.

cheers
 Bartek

From biopython at maubp.freeserve.co.uk  Wed Sep 23 12:04:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 17:04:30 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
Message-ID: <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>

On Wed, Sep 23, 2009 at 4:16 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Sep 23, 2009 at 4:10 PM, Bartek Wilczynski wrote:
>>
>> Ta-da! We are no longer synchronizing from CVS!
>>
>
> Lovely... but could you double check the last few commits made it?
> i.e. The final commit should be:
>
> setup.py CVS revision 1.174
> date: 2009/09/23 10:06:08; ?author: peterc; ?state: Exp; ?lines: +8 -0
> Adding a warning about CVS/git to setup.py (which we will remove
> once we switch to git) so people know they are using an out of date
> repository.

It has just shown up in the last few minutes :)

I'm ready to make the first commit directly to github (removing the
new warning from setup.py), assuming everything is fine on your
end Bartek?

Peter


From biopython at maubp.freeserve.co.uk  Wed Sep 23 12:34:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 17:34:12 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
	<320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>
Message-ID: <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com>

On Wed, Sep 23, 2009 at 5:04 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> I'm ready to make the first commit directly to github (removing the
> new warning from setup.py), assuming everything is fine on your
> end Bartek?

OK - that's done now. Thank you Bartek.

Ladies and Gentlemen, we are now running Biopython development
with git :)

Remember - CVS remains frozen (and I'll ask the OBF admins to make
it read only to prevent any accidents).

Now, let's make sure all the documentation and the wiki etc is up to date,
and make an official announcement on the news server.

Those of you who already had CVS access, once you think you are happy
with using git (i.e. you'd had a play with your own local repository, and also
idealy tried pushed changes to a personal repository on github), please
ask for collaborators status on github.

Peter

From eric.talevich at gmail.com  Wed Sep 23 23:48:49 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 23 Sep 2009 23:48:49 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
Message-ID: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>

Folks,

I've fixed a couple of remaining issues in the Bio.Tree and Bio.TreeIO
modules and I'd like your opinion on what else should be done before merging
this into the mainline.

First, the wiki documentation for PhyloXML has an example pipeline showing
how to build a phylogeny in Biopython, from a raw protein sequence to a
lightly annotated phyloXML file.
http://biopython.org/wiki/PhyloXML#Example_pipeline

Does this look like right? I copied the first few steps from the official
docs.

The source code, for your review, is here:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py

Discussion:

*TreeIO*
The read, parse, write and convert functions work essentially the same as in
SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues:

(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.

(2) NexusIO.write() just doesn't seem to work. I don't understand how to
make the original Nexus module write out trees that it didn't parse itself.
Help?

*Tree
*The BaseTree module is meant to be the basis for Newick trees eventually,
so I'd like to get the design right with the minimum number of public
methods:

(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.

(2) Should find() be based on depth_first_search or breadth_first_search
(not checked in yet)? DFS would potentially find a leaf node faster, but BFS
seems more common in phylogenetics. Note that iteration can easily be
reversed with the standard reversed() function, so we don't need extra
functions for those cases.

(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?

(4) There's some mention in the literature of a relationship-matrix
representation for phylogenies. Does anyone here know how to work with this
representation, or know if it would let us perform complex calculations with
blinding speed behind the scenes? If so, should there be a function in
Bio.Tree.Utils to export a tree to a NumPy array represented this way?  If
not, I'll forget about it.

*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
in the way biologists are used to seeing them. Does anyone know of existing
code that could be borrowed for this? I looked at ETE (announced on the main
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.

Best regards,
Eric

From mjldehoon at yahoo.com  Thu Sep 24 05:33:22 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 24 Sep 2009 02:33:22 -0700 (PDT)
Subject: [Biopython-dev] Blast records
In-Reply-To: <320fb6e00909230734k612c142cse6888a10c0de01b5@mail.gmail.com>
Message-ID: <888743.69260.qm@web62408.mail.re1.yahoo.com>

--- On Wed, 9/23/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
> --- Michiel wrote:
> > For a single psi-blast query, we're getting multiple Blast
> > records. For multiple psi-blast queries, we're iterating over
> > the iteration blocks while ignoring the fact that they can come
> from different queries.
> 
> Is a single Blast record object for each PSI-BLAST
> iteration such a bad thing?
>
Well the plain-text PSI-BLAST parser returns a single Record.PSIBlast object containing all of the PSI-BLAST iterations, whereas the XML parser returns multiple Record.Blast objects. Ideally, the plain-text parser and the XML parser should return the same thing.

--Michiel.


From biopython at maubp.freeserve.co.uk  Thu Sep 24 05:57:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 10:57:12 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
Message-ID: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>

On Thu, Sep 24, 2009 at 4:48 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
> Discussion:
>
> *TreeIO*
> The read, parse, write and convert functions work essentially the same as in
> SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues:

Great.

One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry. Easy.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).

On a more general note, you seem to be recreating the file/handle logic
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and
Bio.TreeIO.write() functions *only* and have the underlying format specific
code just use handles. This avoids the code duplication.

[In fact, as I have said before, I prefer the simplicity of just allowing
handles - and we should make TreeIO and SeqIO/AlignIO consistent]

> (1) 'phyloxml' uses a different object representation than the other two, so
> converting between those formats is not possible until Nexus.Trees is ported
> over to Bio.Tree.

I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).

Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and distances
between nodes.

> (2) NexusIO.write() just doesn't seem to work. I don't understand how to
> make the original Nexus module write out trees that it didn't parse itself.
> Help?

To get the Newick tree, you can just call str(tree), which is basically what
you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
more complicated. You'll need to create a minimal Nexus file - have a
look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having
a hard coded nexus template, and just insert the tree as a Newick string
(and insert the list of taxa?). Perhaps Frank or Cymon can advise us.

> *Tree
> *The BaseTree module is meant to be the basis for Newick trees eventually,
> so I'd like to get the design right with the minimum number of public
> methods:
>
> (1) The find() function, named after the Unix utility that does the same
> thing for directory trees, seems capable of all the iteration and filtering
> necessary for locating data and automatically adding annotations to a tree.
> There's a 'terminal' argument for selecting internal nodes, external nodes,
> or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
> to remove it if no one protests.

I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.

> (2) Should find() be based on depth_first_search or breadth_first_search
> (not checked in yet)? DFS would potentially find a leaf node faster, but BFS
> seems more common in phylogenetics. Note that iteration can easily be
> reversed with the standard reversed() function, so we don't need extra
> functions for those cases.

You could do both, either via an argument or having two methods, say
depth_fist_search and breadth_first_search instead of find.

> (3) I left room in each Node for the left and right indexes used by BioSQL's
> nested-set representation. Now I'm doubting the utility of that -- any
> Biopython function that uses those indexes would need to ensure that the
> index is up to date, which seems tricky. Shall I remove all mention of the
> nested-set representation, or try to support it fully?

A partial implementation doesn't seem helpful, and wastes memory
allocating unused properties. I would remove it from the base Node,
but a full implementation might be useful for something (would it be
possible via a subclass?).

On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.

> (4) There's some mention in the literature of a relationship-matrix
> representation for phylogenies. Does anyone here know how to work with this
> representation, or know if it would let us perform complex calculations with
> blinding speed behind the scenes? If so, should there be a function in
> Bio.Tree.Utils to export a tree to a NumPy array represented this way? ?If
> not, I'll forget about it.

I don't know.

> *Graphics*
> I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
> nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
> usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
> nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
> in the way biologists are used to seeing them. Does anyone know of existing
> code that could be borrowed for this? I looked at ETE (announced on the main
> biopython list last week) and liked the examples, but it uses PyQt4 and a
> standalone GUI for display, which is a substantial departure from the
> Biopython way of doing things.

I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 24 06:23:34 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 11:23:34 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
	<320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>
	<320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com>
Message-ID: <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com>

On Wed, Sep 23, 2009 at 5:34 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Now, let's make sure all the documentation and the wiki etc is up to date,
> and make an official announcement on the news server.
>

How does this look for a draft news post (with links to wiki pages etc):

The release of Biopython 1.52 earlier this week marked the end of an
era, it was our last release using CVS for source code control.

As of now, Biopython is using a git repository, hosted on github.com
who kindly provide git hosting for open source projects free of
charge. The BioRuby project have been using github for some time now,
so we are in good company.

The existing OBF hosted CVS repository will be maintained in the short
to medium term as a backup, but will not be updated.

Although many people have been involved in this move, we?d like to
thank Bartek Wilczynski in particular for handling the CVS to git
conversion, and the mirroring our CVS updates to git during the last
few months transition period. In the next few weeks hopefully we?ll
get our git usage wiki pages perfected, as we start using git for
real.

Peter


From jhuerta at crg.es  Thu Sep 24 06:45:21 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Thu, 24 Sep 2009 12:45:21 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
Message-ID: <c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>

Hi,

( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI  can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.

I develop a lot of  code around tree handling, so if you think I can help,
please tell me.
jaime.


>  > *Graphics*
> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
> unlabeled
> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
> even
> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would
> be
> > nice to have a Reportlab-based module in Bio.Graphics to print
> phylogenies
> > in the way biologists are used to seeing them. Does anyone know of
> existing
> > code that could be borrowed for this? I looked at ETE (announced on the
> main
> > biopython list last week) and liked the examples, but it uses PyQt4 and a
> > standalone GUI for display, which is a substantial departure from the
> > Biopython way of doing things.
>
> I still haven't tracked down my old report lab code, but it wasn't object
> orientated and would need a lot of work to bring up to standard...
>
>


> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 07:14:37 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 07:14:37 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241114.n8OBEbKH005629@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-24 07:14 EST -------
(In reply to comment #4)
> Peter,
> 
> yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time,
> I'll take a look at it and post them here. It's easy to do this. What I did is,
> I parsed the structures through the dssp structure assignment tool and compared
> the obtained sequence with that obtained from the Bio.PDB parser. Background: I
> wanted to map the sequence that dssp sees to atomic coordinates.
> 

If you can give us some more examples that would be very helpful, thank you.

I have committed a partial fix which means any known modified amino acids
(based on the presence of an alpha carbon) will be treated as an amino
acid for building the peptide (and given the default sequence letter of X).
This will also issue a warning. Any such previously unknown modified amino
acid (like PYX) needs to be added to our hard coded lookup table with the
appropriate single letter symbol as used by the PDF in their FASTA files
(in this case, PYX -> C for cysteine).

I suspect that some of your other problem PDB files still have (currently)
undefined modified amino acids in them...

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Thu Sep 24 07:39:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 12:39:59 +0100
Subject: [Biopython-dev] Committing to github...
Message-ID: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com>

Hi all,

My last couple of commits to github have been from a local clone of
the *official*
repository: http://github.com/biopython/biopython/

This is a nice and simple work flow for small changes, and the history
and github
network graph are easy to understand:
http://biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch

This seems like the easiest way to work for people used to CVS, and you don't
need to bother with your own Biopython cloned repository on github (you just
need a github account and collaborator status). I'll probably continue
to do this
in the short term.

--

However, prior to that I did a couple of commits via a local clone of
*my* personal
github repository, http://github.com/peterjc/biopython/

I had kept the master branch on *my* repository identical to the
official master.
However, while I was only pushing a tiny change, git did this as a
merge - resulting
in a flurry of RSS entries and a complicated looking git network
diagram. I think it
is probably just down to the way we've been using the repositories during the
migration? With this backlog of merges done, I expect future commits by this
route will look much cleaner...

Peter

From chapmanb at 50mail.com  Thu Sep 24 08:08:00 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 24 Sep 2009 08:08:00 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
Message-ID: <20090924120800.GJ13500@sobchak.mgh.harvard.edu>

Eric and Peter;
Looking forward to seeing the PhyloXML work merged into the main
branch. Eric, thanks for posting the summary of where things are at.

> > (1) 'phyloxml' uses a different object representation than the other two, so
> > converting between those formats is not possible until Nexus.Trees is ported
> > over to Bio.Tree.
> 
> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
> that phyloxml allows very minimal trees, the reverse as well). It does look
> like the best plan is to use the same tree objects for all three (updating
> Bio.Nexus if possible).

Agreed that this would be nice to have, but I'm not sure why it's
blocking getting the base TreeIO framework and all of PhyloXML into
the main branch. That's a major step forward from the format
specific phylogenetic code we had before and gets us a portion of
the way there.

Next up should be moving over Bio.Nexus to the new framework and
then conversions, but this is another project. I think we should
take this one step at a time.

> Note that Bio.Nexus.Trees still has some useful methods you don't
> appear to support, like finding the last common ancestor and distances
> between nodes.

Agreed. As we move Nexus over, we should be sure to keep current
functionality.

> > (1) The find() function, named after the Unix utility that does the same
> > thing for directory trees, seems capable of all the iteration and filtering
> > necessary for locating data and automatically adding annotations to a tree.
> > There's a 'terminal' argument for selecting internal nodes, external nodes,
> > or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
> > to remove it if no one protests.
> 
> I'm in two minds - iterating over the leaves (taxa) seems like a very
> common operation, and having an explicit method for this might be
> clearer than calling find with special arguments.

I'm for keeping it as well, and just having the underlying
implementation of get_leaf_nodes call find with the right arguments.
This seems like an operation that should be dead obvious to do.

> > (3) I left room in each Node for the left and right indexes used by BioSQL's
> > nested-set representation. Now I'm doubting the utility of that -- any
> > Biopython function that uses those indexes would need to ensure that the
> > index is up to date, which seems tricky. Shall I remove all mention of the
> > nested-set representation, or try to support it fully?

Again I agree with Peter here -- this would be best supported as a
subclass that is database aware with an identical API, similar to
how the Seq objects and BioSQL Seq objects work. This avoids any
overhead for the in-memory case, which will be more common, but
gives you a point to implement the useful database representation
code in the future. If you don't have time to work on all of this
right now, I'd leave the nested-set stuff out and keep it in mind as
a future addition.

Brad

From biopython at maubp.freeserve.co.uk  Thu Sep 24 08:48:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 13:48:37 +0100
Subject: [Biopython-dev] Git documentation on wiki
Message-ID: <320fb6e00909240548q4db8dfc1l83be8408d3b8718f@mail.gmail.com>

Hi all,

I think I have updated the relevant wiki pages about the CVS to git
migration. I have also make the "git" page redirect to the "Source
Code" page, which is the main access point. This now has a quick
summary with the basic links here for anyone wanting to grab the
latest code:

http://biopython.org/wiki/SourceCode

If anyone spots any errors or typos, feel free to fix them or raise
them here for discussion as needed.

Thanks,

Peter

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 10:42:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:42:08 -0400
Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed
	assertion in CondonTable Fix+Patch
In-Reply-To: <bug-2894-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241442.n8OEg8Xo012359@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2894


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-24 10:42 EST -------
I've actually installed Jython 2.5.0 and checked this. A further fix was
required, but this now works with the latest Biopython now in git.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 10:46:38 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:46:38 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To: <bug-2891-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241446.n8OEkc1w012533@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2891


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-24 10:46 EST -------
Testing with Jython 2.5.0 shows my fix didn't work. Reopening...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 10:46:49 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:46:49 -0400
Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary
	Jython Error Fix+Patch
In-Reply-To: <bug-2895-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241446.n8OEknEX012555@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2895


Bug 2895 depends on bug 2891, which changed state.

Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 10:46:53 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:46:53 -0400
Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch
In-Reply-To: <bug-2893-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241446.n8OEkrFK012570@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2893


Bug 2893 depends on bug 2891, which changed state.

Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 10:46:55 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:46:55 -0400
Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch
In-Reply-To: <bug-2892-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241446.n8OEkt93012582@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2892


Bug 2892 depends on bug 2891, which changed state.

Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 11:11:22 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:22 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To: <bug-2891-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBM3q013469@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2891


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|2890                        |
OtherBugsDependingO|2892, 2893, 2895            |
              nThis|                            |


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-24 11:11 EST -------
Removing dependencies on other Jython bugs - they don't block each other.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 11:11:25 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:25 -0400
Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython
In-Reply-To: <bug-2890-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBPYu013482@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2890


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
OtherBugsDependingO|2891                        |
              nThis|                            |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 11:11:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:40 -0400
Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary
	Jython Error Fix+Patch
In-Reply-To: <bug-2895-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBeug013513@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2895


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|2891                        |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 11:11:42 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:42 -0400
Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch
In-Reply-To: <bug-2893-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBgcU013525@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2893


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|2891                        |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 11:11:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:45 -0400
Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch
In-Reply-To: <bug-2892-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBj1e013540@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2892


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|2891                        |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 12:10:30 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 12:10:30 -0400
Subject: [Biopython-dev] [Bug 2918] New: Entrez parser fails on Jython -
	XMLParser lacks SetParamEntityParsing
Message-ID: <bug-2918-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2918

           Summary: Entrez parser fails on Jython - XMLParser lacks
                    SetParamEntityParsing
           Product: Biopython
           Version: 1.52
          Platform: All
               URL: http://bugs.jython.org/issue1447
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Other
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
                CC: kellrott at ucsd.edu


I'm filing this as a bug report so we can track it, but the underlying issue is
a known Jython bug, http://bugs.jython.org/issue1447 (thanks Kyle for reporting
this already).

It can be shown just by running our unit test:

 ~/jython2.5.0/jython run_tests.py test_Entrez.py
test_Entrez ... FAIL
======================================================================
ERROR: Test parsing XML returned by EFetch, Journals database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pjcock/repositories/biopython/Tests/test_Entrez.py", line 3443,
in test_journals
    record = Entrez.read(input)
  File "/Users/pjcock/repositories/biopython/Bio/Entrez/__init__.py", line 259,
in read
    record = handler.run(handle)
  File "/Users/pjcock/repositories/biopython/Bio/Entrez/Parser.py", line 85, in
run
    self.parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
AttributeError: 'XMLParser' object has no attribute 'SetParamEntityParsing'

...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Thu Sep 24 13:59:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 18:59:06 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20090924120800.GJ13500@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<20090924120800.GJ13500@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909241059yfa43889w82c76cd7f2365dee@mail.gmail.com>

On Thu, Sep 24, 2009 at 1:08 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Eric and Peter;
> Looking forward to seeing the PhyloXML work merged into the main
> branch. Eric, thanks for posting the summary of where things are at.
>
>> > (1) 'phyloxml' uses a different object representation than the other two, so
>> > converting between those formats is not possible until Nexus.Trees is ported
>> > over to Bio.Tree.
>>
>> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
>> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
>> that phyloxml allows very minimal trees, the reverse as well). It does look
>> like the best plan is to use the same tree objects for all three (updating
>> Bio.Nexus if possible).
>
> Agreed that this would be nice to have, but I'm not sure why it's
> blocking getting the base TreeIO framework and all of PhyloXML into
> the main branch. That's a major step forward from the format
> specific phylogenetic code we had before and gets us a portion of
> the way there.

If the Newick/Nexus TreeIO parsers return one object type while the
PhyloXML TreeIO parser returns another *incompatible* object type,
then we don't have a unified tree input/output framework. Furthermore,
if you did release this and then later standardise on a single tree object,
you'd break backwards compatibility. All in all, best avoided.

> Next up should be moving over Bio.Nexus to the new framework and
> then conversions, but this is another project. I think we should
> take this one step at a time.

What we could do in the short term is ignore Bio.Nexus.Trees, and
just leave it as is. Instead of having the Newick/Nexus TreeIO code
calling the old Bio.Nexus.Trees code, we just write some new code
(possibly based on old code) which will use Eric's new objects.

We could then (gradually, perhaps by adding a runtime option to
the Nexus parsing API) move Bio.Nexus over from using the old
Bio.Nexus.Trees code to the new TreeIO, and eventually deprecate
and then remove Bio.Nexus.Trees.

Peter

From eric.talevich at gmail.com  Thu Sep 24 23:54:05 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 24 Sep 2009 23:54:05 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
Message-ID: <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>

Hello, Jaime,

Sorry I didn't respond directly to your earlier post -- I wrote half of an
e-mail, then realized I had no good suggestions on what to do so I scrapped
it.

My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.

I see these issues with integration:
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?

2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.

3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.

4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.

5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.

6. The PhylomeDB connector is cool, and browsing the source, looks like it
wouldn't require much effort at all to drop into Biopython.

Thanks for letting us know about this.

Cheers,
Eric


On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas <jhuerta at crg.es> wrote:

> Hi,
>
> ( I'm the developer of ETE. )
> I agree that PyQt4 is an important dependence. I chose it because
> Qt4-QGraphicsScene environment offers many possibilities like openGL
> rendering, unlimited image size, performance, and good bindings to python.
> However, I am working on my code to allow the rendering algorithm to use any
> other graphical library. So, you could render the same tree images using
> different backends. If you think this is useful for you, please let me know
> and we can think how to integrat it with biopython.
> Regarding the GUI, it is not a standalone application but one more method
> within the Tree objects. The GUI  can be started at any point of the
> execution and the main program will continue after you close it. I did it
> like this because I think is quite useful for working within interactive
> python sessions.
>
> I develop a lot of  code around tree handling, so if you think I can help,
> please tell me.
> jaime.
>
>
>
>>  > *Graphics*
>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
>> unlabeled
>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
>> even
>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would
>> be
>> > nice to have a Reportlab-based module in Bio.Graphics to print
>> phylogenies
>> > in the way biologists are used to seeing them. Does anyone know of
>> existing
>> > code that could be borrowed for this? I looked at ETE (announced on the
>> main
>> > biopython list last week) and liked the examples, but it uses PyQt4 and
>> a
>> > standalone GUI for display, which is a substantial departure from the
>> > Biopython way of doing things.
>>
>> I still haven't tracked down my old report lab code, but it wasn't object
>> orientated and would need a lot of work to bring up to standard...
>>
>>
>
>
>
>
>
>
>
>> Peter
>>
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
>
>
>
> --
> =========================
> Jaime Huerta-Cepas, Ph.D.
> CRG-Centre for Genomic Regulation
> Doctor Aiguader, 88
> PRBB Building
> 08003 Barcelona, Spain
> http://www.crg.es/comparative_genomics
> =========================
>
>

From eric.talevich at gmail.com  Fri Sep 25 00:34:17 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 25 Sep 2009 00:34:17 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
Message-ID: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>

Hi Peter,

Thanks for the feedback.

On Thu, Sep 24, 2009 at 5:57 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
> One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
> supported for formats that can represent multiple phylogenetic trees in a
> single file". Is that true, and if so why? For SeqIO and AlignIO you can
> use parse on a file with one entry, the iterator just returns one entry.
> Easy.
> This is important for allowing generic code (e.g. a loop) regardless of
> how many entries there are (one, many, or even zero).
>
>
I'll delete that sentence. I don't know why it's there -- you're right, it's
easy to return an iterable regardless of what the format itself supports.

On a more general note, you seem to be recreating the file/handle logic
> in each of the individual parsers. I think it would make much more sense
> to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read()
> and
> Bio.TreeIO.write() functions *only* and have the underlying format specific
> code just use handles. This avoids the code duplication.
>
>
I did the handle management case-by-case because some of the underlying
libraries already do filename-to-handle conversion -- ElementTree and
Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
ad-hoc handle management, but of course I can move it all to the top if you
think it's best. One day, perhaps we'll have a context manager that we can
reuse everywhere to make magic easy:

with maybe_open(file) as handle:
   tree = FooIO.parse(handle)

Not today, though.


> (1) 'phyloxml' uses a different object representation than the other two,
> so
> > converting between those formats is not possible until Nexus.Trees is
> ported
> > over to Bio.Tree.
>
>
> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it
> would
> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
> that phyloxml allows very minimal trees, the reverse as well). It does look
> like the best plan is to use the same tree objects for all three (updating
> Bio.Nexus if possible).
>
>
I could comment out the 'nexus' and 'newick' lines from the
supported_formats dict. That would disable the top-level functions but leave
the direct NexusIO and NewickIO equivalents intact until the port is
complete.


Note that Bio.Nexus.Trees still has some useful methods you don't
> appear to support, like finding the last common ancestor and distances
> between nodes.
>
> That's intentional, I was just going to port those methods directly from
Bio.Nexus.Trees rather than invent a new API myself.

Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
combined parsers and object representations. My goal is to chop out the
pure-object parts and merge them into Bio.Tree, and let the remaining
parsers return objects built from the new Bio.Tree classes. This looks like
it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
done.

For backward compatibility, I'll leave some wrappers that trigger
DeprecationWarnings in the original places. Nexus.Trees can probably be
reduced to:

import warnings
warnings.warn("Use Bio.Tree and Bio.TreeIO instead", DeprecationWarning)

from Bio.Tree.Newick import *
from Bio.TreeIO.NewickIO import *

(more or less)

> (2) NexusIO.write() just doesn't seem to work. I don't understand how to
> > make the original Nexus module write out trees that it didn't parse
> itself.
> > Help?
>
> To get the Newick tree, you can just call str(tree), which is basically
> what
> you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
> more complicated. You'll need to create a minimal Nexus file - have a
> look at the Bio.AlignIO.NexusIO code. An alternative is to look at is
> having
> a hard coded nexus template, and just insert the tree as a Newick string
> (and insert the list of taxa?). Perhaps Frank or Cymon can advise us.
>
>
OK, thanks, I'll give it a shot. I see some default Nexus template stuff in
Bio.Nexus.Nexus already.


> > *Tree
> > *The BaseTree module is meant to be the basis for Newick trees
> eventually,
> > so I'd like to get the design right with the minimum number of public
> > methods:
> >
> > (1) The find() function, named after the Unix utility that does the same
> > thing for directory trees, seems capable of all the iteration and
> filtering
> > necessary for locating data and automatically adding annotations to a
> tree.
> > There's a 'terminal' argument for selecting internal nodes, external
> nodes,
> > or both, and I think this means get_leaf_nodes() is unnecessary. I'm
> going
> > to remove it if no one protests.
>
> I'm in two minds - iterating over the leaves (taxa) seems like a very
> common operation, and having an explicit method for this might be
> clearer than calling find with special arguments.
>

I think .find(terminal=True) will do the right thing and looks reasonably
simple, but as Brad said, this is a ridiculously common operation so finding
it in the API should be ridiculously easy. I'll rename this function to
get_leaves() and rename find() to findall() (to match ElementTree and make
it clear that it returns an iterable).


> > (3) I left room in each Node for the left and right indexes used by
> BioSQL's
> > nested-set representation. Now I'm doubting the utility of that -- any
> > Biopython function that uses those indexes would need to ensure that the
> > index is up to date, which seems tricky. Shall I remove all mention of
> the
> > nested-set representation, or try to support it fully?
>
> A partial implementation doesn't seem helpful, and wastes memory
> allocating unused properties. I would remove it from the base Node,
> but a full implementation might be useful for something (would it be
> possible via a subclass?).
>
> On a related point, do you think a BioSQL TaxonTree subclass is possible?
> i.e. Something mimicking the new Tree objects (as a subclass), but which
> loads data on demand from the taxon tables in a BioSQL database? This
> would provide a nice way to work with the NCBI taxonomy (once loaded
> into BioSQL), which is a very large tree. For an example use case, I might
> want to extract just the bacteria as a subtree, and save that to a file.
>
>
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.


Cheers,
Eric

From biopython at maubp.freeserve.co.uk  Fri Sep 25 05:59:08 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Sep 2009 10:59:08 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
Message-ID: <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com>

On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> On a related point, do you think a BioSQL TaxonTree subclass is possible?
>> i.e. Something mimicking the new Tree objects (as a subclass), but which
>> loads data on demand from the taxon tables in a BioSQL database? This
>> would provide a nice way to work with the NCBI taxonomy (once loaded
>> into BioSQL), which is a very large tree. For an example use case, I might
>> want to extract just the bacteria as a subtree, and save that to a file.
>>
>
> Doing BioSQL integration was on the original roadmap, but research hasn't
> taken me back there lately. I would like to do it eventually... anyway, that
> would solve the indexing issue nicely. I'll drop the extra attributes -- I
> get the impression they're not meant to be accessed directly in BioSQL
> either, so there's no use for them in Biopython.

As things stand, there is no usage of the left/right index fields in
Biopython.

The current Biopython BioSQL code focusses on the database
variants of the Seq and SeqRecord objects. The only interaction
with the taxon tables is to load/retrieve the species annotations,
and for this we don't need the complications of the left/right index.
We leave them empty if we populate the taxonomy via Entrez
(recalculating the left/right values is computationally expensive).

However, any "DBTaxonTree" object (or whatever we call it) could
potentially offer us a way to (a) populate and (b) use the these
alternative indexes as a way to speed up various subtree operations.

Peter

From biopython at maubp.freeserve.co.uk  Fri Sep 25 06:08:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Sep 2009 11:08:56 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
Message-ID: <320fb6e00909250308s35a286e7x67a7bb3fec6a0673@mail.gmail.com>

On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
>> supported for formats that can represent multiple phylogenetic trees in a
>> single file". Is that true, and if so why? For SeqIO and AlignIO you can
>> use parse on a file with one entry, the iterator just returns one entry.
>> This is important for allowing generic code (e.g. a loop) regardless of
>> how many entries there are (one, many, or even zero).
>
> I'll delete that sentence. I don't know why it's there -- you're right, it's
> easy to return an iterable regardless of what the format itself supports.

OK.

>> On a more general note, you seem to be recreating the file/handle logic
>> in each of the individual parsers. I think it would make much more sense
>> to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read()
>> and Bio.TreeIO.write() functions *only* and have the underlying format
>> specific code just use handles. This avoids the code duplication.
>
> I did the handle management case-by-case because some of the underlying
> libraries already do filename-to-handle conversion -- ElementTree and
> Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
> ad-hoc handle management, but of course I can move it all to the top if you
> think it's best.

Having a single layer of handle/filename conversion in Bio.TreeIO does
seem cleanest to me (even if some of the back ends allow either) and
will ensure our code is consistent.

> One day, perhaps we'll have a context manager that we can
> reuse everywhere to make magic easy:
>
> with maybe_open(file) as handle:
> ? tree = FooIO.parse(handle)
>
> Not today, though.

Not yet, no. For one thing we'll have to phase out Python 2.4 support.

>>> (1) 'phyloxml' uses a different object representation than the other two,
>>> so converting between those formats is not possible until Nexus.Trees
>>> is ported over to Bio.Tree.
>>
>> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it
>> would actually let you do phyloxml -> newick, and phyloxml -> nexus
>> (and assuming that phyloxml allows very minimal trees, the reverse
>> as well). It does look like the best plan is to use the same tree objects
>> for all three (updating Bio.Nexus if possible).
>
> I could comment out the 'nexus' and 'newick' lines from the
> supported_formats dict. That would disable the top-level functions
> but leave the direct NexusIO and NewickIO equivalents intact until
> the port is complete.

I guess shipping a "phyloxml" only Bio.TreeIO would work, but it
would be rather less useful. We could certainly start with just that
on the trunk (i.e. initially no Bio.TreeIO.NewickIO and also no
Bio.TreeIO.NexusIO modules - initially have just a single backend).

>> Note that Bio.Nexus.Trees still has some useful methods you don't
>> appear to support, like finding the last common ancestor and
>> distances between nodes.
>
> That's intentional, I was just going to port those methods directly from
> Bio.Nexus.Trees rather than invent a new API myself.

OK - sounds good.

> Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
> combined parsers and object representations. My goal is to chop out the
> pure-object parts and merge them into Bio.Tree, and let the remaining
> parsers return objects built from the new Bio.Tree classes. This looks like
> it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
> done.

Sounds good - as with Bio.SeqIO and Bio.AlignIO, one of the goals has
been to separate the data object from the (many possible) parsers.

> For backward compatibility, I'll leave some wrappers that trigger
> DeprecationWarnings in the original places. Nexus.Trees can
> probably be reduced ...

Something like that, sure.

>>> (1) The find() function, named after the Unix utility that does the
>>> same thing for directory trees, seems capable of all the iteration
>>> and filtering necessary for locating data and automatically adding
>>> annotations to a tree. There's a 'terminal' argument for selecting
>>> internal nodes, external nodes, or both, and I think this means
>>> get_leaf_nodes() is unnecessary. I'm going to remove it if no one
>>> protests.
>>
>> I'm in two minds - iterating over the leaves (taxa) seems like a very
>> common operation, and having an explicit method for this might be
>> clearer than calling find with special arguments.
>
> I think .find(terminal=True) will do the right thing and looks reasonably
> simple, but as Brad said, this is a ridiculously common operation so
> finding it in the API should be ridiculously easy. I'll rename this function
> to get_leaves() and rename find() to findall() (to match ElementTree
> and make it clear that it returns an iterable).

OK.

Peter


From hlapp at gmx.net  Fri Sep 25 07:39:03 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 25 Sep 2009 07:39:03 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
	<320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com>
Message-ID: <B877367C-9144-488C-B018-A8B5771051D1@gmx.net>


On Sep 25, 2009, at 5:59 AM, Peter wrote:

> On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com 
> > wrote:
>>>
>>> On a related point, do you think a BioSQL TaxonTree subclass is  
>>> possible?
>>> i.e. Something mimicking the new Tree objects (as a subclass), but  
>>> which
>>> loads data on demand from the taxon tables in a BioSQL database?  
>>> This
>>> would provide a nice way to work with the NCBI taxonomy (once loaded
>>> into BioSQL), which is a very large tree. For an example use case,  
>>> I might
>>> want to extract just the bacteria as a subtree, and save that to a  
>>> file.
>>>
>>
>> Doing BioSQL integration was on the original roadmap, but research  
>> hasn't
>> taken me back there lately. I would like to do it eventually...  
>> anyway, that
>> would solve the indexing issue nicely. I'll drop the extra  
>> attributes -- I
>> get the impression they're not meant to be accessed directly in  
>> BioSQL
>> either, so there's no use for them in Biopython.
>
> As things stand, there is no usage of the left/right index fields in
> Biopython.

The left/right fields are really a crutch for doing hierarchical  
(recursive) queries in SQL more efficiently. SQL doesn't have native  
support for recursive queries, and the left/right index values allow  
you to rewrite an otherwise recursive query as a single-hit set.

Within an object-oriented programming language that supports recursion  
these values are of no use - they don't let you traverse a tree faster  
than you would already be able to do through recursing up or down your  
tree data structure. If there's a natural order of nodes, you can  
speed up finding nodes through binary search. But for pulling out  
lineages or subtrees I doubt that this will help at all - it'll have  
to be your data structure (such as having double links) that makes  
those operations efficient.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri Sep 25 08:26:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Sep 2009 13:26:38 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
	<320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>
	<320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com>
	<320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com>
Message-ID: <320fb6e00909250526s294eee65ubbc508136f26f48a@mail.gmail.com>

On Thu, Sep 24, 2009 at 11:23 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Sep 23, 2009 at 5:34 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>> Now, let's make sure all the documentation and the wiki etc is up to date,
>> and make an official announcement on the news server.
>
> How does this look for a draft news post (with links to wiki pages etc):
>
> The release of Biopython 1.52 earlier this week marked the end of an
> era, it was our last release using CVS for source code control. ...

I went ahead and posted something based on that draft:
http://news.open-bio.org/news/2009/09/biopython-cvs-to-git-migration/

Nice to see several more people have started following the github
repository already :)

Peter

From jhuerta at crg.es  Fri Sep 25 11:28:36 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Fri, 25 Sep 2009 17:28:36 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> 
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com> 
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
Message-ID: <c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>

Hi Eric,

Thanks for your comments,
I really see a lot of potential parts in ETE that could be used from
biopython, however, for the moment, we would rather prefer not to modify
current ETE's  GPL license. As far as I know, the main difference between
GPL and BSD-like licenses is that, with the second, you could relicense the
code at any moment under any other policy, including private and close
licenses. GPL includes a protection for this by ensuring that any code based
on GPL sources must be always GPL compatible, and that's why we have chosen
it. Moreover, the use of a BSD-like license would prevent us to use a lot of
great GPL code out there.

It is not my purpose to open a debate about licenses. I just wonder if
biopython could provide any way to link/bind external software, perhaps as
addons or plugins. This would be great, since many extra features (not only
from ETE but from other sources) could be added on specific demands. This
would also mitigate the problem of very specific dependencies, since many of
them would be optional. From my side, I could work for providing bindings
between biopython and ETE's tree graphical rendering features, inline
visualization GUI, extended newick support, tree manipulation and the
methods within the ETE package.

I will be out of the office for several weeks, but if you see any way to
collaborate I will be happy to discuss this a bit more in detail...

Cheers!
Jaime

On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich <eric.talevich at gmail.com>wrote:

> Hello, Jaime,
>
> Sorry I didn't respond directly to your earlier post -- I wrote half of an
> e-mail, then realized I had no good suggestions on what to do so I scrapped
> it.
>
> My Tree and TreeIO code is basically a complete parser for the phyloXML
> format, plus a few base classes extracted out in hopes of eventually
> creating a unified set of format-independent objects, as in SeqIO and
> AlignIO. Your code for working with trees looks much more complete than
> mine, so if some of it can be incorporated into Biopython, I think that
> would be great.
>
> I see these issues with integration:
> 1. It's GPL, while Biopython uses a more permissive custom license
> resembling the BSD and MIT licenses. Would you be willing and able to
> relicense parts of your work for Biopython?
>
> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
> require some compatibility fixes -- not a huge problem.
>
> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional
> dependency in Biopython, so if it can be imported on the fly by just the
> functions that need it (hopefully no core ones), that would be best. If
> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
> it would be better to make that an optional, on-the-fly import, too.
>
> 4. PyQt4 is a big package and I'm not sure it's as common in scientists'
> Python installations as numpy and scipy, so if the underlying algorithms for
> tree layout could be ported to Reportlab, matplotlib or PIL, that would be
> ideal. I personally would like to be able to pair sequence snippets with the
> leaves of a standard phylogram, so if you need me to do some additional work
> to get this section ported to Biopython, I'd consider it time well spent.
>
> 5. Presumably, the tree object type in ETE is different from Bio.Tree or
> Bio.Nexus, so porting the core tree manipulation code to Biopython would
> require a substantial effort somewhere.
>
> 6. The PhylomeDB connector is cool, and browsing the source, looks like it
> wouldn't require much effort at all to drop into Biopython.
>
> Thanks for letting us know about this.
>
> Cheers,
> Eric
>
>
>
> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas <jhuerta at crg.es>wrote:
>
>> Hi,
>>
>> ( I'm the developer of ETE. )
>> I agree that PyQt4 is an important dependence. I chose it because
>> Qt4-QGraphicsScene environment offers many possibilities like openGL
>> rendering, unlimited image size, performance, and good bindings to python.
>> However, I am working on my code to allow the rendering algorithm to use any
>> other graphical library. So, you could render the same tree images using
>> different backends. If you think this is useful for you, please let me know
>> and we can think how to integrat it with biopython.
>> Regarding the GUI, it is not a standalone application but one more method
>> within the Tree objects. The GUI  can be started at any point of the
>> execution and the main program will continue after you close it. I did it
>> like this because I think is quite useful for working within interactive
>> python sessions.
>>
>> I develop a lot of  code around tree handling, so if you think I can help,
>> please tell me.
>> jaime.
>>
>>
>>
>>>  > *Graphics*
>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
>>> unlabeled
>>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
>>> even
>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would
>>> be
>>> > nice to have a Reportlab-based module in Bio.Graphics to print
>>> phylogenies
>>> > in the way biologists are used to seeing them. Does anyone know of
>>> existing
>>> > code that could be borrowed for this? I looked at ETE (announced on the
>>> main
>>> > biopython list last week) and liked the examples, but it uses PyQt4 and
>>> a
>>> > standalone GUI for display, which is a substantial departure from the
>>> > Biopython way of doing things.
>>>
>>> I still haven't tracked down my old report lab code, but it wasn't object
>>> orientated and would need a lot of work to bring up to standard...
>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>>> Peter
>>>
>>> _______________________________________________
>>> Biopython-dev mailing list
>>> Biopython-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>
>>
>>
>>
>> --
>> =========================
>> Jaime Huerta-Cepas, Ph.D.
>> CRG-Centre for Genomic Regulation
>> Doctor Aiguader, 88
>> PRBB Building
>> 08003 Barcelona, Spain
>> http://www.crg.es/comparative_genomics
>> =========================
>>
>>
>


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================

From eric.talevich at gmail.com  Fri Sep 25 11:51:15 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 25 Sep 2009 11:51:15 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
Message-ID: <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com>

Hi Jaime,

Just working on bindings would certainly be easier. The best way to transfer
tree information from Biopython to ETE would be serializing the trees in
phyloXML format (to preserve the annotations) and loading that file in ETE.
I see that ETE allows rich annotation of tree objects, but I don't see
phyloXML or NeXML listed as supported file formats -- is there another
standard format you're using to store this information? If not, I think ETE
would benefit from a phyloXML parser. Since Biopython license is
GPL-compatible (I believe), you could borrow Bio.TreeIO.PhyloXMLIO directly
and just port the Phylogeny and Clade classes to ETE's base classes instead
of Bio.Tree.BaseTree's Tree and Node classes.

Beyond that, some support for BioSQL to store sequences etc. would also help
link ETE to any of the other Bio* projects. There's some example code in
Biopython's top-level BioSQL directory, if you're interested.

Cheers,
Eric

On Fri, Sep 25, 2009 at 11:28 AM, Jaime Huerta Cepas <jhuerta at crg.es> wrote:

> Hi Eric,
>
> Thanks for your comments,
> I really see a lot of potential parts in ETE that could be used from
> biopython, however, for the moment, we would rather prefer not to modify
> current ETE's  GPL license. As far as I know, the main difference between
> GPL and BSD-like licenses is that, with the second, you could relicense the
> code at any moment under any other policy, including private and close
> licenses. GPL includes a protection for this by ensuring that any code based
> on GPL sources must be always GPL compatible, and that's why we have chosen
> it. Moreover, the use of a BSD-like license would prevent us to use a lot of
> great GPL code out there.
>
> It is not my purpose to open a debate about licenses. I just wonder if
> biopython could provide any way to link/bind external software, perhaps as
> addons or plugins. This would be great, since many extra features (not only
> from ETE but from other sources) could be added on specific demands. This
> would also mitigate the problem of very specific dependencies, since many of
> them would be optional. From my side, I could work for providing bindings
> between biopython and ETE's tree graphical rendering features, inline
> visualization GUI, extended newick support, tree manipulation and the
> methods within the ETE package.
>
> I will be out of the office for several weeks, but if you see any way to
> collaborate I will be happy to discuss this a bit more in detail...
>
> Cheers!
> Jaime
>
>
> On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich <eric.talevich at gmail.com>wrote:
>
>> Hello, Jaime,
>>
>> Sorry I didn't respond directly to your earlier post -- I wrote half of an
>> e-mail, then realized I had no good suggestions on what to do so I scrapped
>> it.
>>
>> My Tree and TreeIO code is basically a complete parser for the phyloXML
>> format, plus a few base classes extracted out in hopes of eventually
>> creating a unified set of format-independent objects, as in SeqIO and
>> AlignIO. Your code for working with trees looks much more complete than
>> mine, so if some of it can be incorporated into Biopython, I think that
>> would be great.
>>
>> I see these issues with integration:
>> 1. It's GPL, while Biopython uses a more permissive custom license
>> resembling the BSD and MIT licenses. Would you be willing and able to
>> relicense parts of your work for Biopython?
>>
>> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
>> require some compatibility fixes -- not a huge problem.
>>
>> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional
>> dependency in Biopython, so if it can be imported on the fly by just the
>> functions that need it (hopefully no core ones), that would be best. If
>> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
>> it would be better to make that an optional, on-the-fly import, too.
>>
>> 4. PyQt4 is a big package and I'm not sure it's as common in scientists'
>> Python installations as numpy and scipy, so if the underlying algorithms for
>> tree layout could be ported to Reportlab, matplotlib or PIL, that would be
>> ideal. I personally would like to be able to pair sequence snippets with the
>> leaves of a standard phylogram, so if you need me to do some additional work
>> to get this section ported to Biopython, I'd consider it time well spent.
>>
>> 5. Presumably, the tree object type in ETE is different from Bio.Tree or
>> Bio.Nexus, so porting the core tree manipulation code to Biopython would
>> require a substantial effort somewhere.
>>
>> 6. The PhylomeDB connector is cool, and browsing the source, looks like it
>> wouldn't require much effort at all to drop into Biopython.
>>
>> Thanks for letting us know about this.
>>
>> Cheers,
>> Eric
>>
>>
>>
>> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas <jhuerta at crg.es>wrote:
>>
>>> Hi,
>>>
>>> ( I'm the developer of ETE. )
>>> I agree that PyQt4 is an important dependence. I chose it because
>>> Qt4-QGraphicsScene environment offers many possibilities like openGL
>>> rendering, unlimited image size, performance, and good bindings to python.
>>> However, I am working on my code to allow the rendering algorithm to use any
>>> other graphical library. So, you could render the same tree images using
>>> different backends. If you think this is useful for you, please let me know
>>> and we can think how to integrat it with biopython.
>>> Regarding the GUI, it is not a standalone application but one more method
>>> within the Tree objects. The GUI  can be started at any point of the
>>> execution and the main program will continue after you close it. I did it
>>> like this because I think is quite useful for working within interactive
>>> python sessions.
>>>
>>> I develop a lot of  code around tree handling, so if you think I can
>>> help, please tell me.
>>> jaime.
>>>
>>>
>>>
>>>>  > *Graphics*
>>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
>>>> unlabeled
>>>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
>>>> even
>>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it
>>>> would be
>>>> > nice to have a Reportlab-based module in Bio.Graphics to print
>>>> phylogenies
>>>> > in the way biologists are used to seeing them. Does anyone know of
>>>> existing
>>>> > code that could be borrowed for this? I looked at ETE (announced on
>>>> the main
>>>> > biopython list last week) and liked the examples, but it uses PyQt4
>>>> and a
>>>> > standalone GUI for display, which is a substantial departure from the
>>>> > Biopython way of doing things.
>>>>
>>>> I still haven't tracked down my old report lab code, but it wasn't
>>>> object
>>>> orientated and would need a lot of work to bring up to standard...
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Peter
>>>>
>>>> _______________________________________________
>>>> Biopython-dev mailing list
>>>> Biopython-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>>
>>>
>>>
>>>
>>> --
>>> =========================
>>> Jaime Huerta-Cepas, Ph.D.
>>> CRG-Centre for Genomic Regulation
>>> Doctor Aiguader, 88
>>> PRBB Building
>>> 08003 Barcelona, Spain
>>> http://www.crg.es/comparative_genomics
>>> =========================
>>>
>>>
>>
>
>
> --
> =========================
> Jaime Huerta-Cepas, Ph.D.
> CRG-Centre for Genomic Regulation
> Doctor Aiguader, 88
> PRBB Building
> 08003 Barcelona, Spain
> http://www.crg.es/comparative_genomics
> =========================
>
>

From jhuerta at crg.es  Fri Sep 25 12:13:44 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Fri, 25 Sep 2009 18:13:44 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> 
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com> 
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> 
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com> 
	<3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com>
Message-ID: <c5882df30909250913w3190aadesac8aaee1c613d761@mail.gmail.com>

Hi,


> Just working on bindings would certainly be easier. The best way to
> transfer tree information from Biopython to ETE would be serializing the
> trees in phyloXML format (to preserve the annotations) and loading that file
> in ETE. I see that ETE allows rich annotation of tree objects, but I don't
> see phyloXML or NeXML listed as supported file formats -- is there another
> standard format you're using to store this information?

Extended newick (http://www.phylosoft.org/NHX/) is the only rich format
currently supported by ETE, however only text string representation of tree
node annotations are allowed by this standard. Beyond this, you should use a
cpickle approach to save complex annotated trees. I'm certainly interested
in PhyloXML and NexML support, so, for sure, this could be a nice starting
point.

If not, I think ETE would benefit from a phyloXML parser. Since Biopython
> license is GPL-compatible (I believe), you could borrow
> Bio.TreeIO.PhyloXMLIO directly and just port the Phylogeny and Clade classes
> to ETE's base classes instead of Bio.Tree.BaseTree's Tree and Node classes.
>
I think there is no problem in using BSD license from GPL sources, the
problem would be in the other way around. Then I will take a look at your
phyloxml code to find the best way to bind both packages through phyloXML
serialization.


> Beyond that, some support for BioSQL to store sequences etc. would also
> help link ETE to any of the other Bio* projects. There's some example code
> in Biopython's top-level BioSQL directory, if you're interested.
>
Ok. I'll take a look also. Thanks.

cheers,
Jaime.


>
> Cheers,
> Eric
>
>
> On Fri, Sep 25, 2009 at 11:28 AM, Jaime Huerta Cepas <jhuerta at crg.es>wrote:
>
>> Hi Eric,
>>
>> Thanks for your comments,
>> I really see a lot of potential parts in ETE that could be used from
>> biopython, however, for the moment, we would rather prefer not to modify
>> current ETE's  GPL license. As far as I know, the main difference between
>> GPL and BSD-like licenses is that, with the second, you could relicense the
>> code at any moment under any other policy, including private and close
>> licenses. GPL includes a protection for this by ensuring that any code based
>> on GPL sources must be always GPL compatible, and that's why we have chosen
>> it. Moreover, the use of a BSD-like license would prevent us to use a lot of
>> great GPL code out there.
>>
>> It is not my purpose to open a debate about licenses. I just wonder if
>> biopython could provide any way to link/bind external software, perhaps as
>> addons or plugins. This would be great, since many extra features (not only
>> from ETE but from other sources) could be added on specific demands. This
>> would also mitigate the problem of very specific dependencies, since many of
>> them would be optional. From my side, I could work for providing bindings
>> between biopython and ETE's tree graphical rendering features, inline
>> visualization GUI, extended newick support, tree manipulation and the
>> methods within the ETE package.
>>
>> I will be out of the office for several weeks, but if you see any way to
>> collaborate I will be happy to discuss this a bit more in detail...
>>
>> Cheers!
>> Jaime
>>
>>
>> On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich <eric.talevich at gmail.com>wrote:
>>
>>> Hello, Jaime,
>>>
>>> Sorry I didn't respond directly to your earlier post -- I wrote half of
>>> an e-mail, then realized I had no good suggestions on what to do so I
>>> scrapped it.
>>>
>>> My Tree and TreeIO code is basically a complete parser for the phyloXML
>>> format, plus a few base classes extracted out in hopes of eventually
>>> creating a unified set of format-independent objects, as in SeqIO and
>>> AlignIO. Your code for working with trees looks much more complete than
>>> mine, so if some of it can be incorporated into Biopython, I think that
>>> would be great.
>>>
>>> I see these issues with integration:
>>> 1. It's GPL, while Biopython uses a more permissive custom license
>>> resembling the BSD and MIT licenses. Would you be willing and able to
>>> relicense parts of your work for Biopython?
>>>
>>> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
>>> require some compatibility fixes -- not a huge problem.
>>>
>>> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional
>>> dependency in Biopython, so if it can be imported on the fly by just the
>>> functions that need it (hopefully no core ones), that would be best. If
>>> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
>>> it would be better to make that an optional, on-the-fly import, too.
>>>
>>> 4. PyQt4 is a big package and I'm not sure it's as common in scientists'
>>> Python installations as numpy and scipy, so if the underlying algorithms for
>>> tree layout could be ported to Reportlab, matplotlib or PIL, that would be
>>> ideal. I personally would like to be able to pair sequence snippets with the
>>> leaves of a standard phylogram, so if you need me to do some additional work
>>> to get this section ported to Biopython, I'd consider it time well spent.
>>>
>>> 5. Presumably, the tree object type in ETE is different from Bio.Tree or
>>> Bio.Nexus, so porting the core tree manipulation code to Biopython would
>>> require a substantial effort somewhere.
>>>
>>> 6. The PhylomeDB connector is cool, and browsing the source, looks like
>>> it wouldn't require much effort at all to drop into Biopython.
>>>
>>> Thanks for letting us know about this.
>>>
>>> Cheers,
>>> Eric
>>>
>>>
>>>
>>> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas <jhuerta at crg.es>wrote:
>>>
>>>> Hi,
>>>>
>>>> ( I'm the developer of ETE. )
>>>> I agree that PyQt4 is an important dependence. I chose it because
>>>> Qt4-QGraphicsScene environment offers many possibilities like openGL
>>>> rendering, unlimited image size, performance, and good bindings to python.
>>>> However, I am working on my code to allow the rendering algorithm to use any
>>>> other graphical library. So, you could render the same tree images using
>>>> different backends. If you think this is useful for you, please let me know
>>>> and we can think how to integrat it with biopython.
>>>> Regarding the GUI, it is not a standalone application but one more
>>>> method within the Tree objects. The GUI  can be started at any point of the
>>>> execution and the main program will continue after you close it. I did it
>>>> like this because I think is quite useful for working within interactive
>>>> python sessions.
>>>>
>>>> I develop a lot of  code around tree handling, so if you think I can
>>>> help, please tell me.
>>>> jaime.
>>>>
>>>>
>>>>
>>>>>  > *Graphics*
>>>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
>>>>> unlabeled
>>>>> > nodes inconspicuous, so the resulting graphic is much cleaner,
>>>>> perhaps even
>>>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it
>>>>> would be
>>>>> > nice to have a Reportlab-based module in Bio.Graphics to print
>>>>> phylogenies
>>>>> > in the way biologists are used to seeing them. Does anyone know of
>>>>> existing
>>>>> > code that could be borrowed for this? I looked at ETE (announced on
>>>>> the main
>>>>> > biopython list last week) and liked the examples, but it uses PyQt4
>>>>> and a
>>>>> > standalone GUI for display, which is a substantial departure from the
>>>>> > Biopython way of doing things.
>>>>>
>>>>> I still haven't tracked down my old report lab code, but it wasn't
>>>>> object
>>>>> orientated and would need a lot of work to bring up to standard...
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Peter
>>>>>
>>>>> _______________________________________________
>>>>> Biopython-dev mailing list
>>>>> Biopython-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> =========================
>>>> Jaime Huerta-Cepas, Ph.D.
>>>> CRG-Centre for Genomic Regulation
>>>> Doctor Aiguader, 88
>>>> PRBB Building
>>>> 08003 Barcelona, Spain
>>>> http://www.crg.es/comparative_genomics
>>>> =========================
>>>>
>>>>
>>>
>>
>>
>> --
>> =========================
>> Jaime Huerta-Cepas, Ph.D.
>> CRG-Centre for Genomic Regulation
>> Doctor Aiguader, 88
>> PRBB Building
>> 08003 Barcelona, Spain
>> http://www.crg.es/comparative_genomics
>> =========================
>>
>>
>


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================

From biopython at maubp.freeserve.co.uk  Fri Sep 25 12:22:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Sep 2009 17:22:40 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <c5882df30909250913w3190aadesac8aaee1c613d761@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
	<3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com>
	<c5882df30909250913w3190aadesac8aaee1c613d761@mail.gmail.com>
Message-ID: <320fb6e00909250922y858c172xf1ee51f7673a4fe2@mail.gmail.com>

On Fri, Sep 25, 2009 at 5:13 PM, Jaime Huerta Cepas <jhuerta at crg.es> wrote:
>
> I think there is no problem in using BSD license from GPL sources, the
> problem would be in the other way around.
>

Yes, that way round is fine from a license point of view (taking Biopython's
BSD/MIT style licensed code and using it in a GPL project). But we can't
take your GPL code into Biopython unless you re-license it more liberally.

I can see the appeal of the (L)GPL for forcing the code to stay open, but
Biopython (like Python) went for the other option of basically letting anyone
use the code in anyway they like.

Peter

From hlapp at gmx.net  Fri Sep 25 16:58:36 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 25 Sep 2009 16:58:36 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
Message-ID: <E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>


On Sep 25, 2009, at 11:28 AM, Jaime Huerta Cepas wrote:

> As far as I know, the main difference between GPL and BSD-like  
> licenses is that, with the second, you could relicense the code at  
> any moment under any other policy, including private and close  
> licenses.


This is not true. None of the open-source licenses that I'm aware of  
allows anyone to relicense code under a license that is less liberal,  
or to relicense code at all. It is the copyright owner who can  
relicense code, not the distributor.

One of the differences between GPL and BSD is that GPL is viral.  
Specifically, code that links to GPL-licensed code must also be GPL- 
licensed *when it is distributed.*

(It is a common misconception that GPL is unconditionally viral. I can  
take GPL code and link to it and keep my code closed source for as  
long as I please if I never redistribute it. GPL was written with  
software vendors in mind, whose business consists of distributing  
software for commercial gain. GPL has therefore sometimes been called  
anti-commercial. This is wrong, too, but I won't go into the details  
here.)

Biopython can freely utilize GPL-licensed (or closed source, for that  
matter) software if it doesn't link to it. IANAL but I think it can  
also redistribute GPL-licensed code along with Biopython so long as  
Biopython doesn't link to it, and it is made clear that some of the  
distribution falls under a different license than BSD. (Linux  
distributions mix BSD and GPL software, too.)

As for ETE itself, a BSD/MIT style license seems to be the by far most  
widely used license for Python modules. If you want to facilitate  
adoption of the software as a library by other programmers, GPL is  
going to stand in the way of that. Also, really all that you are  
accomplishing with GPL is that a software company can't take advantage  
of ETE. Is that your chief concern? GPL won't prevent any scientific  
lab from writing closed source code that builds on ETE and publishing  
the results, so long as they don't distribute their closed source code.


	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From chapmanb at 50mail.com  Fri Sep 25 17:48:00 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 25 Sep 2009 17:48:00 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
	<E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
Message-ID: <20090925214800.GE29829@sobchak.mgh.harvard.edu>

Hi all;
Hilmar -- thanks for writing up a nice summary of the license
details. Jaime, I think it's a shame we would let these issues
prevent working together. It sounds like you and Eric have some
shared goals and it would be great to see that evolve into some
useful functionality in Biopython.

Generally, the BSD-like license which Biopython uses encourages
cooperation and keeps people at both academia and industry happy. As
scientists, our goal should be to avoid letting these types of issues
preventing collaboration. Truthfully, there is very little opportunity
for exploitation of bioinformatics software; the economics are just not
there for companies to sell code.

> (It is a common misconception that GPL is unconditionally viral. I can  
> take GPL code and link to it and keep my code closed source for as  
> long as I please if I never redistribute it. GPL was written with  
> software vendors in mind, whose business consists of distributing  
> software for commercial gain. GPL has therefore sometimes been called  
> anti-commercial. This is wrong, too, but I won't go into the details  
> here.)

I agree 100%, but in practical terms it is very difficult to have this
argument at a company. Speaking from experience, GPL creates all kinds
of nasty thoughts in people's heads which prevents adoption of code in
corporate environments. For Biopython and other bioinformatics projects,
we should be actively encouraging contributions from companies as
well as academia.

> Biopython can freely utilize GPL-licensed (or closed source, for that  
> matter) software if it doesn't link to it. IANAL but I think it can  
> also redistribute GPL-licensed code along with Biopython so long as  
> Biopython doesn't link to it, and it is made clear that some of the  
> distribution falls under a different license than BSD. (Linux  
> distributions mix BSD and GPL software, too.)

Yes, but this complication is bad. Let's keep it simple,
Brad

From bugzilla-daemon at portal.open-bio.org  Fri Sep 25 18:48:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 25 Sep 2009 18:48:13 -0400
Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a
	GenBank CON file
In-Reply-To: <bug-2745-42@http.bugzilla.open-bio.org/>
Message-ID: <200909252248.n8PMmDa9028782@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2745


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1214 is|0                           |1
           obsolete|                            |


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-25 18:48 EST -------
(From update of attachment 1214)
Checked into git, leaving this bug open until we've run some more tests on
this.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sat Sep 26 07:36:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 26 Sep 2009 07:36:45 -0400
Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a
	GenBank CON file
In-Reply-To: <bug-2745-42@http.bugzilla.open-bio.org/>
Message-ID: <200909261136.n8QBajsI014127@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2745


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-26 07:36 EST -------
We'll also need to update the SeqIO GenBank output to record the CONTIG string
if present.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From hlapp at gmx.net  Sat Sep 26 11:25:41 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 26 Sep 2009 11:25:41 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20090925214800.GE29829@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
	<E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
	<20090925214800.GE29829@sobchak.mgh.harvard.edu>
Message-ID: <C0D714C1-2EAA-46C5-95F6-CC814CB518AD@gmx.net>


On Sep 25, 2009, at 5:48 PM, Brad Chapman wrote:

> I agree 100%, but in practical terms it is very difficult to have this
> argument at a company.

Yes, I know.

> For Biopython and other bioinformatics projects, we should be  
> actively encouraging contributions from companies as well as academia.


Having worked in commercial and private sector for almost a decade, I  
couldn't agree more. There is a huge amount of open-source code  
development contributed by people working in the private sector, and  
which is hence sponsored by companies.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From jhuerta at crg.es  Sat Sep 26 13:12:59 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Sat, 26 Sep 2009 19:12:59 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> 
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com> 
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> 
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com> 
	<E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
Message-ID: <c5882df30909261012r44773ef6s92c4f83efadf8b16@mail.gmail.com>

Hey! Sorry, It was not my intention to open a flame about licences nor to
sound rude. I apologize if I did.


>  As far as I know, the main difference between GPL and BSD-like licenses is
>> that, with the second, you could relicense the code at any moment under any
>> other policy, including private and close licenses.
>>
>
>
> This is not true. None of the open-source licenses that I'm aware of allows
> anyone to relicense code under a license that is less liberal, or to
> relicense code at all. It is the copyright owner who can relicense code, not
> the distributor.
>
> I'm not an expert on software licences, so I can not enter into this issue
very deeply.  What I said in my previous email is what I could understand
from these info: http://www.gnu.org/philosophy/license-list.html,
http://www.gnu.org/philosophy/categories.html#Non-CopyleftedFreeSoftware
If I was wrong and modified BSD-like sources cannot be relicensed under
other less liberal licenses, then we will kindly consider a change of the
ETE license in the future.


> One of the differences between GPL and BSD is that GPL is viral.
> Specifically, code that links to GPL-licensed code must also be GPL-licensed
> *when it is distributed.*
>
> (It is a common misconception that GPL is unconditionally viral. I can take
> GPL code and link to it and keep my code closed source for as long as I
> please if I never redistribute it. GPL was written with software vendors in
> mind, whose business consists of distributing software for commercial gain.
> GPL has therefore sometimes been called anti-commercial. This is wrong, too,
> but I won't go into the details here.)
>
I see, so the only problem is about distribution...


Biopython can freely utilize GPL-licensed (or closed source, for that
> matter) software if it doesn't link to it. IANAL but I think it can also
> redistribute GPL-licensed code along with Biopython so long as Biopython
> doesn't link to it, and it is made clear that some of the distribution falls
> under a different license than BSD. (Linux distributions mix BSD and GPL
> software, too.)
>
Yes, I agree. This is what I meant as biopython addons. With this in mind,
biopython could be aware of many other software out there and benefit from
it. Is there any work around this in bipython?


As for ETE itself, a BSD/MIT style license seems to be the by far most
> widely used license for Python modules. If you want to facilitate adoption
> of the software as a library by other programmers, GPL is going to stand in
> the way of that. Also, really all that you are accomplishing with GPL is
> that a software company can't take advantage of ETE. Is that your chief
> concern?

Well, our intention was that code based on ETE sources  (other tools or
improvements) were distrubuted/published also as free software. We wanted
also to leave an open door to use other GPL software from ETE.


> GPL won't prevent any scientific lab from writing closed source code that
> builds on ETE and publishing the results, so long as they don't distribute
> their closed source code.

Yes. You are right. We don't want to avoid this.

In any case, thanks for your comments. I will try to get more info about
what you say and, if we have to modify something, we do it. :)

cheers,
Jaime


>
>
>        -hilmar
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================

From jhuerta at crg.es  Sat Sep 26 13:28:02 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Sat, 26 Sep 2009 19:28:02 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20090925214800.GE29829@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> 
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com> 
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> 
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com> 
	<E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
	<20090925214800.GE29829@sobchak.mgh.harvard.edu>
Message-ID: <c5882df30909261028y4c6026b3l4b5b5842bc99d512@mail.gmail.com>

Hi Brad,

Jaime, I think it's a shame we would let these issues
> prevent working together. It sounds like you and Eric have some
> shared goals and it would be great to see that evolve into some
> useful functionality in Biopython.
>

Sure!! My only intention was to find the best way to contribute!
However, the choice of a "viral" GPL license was specifically chosen for
exactly this reason: encouraging free software and academic scientific
resources.
We have a lot shared goals, so I trust we will find a happy way to
colaborate.

Jaime.


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================

From jblanca at btc.upv.es  Mon Sep 28 07:36:14 2009
From: jblanca at btc.upv.es (Jose Blanca)
Date: Mon, 28 Sep 2009 13:36:14 +0200
Subject: [Biopython-dev] fpc and gff
Message-ID: <200909281336.14794.jblanca@btc.upv.es>

Sorry for the previous incomplete mail. :(

Hi:
I'm interested in parsing an fpc physical map and writing a gff3 file from it. 
That's done by the fpc people in bioperl and they go from fpc to gff2. I 
would like to do it in python.
I've written the fpc parser looking at the bioperl one. You can take a look 
at:
http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/fpc.py

Now I have to create the gff structure and writer. I've been reading Brad's 
code regarding the GFF parser and writer. I would like to integrate my fpc 
work as much as posible with biopython and if you like it we could add the 
fpc to Biopython in the future.
But I have not a clear idea on the relation between GFF and SeqFeature. The 
main problem is the subfeature and the gff feature hierarchy. My take on that 
at the moment is to write a GFFfeature class similar to the gff feature with 
seqid, source, type, start, end, score, etc. and go from the fpc to 
GFFFeature objects. I know that this would not integrate nicely with 
BioPython. Could you give some hint on how to do it in a proper way?
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From jblanca at btc.upv.es  Mon Sep 28 07:28:06 2009
From: jblanca at btc.upv.es (Jose Blanca)
Date: Mon, 28 Sep 2009 13:28:06 +0200
Subject: [Biopython-dev] fpc and gff
Message-ID: <200909281328.06817.jblanca@btc.upv.es>

Hi:
I'm interested in parsing an fpc physical map and writing a gff3 file from it. 
That's done by the fpc people in bioperl and they go from fpc to gff2. I 
would like to do it in python.
I've written the fpc parser looking at the bioperl one. You can take a look 
at:
-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From biopython at maubp.freeserve.co.uk  Mon Sep 28 07:52:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 28 Sep 2009 12:52:56 +0100
Subject: [Biopython-dev] fpc and gff
In-Reply-To: <200909281336.14794.jblanca@btc.upv.es>
References: <200909281336.14794.jblanca@btc.upv.es>
Message-ID: <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com>

On Mon, Sep 28, 2009 at 12:36 PM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Sorry for the previous incomplete mail. :(
>
> Hi:
> I'm interested in parsing an fpc physical map and writing a gff3 file from it.
> That's done by the fpc people in bioperl and they go from fpc to gff2. I
> would like to do it in python.
> I've written the fpc parser looking at the bioperl one. You can take a look
> at:
> http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/fpc.py
>
> Now I have to create the gff structure and writer. I've been reading Brad's
> code regarding the GFF parser and writer. I would like to integrate my fpc
> work as much as posible with biopython and if you like it we could add the
> fpc to Biopython in the future.
> But I have not a clear idea on the relation between GFF and SeqFeature. The
> main problem is the subfeature and the gff feature hierarchy. My take on that
> at the moment is to write a GFFfeature class similar to the gff feature with
> seqid, source, type, start, end, score, etc. and go from the fpc to
> GFFFeature objects. I know that this would not integrate nicely with
> BioPython. Could you give some hint on how to do it in a proper way?
> Best regards,

Right now there isn't a "proper way" as Brad's GFF code hasn't
been integrated into Biopython yet.

I think Brad was thinking of using the SeqFeature object "as is" to hold
GFF features, with the sub-features list used for the hierarchy.

Michiel and I had suggested a simpler structure more faithful to the
GFF model might be useful - even if it was just a standardised tuple
of the start, end, strand, id, etc, and an annotation dictionary). For
the SeqIO interface, these GFF features would have to be turned
into normal SeqFeature objects of course.

Peter

From chapmanb at 50mail.com  Mon Sep 28 08:52:38 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 28 Sep 2009 08:52:38 -0400
Subject: [Biopython-dev] fpc and gff
In-Reply-To: <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com>
References: <200909281336.14794.jblanca@btc.upv.es>
	<320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com>
Message-ID: <20090928125238.GG29829@sobchak.mgh.harvard.edu>

Jose;
Glad you're interested in working on this. I'm happy to get the GFF3
writing up to speed for this task.

> > I'm interested in parsing an fpc physical map and writing a gff3 file from it.
[...]
> > But I have not a clear idea on the relation between GFF and SeqFeature. The
> > main problem is the subfeature and the gff feature hierarchy. My take on that
> > at the moment is to write a GFFfeature class similar to the gff feature with
> > seqid, source, type, start, end, score, etc. and go from the fpc to
> > GFFFeature objects. 

> Right now there isn't a "proper way" as Brad's GFF code hasn't
> been integrated into Biopython yet.

Yes, we still have some flexibility here since it hasn't been merged
into Biopython yet, so let's talk about what works best.

> I think Brad was thinking of using the SeqFeature object "as is" to hold
> GFF features, with the sub-features list used for the hierarchy.

What exists now takes an iterator of SeqRecord objects, and writes
each SeqFeature as a GFF3 line:

seqid -- SeqRecord ID
source -- Feature qualifier with key "source"
type -- Feature type attribute
start, end -- The Feature Location
score -- Feature qualifier with key "score"
strand -- Feature strand attribute
phase -- Feature qualifier with key "phase"

The remaining qualifiers are the final key/value pairs of the
attribute.

The hierarchy is represented as sub_features of the parent feature.
This handles any arbitrarily deep nesting of parent and child 
features.

There is some really basic code on the documentation page:

http://biopython.org/wiki/GFF_Parsing#Writing_GFF3

> Michiel and I had suggested a simpler structure more faithful to the
> GFF model might be useful - even if it was just a standardised tuple
> of the start, end, strand, id, etc, and an annotation dictionary). For
> the SeqIO interface, these GFF features would have to be turned
> into normal SeqFeature objects of course.

This could also be useful for a more lightweight representation. I
would rather see this type of representation with primary Python
types, as opposed to a GFFFeature specific class. The current
SeqRecord/SeqFeature implementations is relatively close to what 
a GFF specific class would be so there would be a lot of duplication
without saving much in terms of speed or memory.

Jose, let me know if you'd rather go with a SeqRecord approach or a
lightweight approach. If you provide a couple of examples of the
features you want to store, we can work through how to best
represent those in the GFF hierarchy and then the details of
prepping them for writing.

Brad

From biopython at maubp.freeserve.co.uk  Mon Sep 28 09:10:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 28 Sep 2009 14:10:22 +0100
Subject: [Biopython-dev] fpc and gff
In-Reply-To: <20090928125238.GG29829@sobchak.mgh.harvard.edu>
References: <200909281336.14794.jblanca@btc.upv.es>
	<320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com>
	<20090928125238.GG29829@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909280610q75f7bf4eqae49a1fb6d7eae38@mail.gmail.com>

On Mon, Sep 28, 2009 at 1:52 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
>> Michiel and I had suggested a simpler structure more faithful to the
>> GFF model might be useful - even if it was just a standardised tuple
>> of the start, end, strand, id, etc, and an annotation dictionary). For
>> the SeqIO interface, these GFF features would have to be turned
>> into normal SeqFeature objects of course.
>
> This could also be useful for a more lightweight representation. I
> would rather see this type of representation with primary Python
> types, as opposed to a GFFFeature specific class. The current
> SeqRecord/SeqFeature implementations is relatively close to what
> a GFF specific class would be so there would be a lot of duplication
> without saving much in terms of speed or memory.

Indeed. Which is why I quite like the idea of a simple tuple of ints,
strings and a dict for the annotation (the final column of a GFF file).
This should also be fast for people dealing with big GFF files.

The other plus point here is we can get this (GFF parsing/writing
using basic Python objects) into Biopython first, and then look at
the SeqIO side of things more carefully as a second merge. I may
be overly cautious but I want the resulting GFF <-> SeqRecord <->
GenBank/EMBL/etc mapping to try and follow established practice
as closely as possible, which will need lots of testing and probably
some tweaking of this mapping.

i.e. To me there is a natural break between the basics of GFF
parsing/writing, and the transformation into our existing object
models.

[This applies to all file formats in principle, but most are so simple
that it isn't really an issue worth worrying about.]

Peter

From bugzilla-daemon at portal.open-bio.org  Mon Sep 28 15:37:21 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Sep 2009 15:37:21 -0400
Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a
	GenBank CON file
In-Reply-To: <bug-2745-42@http.bugzilla.open-bio.org/>
Message-ID: <200909281937.n8SJbLYq012300@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2745


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-28 15:37 EST -------
(In reply to comment #6)
> We'll also need to update the SeqIO GenBank output to record the CONTIG
> string if present.

Done, marking as fixed. Assuming there are no objections to the whole
approach (treating the CONTIG data as a string) that is...

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Mon Sep 28 16:09:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 28 Sep 2009 21:09:12 +0100
Subject: [Biopython-dev] Committing to github...
In-Reply-To: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com>
References: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com>
Message-ID: <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com>

On Thu, Sep 24, 2009 at 12:39 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> My last couple of commits to github have been from a local clone
> of the *official* repository: http://github.com/biopython/biopython/
>
> This is a nice and simple work flow for small changes, and the
> history and github network graph are easy to understand:
> http://biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch
>
> This seems like the easiest way to work for people used to CVS,
> and you don't need to bother with your own Biopython cloned
> repository on github (you just need a github account and
> collaborator status). I'll probably continue to do this in the short
> term.

This way of working (described above) is what I have been using
for the last week. If there are multiple developers working (or in
this case, developers using multiple machines), you can still get
interesting mini-branches and merges even like this. Have a look
at the Biopython github network diagram for today for a nice
simple example (which was accidental - but serves as a nice
illustration).

[I know for some of you the following discussion isn't needed,
but I think it is worth trying to explain -  even if just for me, to
make sure it is clear in my head what git is doing.]

In words, the main trunk was split, with a (trivial) change to the
tutorial done on one branch (me at work) and then two separate
commits on a separate branch (unit tests tweak, and GenBank
bug fix), again by me, but on my home computer. The two
branches were then merged into one.

Why did this happen? I was working on a local and very slightly
out of date copy of the repository at home, and make these
local commits. I then tried to push them to github. At that
point git gave me an error saying something else had been
commited in the meantime (in fact by me but on a different
computer) so my local repository was out of date. So I pulled
and merged the latest code from github (the tutorial change),
and then pushed this to github. Done. The merge was 100%
automatic because the files changed were independent.

Back on CVS, as these changes were on separate files, there
wouldn't have been any issue about merging.

Does it matter? No. But we can reduce the likelihood of these
baby branches and merges by getting into the habit of pulling
the latest code from github *before* making any local commits
(a sensible thing to do anyway).

[Did that make sense? One the one hand this is very simple,
but on the other hand, it is rather different to how I used to
think about the code history under CVS.]

Peter

From eric.talevich at gmail.com  Mon Sep 28 16:47:38 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 28 Sep 2009 16:47:38 -0400
Subject: [Biopython-dev] Committing to github...
In-Reply-To: <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com>
References: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com>
	<320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com>
Message-ID: <3f6baf360909281347r32c39918s4a2c8a64cff44622@mail.gmail.com>

On Mon, Sep 28, 2009 at 4:09 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
> Does it matter? No. But we can reduce the likelihood of these
> baby branches and merges by getting into the habit of pulling
> the latest code from github *before* making any local commits
> (a sensible thing to do anyway).
>
>
If you've committed local changes while your repository is out of date and
want to avoid a baby branch, you can also use "git rebase origin/master" to
fix the history. (But probably, most developers will find it easier and
safer to leave the baby branches there.)

Extended example:

git checkout dev     # a development branch
# hack hack
git commit -a          # oops, we're out of sync
git checkout master    # a clean copy of upstream
git pull origin master      # updating like we should have earlier
git rebase master dev
git merge dev
# Should be fast-forward
git push


Cheers,
Eric

From bugzilla-daemon at portal.open-bio.org  Mon Sep 28 17:01:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Sep 2009 17:01:08 -0400
Subject: [Biopython-dev] [Bug 2919] New: Writing SeqFeature qualifiers
Message-ID: <bug-2919-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2919

           Summary: Writing SeqFeature qualifiers
           Product: Biopython
           Version: 1.51
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: estrain at gmail.com


When writing SeqFeature qualifiers key-value pairs, the output contains one
line for each character in the value, rather than simply printing the string.
The sample code at the bottom produces a genbank sequence file that illustrates
the problem.

If I create a qualifiers dictionary using "qualDict = dict(gene="geneA")",
the genbank output contains
     gene            1..6
                     /gene="g"
                     /gene="e"
                     /gene="n"
                     /gene="e"
                     /gene="A"


The offending code appears to be in the InsdcIO.py file, lines 482-483.
If I change

482: for value in values :
483:   self.write_feature_qualifier(key,value)

to

self.write_feature_qualifier(key,values)

then the function appears to work correctly.  

     gene            1..6
                     /gene="geneA"


###########################################################
## Sample code
###########################################################
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.Alphabet import IUPAC

qualDict = dict(gene="geneA")

my_seq = SeqRecord(Seq("ATGATC",IUPAC.ambiguous_dna),id="seq1")
my_seq.features.append((SeqFeature(FeatureLocation(0,6),type="gene",qualifiers=qualDict)))

out_handle = open("test.gbk","w")

SeqIO.write([my_seq],out_handle,"genbank")


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Sep 28 17:22:32 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Sep 2009 17:22:32 -0400
Subject: [Biopython-dev] [Bug 2919] Writing SeqFeature qualifiers
In-Reply-To: <bug-2919-42@http.bugzilla.open-bio.org/>
Message-ID: <200909282122.n8SLMW8w014482@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2919


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-28 17:22 EST -------
It was working as intended - for consistency with the GenBank (and other)
parsers, you were expected to use a lists of strings as the feature qualifier
dictionary values (not just strings).

However, a similar request was made on the mailing list recently, and a fix
checked in (after Biopython 1.52 was released):

http://lists.open-bio.org/pipermail/biopython/2009-September/005585.html

Marking as fixed.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Sep 29 12:41:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 29 Sep 2009 12:41:08 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To: <bug-2891-42@http.bugzilla.open-bio.org/>
Message-ID: <200909291641.n8TGf8HE011375@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2891


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-29 12:41 EST -------
Really fixed this time, tested on Jython 2.5.0 and 2.5.1rc3


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Wed Sep 30 11:27:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Sep 2009 16:27:03 +0100
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
Message-ID: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>

Hi all,

A few months back on the main mailing list, Cedar and I were
talking about taking a SeqRecord, and how to write out its reverse
complement to a file. The thread is archived here:
http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html

Cedar - I cc'd you, as I am not sure if you are on the dev list.
I expect this could get technical pretty quickly, so I wanted
to float this idea on the dev list first...

-----------------------------------------------------------------

So, the background this this discussion:

Unless there is some complicated annotation to transfer,
using Biopython as is, making a new SeqRecord using
the reverse complement sequence of the old SeqRecord
isn't very hard, see:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement

This has meant that generally the current status quo isn't
a problem (at least for me). However, what prompted me
to work on this issue was a real world example.

We have a draft genome where after doing a basic
annotation, it would make sense to flip the strands. I
want to be able to load our current GenBank file, apply
the reverse complement, and have all the annotated
features recalculated to match. With more and more
sequencing projects, this isn't such an odd thing to
want to do.

Dealing with the details of potentially complex locations
in SeqFeature object's isn't very nice, so I think it would
be useful to have this particular functionality built into
Biopython. It is also a small step towards making the
SeqRecord more Seq like (which in general seems a
good idea).

On Thu, Jun 25, 2009 at 12:20 AM, Peter wrote:
>
> What you are doing is fine - although personally I might wrap up the
> first line as a function, as done in the tutorial:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement
>
> While we could add a reverse_complement() method to the SeqRecord
> (and other Seq methods, like translate etc), there is one big problem:
> What to do with the annotation. If your record used to have a name
> based on an accession or a GI number, then this really does not apply
> to the reverse complement (or a translation etc). We could do something
> arbitrary like adding an "rc_" prefix (or variants) but I think the only safe
> answer is to make the user think about this and do what is appropriate
> in their context. And as you have demonstrated, this can still be done
> in one line :)
>
> I make a habit of using this as a justification, but I feel the zen of
> Python "Explicit is better than implicit" applies quite well here.

I've been thinking about this on and off since then, and I still
maintain that for much of the annotation there is no easy answer.
For the sequence itself, the behaviour is well defined. For all
the annotation, there are three possible actions:
(a) User supplies a new value
(b) Reuse the old value
(c) No annotation (the default for a new SeqRecord)

We can do something sensible with the features (if present) and
it will probably make sense to copy but reverse any per-letter
annotation (if present).

On a github branch I have posted some experimental code
which adds a reverse_complement() method to the SeqRecord.
I propose to give the new reverse_complement() a set of
optional arguments (id, name, etc) following the same names
as the existing attributes (and __init__ arguments), allowing
the user to choose between these three actions.

Assuming the general scheme is popular, I'm quite open
to discussing changing these defaults. But for the first
implementation this is what I picked: For the id, name and
description I still lean towards making the user decide this,
and therefore the default is (c). Likewise for the annotations
dictionary and the database cross refs.

For the features and per-letter-annotation, I would opt to
make the default behaviour be to reuse the old data, option
(b) above. For the per-letter-annotation (the restricted
dictionary, letter_annotations) this just means reversing
each entry. For the features, this means reversing the
order of the features, switching their strands (if set), and
calculating the new coordinates (taking care of all the
possible fuzzy locations and sub-features).

The code is here is anyone wants to look at the
technical details:
http://github.com/peterjc/biopython/commits/seqrecords

Peter

From chapmanb at 50mail.com  Tue Sep  1 13:06:39 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 1 Sep 2009 09:06:39 -0400
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
Message-ID: <20090901130639.GI75451@sobchak.mgh.harvard.edu>

Hi Peter;

[indexed dict usage]
> What file formats where you working on, and how many records?

It was a 100Mb fasta file with about 41,000 records. Nothing too
heavy but it worked great. The only change I made was to generalize
the record building line:
                
self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)

to allow an arbitrary function to be passed to define the
identifier, instead of defaulting to the first part of the line.
This is helpful for those fun NCBI ids
(gi|83029091|ref|XM_357633.3|) where other parts of the program only
have the accession number.

> True. Have got any bright ideas for a better name? While the
> index is in memory, the SeqRecord objects are not (unlike the
> original Bio.SeqIO.to_dict() function).
> 
> Or we have one function Bio.SeqIO.indexed_dict() which can
> either use an in memory index, OR an on disk index, offering
> the same functionality.

That's a nice idea -- provide some reasonable defaults based on file
size and type, and allow them to be over-ridden with function
params.

> >> Another option (like the shelve idea we talked about last month)
> >> is to parse the sequence file with SeqIO, and serialise all the
> >> SeqRecord objects to disk, e.g. with pickle or some key/value
> >> database. This is potentially very complex (e.g. arbitrary Python
> >> objects in the annotation), and could lead to a very large "index"
> >> file on disk. On the other hand, some possible back ends would
> >> allow editing the database... which could be very useful.
> >
> > My thought here was to use BioSQL and the SQLite mappings for
> > serializing. We build off a tested and existing serialization, and
> > also guide people into using BioSQL for larger projects.
> > Essentially, we would build an API on top of existing BioSQL
> > functionality that creates the index by loading the SQL and then
> > pushes the parsed records into it.
> 
> Using BioSQL in this way is a much more general tool than
> simply "indexing a sequence file". It feels like a sledgehammer
> to crack a nut. Also, do you expect it to scale well for 10 million
> plus short reads? It may do, but on the other hand it may not.

Agreed that it would introduce extra overhead for something like
short reads. If you are talking about serializing SeqRecords, it
would make sense to re-use what we have in BioSQL. If you are
talking about storing just file offsets, then a lightweight solution
makes more sense.

For me, the initial parse time to prepare an index is not as much of an
issue since it happens once while queries on it will happen multiple
times.

> Also while the current BioSQL mappings are "tried and tested",
> they don't cover everything, in particular per-letter-annotation
> such as a set of quality scores (something that needs addressing
> anyway, probably with JSON or XML serialisation).

Agreed, but the advantage is that improvements can feed back into
BioSQL, instead of work in parallel.

> All the above make me lean towards a less ambitious target
> (read only dictionary access to a sequence file), which just
> requires having an (on disk) index of file offsets (which could
> be done with SQLite or anything else suitable). This choice
> could even be done on the fly at run time (e.g. we look at the
> size of the file to decide if we should use an in memory index
> or on disk - or start out in memory and if the number of records
> gets too big, switch to on disk).

That makes sense. SQLite has in-memory caching which could help with
some of the decision making as it would handle writing and holding
in memory without having to reimplement that bit. Another file based
indexing scheme is the one in bx-python:

http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py

This is a bit more specific as it also handles queries based on
genomic intervals in addition to retrieving by file position. It may
be useful for looking at the underlying storage details.

Brad


From biopython at maubp.freeserve.co.uk  Tue Sep  1 13:25:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Sep 2009 14:25:22 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <20090901130639.GI75451@sobchak.mgh.harvard.edu>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>

On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
> Hi Peter;
>
> [indexed dict usage]
>> What file formats where you working on, and how many records?
>
> It was a 100Mb fasta file with about 41,000 records. Nothing too
> heavy but it worked great.

Yeah, with just 41,000 keys and offsets the in memory dict would
be pretty small too. This is within the range of file sizes I expect
the Bio.SeqIO.indexed_dict() functionality to be used on. Cool.

> The only change I made was to generalize the record building line:
>
> self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)
>
> to allow an arbitrary function to be passed to define the
> identifier, instead of defaulting to the first part of the line.
> This is helpful for those fun NCBI ids
> (gi|83029091|ref|XM_357633.3|) where other parts of the program only
> have the accession number.

Did your callback function get give the "title string" and return
the desired key?

I had wondered about this, but the only way for this to be general
(to work on all file formats) is for the callback function to be given
a SeqRecord object - which means having to fully parse the file
during the indexing, which ends up being *much* slower. We can
do this is you think it adds a lot of utility i.e. mimic the key_function
argument we already have on Bio.SeqIO.to_dict()

Peter


From biopython at maubp.freeserve.co.uk  Tue Sep  1 13:38:07 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Sep 2009 14:38:07 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com>
References: <200904161146.28203.jblanca@btc.upv.es>
	<21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com>
	<320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com>
	<21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com>
	<15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com>
	<320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com>
	<15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com>
	<320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com>
	<320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com>
	<320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com>
Message-ID: <320fb6e00909010638v5c9cec06t66b24e1e755c46cb@mail.gmail.com>

On Fri, Aug 14, 2009 at 1:00 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>>> Jose's code uses seek/tell which means it has to have a handle
>>> to an actual file. He also used binary read mode - I'm not sure if
>>> this was essential or not.
>>
>> Binary mode was not essential - opening an SFF file in default
>> mode also seemed to work fine with Jose's code.
>
> Having worked on this more, default mode or binary mode are fine.
> However, as you might expect, you can't use Python's universal
> read lines mode when parsing SFF files.

Just to clarify this for the record - on Unix you can parse an SFF file
opened in default mode ("r") or binary mode ("rb") but not universal
read line mode ("rU"). However, on Windows only binary mode works.

I've updated my SFF code on github to catch this (as otherwise the
error messages are rather cryptic).

Peter


From biopython at maubp.freeserve.co.uk  Tue Sep  1 13:56:26 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 1 Sep 2009 14:56:26 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <20090901130639.GI75451@sobchak.mgh.harvard.edu>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909010656h594e908cu246138d45442df45@mail.gmail.com>

On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>
>Peter wrote:
>> Using BioSQL in this way is a much more general tool than
>> simply "indexing a sequence file". It feels like a sledgehammer
>> to crack a nut. Also, do you expect it to scale well for 10 million
>> plus short reads? It may do, but on the other hand it may not.
>
> Agreed that it would introduce extra overhead for something like
> short reads. If you are talking about serializing SeqRecords, it
> would make sense to re-use what we have in BioSQL.

I wasn't talking about serialising SeqRecord objects. I agree
there is (almost) no point implementing new serialisation code
when we already have BioSQL.

> If you are talking about storing just file offsets, then a lightweight
> solution makes more sense.

Indeed.

> For me, the initial parse time to prepare an index is not as much
> of an issue since it happens once while queries on it will happen
> multiple times.

It depends on the expected work load - if you are thinking about
indexing a local copy of GenBank, but only expect to pull out a
few (hundred) records, then the index time may be longer than
the total access time.

But in general, if we are talking about saving the index to a file
(which can then be reloaded) I would agree, the up front cost to
prepare the index isn't critical.

On the subject of how to store a index off file offsets on disk,
I think the old Biopython Martel/Mindy indexing code used to
create OBDA style indexes (either simple flat files or BDB based).
We should certainly consider these for cross project compatibility,
or perhaps introduce a new OBDA version which might use
something like SQLite internally instead?
http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html
http://lists.open-bio.org/pipermail/open-bio-l/2009-September/000567.html

Peter


From eoc210 at googlemail.com  Wed Sep  2 12:25:24 2009
From: eoc210 at googlemail.com (Ed Cannon)
Date: Wed, 2 Sep 2009 13:25:24 +0100
Subject: [Biopython-dev] OBO2OWL parser / converter
In-Reply-To: <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net>
References: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com>
	<3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net>
Message-ID: <9e02410b0909020525w5cbf59dek46e0ab1b5144f8@mail.gmail.com>

Hi Hilmar,

My OBO2OWL parser is implemented based on Tirmizi & Miranker?s paper titled:
?OBO2OWL: Roundtrip between OBO and OWL? (
www.cs.utexas.edu/~hamid/pub/tirmizi-obo2owl-tr-06-47.pdf<http://www.cs.utexas.edu/%7Ehamid/pub/tirmizi-obo2owl-tr-06-47.pdf>
)1.

After having looked at the link you sent me to the OBO2OWL mappings google
spreadsheet, it appears that there are some differences, which I?m looking
into at the minute.

Ref:

1. Syed Hamid Tirmizi and Daniel P Miranker. (2006). OBO2OWL: Roundtrip
between OBO and OWL. The University of Texas at Austin, Department of
Computer Sciences, Technical Report TR-06-47, October 2, 16 pages.


Cheers,

Ed

2009/8/31 Hilmar Lapp <hlapp at gmx.net>

> Hi Ed -
>
> is your converter operating in a way that is congruent with (or even
> utilizing) the mapping and the converter provided by the NCBO and Berkeley
> Ontology projects?
>
> http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page
>
> If not, I'm not sure how beneficial it is for users to have multiple and
> possibly conflicting mappings.
>
>        -hilmar
>
>
> On Aug 30, 2009, at 3:33 PM, Ed Cannon wrote:
>
>  Hi All,
>>
>> I would like to thank you guys for all your hard work and effort in making
>> biopython a great piece of open software.
>>
>> I would also like to introduce myself, my name is Ed Cannon, I am a
>> postdoc
>> at Cambridge University working in the fields of  chemo/bioinformatics and
>> semantic web technologies in the group of Peter Murray-Rust.
>>
>> Since a fair amount of my work involves ontologies, I have written an open
>> biomedical ontology (.obo) to web ontology language (.owl) converter. The
>> resultant file can be loaded and used from Protege. I was wondering if
>> this
>> software would be of any interest to  the biopython community? I have just
>> sent a pull request to biopython on github. The code is located at my
>> branch
>> on my account: http://github.com/eoc21/biopython/tree/eoc21Branch.
>>
>> Thanks,
>>
>> Ed
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>


From bugzilla-daemon at portal.open-bio.org  Wed Sep  2 15:24:19 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 2 Sep 2009 11:24:19 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200909021524.n82FOJ7U021693@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-02 11:24 EST -------
(In reply to comment #3)
> I can now parse the Roche SFF index, allowing fast random access to
> the reads. See:
> 
> http://github.com/peterjc/biopython/commits/index
> http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006603.html
> 
> Peter

That branch now has support for SeqIO parsing, indexing and *writing* of
SFF files. The write support is still very new and needs more testing,
but is looking promising. Note that while currently I read the undocumented
Roche style SFF index block, I have not yet attempted to write out such an
index (probably unwise unless the format does get published?).

Also note that there is still scope for improvement for how the trimming
information is presented in the SeqRecord object (perhaps some kind of
masked SeqRecord/Seq as has been suggested on the mailing lists).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Sep  2 16:45:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 2 Sep 2009 12:45:48 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200909021645.n82GjmbA023923@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-02 12:45 EST -------
(In reply to comment #4)
> That branch now has support for SeqIO parsing, indexing and *writing* of
> SFF files. The write support is still very new and needs more testing,
> but is looking promising. Note that while currently I read the undocumented
> Roche style SFF index block, I have not yet attempted to write out such an
> index (probably unwise unless the format does get published?).

It now has a first attempt at writing a Roche style SFF index, which my code
will parse back again happily. I have not yet tried the resulting file with
the Roche SFF tools. Note that this does not preserve any Roche XML meta data.
Note also that the index is skipped if any of the record names are not 14 chars
long (which is try on all the Roche indexes I have looked at).


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Sep  4 10:23:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Sep 2009 06:23:26 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200909041023.n84ANQgj023187@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-04 06:23 EST -------
I've been working on the Roche SFF indexes, and via their tools have discovered
there are at least two index block formats used:

Most SFF files I have looked at have an index block which starts ".mft1.00"
(short for Manifest v1.00 is my guess) which hold both an XML "manifest" or
meta data, plus a read offset index.

You can also get SFF files where the index block starts ".srt1.00" (Short Read
Table v1.00 maybe?) which have just an index.

The indexes details themselves are the same in both cases, and support
arbitrary read name lengths. The offset is in base 255 (not 256), apparently so
that byte 255 (0xFF) can be used as a separator character. For typical Roche
SFF files, the read names are 14 characters, and the index uses 20 bytes per
read.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Sep  4 10:54:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 4 Sep 2009 06:54:39 -0400
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <200909041054.n84AsdNe023921@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-04 06:54 EST -------
The Staden IO lib has references to ".srt1.00" (454 sorted v1.00) and also
another SFF index format, which start ".hsh1.00" (hash table v1.00).

See files io_lib/progs/hash_sff.c and io_lib/open_trace_file.c from
http://sourceforge.net/projects/staden/

Scanning their code also confirms my base 255 deduction for the ".srt" indexes,
see function getuint4_255, and the use of 0xFF as a break character.
Interestingly they only expect 4 bytes for the offset (limiting this to almost
4GB SFF files). There is a fifth byte which is usually null, this could be a
name terminator (although this is not actually needed), or used for 4GB+ SFF
offsets.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Fri Sep  4 15:33:16 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 4 Sep 2009 16:33:16 +0100
Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython?
In-Reply-To: <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
Message-ID: <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>

Hi David,

[This is a continuation of a thread on the main list, but it is much more
suited to the dev list now.]

On Tue, Sep 1, 2009 at 11:38 PM, David Winter wrote:
> Peter wrote:
>> David - I would prefer we also put your new wrappers in
>> Bio.Emboss.Applications, and would be happy to look at adding
>> those to CVS now that Biopython 1.51 is out (I had forgotten
>> about them actually - so thanks for the reminder).
>>
>> Peter
>
> Hi Peter,
>
> I'd almost forgotten about them myself! I only put them in their own module
> because I had the PhyML wrapper as well and that's not an EMBOSS
> application.

I see you've done that on github. I had a look at merging this into CVS,
but had a few comments first.

I found you had a load of tabs in your file (please use 4 space indentation
in future). http://www.biopython.org/wiki/Contributing#Coding_conventions

I am unclear why you are subclassing _EmbossMinimalCommandLine
instead of _EmbossCommandLine since most (all?) of the new wrappers
use the "outfile" parameter. As I recall EMBOSS isn't fussy about the
presence of the equals sign (right now our wrappers mostly omit the
equals, but not all the time - which looks odd to me).

Also your code seems to me missing the __str__ / _validate changes
on the trunk.

And finally, I think you can add yourself to the copyright at the top of
the file for this work ;)

Peter


From biopython at maubp.freeserve.co.uk  Fri Sep  4 17:22:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 4 Sep 2009 18:22:27 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
	<320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>
Message-ID: <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com>

On Tue, Sep 1, 2009 at 2:25 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>> Hi Peter;
>>
>> [indexed dict usage]
>>> What file formats where you working on, and how many records?
>>
>> It was a 100Mb fasta file with about 41,000 records. Nothing too
>> heavy but it worked great.
>
> Yeah, with just 41,000 keys and offsets the in memory dict would
> be pretty small too. This is within the range of file sizes I expect
> the Bio.SeqIO.indexed_dict() functionality to be used on. Cool.
>
>> The only change I made was to generalize the record building line:
>>
>> self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)
>>
>> to allow an arbitrary function to be passed to define the
>> identifier, instead of defaulting to the first part of the line.
>> This is helpful for those fun NCBI ids
>> (gi|83029091|ref|XM_357633.3|) where other parts of the program only
>> have the accession number.
>
> Did your callback function get given the "title string" and return
> the desired key?
>
> I had wondered about this, but the only way for this to be general
> (to work on all file formats) is for the callback function to be given
> a SeqRecord object - which means having to fully parse the file
> during the indexing, which ends up being *much* slower. We can
> do this if you think it adds a lot of utility i.e. mimic the key_function
> argument we already have on Bio.SeqIO.to_dict()

A less flexible option is a callback function which maps the default
record.id to a new key. This would solve your NCBI FASTA issue,
and might be handy in other settings (e.g. removing the version
suffix in GenBank identifiers). However, it would not allow for
example switching to a completely different identifier (e.g. the GI
number) which is present elsewhere in the file.

The point is we can support this kind of limited key_function
without suffering the severe speed penalty which doing a full
parse to give SeqRecord objects would impose.

How does that sound Brad? It should add just a little complexity
to the current code, and allows some neat tricks. Or we can
leave things as they are (KISS).

Peter


From mjldehoon at yahoo.com  Sat Sep  5 08:17:00 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 5 Sep 2009 01:17:00 -0700 (PDT)
Subject: [Biopython-dev] Bio.Entrez.parse
Message-ID: <339938.48242.qm@web62405.mail.re1.yahoo.com>

Hi everybody,
Recently I was trying to parse a huge Entrez XML file containing Entrez gene records. Because of the size of the file, Entrez.read failed with a memory error since it could not keep the entire information in the XML file in memory. I decided to add a parse() function to Bio.Entrez that can iterate of such large files. This function is useful if the XML file essentially contains a list of records; the parse() function is a generator function that returns these records one by one.

--Michiel.


From p.j.a.cock at googlemail.com  Sat Sep  5 12:59:09 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 5 Sep 2009 13:59:09 +0100
Subject: [Biopython-dev] Bio.Entrez.parse
In-Reply-To: <339938.48242.qm@web62405.mail.re1.yahoo.com>
References: <339938.48242.qm@web62405.mail.re1.yahoo.com>
Message-ID: <320fb6e00909050559p2c9da2f1o60905ac3dfe0cb35@mail.gmail.com>

On Sat, Sep 5, 2009 at 9:17 AM, Michiel de Hoon<mjldehoon at yahoo.com> wrote:
> Hi everybody,
> Recently I was trying to parse a huge Entrez XML file containing Entrez gene
> records. Because of the size of the file, Entrez.read failed with a memory
> error since it could not keep the entire information in the XML file in memory.
> I decided to add a parse() function to Bio.Entrez that can iterate of such large
> files. This function is useful if the XML file essentially contains a list of records;
> the parse() function is a generator function that returns these records one by one.

That sounds excellent - I'd noticed that usually Bio.Entez.read() would return
a list of (large nested) records, so this should be a natural extension.

Peter


From biopython at maubp.freeserve.co.uk  Mon Sep  7 11:56:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Sep 2009 12:56:17 +0100
Subject: [Biopython-dev] Anonymous CVS working again :)
Message-ID: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com>

Just an FYI,

While the developer server dev.open-bio.org has been fine, recently
our public read only mirror at cvs.open-bio.org (and cvs.biopython.org)
had not been updated. This affected Biopython and EMBOSS.

And for Biopython as a knock on effect, this had meant the latest
code at http://biopython.org/SRC/biopython/ was a little out of date.
[Biopython's github mirror was not affected]

These all seem to be working fine once again - thanks to someone
at the OBF - let me know who and I'll buy you a beer when we (next)
meet up :)

Peter


From biopython at maubp.freeserve.co.uk  Mon Sep  7 17:34:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Sep 2009 18:34:53 +0100
Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython?
In-Reply-To: <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
Message-ID: <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>

On Fri, Sep 4, 2009 at 4:33 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Hi David,
>
> [This is a continuation of a thread on the main list, but it is much more
> suited to the dev list now.]
>
> ...
>
> I see you've done that on github. I had a look at merging this into CVS,
> but had a few comments first.
>
> I found you had a load of tabs in your file (please use 4 space indentation
> in future). http://www.biopython.org/wiki/Contributing#Coding_conventions

Thanks.

> I am unclear why you are subclassing _EmbossMinimalCommandLine
> instead of _EmbossCommandLine since most (all?) of the new wrappers
> use the "outfile" parameter. As I recall EMBOSS isn't fussy about the
> presence of the equals sign (right now our wrappers mostly omit the
> equals, but not all the time - which looks odd to me).

I see you've switched to _EmbossCommandLine - fine.

> Also your code seems to me missing the __str__ / _validate changes
> on the trunk.

Also fixed, thanks.

> And finally, I think you can add yourself to the copyright at the top of
> the file for this work ;)

Cool.

I have checked this into CVS, but did also fix an old typo (in a docstring)
and one new typo (in an argument name). Thanks David!

Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
based on test_Emboss.py? Continuing on the github branch is fine.

We should put you in the CONTRIB file now too (are there any other
recent people we've missed?). Would you like to give a webpage, or
is this email address fine (be warned it may get harvested for spam)?

Thank you,

Peter


From biopython at maubp.freeserve.co.uk  Mon Sep  7 20:00:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Sep 2009 21:00:46 +0100
Subject: [Biopython-dev] [Root-l] Anonymous CVS working again :)
In-Reply-To: <B86883FE-D894-4933-81AE-7D5452075301@illinois.edu>
References: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com>
	<B86883FE-D894-4933-81AE-7D5452075301@illinois.edu>
Message-ID: <320fb6e00909071300x6238e828l4440e71c562e792c@mail.gmail.com>

> Are these being kept in sync? ? bioperl's moved completely away from
> cvs to svn with very little pain. ?We found sync-ing the two more trouble
> than it was worth.

Perhaps we are talking at cross purposes here Chris.

Right now Biopython and EMBOSS are using CVS, with developers
committing to dev.open-bio.org, which then updates a read only CVS
mirror code.open-bio.org (aka cvs.open-bio.org aka cvs.biopython.org)
to provide anonymous assess.

Likewise, BioPerl etc are using SVN, with developers committing to
dev.open-bio.org, which then updates a read only SVN mirror at
code.open-bio.org (or its other aliases) to provide anonymous assess.

Peter


From biopython at maubp.freeserve.co.uk  Mon Sep  7 21:26:26 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 7 Sep 2009 22:26:26 +0100
Subject: [Biopython-dev] [Root-l] Anonymous CVS working again :)
In-Reply-To: <9A75D700-AC7B-4D5B-ABB2-D28267735E4C@illinois.edu>
References: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com>
	<B86883FE-D894-4933-81AE-7D5452075301@illinois.edu>
	<320fb6e00909071300x6238e828l4440e71c562e792c@mail.gmail.com>
	<9A75D700-AC7B-4D5B-ABB2-D28267735E4C@illinois.edu>
Message-ID: <320fb6e00909071426w1dfed95bx703384b3227eee6b@mail.gmail.com>

On Mon, Sep 7, 2009 at 9:44 PM, Chris Fields<cjfields at illinois.edu> wrote:
> On Sep 7, 2009, at 3:00 PM, Peter wrote:
>
>>> Are these being kept in sync? ? bioperl's moved completely away from
>>> cvs to svn with very little pain. ?We found sync-ing the two more trouble
>>> than it was worth.
>>
>> Perhaps we are talking at cross purposes here Chris.
>>
>> Right now Biopython and EMBOSS are using CVS, with developers
>> committing to dev.open-bio.org, which then updates a read only CVS
>> mirror code.open-bio.org (aka cvs.open-bio.org aka cvs.biopython.org)
>> to provide anonymous assess.
>>
>> Likewise, BioPerl etc are using SVN, with developers committing to
>> dev.open-bio.org, which then updates a read only SVN mirror at
>> code.open-bio.org (or its other aliases) to provide anonymous assess.
>>
>> Peter
>
> Right, I understand that, but you also have a git repo on github (unless I'm
> mistaken). ?Based on that I assume you plan on migrating over to dev git
> and/or github eventually, but I'm unsure of the future of the CVS repo.

Right! For now, CVS changes are pushed to github. Once we move to
git, the CVS repo will no longer be used, and well be left frozen in time.

> My point was, we had been in a similar situation. ?We had thought of having
> a sync'ed CVS <-> SVN repo at one point, but it was way too much trouble to
> deal with and just dropped CVS altogether after the migration. ?Instead, we
> just started switching all docs over to point to svn instead with lots of
> ample warning on the mail lists, and it all worked out in the end (we have
> had very few users inquiring about CVS).

Likewise, we could have git changes pushed into CVS, but there
is little point. We plan to just quit using CVS.

Peter


From david.winter at gmail.com  Mon Sep  7 22:54:52 2009
From: david.winter at gmail.com (David Winter)
Date: Tue, 08 Sep 2009 10:54:52 +1200
Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython?
In-Reply-To: <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>	
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>	
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>	
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
Message-ID: <4AA58F3C.6080200@student.otago.ac.nz>

Hi Peter and all,

Sorry the lack of communication from me on this. I successfully made it 
off the grid for the weekend then found I couldn't push to github from 
work (no ssh over the proxy for students) and couldn't email the list 
from home (can't use the uni's SMTP from off campus ) - IT-security 
catch 22!

> I see you've switched to _EmbossCommandLine - fine.
>
>   
Yeah, this was my stupid fault - you'd given me a heads up about the two 
different version of the _EmbossCommandline and I tried out what I 
already had with the the 'normal' version as saw that it failed but 
didn't read the error message properly (of course it failed because I 
was trying to give it the outfile parameter twice...)

> [... snip the other things you asked about...]
>
>
> I have checked this into CVS, but did also fix an old typo (in a docstring)
> and one new typo (in an argument name). Thanks David!
>
> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
> based on test_Emboss.py? Continuing on the github branch is fine.
>   
Sounds good, will have a go at getting something going in the next 
couple of days

> We should put you in the CONTRIB file now too (are there any other
> recent people we've missed?). Would you like to give a webpage, or
> is this email address fine (be warned it may get harvested for spam)?
>
>   
Well, I'm not sure it's much of a contribution from me, but thanks :) 
Perhaps add david.winter at gmail.com - gmail seems to handle spam pretty 
well and I won't be a student here for ever (right?...)


Cheers,
David


From biopython at maubp.freeserve.co.uk  Tue Sep  8 09:21:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Sep 2009 10:21:11 +0100
Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython?
In-Reply-To: <4AA58F3C.6080200@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
Message-ID: <320fb6e00909080221m7377f033ue9b1617b0bc38f5b@mail.gmail.com>

On Mon, Sep 7, 2009 at 11:54 PM, David Winter<david.winter at gmail.com> wrote:
> Hi Peter and all,
>
> Sorry the lack of communication from me on this. I successfully made it off
> the grid for the weekend then found I couldn't push to github from work (no
> ssh over the proxy for students) and couldn't email the list from home
> (can't use the uni's SMTP from off campus ) - IT-security catch 22!

Tricky.

>> I see you've switched to _EmbossCommandLine - fine.
>
> Yeah, this was my stupid fault - you'd given me a heads up about the two
> different version of the _EmbossCommandline and I tried out what I already
> had with the the 'normal' version as saw that it failed but didn't read the
> error message properly (of course it failed because I was trying to give it
> the outfile parameter twice...)

OK - I wondered if there was some other reason I couldn't see, so worth
checking,

>> I have checked this into CVS, but did also fix an old typo (in a
>> docstring) and one new typo (in an argument name). Thanks David!
>>
>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
>> based on test_Emboss.py? Continuing on the github branch is fine.
>
> Sounds good, will have a go at getting something going in the next
> couple of days

Great - whenever you get time. Thanks!

>> We should put you in the CONTRIB file now too (are there any other
>> recent people we've missed?). Would you like to give a webpage, or
>> is this email address fine (be warned it may get harvested for spam)?
>
> Well, I'm not sure it's much of a contribution from me, but thanks :)

But I'm expecting more in future *grin*

> Perhaps add david.winter at gmail.com - gmail seems to handle spam
> pretty well and I won't be a student here for ever (right?...)

There is always a postdoc ;)

Also can someone remind me at some point that we should include
at least one of the EMBOSS PHYLIP tools in the alignment command
line bit of the tutorial...

Peter


From chapmanb at 50mail.com  Tue Sep  8 12:14:05 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 8 Sep 2009 08:14:05 -0400
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
	<320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>
	<320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com>
Message-ID: <20090908121405.GF63266@sobchak.mgh.harvard.edu>

Hi Peter;

[... callback function for specifying an ID ...]
> > Did your callback function get given the "title string" and return
> > the desired key?
> >
> > I had wondered about this, but the only way for this to be general
> > (to work on all file formats) is for the callback function to be given
> > a SeqRecord object - which means having to fully parse the file
> > during the indexing, which ends up being *much* slower. We can
> > do this if you think it adds a lot of utility i.e. mimic the key_function
> > argument we already have on Bio.SeqIO.to_dict()
> 
> A less flexible option is a callback function which maps the default
> record.id to a new key. This would solve your NCBI FASTA issue,
> and might be handy in other settings (e.g. removing the version
> suffix in GenBank identifiers). However, it would not allow for
> example switching to a completely different identifier (e.g. the GI
> number) which is present elsewhere in the file.
> 
> The point is we can support this kind of limited key_function
> without suffering the severe speed penalty which doing a full
> parse to give SeqRecord objects would impose.

This is a great compromise. You're right, parsing the SeqRecord is too
much, and allowing manipulation of default identifier would work fine.
If people need to do something much more complicated to get the ID
they would probably be better off extending the existing classes and
writing a custom indexer that pulls the IDs they need.

Brad


From biopython at maubp.freeserve.co.uk  Tue Sep  8 13:22:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Sep 2009 14:22:35 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <20090908121405.GF63266@sobchak.mgh.harvard.edu>
References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com>
	<320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com>
	<20090831132451.GD75451@sobchak.mgh.harvard.edu>
	<320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com>
	<20090901130639.GI75451@sobchak.mgh.harvard.edu>
	<320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com>
	<320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com>
	<20090908121405.GF63266@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>

n Tue, Sep 8, 2009 at 1:14 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
> Hi Peter;
>
> [... callback function for specifying an ID ...]
>
>> A less flexible option is a callback function which maps the default
>> record.id to a new key. This would solve your NCBI FASTA issue,
>> and might be handy in other settings (e.g. removing the version
>> suffix in GenBank identifiers). However, it would not allow for
>> example switching to a completely different identifier (e.g. the GI
>> number) which is present elsewhere in the file.
>>
>> The point is we can support this kind of limited key_function
>> without suffering the severe speed penalty which doing a full
>> parse to give SeqRecord objects would impose.
>
> This is a great compromise. You're right, parsing the SeqRecord is too
> much, and allowing manipulation of default identifier would work fine.

Cool - done in CVS, including the docstring and the tutorial.

> If people need to do something much more complicated to get the ID
> they would probably be better off extending the existing classes and
> writing a custom indexer that pulls the IDs they need.

Certainly - we can't expect to cover every possible use case, and
trying to do so will result in an overly complicated API.

Did you have any ideas for a better name than Bio.SeqIO.indexed_dict()?

Peter


From mjldehoon at yahoo.com  Tue Sep  8 13:30:30 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 8 Sep 2009 06:30:30 -0700 (PDT)
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>
Message-ID: <184931.66541.qm@web62403.mail.re1.yahoo.com>

--- On Tue, 9/8/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Did you have any ideas for a better name than
> Bio.SeqIO.indexed_dict()?
> 
Is indexed_dict a function? If so, I suggest we use a verb instead of a noun. Maybe just "index"?

--Michiel.


From biopython at maubp.freeserve.co.uk  Tue Sep  8 13:53:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 8 Sep 2009 14:53:36 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <184931.66541.qm@web62403.mail.re1.yahoo.com>
References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>
	<184931.66541.qm@web62403.mail.re1.yahoo.com>
Message-ID: <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com>

On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon<mjldehoon at yahoo.com> wrote:
> --- On Tue, 9/8/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Did you have any ideas for a better name than
>> Bio.SeqIO.indexed_dict()?
>
> Is indexed_dict a function? If so, I suggest we use a verb instead
> of a noun. Maybe just "index"?
>
> --Michiel.

Bio.SeqIO.indexed_dict() is a function which returns a dictionary like
object. So yes, a verb would be better, and "index" is short and sweet.

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Sep  9 13:24:41 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Sep 2009 09:24:41 -0400
Subject: [Biopython-dev] [Bug 2781] Bio.PDB Structure instances cannot be
	deepcopied
In-Reply-To: <bug-2781-42@http.bugzilla.open-bio.org/>
Message-ID: <200909091324.n89DOf4Q013555@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2781


klaus.kopec at tuebingen.mpg.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WORKSFORME


------- Comment #2 from klaus.kopec at tuebingen.mpg.de  2009-09-09 09:24 EST -------
this seems to be resolved in 1.51 with Python 2.6.2 under 64Bit Ubuntu?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Sep  9 15:18:01 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Sep 2009 11:18:01 -0400
Subject: [Biopython-dev] [Bug 2910] New: Parsing some pdb files results in
	shorter peptide sequences than expected
Message-ID: <bug-2910-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910

           Summary: Parsing some pdb files results in shorter peptide
                    sequences than expected
           Product: Biopython
           Version: 1.49
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: critical
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: schafer at rostlab.org


Parsing the one-letter sequence for a specific chain out of a given pdb file
often seems to result in shorter sequences than expected. 

The following code demonstrates this behavior for structure 1a2d chain A.
Aminoacid #118 VAL after the HETATOM (#117) block is missing in the result. 

------------------CODE----------------
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import *

parser = PDBParser()
ppb = PPBuilder()
structure = parser.get_structure('tmp', '1a2d.pdb')
polypeptides = ppb.build_peptides(structure[0]['A'])
sequence = str(polypeptides[0].get_sequence())

print sequence
------------------CODE----------------

Another example is structure 13gs chain C and D. Both sequences are ECG, the
code above however returns only CG.
So this behavior seems to be indepedent from a present HETATOM block.
This bug is also present in version 1.51.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Sep  9 15:18:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 9 Sep 2009 11:18:48 -0400
Subject: [Biopython-dev] [Bug 2910] Parsing some pdb files results in
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909091518.n89FImn5016415@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


schafer at rostlab.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |schafer at rostlab.org


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 10 12:55:03 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Sep 2009 08:55:03 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909101255.n8ACt3Jd017456@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|critical                    |normal
            Summary|Parsing some pdb files      |Bio.PDB build_peptides
                   |results in shorter peptide  |sometimes gives shorter
                   |sequences than expected     |peptide sequences than
                   |                            |expected


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-10 08:55 EST -------
Retitled as this appears to be a bug in the PPBuilder build_peptides method,
not the PDB parser, see:
http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html

Test script:

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import PPBuilder, to_one_letter_code
parser = PDBParser()
ppb = PPBuilder()
#structure = parser.get_structure('tmp', '1A2D.pdb')
structure = parser.get_structure('tmp', '13GS.pdb')
for model in structure :
    polypeptides = ppb.build_peptides(model)
    assert len(model) == len(polypeptides)
    for chain, pep in zip(model, polypeptides) :
        print
        print "Chain", chain.id
        print "Raw chain:"
        print "".join(to_one_letter_code.get(res.resname,"X") \
                      for res in chain if "CA" in res.child_dict)
        print "From peptide builder:"
        print pep.get_sequence()

Output for 1A2D,

PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2426.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2427.
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2428.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2448.

Chain A
Raw chain:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA
>From peptide builder:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA

Chain B
Raw chain:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA
>From peptide builder:
CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA

Notice there are discontinuities in both chains A and B, and a missing residue
in their peptides.

And the output from 13GS,

PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3760.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3812.
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3852.
PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3948.
PDBConstructionWarning: WARNING: Chain C is discontinuous at line 4033.

Chain A
Raw chain:
MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
>From peptide builder:
MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

Chain B
Raw chain:
PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
>From peptide builder:
PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

Chain C
Raw chain:
ECG
>From peptide builder:
CG

Chain D
Raw chain:
ECG
>From peptide builder:
CG

Notice there are discontinuities in chains A, B and C, but missing residues in
the peptide chains C and D. This suggests the discontinuities are required to
trigger the problem. Also there are no HETATM residues for chains C and D.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 10 12:57:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Sep 2009 08:57:13 -0400
Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed
	assertion in CondonTable Fix+Patch
In-Reply-To: <bug-2894-42@http.bugzilla.open-bio.org/>
Message-ID: <200909101257.n8ACvDe1017562@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2894


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-10 08:57 EST -------
I'm marking this as a duplicated of bug 2887, and believe it to be fixed on the
trunk.

*** This bug has been marked as a duplicate of bug 2887 ***


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 10 12:57:16 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Sep 2009 08:57:16 -0400
Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in
	Bio.Data.CodonTable
In-Reply-To: <bug-2887-42@http.bugzilla.open-bio.org/>
Message-ID: <200909101257.n8ACvGRn017574@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2887


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kellrott at ucsd.edu


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-10 08:57 EST -------
*** Bug 2894 has been marked as a duplicate of this bug. ***


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 10 12:57:20 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 10 Sep 2009 08:57:20 -0400
Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary
	Jython Error Fix+Patch
In-Reply-To: <bug-2895-42@http.bugzilla.open-bio.org/>
Message-ID: <200909101257.n8ACvKL9017592@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2895


Bug 2895 depends on bug 2894, which changed state.

Bug 2894 Summary: Jython List difference causes failed assertion in CondonTable Fix+Patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2894

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Tue Sep 15 13:51:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 14:51:43 +0100
Subject: [Biopython-dev] Another Biopython release?
Message-ID: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>

Hi all,

Looking ahead, Tiago has some population genetics code he hopes to
merge into the trunk at the end of the month (or in October), and we
still have Brad's GFF stuff, my SFF work, Kristian's RNA code, Kyle's
misc suggestions, and perhaps most importantly the phylogenetics
GSoC work to consider.

I know it's been only a month since we released Biopython 1.51, but
does anyone (other than me) think that we already have enough done
to warrant another release? The associated CVS freeze would also
serve as a good break point for moving to github (see other threads).

Here is what we have in the NEWS file at the moment:

<quote>
New helper functions Bio.SeqIO.convert() and Bio.AlignIO.convert() allow an
easier way to use Biopython for simple file format conversions. Additionally,
these new functions allow Biopython to offer important file format specific
optimisations (e.g. FASTQ to FASTA, and interconverting FASTQ variants).

New function Bio.SeqIO.indexed_dict() allows indexing of most sequence file
formats (but not alignment file formats), allowing dictionary like random
access to all the entries in the file as SeqRecord objects, keyed on the
record id. This is especially useful for very large sequencing files, where
all the records cannot be held in memory at once. This supplements the more
flexible but memory demanding Bio.SeqIO.to_dict() function.

Bio.SeqIO can now write "phd" format files (used by PHRED, PHRAD and
CONSED), allowing interconversion with FASTQ files, or FASTA+QUAL files.

Bio.Emboss.Applications now includes wrappers for the "new" PHYLIP EMBASSY
package (e.g. fneighbor) which replace the "old" PHYLIP EMBASSY package
(e.g. efneighbor) whose Biopython wrappers are now obsolete.

See also the DEPRECATED file, as several old deprecated modules have finally
been removed (e.g. Bio.EUtils which had been replaced by Bio.Entrez).
</quote>

[As an aside - Cymon and David - do you want to be named in the NEWS
file for the PHD and PHLIPNEW stuff?]

We're still debating the name of the new function Bio.SeqIO.indexed_dict(),
but I am happy with the code (and new documentation) otherwise. The
related extensions to adding indexing via a lookup file or an SQLite
database is another big chunk of work which I don't have time for at the
moment, but the code already in CVS is still extremely useful as is.

Again, I'm biased, but I think the Bio.SeqIO.convert(...) function will be
a popular addition for its convenience, but especially valuable for anyone
wanting to convert between the different FASTQ files where the optimised
conversion code makes a big speed up.

Does doing another quick release (say at some point next week) sound
like a good plan? If people like the idea, then getting some extra testing
in now would be great - especially on the new stuff (it has unit tests of
course, but real world usage is also important - thanks Brad for already
trying out the FASTA indexing).

Peter


From bartek at rezolwenta.eu.org  Tue Sep 15 14:59:43 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 15 Sep 2009 16:59:43 +0200
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
Message-ID: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>

On Tue, Sep 15, 2009 at 3:51 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I know it's been only a month since we released Biopython 1.51, but
> does anyone (other than me) think that we already have enough done
> to warrant another release? The associated CVS freeze would also
> serve as a good break point for moving to github (see other threads).
>

That would be great. As for the move to github, I've added some (quite
preliminary) docs for developers on how to make commits to the main
branch using git and github to the wiki:
http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch

Any comments and/or improvements are most welcome.

cheers
  Bartek


From tiagoantao at gmail.com  Tue Sep 15 15:29:55 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 15 Sep 2009 16:29:55 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
Message-ID: <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>

On Tue, Sep 15, 2009 at 2:51 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> Looking ahead, Tiago has some population genetics code he hopes to

I can put my stuff in CVS (plus I have docs). Question: CVS is still
"the place". Right?

I just need to test stuff on Windows. All the rest seems ok.

Tiago


From biopython at maubp.freeserve.co.uk  Tue Sep 15 15:35:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 16:35:13 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
Message-ID: <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>

2009/9/15 Tiago Ant?o <tiagoantao at gmail.com>:
> On Tue, Sep 15, 2009 at 2:51 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi all,
>>
>> Looking ahead, Tiago has some population genetics code he hopes to
>
> I can put my stuff in CVS (plus I have docs). Question: CVS is still
> "the place". Right?
>
> I just need to test stuff on Windows. All the rest seems ok.

Yes, for the short term CVS is still the master repository. If you
have that stuff ready to check in now, then sure - go ahead
I was assuming you didn't expect to have this ready just yet,
hence the proposal to sneak out a quick release first ;)

Give me a shout and I'll get my Windows test machine up
and running to double check the unit tests there.

Maybe we'll push back the "next week" idea a bit ;)

Peter


From eric.talevich at gmail.com  Tue Sep 15 15:38:45 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 15 Sep 2009 11:38:45 -0400
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
Message-ID: <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>

>
> On Tue, Sep 15, 2009 at 3:51 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > Hi all,
> >
> > I know it's been only a month since we released Biopython 1.51, but
> > does anyone (other than me) think that we already have enough done
> > to warrant another release? The associated CVS freeze would also
> > serve as a good break point for moving to github (see other threads).
> >
>

Sounds good to me. Completing the Git migration would make it much easier
for me to maintain the Tree/TreeIO stuff, since I already have a few local
branches based on it that an upstream CVS duplication would mangle.


On Tue, Sep 15, 2009 at 10:59 AM, Bartek Wilczynski <
bartek at rezolwenta.eu.org> wrote:

> That would be great. As for the move to github, I've added some (quite
> preliminary) docs for developers on how to make commits to the main
> branch using git and github to the wiki:
> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch
>
>
The setup here for committers looks potentially different from the setup in
"Merging upstream changes" (describing read-only tracking), but also
potentially similar. Diff:
- The github:biopython/biopython repository is called "official" here, but
"upstream" there. Different protocol too, but that's intentional.
- It also shows how to treat the upstream/official repo as the origin,
CVS-style. This would mean the developer doesn't have a separate GitHub fork
to use for personal branches, uncertain commits, etc. that don't belong in
the main repo.

Maybe a good way to organize the page would be in terms of how you want to
use the repo:

1. Tracking Biopython with raw Git (without signing up for GitHub)
   - git clone http://.../biopython/biopython
   - remote: upstream
   - how to format a patch and submit on Bugzilla

2. Tracking Biopython on GitHub (e.g. occasional contributors)
   - sign up, click the "fork" button
   - git clone http://.../your-name-here/biopython
   - remotes: origin, upstream
   - how to submit a pull request on GitHub
   - how to add, manage and delete branches locally and on GitHub

3. Collaborating
   - either #1 or #2 is fine
   - how to add and manage more remotes
   - how to apply Git patches, and why copy/paste kills kittens the next
time you merge

4. Committing to Biopython
   - same as #2, but use the private URL for the "upstream" remote
   - remotes: origin, upstream
   - policy on pushing upstream, code reviews, tagging, etc.


Cheers,
Eric


From tiagoantao at gmail.com  Tue Sep 15 15:39:07 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 15 Sep 2009 16:39:07 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
	<320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
Message-ID: <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>

2009/9/15 Peter <biopython at maubp.freeserve.co.uk>:
> Give me a shout and I'll get my Windows test machine up
> and running to double check the unit tests there.

I think I am not in the mood to impose the burden on you. I will find
a Windows machine and test it myself.

> Maybe we'll push back the "next week" idea a bit ;)

I am OK with "next week". But as I said two months ago, I have
calendarized the extension of Bio.PopGen to October. So the material
can go on the next release after the one on "next week".

I just want to have lots of free time and little travel to be able to
assist potential users (as I intend to announce the new content to the
evolutionary biology crowd quite a lot)

-- 
" It always takes ideology to consummate massive error." - Ambrose
Evans-Pritchard


From biopython at maubp.freeserve.co.uk  Tue Sep 15 15:48:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 16:48:43 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
	<320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
	<6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>
Message-ID: <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com>

2009/9/15 Tiago Ant?o <tiagoantao at gmail.com>:
> 2009/9/15 Peter <biopython at maubp.freeserve.co.uk>:
>> Give me a shout and I'll get my Windows test machine up
>> and running to double check the unit tests there.
>
> I think I am not in the mood to impose the burden on you. I will find
> a Windows machine and test it myself.

I was just going to turn on the machine, update to the latest
CVS, and do a compile/test with Python 2.4, 2.5, 2.6 - Its no
extra effort, as I would be doing this anyway for a new release.

Unless of course you are adding wrappers for more command
line tools, which would ideally require me to install them - that
I might leave for another day ;)

>> Maybe we'll push back the "next week" idea a bit ;)
>
> I am OK with "next week". But as I said two months ago, I have
> calendarized the extension of Bio.PopGen to October. So the material
> can go on the next release after the one on "next week".
>
> I just want to have lots of free time and little travel to be able to
> assist potential users (as I intend to announce the new content to the
> evolutionary biology crowd quite a lot)

If you are happy to merge the code this week (via CVS), and
confident it is ready to release, then I could do the release
next week, and then we move to git.

Or, I can do the release next week, we move to git, and then
you can merge the new code (via git) at your leisure (Oct).

Either plan is fine with me. Which do you prefer?

Peter


From tiagoantao at gmail.com  Tue Sep 15 15:57:17 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 15 Sep 2009 16:57:17 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
	<320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
	<6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>
	<320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com>
Message-ID: <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com>

> Unless of course you are adding wrappers for more command
> line tools, which would ideally require me to install them - that
> I might leave for another day ;)

Spot on ;) .

> If you are happy to merge the code this week (via CVS), and
> confident it is ready to release, then I could do the release
> next week, and then we move to git.

I will be only able to test the code on Windows tomorrow, if I can get
hold to the machine (which I should).

> Either plan is fine with me. Which do you prefer?


I prefer merging on CVS, I am still much more proficient with it. You
should have the merge there on Friday morning when you arrive.
Tutorial included.

Tiago


-- 
" It always takes ideology to consummate massive error." - Ambrose
Evans-Pritchard


From biopython at maubp.freeserve.co.uk  Tue Sep 15 16:09:32 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 17:09:32 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com>
	<320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com>
	<6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com>
	<320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com>
	<6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com>
Message-ID: <320fb6e00909150909x2f45e0f5g6c4da77eafcd9a49@mail.gmail.com>

2009/9/15 Tiago Ant?o <tiagoantao at gmail.com>:
>> Unless of course you are adding wrappers for more command
>> line tools, which would ideally require me to install them - that
>> I might leave for another day ;)
>
> Spot on ;) .

OK.

>> If you are happy to merge the code this week (via CVS), and
>> confident it is ready to release, then I could do the release
>> next week, and then we move to git.
>
> I will be only able to test the code on Windows tomorrow, if
> I can get hold to the machine (which I should).

Fingers crossed this doesn't throw any surprises at you.

>> Either plan is fine with me. Which do you prefer?
>
> I prefer merging on CVS, I am still much more proficient with it. You
> should have the merge there on Friday morning when you arrive.
> Tutorial included.

OK then :)

Peter


From bartek at rezolwenta.eu.org  Tue Sep 15 19:45:22 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 15 Sep 2009 21:45:22 +0200
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
	<3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>
Message-ID: <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com>

On Tue, Sep 15, 2009 at 5:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:

> Sounds good to me. Completing the Git migration would make it much easier
> for me to maintain the Tree/TreeIO stuff, since I already have a few local
> branches based on it that an upstream CVS duplication would mangle.
>
Then maybe we should  wait with committing your changes to the time we
drop CVS,
in order to avoid loss of change history in your code... What do you
think, Peter?

>
> On Tue, Sep 15, 2009 at 10:59 AM, Bartek Wilczynski <
> bartek at rezolwenta.eu.org> wrote:
>
>> That would be great. As for the move to github, I've added some (quite
>> preliminary) docs for developers on how to make commits to the main
>> branch using git and github to the wiki:
>> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch
>>
>>
> The setup here for committers looks potentially different from the setup in
> "Merging upstream changes" (describing read-only tracking), but also
> potentially similar. Diff:
> - The github:biopython/biopython repository is called "official" here, but
> "upstream" there. Different protocol too, but that's intentional.

Yes, indeed. I know this might seem strange but I was trying to
deliberately make the distinction between the main repository in
read-write mode (official) and in read-only mode (upstream). I would
keep it like this at least for a while so that the transition from CVS
is as easy as possible. We have quite a few developers who are new to
git and comfortable with CVS.

> - It also shows how to treat the upstream/official repo as the origin,
> CVS-style.
Yes, exactly.
> This would mean the developer doesn't have a separate GitHub fork
> to use for personal branches, uncertain commits, etc. that don't belong in
> the main repo.
Not necessarily. It just means that these two roles are separate: a
developer can (but does not have to) have his own branch of biopython
tree where he/she makes the changes, but this is not directly linked
to the official (read-write) biopython branch. I know it's  not
necessarily the best way to use github, but I would like to avoid
getting people used to CVS confused. That's why I decided to describe
the role of developer with read-write access differently.

BTW, I would see the role of the GitUsage wiki page as a guide rather
than a law. That means that if someone understands better how to use
git and github and does not get lost with having in his both local and
remote branches with different origins I'm absolutely fine with this.
But I think it is quite complicated, especially for people new to git.

So, in summary, my idea was to (currently) recommend somewhat CVS-like
usage of git on the main branch, which would be simple for people to
use at first and encourage them to  create their own branches and do
development on them.

>
> Maybe a good way to organize the page would be in terms of how you want to
> use the repo:
>
> 1. Tracking Biopython with raw Git (without signing up for GitHub)
> ? - git clone http://.../biopython/biopython
> ? - remote: upstream
> ? - how to format a patch and submit on Bugzilla
>
> 2. Tracking Biopython on GitHub (e.g. occasional contributors)
> ? - sign up, click the "fork" button
> ? - git clone http://.../your-name-here/biopython
> ? - remotes: origin, upstream
> ? - how to submit a pull request on GitHub
> ? - how to add, manage and delete branches locally and on GitHub
>
> 3. Collaborating
> ? - either #1 or #2 is fine
> ? - how to add and manage more remotes
> ? - how to apply Git patches, and why copy/paste kills kittens the next
> time you merge
>
> 4. Committing to Biopython
> ? - same as #2, but use the private URL for the "upstream" remote
> ? - remotes: origin, upstream
> ? - policy on pushing upstream, code reviews, tagging, etc.
>
>

Having such documentation would be nice. I think that it is currently
structured more or less like that (now we just don't have #1 and #4
currently recommends a very simple CVS-like usage). I think that
adding #1 and putting in place policies on how to submit patches would
be great. For #4 I would vote for recommending (at least for a while)
the CVS-like way, but I'm absolutely for the development of the
alternative procedure, where the developer works with a single repo
both on his code and on official branch.

I don't want to underestimate the git skills of our current
developers, but so far I think only a few people have gotten their
github accounts, which means the simpler we keep it the better (at
least for a while). I certainly hope that people will get used to git
quickly, but I would like to make initial change for people who will
be switching from CVS to git as simple as possible.


cheers
  Bartek


From biopython at maubp.freeserve.co.uk  Tue Sep 15 20:25:00 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 15 Sep 2009 21:25:00 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
	<3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>
	<8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com>
Message-ID: <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com>

On Tue, Sep 15, 2009 at 8:45 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
> On Tue, Sep 15, 2009 at 5:38 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
>> Sounds good to me. Completing the Git migration would make it much easier
>> for me to maintain the Tree/TreeIO stuff, since I already have a few local
>> branches based on it that an upstream CVS duplication would mangle.
>
> Then maybe we should ?wait with committing your changes to the
> time we drop CVS, in order to avoid loss of change history in your
> code... What do you think, Peter?

Yes, I was suggesting getting a final CVS release out soon,
and then look at merging all the new stuff (including Eric's
tree stuff) starting to pile up on github.

I knew Tiago has a lump of code ready to go, and as we have
just discussed, as he would prefer to check that in via CVS.
So, Tiago will do that (this Friday), then we'll do the final CVS
release next week, and then switch to git - and start to focus
on merging in new stuff.

Peter


From chapmanb at 50mail.com  Wed Sep 16 12:34:07 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 16 Sep 2009 08:34:07 -0400
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
Message-ID: <20090916123407.GE13500@sobchak.mgh.harvard.edu>

Hi Peter;

> > I know it's been only a month since we released Biopython 1.51, but
> > does anyone (other than me) think that we already have enough done
> > to warrant another release? The associated CVS freeze would also
> > serve as a good break point for moving to github (see other threads).

I don't have a strong opinion about the release. It seems a little
early but if you think we are ready go for it.

I have tested Osvaldo's Novoalign commandline object and have it
ready to get in. Right now it's in a git tree but I can move it
over to a CVS tree and integrate it for the release. It'll live in
Bio/Sequencing/Applications like you suggested. I should be able to
do that this evening.

I am all about the move to Git and GitHub. Anything we can do to
finish that off and make it official is cool by me.

> That would be great. As for the move to github, I've added some (quite
> preliminary) docs for developers on how to make commits to the main
> branch using git and github to the wiki:
> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch

This is looking great. I'd agree with Eric that we should be
consistent in the doc for suggestions on naming the official 
biopython branch:

git remote add upstream git://github.com/biopython/biopython.git
git remote add official git at github.com:biopython/biopython.git

My vote is for the "official" naming which is a little more
specific.

Great stuff,
Brad


From biopython at maubp.freeserve.co.uk  Wed Sep 16 13:30:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Sep 2009 14:30:47 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <20090916123407.GE13500@sobchak.mgh.harvard.edu>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
	<20090916123407.GE13500@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909160630o4dc1379dwaba667ed13ed9bde@mail.gmail.com>

On Wed, Sep 16, 2009 at 1:34 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi Peter;
>
>> > I know it's been only a month since we released Biopython 1.51, but
>> > does anyone (other than me) think that we already have enough done
>> > to warrant another release? The associated CVS freeze would also
>> > serve as a good break point for moving to github (see other threads).
>
> I don't have a strong opinion about the release. It seems a little
> early but if you think we are ready go for it.

OK.

> I have tested Osvaldo's Novoalign commandline object and have it
> ready to get in. Right now it's in a git tree but I can move it
> over to a CVS tree and integrate it for the release. It'll live in
> Bio/Sequencing/Applications like you suggested. I should be able to
> do that this evening.

Go for it - I presume you have it in a private git repostory at the
moment, as I couldn't spot it on github?

> I am all about the move to Git and GitHub. Anything we can do to
> finish that off and make it official is cool by me.
>
>> That would be great. As for the move to github, I've added some (quite
>> preliminary) docs for developers on how to make commits to the main
>> branch using git and github to the wiki:
>> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch
>
> This is looking great. I'd agree with Eric that we should be
> consistent in the doc for suggestions on naming the official
> biopython branch:
>
> git remote add upstream git://github.com/biopython/biopython.git
> git remote add official git at github.com:biopython/biopython.git
>
> My vote is for the "official" naming which is a little more
> specific.

Well, both "official" and "upstream" have merit. I don't mind which,
but it does make sense to be consistent.

Peter


From biopython at maubp.freeserve.co.uk  Wed Sep 16 13:48:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Sep 2009 14:48:39 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com>
References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>
	<184931.66541.qm@web62403.mail.re1.yahoo.com>
	<320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com>
Message-ID: <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com>

On Tue, Sep 8, 2009 at 2:53 PM, Peter wrote:
> On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote:
>>  On Tue, 9/8/09, Peter wrote:
>>> Did you have any ideas for a better name than
>>> Bio.SeqIO.indexed_dict()?
>>
>> Is indexed_dict a function? If so, I suggest we use a verb instead
>> of a noun. Maybe just "index"?
>
> Bio.SeqIO.indexed_dict() is a function which returns a dictionary like
> object. So yes, a verb would be better, and "index" is short and sweet.

Any other comments? Otherwise I'll switch Bio.SeqIO.indexed_dict()
to Bio.SeqIO.index() for the next release.

Thinking ahead, in addition to the current code (indexing a file, keeping
the index in memory) we might in future add want to something like
Bio.SeqIO.sqlite_index() where the index is kept in a database etc.

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Sep 16 22:00:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 16 Sep 2009 18:00:59 -0400
Subject: [Biopython-dev] [Bug 2904] Interface for Novoalign
In-Reply-To: <bug-2904-42@http.bugzilla.open-bio.org/>
Message-ID: <200909162200.n8GM0x7d006226@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2904


chapmanb at 50mail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from chapmanb at 50mail.com  2009-09-16 18:00 EST -------
Osvaldo;
Thanks much for the submission. This is committed and lives in:

Bio/Sequencing/Applications

to create a namespace for future sequencing related commandlines. You can
import with:

from Bio.Sequencing.Applications import NovoalignCommandline

It would be great if you wanted to add a cookbook example of using it
(http://biopython.org/wiki/Category:Cookbook) based on a simple pipeline.
Perhaps something involving downstream parsing of the novoalign format, or
converted to SAM as you suggested in Bug 2905.

Thanks,
Brad


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From tiagoantao at gmail.com  Wed Sep 16 22:53:31 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 16 Sep 2009 23:53:31 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com>
References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com>
	<8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com>
	<3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com>
	<8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com>
	<320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com>
Message-ID: <6d941f120909161553l3f9bae6u5ba45e6cde9b33e3@mail.gmail.com>

Hi,

> I knew Tiago has a lump of code ready to go, and as we have
> just discussed, as he would prefer to check that in via CVS.


I just tested my stuff on Windows.
It worked at first attempt. Strange...
I actually have a few tests (18 to be precise). They all passed at first.
Murphy's laws took a once-in-a-life vacation.

I still have a minor problem. I will not have time to update the
Tutorial before Tuesday. All is written in
http://biopython.org/wiki/PopGen_dev_Genepop , which it will mostly
become tutorial. But I simply don't have time until Tuesday to
transpose.

Code and tests will be committed today.

Tiago


From krother at rubor.de  Thu Sep 17 08:40:28 2009
From: krother at rubor.de (Kristian Rother)
Date: Thu, 17 Sep 2009 10:40:28 +0200
Subject: [Biopython-dev] Another Biopython release?
Message-ID: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de>


Hi Peter,

I could prepare 2-3 exemplary modules for parsing secondary structures +
tests for the Bio.RNA package. As I've been using GIT so far, it would be
most convenient to stick with it and contribute when the main archive has
migrated. Or is it easy to "jump" to CVS on the last possible occasion?

Best,
   Kristian


From biopython at maubp.freeserve.co.uk  Thu Sep 17 09:17:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 10:17:37 +0100
Subject: [Biopython-dev] Another Biopython release?
In-Reply-To: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de>
References: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de>
Message-ID: <320fb6e00909170217j24bab86eqae45440f72ed415e@mail.gmail.com>

On Thu, Sep 17, 2009 at 9:40 AM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi Peter,
>
> I could prepare 2-3 exemplary modules for parsing secondary structures +
> tests for the Bio.RNA package. As I've been using GIT so far, it would be
> most convenient to stick with it and contribute when the main archive has
> migrated. Or is it easy to "jump" to CVS on the last possible occasion?
>
> Best,
> ? Kristian

My plan for this "quick release" was to mark an end to the CVS era, and
not to include any of the really new stuff (like your code), but to wait until
we are on git before looking at it. So keep it in git for now - this should
also make the merge easier.

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 17 11:27:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 12:27:24 +0100
Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO
In-Reply-To: <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com>
References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com>
	<184931.66541.qm@web62403.mail.re1.yahoo.com>
	<320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com>
	<320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com>
Message-ID: <320fb6e00909170427o37813aa7kd86464d9c8e81b36@mail.gmail.com>

On Wed, Sep 16, 2009 at 2:48 PM, Peter wrote:
> On Tue, Sep 8, 2009 at 2:53 PM, Peter wrote:
>> On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote:
>>> ?On Tue, 9/8/09, Peter wrote:
>>>> Did you have any ideas for a better name than
>>>> Bio.SeqIO.indexed_dict()?
>>>
>>> Is indexed_dict a function? If so, I suggest we use a verb instead
>>> of a noun. Maybe just "index"?
>>
>> Bio.SeqIO.indexed_dict() is a function which returns a dictionary like
>> object. So yes, a verb would be better, and "index" is short and sweet.
>
> Any other comments? Otherwise I'll switch Bio.SeqIO.indexed_dict()
> to Bio.SeqIO.index() for the next release.

Done in CVS.

> Thinking ahead, in addition to the current code (indexing a file, keeping
> the index in memory) we might in future add want to something like
> Bio.SeqIO.sqlite_index() where the index is kept in a database etc.

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 17 12:02:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 17 Sep 2009 13:02:18 +0100
Subject: [Biopython-dev] Using PendingDeprecation for obsolete modules
Message-ID: <320fb6e00909170502m14b4e599l66c778bfe67f3625@mail.gmail.com>

Hi all,

Right now we have deprecation process which usually looks like this:

(1) Label as obsolete in docstrings
(2) Label as deprecated in docstrings, issue DeprecationWarning
(3) Remove code

See: http://biopython.org/wiki/Deprecation_policy

I've relatively recently noticed the PendingDeprecationWarning warning
(added in Python 2.3), which is by default silent, but the user can choose
to enable it with the python command line switch -W. For example,

$ python
Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import warnings
>>> warnings.warn("X is obsolete", PendingDeprecationWarning)
>>>

So, by default, no warning message. But if you ask for them:

$ python -W allPython 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import warnings
>>> warnings.warn("X is obsolete", PendingDeprecationWarning)
__main__:1: PendingDeprecationWarning: X is obsolete
>>>

So, I thinking what we should be doing for deprecating modules is:

(1) Label as obsolete in docstrings, issue PendingDeprecationWarning
(2) Label as deprecated in docstrings, issue DeprecationWarning
(3) Remove code

I guess very few people know about pending deprecation warnings,
and so are unlikely to even try using the warning switch. Therefore
I have little inclination to go though all the current modules tagged as
"obsolete" just to add this silent warning.

However, if simply start doing this in future, is really isn't any more work.

Any thoughts?

Peter


From winda002 at student.otago.ac.nz  Fri Sep 18 03:52:11 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Fri, 18 Sep 2009 15:52:11 +1200
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <4AA58F3C.6080200@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>		<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>		<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>		<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
Message-ID: <4AB303EB.1010208@student.otago.ac.nz>


>>
>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
>> based on test_Emboss.py? Continuing on the github branch is fine.
>>  

Well, it didn't end up being very short but there is a test on my 
"phylo" branch (http://github.com/dwinter/biopython/tree/phylo) in  
test_PhylipNew.phy  (which uses a couple of new files in Tests/Phylip) 
that I'd welcome comments on.

Writing them actually exposed a bug in the code already in CVS, the 
FProtParsCommandline option "-intreefile" isn't mandatory so 
"is_required" should be set to 0 rather than 1. In my defence the emboss 
documentation has it listed as being both mandatory and optional.

One possibly foolish thing I did was use TreeIO to test the trees that 
came out of these programs made sense, thinking that module would be 
part of the next release. If the plan is for a new release soon and 
having a test for these wrappers is important the tests could be done 
with Nexus.Trees but I found that was difficult to use for files with 
multiple newick trees.

Cheers,
David


From biopython at maubp.freeserve.co.uk  Fri Sep 18 09:26:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 18 Sep 2009 10:26:59 +0100
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <4AB303EB.1010208@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
	<4AB303EB.1010208@student.otago.ac.nz>
Message-ID: <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com>

On Fri, Sep 18, 2009 at 4:52 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
>
>>>
>>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
>>> based on test_Emboss.py? Continuing on the github branch is fine.
>>>
>
> Well, it didn't end up being very short but there is a test on my "phylo"
> branch (http://github.com/dwinter/biopython/tree/phylo) in
> ?test_PhylipNew.phy ?(which uses a couple of new files in Tests/Phylip) that
> I'd welcome comments on.

Cool - I'll take a look and try and get (some of) it merged into CVS
for this release.

> Writing them actually exposed a bug in the code already in CVS, the
> FProtParsCommandline option "-intreefile" isn't mandatory so "is_required"
> should be set to 0 rather than 1. In my defence the emboss documentation has
> it listed as being both mandatory and optional.

How odd. Maybe EMBOSS switched it at some point?

> One possibly foolish thing I did was use TreeIO to test the trees that came
> out of these programs made sense, thinking that module would be part of the
> next release. If the plan is for a new release soon and having a test for
> these wrappers is important the tests could be done with Nexus.Trees but I
> found that was difficult to use for files with multiple newick trees.

Hmm. In the short term we can either comment out those bits of the test
pending the inclusion of TreeIO in the next release, or add a quick tiny
parser in the test itself to load the trees, split them on the ";" and pass
them one by one to Bio.Nexus.Trees for parsing.

Peter


From biopython at maubp.freeserve.co.uk  Fri Sep 18 11:09:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 18 Sep 2009 12:09:24 +0100
Subject: [Biopython-dev] Entrez ELink history - XML/DTD or Biopython bug?
Message-ID: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com>

Hi Michiel (et al),

I've been trying to get an example working using the Entrez history
for ELink. Strangely here the URL doesn't use history=y but instead
cmd=neighbor_history (while the default is cmd=neighbor).

However, this appears to show a bug in the Bio.Entrez parser. Consider:

from Bio import Entrez
pmid = "14630660"
print Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs",
from_uid=pmid, cmd="neighbor_history").read()

This gives:

<?xml version="1.0"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN"
 "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
<eLinkResult>
<LinkSet>
	<DbFrom>pubmed</DbFrom>
	<IdList>
		<Id>14630660</Id>
	</IdList>
	<LinkSetDbHistory>
		<DbTo>pmc</DbTo>
		<LinkName>pubmed_pmc_refs</LinkName>
		<QueryKey>1</QueryKey>
	</LinkSetDbHistory>
	<WebEnv>NCID_1_2657216_130.14.18.53_9001_1253271778</WebEnv>
</LinkSet>
</eLinkResult>

The XML looks reasonable by eye - although quite different from
the non-history version.

Now if instead of printing that, I try and parse it:

>>> data = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
LinkName="pubmed_pmc_refs", from_uid=pmid, cmd="neighbor_history"))
Traceback (most recent call last):
?File "<stdin>", line 1, in <module>
?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/__init__.py",
line 259, in read
? ?record = handler.run(handle)
?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/Parser.py",
line 90, in run
? ?self.parser.ParseFile(handle)
?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/Parser.py",
line 210, in endElement
? ?current[name] = value
TypeError: 'str' object does not support item assignment

I can file a Biopython bug if you like, but my initial guess is
the problem lies in the XML itself versus the eLink_020511.dtd
file, which does not mention the LinkSetDbHistory element at
all. Do you agree that this looks like an NCBI problem?

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Fri Sep 18 11:40:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 18 Sep 2009 12:40:06 +0100
Subject: [Biopython-dev] Entrez ELink history - XML/DTD or Biopython bug?
In-Reply-To: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com>
References: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com>
Message-ID: <320fb6e00909180440p701d3f5ejd22a605f171989eb@mail.gmail.com>

On Fri, Sep 18, 2009 at 12:09 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Michiel (et al),
>
> I've been trying to get an example working using the Entrez history
> for ELink. Strangely here the URL doesn't use history=y but instead
> cmd=neighbor_history (while the default is cmd=neighbor).
>
> However, this appears to show a bug in the Bio.Entrez parser. Consider:
>
> from Bio import Entrez
> pmid = "14630660"
> print Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs",
> from_uid=pmid, cmd="neighbor_history").read()
>
> This gives:
>
> <?xml version="1.0"?>
> <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN"
> ?"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
> <eLinkResult>
> <LinkSet>
> ? ? ? ?<DbFrom>pubmed</DbFrom>
> ? ? ? ?<IdList>
> ? ? ? ? ? ? ? ?<Id>14630660</Id>
> ? ? ? ?</IdList>
> ? ? ? ?<LinkSetDbHistory>
> ? ? ? ? ? ? ? ?<DbTo>pmc</DbTo>
> ? ? ? ? ? ? ? ?<LinkName>pubmed_pmc_refs</LinkName>
> ? ? ? ? ? ? ? ?<QueryKey>1</QueryKey>
> ? ? ? ?</LinkSetDbHistory>
> ? ? ? ?<WebEnv>NCID_1_2657216_130.14.18.53_9001_1253271778</WebEnv>
> </LinkSet>
> </eLinkResult>
>
> The XML looks reasonable by eye - although quite different from
> the non-history version... but my initial guess is
> the problem lies in the XML itself versus the eLink_020511.dtd
> file, which does not mention the LinkSetDbHistory element at
> all. Do you agree that this looks like an NCBI problem?

I should have done this earlier - but two different XML validators
both agree that the "history" version of the NCBI's ELink XML is
invalid, while the default is fine.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660&cmd=neighbor_history

versus

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660&cmd=neighbor

or:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660

I will get in touch with the NCBI...

Peter


From eric.talevich at gmail.com  Fri Sep 18 14:08:40 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 18 Sep 2009 10:08:40 -0400
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
	<4AB303EB.1010208@student.otago.ac.nz>
	<320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com>
Message-ID: <3f6baf360909180708w2d06c775w18922106bba003e@mail.gmail.com>

On Fri, Sep 18, 2009 at 5:26 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Fri, Sep 18, 2009 at 4:52 AM, David Winter
> <winda002 at student.otago.ac.nz> wrote:
>
> > One possibly foolish thing I did was use TreeIO to test the trees that
> came
> > out of these programs made sense, thinking that module would be part of
> the
> > next release. If the plan is for a new release soon and having a test for
> > these wrappers is important the tests could be done with Nexus.Trees but
> I
> > found that was difficult to use for files with multiple newick trees.
>
> Hmm. In the short term we can either comment out those bits of the test
> pending the inclusion of TreeIO in the next release, or add a quick tiny
> parser in the test itself to load the trees, split them on the ";" and pass
> them one by one to Bio.Nexus.Trees for parsing.
>
>
That's all TreeIO does. The relevant loop is in NewickIO.parse(), if you'd
like to copy it verbatim:
http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py

-Eric


From biopython at maubp.freeserve.co.uk  Sun Sep 20 11:20:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 20 Sep 2009 12:20:43 +0100
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <4AB303EB.1010208@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
	<4AB303EB.1010208@student.otago.ac.nz>
Message-ID: <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com>

On Fri, Sep 18, 2009 at 4:52 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
>>>
>>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py
>>> based on test_Emboss.py? Continuing on the github branch is fine.
>>>
>
> Well, it didn't end up being very short but there is a test on my "phylo"
> branch (http://github.com/dwinter/biopython/tree/phylo) in
> ?test_PhylipNew.phy ?(which uses a couple of new files in Tests/Phylip) that
> I'd welcome comments on.

I've checked in something based on the current version from github.

I added a few checks for missing input files (I was getting cryptic errors),
but then decided we had enough input files in the test suite already, and
that it might be more useful to try writing alignments to the PHYLIP tools
via stdin with AlignIO. Certainly at least one example should try this,
assuming it works. I haven't done this yet - feel free to try.

Note that the stdout from the PHYLIPNEW tools isn't clean, so we
can't avoid having temp output files:
http://lists.open-bio.org/pipermail/emboss-dev/2009-September/000632.html

> Writing them actually exposed a bug in the code already in CVS, the
> FProtParsCommandline option "-intreefile" isn't mandatory so "is_required"
> should be set to 0 rather than 1. In my defence the emboss
> documentation has it listed as being both mandatory and optional.

Fixed in CVS - does this affect any of the other tools using this argument?

> One possibly foolish thing I did was use TreeIO to test the trees that came
> out of these programs made sense, thinking that module would be part of the
> next release. If the plan is for a new release soon and having a test for
> these wrappers is important the tests could be done with Nexus.Trees but I
> found that was difficult to use for files with multiple newick trees.

I put a quick crude helper function into the unit test as discussed.

The unit test is working nicely on Linux with EMBOSS PHYLIP
from CVS, I presume you are testing against an official release?
If you could the CVS code works fine on your setup before the
release that would be great. There is a bit more time as I won't
be able to do the release on Monday, but it should be Tuesday
or Wednesday... and fingers crossed getting PHYLIPNEW
installed on my Windows machine will be easy.

We can look at adding some more of your example input files,
and uncommenting their tests later (especially for cases where
we can't generate the input from Biopython directly). I did add
the horses.tree file BTW.

Thank you David :)

Peter


From winda002 at student.otago.ac.nz  Mon Sep 21 05:13:24 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Mon, 21 Sep 2009 17:13:24 +1200
Subject: [Biopython-dev] draft release announcement
Message-ID: <4AB70B74.1040308@student.otago.ac.nz>

Hi guys,

A draft release announcement for 1.52 for you to look at and comment on. 
This is written with the idea that there will be a blog post describing 
the convert and indexed_dict() methods for SeqIO which can be linked to 
so the
announcement itself is pretty brief.

I didn't mention the movement from CVS to git in the announcement which 
might be something worth adding?

+++
We are pleased to announce the availability of Biopython 1.52, a new 
stable release of the Biopython library.

It may only have been one month since the last release but in that time 
we've added enough useful features to warrant a new release. Biopython 
1.52 will be of particular interest to people using next generation 
sequencing - new functions added to the AlignIO and SeqIO tools speed up 
the way very large sequence files can be dealt with and you can now 
write phd files like those created by  Phred and used in 454 sequencing.

SeqIO and AlignIO both now have a helper function called convert() that 
allows for simple, optimized conversion between file formats while SeqIO 
gets a new method called indexed_dict() which allows random access to 
sequences in a file without reading every record in that file into memory.

The new release also adds command line wrappers for the EMBOSS versions 
of the phylip phylogeny programs and squashes a few minor bugs reported 
since 1.51 was released.

Sources and a Windows Installer are available from the downloads page.

Thanks to the Biopython development team and to everyone who has 
reported bugs since our last release

++++


From tiagoantao at gmail.com  Mon Sep 21 05:17:39 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 21 Sep 2009 06:17:39 +0100
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz>
References: <4AB70B74.1040308@student.otago.ac.nz>
Message-ID: <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com>

There is a big update to the PopGen module, which is now able to do
frequentist statistics and tests through GenePop. I can draft one
paragraph about the subject. I would imagine it is one of the biggest
changes and probably the one that adds most functionality.

On Mon, Sep 21, 2009 at 6:13 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> Hi guys,
>
> A draft release announcement for 1.52 for you to look at and comment on.
> This is written with the idea that there will be a blog post describing the
> convert and indexed_dict() methods for SeqIO which can be linked to so the
> announcement itself is pretty brief.
>
> I didn't mention the movement from CVS to git in the announcement which
> might be something worth adding?
>
> +++
> We are pleased to announce the availability of Biopython 1.52, a new stable
> release of the Biopython library.
>
> It may only have been one month since the last release but in that time
> we've added enough useful features to warrant a new release. Biopython 1.52
> will be of particular interest to people using next generation sequencing -
> new functions added to the AlignIO and SeqIO tools speed up the way very
> large sequence files can be dealt with and you can now write phd files like
> those created by ?Phred and used in 454 sequencing.
>
> SeqIO and AlignIO both now have a helper function called convert() that
> allows for simple, optimized conversion between file formats while SeqIO
> gets a new method called indexed_dict() which allows random access to
> sequences in a file without reading every record in that file into memory.
>
> The new release also adds command line wrappers for the EMBOSS versions of
> the phylip phylogeny programs and squashes a few minor bugs reported since
> 1.51 was released.
>
> Sources and a Windows Installer are available from the downloads page.
>
> Thanks to the Biopython development team and to everyone who has reported
> bugs since our last release
>
> ++++
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
" It always takes ideology to consummate massive error." - Ambrose
Evans-Pritchard


From winda002 at student.otago.ac.nz  Mon Sep 21 05:30:44 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Mon, 21 Sep 2009 17:30:44 +1200
Subject: [Biopython-dev] draft blog post for 1.52 stuff
Message-ID: <4AB70F84.6000709@student.otago.ac.nz>

As I mentioned in the draft release announcement it might be useful to 
have a
a blog post up explaining how the new functions for SeqIO and AlignIO 
work (thanks to Peter for this idea).

I've written a draft for a post that looks at the convert function that 
could do with a little more detail and ignores the indexed_dict() 
function entirely because I just don't have a good enough idea of how it 
works.

Again, any comments are welcome. Is it a good idea to have a post like 
this or should we just extend the release announcement to include a 
little bit more detail?

++
It's only been a month since we released Biopython 1.51 but in that time the
CVS server has stacked up enough cool new features that we are going to put
together a new release soon. As ever the new functions will be documented in
the official tutorial and cookbook but we thought we'd show off a few of
these tools here


Simple, optimized format conversion with SeqIO and AlignIO


No one has ever complained that bioinformatics just doesn't have enough file
formats - you probably frequently find yourself converting sequence 
files to suit
particular applications with SeqIO. At the moment this is usually a two step
process, something like this:

 >>>records = SeqIO.parse(in_handle "genbank")
 >>>SeqIO.write(records, out_handle, "fasta")

As of Biopython 1.52 you'll be able to achieve the same result in a 
single step:

 >>>SeqIO.convert(in_handle, "genbank", out_handle, "fasta")

Adding the convert function to SeqIO will make your scripts more 
readable and
might even save you a couple of lines of code but more importantly it 
allows the
conversion process to be optimized for two formats being used. In the above
example we are moving from a genbank file, which might include multiple
features for each sequence, to a fasta file, which doesn't include features.
If we used the two step process above we'd be spending time reading each 
sequence's features into memory just to skip them when they get passed 
to the write function. SeqIO.convert()  knows that the sequences in the 
input
file are destined to be written to a fasta file so it can skip over the 
features
and save a bit of time in doing the conversion.

Obviously, the optimization in SeqIO.convert() is most powerful when its 
used
on very large files like those produced in next generation sequencing 
projects.
When converting between each of the FASTQ file format's variants with 
the "SeqIO two step" a siginficant amount of time is taken creating 
SeqRecord objects for each record in the input file but none of the 
attributes or methods of the SeqRecord object are required to do the 
conversion. For this reason SeqIO.convert() deals with each record as 
two simple strings, one for the record's sequence, the other for its ID. 
[some information on just how much time that saves on a big file should 
probably go here!]
+++


From winda002 at student.otago.ac.nz  Mon Sep 21 05:45:34 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Mon, 21 Sep 2009 17:45:34 +1200
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com>
References: <4AB70B74.1040308@student.otago.ac.nz>
	<6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com>
Message-ID: <4AB712FE.2060304@student.otago.ac.nz>

Tiago Ant?o wrote:
> There is a big update to the PopGen module, which is now able to do
> frequentist statistics and tests through GenePop. I can draft one
> paragraph about the subject. I would imagine it is one of the biggest
> changes and probably the one that adds most functionality.
>   
Cool, I see now that I should've read the original thread about the new 
release more closely

A paragraph from you on your PopGen code would be really helpful.

Cheers,
David


From tiagoantao at gmail.com  Mon Sep 21 07:23:24 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 21 Sep 2009 08:23:24 +0100
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <4AB712FE.2060304@student.otago.ac.nz>
References: <4AB70B74.1040308@student.otago.ac.nz>
	<6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com>
	<4AB712FE.2060304@student.otago.ac.nz>
Message-ID: <6d941f120909210023v5dc91079s6ec54a04ad8385e7@mail.gmail.com>

Something along the lines of:

The Population Genetics module now allows the calculation of several
tests, and statistical estimators via a wrapper to GenePop. Supported
are tests for Hardy-Weinberg equilibrium, linkage disequilibrium and
estimates for various F statistics (Cockerham and Wier Fst and Fis,
Robertson and Hill Fis, ...), null allele frequencies and number of
migrants among many others. Isolation By Distance (IBD) functionality
is also supported.


I suppose the changes to PopGen are the biggest going on this
Biopython version and probably one of the highlights. I should update
the documentation ASAP.

I intend to announce this version to some population genetics and
evolutionary biology communities (something I have never done in the
past)


On Mon, Sep 21, 2009 at 6:45 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> Tiago Ant?o wrote:
>>
>> There is a big update to the PopGen module, which is now able to do
>> frequentist statistics and tests through GenePop. I can draft one
>> paragraph about the subject. I would imagine it is one of the biggest
>> changes and probably the one that adds most functionality.
>>
>
> Cool, I see now that I should've read the original thread about the new
> release more closely
>
> A paragraph from you on your PopGen code would be really helpful.
>
> Cheers,
> David
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
" It always takes ideology to consummate massive error." - Ambrose
Evans-Pritchard


From biopython at maubp.freeserve.co.uk  Mon Sep 21 09:01:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 21 Sep 2009 10:01:10 +0100
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz>
References: <4AB70B74.1040308@student.otago.ac.nz>
Message-ID: <320fb6e00909210201u3d9032e5vf64ba2953d83938d@mail.gmail.com>

On Mon, Sep 21, 2009 at 6:13 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> Hi guys,
>
> A draft release announcement for 1.52 for you to look at and comment on.
> This is written with the idea that there will be a blog post describing the
> convert and indexed_dict() methods for SeqIO which can be linked to so the
> announcement itself is pretty brief.

I switched indexed_dict() to just index() after discussion on the list.

> I didn't mention the movement from CVS to git in the announcement which
> might be something worth adding?

I think that would warrant a one line paragraph (near the end) :)

Peter


From biopython at maubp.freeserve.co.uk  Mon Sep 21 09:11:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 21 Sep 2009 10:11:17 +0100
Subject: [Biopython-dev] draft blog post for 1.52 stuff
In-Reply-To: <4AB70F84.6000709@student.otago.ac.nz>
References: <4AB70F84.6000709@student.otago.ac.nz>
Message-ID: <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com>

On Mon, Sep 21, 2009 at 6:30 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> As I mentioned in the draft release announcement it might be useful to have
> a blog post up explaining how the new functions for SeqIO and AlignIO work
> (thanks to Peter for this idea).
>
> I've written a draft for a post that looks at the convert function that
> could do with a little more detail and ignores the indexed_dict() function
> entirely because I just don't have a good enough idea of how it works.

Great job - thanks for doing this. I'll tackle an indexing introduction
blog post since you've done a nice job for convert :)

It would also be worth mentioning that the convert function will also
take filenames (not just handles), which also helps simplify simple
conversion tasks.

I should be able to provide some timings for things like FASTQ
conversion, or FASTQ to FASTA on multi-million read files
(there are probably some on the dev list already...).

> Again, any comments are welcome. Is it a good idea to have a post like
> this or should we just extend the release announcement to include a little
> bit more detail?

Well, as I mentioned the idea to David directly, I think these little
motivational examples on the blog are worth trying out. What does
everyone else think?

Peter


From biopython at maubp.freeserve.co.uk  Mon Sep 21 17:41:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 21 Sep 2009 18:41:40 +0100
Subject: [Biopython-dev] draft blog post for 1.52 stuff
In-Reply-To: <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com>
References: <4AB70F84.6000709@student.otago.ac.nz>
	<320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com>
Message-ID: <320fb6e00909211041n6378595cx39f2d395aee0ec7c@mail.gmail.com>

On Mon, Sep 21, 2009 at 10:11 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Sep 21, 2009 at 6:30 AM, David Winter
> <winda002 at student.otago.ac.nz> wrote:
>> As I mentioned in the draft release announcement it might be useful to have
>> a blog post up explaining how the new functions for SeqIO and AlignIO work
>> (thanks to Peter for this idea).
>>
>> I've written a draft for a post that looks at the convert function that
>> could do with a little more detail and ignores the indexed_dict() function
>> entirely because I just don't have a good enough idea of how it works.
>
> Great job - thanks for doing this. I'll tackle an indexing introduction
> blog post since you've done a nice job for convert :)

Done, and up online - hopefully without typos:
http://news.open-bio.org/news/2009/09/biopython-seqio-index/

Peter


From winda002 at student.otago.ac.nz  Tue Sep 22 05:05:31 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Tue, 22 Sep 2009 17:05:31 +1200
Subject: [Biopython-dev] draft release announcement
In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz>
References: <4AB70B74.1040308@student.otago.ac.nz>
Message-ID: <4AB85B1B.2000704@student.otago.ac.nz>

David Winter wrote:
> Hi guys,
>
> A draft release announcement for 1.52 for you to look at and comment 
> on. This is written with the idea that there will be a blog post 
> describing the convert and indexed_dict() methods for SeqIO which can 
> be linked to so the
> announcement itself is pretty brief.
Thanks to Peter and Tiago for their suggestions, there is now a marked 
up version of this draft with those suggestions ready and waiting on to 
go on the blog. Still time for suggestions from anyone else.

David


From winda002 at student.otago.ac.nz  Tue Sep 22 05:14:07 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Tue, 22 Sep 2009 17:14:07 +1200
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>	
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>	
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>	
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>	
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>	
	<4AA58F3C.6080200@student.otago.ac.nz>	
	<4AB303EB.1010208@student.otago.ac.nz>
	<320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com>
Message-ID: <4AB85D1F.7010901@student.otago.ac.nz>

Peter wrote:
>
>
>   
>> Writing them actually exposed a bug in the code already in CVS, the
>> FProtParsCommandline option "-intreefile" isn't mandatory so "is_required"
>> should be set to 0 rather than 1. In my defence the emboss
>> documentation has it listed as being both mandatory and optional.
>>     
>
> Fixed in CVS - does this affect any of the other tools using this argument?
>   
Nope, I only slipped on this one ;)
>   
>> One possibly foolish thing I did was use TreeIO to test the trees that came
>> out of these programs made sense, thinking that module would be part of the
>> next release. If the plan is for a new release soon and having a test for
>> these wrappers is important the tests could be done with Nexus.Trees but I
>> found that was difficult to use for files with multiple newick trees.
>>     
>
> I put a quick crude helper function into the unit test as discussed.
>
> The unit test is working nicely on Linux with EMBOSS PHYLIP
> from CVS, I presume you are testing against an official release?
> If you could the CVS code works fine on your setup before the
> release that would be great. 
Finally got in front of the right computer to do this. The tests in the 
(Biopython) CVS work fine with the official EMBOSS 6.1.0 release (on 
ubuntu if that helps). I'd offer to try it out on windows but I don't 
have EMBOSS, a compiler or and of the libraries that I'd need to do that!

Cheers,
David


From biopython at maubp.freeserve.co.uk  Tue Sep 22 09:23:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 10:23:10 +0100
Subject: [Biopython-dev] Tests for Emboss Phylip wrappers
In-Reply-To: <4AB85D1F.7010901@student.otago.ac.nz>
References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz>
	<320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com>
	<20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz>
	<320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com>
	<320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com>
	<4AA58F3C.6080200@student.otago.ac.nz>
	<4AB303EB.1010208@student.otago.ac.nz>
	<320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com>
	<4AB85D1F.7010901@student.otago.ac.nz>
Message-ID: <320fb6e00909220223q6f079a39o74916d20291c3400@mail.gmail.com>

On Tue, Sep 22, 2009 at 6:14 AM, David Winter
<winda002 at student.otago.ac.nz> wrote:
> Peter wrote:
>>>
>>> Writing them actually exposed a bug in the code already in CVS,
>>> the FProtParsCommandline option "-intreefile" isn't mandatory so
>>> "is_required" should be set to 0 rather than 1. In my defence the
>>> emboss documentation has it listed as being both mandatory and
>>> optional.
>>
>> Fixed in CVS - does this affect any of the other tools using this
>> argument?
>
> Nope, I only slipped on this one ;)

Great. It looks like the tests have been useful already :)

>> The unit test is working nicely on Linux with EMBOSS PHYLIP
>> from CVS, I presume you are testing against an official release?
>> If you could the CVS code works fine on your setup before the
>> release that would be great.
>
> Finally got in front of the right computer to do this. The tests in the
> (Biopython) CVS work fine with the official EMBOSS 6.1.0 release
> (on ubuntu if that helps).

Great - thank you.

> I'd offer to try it out on windows but I don't
> have EMBOSS, a compiler or and of the libraries that I'd need to
> do that!

Hmm - EMBOSS only provide a Windows installer for the core
EMBOSS suite, not the extras like PHYLIP. I do have a C
compiler and cygwin setup on my Windows machine, so it may
work. We'll see...

Peter


From mjldehoon at yahoo.com  Tue Sep 22 10:12:37 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 22 Sep 2009 03:12:37 -0700 (PDT)
Subject: [Biopython-dev] Blast records
Message-ID: <230712.78074.qm@web62406.mail.re1.yahoo.com>

Hi everybody,

I was looking at an older bug report about the plain-text and XML Blast parsers in Biopython:

http://bugzilla.open-bio.org/show_bug.cgi?id=2176

When I was checking the current behavior of Biopython's blast parsers, I noticed that the plain-text parser and the XML parser give different results when parsing psi-blast output. The plain-text parser returns a Blast.Record.PSIBlast object, whereas the XML parser returns Blast.Record.Blast objects. In addition, the XML parser misinterprets the psi-blast XML output (creating a separate Blast record for each psi-blast iteration), whereas the plain-text parser fails on psi-blast output of the current blast program.

To fix this, I guess the first step is to decide whether a psi-blast parser should return a Blast.Record.Blast object or a Blast.Record.PSIBlast object. In theory having a Blast.Record.PSIBlast record seems more appropriate. However, this complicates the parser (it's not clear until halfway through the Blast output if it's Blast or Psi-Blast, which means the user has to tell the parser whether it's Blast or Psi-Blast), and the format of the XML output generated for Blast and Psi-Blast is the same. I would therefore suggest to have one Blast.Record class that can contain both Blast and Psi-Blast output.

Any other opinions, comments, suggestions?

--Michiel.


From biopython at maubp.freeserve.co.uk  Tue Sep 22 11:40:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 12:40:46 +0100
Subject: [Biopython-dev] Blast records
In-Reply-To: <230712.78074.qm@web62406.mail.re1.yahoo.com>
References: <230712.78074.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com>

On Tue, Sep 22, 2009 at 11:12 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Hi everybody,
>
> When I was checking the current behavior of Biopython's blast parsers,
> I noticed that the plain-text parser and the XML parser give different
> results when parsing psi-blast output. The plain-text parser returns a
> Blast.Record.PSIBlast object, whereas the XML parser returns
> Blast.Record.Blast objects. ...
>
> Any other opinions, comments, suggestions?

As I recall (backed up by what I wrote in the tutorial), when I last
checked, the plain text PSI-BLAST output (i.e. from the command
line tool blastpgp) included a lot of information missing in the XML
output. Perhaps this has improved? If it hasn't, I am inclinded to
leave things as they are. If the current PSI-BLAST outputs more
details in the XML we may be able to do a better job.

The next bit is my recollection of some of the background to this:
Classic BLAST (and also RPS-BLAST) allow multiple queries and
use the "iterator" block in the XML file for each query. This was an
odd choice of naming, but I think the XML tag was originally only
intended for the PSI-BLAST outout where each "iteration" block
in the XML corresponds to each step of the algorithm. You may
recall early versions of BLAST would output "concatenated" XML
files for multiple queries - which were not true XML files. I guess
they fixed this by reusing the existing "iteration" structure for
multiple queries (rather than adding new XML tags). With this in
mind the current parsing of the XML from PSI-BLAST makes
sense.

[In any case, I plan to do Biopython 1.52 this afternoon, with
the PSI BLAST parsing left as is it].

Peter


From biopython at maubp.freeserve.co.uk  Tue Sep 22 13:29:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 14:29:10 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
Message-ID: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>

Hi all,

As previously announced, I'm going to try and get Biopython 1.52
done this afternoon - and am now declaring a CVS freeze.

If all goes to plan, once I've done the release CVS will remain
"frozen", and we'll probably get it made read only on the server.
Instead, we're going to try and switch over to git (initially on
github with a backup on the OBF servers).

Stay tuned for further announcements...

Peter


From p.j.a.cock at googlemail.com  Tue Sep 22 16:38:21 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Sep 2009 17:38:21 +0100
Subject: [Biopython-dev] Biopython 1.52 released
Message-ID: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>

Dear all,

Those of you who signed up to our newsfeed will know this already,
but we are pleased to announce the release of Biopython 1.52:

http://news.open-bio.org/news/2009/09/biopython-release-152/

Thank you to all our developers, including David Winter for drafting
the release announcement, and everyone else who as contributed
with feedback, bug reports etc.

Could I also take this opportunity to remind you all we have an
application note out in the OUP journal Bioinformatics:
http://news.open-bio.org/news/2009/03/biopython-paper-published/
http://dx.doi.org/10.1093/bioinformatics/btp163

In any scientific publication using Biopython, we kindly request
you cite this, or another appropriate publication from this list:
http://biopython.org/wiki/Documentation#Papers

Thank you,

Peter


From biopython at maubp.freeserve.co.uk  Tue Sep 22 16:42:49 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 17:42:49 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
Message-ID: <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>

On Tue, Sep 22, 2009 at 2:29 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> As previously announced, I'm going to try and get Biopython 1.52
> done this afternoon - and am now declaring a CVS freeze.
>
> If all goes to plan, once I've done the release CVS will remain
> "frozen", and we'll probably get it made read only on the server.
> Instead, we're going to try and switch over to git (initially on
> github with a backup on the OBF servers).
>
> Stay tuned for further announcements...

OK, the release is done. Let's leave things as they are for a day
or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate
with Bartek about the timings for the git transition.

I am considering adding a warning message to setup.py and the
readme file as the final commit to CVS, pointing out that we will
be moving future development to a git repository. One of the first
commit to git would be to remove that warning. Does that make
sense?

Peter


From bartek at rezolwenta.eu.org  Tue Sep 22 19:46:20 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 22 Sep 2009 21:46:20 +0200
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
Message-ID: <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>

On Tue, Sep 22, 2009 at 6:42 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> OK, the release is done. Let's leave things as they are for a day
> or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate
> with Bartek about the timings for the git transition.
>
> I am considering adding a warning message to setup.py and the
> readme file as the final commit to CVS, pointing out that we will
> be moving future development to a git repository. One of the first
> commit to git would be to remove that warning. Does that make
> sense?

It seems OK to me. Let me know when you make the last commit, so that
I turn off the scripts pushing CVS changes to github, which would be
the only technical thing to do to make the transition. From then on,
we should commit only to git.

Bartek.


From biopython at maubp.freeserve.co.uk  Tue Sep 22 20:18:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 22 Sep 2009 21:18:12 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
Message-ID: <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>

On Tue, Sep 22, 2009 at 8:46 PM, Bartek Wilczynski wrote:
> On Tue, Sep 22, 2009 at 6:42 PM, Peter wrote:
>>
>> OK, the release is done. Let's leave things as they are for a day
>> or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate
>> with Bartek about the timings for the git transition.
>>
>> I am considering adding a warning message to setup.py and the
>> readme file as the final commit to CVS, pointing out that we will
>> be moving future development to a git repository. One of the first
>> commit to git would be to remove that warning. Does that make
>> sense?
>
> It seems OK to me.

Great.

> Let me know when you make the last commit, so that I turn off
> the scripts pushing CVS changes to github, ...

Will do - I'll give it a day or so just in case we need to do a
re-release for anything critical.

> ... which would be the only technical thing to do to make the
> transition. From then on, we should commit only to git.

Yep - although I'll ask the OBF admins to make CVS read only
as a precaution.

Peter


From p.j.a.cock at googlemail.com  Tue Sep 22 20:20:54 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Sep 2009 21:20:54 +0100
Subject: [Biopython-dev] Biopython 1.52 released
In-Reply-To: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>
References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>
Message-ID: <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com>

> Dear all,
>
> Those of you who signed up to our newsfeed will know this already,
> but we are pleased to announce the release of Biopython 1.52:
>
> http://news.open-bio.org/news/2009/09/biopython-release-152/
>
> Thank you to all our developers, including David Winter for drafting
> the release announcement, and everyone else who as contributed
> with feedback, bug reports etc.

Brad - if everything looks fine, can you do the PyPi upload now?

Thanks,

Peter


From chapmanb at 50mail.com  Tue Sep 22 20:42:26 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 22 Sep 2009 16:42:26 -0400
Subject: [Biopython-dev] Biopython 1.52 released
In-Reply-To: <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com>
References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>
	<320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com>
Message-ID: <20090922204226.GA13500@sobchak.mgh.harvard.edu>

Hi Peter;
Congrats to everyone on the release. Peter, thanks as always for all
the hard work.

> Brad - if everything looks fine, can you do the PyPi upload now?

No problem, all set:

http://pypi.python.org/pypi/biopython/

I am tempted to secretly commit something to CVS and then vehemently
deny doing it to mess with everyone's head. Wait, so then how did the
README file get changed? A mystery...

Seriously, looking forward to the Git transition,
Brad


From p.j.a.cock at googlemail.com  Tue Sep 22 21:24:11 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 22 Sep 2009 22:24:11 +0100
Subject: [Biopython-dev] Biopython 1.52 released
In-Reply-To: <20090922204226.GA13500@sobchak.mgh.harvard.edu>
References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com>
	<320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com>
	<20090922204226.GA13500@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909221424t2cd67249pc1555c382c4f5597@mail.gmail.com>

On Tue, Sep 22, 2009 at 9:42 PM, Brad Chapman wrote:
> Hi Peter;
> Congrats to everyone on the release. Peter, thanks as always for all
> the hard work.
>
>> Brad - if everything looks fine, can you do the PyPi upload now?
>
> No problem, all set:
>
> http://pypi.python.org/pypi/biopython/

Lovely :)

> I am tempted to secretly commit something to CVS and then vehemently
> deny doing it to mess with everyone's head. Wait, so then how did the
> README file get changed? A mystery...

Well, unless you have another CVS account that we don't know
about, it wouldn't be much of a mystery would it? Grin.

> Seriously, looking forward to the Git transition,

May you live in interesting times?

But yeah - should be good.

Peter


From biopython at maubp.freeserve.co.uk  Wed Sep 23 10:28:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 11:28:35 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
Message-ID: <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>

On Tue, Sep 22, 2009 at 9:18 PM, Peter  wrote:
> Bartek wrote:
>> Let me know when you make the last commit, so that I turn off
>> the scripts pushing CVS changes to github, ...
>
> Will do - I'll give it a day or so just in case we need to do a
> re-release for anything critical.

Hi Bartek,

OK - I think that's it for final commits to CVS (a few notes about
git, and finally adding the warning in setup.py). Not all of these
changes have made it to github yet.

We also need to 1.52 tag ("biopython-152") to get copied over.

Once that is done, could you turn off your CVS to github
script, and let us know by email?

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Wed Sep 23 14:34:42 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 15:34:42 +0100
Subject: [Biopython-dev] Blast records
In-Reply-To: <154350.7800.qm@web62402.mail.re1.yahoo.com>
References: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com>
	<154350.7800.qm@web62402.mail.re1.yahoo.com>
Message-ID: <320fb6e00909230734k612c142cse6888a10c0de01b5@mail.gmail.com>

On Wed, Sep 23, 2009 at 2:51 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> --- On Tue, 9/22/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> As I recall (backed up by what I wrote in the tutorial),
>> when I last checked, the plain text PSI-BLAST output
>> (i.e. from the command line tool blastpgp) included a
>> lot of information missing in the XML output. Perhaps
>> this has improved? If it hasn't, I am inclined to leave
>> things as they are. If the current PSI-BLAST outputs
>> more details in the XML we may be able to do a better job.
>
> As far as I can tell, the XML contains the same information
> as the plain-text psiblast output, but the XML parser doesn't
> parse it correctly, since it assumes it is dealing with regular
> blast rather than psi-blast.

It sounds like the NCBI have changed the PSI BLAST XML
output then.

>> The next bit is my recollection of some of the background
>> to this:
>> Classic BLAST (and also RPS-BLAST) allow multiple queries
>> and use the "iterator" block in the XML file for each query.
>> This was an odd choice of naming, but I think the XML tag was
>> originally only intended for the PSI-BLAST outout where each
>> "iteration" block in the XML corresponds to each step of the
>> algorithm. You may recall early versions of BLAST would output
>> "concatenated" XML files for multiple queries - which were not
>> true XML files.
>
> That is correct. To make things more complex, if you run
> psi-blast with multiple queries you get concatenated XML
> files again, with the iteration blocks corresponding to the
> psi-blast iterations for each query.

Odd - and arguably a bug, since it isn't valid XML.

>> I guess they fixed this by reusing the existing "iteration"
>> structure for multiple queries (rather than adding new XML
>> tags). With this in mind the current parsing of the XML from
>> PSI-BLAST makes sense.
>
> I don't know if it really makes sense. For a single psi-blast
> query, we're getting multiple Blast records. For multiple
> psi-blast queries, we're iterating over the iteration blocks
> while ignoring the fact that they can come from different
> queries.

Is a single Blast record object for each PSI-BLAST iteration
such a bad thing?

> Ideally, we should be able to see from the XML whether
> it was regular blast with multiple queries, or psi-blast with
> a single query. Right now that is possible by looking at
> the query-def lines, but I wonder if NCBI is considering
> a better solution for this. I'll write an email to them to find out.

Certainly clarification from the NCBI sounds useful.

Peter


From mjldehoon at yahoo.com  Wed Sep 23 13:51:04 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 23 Sep 2009 06:51:04 -0700 (PDT)
Subject: [Biopython-dev] Blast records
In-Reply-To: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com>
Message-ID: <154350.7800.qm@web62402.mail.re1.yahoo.com>

--- On Tue, 9/22/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
> As I recall (backed up by what I wrote in the tutorial),
> when I last checked, the plain text PSI-BLAST output
> (i.e. from the command line tool blastpgp) included a
> lot of information missing in the XML output. Perhaps
> this has improved? If it hasn't, I am inclined to leave
> things as they are. If the current PSI-BLAST outputs
> more details in the XML we may be able to do a better job.

As far as I can tell, the XML contains the same information as the plain-text psiblast output, but the XML parser doesn't parse it correctly, since it assumes it is dealing with regular blast rather than psi-blast.

> The next bit is my recollection of some of the background
> to this:
> Classic BLAST (and also RPS-BLAST) allow multiple queries
> and use the "iterator" block in the XML file for each query.
> This was an odd choice of naming, but I think the XML tag was
> originally only intended for the PSI-BLAST outout where each 
> "iteration" block in the XML corresponds to each step of the 
> algorithm. You may recall early versions of BLAST would output 
> "concatenated" XML files for multiple queries - which were not
> true XML files.

That is correct. To make things more complex, if you run psi-blast with multiple queries you get concatenated XML files again, with the iteration blocks corresponding to the psi-blast iterations for each query.

> I guess they fixed this by reusing the existing "iteration"
> structure for multiple queries (rather than adding new XML
> tags). With this in mind the current parsing of the XML from
> PSI-BLAST makes sense.

I don't know if it really makes sense. For a single psi-blast query, we're getting multiple Blast records. For multiple psi-blast queries, we're iterating over the iteration blocks while ignoring the fact that they can come from different queries.

Ideally, we should be able to see from the XML whether it was regular blast with multiple queries, or psi-blast with a single query. Right now that is possible by looking a the query-def lines, but I wonder if NCBI is considering a better solution for this. I'll write an email to them to find out.

--Michiel


From bugzilla-daemon at portal.open-bio.org  Wed Sep 23 14:47:16 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 23 Sep 2009 10:47:16 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909231447.n8NElGi8003751@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-23 10:47 EST -------
I've looked at PDB file 13GS in more detail, and this doesn't look like a bug
in Biopython, but rather just another odd PDB file.

Chains C and D are only three residue peptides, e.g.

ATOM   3301  N   GLU D   1      16.854  13.061  10.252  1.00 65.68           N  
ATOM   3302  CA  GLU D   1      17.100  13.860   9.018  1.00 66.23           C  
ATOM   3303  C   GLU D   1      17.937  15.095   9.363  1.00 65.02           C  
ATOM   3304  O   GLU D   1      18.510  15.724   8.439  1.00 56.86           O  
ATOM   3305  CB  GLU D   1      15.764  14.279   8.389  1.00 66.35           C  
ATOM   3306  CG  GLU D   1      15.913  14.994   7.062  1.00 67.41           C  
ATOM   3307  CD  GLU D   1      14.584  15.456   6.508  1.00 68.72           C  
ATOM   3308  OE1 GLU D   1      13.547  15.340   7.163  1.00 69.08           O  
ATOM   3309  OXT GLU D   1      17.998  15.420  10.569  1.00 66.12           O  
ATOM   3310  N   CYS D   2      14.618  15.966   5.283  1.00 69.97           N  
ATOM   3311  CA  CYS D   2      13.431  16.483   4.614  1.00 70.18           C  
ATOM   3312  C   CYS D   2      13.374  15.898   3.213  1.00 69.53           C  
ATOM   3313  O   CYS D   2      14.409  15.625   2.610  1.00 65.61           O  
ATOM   3314  CB  CYS D   2      13.502  18.008   4.507  1.00 73.18           C  
ATOM   3315  SG  CYS D   2      14.485  18.841   5.796  1.00 76.47           S  
ATOM   3316  N   GLY D   3      12.166  15.713   2.693  1.00 71.49           N  
ATOM   3317  CA  GLY D   3      12.023  15.155   1.360  1.00 75.33           C  
ATOM   3318  C   GLY D   3      11.489  13.733   1.399  1.00 78.72           C  
ATOM   3319  O   GLY D   3      10.840  13.313   0.413  1.00 79.95           O  
ATOM   3320  OXT GLY D   3      11.717  13.031   2.412  1.00 80.37           O  
TER    3321      GLY D   3

Look at the C-alpha distances, (17.100, 13.860, 9.018) to (13.431, 16.483,
4.614) to (12.023, 15.155, 1.360) giving distances of 6.3 and 3.8:

>>> from math import sqrt
>>> import numpy
>>> a = numpy.array((17.100, 13.860, 9.018))
>>> b = numpy.array((13.431, 16.483, 4.614))
>>> c = numpy.array((12.023, 15.155, 1.360))
>>> sqrt(sum((a-b)**2))
6.3037215991825049
>>> sqrt(sum((b-c)**2))
3.7861014249488876

Clearly the first two residues in this "peptide" are very far apart, regardless
of if you do a simple C-alpha distance (as here), or look at the backbone's N
to C bonds.

The "problem" for 13GS goes away if you relax the default distance threshold,
e.g. use PPBuilder(10.0) instead of PPBuilder().

However, whatever affects 1A2D seems to be a different issue...

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bartek at rezolwenta.eu.org  Wed Sep 23 15:10:32 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Wed, 23 Sep 2009 17:10:32 +0200
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
Message-ID: <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>

On Wed, Sep 23, 2009 at 12:28 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Sep 22, 2009 at 9:18 PM, Peter ?wrote:
> OK - I think that's it for final commits to CVS (a few notes about
> git, and finally adding the warning in setup.py). Not all of these
> changes have made it to github yet.
>
> We also need to 1.52 tag ("biopython-152") to get copied over.
>
> Once that is done, could you turn off your CVS to github
> script, and let us know by email?

Ta-da! We are no longer synchronizing from CVS!

Please do not commit  any changes to the CVS because they are not
going to be transferred to git, which is now _the_ repository for
biopython.

Everyone with biopython CVS accounts is welcome to send their github
logins (off the list) to me or Peter to get them added as biopython
collaborators.

cheers
Bartek


From biopython at maubp.freeserve.co.uk  Wed Sep 23 15:16:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 16:16:19 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
Message-ID: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>

On Wed, Sep 23, 2009 at 4:10 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
> On Wed, Sep 23, 2009 at 12:28 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Tue, Sep 22, 2009 at 9:18 PM, Peter ?wrote:
>> OK - I think that's it for final commits to CVS (a few notes about
>> git, and finally adding the warning in setup.py). Not all of these
>> changes have made it to github yet.
>>
>> We also need to 1.52 tag ("biopython-152") to get copied over.
>>
>> Once that is done, could you turn off your CVS to github
>> script, and let us know by email?
>
> Ta-da! We are no longer synchronizing from CVS!

Lovely... but could you double check the last few commits made it?
i.e. The final commit should be:

setup.py CVS revision 1.174
date: 2009/09/23 10:06:08;  author: peterc;  state: Exp;  lines: +8 -0
Adding a warning about CVS/git to setup.py (which we will remove
once we switch to git) so people know they are using an out of date
repository.

Thanks,

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Sep 23 15:40:00 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 23 Sep 2009 11:40:00 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909231540.n8NFe0iU005670@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-23 11:39 EST -------
I think the problem with PDB file 1A2D is due to the atypical PYX residue,

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import is_aa
structure = PDBParser().get_structure('tmp', '1A2D.pdb')
for model in structure :
    for chain in model :
        for res in chain :
            if "CA" in res.child_dict and not is_aa(res) :
                print chain, res

The polypeptide code only looks at residues that pass the is_aa test, which
means we can ignore things like water atoms associated with a chain. In this
PDB file there are two residues which fail this test:

<Chain id=A> <Residue PYX het=H_PYX resseq=117 icode= >
<Chain id=B> <Residue PYX het=H_PYX resseq=117 icode= >

According to the SEQADV and MODRES lines, these are modified CYS residues.
Comparing this to the PDB provided FASTA file, a "C" is used (CYS). This
leads me to believe the fix is to add the PYX -> C mapping to Biopython.
[The dictionary used, to_one_letter_code, is actually defined in file
Bio/SCOP/RAF.py for some historical reason.]

Consulting the PDB documentation suggests that there are potentially
many more examples like this of unknown HETATM entries which are
modified amino acid residues... see:
ftp://ftp.wwpdb.org/pub/pdb/data/monomers/

Christian - did you find any other problem PDB files?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Sep 23 15:47:19 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 23 Sep 2009 11:47:19 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909231547.n8NFlJ39005869@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


------- Comment #4 from schafer at rostlab.org  2009-09-23 11:47 EST -------
Peter,

yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time,
I'll take a look at it and post them here. It's easy to do this. What I did is,
I parsed the structures through the dssp structure assignment tool and compared
the obtained sequence with that obtained from the Bio.PDB parser. Background: I
wanted to map the sequence that dssp sees to atomic coordinates.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bartek at rezolwenta.eu.org  Wed Sep 23 15:56:42 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Wed, 23 Sep 2009 17:56:42 +0200
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
Message-ID: <8b34ec180909230856u235a17ah437e578e02d5e6d3@mail.gmail.com>

On Wed, Sep 23, 2009 at 5:16 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Lovely... but could you double check the last few commits made it?

Sure, your commit didn't make it to github at first, because It was
just two minutes after the last scheduled synchronization.

Now it's in github.

cheers
 Bartek


From biopython at maubp.freeserve.co.uk  Wed Sep 23 16:04:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 17:04:30 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
Message-ID: <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>

On Wed, Sep 23, 2009 at 4:16 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Sep 23, 2009 at 4:10 PM, Bartek Wilczynski wrote:
>>
>> Ta-da! We are no longer synchronizing from CVS!
>>
>
> Lovely... but could you double check the last few commits made it?
> i.e. The final commit should be:
>
> setup.py CVS revision 1.174
> date: 2009/09/23 10:06:08; ?author: peterc; ?state: Exp; ?lines: +8 -0
> Adding a warning about CVS/git to setup.py (which we will remove
> once we switch to git) so people know they are using an out of date
> repository.

It has just shown up in the last few minutes :)

I'm ready to make the first commit directly to github (removing the
new warning from setup.py), assuming everything is fine on your
end Bartek?

Peter


From biopython at maubp.freeserve.co.uk  Wed Sep 23 16:34:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 23 Sep 2009 17:34:12 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
	<320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>
Message-ID: <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com>

On Wed, Sep 23, 2009 at 5:04 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> I'm ready to make the first commit directly to github (removing the
> new warning from setup.py), assuming everything is fine on your
> end Bartek?

OK - that's done now. Thank you Bartek.

Ladies and Gentlemen, we are now running Biopython development
with git :)

Remember - CVS remains frozen (and I'll ask the OBF admins to make
it read only to prevent any accidents).

Now, let's make sure all the documentation and the wiki etc is up to date,
and make an official announcement on the news server.

Those of you who already had CVS access, once you think you are happy
with using git (i.e. you'd had a play with your own local repository, and also
idealy tried pushed changes to a personal repository on github), please
ask for collaborators status on github.

Peter


From eric.talevich at gmail.com  Thu Sep 24 03:48:49 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 23 Sep 2009 23:48:49 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
Message-ID: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>

Folks,

I've fixed a couple of remaining issues in the Bio.Tree and Bio.TreeIO
modules and I'd like your opinion on what else should be done before merging
this into the mainline.

First, the wiki documentation for PhyloXML has an example pipeline showing
how to build a phylogeny in Biopython, from a raw protein sequence to a
lightly annotated phyloXML file.
http://biopython.org/wiki/PhyloXML#Example_pipeline

Does this look like right? I copied the first few steps from the official
docs.

The source code, for your review, is here:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py

Discussion:

*TreeIO*
The read, parse, write and convert functions work essentially the same as in
SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues:

(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.

(2) NexusIO.write() just doesn't seem to work. I don't understand how to
make the original Nexus module write out trees that it didn't parse itself.
Help?

*Tree
*The BaseTree module is meant to be the basis for Newick trees eventually,
so I'd like to get the design right with the minimum number of public
methods:

(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.

(2) Should find() be based on depth_first_search or breadth_first_search
(not checked in yet)? DFS would potentially find a leaf node faster, but BFS
seems more common in phylogenetics. Note that iteration can easily be
reversed with the standard reversed() function, so we don't need extra
functions for those cases.

(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?

(4) There's some mention in the literature of a relationship-matrix
representation for phylogenies. Does anyone here know how to work with this
representation, or know if it would let us perform complex calculations with
blinding speed behind the scenes? If so, should there be a function in
Bio.Tree.Utils to export a tree to a NumPy array represented this way?  If
not, I'll forget about it.

*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
in the way biologists are used to seeing them. Does anyone know of existing
code that could be borrowed for this? I looked at ETE (announced on the main
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.

Best regards,
Eric


From mjldehoon at yahoo.com  Thu Sep 24 09:33:22 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 24 Sep 2009 02:33:22 -0700 (PDT)
Subject: [Biopython-dev] Blast records
In-Reply-To: <320fb6e00909230734k612c142cse6888a10c0de01b5@mail.gmail.com>
Message-ID: <888743.69260.qm@web62408.mail.re1.yahoo.com>

--- On Wed, 9/23/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
> --- Michiel wrote:
> > For a single psi-blast query, we're getting multiple Blast
> > records. For multiple psi-blast queries, we're iterating over
> > the iteration blocks while ignoring the fact that they can come
> from different queries.
> 
> Is a single Blast record object for each PSI-BLAST
> iteration such a bad thing?
>
Well the plain-text PSI-BLAST parser returns a single Record.PSIBlast object containing all of the PSI-BLAST iterations, whereas the XML parser returns multiple Record.Blast objects. Ideally, the plain-text parser and the XML parser should return the same thing.

--Michiel.


From biopython at maubp.freeserve.co.uk  Thu Sep 24 09:57:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 10:57:12 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
Message-ID: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>

On Thu, Sep 24, 2009 at 4:48 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
> Discussion:
>
> *TreeIO*
> The read, parse, write and convert functions work essentially the same as in
> SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues:

Great.

One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
supported for formats that can represent multiple phylogenetic trees in a
single file". Is that true, and if so why? For SeqIO and AlignIO you can
use parse on a file with one entry, the iterator just returns one entry. Easy.
This is important for allowing generic code (e.g. a loop) regardless of
how many entries there are (one, many, or even zero).

On a more general note, you seem to be recreating the file/handle logic
in each of the individual parsers. I think it would make much more sense
to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and
Bio.TreeIO.write() functions *only* and have the underlying format specific
code just use handles. This avoids the code duplication.

[In fact, as I have said before, I prefer the simplicity of just allowing
handles - and we should make TreeIO and SeqIO/AlignIO consistent]

> (1) 'phyloxml' uses a different object representation than the other two, so
> converting between those formats is not possible until Nexus.Trees is ported
> over to Bio.Tree.

I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
that phyloxml allows very minimal trees, the reverse as well). It does look
like the best plan is to use the same tree objects for all three (updating
Bio.Nexus if possible).

Note that Bio.Nexus.Trees still has some useful methods you don't
appear to support, like finding the last common ancestor and distances
between nodes.

> (2) NexusIO.write() just doesn't seem to work. I don't understand how to
> make the original Nexus module write out trees that it didn't parse itself.
> Help?

To get the Newick tree, you can just call str(tree), which is basically what
you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
more complicated. You'll need to create a minimal Nexus file - have a
look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having
a hard coded nexus template, and just insert the tree as a Newick string
(and insert the list of taxa?). Perhaps Frank or Cymon can advise us.

> *Tree
> *The BaseTree module is meant to be the basis for Newick trees eventually,
> so I'd like to get the design right with the minimum number of public
> methods:
>
> (1) The find() function, named after the Unix utility that does the same
> thing for directory trees, seems capable of all the iteration and filtering
> necessary for locating data and automatically adding annotations to a tree.
> There's a 'terminal' argument for selecting internal nodes, external nodes,
> or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
> to remove it if no one protests.

I'm in two minds - iterating over the leaves (taxa) seems like a very
common operation, and having an explicit method for this might be
clearer than calling find with special arguments.

> (2) Should find() be based on depth_first_search or breadth_first_search
> (not checked in yet)? DFS would potentially find a leaf node faster, but BFS
> seems more common in phylogenetics. Note that iteration can easily be
> reversed with the standard reversed() function, so we don't need extra
> functions for those cases.

You could do both, either via an argument or having two methods, say
depth_fist_search and breadth_first_search instead of find.

> (3) I left room in each Node for the left and right indexes used by BioSQL's
> nested-set representation. Now I'm doubting the utility of that -- any
> Biopython function that uses those indexes would need to ensure that the
> index is up to date, which seems tricky. Shall I remove all mention of the
> nested-set representation, or try to support it fully?

A partial implementation doesn't seem helpful, and wastes memory
allocating unused properties. I would remove it from the base Node,
but a full implementation might be useful for something (would it be
possible via a subclass?).

On a related point, do you think a BioSQL TaxonTree subclass is possible?
i.e. Something mimicking the new Tree objects (as a subclass), but which
loads data on demand from the taxon tables in a BioSQL database? This
would provide a nice way to work with the NCBI taxonomy (once loaded
into BioSQL), which is a very large tree. For an example use case, I might
want to extract just the bacteria as a subtree, and save that to a file.

> (4) There's some mention in the literature of a relationship-matrix
> representation for phylogenies. Does anyone here know how to work with this
> representation, or know if it would let us perform complex calculations with
> blinding speed behind the scenes? If so, should there be a function in
> Bio.Tree.Utils to export a tree to a NumPy array represented this way? ?If
> not, I'll forget about it.

I don't know.

> *Graphics*
> I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
> nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
> usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
> nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
> in the way biologists are used to seeing them. Does anyone know of existing
> code that could be borrowed for this? I looked at ETE (announced on the main
> biopython list last week) and liked the examples, but it uses PyQt4 and a
> standalone GUI for display, which is a substantial departure from the
> Biopython way of doing things.

I still haven't tracked down my old report lab code, but it wasn't object
orientated and would need a lot of work to bring up to standard...

Peter


From biopython at maubp.freeserve.co.uk  Thu Sep 24 10:23:34 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 11:23:34 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
	<320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>
	<320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com>
Message-ID: <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com>

On Wed, Sep 23, 2009 at 5:34 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Now, let's make sure all the documentation and the wiki etc is up to date,
> and make an official announcement on the news server.
>

How does this look for a draft news post (with links to wiki pages etc):

The release of Biopython 1.52 earlier this week marked the end of an
era, it was our last release using CVS for source code control.

As of now, Biopython is using a git repository, hosted on github.com
who kindly provide git hosting for open source projects free of
charge. The BioRuby project have been using github for some time now,
so we are in good company.

The existing OBF hosted CVS repository will be maintained in the short
to medium term as a backup, but will not be updated.

Although many people have been involved in this move, we?d like to
thank Bartek Wilczynski in particular for handling the CVS to git
conversion, and the mirroring our CVS updates to git during the last
few months transition period. In the next few weeks hopefully we?ll
get our git usage wiki pages perfected, as we start using git for
real.

Peter


From jhuerta at crg.es  Thu Sep 24 10:45:21 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Thu, 24 Sep 2009 12:45:21 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
Message-ID: <c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>

Hi,

( I'm the developer of ETE. )
I agree that PyQt4 is an important dependence. I chose it because
Qt4-QGraphicsScene environment offers many possibilities like openGL
rendering, unlimited image size, performance, and good bindings to python.
However, I am working on my code to allow the rendering algorithm to use any
other graphical library. So, you could render the same tree images using
different backends. If you think this is useful for you, please let me know
and we can think how to integrat it with biopython.
Regarding the GUI, it is not a standalone application but one more method
within the Tree objects. The GUI  can be started at any point of the
execution and the main program will continue after you close it. I did it
like this because I think is quite useful for working within interactive
python sessions.

I develop a lot of  code around tree handling, so if you think I can help,
please tell me.
jaime.


>  > *Graphics*
> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
> unlabeled
> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
> even
> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would
> be
> > nice to have a Reportlab-based module in Bio.Graphics to print
> phylogenies
> > in the way biologists are used to seeing them. Does anyone know of
> existing
> > code that could be borrowed for this? I looked at ETE (announced on the
> main
> > biopython list last week) and liked the examples, but it uses PyQt4 and a
> > standalone GUI for display, which is a substantial departure from the
> > Biopython way of doing things.
>
> I still haven't tracked down my old report lab code, but it wasn't object
> orientated and would need a lot of work to bring up to standard...
>
>


> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 11:14:37 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 07:14:37 -0400
Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives
	shorter peptide sequences than expected
In-Reply-To: <bug-2910-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241114.n8OBEbKH005629@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2910


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-24 07:14 EST -------
(In reply to comment #4)
> Peter,
> 
> yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time,
> I'll take a look at it and post them here. It's easy to do this. What I did is,
> I parsed the structures through the dssp structure assignment tool and compared
> the obtained sequence with that obtained from the Bio.PDB parser. Background: I
> wanted to map the sequence that dssp sees to atomic coordinates.
> 

If you can give us some more examples that would be very helpful, thank you.

I have committed a partial fix which means any known modified amino acids
(based on the presence of an alpha carbon) will be treated as an amino
acid for building the peptide (and given the default sequence letter of X).
This will also issue a warning. Any such previously unknown modified amino
acid (like PYX) needs to be added to our hard coded lookup table with the
appropriate single letter symbol as used by the PDF in their FASTA files
(in this case, PYX -> C for cysteine).

I suspect that some of your other problem PDB files still have (currently)
undefined modified amino acids in them...

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Thu Sep 24 11:39:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 12:39:59 +0100
Subject: [Biopython-dev] Committing to github...
Message-ID: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com>

Hi all,

My last couple of commits to github have been from a local clone of
the *official*
repository: http://github.com/biopython/biopython/

This is a nice and simple work flow for small changes, and the history
and github
network graph are easy to understand:
http://biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch

This seems like the easiest way to work for people used to CVS, and you don't
need to bother with your own Biopython cloned repository on github (you just
need a github account and collaborator status). I'll probably continue
to do this
in the short term.

--

However, prior to that I did a couple of commits via a local clone of
*my* personal
github repository, http://github.com/peterjc/biopython/

I had kept the master branch on *my* repository identical to the
official master.
However, while I was only pushing a tiny change, git did this as a
merge - resulting
in a flurry of RSS entries and a complicated looking git network
diagram. I think it
is probably just down to the way we've been using the repositories during the
migration? With this backlog of merges done, I expect future commits by this
route will look much cleaner...

Peter


From chapmanb at 50mail.com  Thu Sep 24 12:08:00 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 24 Sep 2009 08:08:00 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
Message-ID: <20090924120800.GJ13500@sobchak.mgh.harvard.edu>

Eric and Peter;
Looking forward to seeing the PhyloXML work merged into the main
branch. Eric, thanks for posting the summary of where things are at.

> > (1) 'phyloxml' uses a different object representation than the other two, so
> > converting between those formats is not possible until Nexus.Trees is ported
> > over to Bio.Tree.
> 
> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
> that phyloxml allows very minimal trees, the reverse as well). It does look
> like the best plan is to use the same tree objects for all three (updating
> Bio.Nexus if possible).

Agreed that this would be nice to have, but I'm not sure why it's
blocking getting the base TreeIO framework and all of PhyloXML into
the main branch. That's a major step forward from the format
specific phylogenetic code we had before and gets us a portion of
the way there.

Next up should be moving over Bio.Nexus to the new framework and
then conversions, but this is another project. I think we should
take this one step at a time.

> Note that Bio.Nexus.Trees still has some useful methods you don't
> appear to support, like finding the last common ancestor and distances
> between nodes.

Agreed. As we move Nexus over, we should be sure to keep current
functionality.

> > (1) The find() function, named after the Unix utility that does the same
> > thing for directory trees, seems capable of all the iteration and filtering
> > necessary for locating data and automatically adding annotations to a tree.
> > There's a 'terminal' argument for selecting internal nodes, external nodes,
> > or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
> > to remove it if no one protests.
> 
> I'm in two minds - iterating over the leaves (taxa) seems like a very
> common operation, and having an explicit method for this might be
> clearer than calling find with special arguments.

I'm for keeping it as well, and just having the underlying
implementation of get_leaf_nodes call find with the right arguments.
This seems like an operation that should be dead obvious to do.

> > (3) I left room in each Node for the left and right indexes used by BioSQL's
> > nested-set representation. Now I'm doubting the utility of that -- any
> > Biopython function that uses those indexes would need to ensure that the
> > index is up to date, which seems tricky. Shall I remove all mention of the
> > nested-set representation, or try to support it fully?

Again I agree with Peter here -- this would be best supported as a
subclass that is database aware with an identical API, similar to
how the Seq objects and BioSQL Seq objects work. This avoids any
overhead for the in-memory case, which will be more common, but
gives you a point to implement the useful database representation
code in the future. If you don't have time to work on all of this
right now, I'd leave the nested-set stuff out and keep it in mind as
a future addition.

Brad


From biopython at maubp.freeserve.co.uk  Thu Sep 24 12:48:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 13:48:37 +0100
Subject: [Biopython-dev] Git documentation on wiki
Message-ID: <320fb6e00909240548q4db8dfc1l83be8408d3b8718f@mail.gmail.com>

Hi all,

I think I have updated the relevant wiki pages about the CVS to git
migration. I have also make the "git" page redirect to the "Source
Code" page, which is the main access point. This now has a quick
summary with the basic links here for anyone wanting to grab the
latest code:

http://biopython.org/wiki/SourceCode

If anyone spots any errors or typos, feel free to fix them or raise
them here for discussion as needed.

Thanks,

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 14:42:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:42:08 -0400
Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed
	assertion in CondonTable Fix+Patch
In-Reply-To: <bug-2894-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241442.n8OEg8Xo012359@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2894


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-24 10:42 EST -------
I've actually installed Jython 2.5.0 and checked this. A further fix was
required, but this now works with the latest Biopython now in git.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 14:46:38 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:46:38 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To: <bug-2891-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241446.n8OEkc1w012533@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2891


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-24 10:46 EST -------
Testing with Jython 2.5.0 shows my fix didn't work. Reopening...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 14:46:49 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:46:49 -0400
Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary
	Jython Error Fix+Patch
In-Reply-To: <bug-2895-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241446.n8OEknEX012555@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2895


Bug 2895 depends on bug 2891, which changed state.

Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 14:46:53 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:46:53 -0400
Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch
In-Reply-To: <bug-2893-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241446.n8OEkrFK012570@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2893


Bug 2893 depends on bug 2891, which changed state.

Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 14:46:55 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 10:46:55 -0400
Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch
In-Reply-To: <bug-2892-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241446.n8OEkt93012582@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2892


Bug 2892 depends on bug 2891, which changed state.

Bug 2891 Summary: Jython test_NCBITextParser fix+patch
http://bugzilla.open-bio.org/show_bug.cgi?id=2891

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 15:11:22 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:22 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To: <bug-2891-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBM3q013469@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2891


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|2890                        |
OtherBugsDependingO|2892, 2893, 2895            |
              nThis|                            |


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-24 11:11 EST -------
Removing dependencies on other Jython bugs - they don't block each other.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 15:11:25 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:25 -0400
Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython
In-Reply-To: <bug-2890-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBPYu013482@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2890


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
OtherBugsDependingO|2891                        |
              nThis|                            |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 15:11:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:40 -0400
Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary
	Jython Error Fix+Patch
In-Reply-To: <bug-2895-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBeug013513@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2895


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|2891                        |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 15:11:42 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:42 -0400
Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch
In-Reply-To: <bug-2893-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBgcU013525@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2893


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|2891                        |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 15:11:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 11:11:45 -0400
Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch
In-Reply-To: <bug-2892-42@http.bugzilla.open-bio.org/>
Message-ID: <200909241511.n8OFBj1e013540@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2892


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|2891                        |


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Thu Sep 24 16:10:30 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 24 Sep 2009 12:10:30 -0400
Subject: [Biopython-dev] [Bug 2918] New: Entrez parser fails on Jython -
	XMLParser lacks SetParamEntityParsing
Message-ID: <bug-2918-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2918

           Summary: Entrez parser fails on Jython - XMLParser lacks
                    SetParamEntityParsing
           Product: Biopython
           Version: 1.52
          Platform: All
               URL: http://bugs.jython.org/issue1447
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Other
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
                CC: kellrott at ucsd.edu


I'm filing this as a bug report so we can track it, but the underlying issue is
a known Jython bug, http://bugs.jython.org/issue1447 (thanks Kyle for reporting
this already).

It can be shown just by running our unit test:

 ~/jython2.5.0/jython run_tests.py test_Entrez.py
test_Entrez ... FAIL
======================================================================
ERROR: Test parsing XML returned by EFetch, Journals database
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pjcock/repositories/biopython/Tests/test_Entrez.py", line 3443,
in test_journals
    record = Entrez.read(input)
  File "/Users/pjcock/repositories/biopython/Bio/Entrez/__init__.py", line 259,
in read
    record = handler.run(handle)
  File "/Users/pjcock/repositories/biopython/Bio/Entrez/Parser.py", line 85, in
run
    self.parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
AttributeError: 'XMLParser' object has no attribute 'SetParamEntityParsing'

...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Thu Sep 24 17:59:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 24 Sep 2009 18:59:06 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20090924120800.GJ13500@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<20090924120800.GJ13500@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909241059yfa43889w82c76cd7f2365dee@mail.gmail.com>

On Thu, Sep 24, 2009 at 1:08 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Eric and Peter;
> Looking forward to seeing the PhyloXML work merged into the main
> branch. Eric, thanks for posting the summary of where things are at.
>
>> > (1) 'phyloxml' uses a different object representation than the other two, so
>> > converting between those formats is not possible until Nexus.Trees is ported
>> > over to Bio.Tree.
>>
>> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would
>> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
>> that phyloxml allows very minimal trees, the reverse as well). It does look
>> like the best plan is to use the same tree objects for all three (updating
>> Bio.Nexus if possible).
>
> Agreed that this would be nice to have, but I'm not sure why it's
> blocking getting the base TreeIO framework and all of PhyloXML into
> the main branch. That's a major step forward from the format
> specific phylogenetic code we had before and gets us a portion of
> the way there.

If the Newick/Nexus TreeIO parsers return one object type while the
PhyloXML TreeIO parser returns another *incompatible* object type,
then we don't have a unified tree input/output framework. Furthermore,
if you did release this and then later standardise on a single tree object,
you'd break backwards compatibility. All in all, best avoided.

> Next up should be moving over Bio.Nexus to the new framework and
> then conversions, but this is another project. I think we should
> take this one step at a time.

What we could do in the short term is ignore Bio.Nexus.Trees, and
just leave it as is. Instead of having the Newick/Nexus TreeIO code
calling the old Bio.Nexus.Trees code, we just write some new code
(possibly based on old code) which will use Eric's new objects.

We could then (gradually, perhaps by adding a runtime option to
the Nexus parsing API) move Bio.Nexus over from using the old
Bio.Nexus.Trees code to the new TreeIO, and eventually deprecate
and then remove Bio.Nexus.Trees.

Peter


From eric.talevich at gmail.com  Fri Sep 25 03:54:05 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 24 Sep 2009 23:54:05 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
Message-ID: <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>

Hello, Jaime,

Sorry I didn't respond directly to your earlier post -- I wrote half of an
e-mail, then realized I had no good suggestions on what to do so I scrapped
it.

My Tree and TreeIO code is basically a complete parser for the phyloXML
format, plus a few base classes extracted out in hopes of eventually
creating a unified set of format-independent objects, as in SeqIO and
AlignIO. Your code for working with trees looks much more complete than
mine, so if some of it can be incorporated into Biopython, I think that
would be great.

I see these issues with integration:
1. It's GPL, while Biopython uses a more permissive custom license
resembling the BSD and MIT licenses. Would you be willing and able to
relicense parts of your work for Biopython?

2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
require some compatibility fixes -- not a huge problem.

3. Scipy and numpy dependencies: Numpy is considered a semi-optional
dependency in Biopython, so if it can be imported on the fly by just the
functions that need it (hopefully no core ones), that would be best. If
not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
it would be better to make that an optional, on-the-fly import, too.

4. PyQt4 is a big package and I'm not sure it's as common in scientists'
Python installations as numpy and scipy, so if the underlying algorithms for
tree layout could be ported to Reportlab, matplotlib or PIL, that would be
ideal. I personally would like to be able to pair sequence snippets with the
leaves of a standard phylogram, so if you need me to do some additional work
to get this section ported to Biopython, I'd consider it time well spent.

5. Presumably, the tree object type in ETE is different from Bio.Tree or
Bio.Nexus, so porting the core tree manipulation code to Biopython would
require a substantial effort somewhere.

6. The PhylomeDB connector is cool, and browsing the source, looks like it
wouldn't require much effort at all to drop into Biopython.

Thanks for letting us know about this.

Cheers,
Eric


On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas <jhuerta at crg.es> wrote:

> Hi,
>
> ( I'm the developer of ETE. )
> I agree that PyQt4 is an important dependence. I chose it because
> Qt4-QGraphicsScene environment offers many possibilities like openGL
> rendering, unlimited image size, performance, and good bindings to python.
> However, I am working on my code to allow the rendering algorithm to use any
> other graphical library. So, you could render the same tree images using
> different backends. If you think this is useful for you, please let me know
> and we can think how to integrat it with biopython.
> Regarding the GUI, it is not a standalone application but one more method
> within the Tree objects. The GUI  can be started at any point of the
> execution and the main program will continue after you close it. I did it
> like this because I think is quite useful for working within interactive
> python sessions.
>
> I develop a lot of  code around tree handling, so if you think I can help,
> please tell me.
> jaime.
>
>
>
>>  > *Graphics*
>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
>> unlabeled
>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
>> even
>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would
>> be
>> > nice to have a Reportlab-based module in Bio.Graphics to print
>> phylogenies
>> > in the way biologists are used to seeing them. Does anyone know of
>> existing
>> > code that could be borrowed for this? I looked at ETE (announced on the
>> main
>> > biopython list last week) and liked the examples, but it uses PyQt4 and
>> a
>> > standalone GUI for display, which is a substantial departure from the
>> > Biopython way of doing things.
>>
>> I still haven't tracked down my old report lab code, but it wasn't object
>> orientated and would need a lot of work to bring up to standard...
>>
>>
>
>
>
>
>
>
>
>> Peter
>>
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
>
>
>
> --
> =========================
> Jaime Huerta-Cepas, Ph.D.
> CRG-Centre for Genomic Regulation
> Doctor Aiguader, 88
> PRBB Building
> 08003 Barcelona, Spain
> http://www.crg.es/comparative_genomics
> =========================
>
>


From eric.talevich at gmail.com  Fri Sep 25 04:34:17 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 25 Sep 2009 00:34:17 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
Message-ID: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>

Hi Peter,

Thanks for the feedback.

On Thu, Sep 24, 2009 at 5:57 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
> One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
> supported for formats that can represent multiple phylogenetic trees in a
> single file". Is that true, and if so why? For SeqIO and AlignIO you can
> use parse on a file with one entry, the iterator just returns one entry.
> Easy.
> This is important for allowing generic code (e.g. a loop) regardless of
> how many entries there are (one, many, or even zero).
>
>
I'll delete that sentence. I don't know why it's there -- you're right, it's
easy to return an iterable regardless of what the format itself supports.

On a more general note, you seem to be recreating the file/handle logic
> in each of the individual parsers. I think it would make much more sense
> to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read()
> and
> Bio.TreeIO.write() functions *only* and have the underlying format specific
> code just use handles. This avoids the code duplication.
>
>
I did the handle management case-by-case because some of the underlying
libraries already do filename-to-handle conversion -- ElementTree and
Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
ad-hoc handle management, but of course I can move it all to the top if you
think it's best. One day, perhaps we'll have a context manager that we can
reuse everywhere to make magic easy:

with maybe_open(file) as handle:
   tree = FooIO.parse(handle)

Not today, though.


> (1) 'phyloxml' uses a different object representation than the other two,
> so
> > converting between those formats is not possible until Nexus.Trees is
> ported
> > over to Bio.Tree.
>
>
> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it
> would
> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming
> that phyloxml allows very minimal trees, the reverse as well). It does look
> like the best plan is to use the same tree objects for all three (updating
> Bio.Nexus if possible).
>
>
I could comment out the 'nexus' and 'newick' lines from the
supported_formats dict. That would disable the top-level functions but leave
the direct NexusIO and NewickIO equivalents intact until the port is
complete.


Note that Bio.Nexus.Trees still has some useful methods you don't
> appear to support, like finding the last common ancestor and distances
> between nodes.
>
> That's intentional, I was just going to port those methods directly from
Bio.Nexus.Trees rather than invent a new API myself.

Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
combined parsers and object representations. My goal is to chop out the
pure-object parts and merge them into Bio.Tree, and let the remaining
parsers return objects built from the new Bio.Tree classes. This looks like
it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
done.

For backward compatibility, I'll leave some wrappers that trigger
DeprecationWarnings in the original places. Nexus.Trees can probably be
reduced to:

import warnings
warnings.warn("Use Bio.Tree and Bio.TreeIO instead", DeprecationWarning)

from Bio.Tree.Newick import *
from Bio.TreeIO.NewickIO import *

(more or less)

> (2) NexusIO.write() just doesn't seem to work. I don't understand how to
> > make the original Nexus module write out trees that it didn't parse
> itself.
> > Help?
>
> To get the Newick tree, you can just call str(tree), which is basically
> what
> you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be
> more complicated. You'll need to create a minimal Nexus file - have a
> look at the Bio.AlignIO.NexusIO code. An alternative is to look at is
> having
> a hard coded nexus template, and just insert the tree as a Newick string
> (and insert the list of taxa?). Perhaps Frank or Cymon can advise us.
>
>
OK, thanks, I'll give it a shot. I see some default Nexus template stuff in
Bio.Nexus.Nexus already.


> > *Tree
> > *The BaseTree module is meant to be the basis for Newick trees
> eventually,
> > so I'd like to get the design right with the minimum number of public
> > methods:
> >
> > (1) The find() function, named after the Unix utility that does the same
> > thing for directory trees, seems capable of all the iteration and
> filtering
> > necessary for locating data and automatically adding annotations to a
> tree.
> > There's a 'terminal' argument for selecting internal nodes, external
> nodes,
> > or both, and I think this means get_leaf_nodes() is unnecessary. I'm
> going
> > to remove it if no one protests.
>
> I'm in two minds - iterating over the leaves (taxa) seems like a very
> common operation, and having an explicit method for this might be
> clearer than calling find with special arguments.
>

I think .find(terminal=True) will do the right thing and looks reasonably
simple, but as Brad said, this is a ridiculously common operation so finding
it in the API should be ridiculously easy. I'll rename this function to
get_leaves() and rename find() to findall() (to match ElementTree and make
it clear that it returns an iterable).


> > (3) I left room in each Node for the left and right indexes used by
> BioSQL's
> > nested-set representation. Now I'm doubting the utility of that -- any
> > Biopython function that uses those indexes would need to ensure that the
> > index is up to date, which seems tricky. Shall I remove all mention of
> the
> > nested-set representation, or try to support it fully?
>
> A partial implementation doesn't seem helpful, and wastes memory
> allocating unused properties. I would remove it from the base Node,
> but a full implementation might be useful for something (would it be
> possible via a subclass?).
>
> On a related point, do you think a BioSQL TaxonTree subclass is possible?
> i.e. Something mimicking the new Tree objects (as a subclass), but which
> loads data on demand from the taxon tables in a BioSQL database? This
> would provide a nice way to work with the NCBI taxonomy (once loaded
> into BioSQL), which is a very large tree. For an example use case, I might
> want to extract just the bacteria as a subtree, and save that to a file.
>
>
Doing BioSQL integration was on the original roadmap, but research hasn't
taken me back there lately. I would like to do it eventually... anyway, that
would solve the indexing issue nicely. I'll drop the extra attributes -- I
get the impression they're not meant to be accessed directly in BioSQL
either, so there's no use for them in Biopython.


Cheers,
Eric


From biopython at maubp.freeserve.co.uk  Fri Sep 25 09:59:08 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Sep 2009 10:59:08 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
Message-ID: <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com>

On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> On a related point, do you think a BioSQL TaxonTree subclass is possible?
>> i.e. Something mimicking the new Tree objects (as a subclass), but which
>> loads data on demand from the taxon tables in a BioSQL database? This
>> would provide a nice way to work with the NCBI taxonomy (once loaded
>> into BioSQL), which is a very large tree. For an example use case, I might
>> want to extract just the bacteria as a subtree, and save that to a file.
>>
>
> Doing BioSQL integration was on the original roadmap, but research hasn't
> taken me back there lately. I would like to do it eventually... anyway, that
> would solve the indexing issue nicely. I'll drop the extra attributes -- I
> get the impression they're not meant to be accessed directly in BioSQL
> either, so there's no use for them in Biopython.

As things stand, there is no usage of the left/right index fields in
Biopython.

The current Biopython BioSQL code focusses on the database
variants of the Seq and SeqRecord objects. The only interaction
with the taxon tables is to load/retrieve the species annotations,
and for this we don't need the complications of the left/right index.
We leave them empty if we populate the taxonomy via Entrez
(recalculating the left/right values is computationally expensive).

However, any "DBTaxonTree" object (or whatever we call it) could
potentially offer us a way to (a) populate and (b) use the these
alternative indexes as a way to speed up various subtree operations.

Peter


From biopython at maubp.freeserve.co.uk  Fri Sep 25 10:08:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Sep 2009 11:08:56 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
Message-ID: <320fb6e00909250308s35a286e7x67a7bb3fec6a0673@mail.gmail.com>

On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> One minor point - the docstring for Bio.TreeIO.parse() says: "This is only
>> supported for formats that can represent multiple phylogenetic trees in a
>> single file". Is that true, and if so why? For SeqIO and AlignIO you can
>> use parse on a file with one entry, the iterator just returns one entry.
>> This is important for allowing generic code (e.g. a loop) regardless of
>> how many entries there are (one, many, or even zero).
>
> I'll delete that sentence. I don't know why it's there -- you're right, it's
> easy to return an iterable regardless of what the format itself supports.

OK.

>> On a more general note, you seem to be recreating the file/handle logic
>> in each of the individual parsers. I think it would make much more sense
>> to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read()
>> and Bio.TreeIO.write() functions *only* and have the underlying format
>> specific code just use handles. This avoids the code duplication.
>
> I did the handle management case-by-case because some of the underlying
> libraries already do filename-to-handle conversion -- ElementTree and
> Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of
> ad-hoc handle management, but of course I can move it all to the top if you
> think it's best.

Having a single layer of handle/filename conversion in Bio.TreeIO does
seem cleanest to me (even if some of the back ends allow either) and
will ensure our code is consistent.

> One day, perhaps we'll have a context manager that we can
> reuse everywhere to make magic easy:
>
> with maybe_open(file) as handle:
> ? tree = FooIO.parse(handle)
>
> Not today, though.

Not yet, no. For one thing we'll have to phase out Python 2.4 support.

>>> (1) 'phyloxml' uses a different object representation than the other two,
>>> so converting between those formats is not possible until Nexus.Trees
>>> is ported over to Bio.Tree.
>>
>> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it
>> would actually let you do phyloxml -> newick, and phyloxml -> nexus
>> (and assuming that phyloxml allows very minimal trees, the reverse
>> as well). It does look like the best plan is to use the same tree objects
>> for all three (updating Bio.Nexus if possible).
>
> I could comment out the 'nexus' and 'newick' lines from the
> supported_formats dict. That would disable the top-level functions
> but leave the direct NexusIO and NewickIO equivalents intact until
> the port is complete.

I guess shipping a "phyloxml" only Bio.TreeIO would work, but it
would be rather less useful. We could certainly start with just that
on the trunk (i.e. initially no Bio.TreeIO.NewickIO and also no
Bio.TreeIO.NexusIO modules - initially have just a single backend).

>> Note that Bio.Nexus.Trees still has some useful methods you don't
>> appear to support, like finding the last common ancestor and
>> distances between nodes.
>
> That's intentional, I was just going to port those methods directly from
> Bio.Nexus.Trees rather than invent a new API myself.

OK - sounds good.

> Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are
> combined parsers and object representations. My goal is to chop out the
> pure-object parts and merge them into Bio.Tree, and let the remaining
> parsers return objects built from the new Bio.Tree classes. This looks like
> it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be
> done.

Sounds good - as with Bio.SeqIO and Bio.AlignIO, one of the goals has
been to separate the data object from the (many possible) parsers.

> For backward compatibility, I'll leave some wrappers that trigger
> DeprecationWarnings in the original places. Nexus.Trees can
> probably be reduced ...

Something like that, sure.

>>> (1) The find() function, named after the Unix utility that does the
>>> same thing for directory trees, seems capable of all the iteration
>>> and filtering necessary for locating data and automatically adding
>>> annotations to a tree. There's a 'terminal' argument for selecting
>>> internal nodes, external nodes, or both, and I think this means
>>> get_leaf_nodes() is unnecessary. I'm going to remove it if no one
>>> protests.
>>
>> I'm in two minds - iterating over the leaves (taxa) seems like a very
>> common operation, and having an explicit method for this might be
>> clearer than calling find with special arguments.
>
> I think .find(terminal=True) will do the right thing and looks reasonably
> simple, but as Brad said, this is a ridiculously common operation so
> finding it in the API should be ridiculously easy. I'll rename this function
> to get_leaves() and rename find() to findall() (to match ElementTree
> and make it clear that it returns an iterable).

OK.

Peter


From hlapp at gmx.net  Fri Sep 25 11:39:03 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 25 Sep 2009 07:39:03 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com>
	<320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com>
Message-ID: <B877367C-9144-488C-B018-A8B5771051D1@gmx.net>


On Sep 25, 2009, at 5:59 AM, Peter wrote:

> On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich <eric.talevich at gmail.com 
> > wrote:
>>>
>>> On a related point, do you think a BioSQL TaxonTree subclass is  
>>> possible?
>>> i.e. Something mimicking the new Tree objects (as a subclass), but  
>>> which
>>> loads data on demand from the taxon tables in a BioSQL database?  
>>> This
>>> would provide a nice way to work with the NCBI taxonomy (once loaded
>>> into BioSQL), which is a very large tree. For an example use case,  
>>> I might
>>> want to extract just the bacteria as a subtree, and save that to a  
>>> file.
>>>
>>
>> Doing BioSQL integration was on the original roadmap, but research  
>> hasn't
>> taken me back there lately. I would like to do it eventually...  
>> anyway, that
>> would solve the indexing issue nicely. I'll drop the extra  
>> attributes -- I
>> get the impression they're not meant to be accessed directly in  
>> BioSQL
>> either, so there's no use for them in Biopython.
>
> As things stand, there is no usage of the left/right index fields in
> Biopython.

The left/right fields are really a crutch for doing hierarchical  
(recursive) queries in SQL more efficiently. SQL doesn't have native  
support for recursive queries, and the left/right index values allow  
you to rewrite an otherwise recursive query as a single-hit set.

Within an object-oriented programming language that supports recursion  
these values are of no use - they don't let you traverse a tree faster  
than you would already be able to do through recursing up or down your  
tree data structure. If there's a natural order of nodes, you can  
speed up finding nodes through binary search. But for pulling out  
lineages or subtrees I doubt that this will help at all - it'll have  
to be your data structure (such as having double links) that makes  
those operations efficient.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri Sep 25 12:26:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Sep 2009 13:26:38 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.52
In-Reply-To: <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com>
References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com>
	<320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com>
	<8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com>
	<320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com>
	<320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com>
	<8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com>
	<320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com>
	<320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com>
	<320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com>
	<320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com>
Message-ID: <320fb6e00909250526s294eee65ubbc508136f26f48a@mail.gmail.com>

On Thu, Sep 24, 2009 at 11:23 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Sep 23, 2009 at 5:34 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>> Now, let's make sure all the documentation and the wiki etc is up to date,
>> and make an official announcement on the news server.
>
> How does this look for a draft news post (with links to wiki pages etc):
>
> The release of Biopython 1.52 earlier this week marked the end of an
> era, it was our last release using CVS for source code control. ...

I went ahead and posted something based on that draft:
http://news.open-bio.org/news/2009/09/biopython-cvs-to-git-migration/

Nice to see several more people have started following the github
repository already :)

Peter


From jhuerta at crg.es  Fri Sep 25 15:28:36 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Fri, 25 Sep 2009 17:28:36 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> 
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com> 
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
Message-ID: <c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>

Hi Eric,

Thanks for your comments,
I really see a lot of potential parts in ETE that could be used from
biopython, however, for the moment, we would rather prefer not to modify
current ETE's  GPL license. As far as I know, the main difference between
GPL and BSD-like licenses is that, with the second, you could relicense the
code at any moment under any other policy, including private and close
licenses. GPL includes a protection for this by ensuring that any code based
on GPL sources must be always GPL compatible, and that's why we have chosen
it. Moreover, the use of a BSD-like license would prevent us to use a lot of
great GPL code out there.

It is not my purpose to open a debate about licenses. I just wonder if
biopython could provide any way to link/bind external software, perhaps as
addons or plugins. This would be great, since many extra features (not only
from ETE but from other sources) could be added on specific demands. This
would also mitigate the problem of very specific dependencies, since many of
them would be optional. From my side, I could work for providing bindings
between biopython and ETE's tree graphical rendering features, inline
visualization GUI, extended newick support, tree manipulation and the
methods within the ETE package.

I will be out of the office for several weeks, but if you see any way to
collaborate I will be happy to discuss this a bit more in detail...

Cheers!
Jaime

On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich <eric.talevich at gmail.com>wrote:

> Hello, Jaime,
>
> Sorry I didn't respond directly to your earlier post -- I wrote half of an
> e-mail, then realized I had no good suggestions on what to do so I scrapped
> it.
>
> My Tree and TreeIO code is basically a complete parser for the phyloXML
> format, plus a few base classes extracted out in hopes of eventually
> creating a unified set of format-independent objects, as in SeqIO and
> AlignIO. Your code for working with trees looks much more complete than
> mine, so if some of it can be incorporated into Biopython, I think that
> would be great.
>
> I see these issues with integration:
> 1. It's GPL, while Biopython uses a more permissive custom license
> resembling the BSD and MIT licenses. Would you be willing and able to
> relicense parts of your work for Biopython?
>
> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
> require some compatibility fixes -- not a huge problem.
>
> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional
> dependency in Biopython, so if it can be imported on the fly by just the
> functions that need it (hopefully no core ones), that would be best. If
> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
> it would be better to make that an optional, on-the-fly import, too.
>
> 4. PyQt4 is a big package and I'm not sure it's as common in scientists'
> Python installations as numpy and scipy, so if the underlying algorithms for
> tree layout could be ported to Reportlab, matplotlib or PIL, that would be
> ideal. I personally would like to be able to pair sequence snippets with the
> leaves of a standard phylogram, so if you need me to do some additional work
> to get this section ported to Biopython, I'd consider it time well spent.
>
> 5. Presumably, the tree object type in ETE is different from Bio.Tree or
> Bio.Nexus, so porting the core tree manipulation code to Biopython would
> require a substantial effort somewhere.
>
> 6. The PhylomeDB connector is cool, and browsing the source, looks like it
> wouldn't require much effort at all to drop into Biopython.
>
> Thanks for letting us know about this.
>
> Cheers,
> Eric
>
>
>
> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas <jhuerta at crg.es>wrote:
>
>> Hi,
>>
>> ( I'm the developer of ETE. )
>> I agree that PyQt4 is an important dependence. I chose it because
>> Qt4-QGraphicsScene environment offers many possibilities like openGL
>> rendering, unlimited image size, performance, and good bindings to python.
>> However, I am working on my code to allow the rendering algorithm to use any
>> other graphical library. So, you could render the same tree images using
>> different backends. If you think this is useful for you, please let me know
>> and we can think how to integrat it with biopython.
>> Regarding the GUI, it is not a standalone application but one more method
>> within the Tree objects. The GUI  can be started at any point of the
>> execution and the main program will continue after you close it. I did it
>> like this because I think is quite useful for working within interactive
>> python sessions.
>>
>> I develop a lot of  code around tree handling, so if you think I can help,
>> please tell me.
>> jaime.
>>
>>
>>
>>>  > *Graphics*
>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
>>> unlabeled
>>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
>>> even
>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would
>>> be
>>> > nice to have a Reportlab-based module in Bio.Graphics to print
>>> phylogenies
>>> > in the way biologists are used to seeing them. Does anyone know of
>>> existing
>>> > code that could be borrowed for this? I looked at ETE (announced on the
>>> main
>>> > biopython list last week) and liked the examples, but it uses PyQt4 and
>>> a
>>> > standalone GUI for display, which is a substantial departure from the
>>> > Biopython way of doing things.
>>>
>>> I still haven't tracked down my old report lab code, but it wasn't object
>>> orientated and would need a lot of work to bring up to standard...
>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>>> Peter
>>>
>>> _______________________________________________
>>> Biopython-dev mailing list
>>> Biopython-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>
>>
>>
>>
>> --
>> =========================
>> Jaime Huerta-Cepas, Ph.D.
>> CRG-Centre for Genomic Regulation
>> Doctor Aiguader, 88
>> PRBB Building
>> 08003 Barcelona, Spain
>> http://www.crg.es/comparative_genomics
>> =========================
>>
>>
>


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================


From eric.talevich at gmail.com  Fri Sep 25 15:51:15 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 25 Sep 2009 11:51:15 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
Message-ID: <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com>

Hi Jaime,

Just working on bindings would certainly be easier. The best way to transfer
tree information from Biopython to ETE would be serializing the trees in
phyloXML format (to preserve the annotations) and loading that file in ETE.
I see that ETE allows rich annotation of tree objects, but I don't see
phyloXML or NeXML listed as supported file formats -- is there another
standard format you're using to store this information? If not, I think ETE
would benefit from a phyloXML parser. Since Biopython license is
GPL-compatible (I believe), you could borrow Bio.TreeIO.PhyloXMLIO directly
and just port the Phylogeny and Clade classes to ETE's base classes instead
of Bio.Tree.BaseTree's Tree and Node classes.

Beyond that, some support for BioSQL to store sequences etc. would also help
link ETE to any of the other Bio* projects. There's some example code in
Biopython's top-level BioSQL directory, if you're interested.

Cheers,
Eric

On Fri, Sep 25, 2009 at 11:28 AM, Jaime Huerta Cepas <jhuerta at crg.es> wrote:

> Hi Eric,
>
> Thanks for your comments,
> I really see a lot of potential parts in ETE that could be used from
> biopython, however, for the moment, we would rather prefer not to modify
> current ETE's  GPL license. As far as I know, the main difference between
> GPL and BSD-like licenses is that, with the second, you could relicense the
> code at any moment under any other policy, including private and close
> licenses. GPL includes a protection for this by ensuring that any code based
> on GPL sources must be always GPL compatible, and that's why we have chosen
> it. Moreover, the use of a BSD-like license would prevent us to use a lot of
> great GPL code out there.
>
> It is not my purpose to open a debate about licenses. I just wonder if
> biopython could provide any way to link/bind external software, perhaps as
> addons or plugins. This would be great, since many extra features (not only
> from ETE but from other sources) could be added on specific demands. This
> would also mitigate the problem of very specific dependencies, since many of
> them would be optional. From my side, I could work for providing bindings
> between biopython and ETE's tree graphical rendering features, inline
> visualization GUI, extended newick support, tree manipulation and the
> methods within the ETE package.
>
> I will be out of the office for several weeks, but if you see any way to
> collaborate I will be happy to discuss this a bit more in detail...
>
> Cheers!
> Jaime
>
>
> On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich <eric.talevich at gmail.com>wrote:
>
>> Hello, Jaime,
>>
>> Sorry I didn't respond directly to your earlier post -- I wrote half of an
>> e-mail, then realized I had no good suggestions on what to do so I scrapped
>> it.
>>
>> My Tree and TreeIO code is basically a complete parser for the phyloXML
>> format, plus a few base classes extracted out in hopes of eventually
>> creating a unified set of format-independent objects, as in SeqIO and
>> AlignIO. Your code for working with trees looks much more complete than
>> mine, so if some of it can be incorporated into Biopython, I think that
>> would be great.
>>
>> I see these issues with integration:
>> 1. It's GPL, while Biopython uses a more permissive custom license
>> resembling the BSD and MIT licenses. Would you be willing and able to
>> relicense parts of your work for Biopython?
>>
>> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
>> require some compatibility fixes -- not a huge problem.
>>
>> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional
>> dependency in Biopython, so if it can be imported on the fly by just the
>> functions that need it (hopefully no core ones), that would be best. If
>> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
>> it would be better to make that an optional, on-the-fly import, too.
>>
>> 4. PyQt4 is a big package and I'm not sure it's as common in scientists'
>> Python installations as numpy and scipy, so if the underlying algorithms for
>> tree layout could be ported to Reportlab, matplotlib or PIL, that would be
>> ideal. I personally would like to be able to pair sequence snippets with the
>> leaves of a standard phylogram, so if you need me to do some additional work
>> to get this section ported to Biopython, I'd consider it time well spent.
>>
>> 5. Presumably, the tree object type in ETE is different from Bio.Tree or
>> Bio.Nexus, so porting the core tree manipulation code to Biopython would
>> require a substantial effort somewhere.
>>
>> 6. The PhylomeDB connector is cool, and browsing the source, looks like it
>> wouldn't require much effort at all to drop into Biopython.
>>
>> Thanks for letting us know about this.
>>
>> Cheers,
>> Eric
>>
>>
>>
>> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas <jhuerta at crg.es>wrote:
>>
>>> Hi,
>>>
>>> ( I'm the developer of ETE. )
>>> I agree that PyQt4 is an important dependence. I chose it because
>>> Qt4-QGraphicsScene environment offers many possibilities like openGL
>>> rendering, unlimited image size, performance, and good bindings to python.
>>> However, I am working on my code to allow the rendering algorithm to use any
>>> other graphical library. So, you could render the same tree images using
>>> different backends. If you think this is useful for you, please let me know
>>> and we can think how to integrat it with biopython.
>>> Regarding the GUI, it is not a standalone application but one more method
>>> within the Tree objects. The GUI  can be started at any point of the
>>> execution and the main program will continue after you close it. I did it
>>> like this because I think is quite useful for working within interactive
>>> python sessions.
>>>
>>> I develop a lot of  code around tree handling, so if you think I can
>>> help, please tell me.
>>> jaime.
>>>
>>>
>>>
>>>>  > *Graphics*
>>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
>>>> unlabeled
>>>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps
>>>> even
>>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it
>>>> would be
>>>> > nice to have a Reportlab-based module in Bio.Graphics to print
>>>> phylogenies
>>>> > in the way biologists are used to seeing them. Does anyone know of
>>>> existing
>>>> > code that could be borrowed for this? I looked at ETE (announced on
>>>> the main
>>>> > biopython list last week) and liked the examples, but it uses PyQt4
>>>> and a
>>>> > standalone GUI for display, which is a substantial departure from the
>>>> > Biopython way of doing things.
>>>>
>>>> I still haven't tracked down my old report lab code, but it wasn't
>>>> object
>>>> orientated and would need a lot of work to bring up to standard...
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Peter
>>>>
>>>> _______________________________________________
>>>> Biopython-dev mailing list
>>>> Biopython-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>>
>>>
>>>
>>>
>>> --
>>> =========================
>>> Jaime Huerta-Cepas, Ph.D.
>>> CRG-Centre for Genomic Regulation
>>> Doctor Aiguader, 88
>>> PRBB Building
>>> 08003 Barcelona, Spain
>>> http://www.crg.es/comparative_genomics
>>> =========================
>>>
>>>
>>
>
>
> --
> =========================
> Jaime Huerta-Cepas, Ph.D.
> CRG-Centre for Genomic Regulation
> Doctor Aiguader, 88
> PRBB Building
> 08003 Barcelona, Spain
> http://www.crg.es/comparative_genomics
> =========================
>
>


From jhuerta at crg.es  Fri Sep 25 16:13:44 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Fri, 25 Sep 2009 18:13:44 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> 
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com> 
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> 
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com> 
	<3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com>
Message-ID: <c5882df30909250913w3190aadesac8aaee1c613d761@mail.gmail.com>

Hi,


> Just working on bindings would certainly be easier. The best way to
> transfer tree information from Biopython to ETE would be serializing the
> trees in phyloXML format (to preserve the annotations) and loading that file
> in ETE. I see that ETE allows rich annotation of tree objects, but I don't
> see phyloXML or NeXML listed as supported file formats -- is there another
> standard format you're using to store this information?

Extended newick (http://www.phylosoft.org/NHX/) is the only rich format
currently supported by ETE, however only text string representation of tree
node annotations are allowed by this standard. Beyond this, you should use a
cpickle approach to save complex annotated trees. I'm certainly interested
in PhyloXML and NexML support, so, for sure, this could be a nice starting
point.

If not, I think ETE would benefit from a phyloXML parser. Since Biopython
> license is GPL-compatible (I believe), you could borrow
> Bio.TreeIO.PhyloXMLIO directly and just port the Phylogeny and Clade classes
> to ETE's base classes instead of Bio.Tree.BaseTree's Tree and Node classes.
>
I think there is no problem in using BSD license from GPL sources, the
problem would be in the other way around. Then I will take a look at your
phyloxml code to find the best way to bind both packages through phyloXML
serialization.


> Beyond that, some support for BioSQL to store sequences etc. would also
> help link ETE to any of the other Bio* projects. There's some example code
> in Biopython's top-level BioSQL directory, if you're interested.
>
Ok. I'll take a look also. Thanks.

cheers,
Jaime.


>
> Cheers,
> Eric
>
>
> On Fri, Sep 25, 2009 at 11:28 AM, Jaime Huerta Cepas <jhuerta at crg.es>wrote:
>
>> Hi Eric,
>>
>> Thanks for your comments,
>> I really see a lot of potential parts in ETE that could be used from
>> biopython, however, for the moment, we would rather prefer not to modify
>> current ETE's  GPL license. As far as I know, the main difference between
>> GPL and BSD-like licenses is that, with the second, you could relicense the
>> code at any moment under any other policy, including private and close
>> licenses. GPL includes a protection for this by ensuring that any code based
>> on GPL sources must be always GPL compatible, and that's why we have chosen
>> it. Moreover, the use of a BSD-like license would prevent us to use a lot of
>> great GPL code out there.
>>
>> It is not my purpose to open a debate about licenses. I just wonder if
>> biopython could provide any way to link/bind external software, perhaps as
>> addons or plugins. This would be great, since many extra features (not only
>> from ETE but from other sources) could be added on specific demands. This
>> would also mitigate the problem of very specific dependencies, since many of
>> them would be optional. From my side, I could work for providing bindings
>> between biopython and ETE's tree graphical rendering features, inline
>> visualization GUI, extended newick support, tree manipulation and the
>> methods within the ETE package.
>>
>> I will be out of the office for several weeks, but if you see any way to
>> collaborate I will be happy to discuss this a bit more in detail...
>>
>> Cheers!
>> Jaime
>>
>>
>> On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich <eric.talevich at gmail.com>wrote:
>>
>>> Hello, Jaime,
>>>
>>> Sorry I didn't respond directly to your earlier post -- I wrote half of
>>> an e-mail, then realized I had no good suggestions on what to do so I
>>> scrapped it.
>>>
>>> My Tree and TreeIO code is basically a complete parser for the phyloXML
>>> format, plus a few base classes extracted out in hopes of eventually
>>> creating a unified set of format-independent objects, as in SeqIO and
>>> AlignIO. Your code for working with trees looks much more complete than
>>> mine, so if some of it can be incorporated into Biopython, I think that
>>> would be great.
>>>
>>> I see these issues with integration:
>>> 1. It's GPL, while Biopython uses a more permissive custom license
>>> resembling the BSD and MIT licenses. Would you be willing and able to
>>> relicense parts of your work for Biopython?
>>>
>>> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will
>>> require some compatibility fixes -- not a huge problem.
>>>
>>> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional
>>> dependency in Biopython, so if it can be imported on the fly by just the
>>> functions that need it (hopefully no core ones), that would be best. If
>>> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so
>>> it would be better to make that an optional, on-the-fly import, too.
>>>
>>> 4. PyQt4 is a big package and I'm not sure it's as common in scientists'
>>> Python installations as numpy and scipy, so if the underlying algorithms for
>>> tree layout could be ported to Reportlab, matplotlib or PIL, that would be
>>> ideal. I personally would like to be able to pair sequence snippets with the
>>> leaves of a standard phylogram, so if you need me to do some additional work
>>> to get this section ported to Biopython, I'd consider it time well spent.
>>>
>>> 5. Presumably, the tree object type in ETE is different from Bio.Tree or
>>> Bio.Nexus, so porting the core tree manipulation code to Biopython would
>>> require a substantial effort somewhere.
>>>
>>> 6. The PhylomeDB connector is cool, and browsing the source, looks like
>>> it wouldn't require much effort at all to drop into Biopython.
>>>
>>> Thanks for letting us know about this.
>>>
>>> Cheers,
>>> Eric
>>>
>>>
>>>
>>> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas <jhuerta at crg.es>wrote:
>>>
>>>> Hi,
>>>>
>>>> ( I'm the developer of ETE. )
>>>> I agree that PyQt4 is an important dependence. I chose it because
>>>> Qt4-QGraphicsScene environment offers many possibilities like openGL
>>>> rendering, unlimited image size, performance, and good bindings to python.
>>>> However, I am working on my code to allow the rendering algorithm to use any
>>>> other graphical library. So, you could render the same tree images using
>>>> different backends. If you think this is useful for you, please let me know
>>>> and we can think how to integrat it with biopython.
>>>> Regarding the GUI, it is not a standalone application but one more
>>>> method within the Tree objects. The GUI  can be started at any point of the
>>>> execution and the main program will continue after you close it. I did it
>>>> like this because I think is quite useful for working within interactive
>>>> python sessions.
>>>>
>>>> I develop a lot of  code around tree handling, so if you think I can
>>>> help, please tell me.
>>>> jaime.
>>>>
>>>>
>>>>
>>>>>  > *Graphics*
>>>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave
>>>>> unlabeled
>>>>> > nodes inconspicuous, so the resulting graphic is much cleaner,
>>>>> perhaps even
>>>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it
>>>>> would be
>>>>> > nice to have a Reportlab-based module in Bio.Graphics to print
>>>>> phylogenies
>>>>> > in the way biologists are used to seeing them. Does anyone know of
>>>>> existing
>>>>> > code that could be borrowed for this? I looked at ETE (announced on
>>>>> the main
>>>>> > biopython list last week) and liked the examples, but it uses PyQt4
>>>>> and a
>>>>> > standalone GUI for display, which is a substantial departure from the
>>>>> > Biopython way of doing things.
>>>>>
>>>>> I still haven't tracked down my old report lab code, but it wasn't
>>>>> object
>>>>> orientated and would need a lot of work to bring up to standard...
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Peter
>>>>>
>>>>> _______________________________________________
>>>>> Biopython-dev mailing list
>>>>> Biopython-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> =========================
>>>> Jaime Huerta-Cepas, Ph.D.
>>>> CRG-Centre for Genomic Regulation
>>>> Doctor Aiguader, 88
>>>> PRBB Building
>>>> 08003 Barcelona, Spain
>>>> http://www.crg.es/comparative_genomics
>>>> =========================
>>>>
>>>>
>>>
>>
>>
>> --
>> =========================
>> Jaime Huerta-Cepas, Ph.D.
>> CRG-Centre for Genomic Regulation
>> Doctor Aiguader, 88
>> PRBB Building
>> 08003 Barcelona, Spain
>> http://www.crg.es/comparative_genomics
>> =========================
>>
>>
>


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================


From biopython at maubp.freeserve.co.uk  Fri Sep 25 16:22:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 25 Sep 2009 17:22:40 +0100
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <c5882df30909250913w3190aadesac8aaee1c613d761@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
	<3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com>
	<c5882df30909250913w3190aadesac8aaee1c613d761@mail.gmail.com>
Message-ID: <320fb6e00909250922y858c172xf1ee51f7673a4fe2@mail.gmail.com>

On Fri, Sep 25, 2009 at 5:13 PM, Jaime Huerta Cepas <jhuerta at crg.es> wrote:
>
> I think there is no problem in using BSD license from GPL sources, the
> problem would be in the other way around.
>

Yes, that way round is fine from a license point of view (taking Biopython's
BSD/MIT style licensed code and using it in a GPL project). But we can't
take your GPL code into Biopython unless you re-license it more liberally.

I can see the appeal of the (L)GPL for forcing the code to stay open, but
Biopython (like Python) went for the other option of basically letting anyone
use the code in anyway they like.

Peter


From hlapp at gmx.net  Fri Sep 25 20:58:36 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 25 Sep 2009 16:58:36 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
Message-ID: <E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>


On Sep 25, 2009, at 11:28 AM, Jaime Huerta Cepas wrote:

> As far as I know, the main difference between GPL and BSD-like  
> licenses is that, with the second, you could relicense the code at  
> any moment under any other policy, including private and close  
> licenses.


This is not true. None of the open-source licenses that I'm aware of  
allows anyone to relicense code under a license that is less liberal,  
or to relicense code at all. It is the copyright owner who can  
relicense code, not the distributor.

One of the differences between GPL and BSD is that GPL is viral.  
Specifically, code that links to GPL-licensed code must also be GPL- 
licensed *when it is distributed.*

(It is a common misconception that GPL is unconditionally viral. I can  
take GPL code and link to it and keep my code closed source for as  
long as I please if I never redistribute it. GPL was written with  
software vendors in mind, whose business consists of distributing  
software for commercial gain. GPL has therefore sometimes been called  
anti-commercial. This is wrong, too, but I won't go into the details  
here.)

Biopython can freely utilize GPL-licensed (or closed source, for that  
matter) software if it doesn't link to it. IANAL but I think it can  
also redistribute GPL-licensed code along with Biopython so long as  
Biopython doesn't link to it, and it is made clear that some of the  
distribution falls under a different license than BSD. (Linux  
distributions mix BSD and GPL software, too.)

As for ETE itself, a BSD/MIT style license seems to be the by far most  
widely used license for Python modules. If you want to facilitate  
adoption of the software as a library by other programmers, GPL is  
going to stand in the way of that. Also, really all that you are  
accomplishing with GPL is that a software company can't take advantage  
of ETE. Is that your chief concern? GPL won't prevent any scientific  
lab from writing closed source code that builds on ETE and publishing  
the results, so long as they don't distribute their closed source code.


	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From chapmanb at 50mail.com  Fri Sep 25 21:48:00 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 25 Sep 2009 17:48:00 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
	<E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
Message-ID: <20090925214800.GE29829@sobchak.mgh.harvard.edu>

Hi all;
Hilmar -- thanks for writing up a nice summary of the license
details. Jaime, I think it's a shame we would let these issues
prevent working together. It sounds like you and Eric have some
shared goals and it would be great to see that evolve into some
useful functionality in Biopython.

Generally, the BSD-like license which Biopython uses encourages
cooperation and keeps people at both academia and industry happy. As
scientists, our goal should be to avoid letting these types of issues
preventing collaboration. Truthfully, there is very little opportunity
for exploitation of bioinformatics software; the economics are just not
there for companies to sell code.

> (It is a common misconception that GPL is unconditionally viral. I can  
> take GPL code and link to it and keep my code closed source for as  
> long as I please if I never redistribute it. GPL was written with  
> software vendors in mind, whose business consists of distributing  
> software for commercial gain. GPL has therefore sometimes been called  
> anti-commercial. This is wrong, too, but I won't go into the details  
> here.)

I agree 100%, but in practical terms it is very difficult to have this
argument at a company. Speaking from experience, GPL creates all kinds
of nasty thoughts in people's heads which prevents adoption of code in
corporate environments. For Biopython and other bioinformatics projects,
we should be actively encouraging contributions from companies as
well as academia.

> Biopython can freely utilize GPL-licensed (or closed source, for that  
> matter) software if it doesn't link to it. IANAL but I think it can  
> also redistribute GPL-licensed code along with Biopython so long as  
> Biopython doesn't link to it, and it is made clear that some of the  
> distribution falls under a different license than BSD. (Linux  
> distributions mix BSD and GPL software, too.)

Yes, but this complication is bad. Let's keep it simple,
Brad


From bugzilla-daemon at portal.open-bio.org  Fri Sep 25 22:48:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 25 Sep 2009 18:48:13 -0400
Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a
	GenBank CON file
In-Reply-To: <bug-2745-42@http.bugzilla.open-bio.org/>
Message-ID: <200909252248.n8PMmDa9028782@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2745


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #1214 is|0                           |1
           obsolete|                            |


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-25 18:48 EST -------
(From update of attachment 1214)
Checked into git, leaving this bug open until we've run some more tests on
this.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sat Sep 26 11:36:45 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 26 Sep 2009 07:36:45 -0400
Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a
	GenBank CON file
In-Reply-To: <bug-2745-42@http.bugzilla.open-bio.org/>
Message-ID: <200909261136.n8QBajsI014127@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2745


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-26 07:36 EST -------
We'll also need to update the SeqIO GenBank output to record the CONTIG string
if present.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From hlapp at gmx.net  Sat Sep 26 15:25:41 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 26 Sep 2009 11:25:41 -0400
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20090925214800.GE29829@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com>
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com>
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com>
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com>
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com>
	<E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
	<20090925214800.GE29829@sobchak.mgh.harvard.edu>
Message-ID: <C0D714C1-2EAA-46C5-95F6-CC814CB518AD@gmx.net>


On Sep 25, 2009, at 5:48 PM, Brad Chapman wrote:

> I agree 100%, but in practical terms it is very difficult to have this
> argument at a company.

Yes, I know.

> For Biopython and other bioinformatics projects, we should be  
> actively encouraging contributions from companies as well as academia.


Having worked in commercial and private sector for almost a decade, I  
couldn't agree more. There is a huge amount of open-source code  
development contributed by people working in the private sector, and  
which is hence sponsored by companies.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From jhuerta at crg.es  Sat Sep 26 17:12:59 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Sat, 26 Sep 2009 19:12:59 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> 
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com> 
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> 
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com> 
	<E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
Message-ID: <c5882df30909261012r44773ef6s92c4f83efadf8b16@mail.gmail.com>

Hey! Sorry, It was not my intention to open a flame about licences nor to
sound rude. I apologize if I did.


>  As far as I know, the main difference between GPL and BSD-like licenses is
>> that, with the second, you could relicense the code at any moment under any
>> other policy, including private and close licenses.
>>
>
>
> This is not true. None of the open-source licenses that I'm aware of allows
> anyone to relicense code under a license that is less liberal, or to
> relicense code at all. It is the copyright owner who can relicense code, not
> the distributor.
>
> I'm not an expert on software licences, so I can not enter into this issue
very deeply.  What I said in my previous email is what I could understand
from these info: http://www.gnu.org/philosophy/license-list.html,
http://www.gnu.org/philosophy/categories.html#Non-CopyleftedFreeSoftware
If I was wrong and modified BSD-like sources cannot be relicensed under
other less liberal licenses, then we will kindly consider a change of the
ETE license in the future.


> One of the differences between GPL and BSD is that GPL is viral.
> Specifically, code that links to GPL-licensed code must also be GPL-licensed
> *when it is distributed.*
>
> (It is a common misconception that GPL is unconditionally viral. I can take
> GPL code and link to it and keep my code closed source for as long as I
> please if I never redistribute it. GPL was written with software vendors in
> mind, whose business consists of distributing software for commercial gain.
> GPL has therefore sometimes been called anti-commercial. This is wrong, too,
> but I won't go into the details here.)
>
I see, so the only problem is about distribution...


Biopython can freely utilize GPL-licensed (or closed source, for that
> matter) software if it doesn't link to it. IANAL but I think it can also
> redistribute GPL-licensed code along with Biopython so long as Biopython
> doesn't link to it, and it is made clear that some of the distribution falls
> under a different license than BSD. (Linux distributions mix BSD and GPL
> software, too.)
>
Yes, I agree. This is what I meant as biopython addons. With this in mind,
biopython could be aware of many other software out there and benefit from
it. Is there any work around this in bipython?


As for ETE itself, a BSD/MIT style license seems to be the by far most
> widely used license for Python modules. If you want to facilitate adoption
> of the software as a library by other programmers, GPL is going to stand in
> the way of that. Also, really all that you are accomplishing with GPL is
> that a software company can't take advantage of ETE. Is that your chief
> concern?

Well, our intention was that code based on ETE sources  (other tools or
improvements) were distrubuted/published also as free software. We wanted
also to leave an open door to use other GPL software from ETE.


> GPL won't prevent any scientific lab from writing closed source code that
> builds on ETE and publishing the results, so long as they don't distribute
> their closed source code.

Yes. You are right. We don't want to avoid this.

In any case, thanks for your comments. I will try to get more info about
what you say and, if we have to modify something, we do it. :)

cheers,
Jaime


>
>
>        -hilmar
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================


From jhuerta at crg.es  Sat Sep 26 17:28:02 2009
From: jhuerta at crg.es (Jaime Huerta Cepas)
Date: Sat, 26 Sep 2009 19:28:02 +0200
Subject: [Biopython-dev] Code review request for phyloxml branch
In-Reply-To: <20090925214800.GE29829@sobchak.mgh.harvard.edu>
References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> 
	<320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> 
	<c5882df30909240345x7139f50bt692bc92dea139f52@mail.gmail.com> 
	<3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> 
	<c5882df30909250828s282fa886tea3d54d9267d63d1@mail.gmail.com> 
	<E7ECBBE5-D20E-492A-A88E-CB517AB4A0BC@gmx.net>
	<20090925214800.GE29829@sobchak.mgh.harvard.edu>
Message-ID: <c5882df30909261028y4c6026b3l4b5b5842bc99d512@mail.gmail.com>

Hi Brad,

Jaime, I think it's a shame we would let these issues
> prevent working together. It sounds like you and Eric have some
> shared goals and it would be great to see that evolve into some
> useful functionality in Biopython.
>

Sure!! My only intention was to find the best way to contribute!
However, the choice of a "viral" GPL license was specifically chosen for
exactly this reason: encouraging free software and academic scientific
resources.
We have a lot shared goals, so I trust we will find a happy way to
colaborate.

Jaime.


-- 
=========================
Jaime Huerta-Cepas, Ph.D.
CRG-Centre for Genomic Regulation
Doctor Aiguader, 88
PRBB Building
08003 Barcelona, Spain
http://www.crg.es/comparative_genomics
=========================


From jblanca at btc.upv.es  Mon Sep 28 11:36:14 2009
From: jblanca at btc.upv.es (Jose Blanca)
Date: Mon, 28 Sep 2009 13:36:14 +0200
Subject: [Biopython-dev] fpc and gff
Message-ID: <200909281336.14794.jblanca@btc.upv.es>

Sorry for the previous incomplete mail. :(

Hi:
I'm interested in parsing an fpc physical map and writing a gff3 file from it. 
That's done by the fpc people in bioperl and they go from fpc to gff2. I 
would like to do it in python.
I've written the fpc parser looking at the bioperl one. You can take a look 
at:
http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/fpc.py

Now I have to create the gff structure and writer. I've been reading Brad's 
code regarding the GFF parser and writer. I would like to integrate my fpc 
work as much as posible with biopython and if you like it we could add the 
fpc to Biopython in the future.
But I have not a clear idea on the relation between GFF and SeqFeature. The 
main problem is the subfeature and the gff feature hierarchy. My take on that 
at the moment is to write a GFFfeature class similar to the gff feature with 
seqid, source, type, start, end, score, etc. and go from the fpc to 
GFFFeature objects. I know that this would not integrate nicely with 
BioPython. Could you give some hint on how to do it in a proper way?
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From jblanca at btc.upv.es  Mon Sep 28 11:28:06 2009
From: jblanca at btc.upv.es (Jose Blanca)
Date: Mon, 28 Sep 2009 13:28:06 +0200
Subject: [Biopython-dev] fpc and gff
Message-ID: <200909281328.06817.jblanca@btc.upv.es>

Hi:
I'm interested in parsing an fpc physical map and writing a gff3 file from it. 
That's done by the fpc people in bioperl and they go from fpc to gff2. I 
would like to do it in python.
I've written the fpc parser looking at the bioperl one. You can take a look 
at:
-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From biopython at maubp.freeserve.co.uk  Mon Sep 28 11:52:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 28 Sep 2009 12:52:56 +0100
Subject: [Biopython-dev] fpc and gff
In-Reply-To: <200909281336.14794.jblanca@btc.upv.es>
References: <200909281336.14794.jblanca@btc.upv.es>
Message-ID: <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com>

On Mon, Sep 28, 2009 at 12:36 PM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Sorry for the previous incomplete mail. :(
>
> Hi:
> I'm interested in parsing an fpc physical map and writing a gff3 file from it.
> That's done by the fpc people in bioperl and they go from fpc to gff2. I
> would like to do it in python.
> I've written the fpc parser looking at the bioperl one. You can take a look
> at:
> http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/fpc.py
>
> Now I have to create the gff structure and writer. I've been reading Brad's
> code regarding the GFF parser and writer. I would like to integrate my fpc
> work as much as posible with biopython and if you like it we could add the
> fpc to Biopython in the future.
> But I have not a clear idea on the relation between GFF and SeqFeature. The
> main problem is the subfeature and the gff feature hierarchy. My take on that
> at the moment is to write a GFFfeature class similar to the gff feature with
> seqid, source, type, start, end, score, etc. and go from the fpc to
> GFFFeature objects. I know that this would not integrate nicely with
> BioPython. Could you give some hint on how to do it in a proper way?
> Best regards,

Right now there isn't a "proper way" as Brad's GFF code hasn't
been integrated into Biopython yet.

I think Brad was thinking of using the SeqFeature object "as is" to hold
GFF features, with the sub-features list used for the hierarchy.

Michiel and I had suggested a simpler structure more faithful to the
GFF model might be useful - even if it was just a standardised tuple
of the start, end, strand, id, etc, and an annotation dictionary). For
the SeqIO interface, these GFF features would have to be turned
into normal SeqFeature objects of course.

Peter


From chapmanb at 50mail.com  Mon Sep 28 12:52:38 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 28 Sep 2009 08:52:38 -0400
Subject: [Biopython-dev] fpc and gff
In-Reply-To: <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com>
References: <200909281336.14794.jblanca@btc.upv.es>
	<320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com>
Message-ID: <20090928125238.GG29829@sobchak.mgh.harvard.edu>

Jose;
Glad you're interested in working on this. I'm happy to get the GFF3
writing up to speed for this task.

> > I'm interested in parsing an fpc physical map and writing a gff3 file from it.
[...]
> > But I have not a clear idea on the relation between GFF and SeqFeature. The
> > main problem is the subfeature and the gff feature hierarchy. My take on that
> > at the moment is to write a GFFfeature class similar to the gff feature with
> > seqid, source, type, start, end, score, etc. and go from the fpc to
> > GFFFeature objects. 

> Right now there isn't a "proper way" as Brad's GFF code hasn't
> been integrated into Biopython yet.

Yes, we still have some flexibility here since it hasn't been merged
into Biopython yet, so let's talk about what works best.

> I think Brad was thinking of using the SeqFeature object "as is" to hold
> GFF features, with the sub-features list used for the hierarchy.

What exists now takes an iterator of SeqRecord objects, and writes
each SeqFeature as a GFF3 line:

seqid -- SeqRecord ID
source -- Feature qualifier with key "source"
type -- Feature type attribute
start, end -- The Feature Location
score -- Feature qualifier with key "score"
strand -- Feature strand attribute
phase -- Feature qualifier with key "phase"

The remaining qualifiers are the final key/value pairs of the
attribute.

The hierarchy is represented as sub_features of the parent feature.
This handles any arbitrarily deep nesting of parent and child 
features.

There is some really basic code on the documentation page:

http://biopython.org/wiki/GFF_Parsing#Writing_GFF3

> Michiel and I had suggested a simpler structure more faithful to the
> GFF model might be useful - even if it was just a standardised tuple
> of the start, end, strand, id, etc, and an annotation dictionary). For
> the SeqIO interface, these GFF features would have to be turned
> into normal SeqFeature objects of course.

This could also be useful for a more lightweight representation. I
would rather see this type of representation with primary Python
types, as opposed to a GFFFeature specific class. The current
SeqRecord/SeqFeature implementations is relatively close to what 
a GFF specific class would be so there would be a lot of duplication
without saving much in terms of speed or memory.

Jose, let me know if you'd rather go with a SeqRecord approach or a
lightweight approach. If you provide a couple of examples of the
features you want to store, we can work through how to best
represent those in the GFF hierarchy and then the details of
prepping them for writing.

Brad


From biopython at maubp.freeserve.co.uk  Mon Sep 28 13:10:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 28 Sep 2009 14:10:22 +0100
Subject: [Biopython-dev] fpc and gff
In-Reply-To: <20090928125238.GG29829@sobchak.mgh.harvard.edu>
References: <200909281336.14794.jblanca@btc.upv.es>
	<320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com>
	<20090928125238.GG29829@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00909280610q75f7bf4eqae49a1fb6d7eae38@mail.gmail.com>

On Mon, Sep 28, 2009 at 1:52 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
>> Michiel and I had suggested a simpler structure more faithful to the
>> GFF model might be useful - even if it was just a standardised tuple
>> of the start, end, strand, id, etc, and an annotation dictionary). For
>> the SeqIO interface, these GFF features would have to be turned
>> into normal SeqFeature objects of course.
>
> This could also be useful for a more lightweight representation. I
> would rather see this type of representation with primary Python
> types, as opposed to a GFFFeature specific class. The current
> SeqRecord/SeqFeature implementations is relatively close to what
> a GFF specific class would be so there would be a lot of duplication
> without saving much in terms of speed or memory.

Indeed. Which is why I quite like the idea of a simple tuple of ints,
strings and a dict for the annotation (the final column of a GFF file).
This should also be fast for people dealing with big GFF files.

The other plus point here is we can get this (GFF parsing/writing
using basic Python objects) into Biopython first, and then look at
the SeqIO side of things more carefully as a second merge. I may
be overly cautious but I want the resulting GFF <-> SeqRecord <->
GenBank/EMBL/etc mapping to try and follow established practice
as closely as possible, which will need lots of testing and probably
some tweaking of this mapping.

i.e. To me there is a natural break between the basics of GFF
parsing/writing, and the transformation into our existing object
models.

[This applies to all file formats in principle, but most are so simple
that it isn't really an issue worth worrying about.]

Peter


From bugzilla-daemon at portal.open-bio.org  Mon Sep 28 19:37:21 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Sep 2009 15:37:21 -0400
Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a
	GenBank CON file
In-Reply-To: <bug-2745-42@http.bugzilla.open-bio.org/>
Message-ID: <200909281937.n8SJbLYq012300@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2745


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-28 15:37 EST -------
(In reply to comment #6)
> We'll also need to update the SeqIO GenBank output to record the CONTIG
> string if present.

Done, marking as fixed. Assuming there are no objections to the whole
approach (treating the CONTIG data as a string) that is...

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Mon Sep 28 20:09:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 28 Sep 2009 21:09:12 +0100
Subject: [Biopython-dev] Committing to github...
In-Reply-To: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com>
References: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com>
Message-ID: <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com>

On Thu, Sep 24, 2009 at 12:39 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> My last couple of commits to github have been from a local clone
> of the *official* repository: http://github.com/biopython/biopython/
>
> This is a nice and simple work flow for small changes, and the
> history and github network graph are easy to understand:
> http://biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch
>
> This seems like the easiest way to work for people used to CVS,
> and you don't need to bother with your own Biopython cloned
> repository on github (you just need a github account and
> collaborator status). I'll probably continue to do this in the short
> term.

This way of working (described above) is what I have been using
for the last week. If there are multiple developers working (or in
this case, developers using multiple machines), you can still get
interesting mini-branches and merges even like this. Have a look
at the Biopython github network diagram for today for a nice
simple example (which was accidental - but serves as a nice
illustration).

[I know for some of you the following discussion isn't needed,
but I think it is worth trying to explain -  even if just for me, to
make sure it is clear in my head what git is doing.]

In words, the main trunk was split, with a (trivial) change to the
tutorial done on one branch (me at work) and then two separate
commits on a separate branch (unit tests tweak, and GenBank
bug fix), again by me, but on my home computer. The two
branches were then merged into one.

Why did this happen? I was working on a local and very slightly
out of date copy of the repository at home, and make these
local commits. I then tried to push them to github. At that
point git gave me an error saying something else had been
commited in the meantime (in fact by me but on a different
computer) so my local repository was out of date. So I pulled
and merged the latest code from github (the tutorial change),
and then pushed this to github. Done. The merge was 100%
automatic because the files changed were independent.

Back on CVS, as these changes were on separate files, there
wouldn't have been any issue about merging.

Does it matter? No. But we can reduce the likelihood of these
baby branches and merges by getting into the habit of pulling
the latest code from github *before* making any local commits
(a sensible thing to do anyway).

[Did that make sense? One the one hand this is very simple,
but on the other hand, it is rather different to how I used to
think about the code history under CVS.]

Peter


From eric.talevich at gmail.com  Mon Sep 28 20:47:38 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 28 Sep 2009 16:47:38 -0400
Subject: [Biopython-dev] Committing to github...
In-Reply-To: <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com>
References: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com>
	<320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com>
Message-ID: <3f6baf360909281347r32c39918s4a2c8a64cff44622@mail.gmail.com>

On Mon, Sep 28, 2009 at 4:09 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
> Does it matter? No. But we can reduce the likelihood of these
> baby branches and merges by getting into the habit of pulling
> the latest code from github *before* making any local commits
> (a sensible thing to do anyway).
>
>
If you've committed local changes while your repository is out of date and
want to avoid a baby branch, you can also use "git rebase origin/master" to
fix the history. (But probably, most developers will find it easier and
safer to leave the baby branches there.)

Extended example:

git checkout dev     # a development branch
# hack hack
git commit -a          # oops, we're out of sync
git checkout master    # a clean copy of upstream
git pull origin master      # updating like we should have earlier
git rebase master dev
git merge dev
# Should be fast-forward
git push


Cheers,
Eric


From bugzilla-daemon at portal.open-bio.org  Mon Sep 28 21:01:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Sep 2009 17:01:08 -0400
Subject: [Biopython-dev] [Bug 2919] New: Writing SeqFeature qualifiers
Message-ID: <bug-2919-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=2919

           Summary: Writing SeqFeature qualifiers
           Product: Biopython
           Version: 1.51
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: estrain at gmail.com


When writing SeqFeature qualifiers key-value pairs, the output contains one
line for each character in the value, rather than simply printing the string.
The sample code at the bottom produces a genbank sequence file that illustrates
the problem.

If I create a qualifiers dictionary using "qualDict = dict(gene="geneA")",
the genbank output contains
     gene            1..6
                     /gene="g"
                     /gene="e"
                     /gene="n"
                     /gene="e"
                     /gene="A"


The offending code appears to be in the InsdcIO.py file, lines 482-483.
If I change

482: for value in values :
483:   self.write_feature_qualifier(key,value)

to

self.write_feature_qualifier(key,values)

then the function appears to work correctly.  

     gene            1..6
                     /gene="geneA"


###########################################################
## Sample code
###########################################################
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.Alphabet import IUPAC

qualDict = dict(gene="geneA")

my_seq = SeqRecord(Seq("ATGATC",IUPAC.ambiguous_dna),id="seq1")
my_seq.features.append((SeqFeature(FeatureLocation(0,6),type="gene",qualifiers=qualDict)))

out_handle = open("test.gbk","w")

SeqIO.write([my_seq],out_handle,"genbank")


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Sep 28 21:22:32 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 28 Sep 2009 17:22:32 -0400
Subject: [Biopython-dev] [Bug 2919] Writing SeqFeature qualifiers
In-Reply-To: <bug-2919-42@http.bugzilla.open-bio.org/>
Message-ID: <200909282122.n8SLMW8w014482@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2919


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-28 17:22 EST -------
It was working as intended - for consistency with the GenBank (and other)
parsers, you were expected to use a lists of strings as the feature qualifier
dictionary values (not just strings).

However, a similar request was made on the mailing list recently, and a fix
checked in (after Biopython 1.52 was released):

http://lists.open-bio.org/pipermail/biopython/2009-September/005585.html

Marking as fixed.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Sep 29 16:41:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 29 Sep 2009 12:41:08 -0400
Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch
In-Reply-To: <bug-2891-42@http.bugzilla.open-bio.org/>
Message-ID: <200909291641.n8TGf8HE011375@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2891


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |FIXED


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2009-09-29 12:41 EST -------
Really fixed this time, tested on Jython 2.5.0 and 2.5.1rc3


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Wed Sep 30 15:27:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Sep 2009 16:27:03 +0100
Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method?
Message-ID: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com>

Hi all,

A few months back on the main mailing list, Cedar and I were
talking about taking a SeqRecord, and how to write out its reverse
complement to a file. The thread is archived here:
http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html

Cedar - I cc'd you, as I am not sure if you are on the dev list.
I expect this could get technical pretty quickly, so I wanted
to float this idea on the dev list first...

-----------------------------------------------------------------

So, the background this this discussion:

Unless there is some complicated annotation to transfer,
using Biopython as is, making a new SeqRecord using
the reverse complement sequence of the old SeqRecord
isn't very hard, see:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement

This has meant that generally the current status quo isn't
a problem (at least for me). However, what prompted me
to work on this issue was a real world example.

We have a draft genome where after doing a basic
annotation, it would make sense to flip the strands. I
want to be able to load our current GenBank file, apply
the reverse complement, and have all the annotated
features recalculated to match. With more and more
sequencing projects, this isn't such an odd thing to
want to do.

Dealing with the details of potentially complex locations
in SeqFeature object's isn't very nice, so I think it would
be useful to have this particular functionality built into
Biopython. It is also a small step towards making the
SeqRecord more Seq like (which in general seems a
good idea).

On Thu, Jun 25, 2009 at 12:20 AM, Peter wrote:
>
> What you are doing is fine - although personally I might wrap up the
> first line as a function, as done in the tutorial:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement
>
> While we could add a reverse_complement() method to the SeqRecord
> (and other Seq methods, like translate etc), there is one big problem:
> What to do with the annotation. If your record used to have a name
> based on an accession or a GI number, then this really does not apply
> to the reverse complement (or a translation etc). We could do something
> arbitrary like adding an "rc_" prefix (or variants) but I think the only safe
> answer is to make the user think about this and do what is appropriate
> in their context. And as you have demonstrated, this can still be done
> in one line :)
>
> I make a habit of using this as a justification, but I feel the zen of
> Python "Explicit is better than implicit" applies quite well here.

I've been thinking about this on and off since then, and I still
maintain that for much of the annotation there is no easy answer.
For the sequence itself, the behaviour is well defined. For all
the annotation, there are three possible actions:
(a) User supplies a new value
(b) Reuse the old value
(c) No annotation (the default for a new SeqRecord)

We can do something sensible with the features (if present) and
it will probably make sense to copy but reverse any per-letter
annotation (if present).

On a github branch I have posted some experimental code
which adds a reverse_complement() method to the SeqRecord.
I propose to give the new reverse_complement() a set of
optional arguments (id, name, etc) following the same names
as the existing attributes (and __init__ arguments), allowing
the user to choose between these three actions.

Assuming the general scheme is popular, I'm quite open
to discussing changing these defaults. But for the first
implementation this is what I picked: For the id, name and
description I still lean towards making the user decide this,
and therefore the default is (c). Likewise for the annotations
dictionary and the database cross refs.

For the features and per-letter-annotation, I would opt to
make the default behaviour be to reuse the old data, option
(b) above. For the per-letter-annotation (the restricted
dictionary, letter_annotations) this just means reversing
each entry. For the features, this means reversing the
order of the features, switching their strands (if set), and
calculating the new coordinates (taking care of all the
possible fuzzy locations and sub-features).

The code is here is anyone wants to look at the
technical details:
http://github.com/peterjc/biopython/commits/seqrecords

Peter