From chapmanb at 50mail.com Tue Sep 1 09:06:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 1 Sep 2009 09:06:39 -0400 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> Message-ID: <20090901130639.GI75451@sobchak.mgh.harvard.edu> Hi Peter; [indexed dict usage] > What file formats where you working on, and how many records? It was a 100Mb fasta file with about 41,000 records. Nothing too heavy but it worked great. The only change I made was to generalize the record building line: self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) to allow an arbitrary function to be passed to define the identifier, instead of defaulting to the first part of the line. This is helpful for those fun NCBI ids (gi|83029091|ref|XM_357633.3|) where other parts of the program only have the accession number. > True. Have got any bright ideas for a better name? While the > index is in memory, the SeqRecord objects are not (unlike the > original Bio.SeqIO.to_dict() function). > > Or we have one function Bio.SeqIO.indexed_dict() which can > either use an in memory index, OR an on disk index, offering > the same functionality. That's a nice idea -- provide some reasonable defaults based on file size and type, and allow them to be over-ridden with function params. > >> Another option (like the shelve idea we talked about last month) > >> is to parse the sequence file with SeqIO, and serialise all the > >> SeqRecord objects to disk, e.g. with pickle or some key/value > >> database. This is potentially very complex (e.g. arbitrary Python > >> objects in the annotation), and could lead to a very large "index" > >> file on disk. On the other hand, some possible back ends would > >> allow editing the database... which could be very useful. > > > > My thought here was to use BioSQL and the SQLite mappings for > > serializing. We build off a tested and existing serialization, and > > also guide people into using BioSQL for larger projects. > > Essentially, we would build an API on top of existing BioSQL > > functionality that creates the index by loading the SQL and then > > pushes the parsed records into it. > > Using BioSQL in this way is a much more general tool than > simply "indexing a sequence file". It feels like a sledgehammer > to crack a nut. Also, do you expect it to scale well for 10 million > plus short reads? It may do, but on the other hand it may not. Agreed that it would introduce extra overhead for something like short reads. If you are talking about serializing SeqRecords, it would make sense to re-use what we have in BioSQL. If you are talking about storing just file offsets, then a lightweight solution makes more sense. For me, the initial parse time to prepare an index is not as much of an issue since it happens once while queries on it will happen multiple times. > Also while the current BioSQL mappings are "tried and tested", > they don't cover everything, in particular per-letter-annotation > such as a set of quality scores (something that needs addressing > anyway, probably with JSON or XML serialisation). Agreed, but the advantage is that improvements can feed back into BioSQL, instead of work in parallel. > All the above make me lean towards a less ambitious target > (read only dictionary access to a sequence file), which just > requires having an (on disk) index of file offsets (which could > be done with SQLite or anything else suitable). This choice > could even be done on the fly at run time (e.g. we look at the > size of the file to decide if we should use an in memory index > or on disk - or start out in memory and if the number of records > gets too big, switch to on disk). That makes sense. SQLite has in-memory caching which could help with some of the decision making as it would handle writing and holding in memory without having to reimplement that bit. Another file based indexing scheme is the one in bx-python: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py This is a bit more specific as it also handles queries based on genomic intervals in addition to retrieving by file position. It may be useful for looking at the underlying storage details. Brad From biopython at maubp.freeserve.co.uk Tue Sep 1 09:25:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 14:25:22 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <20090901130639.GI75451@sobchak.mgh.harvard.edu> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman wrote: > Hi Peter; > > [indexed dict usage] >> What file formats where you working on, and how many records? > > It was a 100Mb fasta file with about 41,000 records. Nothing too > heavy but it worked great. Yeah, with just 41,000 keys and offsets the in memory dict would be pretty small too. This is within the range of file sizes I expect the Bio.SeqIO.indexed_dict() functionality to be used on. Cool. > The only change I made was to generalize the record building line: > > self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) > > to allow an arbitrary function to be passed to define the > identifier, instead of defaulting to the first part of the line. > This is helpful for those fun NCBI ids > (gi|83029091|ref|XM_357633.3|) where other parts of the program only > have the accession number. Did your callback function get give the "title string" and return the desired key? I had wondered about this, but the only way for this to be general (to work on all file formats) is for the callback function to be given a SeqRecord object - which means having to fully parse the file during the indexing, which ends up being *much* slower. We can do this is you think it adds a lot of utility i.e. mimic the key_function argument we already have on Bio.SeqIO.to_dict() Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 09:38:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 14:38:07 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com> Message-ID: <320fb6e00909010638v5c9cec06t66b24e1e755c46cb@mail.gmail.com> On Fri, Aug 14, 2009 at 1:00 PM, Peter wrote: >>> Jose's code uses seek/tell which means it has to have a handle >>> to an actual file. He also used binary read mode - I'm not sure if >>> this was essential or not. >> >> Binary mode was not essential - opening an SFF file in default >> mode also seemed to work fine with Jose's code. > > Having worked on this more, default mode or binary mode are fine. > However, as you might expect, you can't use Python's universal > read lines mode when parsing SFF files. Just to clarify this for the record - on Unix you can parse an SFF file opened in default mode ("r") or binary mode ("rb") but not universal read line mode ("rU"). However, on Windows only binary mode works. I've updated my SFF code on github to catch this (as otherwise the error messages are rather cryptic). Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 09:56:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 14:56:26 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <20090901130639.GI75451@sobchak.mgh.harvard.edu> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909010656h594e908cu246138d45442df45@mail.gmail.com> On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman wrote: > >Peter wrote: >> Using BioSQL in this way is a much more general tool than >> simply "indexing a sequence file". It feels like a sledgehammer >> to crack a nut. Also, do you expect it to scale well for 10 million >> plus short reads? It may do, but on the other hand it may not. > > Agreed that it would introduce extra overhead for something like > short reads. If you are talking about serializing SeqRecords, it > would make sense to re-use what we have in BioSQL. I wasn't talking about serialising SeqRecord objects. I agree there is (almost) no point implementing new serialisation code when we already have BioSQL. > If you are talking about storing just file offsets, then a lightweight > solution makes more sense. Indeed. > For me, the initial parse time to prepare an index is not as much > of an issue since it happens once while queries on it will happen > multiple times. It depends on the expected work load - if you are thinking about indexing a local copy of GenBank, but only expect to pull out a few (hundred) records, then the index time may be longer than the total access time. But in general, if we are talking about saving the index to a file (which can then be reloaded) I would agree, the up front cost to prepare the index isn't critical. On the subject of how to store a index off file offsets on disk, I think the old Biopython Martel/Mindy indexing code used to create OBDA style indexes (either simple flat files or BDB based). We should certainly consider these for cross project compatibility, or perhaps introduce a new OBDA version which might use something like SQLite internally instead? http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html http://lists.open-bio.org/pipermail/open-bio-l/2009-September/000567.html Peter From eoc210 at googlemail.com Wed Sep 2 08:25:24 2009 From: eoc210 at googlemail.com (Ed Cannon) Date: Wed, 2 Sep 2009 13:25:24 +0100 Subject: [Biopython-dev] OBO2OWL parser / converter In-Reply-To: <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net> References: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com> <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net> Message-ID: <9e02410b0909020525w5cbf59dek46e0ab1b5144f8@mail.gmail.com> Hi Hilmar, My OBO2OWL parser is implemented based on Tirmizi & Miranker?s paper titled: ?OBO2OWL: Roundtrip between OBO and OWL? ( www.cs.utexas.edu/~hamid/pub/tirmizi-obo2owl-tr-06-47.pdf )1. After having looked at the link you sent me to the OBO2OWL mappings google spreadsheet, it appears that there are some differences, which I?m looking into at the minute. Ref: 1. Syed Hamid Tirmizi and Daniel P Miranker. (2006). OBO2OWL: Roundtrip between OBO and OWL. The University of Texas at Austin, Department of Computer Sciences, Technical Report TR-06-47, October 2, 16 pages. Cheers, Ed 2009/8/31 Hilmar Lapp > Hi Ed - > > is your converter operating in a way that is congruent with (or even > utilizing) the mapping and the converter provided by the NCBO and Berkeley > Ontology projects? > > http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page > > If not, I'm not sure how beneficial it is for users to have multiple and > possibly conflicting mappings. > > -hilmar > > > On Aug 30, 2009, at 3:33 PM, Ed Cannon wrote: > > Hi All, >> >> I would like to thank you guys for all your hard work and effort in making >> biopython a great piece of open software. >> >> I would also like to introduce myself, my name is Ed Cannon, I am a >> postdoc >> at Cambridge University working in the fields of chemo/bioinformatics and >> semantic web technologies in the group of Peter Murray-Rust. >> >> Since a fair amount of my work involves ontologies, I have written an open >> biomedical ontology (.obo) to web ontology language (.owl) converter. The >> resultant file can be loaded and used from Protege. I was wondering if >> this >> software would be of any interest to the biopython community? I have just >> sent a pull request to biopython on github. The code is located at my >> branch >> on my account: http://github.com/eoc21/biopython/tree/eoc21Branch. >> >> Thanks, >> >> Ed >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > From bugzilla-daemon at portal.open-bio.org Wed Sep 2 11:24:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Sep 2009 11:24:19 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200909021524.n82FOJ7U021693@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-02 11:24 EST ------- (In reply to comment #3) > I can now parse the Roche SFF index, allowing fast random access to > the reads. See: > > http://github.com/peterjc/biopython/commits/index > http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006603.html > > Peter That branch now has support for SeqIO parsing, indexing and *writing* of SFF files. The write support is still very new and needs more testing, but is looking promising. Note that while currently I read the undocumented Roche style SFF index block, I have not yet attempted to write out such an index (probably unwise unless the format does get published?). Also note that there is still scope for improvement for how the trimming information is presented in the SeqRecord object (perhaps some kind of masked SeqRecord/Seq as has been suggested on the mailing lists). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 2 12:45:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Sep 2009 12:45:48 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200909021645.n82GjmbA023923@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-02 12:45 EST ------- (In reply to comment #4) > That branch now has support for SeqIO parsing, indexing and *writing* of > SFF files. The write support is still very new and needs more testing, > but is looking promising. Note that while currently I read the undocumented > Roche style SFF index block, I have not yet attempted to write out such an > index (probably unwise unless the format does get published?). It now has a first attempt at writing a Roche style SFF index, which my code will parse back again happily. I have not yet tried the resulting file with the Roche SFF tools. Note that this does not preserve any Roche XML meta data. Note also that the index is skipped if any of the record names are not 14 chars long (which is try on all the Roche indexes I have looked at). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 4 06:23:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Sep 2009 06:23:26 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200909041023.n84ANQgj023187@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-04 06:23 EST ------- I've been working on the Roche SFF indexes, and via their tools have discovered there are at least two index block formats used: Most SFF files I have looked at have an index block which starts ".mft1.00" (short for Manifest v1.00 is my guess) which hold both an XML "manifest" or meta data, plus a read offset index. You can also get SFF files where the index block starts ".srt1.00" (Short Read Table v1.00 maybe?) which have just an index. The indexes details themselves are the same in both cases, and support arbitrary read name lengths. The offset is in base 255 (not 256), apparently so that byte 255 (0xFF) can be used as a separator character. For typical Roche SFF files, the read names are 14 characters, and the index uses 20 bytes per read. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 4 06:54:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Sep 2009 06:54:39 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200909041054.n84AsdNe023921@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-04 06:54 EST ------- The Staden IO lib has references to ".srt1.00" (454 sorted v1.00) and also another SFF index format, which start ".hsh1.00" (hash table v1.00). See files io_lib/progs/hash_sff.c and io_lib/open_trace_file.c from http://sourceforge.net/projects/staden/ Scanning their code also confirms my base 255 deduction for the ".srt" indexes, see function getuint4_255, and the use of 0xFF as a break character. Interestingly they only expect 4 bytes for the offset (limiting this to almost 4GB SFF files). There is a fifth byte which is usually null, this could be a name terminator (although this is not actually needed), or used for 4GB+ SFF offsets. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Sep 4 11:33:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Sep 2009 16:33:16 +0100 Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython? In-Reply-To: <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> Hi David, [This is a continuation of a thread on the main list, but it is much more suited to the dev list now.] On Tue, Sep 1, 2009 at 11:38 PM, David Winter wrote: > Peter wrote: >> David - I would prefer we also put your new wrappers in >> Bio.Emboss.Applications, and would be happy to look at adding >> those to CVS now that Biopython 1.51 is out (I had forgotten >> about them actually - so thanks for the reminder). >> >> Peter > > Hi Peter, > > I'd almost forgotten about them myself! I only put them in their own module > because I had the PhyML wrapper as well and that's not an EMBOSS > application. I see you've done that on github. I had a look at merging this into CVS, but had a few comments first. I found you had a load of tabs in your file (please use 4 space indentation in future). http://www.biopython.org/wiki/Contributing#Coding_conventions I am unclear why you are subclassing _EmbossMinimalCommandLine instead of _EmbossCommandLine since most (all?) of the new wrappers use the "outfile" parameter. As I recall EMBOSS isn't fussy about the presence of the equals sign (right now our wrappers mostly omit the equals, but not all the time - which looks odd to me). Also your code seems to me missing the __str__ / _validate changes on the trunk. And finally, I think you can add yourself to the copyright at the top of the file for this work ;) Peter From biopython at maubp.freeserve.co.uk Fri Sep 4 13:22:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Sep 2009 18:22:27 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> Message-ID: <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com> On Tue, Sep 1, 2009 at 2:25 PM, Peter wrote: > On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman wrote: >> Hi Peter; >> >> [indexed dict usage] >>> What file formats where you working on, and how many records? >> >> It was a 100Mb fasta file with about 41,000 records. Nothing too >> heavy but it worked great. > > Yeah, with just 41,000 keys and offsets the in memory dict would > be pretty small too. This is within the range of file sizes I expect > the Bio.SeqIO.indexed_dict() functionality to be used on. Cool. > >> The only change I made was to generalize the record building line: >> >> self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) >> >> to allow an arbitrary function to be passed to define the >> identifier, instead of defaulting to the first part of the line. >> This is helpful for those fun NCBI ids >> (gi|83029091|ref|XM_357633.3|) where other parts of the program only >> have the accession number. > > Did your callback function get given the "title string" and return > the desired key? > > I had wondered about this, but the only way for this to be general > (to work on all file formats) is for the callback function to be given > a SeqRecord object - which means having to fully parse the file > during the indexing, which ends up being *much* slower. We can > do this if you think it adds a lot of utility i.e. mimic the key_function > argument we already have on Bio.SeqIO.to_dict() A less flexible option is a callback function which maps the default record.id to a new key. This would solve your NCBI FASTA issue, and might be handy in other settings (e.g. removing the version suffix in GenBank identifiers). However, it would not allow for example switching to a completely different identifier (e.g. the GI number) which is present elsewhere in the file. The point is we can support this kind of limited key_function without suffering the severe speed penalty which doing a full parse to give SeqRecord objects would impose. How does that sound Brad? It should add just a little complexity to the current code, and allows some neat tricks. Or we can leave things as they are (KISS). Peter From mjldehoon at yahoo.com Sat Sep 5 04:17:00 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 5 Sep 2009 01:17:00 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez.parse Message-ID: <339938.48242.qm@web62405.mail.re1.yahoo.com> Hi everybody, Recently I was trying to parse a huge Entrez XML file containing Entrez gene records. Because of the size of the file, Entrez.read failed with a memory error since it could not keep the entire information in the XML file in memory. I decided to add a parse() function to Bio.Entrez that can iterate of such large files. This function is useful if the XML file essentially contains a list of records; the parse() function is a generator function that returns these records one by one. --Michiel. From p.j.a.cock at googlemail.com Sat Sep 5 08:59:09 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 5 Sep 2009 13:59:09 +0100 Subject: [Biopython-dev] Bio.Entrez.parse In-Reply-To: <339938.48242.qm@web62405.mail.re1.yahoo.com> References: <339938.48242.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00909050559p2c9da2f1o60905ac3dfe0cb35@mail.gmail.com> On Sat, Sep 5, 2009 at 9:17 AM, Michiel de Hoon wrote: > Hi everybody, > Recently I was trying to parse a huge Entrez XML file containing Entrez gene > records. Because of the size of the file, Entrez.read failed with a memory > error since it could not keep the entire information in the XML file in memory. > I decided to add a parse() function to Bio.Entrez that can iterate of such large > files. This function is useful if the XML file essentially contains a list of records; > the parse() function is a generator function that returns these records one by one. That sounds excellent - I'd noticed that usually Bio.Entez.read() would return a list of (large nested) records, so this should be a natural extension. Peter From biopython at maubp.freeserve.co.uk Mon Sep 7 07:56:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Sep 2009 12:56:17 +0100 Subject: [Biopython-dev] Anonymous CVS working again :) Message-ID: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com> Just an FYI, While the developer server dev.open-bio.org has been fine, recently our public read only mirror at cvs.open-bio.org (and cvs.biopython.org) had not been updated. This affected Biopython and EMBOSS. And for Biopython as a knock on effect, this had meant the latest code at http://biopython.org/SRC/biopython/ was a little out of date. [Biopython's github mirror was not affected] These all seem to be working fine once again - thanks to someone at the OBF - let me know who and I'll buy you a beer when we (next) meet up :) Peter From biopython at maubp.freeserve.co.uk Mon Sep 7 13:34:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Sep 2009 18:34:53 +0100 Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython? In-Reply-To: <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> Message-ID: <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> On Fri, Sep 4, 2009 at 4:33 PM, Peter wrote: > Hi David, > > [This is a continuation of a thread on the main list, but it is much more > suited to the dev list now.] > > ... > > I see you've done that on github. I had a look at merging this into CVS, > but had a few comments first. > > I found you had a load of tabs in your file (please use 4 space indentation > in future). http://www.biopython.org/wiki/Contributing#Coding_conventions Thanks. > I am unclear why you are subclassing _EmbossMinimalCommandLine > instead of _EmbossCommandLine since most (all?) of the new wrappers > use the "outfile" parameter. As I recall EMBOSS isn't fussy about the > presence of the equals sign (right now our wrappers mostly omit the > equals, but not all the time - which looks odd to me). I see you've switched to _EmbossCommandLine - fine. > Also your code seems to me missing the __str__ / _validate changes > on the trunk. Also fixed, thanks. > And finally, I think you can add yourself to the copyright at the top of > the file for this work ;) Cool. I have checked this into CVS, but did also fix an old typo (in a docstring) and one new typo (in an argument name). Thanks David! Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py based on test_Emboss.py? Continuing on the github branch is fine. We should put you in the CONTRIB file now too (are there any other recent people we've missed?). Would you like to give a webpage, or is this email address fine (be warned it may get harvested for spam)? Thank you, Peter From biopython at maubp.freeserve.co.uk Mon Sep 7 16:00:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Sep 2009 21:00:46 +0100 Subject: [Biopython-dev] [Root-l] Anonymous CVS working again :) In-Reply-To: References: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com> Message-ID: <320fb6e00909071300x6238e828l4440e71c562e792c@mail.gmail.com> > Are these being kept in sync? ? bioperl's moved completely away from > cvs to svn with very little pain. ?We found sync-ing the two more trouble > than it was worth. Perhaps we are talking at cross purposes here Chris. Right now Biopython and EMBOSS are using CVS, with developers committing to dev.open-bio.org, which then updates a read only CVS mirror code.open-bio.org (aka cvs.open-bio.org aka cvs.biopython.org) to provide anonymous assess. Likewise, BioPerl etc are using SVN, with developers committing to dev.open-bio.org, which then updates a read only SVN mirror at code.open-bio.org (or its other aliases) to provide anonymous assess. Peter From biopython at maubp.freeserve.co.uk Mon Sep 7 17:26:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Sep 2009 22:26:26 +0100 Subject: [Biopython-dev] [Root-l] Anonymous CVS working again :) In-Reply-To: <9A75D700-AC7B-4D5B-ABB2-D28267735E4C@illinois.edu> References: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com> <320fb6e00909071300x6238e828l4440e71c562e792c@mail.gmail.com> <9A75D700-AC7B-4D5B-ABB2-D28267735E4C@illinois.edu> Message-ID: <320fb6e00909071426w1dfed95bx703384b3227eee6b@mail.gmail.com> On Mon, Sep 7, 2009 at 9:44 PM, Chris Fields wrote: > On Sep 7, 2009, at 3:00 PM, Peter wrote: > >>> Are these being kept in sync? ? bioperl's moved completely away from >>> cvs to svn with very little pain. ?We found sync-ing the two more trouble >>> than it was worth. >> >> Perhaps we are talking at cross purposes here Chris. >> >> Right now Biopython and EMBOSS are using CVS, with developers >> committing to dev.open-bio.org, which then updates a read only CVS >> mirror code.open-bio.org (aka cvs.open-bio.org aka cvs.biopython.org) >> to provide anonymous assess. >> >> Likewise, BioPerl etc are using SVN, with developers committing to >> dev.open-bio.org, which then updates a read only SVN mirror at >> code.open-bio.org (or its other aliases) to provide anonymous assess. >> >> Peter > > Right, I understand that, but you also have a git repo on github (unless I'm > mistaken). ?Based on that I assume you plan on migrating over to dev git > and/or github eventually, but I'm unsure of the future of the CVS repo. Right! For now, CVS changes are pushed to github. Once we move to git, the CVS repo will no longer be used, and well be left frozen in time. > My point was, we had been in a similar situation. ?We had thought of having > a sync'ed CVS <-> SVN repo at one point, but it was way too much trouble to > deal with and just dropped CVS altogether after the migration. ?Instead, we > just started switching all docs over to point to svn instead with lots of > ample warning on the mail lists, and it all worked out in the end (we have > had very few users inquiring about CVS). Likewise, we could have git changes pushed into CVS, but there is little point. We plan to just quit using CVS. Peter From david.winter at gmail.com Mon Sep 7 18:54:52 2009 From: david.winter at gmail.com (David Winter) Date: Tue, 08 Sep 2009 10:54:52 +1200 Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython? In-Reply-To: <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> Message-ID: <4AA58F3C.6080200@student.otago.ac.nz> Hi Peter and all, Sorry the lack of communication from me on this. I successfully made it off the grid for the weekend then found I couldn't push to github from work (no ssh over the proxy for students) and couldn't email the list from home (can't use the uni's SMTP from off campus ) - IT-security catch 22! > I see you've switched to _EmbossCommandLine - fine. > > Yeah, this was my stupid fault - you'd given me a heads up about the two different version of the _EmbossCommandline and I tried out what I already had with the the 'normal' version as saw that it failed but didn't read the error message properly (of course it failed because I was trying to give it the outfile parameter twice...) > [... snip the other things you asked about...] > > > I have checked this into CVS, but did also fix an old typo (in a docstring) > and one new typo (in an argument name). Thanks David! > > Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py > based on test_Emboss.py? Continuing on the github branch is fine. > Sounds good, will have a go at getting something going in the next couple of days > We should put you in the CONTRIB file now too (are there any other > recent people we've missed?). Would you like to give a webpage, or > is this email address fine (be warned it may get harvested for spam)? > > Well, I'm not sure it's much of a contribution from me, but thanks :) Perhaps add david.winter at gmail.com - gmail seems to handle spam pretty well and I won't be a student here for ever (right?...) Cheers, David From biopython at maubp.freeserve.co.uk Tue Sep 8 05:21:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 10:21:11 +0100 Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython? In-Reply-To: <4AA58F3C.6080200@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> Message-ID: <320fb6e00909080221m7377f033ue9b1617b0bc38f5b@mail.gmail.com> On Mon, Sep 7, 2009 at 11:54 PM, David Winter wrote: > Hi Peter and all, > > Sorry the lack of communication from me on this. I successfully made it off > the grid for the weekend then found I couldn't push to github from work (no > ssh over the proxy for students) and couldn't email the list from home > (can't use the uni's SMTP from off campus ) - IT-security catch 22! Tricky. >> I see you've switched to _EmbossCommandLine - fine. > > Yeah, this was my stupid fault - you'd given me a heads up about the two > different version of the _EmbossCommandline and I tried out what I already > had with the the 'normal' version as saw that it failed but didn't read the > error message properly (of course it failed because I was trying to give it > the outfile parameter twice...) OK - I wondered if there was some other reason I couldn't see, so worth checking, >> I have checked this into CVS, but did also fix an old typo (in a >> docstring) and one new typo (in an argument name). Thanks David! >> >> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py >> based on test_Emboss.py? Continuing on the github branch is fine. > > Sounds good, will have a go at getting something going in the next > couple of days Great - whenever you get time. Thanks! >> We should put you in the CONTRIB file now too (are there any other >> recent people we've missed?). Would you like to give a webpage, or >> is this email address fine (be warned it may get harvested for spam)? > > Well, I'm not sure it's much of a contribution from me, but thanks :) But I'm expecting more in future *grin* > Perhaps add david.winter at gmail.com - gmail seems to handle spam > pretty well and I won't be a student here for ever (right?...) There is always a postdoc ;) Also can someone remind me at some point that we should include at least one of the EMBOSS PHYLIP tools in the alignment command line bit of the tutorial... Peter From chapmanb at 50mail.com Tue Sep 8 08:14:05 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 8 Sep 2009 08:14:05 -0400 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com> Message-ID: <20090908121405.GF63266@sobchak.mgh.harvard.edu> Hi Peter; [... callback function for specifying an ID ...] > > Did your callback function get given the "title string" and return > > the desired key? > > > > I had wondered about this, but the only way for this to be general > > (to work on all file formats) is for the callback function to be given > > a SeqRecord object - which means having to fully parse the file > > during the indexing, which ends up being *much* slower. We can > > do this if you think it adds a lot of utility i.e. mimic the key_function > > argument we already have on Bio.SeqIO.to_dict() > > A less flexible option is a callback function which maps the default > record.id to a new key. This would solve your NCBI FASTA issue, > and might be handy in other settings (e.g. removing the version > suffix in GenBank identifiers). However, it would not allow for > example switching to a completely different identifier (e.g. the GI > number) which is present elsewhere in the file. > > The point is we can support this kind of limited key_function > without suffering the severe speed penalty which doing a full > parse to give SeqRecord objects would impose. This is a great compromise. You're right, parsing the SeqRecord is too much, and allowing manipulation of default identifier would work fine. If people need to do something much more complicated to get the ID they would probably be better off extending the existing classes and writing a custom indexer that pulls the IDs they need. Brad From biopython at maubp.freeserve.co.uk Tue Sep 8 09:22:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 14:22:35 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <20090908121405.GF63266@sobchak.mgh.harvard.edu> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com> <20090908121405.GF63266@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> n Tue, Sep 8, 2009 at 1:14 PM, Brad Chapman wrote: > Hi Peter; > > [... callback function for specifying an ID ...] > >> A less flexible option is a callback function which maps the default >> record.id to a new key. This would solve your NCBI FASTA issue, >> and might be handy in other settings (e.g. removing the version >> suffix in GenBank identifiers). However, it would not allow for >> example switching to a completely different identifier (e.g. the GI >> number) which is present elsewhere in the file. >> >> The point is we can support this kind of limited key_function >> without suffering the severe speed penalty which doing a full >> parse to give SeqRecord objects would impose. > > This is a great compromise. You're right, parsing the SeqRecord is too > much, and allowing manipulation of default identifier would work fine. Cool - done in CVS, including the docstring and the tutorial. > If people need to do something much more complicated to get the ID > they would probably be better off extending the existing classes and > writing a custom indexer that pulls the IDs they need. Certainly - we can't expect to cover every possible use case, and trying to do so will result in an overly complicated API. Did you have any ideas for a better name than Bio.SeqIO.indexed_dict()? Peter From mjldehoon at yahoo.com Tue Sep 8 09:30:30 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Sep 2009 06:30:30 -0700 (PDT) Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> Message-ID: <184931.66541.qm@web62403.mail.re1.yahoo.com> --- On Tue, 9/8/09, Peter wrote: > Did you have any ideas for a better name than > Bio.SeqIO.indexed_dict()? > Is indexed_dict a function? If so, I suggest we use a verb instead of a noun. Maybe just "index"? --Michiel. From biopython at maubp.freeserve.co.uk Tue Sep 8 09:53:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 14:53:36 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <184931.66541.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> <184931.66541.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com> On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote: > --- On Tue, 9/8/09, Peter wrote: >> Did you have any ideas for a better name than >> Bio.SeqIO.indexed_dict()? > > Is indexed_dict a function? If so, I suggest we use a verb instead > of a noun. Maybe just "index"? > > --Michiel. Bio.SeqIO.indexed_dict() is a function which returns a dictionary like object. So yes, a verb would be better, and "index" is short and sweet. Peter From bugzilla-daemon at portal.open-bio.org Wed Sep 9 09:24:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Sep 2009 09:24:41 -0400 Subject: [Biopython-dev] [Bug 2781] Bio.PDB Structure instances cannot be deepcopied In-Reply-To: Message-ID: <200909091324.n89DOf4Q013555@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2781 klaus.kopec at tuebingen.mpg.de changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #2 from klaus.kopec at tuebingen.mpg.de 2009-09-09 09:24 EST ------- this seems to be resolved in 1.51 with Python 2.6.2 under 64Bit Ubuntu? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 9 11:18:01 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Sep 2009 11:18:01 -0400 Subject: [Biopython-dev] [Bug 2910] New: Parsing some pdb files results in shorter peptide sequences than expected Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2910 Summary: Parsing some pdb files results in shorter peptide sequences than expected Product: Biopython Version: 1.49 Platform: PC OS/Version: Linux Status: NEW Severity: critical Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: schafer at rostlab.org Parsing the one-letter sequence for a specific chain out of a given pdb file often seems to result in shorter sequences than expected. The following code demonstrates this behavior for structure 1a2d chain A. Aminoacid #118 VAL after the HETATOM (#117) block is missing in the result. ------------------CODE---------------- from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import * parser = PDBParser() ppb = PPBuilder() structure = parser.get_structure('tmp', '1a2d.pdb') polypeptides = ppb.build_peptides(structure[0]['A']) sequence = str(polypeptides[0].get_sequence()) print sequence ------------------CODE---------------- Another example is structure 13gs chain C and D. Both sequences are ECG, the code above however returns only CG. So this behavior seems to be indepedent from a present HETATOM block. This bug is also present in version 1.51. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 9 11:18:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Sep 2009 11:18:48 -0400 Subject: [Biopython-dev] [Bug 2910] Parsing some pdb files results in shorter peptide sequences than expected In-Reply-To: Message-ID: <200909091518.n89FImn5016415@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 schafer at rostlab.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |schafer at rostlab.org -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 10 08:55:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Sep 2009 08:55:03 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909101255.n8ACt3Jd017456@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|critical |normal Summary|Parsing some pdb files |Bio.PDB build_peptides |results in shorter peptide |sometimes gives shorter |sequences than expected |peptide sequences than | |expected ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-10 08:55 EST ------- Retitled as this appears to be a bug in the PPBuilder build_peptides method, not the PDB parser, see: http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html Test script: from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import PPBuilder, to_one_letter_code parser = PDBParser() ppb = PPBuilder() #structure = parser.get_structure('tmp', '1A2D.pdb') structure = parser.get_structure('tmp', '13GS.pdb') for model in structure : polypeptides = ppb.build_peptides(model) assert len(model) == len(polypeptides) for chain, pep in zip(model, polypeptides) : print print "Chain", chain.id print "Raw chain:" print "".join(to_one_letter_code.get(res.resname,"X") \ for res in chain if "CA" in res.child_dict) print "From peptide builder:" print pep.get_sequence() Output for 1A2D, PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2426. PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2427. PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2428. PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2448. Chain A Raw chain: CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA >From peptide builder: CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA Chain B Raw chain: CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA >From peptide builder: CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA Notice there are discontinuities in both chains A and B, and a missing residue in their peptides. And the output from 13GS, PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3760. PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3812. PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3852. PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3948. PDBConstructionWarning: WARNING: Chain C is discontinuous at line 4033. Chain A Raw chain: MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ >From peptide builder: MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ Chain B Raw chain: PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ >From peptide builder: PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ Chain C Raw chain: ECG >From peptide builder: CG Chain D Raw chain: ECG >From peptide builder: CG Notice there are discontinuities in chains A, B and C, but missing residues in the peptide chains C and D. This suggests the discontinuities are required to trigger the problem. Also there are no HETATM residues for chains C and D. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 10 08:57:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Sep 2009 08:57:13 -0400 Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed assertion in CondonTable Fix+Patch In-Reply-To: Message-ID: <200909101257.n8ACvDe1017562@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2894 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-10 08:57 EST ------- I'm marking this as a duplicated of bug 2887, and believe it to be fixed on the trunk. *** This bug has been marked as a duplicate of bug 2887 *** -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 10 08:57:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Sep 2009 08:57:16 -0400 Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in Bio.Data.CodonTable In-Reply-To: Message-ID: <200909101257.n8ACvGRn017574@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2887 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kellrott at ucsd.edu ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-10 08:57 EST ------- *** Bug 2894 has been marked as a duplicate of this bug. *** -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 10 08:57:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Sep 2009 08:57:20 -0400 Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch In-Reply-To: Message-ID: <200909101257.n8ACvKL9017592@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2895 Bug 2895 depends on bug 2894, which changed state. Bug 2894 Summary: Jython List difference causes failed assertion in CondonTable Fix+Patch http://bugzilla.open-bio.org/show_bug.cgi?id=2894 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Sep 15 09:51:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 14:51:43 +0100 Subject: [Biopython-dev] Another Biopython release? Message-ID: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> Hi all, Looking ahead, Tiago has some population genetics code he hopes to merge into the trunk at the end of the month (or in October), and we still have Brad's GFF stuff, my SFF work, Kristian's RNA code, Kyle's misc suggestions, and perhaps most importantly the phylogenetics GSoC work to consider. I know it's been only a month since we released Biopython 1.51, but does anyone (other than me) think that we already have enough done to warrant another release? The associated CVS freeze would also serve as a good break point for moving to github (see other threads). Here is what we have in the NEWS file at the moment: New helper functions Bio.SeqIO.convert() and Bio.AlignIO.convert() allow an easier way to use Biopython for simple file format conversions. Additionally, these new functions allow Biopython to offer important file format specific optimisations (e.g. FASTQ to FASTA, and interconverting FASTQ variants). New function Bio.SeqIO.indexed_dict() allows indexing of most sequence file formats (but not alignment file formats), allowing dictionary like random access to all the entries in the file as SeqRecord objects, keyed on the record id. This is especially useful for very large sequencing files, where all the records cannot be held in memory at once. This supplements the more flexible but memory demanding Bio.SeqIO.to_dict() function. Bio.SeqIO can now write "phd" format files (used by PHRED, PHRAD and CONSED), allowing interconversion with FASTQ files, or FASTA+QUAL files. Bio.Emboss.Applications now includes wrappers for the "new" PHYLIP EMBASSY package (e.g. fneighbor) which replace the "old" PHYLIP EMBASSY package (e.g. efneighbor) whose Biopython wrappers are now obsolete. See also the DEPRECATED file, as several old deprecated modules have finally been removed (e.g. Bio.EUtils which had been replaced by Bio.Entrez). [As an aside - Cymon and David - do you want to be named in the NEWS file for the PHD and PHLIPNEW stuff?] We're still debating the name of the new function Bio.SeqIO.indexed_dict(), but I am happy with the code (and new documentation) otherwise. The related extensions to adding indexing via a lookup file or an SQLite database is another big chunk of work which I don't have time for at the moment, but the code already in CVS is still extremely useful as is. Again, I'm biased, but I think the Bio.SeqIO.convert(...) function will be a popular addition for its convenience, but especially valuable for anyone wanting to convert between the different FASTQ files where the optimised conversion code makes a big speed up. Does doing another quick release (say at some point next week) sound like a good plan? If people like the idea, then getting some extra testing in now would be great - especially on the new stuff (it has unit tests of course, but real world usage is also important - thanks Brad for already trying out the FASTA indexing). Peter From bartek at rezolwenta.eu.org Tue Sep 15 10:59:43 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 15 Sep 2009 16:59:43 +0200 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> Message-ID: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> On Tue, Sep 15, 2009 at 3:51 PM, Peter wrote: > Hi all, > > I know it's been only a month since we released Biopython 1.51, but > does anyone (other than me) think that we already have enough done > to warrant another release? The associated CVS freeze would also > serve as a good break point for moving to github (see other threads). > That would be great. As for the move to github, I've added some (quite preliminary) docs for developers on how to make commits to the main branch using git and github to the wiki: http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch Any comments and/or improvements are most welcome. cheers Bartek From tiagoantao at gmail.com Tue Sep 15 11:29:55 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 15 Sep 2009 16:29:55 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> Message-ID: <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> On Tue, Sep 15, 2009 at 2:51 PM, Peter wrote: > Hi all, > > Looking ahead, Tiago has some population genetics code he hopes to I can put my stuff in CVS (plus I have docs). Question: CVS is still "the place". Right? I just need to test stuff on Windows. All the rest seems ok. Tiago From biopython at maubp.freeserve.co.uk Tue Sep 15 11:35:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 16:35:13 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> Message-ID: <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> 2009/9/15 Tiago Ant?o : > On Tue, Sep 15, 2009 at 2:51 PM, Peter wrote: >> Hi all, >> >> Looking ahead, Tiago has some population genetics code he hopes to > > I can put my stuff in CVS (plus I have docs). Question: CVS is still > "the place". Right? > > I just need to test stuff on Windows. All the rest seems ok. Yes, for the short term CVS is still the master repository. If you have that stuff ready to check in now, then sure - go ahead I was assuming you didn't expect to have this ready just yet, hence the proposal to sneak out a quick release first ;) Give me a shout and I'll get my Windows test machine up and running to double check the unit tests there. Maybe we'll push back the "next week" idea a bit ;) Peter From eric.talevich at gmail.com Tue Sep 15 11:38:45 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 15 Sep 2009 11:38:45 -0400 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> Message-ID: <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> > > On Tue, Sep 15, 2009 at 3:51 PM, Peter > wrote: > > Hi all, > > > > I know it's been only a month since we released Biopython 1.51, but > > does anyone (other than me) think that we already have enough done > > to warrant another release? The associated CVS freeze would also > > serve as a good break point for moving to github (see other threads). > > > Sounds good to me. Completing the Git migration would make it much easier for me to maintain the Tree/TreeIO stuff, since I already have a few local branches based on it that an upstream CVS duplication would mangle. On Tue, Sep 15, 2009 at 10:59 AM, Bartek Wilczynski < bartek at rezolwenta.eu.org> wrote: > That would be great. As for the move to github, I've added some (quite > preliminary) docs for developers on how to make commits to the main > branch using git and github to the wiki: > http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch > > The setup here for committers looks potentially different from the setup in "Merging upstream changes" (describing read-only tracking), but also potentially similar. Diff: - The github:biopython/biopython repository is called "official" here, but "upstream" there. Different protocol too, but that's intentional. - It also shows how to treat the upstream/official repo as the origin, CVS-style. This would mean the developer doesn't have a separate GitHub fork to use for personal branches, uncertain commits, etc. that don't belong in the main repo. Maybe a good way to organize the page would be in terms of how you want to use the repo: 1. Tracking Biopython with raw Git (without signing up for GitHub) - git clone http://.../biopython/biopython - remote: upstream - how to format a patch and submit on Bugzilla 2. Tracking Biopython on GitHub (e.g. occasional contributors) - sign up, click the "fork" button - git clone http://.../your-name-here/biopython - remotes: origin, upstream - how to submit a pull request on GitHub - how to add, manage and delete branches locally and on GitHub 3. Collaborating - either #1 or #2 is fine - how to add and manage more remotes - how to apply Git patches, and why copy/paste kills kittens the next time you merge 4. Committing to Biopython - same as #2, but use the private URL for the "upstream" remote - remotes: origin, upstream - policy on pushing upstream, code reviews, tagging, etc. Cheers, Eric From tiagoantao at gmail.com Tue Sep 15 11:39:07 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 15 Sep 2009 16:39:07 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> Message-ID: <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> 2009/9/15 Peter : > Give me a shout and I'll get my Windows test machine up > and running to double check the unit tests there. I think I am not in the mood to impose the burden on you. I will find a Windows machine and test it myself. > Maybe we'll push back the "next week" idea a bit ;) I am OK with "next week". But as I said two months ago, I have calendarized the extension of Bio.PopGen to October. So the material can go on the next release after the one on "next week". I just want to have lots of free time and little travel to be able to assist potential users (as I intend to announce the new content to the evolutionary biology crowd quite a lot) -- " It always takes ideology to consummate massive error." - Ambrose Evans-Pritchard From biopython at maubp.freeserve.co.uk Tue Sep 15 11:48:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 16:48:43 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> Message-ID: <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com> 2009/9/15 Tiago Ant?o : > 2009/9/15 Peter : >> Give me a shout and I'll get my Windows test machine up >> and running to double check the unit tests there. > > I think I am not in the mood to impose the burden on you. I will find > a Windows machine and test it myself. I was just going to turn on the machine, update to the latest CVS, and do a compile/test with Python 2.4, 2.5, 2.6 - Its no extra effort, as I would be doing this anyway for a new release. Unless of course you are adding wrappers for more command line tools, which would ideally require me to install them - that I might leave for another day ;) >> Maybe we'll push back the "next week" idea a bit ;) > > I am OK with "next week". But as I said two months ago, I have > calendarized the extension of Bio.PopGen to October. So the material > can go on the next release after the one on "next week". > > I just want to have lots of free time and little travel to be able to > assist potential users (as I intend to announce the new content to the > evolutionary biology crowd quite a lot) If you are happy to merge the code this week (via CVS), and confident it is ready to release, then I could do the release next week, and then we move to git. Or, I can do the release next week, we move to git, and then you can merge the new code (via git) at your leisure (Oct). Either plan is fine with me. Which do you prefer? Peter From tiagoantao at gmail.com Tue Sep 15 11:57:17 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 15 Sep 2009 16:57:17 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com> Message-ID: <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com> > Unless of course you are adding wrappers for more command > line tools, which would ideally require me to install them - that > I might leave for another day ;) Spot on ;) . > If you are happy to merge the code this week (via CVS), and > confident it is ready to release, then I could do the release > next week, and then we move to git. I will be only able to test the code on Windows tomorrow, if I can get hold to the machine (which I should). > Either plan is fine with me. Which do you prefer? I prefer merging on CVS, I am still much more proficient with it. You should have the merge there on Friday morning when you arrive. Tutorial included. Tiago -- " It always takes ideology to consummate massive error." - Ambrose Evans-Pritchard From biopython at maubp.freeserve.co.uk Tue Sep 15 12:09:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 17:09:32 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com> <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com> Message-ID: <320fb6e00909150909x2f45e0f5g6c4da77eafcd9a49@mail.gmail.com> 2009/9/15 Tiago Ant?o : >> Unless of course you are adding wrappers for more command >> line tools, which would ideally require me to install them - that >> I might leave for another day ;) > > Spot on ;) . OK. >> If you are happy to merge the code this week (via CVS), and >> confident it is ready to release, then I could do the release >> next week, and then we move to git. > > I will be only able to test the code on Windows tomorrow, if > I can get hold to the machine (which I should). Fingers crossed this doesn't throw any surprises at you. >> Either plan is fine with me. Which do you prefer? > > I prefer merging on CVS, I am still much more proficient with it. You > should have the merge there on Friday morning when you arrive. > Tutorial included. OK then :) Peter From bartek at rezolwenta.eu.org Tue Sep 15 15:45:22 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 15 Sep 2009 21:45:22 +0200 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> Message-ID: <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com> On Tue, Sep 15, 2009 at 5:38 PM, Eric Talevich wrote: > Sounds good to me. Completing the Git migration would make it much easier > for me to maintain the Tree/TreeIO stuff, since I already have a few local > branches based on it that an upstream CVS duplication would mangle. > Then maybe we should wait with committing your changes to the time we drop CVS, in order to avoid loss of change history in your code... What do you think, Peter? > > On Tue, Sep 15, 2009 at 10:59 AM, Bartek Wilczynski < > bartek at rezolwenta.eu.org> wrote: > >> That would be great. As for the move to github, I've added some (quite >> preliminary) docs for developers on how to make commits to the main >> branch using git and github to the wiki: >> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch >> >> > The setup here for committers looks potentially different from the setup in > "Merging upstream changes" (describing read-only tracking), but also > potentially similar. Diff: > - The github:biopython/biopython repository is called "official" here, but > "upstream" there. Different protocol too, but that's intentional. Yes, indeed. I know this might seem strange but I was trying to deliberately make the distinction between the main repository in read-write mode (official) and in read-only mode (upstream). I would keep it like this at least for a while so that the transition from CVS is as easy as possible. We have quite a few developers who are new to git and comfortable with CVS. > - It also shows how to treat the upstream/official repo as the origin, > CVS-style. Yes, exactly. > This would mean the developer doesn't have a separate GitHub fork > to use for personal branches, uncertain commits, etc. that don't belong in > the main repo. Not necessarily. It just means that these two roles are separate: a developer can (but does not have to) have his own branch of biopython tree where he/she makes the changes, but this is not directly linked to the official (read-write) biopython branch. I know it's not necessarily the best way to use github, but I would like to avoid getting people used to CVS confused. That's why I decided to describe the role of developer with read-write access differently. BTW, I would see the role of the GitUsage wiki page as a guide rather than a law. That means that if someone understands better how to use git and github and does not get lost with having in his both local and remote branches with different origins I'm absolutely fine with this. But I think it is quite complicated, especially for people new to git. So, in summary, my idea was to (currently) recommend somewhat CVS-like usage of git on the main branch, which would be simple for people to use at first and encourage them to create their own branches and do development on them. > > Maybe a good way to organize the page would be in terms of how you want to > use the repo: > > 1. Tracking Biopython with raw Git (without signing up for GitHub) > ? - git clone http://.../biopython/biopython > ? - remote: upstream > ? - how to format a patch and submit on Bugzilla > > 2. Tracking Biopython on GitHub (e.g. occasional contributors) > ? - sign up, click the "fork" button > ? - git clone http://.../your-name-here/biopython > ? - remotes: origin, upstream > ? - how to submit a pull request on GitHub > ? - how to add, manage and delete branches locally and on GitHub > > 3. Collaborating > ? - either #1 or #2 is fine > ? - how to add and manage more remotes > ? - how to apply Git patches, and why copy/paste kills kittens the next > time you merge > > 4. Committing to Biopython > ? - same as #2, but use the private URL for the "upstream" remote > ? - remotes: origin, upstream > ? - policy on pushing upstream, code reviews, tagging, etc. > > Having such documentation would be nice. I think that it is currently structured more or less like that (now we just don't have #1 and #4 currently recommends a very simple CVS-like usage). I think that adding #1 and putting in place policies on how to submit patches would be great. For #4 I would vote for recommending (at least for a while) the CVS-like way, but I'm absolutely for the development of the alternative procedure, where the developer works with a single repo both on his code and on official branch. I don't want to underestimate the git skills of our current developers, but so far I think only a few people have gotten their github accounts, which means the simpler we keep it the better (at least for a while). I certainly hope that people will get used to git quickly, but I would like to make initial change for people who will be switching from CVS to git as simple as possible. cheers Bartek From biopython at maubp.freeserve.co.uk Tue Sep 15 16:25:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 21:25:00 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com> Message-ID: <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com> On Tue, Sep 15, 2009 at 8:45 PM, Bartek Wilczynski wrote: > On Tue, Sep 15, 2009 at 5:38 PM, Eric Talevich wrote: > >> Sounds good to me. Completing the Git migration would make it much easier >> for me to maintain the Tree/TreeIO stuff, since I already have a few local >> branches based on it that an upstream CVS duplication would mangle. > > Then maybe we should ?wait with committing your changes to the > time we drop CVS, in order to avoid loss of change history in your > code... What do you think, Peter? Yes, I was suggesting getting a final CVS release out soon, and then look at merging all the new stuff (including Eric's tree stuff) starting to pile up on github. I knew Tiago has a lump of code ready to go, and as we have just discussed, as he would prefer to check that in via CVS. So, Tiago will do that (this Friday), then we'll do the final CVS release next week, and then switch to git - and start to focus on merging in new stuff. Peter From chapmanb at 50mail.com Wed Sep 16 08:34:07 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 16 Sep 2009 08:34:07 -0400 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> Message-ID: <20090916123407.GE13500@sobchak.mgh.harvard.edu> Hi Peter; > > I know it's been only a month since we released Biopython 1.51, but > > does anyone (other than me) think that we already have enough done > > to warrant another release? The associated CVS freeze would also > > serve as a good break point for moving to github (see other threads). I don't have a strong opinion about the release. It seems a little early but if you think we are ready go for it. I have tested Osvaldo's Novoalign commandline object and have it ready to get in. Right now it's in a git tree but I can move it over to a CVS tree and integrate it for the release. It'll live in Bio/Sequencing/Applications like you suggested. I should be able to do that this evening. I am all about the move to Git and GitHub. Anything we can do to finish that off and make it official is cool by me. > That would be great. As for the move to github, I've added some (quite > preliminary) docs for developers on how to make commits to the main > branch using git and github to the wiki: > http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch This is looking great. I'd agree with Eric that we should be consistent in the doc for suggestions on naming the official biopython branch: git remote add upstream git://github.com/biopython/biopython.git git remote add official git at github.com:biopython/biopython.git My vote is for the "official" naming which is a little more specific. Great stuff, Brad From biopython at maubp.freeserve.co.uk Wed Sep 16 09:30:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 14:30:47 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <20090916123407.GE13500@sobchak.mgh.harvard.edu> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> <20090916123407.GE13500@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909160630o4dc1379dwaba667ed13ed9bde@mail.gmail.com> On Wed, Sep 16, 2009 at 1:34 PM, Brad Chapman wrote: > Hi Peter; > >> > I know it's been only a month since we released Biopython 1.51, but >> > does anyone (other than me) think that we already have enough done >> > to warrant another release? The associated CVS freeze would also >> > serve as a good break point for moving to github (see other threads). > > I don't have a strong opinion about the release. It seems a little > early but if you think we are ready go for it. OK. > I have tested Osvaldo's Novoalign commandline object and have it > ready to get in. Right now it's in a git tree but I can move it > over to a CVS tree and integrate it for the release. It'll live in > Bio/Sequencing/Applications like you suggested. I should be able to > do that this evening. Go for it - I presume you have it in a private git repostory at the moment, as I couldn't spot it on github? > I am all about the move to Git and GitHub. Anything we can do to > finish that off and make it official is cool by me. > >> That would be great. As for the move to github, I've added some (quite >> preliminary) docs for developers on how to make commits to the main >> branch using git and github to the wiki: >> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch > > This is looking great. I'd agree with Eric that we should be > consistent in the doc for suggestions on naming the official > biopython branch: > > git remote add upstream git://github.com/biopython/biopython.git > git remote add official git at github.com:biopython/biopython.git > > My vote is for the "official" naming which is a little more > specific. Well, both "official" and "upstream" have merit. I don't mind which, but it does make sense to be consistent. Peter From biopython at maubp.freeserve.co.uk Wed Sep 16 09:48:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 14:48:39 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com> References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> <184931.66541.qm@web62403.mail.re1.yahoo.com> <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com> Message-ID: <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com> On Tue, Sep 8, 2009 at 2:53 PM, Peter wrote: > On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote: >> On Tue, 9/8/09, Peter wrote: >>> Did you have any ideas for a better name than >>> Bio.SeqIO.indexed_dict()? >> >> Is indexed_dict a function? If so, I suggest we use a verb instead >> of a noun. Maybe just "index"? > > Bio.SeqIO.indexed_dict() is a function which returns a dictionary like > object. So yes, a verb would be better, and "index" is short and sweet. Any other comments? Otherwise I'll switch Bio.SeqIO.indexed_dict() to Bio.SeqIO.index() for the next release. Thinking ahead, in addition to the current code (indexing a file, keeping the index in memory) we might in future add want to something like Bio.SeqIO.sqlite_index() where the index is kept in a database etc. Peter From bugzilla-daemon at portal.open-bio.org Wed Sep 16 18:00:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 16 Sep 2009 18:00:59 -0400 Subject: [Biopython-dev] [Bug 2904] Interface for Novoalign In-Reply-To: Message-ID: <200909162200.n8GM0x7d006226@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2904 chapmanb at 50mail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from chapmanb at 50mail.com 2009-09-16 18:00 EST ------- Osvaldo; Thanks much for the submission. This is committed and lives in: Bio/Sequencing/Applications to create a namespace for future sequencing related commandlines. You can import with: from Bio.Sequencing.Applications import NovoalignCommandline It would be great if you wanted to add a cookbook example of using it (http://biopython.org/wiki/Category:Cookbook) based on a simple pipeline. Perhaps something involving downstream parsing of the novoalign format, or converted to SAM as you suggested in Bug 2905. Thanks, Brad -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Wed Sep 16 18:53:31 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 16 Sep 2009 23:53:31 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com> <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com> Message-ID: <6d941f120909161553l3f9bae6u5ba45e6cde9b33e3@mail.gmail.com> Hi, > I knew Tiago has a lump of code ready to go, and as we have > just discussed, as he would prefer to check that in via CVS. I just tested my stuff on Windows. It worked at first attempt. Strange... I actually have a few tests (18 to be precise). They all passed at first. Murphy's laws took a once-in-a-life vacation. I still have a minor problem. I will not have time to update the Tutorial before Tuesday. All is written in http://biopython.org/wiki/PopGen_dev_Genepop , which it will mostly become tutorial. But I simply don't have time until Tuesday to transpose. Code and tests will be committed today. Tiago From krother at rubor.de Thu Sep 17 04:40:28 2009 From: krother at rubor.de (Kristian Rother) Date: Thu, 17 Sep 2009 10:40:28 +0200 Subject: [Biopython-dev] Another Biopython release? Message-ID: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de> Hi Peter, I could prepare 2-3 exemplary modules for parsing secondary structures + tests for the Bio.RNA package. As I've been using GIT so far, it would be most convenient to stick with it and contribute when the main archive has migrated. Or is it easy to "jump" to CVS on the last possible occasion? Best, Kristian From biopython at maubp.freeserve.co.uk Thu Sep 17 05:17:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 10:17:37 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de> References: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de> Message-ID: <320fb6e00909170217j24bab86eqae45440f72ed415e@mail.gmail.com> On Thu, Sep 17, 2009 at 9:40 AM, Kristian Rother wrote: > > Hi Peter, > > I could prepare 2-3 exemplary modules for parsing secondary structures + > tests for the Bio.RNA package. As I've been using GIT so far, it would be > most convenient to stick with it and contribute when the main archive has > migrated. Or is it easy to "jump" to CVS on the last possible occasion? > > Best, > ? Kristian My plan for this "quick release" was to mark an end to the CVS era, and not to include any of the really new stuff (like your code), but to wait until we are on git before looking at it. So keep it in git for now - this should also make the merge easier. Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 07:27:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 12:27:24 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com> References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> <184931.66541.qm@web62403.mail.re1.yahoo.com> <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com> <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com> Message-ID: <320fb6e00909170427o37813aa7kd86464d9c8e81b36@mail.gmail.com> On Wed, Sep 16, 2009 at 2:48 PM, Peter wrote: > On Tue, Sep 8, 2009 at 2:53 PM, Peter wrote: >> On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote: >>> ?On Tue, 9/8/09, Peter wrote: >>>> Did you have any ideas for a better name than >>>> Bio.SeqIO.indexed_dict()? >>> >>> Is indexed_dict a function? If so, I suggest we use a verb instead >>> of a noun. Maybe just "index"? >> >> Bio.SeqIO.indexed_dict() is a function which returns a dictionary like >> object. So yes, a verb would be better, and "index" is short and sweet. > > Any other comments? Otherwise I'll switch Bio.SeqIO.indexed_dict() > to Bio.SeqIO.index() for the next release. Done in CVS. > Thinking ahead, in addition to the current code (indexing a file, keeping > the index in memory) we might in future add want to something like > Bio.SeqIO.sqlite_index() where the index is kept in a database etc. Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 08:02:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 13:02:18 +0100 Subject: [Biopython-dev] Using PendingDeprecation for obsolete modules Message-ID: <320fb6e00909170502m14b4e599l66c778bfe67f3625@mail.gmail.com> Hi all, Right now we have deprecation process which usually looks like this: (1) Label as obsolete in docstrings (2) Label as deprecated in docstrings, issue DeprecationWarning (3) Remove code See: http://biopython.org/wiki/Deprecation_policy I've relatively recently noticed the PendingDeprecationWarning warning (added in Python 2.3), which is by default silent, but the user can choose to enable it with the python command line switch -W. For example, $ python Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import warnings >>> warnings.warn("X is obsolete", PendingDeprecationWarning) >>> So, by default, no warning message. But if you ask for them: $ python -W allPython 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import warnings >>> warnings.warn("X is obsolete", PendingDeprecationWarning) __main__:1: PendingDeprecationWarning: X is obsolete >>> So, I thinking what we should be doing for deprecating modules is: (1) Label as obsolete in docstrings, issue PendingDeprecationWarning (2) Label as deprecated in docstrings, issue DeprecationWarning (3) Remove code I guess very few people know about pending deprecation warnings, and so are unlikely to even try using the warning switch. Therefore I have little inclination to go though all the current modules tagged as "obsolete" just to add this silent warning. However, if simply start doing this in future, is really isn't any more work. Any thoughts? Peter From winda002 at student.otago.ac.nz Thu Sep 17 23:52:11 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 18 Sep 2009 15:52:11 +1200 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <4AA58F3C.6080200@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> Message-ID: <4AB303EB.1010208@student.otago.ac.nz> >> >> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py >> based on test_Emboss.py? Continuing on the github branch is fine. >> Well, it didn't end up being very short but there is a test on my "phylo" branch (http://github.com/dwinter/biopython/tree/phylo) in test_PhylipNew.phy (which uses a couple of new files in Tests/Phylip) that I'd welcome comments on. Writing them actually exposed a bug in the code already in CVS, the FProtParsCommandline option "-intreefile" isn't mandatory so "is_required" should be set to 0 rather than 1. In my defence the emboss documentation has it listed as being both mandatory and optional. One possibly foolish thing I did was use TreeIO to test the trees that came out of these programs made sense, thinking that module would be part of the next release. If the plan is for a new release soon and having a test for these wrappers is important the tests could be done with Nexus.Trees but I found that was difficult to use for files with multiple newick trees. Cheers, David From biopython at maubp.freeserve.co.uk Fri Sep 18 05:26:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 10:26:59 +0100 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <4AB303EB.1010208@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> Message-ID: <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com> On Fri, Sep 18, 2009 at 4:52 AM, David Winter wrote: > >>> >>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py >>> based on test_Emboss.py? Continuing on the github branch is fine. >>> > > Well, it didn't end up being very short but there is a test on my "phylo" > branch (http://github.com/dwinter/biopython/tree/phylo) in > ?test_PhylipNew.phy ?(which uses a couple of new files in Tests/Phylip) that > I'd welcome comments on. Cool - I'll take a look and try and get (some of) it merged into CVS for this release. > Writing them actually exposed a bug in the code already in CVS, the > FProtParsCommandline option "-intreefile" isn't mandatory so "is_required" > should be set to 0 rather than 1. In my defence the emboss documentation has > it listed as being both mandatory and optional. How odd. Maybe EMBOSS switched it at some point? > One possibly foolish thing I did was use TreeIO to test the trees that came > out of these programs made sense, thinking that module would be part of the > next release. If the plan is for a new release soon and having a test for > these wrappers is important the tests could be done with Nexus.Trees but I > found that was difficult to use for files with multiple newick trees. Hmm. In the short term we can either comment out those bits of the test pending the inclusion of TreeIO in the next release, or add a quick tiny parser in the test itself to load the trees, split them on the ";" and pass them one by one to Bio.Nexus.Trees for parsing. Peter From biopython at maubp.freeserve.co.uk Fri Sep 18 07:09:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 12:09:24 +0100 Subject: [Biopython-dev] Entrez ELink history - XML/DTD or Biopython bug? Message-ID: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com> Hi Michiel (et al), I've been trying to get an example working using the Entrez history for ELink. Strangely here the URL doesn't use history=y but instead cmd=neighbor_history (while the default is cmd=neighbor). However, this appears to show a bug in the Bio.Entrez parser. Consider: from Bio import Entrez pmid = "14630660" print Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs", from_uid=pmid, cmd="neighbor_history").read() This gives: pubmed 14630660 pmc pubmed_pmc_refs 1 NCID_1_2657216_130.14.18.53_9001_1253271778 The XML looks reasonable by eye - although quite different from the non-history version. Now if instead of printing that, I try and parse it: >>> data = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs", from_uid=pmid, cmd="neighbor_history")) Traceback (most recent call last): ?File "", line 1, in ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 259, in read ? ?record = handler.run(handle) ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run ? ?self.parser.ParseFile(handle) ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 210, in endElement ? ?current[name] = value TypeError: 'str' object does not support item assignment I can file a Biopython bug if you like, but my initial guess is the problem lies in the XML itself versus the eLink_020511.dtd file, which does not mention the LinkSetDbHistory element at all. Do you agree that this looks like an NCBI problem? Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Sep 18 07:40:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 12:40:06 +0100 Subject: [Biopython-dev] Entrez ELink history - XML/DTD or Biopython bug? In-Reply-To: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com> References: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com> Message-ID: <320fb6e00909180440p701d3f5ejd22a605f171989eb@mail.gmail.com> On Fri, Sep 18, 2009 at 12:09 PM, Peter wrote: > Hi Michiel (et al), > > I've been trying to get an example working using the Entrez history > for ELink. Strangely here the URL doesn't use history=y but instead > cmd=neighbor_history (while the default is cmd=neighbor). > > However, this appears to show a bug in the Bio.Entrez parser. Consider: > > from Bio import Entrez > pmid = "14630660" > print Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs", > from_uid=pmid, cmd="neighbor_history").read() > > This gives: > > > ?"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> > > > ? ? ? ?pubmed > ? ? ? ? > ? ? ? ? ? ? ? ?14630660 > ? ? ? ? > ? ? ? ? > ? ? ? ? ? ? ? ?pmc > ? ? ? ? ? ? ? ?pubmed_pmc_refs > ? ? ? ? ? ? ? ?1 > ? ? ? ? > ? ? ? ?NCID_1_2657216_130.14.18.53_9001_1253271778 > > > > The XML looks reasonable by eye - although quite different from > the non-history version... but my initial guess is > the problem lies in the XML itself versus the eLink_020511.dtd > file, which does not mention the LinkSetDbHistory element at > all. Do you agree that this looks like an NCBI problem? I should have done this earlier - but two different XML validators both agree that the "history" version of the NCBI's ELink XML is invalid, while the default is fine. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660&cmd=neighbor_history versus http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660&cmd=neighbor or: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660 I will get in touch with the NCBI... Peter From eric.talevich at gmail.com Fri Sep 18 10:08:40 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 18 Sep 2009 10:08:40 -0400 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com> Message-ID: <3f6baf360909180708w2d06c775w18922106bba003e@mail.gmail.com> On Fri, Sep 18, 2009 at 5:26 AM, Peter wrote: > On Fri, Sep 18, 2009 at 4:52 AM, David Winter > wrote: > > > One possibly foolish thing I did was use TreeIO to test the trees that > came > > out of these programs made sense, thinking that module would be part of > the > > next release. If the plan is for a new release soon and having a test for > > these wrappers is important the tests could be done with Nexus.Trees but > I > > found that was difficult to use for files with multiple newick trees. > > Hmm. In the short term we can either comment out those bits of the test > pending the inclusion of TreeIO in the next release, or add a quick tiny > parser in the test itself to load the trees, split them on the ";" and pass > them one by one to Bio.Nexus.Trees for parsing. > > That's all TreeIO does. The relevant loop is in NewickIO.parse(), if you'd like to copy it verbatim: http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py -Eric From biopython at maubp.freeserve.co.uk Sun Sep 20 07:20:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 20 Sep 2009 12:20:43 +0100 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <4AB303EB.1010208@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> Message-ID: <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com> On Fri, Sep 18, 2009 at 4:52 AM, David Winter wrote: >>> >>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py >>> based on test_Emboss.py? Continuing on the github branch is fine. >>> > > Well, it didn't end up being very short but there is a test on my "phylo" > branch (http://github.com/dwinter/biopython/tree/phylo) in > ?test_PhylipNew.phy ?(which uses a couple of new files in Tests/Phylip) that > I'd welcome comments on. I've checked in something based on the current version from github. I added a few checks for missing input files (I was getting cryptic errors), but then decided we had enough input files in the test suite already, and that it might be more useful to try writing alignments to the PHYLIP tools via stdin with AlignIO. Certainly at least one example should try this, assuming it works. I haven't done this yet - feel free to try. Note that the stdout from the PHYLIPNEW tools isn't clean, so we can't avoid having temp output files: http://lists.open-bio.org/pipermail/emboss-dev/2009-September/000632.html > Writing them actually exposed a bug in the code already in CVS, the > FProtParsCommandline option "-intreefile" isn't mandatory so "is_required" > should be set to 0 rather than 1. In my defence the emboss > documentation has it listed as being both mandatory and optional. Fixed in CVS - does this affect any of the other tools using this argument? > One possibly foolish thing I did was use TreeIO to test the trees that came > out of these programs made sense, thinking that module would be part of the > next release. If the plan is for a new release soon and having a test for > these wrappers is important the tests could be done with Nexus.Trees but I > found that was difficult to use for files with multiple newick trees. I put a quick crude helper function into the unit test as discussed. The unit test is working nicely on Linux with EMBOSS PHYLIP from CVS, I presume you are testing against an official release? If you could the CVS code works fine on your setup before the release that would be great. There is a bit more time as I won't be able to do the release on Monday, but it should be Tuesday or Wednesday... and fingers crossed getting PHYLIPNEW installed on my Windows machine will be easy. We can look at adding some more of your example input files, and uncommenting their tests later (especially for cases where we can't generate the input from Biopython directly). I did add the horses.tree file BTW. Thank you David :) Peter From winda002 at student.otago.ac.nz Mon Sep 21 01:13:24 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Mon, 21 Sep 2009 17:13:24 +1200 Subject: [Biopython-dev] draft release announcement Message-ID: <4AB70B74.1040308@student.otago.ac.nz> Hi guys, A draft release announcement for 1.52 for you to look at and comment on. This is written with the idea that there will be a blog post describing the convert and indexed_dict() methods for SeqIO which can be linked to so the announcement itself is pretty brief. I didn't mention the movement from CVS to git in the announcement which might be something worth adding? +++ We are pleased to announce the availability of Biopython 1.52, a new stable release of the Biopython library. It may only have been one month since the last release but in that time we've added enough useful features to warrant a new release. Biopython 1.52 will be of particular interest to people using next generation sequencing - new functions added to the AlignIO and SeqIO tools speed up the way very large sequence files can be dealt with and you can now write phd files like those created by Phred and used in 454 sequencing. SeqIO and AlignIO both now have a helper function called convert() that allows for simple, optimized conversion between file formats while SeqIO gets a new method called indexed_dict() which allows random access to sequences in a file without reading every record in that file into memory. The new release also adds command line wrappers for the EMBOSS versions of the phylip phylogeny programs and squashes a few minor bugs reported since 1.51 was released. Sources and a Windows Installer are available from the downloads page. Thanks to the Biopython development team and to everyone who has reported bugs since our last release ++++ From tiagoantao at gmail.com Mon Sep 21 01:17:39 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 21 Sep 2009 06:17:39 +0100 Subject: [Biopython-dev] draft release announcement In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz> References: <4AB70B74.1040308@student.otago.ac.nz> Message-ID: <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com> There is a big update to the PopGen module, which is now able to do frequentist statistics and tests through GenePop. I can draft one paragraph about the subject. I would imagine it is one of the biggest changes and probably the one that adds most functionality. On Mon, Sep 21, 2009 at 6:13 AM, David Winter wrote: > Hi guys, > > A draft release announcement for 1.52 for you to look at and comment on. > This is written with the idea that there will be a blog post describing the > convert and indexed_dict() methods for SeqIO which can be linked to so the > announcement itself is pretty brief. > > I didn't mention the movement from CVS to git in the announcement which > might be something worth adding? > > +++ > We are pleased to announce the availability of Biopython 1.52, a new stable > release of the Biopython library. > > It may only have been one month since the last release but in that time > we've added enough useful features to warrant a new release. Biopython 1.52 > will be of particular interest to people using next generation sequencing - > new functions added to the AlignIO and SeqIO tools speed up the way very > large sequence files can be dealt with and you can now write phd files like > those created by ?Phred and used in 454 sequencing. > > SeqIO and AlignIO both now have a helper function called convert() that > allows for simple, optimized conversion between file formats while SeqIO > gets a new method called indexed_dict() which allows random access to > sequences in a file without reading every record in that file into memory. > > The new release also adds command line wrappers for the EMBOSS versions of > the phylip phylogeny programs and squashes a few minor bugs reported since > 1.51 was released. > > Sources and a Windows Installer are available from the downloads page. > > Thanks to the Biopython development team and to everyone who has reported > bugs since our last release > > ++++ > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- " It always takes ideology to consummate massive error." - Ambrose Evans-Pritchard From winda002 at student.otago.ac.nz Mon Sep 21 01:30:44 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Mon, 21 Sep 2009 17:30:44 +1200 Subject: [Biopython-dev] draft blog post for 1.52 stuff Message-ID: <4AB70F84.6000709@student.otago.ac.nz> As I mentioned in the draft release announcement it might be useful to have a a blog post up explaining how the new functions for SeqIO and AlignIO work (thanks to Peter for this idea). I've written a draft for a post that looks at the convert function that could do with a little more detail and ignores the indexed_dict() function entirely because I just don't have a good enough idea of how it works. Again, any comments are welcome. Is it a good idea to have a post like this or should we just extend the release announcement to include a little bit more detail? ++ It's only been a month since we released Biopython 1.51 but in that time the CVS server has stacked up enough cool new features that we are going to put together a new release soon. As ever the new functions will be documented in the official tutorial and cookbook but we thought we'd show off a few of these tools here Simple, optimized format conversion with SeqIO and AlignIO No one has ever complained that bioinformatics just doesn't have enough file formats - you probably frequently find yourself converting sequence files to suit particular applications with SeqIO. At the moment this is usually a two step process, something like this: >>>records = SeqIO.parse(in_handle "genbank") >>>SeqIO.write(records, out_handle, "fasta") As of Biopython 1.52 you'll be able to achieve the same result in a single step: >>>SeqIO.convert(in_handle, "genbank", out_handle, "fasta") Adding the convert function to SeqIO will make your scripts more readable and might even save you a couple of lines of code but more importantly it allows the conversion process to be optimized for two formats being used. In the above example we are moving from a genbank file, which might include multiple features for each sequence, to a fasta file, which doesn't include features. If we used the two step process above we'd be spending time reading each sequence's features into memory just to skip them when they get passed to the write function. SeqIO.convert() knows that the sequences in the input file are destined to be written to a fasta file so it can skip over the features and save a bit of time in doing the conversion. Obviously, the optimization in SeqIO.convert() is most powerful when its used on very large files like those produced in next generation sequencing projects. When converting between each of the FASTQ file format's variants with the "SeqIO two step" a siginficant amount of time is taken creating SeqRecord objects for each record in the input file but none of the attributes or methods of the SeqRecord object are required to do the conversion. For this reason SeqIO.convert() deals with each record as two simple strings, one for the record's sequence, the other for its ID. [some information on just how much time that saves on a big file should probably go here!] +++ From winda002 at student.otago.ac.nz Mon Sep 21 01:45:34 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Mon, 21 Sep 2009 17:45:34 +1200 Subject: [Biopython-dev] draft release announcement In-Reply-To: <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com> References: <4AB70B74.1040308@student.otago.ac.nz> <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com> Message-ID: <4AB712FE.2060304@student.otago.ac.nz> Tiago Ant?o wrote: > There is a big update to the PopGen module, which is now able to do > frequentist statistics and tests through GenePop. I can draft one > paragraph about the subject. I would imagine it is one of the biggest > changes and probably the one that adds most functionality. > Cool, I see now that I should've read the original thread about the new release more closely A paragraph from you on your PopGen code would be really helpful. Cheers, David From tiagoantao at gmail.com Mon Sep 21 03:23:24 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 21 Sep 2009 08:23:24 +0100 Subject: [Biopython-dev] draft release announcement In-Reply-To: <4AB712FE.2060304@student.otago.ac.nz> References: <4AB70B74.1040308@student.otago.ac.nz> <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com> <4AB712FE.2060304@student.otago.ac.nz> Message-ID: <6d941f120909210023v5dc91079s6ec54a04ad8385e7@mail.gmail.com> Something along the lines of: The Population Genetics module now allows the calculation of several tests, and statistical estimators via a wrapper to GenePop. Supported are tests for Hardy-Weinberg equilibrium, linkage disequilibrium and estimates for various F statistics (Cockerham and Wier Fst and Fis, Robertson and Hill Fis, ...), null allele frequencies and number of migrants among many others. Isolation By Distance (IBD) functionality is also supported. I suppose the changes to PopGen are the biggest going on this Biopython version and probably one of the highlights. I should update the documentation ASAP. I intend to announce this version to some population genetics and evolutionary biology communities (something I have never done in the past) On Mon, Sep 21, 2009 at 6:45 AM, David Winter wrote: > Tiago Ant?o wrote: >> >> There is a big update to the PopGen module, which is now able to do >> frequentist statistics and tests through GenePop. I can draft one >> paragraph about the subject. I would imagine it is one of the biggest >> changes and probably the one that adds most functionality. >> > > Cool, I see now that I should've read the original thread about the new > release more closely > > A paragraph from you on your PopGen code would be really helpful. > > Cheers, > David > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- " It always takes ideology to consummate massive error." - Ambrose Evans-Pritchard From biopython at maubp.freeserve.co.uk Mon Sep 21 05:01:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 21 Sep 2009 10:01:10 +0100 Subject: [Biopython-dev] draft release announcement In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz> References: <4AB70B74.1040308@student.otago.ac.nz> Message-ID: <320fb6e00909210201u3d9032e5vf64ba2953d83938d@mail.gmail.com> On Mon, Sep 21, 2009 at 6:13 AM, David Winter wrote: > Hi guys, > > A draft release announcement for 1.52 for you to look at and comment on. > This is written with the idea that there will be a blog post describing the > convert and indexed_dict() methods for SeqIO which can be linked to so the > announcement itself is pretty brief. I switched indexed_dict() to just index() after discussion on the list. > I didn't mention the movement from CVS to git in the announcement which > might be something worth adding? I think that would warrant a one line paragraph (near the end) :) Peter From biopython at maubp.freeserve.co.uk Mon Sep 21 05:11:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 21 Sep 2009 10:11:17 +0100 Subject: [Biopython-dev] draft blog post for 1.52 stuff In-Reply-To: <4AB70F84.6000709@student.otago.ac.nz> References: <4AB70F84.6000709@student.otago.ac.nz> Message-ID: <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com> On Mon, Sep 21, 2009 at 6:30 AM, David Winter wrote: > As I mentioned in the draft release announcement it might be useful to have > a blog post up explaining how the new functions for SeqIO and AlignIO work > (thanks to Peter for this idea). > > I've written a draft for a post that looks at the convert function that > could do with a little more detail and ignores the indexed_dict() function > entirely because I just don't have a good enough idea of how it works. Great job - thanks for doing this. I'll tackle an indexing introduction blog post since you've done a nice job for convert :) It would also be worth mentioning that the convert function will also take filenames (not just handles), which also helps simplify simple conversion tasks. I should be able to provide some timings for things like FASTQ conversion, or FASTQ to FASTA on multi-million read files (there are probably some on the dev list already...). > Again, any comments are welcome. Is it a good idea to have a post like > this or should we just extend the release announcement to include a little > bit more detail? Well, as I mentioned the idea to David directly, I think these little motivational examples on the blog are worth trying out. What does everyone else think? Peter From biopython at maubp.freeserve.co.uk Mon Sep 21 13:41:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 21 Sep 2009 18:41:40 +0100 Subject: [Biopython-dev] draft blog post for 1.52 stuff In-Reply-To: <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com> References: <4AB70F84.6000709@student.otago.ac.nz> <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com> Message-ID: <320fb6e00909211041n6378595cx39f2d395aee0ec7c@mail.gmail.com> On Mon, Sep 21, 2009 at 10:11 AM, Peter wrote: > On Mon, Sep 21, 2009 at 6:30 AM, David Winter > wrote: >> As I mentioned in the draft release announcement it might be useful to have >> a blog post up explaining how the new functions for SeqIO and AlignIO work >> (thanks to Peter for this idea). >> >> I've written a draft for a post that looks at the convert function that >> could do with a little more detail and ignores the indexed_dict() function >> entirely because I just don't have a good enough idea of how it works. > > Great job - thanks for doing this. I'll tackle an indexing introduction > blog post since you've done a nice job for convert :) Done, and up online - hopefully without typos: http://news.open-bio.org/news/2009/09/biopython-seqio-index/ Peter From winda002 at student.otago.ac.nz Tue Sep 22 01:05:31 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 22 Sep 2009 17:05:31 +1200 Subject: [Biopython-dev] draft release announcement In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz> References: <4AB70B74.1040308@student.otago.ac.nz> Message-ID: <4AB85B1B.2000704@student.otago.ac.nz> David Winter wrote: > Hi guys, > > A draft release announcement for 1.52 for you to look at and comment > on. This is written with the idea that there will be a blog post > describing the convert and indexed_dict() methods for SeqIO which can > be linked to so the > announcement itself is pretty brief. Thanks to Peter and Tiago for their suggestions, there is now a marked up version of this draft with those suggestions ready and waiting on to go on the blog. Still time for suggestions from anyone else. David From winda002 at student.otago.ac.nz Tue Sep 22 01:14:07 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 22 Sep 2009 17:14:07 +1200 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com> Message-ID: <4AB85D1F.7010901@student.otago.ac.nz> Peter wrote: > > > >> Writing them actually exposed a bug in the code already in CVS, the >> FProtParsCommandline option "-intreefile" isn't mandatory so "is_required" >> should be set to 0 rather than 1. In my defence the emboss >> documentation has it listed as being both mandatory and optional. >> > > Fixed in CVS - does this affect any of the other tools using this argument? > Nope, I only slipped on this one ;) > >> One possibly foolish thing I did was use TreeIO to test the trees that came >> out of these programs made sense, thinking that module would be part of the >> next release. If the plan is for a new release soon and having a test for >> these wrappers is important the tests could be done with Nexus.Trees but I >> found that was difficult to use for files with multiple newick trees. >> > > I put a quick crude helper function into the unit test as discussed. > > The unit test is working nicely on Linux with EMBOSS PHYLIP > from CVS, I presume you are testing against an official release? > If you could the CVS code works fine on your setup before the > release that would be great. Finally got in front of the right computer to do this. The tests in the (Biopython) CVS work fine with the official EMBOSS 6.1.0 release (on ubuntu if that helps). I'd offer to try it out on windows but I don't have EMBOSS, a compiler or and of the libraries that I'd need to do that! Cheers, David From biopython at maubp.freeserve.co.uk Tue Sep 22 05:23:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 10:23:10 +0100 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <4AB85D1F.7010901@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com> <4AB85D1F.7010901@student.otago.ac.nz> Message-ID: <320fb6e00909220223q6f079a39o74916d20291c3400@mail.gmail.com> On Tue, Sep 22, 2009 at 6:14 AM, David Winter wrote: > Peter wrote: >>> >>> Writing them actually exposed a bug in the code already in CVS, >>> the FProtParsCommandline option "-intreefile" isn't mandatory so >>> "is_required" should be set to 0 rather than 1. In my defence the >>> emboss documentation has it listed as being both mandatory and >>> optional. >> >> Fixed in CVS - does this affect any of the other tools using this >> argument? > > Nope, I only slipped on this one ;) Great. It looks like the tests have been useful already :) >> The unit test is working nicely on Linux with EMBOSS PHYLIP >> from CVS, I presume you are testing against an official release? >> If you could the CVS code works fine on your setup before the >> release that would be great. > > Finally got in front of the right computer to do this. The tests in the > (Biopython) CVS work fine with the official EMBOSS 6.1.0 release > (on ubuntu if that helps). Great - thank you. > I'd offer to try it out on windows but I don't > have EMBOSS, a compiler or and of the libraries that I'd need to > do that! Hmm - EMBOSS only provide a Windows installer for the core EMBOSS suite, not the extras like PHYLIP. I do have a C compiler and cygwin setup on my Windows machine, so it may work. We'll see... Peter From mjldehoon at yahoo.com Tue Sep 22 06:12:37 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 22 Sep 2009 03:12:37 -0700 (PDT) Subject: [Biopython-dev] Blast records Message-ID: <230712.78074.qm@web62406.mail.re1.yahoo.com> Hi everybody, I was looking at an older bug report about the plain-text and XML Blast parsers in Biopython: http://bugzilla.open-bio.org/show_bug.cgi?id=2176 When I was checking the current behavior of Biopython's blast parsers, I noticed that the plain-text parser and the XML parser give different results when parsing psi-blast output. The plain-text parser returns a Blast.Record.PSIBlast object, whereas the XML parser returns Blast.Record.Blast objects. In addition, the XML parser misinterprets the psi-blast XML output (creating a separate Blast record for each psi-blast iteration), whereas the plain-text parser fails on psi-blast output of the current blast program. To fix this, I guess the first step is to decide whether a psi-blast parser should return a Blast.Record.Blast object or a Blast.Record.PSIBlast object. In theory having a Blast.Record.PSIBlast record seems more appropriate. However, this complicates the parser (it's not clear until halfway through the Blast output if it's Blast or Psi-Blast, which means the user has to tell the parser whether it's Blast or Psi-Blast), and the format of the XML output generated for Blast and Psi-Blast is the same. I would therefore suggest to have one Blast.Record class that can contain both Blast and Psi-Blast output. Any other opinions, comments, suggestions? --Michiel. From biopython at maubp.freeserve.co.uk Tue Sep 22 07:40:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 12:40:46 +0100 Subject: [Biopython-dev] Blast records In-Reply-To: <230712.78074.qm@web62406.mail.re1.yahoo.com> References: <230712.78074.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com> On Tue, Sep 22, 2009 at 11:12 AM, Michiel de Hoon wrote: > Hi everybody, > > When I was checking the current behavior of Biopython's blast parsers, > I noticed that the plain-text parser and the XML parser give different > results when parsing psi-blast output. The plain-text parser returns a > Blast.Record.PSIBlast object, whereas the XML parser returns > Blast.Record.Blast objects. ... > > Any other opinions, comments, suggestions? As I recall (backed up by what I wrote in the tutorial), when I last checked, the plain text PSI-BLAST output (i.e. from the command line tool blastpgp) included a lot of information missing in the XML output. Perhaps this has improved? If it hasn't, I am inclinded to leave things as they are. If the current PSI-BLAST outputs more details in the XML we may be able to do a better job. The next bit is my recollection of some of the background to this: Classic BLAST (and also RPS-BLAST) allow multiple queries and use the "iterator" block in the XML file for each query. This was an odd choice of naming, but I think the XML tag was originally only intended for the PSI-BLAST outout where each "iteration" block in the XML corresponds to each step of the algorithm. You may recall early versions of BLAST would output "concatenated" XML files for multiple queries - which were not true XML files. I guess they fixed this by reusing the existing "iteration" structure for multiple queries (rather than adding new XML tags). With this in mind the current parsing of the XML from PSI-BLAST makes sense. [In any case, I plan to do Biopython 1.52 this afternoon, with the PSI BLAST parsing left as is it]. Peter From biopython at maubp.freeserve.co.uk Tue Sep 22 09:29:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 14:29:10 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 Message-ID: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> Hi all, As previously announced, I'm going to try and get Biopython 1.52 done this afternoon - and am now declaring a CVS freeze. If all goes to plan, once I've done the release CVS will remain "frozen", and we'll probably get it made read only on the server. Instead, we're going to try and switch over to git (initially on github with a backup on the OBF servers). Stay tuned for further announcements... Peter From p.j.a.cock at googlemail.com Tue Sep 22 12:38:21 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Sep 2009 17:38:21 +0100 Subject: [Biopython-dev] Biopython 1.52 released Message-ID: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> Dear all, Those of you who signed up to our newsfeed will know this already, but we are pleased to announce the release of Biopython 1.52: http://news.open-bio.org/news/2009/09/biopython-release-152/ Thank you to all our developers, including David Winter for drafting the release announcement, and everyone else who as contributed with feedback, bug reports etc. Could I also take this opportunity to remind you all we have an application note out in the OUP journal Bioinformatics: http://news.open-bio.org/news/2009/03/biopython-paper-published/ http://dx.doi.org/10.1093/bioinformatics/btp163 In any scientific publication using Biopython, we kindly request you cite this, or another appropriate publication from this list: http://biopython.org/wiki/Documentation#Papers Thank you, Peter From biopython at maubp.freeserve.co.uk Tue Sep 22 12:42:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 17:42:49 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> Message-ID: <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> On Tue, Sep 22, 2009 at 2:29 PM, Peter wrote: > Hi all, > > As previously announced, I'm going to try and get Biopython 1.52 > done this afternoon - and am now declaring a CVS freeze. > > If all goes to plan, once I've done the release CVS will remain > "frozen", and we'll probably get it made read only on the server. > Instead, we're going to try and switch over to git (initially on > github with a backup on the OBF servers). > > Stay tuned for further announcements... OK, the release is done. Let's leave things as they are for a day or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate with Bartek about the timings for the git transition. I am considering adding a warning message to setup.py and the readme file as the final commit to CVS, pointing out that we will be moving future development to a git repository. One of the first commit to git would be to remove that warning. Does that make sense? Peter From bartek at rezolwenta.eu.org Tue Sep 22 15:46:20 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 22 Sep 2009 21:46:20 +0200 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> Message-ID: <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> On Tue, Sep 22, 2009 at 6:42 PM, Peter wrote: > > OK, the release is done. Let's leave things as they are for a day > or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate > with Bartek about the timings for the git transition. > > I am considering adding a warning message to setup.py and the > readme file as the final commit to CVS, pointing out that we will > be moving future development to a git repository. One of the first > commit to git would be to remove that warning. Does that make > sense? It seems OK to me. Let me know when you make the last commit, so that I turn off the scripts pushing CVS changes to github, which would be the only technical thing to do to make the transition. From then on, we should commit only to git. Bartek. From biopython at maubp.freeserve.co.uk Tue Sep 22 16:18:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 21:18:12 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> Message-ID: <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> On Tue, Sep 22, 2009 at 8:46 PM, Bartek Wilczynski wrote: > On Tue, Sep 22, 2009 at 6:42 PM, Peter wrote: >> >> OK, the release is done. Let's leave things as they are for a day >> or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate >> with Bartek about the timings for the git transition. >> >> I am considering adding a warning message to setup.py and the >> readme file as the final commit to CVS, pointing out that we will >> be moving future development to a git repository. One of the first >> commit to git would be to remove that warning. Does that make >> sense? > > It seems OK to me. Great. > Let me know when you make the last commit, so that I turn off > the scripts pushing CVS changes to github, ... Will do - I'll give it a day or so just in case we need to do a re-release for anything critical. > ... which would be the only technical thing to do to make the > transition. From then on, we should commit only to git. Yep - although I'll ask the OBF admins to make CVS read only as a precaution. Peter From p.j.a.cock at googlemail.com Tue Sep 22 16:20:54 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Sep 2009 21:20:54 +0100 Subject: [Biopython-dev] Biopython 1.52 released In-Reply-To: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> Message-ID: <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com> > Dear all, > > Those of you who signed up to our newsfeed will know this already, > but we are pleased to announce the release of Biopython 1.52: > > http://news.open-bio.org/news/2009/09/biopython-release-152/ > > Thank you to all our developers, including David Winter for drafting > the release announcement, and everyone else who as contributed > with feedback, bug reports etc. Brad - if everything looks fine, can you do the PyPi upload now? Thanks, Peter From chapmanb at 50mail.com Tue Sep 22 16:42:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 22 Sep 2009 16:42:26 -0400 Subject: [Biopython-dev] Biopython 1.52 released In-Reply-To: <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com> References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com> Message-ID: <20090922204226.GA13500@sobchak.mgh.harvard.edu> Hi Peter; Congrats to everyone on the release. Peter, thanks as always for all the hard work. > Brad - if everything looks fine, can you do the PyPi upload now? No problem, all set: http://pypi.python.org/pypi/biopython/ I am tempted to secretly commit something to CVS and then vehemently deny doing it to mess with everyone's head. Wait, so then how did the README file get changed? A mystery... Seriously, looking forward to the Git transition, Brad From p.j.a.cock at googlemail.com Tue Sep 22 17:24:11 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Sep 2009 22:24:11 +0100 Subject: [Biopython-dev] Biopython 1.52 released In-Reply-To: <20090922204226.GA13500@sobchak.mgh.harvard.edu> References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com> <20090922204226.GA13500@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909221424t2cd67249pc1555c382c4f5597@mail.gmail.com> On Tue, Sep 22, 2009 at 9:42 PM, Brad Chapman wrote: > Hi Peter; > Congrats to everyone on the release. Peter, thanks as always for all > the hard work. > >> Brad - if everything looks fine, can you do the PyPi upload now? > > No problem, all set: > > http://pypi.python.org/pypi/biopython/ Lovely :) > I am tempted to secretly commit something to CVS and then vehemently > deny doing it to mess with everyone's head. Wait, so then how did the > README file get changed? A mystery... Well, unless you have another CVS account that we don't know about, it wouldn't be much of a mystery would it? Grin. > Seriously, looking forward to the Git transition, May you live in interesting times? But yeah - should be good. Peter From biopython at maubp.freeserve.co.uk Wed Sep 23 06:28:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 11:28:35 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> Message-ID: <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> On Tue, Sep 22, 2009 at 9:18 PM, Peter wrote: > Bartek wrote: >> Let me know when you make the last commit, so that I turn off >> the scripts pushing CVS changes to github, ... > > Will do - I'll give it a day or so just in case we need to do a > re-release for anything critical. Hi Bartek, OK - I think that's it for final commits to CVS (a few notes about git, and finally adding the warning in setup.py). Not all of these changes have made it to github yet. We also need to 1.52 tag ("biopython-152") to get copied over. Once that is done, could you turn off your CVS to github script, and let us know by email? Thanks, Peter From biopython at maubp.freeserve.co.uk Wed Sep 23 10:34:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 15:34:42 +0100 Subject: [Biopython-dev] Blast records In-Reply-To: <154350.7800.qm@web62402.mail.re1.yahoo.com> References: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com> <154350.7800.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e00909230734k612c142cse6888a10c0de01b5@mail.gmail.com> On Wed, Sep 23, 2009 at 2:51 PM, Michiel de Hoon wrote: > > --- On Tue, 9/22/09, Peter wrote: >> As I recall (backed up by what I wrote in the tutorial), >> when I last checked, the plain text PSI-BLAST output >> (i.e. from the command line tool blastpgp) included a >> lot of information missing in the XML output. Perhaps >> this has improved? If it hasn't, I am inclined to leave >> things as they are. If the current PSI-BLAST outputs >> more details in the XML we may be able to do a better job. > > As far as I can tell, the XML contains the same information > as the plain-text psiblast output, but the XML parser doesn't > parse it correctly, since it assumes it is dealing with regular > blast rather than psi-blast. It sounds like the NCBI have changed the PSI BLAST XML output then. >> The next bit is my recollection of some of the background >> to this: >> Classic BLAST (and also RPS-BLAST) allow multiple queries >> and use the "iterator" block in the XML file for each query. >> This was an odd choice of naming, but I think the XML tag was >> originally only intended for the PSI-BLAST outout where each >> "iteration" block in the XML corresponds to each step of the >> algorithm. You may recall early versions of BLAST would output >> "concatenated" XML files for multiple queries - which were not >> true XML files. > > That is correct. To make things more complex, if you run > psi-blast with multiple queries you get concatenated XML > files again, with the iteration blocks corresponding to the > psi-blast iterations for each query. Odd - and arguably a bug, since it isn't valid XML. >> I guess they fixed this by reusing the existing "iteration" >> structure for multiple queries (rather than adding new XML >> tags). With this in mind the current parsing of the XML from >> PSI-BLAST makes sense. > > I don't know if it really makes sense. For a single psi-blast > query, we're getting multiple Blast records. For multiple > psi-blast queries, we're iterating over the iteration blocks > while ignoring the fact that they can come from different > queries. Is a single Blast record object for each PSI-BLAST iteration such a bad thing? > Ideally, we should be able to see from the XML whether > it was regular blast with multiple queries, or psi-blast with > a single query. Right now that is possible by looking at > the query-def lines, but I wonder if NCBI is considering > a better solution for this. I'll write an email to them to find out. Certainly clarification from the NCBI sounds useful. Peter From mjldehoon at yahoo.com Wed Sep 23 09:51:04 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 23 Sep 2009 06:51:04 -0700 (PDT) Subject: [Biopython-dev] Blast records In-Reply-To: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com> Message-ID: <154350.7800.qm@web62402.mail.re1.yahoo.com> --- On Tue, 9/22/09, Peter wrote: > As I recall (backed up by what I wrote in the tutorial), > when I last checked, the plain text PSI-BLAST output > (i.e. from the command line tool blastpgp) included a > lot of information missing in the XML output. Perhaps > this has improved? If it hasn't, I am inclined to leave > things as they are. If the current PSI-BLAST outputs > more details in the XML we may be able to do a better job. As far as I can tell, the XML contains the same information as the plain-text psiblast output, but the XML parser doesn't parse it correctly, since it assumes it is dealing with regular blast rather than psi-blast. > The next bit is my recollection of some of the background > to this: > Classic BLAST (and also RPS-BLAST) allow multiple queries > and use the "iterator" block in the XML file for each query. > This was an odd choice of naming, but I think the XML tag was > originally only intended for the PSI-BLAST outout where each > "iteration" block in the XML corresponds to each step of the > algorithm. You may recall early versions of BLAST would output > "concatenated" XML files for multiple queries - which were not > true XML files. That is correct. To make things more complex, if you run psi-blast with multiple queries you get concatenated XML files again, with the iteration blocks corresponding to the psi-blast iterations for each query. > I guess they fixed this by reusing the existing "iteration" > structure for multiple queries (rather than adding new XML > tags). With this in mind the current parsing of the XML from > PSI-BLAST makes sense. I don't know if it really makes sense. For a single psi-blast query, we're getting multiple Blast records. For multiple psi-blast queries, we're iterating over the iteration blocks while ignoring the fact that they can come from different queries. Ideally, we should be able to see from the XML whether it was regular blast with multiple queries, or psi-blast with a single query. Right now that is possible by looking a the query-def lines, but I wonder if NCBI is considering a better solution for this. I'll write an email to them to find out. --Michiel From bugzilla-daemon at portal.open-bio.org Wed Sep 23 10:47:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 23 Sep 2009 10:47:16 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909231447.n8NElGi8003751@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-23 10:47 EST ------- I've looked at PDB file 13GS in more detail, and this doesn't look like a bug in Biopython, but rather just another odd PDB file. Chains C and D are only three residue peptides, e.g. ATOM 3301 N GLU D 1 16.854 13.061 10.252 1.00 65.68 N ATOM 3302 CA GLU D 1 17.100 13.860 9.018 1.00 66.23 C ATOM 3303 C GLU D 1 17.937 15.095 9.363 1.00 65.02 C ATOM 3304 O GLU D 1 18.510 15.724 8.439 1.00 56.86 O ATOM 3305 CB GLU D 1 15.764 14.279 8.389 1.00 66.35 C ATOM 3306 CG GLU D 1 15.913 14.994 7.062 1.00 67.41 C ATOM 3307 CD GLU D 1 14.584 15.456 6.508 1.00 68.72 C ATOM 3308 OE1 GLU D 1 13.547 15.340 7.163 1.00 69.08 O ATOM 3309 OXT GLU D 1 17.998 15.420 10.569 1.00 66.12 O ATOM 3310 N CYS D 2 14.618 15.966 5.283 1.00 69.97 N ATOM 3311 CA CYS D 2 13.431 16.483 4.614 1.00 70.18 C ATOM 3312 C CYS D 2 13.374 15.898 3.213 1.00 69.53 C ATOM 3313 O CYS D 2 14.409 15.625 2.610 1.00 65.61 O ATOM 3314 CB CYS D 2 13.502 18.008 4.507 1.00 73.18 C ATOM 3315 SG CYS D 2 14.485 18.841 5.796 1.00 76.47 S ATOM 3316 N GLY D 3 12.166 15.713 2.693 1.00 71.49 N ATOM 3317 CA GLY D 3 12.023 15.155 1.360 1.00 75.33 C ATOM 3318 C GLY D 3 11.489 13.733 1.399 1.00 78.72 C ATOM 3319 O GLY D 3 10.840 13.313 0.413 1.00 79.95 O ATOM 3320 OXT GLY D 3 11.717 13.031 2.412 1.00 80.37 O TER 3321 GLY D 3 Look at the C-alpha distances, (17.100, 13.860, 9.018) to (13.431, 16.483, 4.614) to (12.023, 15.155, 1.360) giving distances of 6.3 and 3.8: >>> from math import sqrt >>> import numpy >>> a = numpy.array((17.100, 13.860, 9.018)) >>> b = numpy.array((13.431, 16.483, 4.614)) >>> c = numpy.array((12.023, 15.155, 1.360)) >>> sqrt(sum((a-b)**2)) 6.3037215991825049 >>> sqrt(sum((b-c)**2)) 3.7861014249488876 Clearly the first two residues in this "peptide" are very far apart, regardless of if you do a simple C-alpha distance (as here), or look at the backbone's N to C bonds. The "problem" for 13GS goes away if you relax the default distance threshold, e.g. use PPBuilder(10.0) instead of PPBuilder(). However, whatever affects 1A2D seems to be a different issue... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Wed Sep 23 11:10:32 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 23 Sep 2009 17:10:32 +0200 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> Message-ID: <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> On Wed, Sep 23, 2009 at 12:28 PM, Peter wrote: > On Tue, Sep 22, 2009 at 9:18 PM, Peter ?wrote: > OK - I think that's it for final commits to CVS (a few notes about > git, and finally adding the warning in setup.py). Not all of these > changes have made it to github yet. > > We also need to 1.52 tag ("biopython-152") to get copied over. > > Once that is done, could you turn off your CVS to github > script, and let us know by email? Ta-da! We are no longer synchronizing from CVS! Please do not commit any changes to the CVS because they are not going to be transferred to git, which is now _the_ repository for biopython. Everyone with biopython CVS accounts is welcome to send their github logins (off the list) to me or Peter to get them added as biopython collaborators. cheers Bartek From biopython at maubp.freeserve.co.uk Wed Sep 23 11:16:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 16:16:19 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> Message-ID: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> On Wed, Sep 23, 2009 at 4:10 PM, Bartek Wilczynski wrote: > On Wed, Sep 23, 2009 at 12:28 PM, Peter wrote: >> On Tue, Sep 22, 2009 at 9:18 PM, Peter ?wrote: >> OK - I think that's it for final commits to CVS (a few notes about >> git, and finally adding the warning in setup.py). Not all of these >> changes have made it to github yet. >> >> We also need to 1.52 tag ("biopython-152") to get copied over. >> >> Once that is done, could you turn off your CVS to github >> script, and let us know by email? > > Ta-da! We are no longer synchronizing from CVS! Lovely... but could you double check the last few commits made it? i.e. The final commit should be: setup.py CVS revision 1.174 date: 2009/09/23 10:06:08; author: peterc; state: Exp; lines: +8 -0 Adding a warning about CVS/git to setup.py (which we will remove once we switch to git) so people know they are using an out of date repository. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Wed Sep 23 11:40:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 23 Sep 2009 11:40:00 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909231540.n8NFe0iU005670@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-23 11:39 EST ------- I think the problem with PDB file 1A2D is due to the atypical PYX residue, from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import is_aa structure = PDBParser().get_structure('tmp', '1A2D.pdb') for model in structure : for chain in model : for res in chain : if "CA" in res.child_dict and not is_aa(res) : print chain, res The polypeptide code only looks at residues that pass the is_aa test, which means we can ignore things like water atoms associated with a chain. In this PDB file there are two residues which fail this test: According to the SEQADV and MODRES lines, these are modified CYS residues. Comparing this to the PDB provided FASTA file, a "C" is used (CYS). This leads me to believe the fix is to add the PYX -> C mapping to Biopython. [The dictionary used, to_one_letter_code, is actually defined in file Bio/SCOP/RAF.py for some historical reason.] Consulting the PDB documentation suggests that there are potentially many more examples like this of unknown HETATM entries which are modified amino acid residues... see: ftp://ftp.wwpdb.org/pub/pdb/data/monomers/ Christian - did you find any other problem PDB files? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 23 11:47:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 23 Sep 2009 11:47:19 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909231547.n8NFlJ39005869@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #4 from schafer at rostlab.org 2009-09-23 11:47 EST ------- Peter, yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, I'll take a look at it and post them here. It's easy to do this. What I did is, I parsed the structures through the dssp structure assignment tool and compared the obtained sequence with that obtained from the Bio.PDB parser. Background: I wanted to map the sequence that dssp sees to atomic coordinates. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Wed Sep 23 11:56:42 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 23 Sep 2009 17:56:42 +0200 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> Message-ID: <8b34ec180909230856u235a17ah437e578e02d5e6d3@mail.gmail.com> On Wed, Sep 23, 2009 at 5:16 PM, Peter wrote: > > Lovely... but could you double check the last few commits made it? Sure, your commit didn't make it to github at first, because It was just two minutes after the last scheduled synchronization. Now it's in github. cheers Bartek From biopython at maubp.freeserve.co.uk Wed Sep 23 12:04:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 17:04:30 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> Message-ID: <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> On Wed, Sep 23, 2009 at 4:16 PM, Peter wrote: > On Wed, Sep 23, 2009 at 4:10 PM, Bartek Wilczynski wrote: >> >> Ta-da! We are no longer synchronizing from CVS! >> > > Lovely... but could you double check the last few commits made it? > i.e. The final commit should be: > > setup.py CVS revision 1.174 > date: 2009/09/23 10:06:08; ?author: peterc; ?state: Exp; ?lines: +8 -0 > Adding a warning about CVS/git to setup.py (which we will remove > once we switch to git) so people know they are using an out of date > repository. It has just shown up in the last few minutes :) I'm ready to make the first commit directly to github (removing the new warning from setup.py), assuming everything is fine on your end Bartek? Peter From biopython at maubp.freeserve.co.uk Wed Sep 23 12:34:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 17:34:12 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> Message-ID: <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com> On Wed, Sep 23, 2009 at 5:04 PM, Peter wrote: > > I'm ready to make the first commit directly to github (removing the > new warning from setup.py), assuming everything is fine on your > end Bartek? OK - that's done now. Thank you Bartek. Ladies and Gentlemen, we are now running Biopython development with git :) Remember - CVS remains frozen (and I'll ask the OBF admins to make it read only to prevent any accidents). Now, let's make sure all the documentation and the wiki etc is up to date, and make an official announcement on the news server. Those of you who already had CVS access, once you think you are happy with using git (i.e. you'd had a play with your own local repository, and also idealy tried pushed changes to a personal repository on github), please ask for collaborators status on github. Peter From eric.talevich at gmail.com Wed Sep 23 23:48:49 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 23 Sep 2009 23:48:49 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch Message-ID: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> Folks, I've fixed a couple of remaining issues in the Bio.Tree and Bio.TreeIO modules and I'd like your opinion on what else should be done before merging this into the mainline. First, the wiki documentation for PhyloXML has an example pipeline showing how to build a phylogeny in Biopython, from a raw protein sequence to a lightly annotated phyloXML file. http://biopython.org/wiki/PhyloXML#Example_pipeline Does this look like right? I copied the first few steps from the official docs. The source code, for your review, is here: http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/ http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/ http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py Discussion: *TreeIO* The read, parse, write and convert functions work essentially the same as in SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues: (1) 'phyloxml' uses a different object representation than the other two, so converting between those formats is not possible until Nexus.Trees is ported over to Bio.Tree. (2) NexusIO.write() just doesn't seem to work. I don't understand how to make the original Nexus module write out trees that it didn't parse itself. Help? *Tree *The BaseTree module is meant to be the basis for Newick trees eventually, so I'd like to get the design right with the minimum number of public methods: (1) The find() function, named after the Unix utility that does the same thing for directory trees, seems capable of all the iteration and filtering necessary for locating data and automatically adding annotations to a tree. There's a 'terminal' argument for selecting internal nodes, external nodes, or both, and I think this means get_leaf_nodes() is unnecessary. I'm going to remove it if no one protests. (2) Should find() be based on depth_first_search or breadth_first_search (not checked in yet)? DFS would potentially find a leaf node faster, but BFS seems more common in phylogenetics. Note that iteration can easily be reversed with the standard reversed() function, so we don't need extra functions for those cases. (3) I left room in each Node for the left and right indexes used by BioSQL's nested-set representation. Now I'm doubting the utility of that -- any Biopython function that uses those indexes would need to ensure that the index is up to date, which seems tricky. Shall I remove all mention of the nested-set representation, or try to support it fully? (4) There's some mention in the literature of a relationship-matrix representation for phylogenies. Does anyone here know how to work with this representation, or know if it would let us perform complex calculations with blinding speed behind the scenes? If so, should there be a function in Bio.Tree.Utils to export a tree to a NumPy array represented this way? If not, I'll forget about it. *Graphics* I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even usable. Plus, the nodes are now a pretty shade of blue. Still, it would be nice to have a Reportlab-based module in Bio.Graphics to print phylogenies in the way biologists are used to seeing them. Does anyone know of existing code that could be borrowed for this? I looked at ETE (announced on the main biopython list last week) and liked the examples, but it uses PyQt4 and a standalone GUI for display, which is a substantial departure from the Biopython way of doing things. Best regards, Eric From mjldehoon at yahoo.com Thu Sep 24 05:33:22 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 24 Sep 2009 02:33:22 -0700 (PDT) Subject: [Biopython-dev] Blast records In-Reply-To: <320fb6e00909230734k612c142cse6888a10c0de01b5@mail.gmail.com> Message-ID: <888743.69260.qm@web62408.mail.re1.yahoo.com> --- On Wed, 9/23/09, Peter wrote: > --- Michiel wrote: > > For a single psi-blast query, we're getting multiple Blast > > records. For multiple psi-blast queries, we're iterating over > > the iteration blocks while ignoring the fact that they can come > from different queries. > > Is a single Blast record object for each PSI-BLAST > iteration such a bad thing? > Well the plain-text PSI-BLAST parser returns a single Record.PSIBlast object containing all of the PSI-BLAST iterations, whereas the XML parser returns multiple Record.Blast objects. Ideally, the plain-text parser and the XML parser should return the same thing. --Michiel. From biopython at maubp.freeserve.co.uk Thu Sep 24 05:57:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 10:57:12 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> Message-ID: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> On Thu, Sep 24, 2009 at 4:48 AM, Eric Talevich wrote: > Discussion: > > *TreeIO* > The read, parse, write and convert functions work essentially the same as in > SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues: Great. One minor point - the docstring for Bio.TreeIO.parse() says: "This is only supported for formats that can represent multiple phylogenetic trees in a single file". Is that true, and if so why? For SeqIO and AlignIO you can use parse on a file with one entry, the iterator just returns one entry. Easy. This is important for allowing generic code (e.g. a loop) regardless of how many entries there are (one, many, or even zero). On a more general note, you seem to be recreating the file/handle logic in each of the individual parsers. I think it would make much more sense to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and Bio.TreeIO.write() functions *only* and have the underlying format specific code just use handles. This avoids the code duplication. [In fact, as I have said before, I prefer the simplicity of just allowing handles - and we should make TreeIO and SeqIO/AlignIO consistent] > (1) 'phyloxml' uses a different object representation than the other two, so > converting between those formats is not possible until Nexus.Trees is ported > over to Bio.Tree. I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming that phyloxml allows very minimal trees, the reverse as well). It does look like the best plan is to use the same tree objects for all three (updating Bio.Nexus if possible). Note that Bio.Nexus.Trees still has some useful methods you don't appear to support, like finding the last common ancestor and distances between nodes. > (2) NexusIO.write() just doesn't seem to work. I don't understand how to > make the original Nexus module write out trees that it didn't parse itself. > Help? To get the Newick tree, you can just call str(tree), which is basically what you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be more complicated. You'll need to create a minimal Nexus file - have a look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having a hard coded nexus template, and just insert the tree as a Newick string (and insert the list of taxa?). Perhaps Frank or Cymon can advise us. > *Tree > *The BaseTree module is meant to be the basis for Newick trees eventually, > so I'd like to get the design right with the minimum number of public > methods: > > (1) The find() function, named after the Unix utility that does the same > thing for directory trees, seems capable of all the iteration and filtering > necessary for locating data and automatically adding annotations to a tree. > There's a 'terminal' argument for selecting internal nodes, external nodes, > or both, and I think this means get_leaf_nodes() is unnecessary. I'm going > to remove it if no one protests. I'm in two minds - iterating over the leaves (taxa) seems like a very common operation, and having an explicit method for this might be clearer than calling find with special arguments. > (2) Should find() be based on depth_first_search or breadth_first_search > (not checked in yet)? DFS would potentially find a leaf node faster, but BFS > seems more common in phylogenetics. Note that iteration can easily be > reversed with the standard reversed() function, so we don't need extra > functions for those cases. You could do both, either via an argument or having two methods, say depth_fist_search and breadth_first_search instead of find. > (3) I left room in each Node for the left and right indexes used by BioSQL's > nested-set representation. Now I'm doubting the utility of that -- any > Biopython function that uses those indexes would need to ensure that the > index is up to date, which seems tricky. Shall I remove all mention of the > nested-set representation, or try to support it fully? A partial implementation doesn't seem helpful, and wastes memory allocating unused properties. I would remove it from the base Node, but a full implementation might be useful for something (would it be possible via a subclass?). On a related point, do you think a BioSQL TaxonTree subclass is possible? i.e. Something mimicking the new Tree objects (as a subclass), but which loads data on demand from the taxon tables in a BioSQL database? This would provide a nice way to work with the NCBI taxonomy (once loaded into BioSQL), which is a very large tree. For an example use case, I might want to extract just the bacteria as a subtree, and save that to a file. > (4) There's some mention in the literature of a relationship-matrix > representation for phylogenies. Does anyone here know how to work with this > representation, or know if it would let us perform complex calculations with > blinding speed behind the scenes? If so, should there be a function in > Bio.Tree.Utils to export a tree to a NumPy array represented this way? ?If > not, I'll forget about it. I don't know. > *Graphics* > I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even > usable. Plus, the nodes are now a pretty shade of blue. Still, it would be > nice to have a Reportlab-based module in Bio.Graphics to print phylogenies > in the way biologists are used to seeing them. Does anyone know of existing > code that could be borrowed for this? I looked at ETE (announced on the main > biopython list last week) and liked the examples, but it uses PyQt4 and a > standalone GUI for display, which is a substantial departure from the > Biopython way of doing things. I still haven't tracked down my old report lab code, but it wasn't object orientated and would need a lot of work to bring up to standard... Peter From biopython at maubp.freeserve.co.uk Thu Sep 24 06:23:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 11:23:34 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com> Message-ID: <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com> On Wed, Sep 23, 2009 at 5:34 PM, Peter wrote: > > Now, let's make sure all the documentation and the wiki etc is up to date, > and make an official announcement on the news server. > How does this look for a draft news post (with links to wiki pages etc): The release of Biopython 1.52 earlier this week marked the end of an era, it was our last release using CVS for source code control. As of now, Biopython is using a git repository, hosted on github.com who kindly provide git hosting for open source projects free of charge. The BioRuby project have been using github for some time now, so we are in good company. The existing OBF hosted CVS repository will be maintained in the short to medium term as a backup, but will not be updated. Although many people have been involved in this move, we?d like to thank Bartek Wilczynski in particular for handling the CVS to git conversion, and the mirroring our CVS updates to git during the last few months transition period. In the next few weeks hopefully we?ll get our git usage wiki pages perfected, as we start using git for real. Peter From jhuerta at crg.es Thu Sep 24 06:45:21 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Thu, 24 Sep 2009 12:45:21 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> Message-ID: Hi, ( I'm the developer of ETE. ) I agree that PyQt4 is an important dependence. I chose it because Qt4-QGraphicsScene environment offers many possibilities like openGL rendering, unlimited image size, performance, and good bindings to python. However, I am working on my code to allow the rendering algorithm to use any other graphical library. So, you could render the same tree images using different backends. If you think this is useful for you, please let me know and we can think how to integrat it with biopython. Regarding the GUI, it is not a standalone application but one more method within the Tree objects. The GUI can be started at any point of the execution and the main program will continue after you close it. I did it like this because I think is quite useful for working within interactive python sessions. I develop a lot of code around tree handling, so if you think I can help, please tell me. jaime. > > *Graphics* > > I finally fixed the networkx/graphviz/matplotlib drawing to leave > unlabeled > > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps > even > > usable. Plus, the nodes are now a pretty shade of blue. Still, it would > be > > nice to have a Reportlab-based module in Bio.Graphics to print > phylogenies > > in the way biologists are used to seeing them. Does anyone know of > existing > > code that could be borrowed for this? I looked at ETE (announced on the > main > > biopython list last week) and liked the examples, but it uses PyQt4 and a > > standalone GUI for display, which is a substantial departure from the > > Biopython way of doing things. > > I still haven't tracked down my old report lab code, but it wasn't object > orientated and would need a lot of work to bring up to standard... > > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From bugzilla-daemon at portal.open-bio.org Thu Sep 24 07:14:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 07:14:37 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909241114.n8OBEbKH005629@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-24 07:14 EST ------- (In reply to comment #4) > Peter, > > yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, > I'll take a look at it and post them here. It's easy to do this. What I did is, > I parsed the structures through the dssp structure assignment tool and compared > the obtained sequence with that obtained from the Bio.PDB parser. Background: I > wanted to map the sequence that dssp sees to atomic coordinates. > If you can give us some more examples that would be very helpful, thank you. I have committed a partial fix which means any known modified amino acids (based on the presence of an alpha carbon) will be treated as an amino acid for building the peptide (and given the default sequence letter of X). This will also issue a warning. Any such previously unknown modified amino acid (like PYX) needs to be added to our hard coded lookup table with the appropriate single letter symbol as used by the PDF in their FASTA files (in this case, PYX -> C for cysteine). I suspect that some of your other problem PDB files still have (currently) undefined modified amino acids in them... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Sep 24 07:39:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 12:39:59 +0100 Subject: [Biopython-dev] Committing to github... Message-ID: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com> Hi all, My last couple of commits to github have been from a local clone of the *official* repository: http://github.com/biopython/biopython/ This is a nice and simple work flow for small changes, and the history and github network graph are easy to understand: http://biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch This seems like the easiest way to work for people used to CVS, and you don't need to bother with your own Biopython cloned repository on github (you just need a github account and collaborator status). I'll probably continue to do this in the short term. -- However, prior to that I did a couple of commits via a local clone of *my* personal github repository, http://github.com/peterjc/biopython/ I had kept the master branch on *my* repository identical to the official master. However, while I was only pushing a tiny change, git did this as a merge - resulting in a flurry of RSS entries and a complicated looking git network diagram. I think it is probably just down to the way we've been using the repositories during the migration? With this backlog of merges done, I expect future commits by this route will look much cleaner... Peter From chapmanb at 50mail.com Thu Sep 24 08:08:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 24 Sep 2009 08:08:00 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> Message-ID: <20090924120800.GJ13500@sobchak.mgh.harvard.edu> Eric and Peter; Looking forward to seeing the PhyloXML work merged into the main branch. Eric, thanks for posting the summary of where things are at. > > (1) 'phyloxml' uses a different object representation than the other two, so > > converting between those formats is not possible until Nexus.Trees is ported > > over to Bio.Tree. > > I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would > actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming > that phyloxml allows very minimal trees, the reverse as well). It does look > like the best plan is to use the same tree objects for all three (updating > Bio.Nexus if possible). Agreed that this would be nice to have, but I'm not sure why it's blocking getting the base TreeIO framework and all of PhyloXML into the main branch. That's a major step forward from the format specific phylogenetic code we had before and gets us a portion of the way there. Next up should be moving over Bio.Nexus to the new framework and then conversions, but this is another project. I think we should take this one step at a time. > Note that Bio.Nexus.Trees still has some useful methods you don't > appear to support, like finding the last common ancestor and distances > between nodes. Agreed. As we move Nexus over, we should be sure to keep current functionality. > > (1) The find() function, named after the Unix utility that does the same > > thing for directory trees, seems capable of all the iteration and filtering > > necessary for locating data and automatically adding annotations to a tree. > > There's a 'terminal' argument for selecting internal nodes, external nodes, > > or both, and I think this means get_leaf_nodes() is unnecessary. I'm going > > to remove it if no one protests. > > I'm in two minds - iterating over the leaves (taxa) seems like a very > common operation, and having an explicit method for this might be > clearer than calling find with special arguments. I'm for keeping it as well, and just having the underlying implementation of get_leaf_nodes call find with the right arguments. This seems like an operation that should be dead obvious to do. > > (3) I left room in each Node for the left and right indexes used by BioSQL's > > nested-set representation. Now I'm doubting the utility of that -- any > > Biopython function that uses those indexes would need to ensure that the > > index is up to date, which seems tricky. Shall I remove all mention of the > > nested-set representation, or try to support it fully? Again I agree with Peter here -- this would be best supported as a subclass that is database aware with an identical API, similar to how the Seq objects and BioSQL Seq objects work. This avoids any overhead for the in-memory case, which will be more common, but gives you a point to implement the useful database representation code in the future. If you don't have time to work on all of this right now, I'd leave the nested-set stuff out and keep it in mind as a future addition. Brad From biopython at maubp.freeserve.co.uk Thu Sep 24 08:48:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 13:48:37 +0100 Subject: [Biopython-dev] Git documentation on wiki Message-ID: <320fb6e00909240548q4db8dfc1l83be8408d3b8718f@mail.gmail.com> Hi all, I think I have updated the relevant wiki pages about the CVS to git migration. I have also make the "git" page redirect to the "Source Code" page, which is the main access point. This now has a quick summary with the basic links here for anyone wanting to grab the latest code: http://biopython.org/wiki/SourceCode If anyone spots any errors or typos, feel free to fix them or raise them here for discussion as needed. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Thu Sep 24 10:42:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:42:08 -0400 Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed assertion in CondonTable Fix+Patch In-Reply-To: Message-ID: <200909241442.n8OEg8Xo012359@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2894 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-24 10:42 EST ------- I've actually installed Jython 2.5.0 and checked this. A further fix was required, but this now works with the latest Biopython now in git. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 10:46:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:46:38 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200909241446.n8OEkc1w012533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-24 10:46 EST ------- Testing with Jython 2.5.0 shows my fix didn't work. Reopening... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 10:46:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:46:49 -0400 Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch In-Reply-To: Message-ID: <200909241446.n8OEknEX012555@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2895 Bug 2895 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 10:46:53 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:46:53 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200909241446.n8OEkrFK012570@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 Bug 2893 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 10:46:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:46:55 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200909241446.n8OEkt93012582@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 Bug 2892 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 11:11:22 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:22 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200909241511.n8OFBM3q013469@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn|2890 | OtherBugsDependingO|2892, 2893, 2895 | nThis| | ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-24 11:11 EST ------- Removing dependencies on other Jython bugs - they don't block each other. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 11:11:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:25 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200909241511.n8OFBPYu013482@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO|2891 | nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 11:11:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:40 -0400 Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch In-Reply-To: Message-ID: <200909241511.n8OFBeug013513@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2895 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn|2891 | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 11:11:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:42 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200909241511.n8OFBgcU013525@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn|2891 | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 11:11:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:45 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200909241511.n8OFBj1e013540@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn|2891 | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 12:10:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 12:10:30 -0400 Subject: [Biopython-dev] [Bug 2918] New: Entrez parser fails on Jython - XMLParser lacks SetParamEntityParsing Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2918 Summary: Entrez parser fails on Jython - XMLParser lacks SetParamEntityParsing Product: Biopython Version: 1.52 Platform: All URL: http://bugs.jython.org/issue1447 OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk CC: kellrott at ucsd.edu I'm filing this as a bug report so we can track it, but the underlying issue is a known Jython bug, http://bugs.jython.org/issue1447 (thanks Kyle for reporting this already). It can be shown just by running our unit test: ~/jython2.5.0/jython run_tests.py test_Entrez.py test_Entrez ... FAIL ====================================================================== ERROR: Test parsing XML returned by EFetch, Journals database ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pjcock/repositories/biopython/Tests/test_Entrez.py", line 3443, in test_journals record = Entrez.read(input) File "/Users/pjcock/repositories/biopython/Bio/Entrez/__init__.py", line 259, in read record = handler.run(handle) File "/Users/pjcock/repositories/biopython/Bio/Entrez/Parser.py", line 85, in run self.parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS) AttributeError: 'XMLParser' object has no attribute 'SetParamEntityParsing' ... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Sep 24 13:59:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 18:59:06 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20090924120800.GJ13500@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <20090924120800.GJ13500@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909241059yfa43889w82c76cd7f2365dee@mail.gmail.com> On Thu, Sep 24, 2009 at 1:08 PM, Brad Chapman wrote: > Eric and Peter; > Looking forward to seeing the PhyloXML work merged into the main > branch. Eric, thanks for posting the summary of where things are at. > >> > (1) 'phyloxml' uses a different object representation than the other two, so >> > converting between those formats is not possible until Nexus.Trees is ported >> > over to Bio.Tree. >> >> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would >> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming >> that phyloxml allows very minimal trees, the reverse as well). It does look >> like the best plan is to use the same tree objects for all three (updating >> Bio.Nexus if possible). > > Agreed that this would be nice to have, but I'm not sure why it's > blocking getting the base TreeIO framework and all of PhyloXML into > the main branch. That's a major step forward from the format > specific phylogenetic code we had before and gets us a portion of > the way there. If the Newick/Nexus TreeIO parsers return one object type while the PhyloXML TreeIO parser returns another *incompatible* object type, then we don't have a unified tree input/output framework. Furthermore, if you did release this and then later standardise on a single tree object, you'd break backwards compatibility. All in all, best avoided. > Next up should be moving over Bio.Nexus to the new framework and > then conversions, but this is another project. I think we should > take this one step at a time. What we could do in the short term is ignore Bio.Nexus.Trees, and just leave it as is. Instead of having the Newick/Nexus TreeIO code calling the old Bio.Nexus.Trees code, we just write some new code (possibly based on old code) which will use Eric's new objects. We could then (gradually, perhaps by adding a runtime option to the Nexus parsing API) move Bio.Nexus over from using the old Bio.Nexus.Trees code to the new TreeIO, and eventually deprecate and then remove Bio.Nexus.Trees. Peter From eric.talevich at gmail.com Thu Sep 24 23:54:05 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 24 Sep 2009 23:54:05 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> Message-ID: <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Hello, Jaime, Sorry I didn't respond directly to your earlier post -- I wrote half of an e-mail, then realized I had no good suggestions on what to do so I scrapped it. My Tree and TreeIO code is basically a complete parser for the phyloXML format, plus a few base classes extracted out in hopes of eventually creating a unified set of format-independent objects, as in SeqIO and AlignIO. Your code for working with trees looks much more complete than mine, so if some of it can be incorporated into Biopython, I think that would be great. I see these issues with integration: 1. It's GPL, while Biopython uses a more permissive custom license resembling the BSD and MIT licenses. Would you be willing and able to relicense parts of your work for Biopython? 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will require some compatibility fixes -- not a huge problem. 3. Scipy and numpy dependencies: Numpy is considered a semi-optional dependency in Biopython, so if it can be imported on the fly by just the functions that need it (hopefully no core ones), that would be best. If not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so it would be better to make that an optional, on-the-fly import, too. 4. PyQt4 is a big package and I'm not sure it's as common in scientists' Python installations as numpy and scipy, so if the underlying algorithms for tree layout could be ported to Reportlab, matplotlib or PIL, that would be ideal. I personally would like to be able to pair sequence snippets with the leaves of a standard phylogram, so if you need me to do some additional work to get this section ported to Biopython, I'd consider it time well spent. 5. Presumably, the tree object type in ETE is different from Bio.Tree or Bio.Nexus, so porting the core tree manipulation code to Biopython would require a substantial effort somewhere. 6. The PhylomeDB connector is cool, and browsing the source, looks like it wouldn't require much effort at all to drop into Biopython. Thanks for letting us know about this. Cheers, Eric On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas wrote: > Hi, > > ( I'm the developer of ETE. ) > I agree that PyQt4 is an important dependence. I chose it because > Qt4-QGraphicsScene environment offers many possibilities like openGL > rendering, unlimited image size, performance, and good bindings to python. > However, I am working on my code to allow the rendering algorithm to use any > other graphical library. So, you could render the same tree images using > different backends. If you think this is useful for you, please let me know > and we can think how to integrat it with biopython. > Regarding the GUI, it is not a standalone application but one more method > within the Tree objects. The GUI can be started at any point of the > execution and the main program will continue after you close it. I did it > like this because I think is quite useful for working within interactive > python sessions. > > I develop a lot of code around tree handling, so if you think I can help, > please tell me. > jaime. > > > >> > *Graphics* >> > I finally fixed the networkx/graphviz/matplotlib drawing to leave >> unlabeled >> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps >> even >> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would >> be >> > nice to have a Reportlab-based module in Bio.Graphics to print >> phylogenies >> > in the way biologists are used to seeing them. Does anyone know of >> existing >> > code that could be borrowed for this? I looked at ETE (announced on the >> main >> > biopython list last week) and liked the examples, but it uses PyQt4 and >> a >> > standalone GUI for display, which is a substantial departure from the >> > Biopython way of doing things. >> >> I still haven't tracked down my old report lab code, but it wasn't object >> orientated and would need a lot of work to bring up to standard... >> >> > > > > > > > >> Peter >> >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > > > -- > ========================= > Jaime Huerta-Cepas, Ph.D. > CRG-Centre for Genomic Regulation > Doctor Aiguader, 88 > PRBB Building > 08003 Barcelona, Spain > http://www.crg.es/comparative_genomics > ========================= > > From eric.talevich at gmail.com Fri Sep 25 00:34:17 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 25 Sep 2009 00:34:17 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> Message-ID: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> Hi Peter, Thanks for the feedback. On Thu, Sep 24, 2009 at 5:57 AM, Peter wrote: > > One minor point - the docstring for Bio.TreeIO.parse() says: "This is only > supported for formats that can represent multiple phylogenetic trees in a > single file". Is that true, and if so why? For SeqIO and AlignIO you can > use parse on a file with one entry, the iterator just returns one entry. > Easy. > This is important for allowing generic code (e.g. a loop) regardless of > how many entries there are (one, many, or even zero). > > I'll delete that sentence. I don't know why it's there -- you're right, it's easy to return an iterable regardless of what the format itself supports. On a more general note, you seem to be recreating the file/handle logic > in each of the individual parsers. I think it would make much more sense > to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() > and > Bio.TreeIO.write() functions *only* and have the underlying format specific > code just use handles. This avoids the code duplication. > > I did the handle management case-by-case because some of the underlying libraries already do filename-to-handle conversion -- ElementTree and Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of ad-hoc handle management, but of course I can move it all to the top if you think it's best. One day, perhaps we'll have a context manager that we can reuse everywhere to make magic easy: with maybe_open(file) as handle: tree = FooIO.parse(handle) Not today, though. > (1) 'phyloxml' uses a different object representation than the other two, > so > > converting between those formats is not possible until Nexus.Trees is > ported > > over to Bio.Tree. > > > I think that is a blocker - I wouldn't want to release Bio.TreeIO until it > would > actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming > that phyloxml allows very minimal trees, the reverse as well). It does look > like the best plan is to use the same tree objects for all three (updating > Bio.Nexus if possible). > > I could comment out the 'nexus' and 'newick' lines from the supported_formats dict. That would disable the top-level functions but leave the direct NexusIO and NewickIO equivalents intact until the port is complete. Note that Bio.Nexus.Trees still has some useful methods you don't > appear to support, like finding the last common ancestor and distances > between nodes. > > That's intentional, I was just going to port those methods directly from Bio.Nexus.Trees rather than invent a new API myself. Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are combined parsers and object representations. My goal is to chop out the pure-object parts and merge them into Bio.Tree, and let the remaining parsers return objects built from the new Bio.Tree classes. This looks like it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be done. For backward compatibility, I'll leave some wrappers that trigger DeprecationWarnings in the original places. Nexus.Trees can probably be reduced to: import warnings warnings.warn("Use Bio.Tree and Bio.TreeIO instead", DeprecationWarning) from Bio.Tree.Newick import * from Bio.TreeIO.NewickIO import * (more or less) > (2) NexusIO.write() just doesn't seem to work. I don't understand how to > > make the original Nexus module write out trees that it didn't parse > itself. > > Help? > > To get the Newick tree, you can just call str(tree), which is basically > what > you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be > more complicated. You'll need to create a minimal Nexus file - have a > look at the Bio.AlignIO.NexusIO code. An alternative is to look at is > having > a hard coded nexus template, and just insert the tree as a Newick string > (and insert the list of taxa?). Perhaps Frank or Cymon can advise us. > > OK, thanks, I'll give it a shot. I see some default Nexus template stuff in Bio.Nexus.Nexus already. > > *Tree > > *The BaseTree module is meant to be the basis for Newick trees > eventually, > > so I'd like to get the design right with the minimum number of public > > methods: > > > > (1) The find() function, named after the Unix utility that does the same > > thing for directory trees, seems capable of all the iteration and > filtering > > necessary for locating data and automatically adding annotations to a > tree. > > There's a 'terminal' argument for selecting internal nodes, external > nodes, > > or both, and I think this means get_leaf_nodes() is unnecessary. I'm > going > > to remove it if no one protests. > > I'm in two minds - iterating over the leaves (taxa) seems like a very > common operation, and having an explicit method for this might be > clearer than calling find with special arguments. > I think .find(terminal=True) will do the right thing and looks reasonably simple, but as Brad said, this is a ridiculously common operation so finding it in the API should be ridiculously easy. I'll rename this function to get_leaves() and rename find() to findall() (to match ElementTree and make it clear that it returns an iterable). > > (3) I left room in each Node for the left and right indexes used by > BioSQL's > > nested-set representation. Now I'm doubting the utility of that -- any > > Biopython function that uses those indexes would need to ensure that the > > index is up to date, which seems tricky. Shall I remove all mention of > the > > nested-set representation, or try to support it fully? > > A partial implementation doesn't seem helpful, and wastes memory > allocating unused properties. I would remove it from the base Node, > but a full implementation might be useful for something (would it be > possible via a subclass?). > > On a related point, do you think a BioSQL TaxonTree subclass is possible? > i.e. Something mimicking the new Tree objects (as a subclass), but which > loads data on demand from the taxon tables in a BioSQL database? This > would provide a nice way to work with the NCBI taxonomy (once loaded > into BioSQL), which is a very large tree. For an example use case, I might > want to extract just the bacteria as a subtree, and save that to a file. > > Doing BioSQL integration was on the original roadmap, but research hasn't taken me back there lately. I would like to do it eventually... anyway, that would solve the indexing issue nicely. I'll drop the extra attributes -- I get the impression they're not meant to be accessed directly in BioSQL either, so there's no use for them in Biopython. Cheers, Eric From biopython at maubp.freeserve.co.uk Fri Sep 25 05:59:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 10:59:08 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> Message-ID: <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com> On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich wrote: >> >> On a related point, do you think a BioSQL TaxonTree subclass is possible? >> i.e. Something mimicking the new Tree objects (as a subclass), but which >> loads data on demand from the taxon tables in a BioSQL database? This >> would provide a nice way to work with the NCBI taxonomy (once loaded >> into BioSQL), which is a very large tree. For an example use case, I might >> want to extract just the bacteria as a subtree, and save that to a file. >> > > Doing BioSQL integration was on the original roadmap, but research hasn't > taken me back there lately. I would like to do it eventually... anyway, that > would solve the indexing issue nicely. I'll drop the extra attributes -- I > get the impression they're not meant to be accessed directly in BioSQL > either, so there's no use for them in Biopython. As things stand, there is no usage of the left/right index fields in Biopython. The current Biopython BioSQL code focusses on the database variants of the Seq and SeqRecord objects. The only interaction with the taxon tables is to load/retrieve the species annotations, and for this we don't need the complications of the left/right index. We leave them empty if we populate the taxonomy via Entrez (recalculating the left/right values is computationally expensive). However, any "DBTaxonTree" object (or whatever we call it) could potentially offer us a way to (a) populate and (b) use the these alternative indexes as a way to speed up various subtree operations. Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 06:08:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 11:08:56 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> Message-ID: <320fb6e00909250308s35a286e7x67a7bb3fec6a0673@mail.gmail.com> On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich wrote: >> One minor point - the docstring for Bio.TreeIO.parse() says: "This is only >> supported for formats that can represent multiple phylogenetic trees in a >> single file". Is that true, and if so why? For SeqIO and AlignIO you can >> use parse on a file with one entry, the iterator just returns one entry. >> This is important for allowing generic code (e.g. a loop) regardless of >> how many entries there are (one, many, or even zero). > > I'll delete that sentence. I don't know why it's there -- you're right, it's > easy to return an iterable regardless of what the format itself supports. OK. >> On a more general note, you seem to be recreating the file/handle logic >> in each of the individual parsers. I think it would make much more sense >> to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() >> and Bio.TreeIO.write() functions *only* and have the underlying format >> specific code just use handles. This avoids the code duplication. > > I did the handle management case-by-case because some of the underlying > libraries already do filename-to-handle conversion -- ElementTree and > Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of > ad-hoc handle management, but of course I can move it all to the top if you > think it's best. Having a single layer of handle/filename conversion in Bio.TreeIO does seem cleanest to me (even if some of the back ends allow either) and will ensure our code is consistent. > One day, perhaps we'll have a context manager that we can > reuse everywhere to make magic easy: > > with maybe_open(file) as handle: > ? tree = FooIO.parse(handle) > > Not today, though. Not yet, no. For one thing we'll have to phase out Python 2.4 support. >>> (1) 'phyloxml' uses a different object representation than the other two, >>> so converting between those formats is not possible until Nexus.Trees >>> is ported over to Bio.Tree. >> >> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it >> would actually let you do phyloxml -> newick, and phyloxml -> nexus >> (and assuming that phyloxml allows very minimal trees, the reverse >> as well). It does look like the best plan is to use the same tree objects >> for all three (updating Bio.Nexus if possible). > > I could comment out the 'nexus' and 'newick' lines from the > supported_formats dict. That would disable the top-level functions > but leave the direct NexusIO and NewickIO equivalents intact until > the port is complete. I guess shipping a "phyloxml" only Bio.TreeIO would work, but it would be rather less useful. We could certainly start with just that on the trunk (i.e. initially no Bio.TreeIO.NewickIO and also no Bio.TreeIO.NexusIO modules - initially have just a single backend). >> Note that Bio.Nexus.Trees still has some useful methods you don't >> appear to support, like finding the last common ancestor and >> distances between nodes. > > That's intentional, I was just going to port those methods directly from > Bio.Nexus.Trees rather than invent a new API myself. OK - sounds good. > Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are > combined parsers and object representations. My goal is to chop out the > pure-object parts and merge them into Bio.Tree, and let the remaining > parsers return objects built from the new Bio.Tree classes. This looks like > it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be > done. Sounds good - as with Bio.SeqIO and Bio.AlignIO, one of the goals has been to separate the data object from the (many possible) parsers. > For backward compatibility, I'll leave some wrappers that trigger > DeprecationWarnings in the original places. Nexus.Trees can > probably be reduced ... Something like that, sure. >>> (1) The find() function, named after the Unix utility that does the >>> same thing for directory trees, seems capable of all the iteration >>> and filtering necessary for locating data and automatically adding >>> annotations to a tree. There's a 'terminal' argument for selecting >>> internal nodes, external nodes, or both, and I think this means >>> get_leaf_nodes() is unnecessary. I'm going to remove it if no one >>> protests. >> >> I'm in two minds - iterating over the leaves (taxa) seems like a very >> common operation, and having an explicit method for this might be >> clearer than calling find with special arguments. > > I think .find(terminal=True) will do the right thing and looks reasonably > simple, but as Brad said, this is a ridiculously common operation so > finding it in the API should be ridiculously easy. I'll rename this function > to get_leaves() and rename find() to findall() (to match ElementTree > and make it clear that it returns an iterable). OK. Peter From hlapp at gmx.net Fri Sep 25 07:39:03 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 25 Sep 2009 07:39:03 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com> Message-ID: On Sep 25, 2009, at 5:59 AM, Peter wrote: > On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich > wrote: >>> >>> On a related point, do you think a BioSQL TaxonTree subclass is >>> possible? >>> i.e. Something mimicking the new Tree objects (as a subclass), but >>> which >>> loads data on demand from the taxon tables in a BioSQL database? >>> This >>> would provide a nice way to work with the NCBI taxonomy (once loaded >>> into BioSQL), which is a very large tree. For an example use case, >>> I might >>> want to extract just the bacteria as a subtree, and save that to a >>> file. >>> >> >> Doing BioSQL integration was on the original roadmap, but research >> hasn't >> taken me back there lately. I would like to do it eventually... >> anyway, that >> would solve the indexing issue nicely. I'll drop the extra >> attributes -- I >> get the impression they're not meant to be accessed directly in >> BioSQL >> either, so there's no use for them in Biopython. > > As things stand, there is no usage of the left/right index fields in > Biopython. The left/right fields are really a crutch for doing hierarchical (recursive) queries in SQL more efficiently. SQL doesn't have native support for recursive queries, and the left/right index values allow you to rewrite an otherwise recursive query as a single-hit set. Within an object-oriented programming language that supports recursion these values are of no use - they don't let you traverse a tree faster than you would already be able to do through recursing up or down your tree data structure. If there's a natural order of nodes, you can speed up finding nodes through binary search. But for pulling out lineages or subtrees I doubt that this will help at all - it'll have to be your data structure (such as having double links) that makes those operations efficient. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri Sep 25 08:26:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 13:26:38 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com> <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com> Message-ID: <320fb6e00909250526s294eee65ubbc508136f26f48a@mail.gmail.com> On Thu, Sep 24, 2009 at 11:23 AM, Peter wrote: > On Wed, Sep 23, 2009 at 5:34 PM, Peter wrote: >> >> Now, let's make sure all the documentation and the wiki etc is up to date, >> and make an official announcement on the news server. > > How does this look for a draft news post (with links to wiki pages etc): > > The release of Biopython 1.52 earlier this week marked the end of an > era, it was our last release using CVS for source code control. ... I went ahead and posted something based on that draft: http://news.open-bio.org/news/2009/09/biopython-cvs-to-git-migration/ Nice to see several more people have started following the github repository already :) Peter From jhuerta at crg.es Fri Sep 25 11:28:36 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Fri, 25 Sep 2009 17:28:36 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: Hi Eric, Thanks for your comments, I really see a lot of potential parts in ETE that could be used from biopython, however, for the moment, we would rather prefer not to modify current ETE's GPL license. As far as I know, the main difference between GPL and BSD-like licenses is that, with the second, you could relicense the code at any moment under any other policy, including private and close licenses. GPL includes a protection for this by ensuring that any code based on GPL sources must be always GPL compatible, and that's why we have chosen it. Moreover, the use of a BSD-like license would prevent us to use a lot of great GPL code out there. It is not my purpose to open a debate about licenses. I just wonder if biopython could provide any way to link/bind external software, perhaps as addons or plugins. This would be great, since many extra features (not only from ETE but from other sources) could be added on specific demands. This would also mitigate the problem of very specific dependencies, since many of them would be optional. From my side, I could work for providing bindings between biopython and ETE's tree graphical rendering features, inline visualization GUI, extended newick support, tree manipulation and the methods within the ETE package. I will be out of the office for several weeks, but if you see any way to collaborate I will be happy to discuss this a bit more in detail... Cheers! Jaime On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich wrote: > Hello, Jaime, > > Sorry I didn't respond directly to your earlier post -- I wrote half of an > e-mail, then realized I had no good suggestions on what to do so I scrapped > it. > > My Tree and TreeIO code is basically a complete parser for the phyloXML > format, plus a few base classes extracted out in hopes of eventually > creating a unified set of format-independent objects, as in SeqIO and > AlignIO. Your code for working with trees looks much more complete than > mine, so if some of it can be incorporated into Biopython, I think that > would be great. > > I see these issues with integration: > 1. It's GPL, while Biopython uses a more permissive custom license > resembling the BSD and MIT licenses. Would you be willing and able to > relicense parts of your work for Biopython? > > 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will > require some compatibility fixes -- not a huge problem. > > 3. Scipy and numpy dependencies: Numpy is considered a semi-optional > dependency in Biopython, so if it can be imported on the fly by just the > functions that need it (hopefully no core ones), that would be best. If > not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so > it would be better to make that an optional, on-the-fly import, too. > > 4. PyQt4 is a big package and I'm not sure it's as common in scientists' > Python installations as numpy and scipy, so if the underlying algorithms for > tree layout could be ported to Reportlab, matplotlib or PIL, that would be > ideal. I personally would like to be able to pair sequence snippets with the > leaves of a standard phylogram, so if you need me to do some additional work > to get this section ported to Biopython, I'd consider it time well spent. > > 5. Presumably, the tree object type in ETE is different from Bio.Tree or > Bio.Nexus, so porting the core tree manipulation code to Biopython would > require a substantial effort somewhere. > > 6. The PhylomeDB connector is cool, and browsing the source, looks like it > wouldn't require much effort at all to drop into Biopython. > > Thanks for letting us know about this. > > Cheers, > Eric > > > > On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas wrote: > >> Hi, >> >> ( I'm the developer of ETE. ) >> I agree that PyQt4 is an important dependence. I chose it because >> Qt4-QGraphicsScene environment offers many possibilities like openGL >> rendering, unlimited image size, performance, and good bindings to python. >> However, I am working on my code to allow the rendering algorithm to use any >> other graphical library. So, you could render the same tree images using >> different backends. If you think this is useful for you, please let me know >> and we can think how to integrat it with biopython. >> Regarding the GUI, it is not a standalone application but one more method >> within the Tree objects. The GUI can be started at any point of the >> execution and the main program will continue after you close it. I did it >> like this because I think is quite useful for working within interactive >> python sessions. >> >> I develop a lot of code around tree handling, so if you think I can help, >> please tell me. >> jaime. >> >> >> >>> > *Graphics* >>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave >>> unlabeled >>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps >>> even >>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would >>> be >>> > nice to have a Reportlab-based module in Bio.Graphics to print >>> phylogenies >>> > in the way biologists are used to seeing them. Does anyone know of >>> existing >>> > code that could be borrowed for this? I looked at ETE (announced on the >>> main >>> > biopython list last week) and liked the examples, but it uses PyQt4 and >>> a >>> > standalone GUI for display, which is a substantial departure from the >>> > Biopython way of doing things. >>> >>> I still haven't tracked down my old report lab code, but it wasn't object >>> orientated and would need a lot of work to bring up to standard... >>> >>> >> >> >> >> >> >> >> >>> Peter >>> >>> _______________________________________________ >>> Biopython-dev mailing list >>> Biopython-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>> >> >> >> >> -- >> ========================= >> Jaime Huerta-Cepas, Ph.D. >> CRG-Centre for Genomic Regulation >> Doctor Aiguader, 88 >> PRBB Building >> 08003 Barcelona, Spain >> http://www.crg.es/comparative_genomics >> ========================= >> >> > -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From eric.talevich at gmail.com Fri Sep 25 11:51:15 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 25 Sep 2009 11:51:15 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com> Hi Jaime, Just working on bindings would certainly be easier. The best way to transfer tree information from Biopython to ETE would be serializing the trees in phyloXML format (to preserve the annotations) and loading that file in ETE. I see that ETE allows rich annotation of tree objects, but I don't see phyloXML or NeXML listed as supported file formats -- is there another standard format you're using to store this information? If not, I think ETE would benefit from a phyloXML parser. Since Biopython license is GPL-compatible (I believe), you could borrow Bio.TreeIO.PhyloXMLIO directly and just port the Phylogeny and Clade classes to ETE's base classes instead of Bio.Tree.BaseTree's Tree and Node classes. Beyond that, some support for BioSQL to store sequences etc. would also help link ETE to any of the other Bio* projects. There's some example code in Biopython's top-level BioSQL directory, if you're interested. Cheers, Eric On Fri, Sep 25, 2009 at 11:28 AM, Jaime Huerta Cepas wrote: > Hi Eric, > > Thanks for your comments, > I really see a lot of potential parts in ETE that could be used from > biopython, however, for the moment, we would rather prefer not to modify > current ETE's GPL license. As far as I know, the main difference between > GPL and BSD-like licenses is that, with the second, you could relicense the > code at any moment under any other policy, including private and close > licenses. GPL includes a protection for this by ensuring that any code based > on GPL sources must be always GPL compatible, and that's why we have chosen > it. Moreover, the use of a BSD-like license would prevent us to use a lot of > great GPL code out there. > > It is not my purpose to open a debate about licenses. I just wonder if > biopython could provide any way to link/bind external software, perhaps as > addons or plugins. This would be great, since many extra features (not only > from ETE but from other sources) could be added on specific demands. This > would also mitigate the problem of very specific dependencies, since many of > them would be optional. From my side, I could work for providing bindings > between biopython and ETE's tree graphical rendering features, inline > visualization GUI, extended newick support, tree manipulation and the > methods within the ETE package. > > I will be out of the office for several weeks, but if you see any way to > collaborate I will be happy to discuss this a bit more in detail... > > Cheers! > Jaime > > > On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich wrote: > >> Hello, Jaime, >> >> Sorry I didn't respond directly to your earlier post -- I wrote half of an >> e-mail, then realized I had no good suggestions on what to do so I scrapped >> it. >> >> My Tree and TreeIO code is basically a complete parser for the phyloXML >> format, plus a few base classes extracted out in hopes of eventually >> creating a unified set of format-independent objects, as in SeqIO and >> AlignIO. Your code for working with trees looks much more complete than >> mine, so if some of it can be incorporated into Biopython, I think that >> would be great. >> >> I see these issues with integration: >> 1. It's GPL, while Biopython uses a more permissive custom license >> resembling the BSD and MIT licenses. Would you be willing and able to >> relicense parts of your work for Biopython? >> >> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will >> require some compatibility fixes -- not a huge problem. >> >> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional >> dependency in Biopython, so if it can be imported on the fly by just the >> functions that need it (hopefully no core ones), that would be best. If >> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so >> it would be better to make that an optional, on-the-fly import, too. >> >> 4. PyQt4 is a big package and I'm not sure it's as common in scientists' >> Python installations as numpy and scipy, so if the underlying algorithms for >> tree layout could be ported to Reportlab, matplotlib or PIL, that would be >> ideal. I personally would like to be able to pair sequence snippets with the >> leaves of a standard phylogram, so if you need me to do some additional work >> to get this section ported to Biopython, I'd consider it time well spent. >> >> 5. Presumably, the tree object type in ETE is different from Bio.Tree or >> Bio.Nexus, so porting the core tree manipulation code to Biopython would >> require a substantial effort somewhere. >> >> 6. The PhylomeDB connector is cool, and browsing the source, looks like it >> wouldn't require much effort at all to drop into Biopython. >> >> Thanks for letting us know about this. >> >> Cheers, >> Eric >> >> >> >> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas wrote: >> >>> Hi, >>> >>> ( I'm the developer of ETE. ) >>> I agree that PyQt4 is an important dependence. I chose it because >>> Qt4-QGraphicsScene environment offers many possibilities like openGL >>> rendering, unlimited image size, performance, and good bindings to python. >>> However, I am working on my code to allow the rendering algorithm to use any >>> other graphical library. So, you could render the same tree images using >>> different backends. If you think this is useful for you, please let me know >>> and we can think how to integrat it with biopython. >>> Regarding the GUI, it is not a standalone application but one more method >>> within the Tree objects. The GUI can be started at any point of the >>> execution and the main program will continue after you close it. I did it >>> like this because I think is quite useful for working within interactive >>> python sessions. >>> >>> I develop a lot of code around tree handling, so if you think I can >>> help, please tell me. >>> jaime. >>> >>> >>> >>>> > *Graphics* >>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave >>>> unlabeled >>>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps >>>> even >>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it >>>> would be >>>> > nice to have a Reportlab-based module in Bio.Graphics to print >>>> phylogenies >>>> > in the way biologists are used to seeing them. Does anyone know of >>>> existing >>>> > code that could be borrowed for this? I looked at ETE (announced on >>>> the main >>>> > biopython list last week) and liked the examples, but it uses PyQt4 >>>> and a >>>> > standalone GUI for display, which is a substantial departure from the >>>> > Biopython way of doing things. >>>> >>>> I still haven't tracked down my old report lab code, but it wasn't >>>> object >>>> orientated and would need a lot of work to bring up to standard... >>>> >>>> >>> >>> >>> >>> >>> >>> >>> >>>> Peter >>>> >>>> _______________________________________________ >>>> Biopython-dev mailing list >>>> Biopython-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>>> >>> >>> >>> >>> -- >>> ========================= >>> Jaime Huerta-Cepas, Ph.D. >>> CRG-Centre for Genomic Regulation >>> Doctor Aiguader, 88 >>> PRBB Building >>> 08003 Barcelona, Spain >>> http://www.crg.es/comparative_genomics >>> ========================= >>> >>> >> > > > -- > ========================= > Jaime Huerta-Cepas, Ph.D. > CRG-Centre for Genomic Regulation > Doctor Aiguader, 88 > PRBB Building > 08003 Barcelona, Spain > http://www.crg.es/comparative_genomics > ========================= > > From jhuerta at crg.es Fri Sep 25 12:13:44 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Fri, 25 Sep 2009 18:13:44 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com> Message-ID: Hi, > Just working on bindings would certainly be easier. The best way to > transfer tree information from Biopython to ETE would be serializing the > trees in phyloXML format (to preserve the annotations) and loading that file > in ETE. I see that ETE allows rich annotation of tree objects, but I don't > see phyloXML or NeXML listed as supported file formats -- is there another > standard format you're using to store this information? Extended newick (http://www.phylosoft.org/NHX/) is the only rich format currently supported by ETE, however only text string representation of tree node annotations are allowed by this standard. Beyond this, you should use a cpickle approach to save complex annotated trees. I'm certainly interested in PhyloXML and NexML support, so, for sure, this could be a nice starting point. If not, I think ETE would benefit from a phyloXML parser. Since Biopython > license is GPL-compatible (I believe), you could borrow > Bio.TreeIO.PhyloXMLIO directly and just port the Phylogeny and Clade classes > to ETE's base classes instead of Bio.Tree.BaseTree's Tree and Node classes. > I think there is no problem in using BSD license from GPL sources, the problem would be in the other way around. Then I will take a look at your phyloxml code to find the best way to bind both packages through phyloXML serialization. > Beyond that, some support for BioSQL to store sequences etc. would also > help link ETE to any of the other Bio* projects. There's some example code > in Biopython's top-level BioSQL directory, if you're interested. > Ok. I'll take a look also. Thanks. cheers, Jaime. > > Cheers, > Eric > > > On Fri, Sep 25, 2009 at 11:28 AM, Jaime Huerta Cepas wrote: > >> Hi Eric, >> >> Thanks for your comments, >> I really see a lot of potential parts in ETE that could be used from >> biopython, however, for the moment, we would rather prefer not to modify >> current ETE's GPL license. As far as I know, the main difference between >> GPL and BSD-like licenses is that, with the second, you could relicense the >> code at any moment under any other policy, including private and close >> licenses. GPL includes a protection for this by ensuring that any code based >> on GPL sources must be always GPL compatible, and that's why we have chosen >> it. Moreover, the use of a BSD-like license would prevent us to use a lot of >> great GPL code out there. >> >> It is not my purpose to open a debate about licenses. I just wonder if >> biopython could provide any way to link/bind external software, perhaps as >> addons or plugins. This would be great, since many extra features (not only >> from ETE but from other sources) could be added on specific demands. This >> would also mitigate the problem of very specific dependencies, since many of >> them would be optional. From my side, I could work for providing bindings >> between biopython and ETE's tree graphical rendering features, inline >> visualization GUI, extended newick support, tree manipulation and the >> methods within the ETE package. >> >> I will be out of the office for several weeks, but if you see any way to >> collaborate I will be happy to discuss this a bit more in detail... >> >> Cheers! >> Jaime >> >> >> On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich wrote: >> >>> Hello, Jaime, >>> >>> Sorry I didn't respond directly to your earlier post -- I wrote half of >>> an e-mail, then realized I had no good suggestions on what to do so I >>> scrapped it. >>> >>> My Tree and TreeIO code is basically a complete parser for the phyloXML >>> format, plus a few base classes extracted out in hopes of eventually >>> creating a unified set of format-independent objects, as in SeqIO and >>> AlignIO. Your code for working with trees looks much more complete than >>> mine, so if some of it can be incorporated into Biopython, I think that >>> would be great. >>> >>> I see these issues with integration: >>> 1. It's GPL, while Biopython uses a more permissive custom license >>> resembling the BSD and MIT licenses. Would you be willing and able to >>> relicense parts of your work for Biopython? >>> >>> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will >>> require some compatibility fixes -- not a huge problem. >>> >>> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional >>> dependency in Biopython, so if it can be imported on the fly by just the >>> functions that need it (hopefully no core ones), that would be best. If >>> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so >>> it would be better to make that an optional, on-the-fly import, too. >>> >>> 4. PyQt4 is a big package and I'm not sure it's as common in scientists' >>> Python installations as numpy and scipy, so if the underlying algorithms for >>> tree layout could be ported to Reportlab, matplotlib or PIL, that would be >>> ideal. I personally would like to be able to pair sequence snippets with the >>> leaves of a standard phylogram, so if you need me to do some additional work >>> to get this section ported to Biopython, I'd consider it time well spent. >>> >>> 5. Presumably, the tree object type in ETE is different from Bio.Tree or >>> Bio.Nexus, so porting the core tree manipulation code to Biopython would >>> require a substantial effort somewhere. >>> >>> 6. The PhylomeDB connector is cool, and browsing the source, looks like >>> it wouldn't require much effort at all to drop into Biopython. >>> >>> Thanks for letting us know about this. >>> >>> Cheers, >>> Eric >>> >>> >>> >>> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas wrote: >>> >>>> Hi, >>>> >>>> ( I'm the developer of ETE. ) >>>> I agree that PyQt4 is an important dependence. I chose it because >>>> Qt4-QGraphicsScene environment offers many possibilities like openGL >>>> rendering, unlimited image size, performance, and good bindings to python. >>>> However, I am working on my code to allow the rendering algorithm to use any >>>> other graphical library. So, you could render the same tree images using >>>> different backends. If you think this is useful for you, please let me know >>>> and we can think how to integrat it with biopython. >>>> Regarding the GUI, it is not a standalone application but one more >>>> method within the Tree objects. The GUI can be started at any point of the >>>> execution and the main program will continue after you close it. I did it >>>> like this because I think is quite useful for working within interactive >>>> python sessions. >>>> >>>> I develop a lot of code around tree handling, so if you think I can >>>> help, please tell me. >>>> jaime. >>>> >>>> >>>> >>>>> > *Graphics* >>>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave >>>>> unlabeled >>>>> > nodes inconspicuous, so the resulting graphic is much cleaner, >>>>> perhaps even >>>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it >>>>> would be >>>>> > nice to have a Reportlab-based module in Bio.Graphics to print >>>>> phylogenies >>>>> > in the way biologists are used to seeing them. Does anyone know of >>>>> existing >>>>> > code that could be borrowed for this? I looked at ETE (announced on >>>>> the main >>>>> > biopython list last week) and liked the examples, but it uses PyQt4 >>>>> and a >>>>> > standalone GUI for display, which is a substantial departure from the >>>>> > Biopython way of doing things. >>>>> >>>>> I still haven't tracked down my old report lab code, but it wasn't >>>>> object >>>>> orientated and would need a lot of work to bring up to standard... >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>> Peter >>>>> >>>>> _______________________________________________ >>>>> Biopython-dev mailing list >>>>> Biopython-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>>>> >>>> >>>> >>>> >>>> -- >>>> ========================= >>>> Jaime Huerta-Cepas, Ph.D. >>>> CRG-Centre for Genomic Regulation >>>> Doctor Aiguader, 88 >>>> PRBB Building >>>> 08003 Barcelona, Spain >>>> http://www.crg.es/comparative_genomics >>>> ========================= >>>> >>>> >>> >> >> >> -- >> ========================= >> Jaime Huerta-Cepas, Ph.D. >> CRG-Centre for Genomic Regulation >> Doctor Aiguader, 88 >> PRBB Building >> 08003 Barcelona, Spain >> http://www.crg.es/comparative_genomics >> ========================= >> >> > -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From biopython at maubp.freeserve.co.uk Fri Sep 25 12:22:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 17:22:40 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com> Message-ID: <320fb6e00909250922y858c172xf1ee51f7673a4fe2@mail.gmail.com> On Fri, Sep 25, 2009 at 5:13 PM, Jaime Huerta Cepas wrote: > > I think there is no problem in using BSD license from GPL sources, the > problem would be in the other way around. > Yes, that way round is fine from a license point of view (taking Biopython's BSD/MIT style licensed code and using it in a GPL project). But we can't take your GPL code into Biopython unless you re-license it more liberally. I can see the appeal of the (L)GPL for forcing the code to stay open, but Biopython (like Python) went for the other option of basically letting anyone use the code in anyway they like. Peter From hlapp at gmx.net Fri Sep 25 16:58:36 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 25 Sep 2009 16:58:36 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: On Sep 25, 2009, at 11:28 AM, Jaime Huerta Cepas wrote: > As far as I know, the main difference between GPL and BSD-like > licenses is that, with the second, you could relicense the code at > any moment under any other policy, including private and close > licenses. This is not true. None of the open-source licenses that I'm aware of allows anyone to relicense code under a license that is less liberal, or to relicense code at all. It is the copyright owner who can relicense code, not the distributor. One of the differences between GPL and BSD is that GPL is viral. Specifically, code that links to GPL-licensed code must also be GPL- licensed *when it is distributed.* (It is a common misconception that GPL is unconditionally viral. I can take GPL code and link to it and keep my code closed source for as long as I please if I never redistribute it. GPL was written with software vendors in mind, whose business consists of distributing software for commercial gain. GPL has therefore sometimes been called anti-commercial. This is wrong, too, but I won't go into the details here.) Biopython can freely utilize GPL-licensed (or closed source, for that matter) software if it doesn't link to it. IANAL but I think it can also redistribute GPL-licensed code along with Biopython so long as Biopython doesn't link to it, and it is made clear that some of the distribution falls under a different license than BSD. (Linux distributions mix BSD and GPL software, too.) As for ETE itself, a BSD/MIT style license seems to be the by far most widely used license for Python modules. If you want to facilitate adoption of the software as a library by other programmers, GPL is going to stand in the way of that. Also, really all that you are accomplishing with GPL is that a software company can't take advantage of ETE. Is that your chief concern? GPL won't prevent any scientific lab from writing closed source code that builds on ETE and publishing the results, so long as they don't distribute their closed source code. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From chapmanb at 50mail.com Fri Sep 25 17:48:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 25 Sep 2009 17:48:00 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: <20090925214800.GE29829@sobchak.mgh.harvard.edu> Hi all; Hilmar -- thanks for writing up a nice summary of the license details. Jaime, I think it's a shame we would let these issues prevent working together. It sounds like you and Eric have some shared goals and it would be great to see that evolve into some useful functionality in Biopython. Generally, the BSD-like license which Biopython uses encourages cooperation and keeps people at both academia and industry happy. As scientists, our goal should be to avoid letting these types of issues preventing collaboration. Truthfully, there is very little opportunity for exploitation of bioinformatics software; the economics are just not there for companies to sell code. > (It is a common misconception that GPL is unconditionally viral. I can > take GPL code and link to it and keep my code closed source for as > long as I please if I never redistribute it. GPL was written with > software vendors in mind, whose business consists of distributing > software for commercial gain. GPL has therefore sometimes been called > anti-commercial. This is wrong, too, but I won't go into the details > here.) I agree 100%, but in practical terms it is very difficult to have this argument at a company. Speaking from experience, GPL creates all kinds of nasty thoughts in people's heads which prevents adoption of code in corporate environments. For Biopython and other bioinformatics projects, we should be actively encouraging contributions from companies as well as academia. > Biopython can freely utilize GPL-licensed (or closed source, for that > matter) software if it doesn't link to it. IANAL but I think it can > also redistribute GPL-licensed code along with Biopython so long as > Biopython doesn't link to it, and it is made clear that some of the > distribution falls under a different license than BSD. (Linux > distributions mix BSD and GPL software, too.) Yes, but this complication is bad. Let's keep it simple, Brad From bugzilla-daemon at portal.open-bio.org Fri Sep 25 18:48:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Sep 2009 18:48:13 -0400 Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a GenBank CON file In-Reply-To: Message-ID: <200909252248.n8PMmDa9028782@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2745 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1214 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-25 18:48 EST ------- (From update of attachment 1214) Checked into git, leaving this bug open until we've run some more tests on this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Sep 26 07:36:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 26 Sep 2009 07:36:45 -0400 Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a GenBank CON file In-Reply-To: Message-ID: <200909261136.n8QBajsI014127@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2745 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-26 07:36 EST ------- We'll also need to update the SeqIO GenBank output to record the CONTIG string if present. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hlapp at gmx.net Sat Sep 26 11:25:41 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 26 Sep 2009 11:25:41 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20090925214800.GE29829@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> <20090925214800.GE29829@sobchak.mgh.harvard.edu> Message-ID: On Sep 25, 2009, at 5:48 PM, Brad Chapman wrote: > I agree 100%, but in practical terms it is very difficult to have this > argument at a company. Yes, I know. > For Biopython and other bioinformatics projects, we should be > actively encouraging contributions from companies as well as academia. Having worked in commercial and private sector for almost a decade, I couldn't agree more. There is a huge amount of open-source code development contributed by people working in the private sector, and which is hence sponsored by companies. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jhuerta at crg.es Sat Sep 26 13:12:59 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Sat, 26 Sep 2009 19:12:59 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: Hey! Sorry, It was not my intention to open a flame about licences nor to sound rude. I apologize if I did. > As far as I know, the main difference between GPL and BSD-like licenses is >> that, with the second, you could relicense the code at any moment under any >> other policy, including private and close licenses. >> > > > This is not true. None of the open-source licenses that I'm aware of allows > anyone to relicense code under a license that is less liberal, or to > relicense code at all. It is the copyright owner who can relicense code, not > the distributor. > > I'm not an expert on software licences, so I can not enter into this issue very deeply. What I said in my previous email is what I could understand from these info: http://www.gnu.org/philosophy/license-list.html, http://www.gnu.org/philosophy/categories.html#Non-CopyleftedFreeSoftware If I was wrong and modified BSD-like sources cannot be relicensed under other less liberal licenses, then we will kindly consider a change of the ETE license in the future. > One of the differences between GPL and BSD is that GPL is viral. > Specifically, code that links to GPL-licensed code must also be GPL-licensed > *when it is distributed.* > > (It is a common misconception that GPL is unconditionally viral. I can take > GPL code and link to it and keep my code closed source for as long as I > please if I never redistribute it. GPL was written with software vendors in > mind, whose business consists of distributing software for commercial gain. > GPL has therefore sometimes been called anti-commercial. This is wrong, too, > but I won't go into the details here.) > I see, so the only problem is about distribution... Biopython can freely utilize GPL-licensed (or closed source, for that > matter) software if it doesn't link to it. IANAL but I think it can also > redistribute GPL-licensed code along with Biopython so long as Biopython > doesn't link to it, and it is made clear that some of the distribution falls > under a different license than BSD. (Linux distributions mix BSD and GPL > software, too.) > Yes, I agree. This is what I meant as biopython addons. With this in mind, biopython could be aware of many other software out there and benefit from it. Is there any work around this in bipython? As for ETE itself, a BSD/MIT style license seems to be the by far most > widely used license for Python modules. If you want to facilitate adoption > of the software as a library by other programmers, GPL is going to stand in > the way of that. Also, really all that you are accomplishing with GPL is > that a software company can't take advantage of ETE. Is that your chief > concern? Well, our intention was that code based on ETE sources (other tools or improvements) were distrubuted/published also as free software. We wanted also to leave an open door to use other GPL software from ETE. > GPL won't prevent any scientific lab from writing closed source code that > builds on ETE and publishing the results, so long as they don't distribute > their closed source code. Yes. You are right. We don't want to avoid this. In any case, thanks for your comments. I will try to get more info about what you say and, if we have to modify something, we do it. :) cheers, Jaime > > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From jhuerta at crg.es Sat Sep 26 13:28:02 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Sat, 26 Sep 2009 19:28:02 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20090925214800.GE29829@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> <20090925214800.GE29829@sobchak.mgh.harvard.edu> Message-ID: Hi Brad, Jaime, I think it's a shame we would let these issues > prevent working together. It sounds like you and Eric have some > shared goals and it would be great to see that evolve into some > useful functionality in Biopython. > Sure!! My only intention was to find the best way to contribute! However, the choice of a "viral" GPL license was specifically chosen for exactly this reason: encouraging free software and academic scientific resources. We have a lot shared goals, so I trust we will find a happy way to colaborate. Jaime. -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From jblanca at btc.upv.es Mon Sep 28 07:36:14 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 28 Sep 2009 13:36:14 +0200 Subject: [Biopython-dev] fpc and gff Message-ID: <200909281336.14794.jblanca@btc.upv.es> Sorry for the previous incomplete mail. :( Hi: I'm interested in parsing an fpc physical map and writing a gff3 file from it. That's done by the fpc people in bioperl and they go from fpc to gff2. I would like to do it in python. I've written the fpc parser looking at the bioperl one. You can take a look at: http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/fpc.py Now I have to create the gff structure and writer. I've been reading Brad's code regarding the GFF parser and writer. I would like to integrate my fpc work as much as posible with biopython and if you like it we could add the fpc to Biopython in the future. But I have not a clear idea on the relation between GFF and SeqFeature. The main problem is the subfeature and the gff feature hierarchy. My take on that at the moment is to write a GFFfeature class similar to the gff feature with seqid, source, type, start, end, score, etc. and go from the fpc to GFFFeature objects. I know that this would not integrate nicely with BioPython. Could you give some hint on how to do it in a proper way? Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From jblanca at btc.upv.es Mon Sep 28 07:28:06 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 28 Sep 2009 13:28:06 +0200 Subject: [Biopython-dev] fpc and gff Message-ID: <200909281328.06817.jblanca@btc.upv.es> Hi: I'm interested in parsing an fpc physical map and writing a gff3 file from it. That's done by the fpc people in bioperl and they go from fpc to gff2. I would like to do it in python. I've written the fpc parser looking at the bioperl one. You can take a look at: -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Mon Sep 28 07:52:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 12:52:56 +0100 Subject: [Biopython-dev] fpc and gff In-Reply-To: <200909281336.14794.jblanca@btc.upv.es> References: <200909281336.14794.jblanca@btc.upv.es> Message-ID: <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com> On Mon, Sep 28, 2009 at 12:36 PM, Jose Blanca wrote: > Sorry for the previous incomplete mail. :( > > Hi: > I'm interested in parsing an fpc physical map and writing a gff3 file from it. > That's done by the fpc people in bioperl and they go from fpc to gff2. I > would like to do it in python. > I've written the fpc parser looking at the bioperl one. You can take a look > at: > http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/fpc.py > > Now I have to create the gff structure and writer. I've been reading Brad's > code regarding the GFF parser and writer. I would like to integrate my fpc > work as much as posible with biopython and if you like it we could add the > fpc to Biopython in the future. > But I have not a clear idea on the relation between GFF and SeqFeature. The > main problem is the subfeature and the gff feature hierarchy. My take on that > at the moment is to write a GFFfeature class similar to the gff feature with > seqid, source, type, start, end, score, etc. and go from the fpc to > GFFFeature objects. I know that this would not integrate nicely with > BioPython. Could you give some hint on how to do it in a proper way? > Best regards, Right now there isn't a "proper way" as Brad's GFF code hasn't been integrated into Biopython yet. I think Brad was thinking of using the SeqFeature object "as is" to hold GFF features, with the sub-features list used for the hierarchy. Michiel and I had suggested a simpler structure more faithful to the GFF model might be useful - even if it was just a standardised tuple of the start, end, strand, id, etc, and an annotation dictionary). For the SeqIO interface, these GFF features would have to be turned into normal SeqFeature objects of course. Peter From chapmanb at 50mail.com Mon Sep 28 08:52:38 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 28 Sep 2009 08:52:38 -0400 Subject: [Biopython-dev] fpc and gff In-Reply-To: <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com> References: <200909281336.14794.jblanca@btc.upv.es> <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com> Message-ID: <20090928125238.GG29829@sobchak.mgh.harvard.edu> Jose; Glad you're interested in working on this. I'm happy to get the GFF3 writing up to speed for this task. > > I'm interested in parsing an fpc physical map and writing a gff3 file from it. [...] > > But I have not a clear idea on the relation between GFF and SeqFeature. The > > main problem is the subfeature and the gff feature hierarchy. My take on that > > at the moment is to write a GFFfeature class similar to the gff feature with > > seqid, source, type, start, end, score, etc. and go from the fpc to > > GFFFeature objects. > Right now there isn't a "proper way" as Brad's GFF code hasn't > been integrated into Biopython yet. Yes, we still have some flexibility here since it hasn't been merged into Biopython yet, so let's talk about what works best. > I think Brad was thinking of using the SeqFeature object "as is" to hold > GFF features, with the sub-features list used for the hierarchy. What exists now takes an iterator of SeqRecord objects, and writes each SeqFeature as a GFF3 line: seqid -- SeqRecord ID source -- Feature qualifier with key "source" type -- Feature type attribute start, end -- The Feature Location score -- Feature qualifier with key "score" strand -- Feature strand attribute phase -- Feature qualifier with key "phase" The remaining qualifiers are the final key/value pairs of the attribute. The hierarchy is represented as sub_features of the parent feature. This handles any arbitrarily deep nesting of parent and child features. There is some really basic code on the documentation page: http://biopython.org/wiki/GFF_Parsing#Writing_GFF3 > Michiel and I had suggested a simpler structure more faithful to the > GFF model might be useful - even if it was just a standardised tuple > of the start, end, strand, id, etc, and an annotation dictionary). For > the SeqIO interface, these GFF features would have to be turned > into normal SeqFeature objects of course. This could also be useful for a more lightweight representation. I would rather see this type of representation with primary Python types, as opposed to a GFFFeature specific class. The current SeqRecord/SeqFeature implementations is relatively close to what a GFF specific class would be so there would be a lot of duplication without saving much in terms of speed or memory. Jose, let me know if you'd rather go with a SeqRecord approach or a lightweight approach. If you provide a couple of examples of the features you want to store, we can work through how to best represent those in the GFF hierarchy and then the details of prepping them for writing. Brad From biopython at maubp.freeserve.co.uk Mon Sep 28 09:10:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 14:10:22 +0100 Subject: [Biopython-dev] fpc and gff In-Reply-To: <20090928125238.GG29829@sobchak.mgh.harvard.edu> References: <200909281336.14794.jblanca@btc.upv.es> <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com> <20090928125238.GG29829@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909280610q75f7bf4eqae49a1fb6d7eae38@mail.gmail.com> On Mon, Sep 28, 2009 at 1:52 PM, Brad Chapman wrote: > >> Michiel and I had suggested a simpler structure more faithful to the >> GFF model might be useful - even if it was just a standardised tuple >> of the start, end, strand, id, etc, and an annotation dictionary). For >> the SeqIO interface, these GFF features would have to be turned >> into normal SeqFeature objects of course. > > This could also be useful for a more lightweight representation. I > would rather see this type of representation with primary Python > types, as opposed to a GFFFeature specific class. The current > SeqRecord/SeqFeature implementations is relatively close to what > a GFF specific class would be so there would be a lot of duplication > without saving much in terms of speed or memory. Indeed. Which is why I quite like the idea of a simple tuple of ints, strings and a dict for the annotation (the final column of a GFF file). This should also be fast for people dealing with big GFF files. The other plus point here is we can get this (GFF parsing/writing using basic Python objects) into Biopython first, and then look at the SeqIO side of things more carefully as a second merge. I may be overly cautious but I want the resulting GFF <-> SeqRecord <-> GenBank/EMBL/etc mapping to try and follow established practice as closely as possible, which will need lots of testing and probably some tweaking of this mapping. i.e. To me there is a natural break between the basics of GFF parsing/writing, and the transformation into our existing object models. [This applies to all file formats in principle, but most are so simple that it isn't really an issue worth worrying about.] Peter From bugzilla-daemon at portal.open-bio.org Mon Sep 28 15:37:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Sep 2009 15:37:21 -0400 Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a GenBank CON file In-Reply-To: Message-ID: <200909281937.n8SJbLYq012300@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2745 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-28 15:37 EST ------- (In reply to comment #6) > We'll also need to update the SeqIO GenBank output to record the CONTIG > string if present. Done, marking as fixed. Assuming there are no objections to the whole approach (treating the CONTIG data as a string) that is... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Sep 28 16:09:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 21:09:12 +0100 Subject: [Biopython-dev] Committing to github... In-Reply-To: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com> References: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com> Message-ID: <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com> On Thu, Sep 24, 2009 at 12:39 PM, Peter wrote: > Hi all, > > My last couple of commits to github have been from a local clone > of the *official* repository: http://github.com/biopython/biopython/ > > This is a nice and simple work flow for small changes, and the > history and github network graph are easy to understand: > http://biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch > > This seems like the easiest way to work for people used to CVS, > and you don't need to bother with your own Biopython cloned > repository on github (you just need a github account and > collaborator status). I'll probably continue to do this in the short > term. This way of working (described above) is what I have been using for the last week. If there are multiple developers working (or in this case, developers using multiple machines), you can still get interesting mini-branches and merges even like this. Have a look at the Biopython github network diagram for today for a nice simple example (which was accidental - but serves as a nice illustration). [I know for some of you the following discussion isn't needed, but I think it is worth trying to explain - even if just for me, to make sure it is clear in my head what git is doing.] In words, the main trunk was split, with a (trivial) change to the tutorial done on one branch (me at work) and then two separate commits on a separate branch (unit tests tweak, and GenBank bug fix), again by me, but on my home computer. The two branches were then merged into one. Why did this happen? I was working on a local and very slightly out of date copy of the repository at home, and make these local commits. I then tried to push them to github. At that point git gave me an error saying something else had been commited in the meantime (in fact by me but on a different computer) so my local repository was out of date. So I pulled and merged the latest code from github (the tutorial change), and then pushed this to github. Done. The merge was 100% automatic because the files changed were independent. Back on CVS, as these changes were on separate files, there wouldn't have been any issue about merging. Does it matter? No. But we can reduce the likelihood of these baby branches and merges by getting into the habit of pulling the latest code from github *before* making any local commits (a sensible thing to do anyway). [Did that make sense? One the one hand this is very simple, but on the other hand, it is rather different to how I used to think about the code history under CVS.] Peter From eric.talevich at gmail.com Mon Sep 28 16:47:38 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 28 Sep 2009 16:47:38 -0400 Subject: [Biopython-dev] Committing to github... In-Reply-To: <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com> References: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com> <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com> Message-ID: <3f6baf360909281347r32c39918s4a2c8a64cff44622@mail.gmail.com> On Mon, Sep 28, 2009 at 4:09 PM, Peter wrote: > > Does it matter? No. But we can reduce the likelihood of these > baby branches and merges by getting into the habit of pulling > the latest code from github *before* making any local commits > (a sensible thing to do anyway). > > If you've committed local changes while your repository is out of date and want to avoid a baby branch, you can also use "git rebase origin/master" to fix the history. (But probably, most developers will find it easier and safer to leave the baby branches there.) Extended example: git checkout dev # a development branch # hack hack git commit -a # oops, we're out of sync git checkout master # a clean copy of upstream git pull origin master # updating like we should have earlier git rebase master dev git merge dev # Should be fast-forward git push Cheers, Eric From bugzilla-daemon at portal.open-bio.org Mon Sep 28 17:01:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Sep 2009 17:01:08 -0400 Subject: [Biopython-dev] [Bug 2919] New: Writing SeqFeature qualifiers Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2919 Summary: Writing SeqFeature qualifiers Product: Biopython Version: 1.51 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: estrain at gmail.com When writing SeqFeature qualifiers key-value pairs, the output contains one line for each character in the value, rather than simply printing the string. The sample code at the bottom produces a genbank sequence file that illustrates the problem. If I create a qualifiers dictionary using "qualDict = dict(gene="geneA")", the genbank output contains gene 1..6 /gene="g" /gene="e" /gene="n" /gene="e" /gene="A" The offending code appears to be in the InsdcIO.py file, lines 482-483. If I change 482: for value in values : 483: self.write_feature_qualifier(key,value) to self.write_feature_qualifier(key,values) then the function appears to work correctly. gene 1..6 /gene="geneA" ########################################################### ## Sample code ########################################################### from Bio.Seq import Seq from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation from Bio.Alphabet import IUPAC qualDict = dict(gene="geneA") my_seq = SeqRecord(Seq("ATGATC",IUPAC.ambiguous_dna),id="seq1") my_seq.features.append((SeqFeature(FeatureLocation(0,6),type="gene",qualifiers=qualDict))) out_handle = open("test.gbk","w") SeqIO.write([my_seq],out_handle,"genbank") -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Sep 28 17:22:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Sep 2009 17:22:32 -0400 Subject: [Biopython-dev] [Bug 2919] Writing SeqFeature qualifiers In-Reply-To: Message-ID: <200909282122.n8SLMW8w014482@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2919 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-28 17:22 EST ------- It was working as intended - for consistency with the GenBank (and other) parsers, you were expected to use a lists of strings as the feature qualifier dictionary values (not just strings). However, a similar request was made on the mailing list recently, and a fix checked in (after Biopython 1.52 was released): http://lists.open-bio.org/pipermail/biopython/2009-September/005585.html Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Sep 29 12:41:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 29 Sep 2009 12:41:08 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200909291641.n8TGf8HE011375@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-29 12:41 EST ------- Really fixed this time, tested on Jython 2.5.0 and 2.5.1rc3 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Sep 30 11:27:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Sep 2009 16:27:03 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? Message-ID: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> Hi all, A few months back on the main mailing list, Cedar and I were talking about taking a SeqRecord, and how to write out its reverse complement to a file. The thread is archived here: http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html Cedar - I cc'd you, as I am not sure if you are on the dev list. I expect this could get technical pretty quickly, so I wanted to float this idea on the dev list first... ----------------------------------------------------------------- So, the background this this discussion: Unless there is some complicated annotation to transfer, using Biopython as is, making a new SeqRecord using the reverse complement sequence of the old SeqRecord isn't very hard, see: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement This has meant that generally the current status quo isn't a problem (at least for me). However, what prompted me to work on this issue was a real world example. We have a draft genome where after doing a basic annotation, it would make sense to flip the strands. I want to be able to load our current GenBank file, apply the reverse complement, and have all the annotated features recalculated to match. With more and more sequencing projects, this isn't such an odd thing to want to do. Dealing with the details of potentially complex locations in SeqFeature object's isn't very nice, so I think it would be useful to have this particular functionality built into Biopython. It is also a small step towards making the SeqRecord more Seq like (which in general seems a good idea). On Thu, Jun 25, 2009 at 12:20 AM, Peter wrote: > > What you are doing is fine - although personally I might wrap up the > first line as a function, as done in the tutorial: > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement > > While we could add a reverse_complement() method to the SeqRecord > (and other Seq methods, like translate etc), there is one big problem: > What to do with the annotation. If your record used to have a name > based on an accession or a GI number, then this really does not apply > to the reverse complement (or a translation etc). We could do something > arbitrary like adding an "rc_" prefix (or variants) but I think the only safe > answer is to make the user think about this and do what is appropriate > in their context. And as you have demonstrated, this can still be done > in one line :) > > I make a habit of using this as a justification, but I feel the zen of > Python "Explicit is better than implicit" applies quite well here. I've been thinking about this on and off since then, and I still maintain that for much of the annotation there is no easy answer. For the sequence itself, the behaviour is well defined. For all the annotation, there are three possible actions: (a) User supplies a new value (b) Reuse the old value (c) No annotation (the default for a new SeqRecord) We can do something sensible with the features (if present) and it will probably make sense to copy but reverse any per-letter annotation (if present). On a github branch I have posted some experimental code which adds a reverse_complement() method to the SeqRecord. I propose to give the new reverse_complement() a set of optional arguments (id, name, etc) following the same names as the existing attributes (and __init__ arguments), allowing the user to choose between these three actions. Assuming the general scheme is popular, I'm quite open to discussing changing these defaults. But for the first implementation this is what I picked: For the id, name and description I still lean towards making the user decide this, and therefore the default is (c). Likewise for the annotations dictionary and the database cross refs. For the features and per-letter-annotation, I would opt to make the default behaviour be to reuse the old data, option (b) above. For the per-letter-annotation (the restricted dictionary, letter_annotations) this just means reversing each entry. For the features, this means reversing the order of the features, switching their strands (if set), and calculating the new coordinates (taking care of all the possible fuzzy locations and sub-features). The code is here is anyone wants to look at the technical details: http://github.com/peterjc/biopython/commits/seqrecords Peter From chapmanb at 50mail.com Tue Sep 1 13:06:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 1 Sep 2009 09:06:39 -0400 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> Message-ID: <20090901130639.GI75451@sobchak.mgh.harvard.edu> Hi Peter; [indexed dict usage] > What file formats where you working on, and how many records? It was a 100Mb fasta file with about 41,000 records. Nothing too heavy but it worked great. The only change I made was to generalize the record building line: self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) to allow an arbitrary function to be passed to define the identifier, instead of defaulting to the first part of the line. This is helpful for those fun NCBI ids (gi|83029091|ref|XM_357633.3|) where other parts of the program only have the accession number. > True. Have got any bright ideas for a better name? While the > index is in memory, the SeqRecord objects are not (unlike the > original Bio.SeqIO.to_dict() function). > > Or we have one function Bio.SeqIO.indexed_dict() which can > either use an in memory index, OR an on disk index, offering > the same functionality. That's a nice idea -- provide some reasonable defaults based on file size and type, and allow them to be over-ridden with function params. > >> Another option (like the shelve idea we talked about last month) > >> is to parse the sequence file with SeqIO, and serialise all the > >> SeqRecord objects to disk, e.g. with pickle or some key/value > >> database. This is potentially very complex (e.g. arbitrary Python > >> objects in the annotation), and could lead to a very large "index" > >> file on disk. On the other hand, some possible back ends would > >> allow editing the database... which could be very useful. > > > > My thought here was to use BioSQL and the SQLite mappings for > > serializing. We build off a tested and existing serialization, and > > also guide people into using BioSQL for larger projects. > > Essentially, we would build an API on top of existing BioSQL > > functionality that creates the index by loading the SQL and then > > pushes the parsed records into it. > > Using BioSQL in this way is a much more general tool than > simply "indexing a sequence file". It feels like a sledgehammer > to crack a nut. Also, do you expect it to scale well for 10 million > plus short reads? It may do, but on the other hand it may not. Agreed that it would introduce extra overhead for something like short reads. If you are talking about serializing SeqRecords, it would make sense to re-use what we have in BioSQL. If you are talking about storing just file offsets, then a lightweight solution makes more sense. For me, the initial parse time to prepare an index is not as much of an issue since it happens once while queries on it will happen multiple times. > Also while the current BioSQL mappings are "tried and tested", > they don't cover everything, in particular per-letter-annotation > such as a set of quality scores (something that needs addressing > anyway, probably with JSON or XML serialisation). Agreed, but the advantage is that improvements can feed back into BioSQL, instead of work in parallel. > All the above make me lean towards a less ambitious target > (read only dictionary access to a sequence file), which just > requires having an (on disk) index of file offsets (which could > be done with SQLite or anything else suitable). This choice > could even be done on the fly at run time (e.g. we look at the > size of the file to decide if we should use an in memory index > or on disk - or start out in memory and if the number of records > gets too big, switch to on disk). That makes sense. SQLite has in-memory caching which could help with some of the decision making as it would handle writing and holding in memory without having to reimplement that bit. Another file based indexing scheme is the one in bx-python: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py This is a bit more specific as it also handles queries based on genomic intervals in addition to retrieving by file position. It may be useful for looking at the underlying storage details. Brad From biopython at maubp.freeserve.co.uk Tue Sep 1 13:25:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 14:25:22 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <20090901130639.GI75451@sobchak.mgh.harvard.edu> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman wrote: > Hi Peter; > > [indexed dict usage] >> What file formats where you working on, and how many records? > > It was a 100Mb fasta file with about 41,000 records. Nothing too > heavy but it worked great. Yeah, with just 41,000 keys and offsets the in memory dict would be pretty small too. This is within the range of file sizes I expect the Bio.SeqIO.indexed_dict() functionality to be used on. Cool. > The only change I made was to generalize the record building line: > > self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) > > to allow an arbitrary function to be passed to define the > identifier, instead of defaulting to the first part of the line. > This is helpful for those fun NCBI ids > (gi|83029091|ref|XM_357633.3|) where other parts of the program only > have the accession number. Did your callback function get give the "title string" and return the desired key? I had wondered about this, but the only way for this to be general (to work on all file formats) is for the callback function to be given a SeqRecord object - which means having to fully parse the file during the indexing, which ends up being *much* slower. We can do this is you think it adds a lot of utility i.e. mimic the key_function argument we already have on Bio.SeqIO.to_dict() Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 13:38:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 14:38:07 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com> References: <200904161146.28203.jblanca@btc.upv.es> <21f6d10e0905221848v572b35c3u7f9b697d5fb3700c@mail.gmail.com> <320fb6e00905230428i1d3d090fg40eed6478868af4f@mail.gmail.com> <21f6d10e0905230911j50d34e90s6a7a2d076e239ced@mail.gmail.com> <15d850bd0907221116h4240dff6r6404b065e9464987@mail.gmail.com> <320fb6e00907221225r1a54bc2na224c8ee1b1714df@mail.gmail.com> <15d850bd0907221351n7b154e62nc592bf01c393a10e@mail.gmail.com> <320fb6e00907230234g75b1ee5foca335073b19b5448@mail.gmail.com> <320fb6e00908120554y3c116f27lb07d62c665fc1f78@mail.gmail.com> <320fb6e00908140500n56e7ccbcl7123099b8de06ccf@mail.gmail.com> Message-ID: <320fb6e00909010638v5c9cec06t66b24e1e755c46cb@mail.gmail.com> On Fri, Aug 14, 2009 at 1:00 PM, Peter wrote: >>> Jose's code uses seek/tell which means it has to have a handle >>> to an actual file. He also used binary read mode - I'm not sure if >>> this was essential or not. >> >> Binary mode was not essential - opening an SFF file in default >> mode also seemed to work fine with Jose's code. > > Having worked on this more, default mode or binary mode are fine. > However, as you might expect, you can't use Python's universal > read lines mode when parsing SFF files. Just to clarify this for the record - on Unix you can parse an SFF file opened in default mode ("r") or binary mode ("rb") but not universal read line mode ("rU"). However, on Windows only binary mode works. I've updated my SFF code on github to catch this (as otherwise the error messages are rather cryptic). Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 13:56:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 14:56:26 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <20090901130639.GI75451@sobchak.mgh.harvard.edu> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909010656h594e908cu246138d45442df45@mail.gmail.com> On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman wrote: > >Peter wrote: >> Using BioSQL in this way is a much more general tool than >> simply "indexing a sequence file". It feels like a sledgehammer >> to crack a nut. Also, do you expect it to scale well for 10 million >> plus short reads? It may do, but on the other hand it may not. > > Agreed that it would introduce extra overhead for something like > short reads. If you are talking about serializing SeqRecords, it > would make sense to re-use what we have in BioSQL. I wasn't talking about serialising SeqRecord objects. I agree there is (almost) no point implementing new serialisation code when we already have BioSQL. > If you are talking about storing just file offsets, then a lightweight > solution makes more sense. Indeed. > For me, the initial parse time to prepare an index is not as much > of an issue since it happens once while queries on it will happen > multiple times. It depends on the expected work load - if you are thinking about indexing a local copy of GenBank, but only expect to pull out a few (hundred) records, then the index time may be longer than the total access time. But in general, if we are talking about saving the index to a file (which can then be reloaded) I would agree, the up front cost to prepare the index isn't critical. On the subject of how to store a index off file offsets on disk, I think the old Biopython Martel/Mindy indexing code used to create OBDA style indexes (either simple flat files or BDB based). We should certainly consider these for cross project compatibility, or perhaps introduce a new OBDA version which might use something like SQLite internally instead? http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html http://lists.open-bio.org/pipermail/open-bio-l/2009-September/000567.html Peter From eoc210 at googlemail.com Wed Sep 2 12:25:24 2009 From: eoc210 at googlemail.com (Ed Cannon) Date: Wed, 2 Sep 2009 13:25:24 +0100 Subject: [Biopython-dev] OBO2OWL parser / converter In-Reply-To: <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net> References: <9e02410b0908301233k6b43f2e3wba791a405d5028a3@mail.gmail.com> <3AA994B7-B2FB-4D3B-A929-D6F5A9297BB2@gmx.net> Message-ID: <9e02410b0909020525w5cbf59dek46e0ab1b5144f8@mail.gmail.com> Hi Hilmar, My OBO2OWL parser is implemented based on Tirmizi & Miranker?s paper titled: ?OBO2OWL: Roundtrip between OBO and OWL? ( www.cs.utexas.edu/~hamid/pub/tirmizi-obo2owl-tr-06-47.pdf )1. After having looked at the link you sent me to the OBO2OWL mappings google spreadsheet, it appears that there are some differences, which I?m looking into at the minute. Ref: 1. Syed Hamid Tirmizi and Daniel P Miranker. (2006). OBO2OWL: Roundtrip between OBO and OWL. The University of Texas at Austin, Department of Computer Sciences, Technical Report TR-06-47, October 2, 16 pages. Cheers, Ed 2009/8/31 Hilmar Lapp > Hi Ed - > > is your converter operating in a way that is congruent with (or even > utilizing) the mapping and the converter provided by the NCBO and Berkeley > Ontology projects? > > http://www.bioontology.org/wiki/index.php/OboInOwl:Main_Page > > If not, I'm not sure how beneficial it is for users to have multiple and > possibly conflicting mappings. > > -hilmar > > > On Aug 30, 2009, at 3:33 PM, Ed Cannon wrote: > > Hi All, >> >> I would like to thank you guys for all your hard work and effort in making >> biopython a great piece of open software. >> >> I would also like to introduce myself, my name is Ed Cannon, I am a >> postdoc >> at Cambridge University working in the fields of chemo/bioinformatics and >> semantic web technologies in the group of Peter Murray-Rust. >> >> Since a fair amount of my work involves ontologies, I have written an open >> biomedical ontology (.obo) to web ontology language (.owl) converter. The >> resultant file can be loaded and used from Protege. I was wondering if >> this >> software would be of any interest to the biopython community? I have just >> sent a pull request to biopython on github. The code is located at my >> branch >> on my account: http://github.com/eoc21/biopython/tree/eoc21Branch. >> >> Thanks, >> >> Ed >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > From bugzilla-daemon at portal.open-bio.org Wed Sep 2 15:24:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Sep 2009 11:24:19 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200909021524.n82FOJ7U021693@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-02 11:24 EST ------- (In reply to comment #3) > I can now parse the Roche SFF index, allowing fast random access to > the reads. See: > > http://github.com/peterjc/biopython/commits/index > http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006603.html > > Peter That branch now has support for SeqIO parsing, indexing and *writing* of SFF files. The write support is still very new and needs more testing, but is looking promising. Note that while currently I read the undocumented Roche style SFF index block, I have not yet attempted to write out such an index (probably unwise unless the format does get published?). Also note that there is still scope for improvement for how the trimming information is presented in the SeqRecord object (perhaps some kind of masked SeqRecord/Seq as has been suggested on the mailing lists). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 2 16:45:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Sep 2009 12:45:48 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200909021645.n82GjmbA023923@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-02 12:45 EST ------- (In reply to comment #4) > That branch now has support for SeqIO parsing, indexing and *writing* of > SFF files. The write support is still very new and needs more testing, > but is looking promising. Note that while currently I read the undocumented > Roche style SFF index block, I have not yet attempted to write out such an > index (probably unwise unless the format does get published?). It now has a first attempt at writing a Roche style SFF index, which my code will parse back again happily. I have not yet tried the resulting file with the Roche SFF tools. Note that this does not preserve any Roche XML meta data. Note also that the index is skipped if any of the record names are not 14 chars long (which is try on all the Roche indexes I have looked at). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 4 10:23:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Sep 2009 06:23:26 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200909041023.n84ANQgj023187@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-04 06:23 EST ------- I've been working on the Roche SFF indexes, and via their tools have discovered there are at least two index block formats used: Most SFF files I have looked at have an index block which starts ".mft1.00" (short for Manifest v1.00 is my guess) which hold both an XML "manifest" or meta data, plus a read offset index. You can also get SFF files where the index block starts ".srt1.00" (Short Read Table v1.00 maybe?) which have just an index. The indexes details themselves are the same in both cases, and support arbitrary read name lengths. The offset is in base 255 (not 256), apparently so that byte 255 (0xFF) can be used as a separator character. For typical Roche SFF files, the read names are 14 characters, and the index uses 20 bytes per read. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Sep 4 10:54:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Sep 2009 06:54:39 -0400 Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read files in Bio.SeqIO In-Reply-To: Message-ID: <200909041054.n84AsdNe023921@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2837 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-04 06:54 EST ------- The Staden IO lib has references to ".srt1.00" (454 sorted v1.00) and also another SFF index format, which start ".hsh1.00" (hash table v1.00). See files io_lib/progs/hash_sff.c and io_lib/open_trace_file.c from http://sourceforge.net/projects/staden/ Scanning their code also confirms my base 255 deduction for the ".srt" indexes, see function getuint4_255, and the use of 0xFF as a break character. Interestingly they only expect 4 bytes for the offset (limiting this to almost 4GB SFF files). There is a fifth byte which is usually null, this could be a name terminator (although this is not actually needed), or used for 4GB+ SFF offsets. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Sep 4 15:33:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Sep 2009 16:33:16 +0100 Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython? In-Reply-To: <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> Hi David, [This is a continuation of a thread on the main list, but it is much more suited to the dev list now.] On Tue, Sep 1, 2009 at 11:38 PM, David Winter wrote: > Peter wrote: >> David - I would prefer we also put your new wrappers in >> Bio.Emboss.Applications, and would be happy to look at adding >> those to CVS now that Biopython 1.51 is out (I had forgotten >> about them actually - so thanks for the reminder). >> >> Peter > > Hi Peter, > > I'd almost forgotten about them myself! I only put them in their own module > because I had the PhyML wrapper as well and that's not an EMBOSS > application. I see you've done that on github. I had a look at merging this into CVS, but had a few comments first. I found you had a load of tabs in your file (please use 4 space indentation in future). http://www.biopython.org/wiki/Contributing#Coding_conventions I am unclear why you are subclassing _EmbossMinimalCommandLine instead of _EmbossCommandLine since most (all?) of the new wrappers use the "outfile" parameter. As I recall EMBOSS isn't fussy about the presence of the equals sign (right now our wrappers mostly omit the equals, but not all the time - which looks odd to me). Also your code seems to me missing the __str__ / _validate changes on the trunk. And finally, I think you can add yourself to the copyright at the top of the file for this work ;) Peter From biopython at maubp.freeserve.co.uk Fri Sep 4 17:22:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Sep 2009 18:22:27 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> Message-ID: <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com> On Tue, Sep 1, 2009 at 2:25 PM, Peter wrote: > On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman wrote: >> Hi Peter; >> >> [indexed dict usage] >>> What file formats where you working on, and how many records? >> >> It was a 100Mb fasta file with about 41,000 records. Nothing too >> heavy but it worked great. > > Yeah, with just 41,000 keys and offsets the in memory dict would > be pretty small too. This is within the range of file sizes I expect > the Bio.SeqIO.indexed_dict() functionality to be used on. Cool. > >> The only change I made was to generalize the record building line: >> >> self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) >> >> to allow an arbitrary function to be passed to define the >> identifier, instead of defaulting to the first part of the line. >> This is helpful for those fun NCBI ids >> (gi|83029091|ref|XM_357633.3|) where other parts of the program only >> have the accession number. > > Did your callback function get given the "title string" and return > the desired key? > > I had wondered about this, but the only way for this to be general > (to work on all file formats) is for the callback function to be given > a SeqRecord object - which means having to fully parse the file > during the indexing, which ends up being *much* slower. We can > do this if you think it adds a lot of utility i.e. mimic the key_function > argument we already have on Bio.SeqIO.to_dict() A less flexible option is a callback function which maps the default record.id to a new key. This would solve your NCBI FASTA issue, and might be handy in other settings (e.g. removing the version suffix in GenBank identifiers). However, it would not allow for example switching to a completely different identifier (e.g. the GI number) which is present elsewhere in the file. The point is we can support this kind of limited key_function without suffering the severe speed penalty which doing a full parse to give SeqRecord objects would impose. How does that sound Brad? It should add just a little complexity to the current code, and allows some neat tricks. Or we can leave things as they are (KISS). Peter From mjldehoon at yahoo.com Sat Sep 5 08:17:00 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 5 Sep 2009 01:17:00 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez.parse Message-ID: <339938.48242.qm@web62405.mail.re1.yahoo.com> Hi everybody, Recently I was trying to parse a huge Entrez XML file containing Entrez gene records. Because of the size of the file, Entrez.read failed with a memory error since it could not keep the entire information in the XML file in memory. I decided to add a parse() function to Bio.Entrez that can iterate of such large files. This function is useful if the XML file essentially contains a list of records; the parse() function is a generator function that returns these records one by one. --Michiel. From p.j.a.cock at googlemail.com Sat Sep 5 12:59:09 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 5 Sep 2009 13:59:09 +0100 Subject: [Biopython-dev] Bio.Entrez.parse In-Reply-To: <339938.48242.qm@web62405.mail.re1.yahoo.com> References: <339938.48242.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00909050559p2c9da2f1o60905ac3dfe0cb35@mail.gmail.com> On Sat, Sep 5, 2009 at 9:17 AM, Michiel de Hoon wrote: > Hi everybody, > Recently I was trying to parse a huge Entrez XML file containing Entrez gene > records. Because of the size of the file, Entrez.read failed with a memory > error since it could not keep the entire information in the XML file in memory. > I decided to add a parse() function to Bio.Entrez that can iterate of such large > files. This function is useful if the XML file essentially contains a list of records; > the parse() function is a generator function that returns these records one by one. That sounds excellent - I'd noticed that usually Bio.Entez.read() would return a list of (large nested) records, so this should be a natural extension. Peter From biopython at maubp.freeserve.co.uk Mon Sep 7 11:56:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Sep 2009 12:56:17 +0100 Subject: [Biopython-dev] Anonymous CVS working again :) Message-ID: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com> Just an FYI, While the developer server dev.open-bio.org has been fine, recently our public read only mirror at cvs.open-bio.org (and cvs.biopython.org) had not been updated. This affected Biopython and EMBOSS. And for Biopython as a knock on effect, this had meant the latest code at http://biopython.org/SRC/biopython/ was a little out of date. [Biopython's github mirror was not affected] These all seem to be working fine once again - thanks to someone at the OBF - let me know who and I'll buy you a beer when we (next) meet up :) Peter From biopython at maubp.freeserve.co.uk Mon Sep 7 17:34:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Sep 2009 18:34:53 +0100 Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython? In-Reply-To: <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> Message-ID: <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> On Fri, Sep 4, 2009 at 4:33 PM, Peter wrote: > Hi David, > > [This is a continuation of a thread on the main list, but it is much more > suited to the dev list now.] > > ... > > I see you've done that on github. I had a look at merging this into CVS, > but had a few comments first. > > I found you had a load of tabs in your file (please use 4 space indentation > in future). http://www.biopython.org/wiki/Contributing#Coding_conventions Thanks. > I am unclear why you are subclassing _EmbossMinimalCommandLine > instead of _EmbossCommandLine since most (all?) of the new wrappers > use the "outfile" parameter. As I recall EMBOSS isn't fussy about the > presence of the equals sign (right now our wrappers mostly omit the > equals, but not all the time - which looks odd to me). I see you've switched to _EmbossCommandLine - fine. > Also your code seems to me missing the __str__ / _validate changes > on the trunk. Also fixed, thanks. > And finally, I think you can add yourself to the copyright at the top of > the file for this work ;) Cool. I have checked this into CVS, but did also fix an old typo (in a docstring) and one new typo (in an argument name). Thanks David! Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py based on test_Emboss.py? Continuing on the github branch is fine. We should put you in the CONTRIB file now too (are there any other recent people we've missed?). Would you like to give a webpage, or is this email address fine (be warned it may get harvested for spam)? Thank you, Peter From biopython at maubp.freeserve.co.uk Mon Sep 7 20:00:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Sep 2009 21:00:46 +0100 Subject: [Biopython-dev] [Root-l] Anonymous CVS working again :) In-Reply-To: References: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com> Message-ID: <320fb6e00909071300x6238e828l4440e71c562e792c@mail.gmail.com> > Are these being kept in sync? ? bioperl's moved completely away from > cvs to svn with very little pain. ?We found sync-ing the two more trouble > than it was worth. Perhaps we are talking at cross purposes here Chris. Right now Biopython and EMBOSS are using CVS, with developers committing to dev.open-bio.org, which then updates a read only CVS mirror code.open-bio.org (aka cvs.open-bio.org aka cvs.biopython.org) to provide anonymous assess. Likewise, BioPerl etc are using SVN, with developers committing to dev.open-bio.org, which then updates a read only SVN mirror at code.open-bio.org (or its other aliases) to provide anonymous assess. Peter From biopython at maubp.freeserve.co.uk Mon Sep 7 21:26:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Sep 2009 22:26:26 +0100 Subject: [Biopython-dev] [Root-l] Anonymous CVS working again :) In-Reply-To: <9A75D700-AC7B-4D5B-ABB2-D28267735E4C@illinois.edu> References: <320fb6e00909070456k28122011o59bfb3d640c4a0a8@mail.gmail.com> <320fb6e00909071300x6238e828l4440e71c562e792c@mail.gmail.com> <9A75D700-AC7B-4D5B-ABB2-D28267735E4C@illinois.edu> Message-ID: <320fb6e00909071426w1dfed95bx703384b3227eee6b@mail.gmail.com> On Mon, Sep 7, 2009 at 9:44 PM, Chris Fields wrote: > On Sep 7, 2009, at 3:00 PM, Peter wrote: > >>> Are these being kept in sync? ? bioperl's moved completely away from >>> cvs to svn with very little pain. ?We found sync-ing the two more trouble >>> than it was worth. >> >> Perhaps we are talking at cross purposes here Chris. >> >> Right now Biopython and EMBOSS are using CVS, with developers >> committing to dev.open-bio.org, which then updates a read only CVS >> mirror code.open-bio.org (aka cvs.open-bio.org aka cvs.biopython.org) >> to provide anonymous assess. >> >> Likewise, BioPerl etc are using SVN, with developers committing to >> dev.open-bio.org, which then updates a read only SVN mirror at >> code.open-bio.org (or its other aliases) to provide anonymous assess. >> >> Peter > > Right, I understand that, but you also have a git repo on github (unless I'm > mistaken). ?Based on that I assume you plan on migrating over to dev git > and/or github eventually, but I'm unsure of the future of the CVS repo. Right! For now, CVS changes are pushed to github. Once we move to git, the CVS repo will no longer be used, and well be left frozen in time. > My point was, we had been in a similar situation. ?We had thought of having > a sync'ed CVS <-> SVN repo at one point, but it was way too much trouble to > deal with and just dropped CVS altogether after the migration. ?Instead, we > just started switching all docs over to point to svn instead with lots of > ample warning on the mail lists, and it all worked out in the end (we have > had very few users inquiring about CVS). Likewise, we could have git changes pushed into CVS, but there is little point. We plan to just quit using CVS. Peter From david.winter at gmail.com Mon Sep 7 22:54:52 2009 From: david.winter at gmail.com (David Winter) Date: Tue, 08 Sep 2009 10:54:52 +1200 Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython? In-Reply-To: <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> Message-ID: <4AA58F3C.6080200@student.otago.ac.nz> Hi Peter and all, Sorry the lack of communication from me on this. I successfully made it off the grid for the weekend then found I couldn't push to github from work (no ssh over the proxy for students) and couldn't email the list from home (can't use the uni's SMTP from off campus ) - IT-security catch 22! > I see you've switched to _EmbossCommandLine - fine. > > Yeah, this was my stupid fault - you'd given me a heads up about the two different version of the _EmbossCommandline and I tried out what I already had with the the 'normal' version as saw that it failed but didn't read the error message properly (of course it failed because I was trying to give it the outfile parameter twice...) > [... snip the other things you asked about...] > > > I have checked this into CVS, but did also fix an old typo (in a docstring) > and one new typo (in an argument name). Thanks David! > > Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py > based on test_Emboss.py? Continuing on the github branch is fine. > Sounds good, will have a go at getting something going in the next couple of days > We should put you in the CONTRIB file now too (are there any other > recent people we've missed?). Would you like to give a webpage, or > is this email address fine (be warned it may get harvested for spam)? > > Well, I'm not sure it's much of a contribution from me, but thanks :) Perhaps add david.winter at gmail.com - gmail seems to handle spam pretty well and I won't be a student here for ever (right?...) Cheers, David From biopython at maubp.freeserve.co.uk Tue Sep 8 09:21:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 10:21:11 +0100 Subject: [Biopython-dev] [Biopython] Phylogenetic trees with biopython? In-Reply-To: <4AA58F3C.6080200@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> Message-ID: <320fb6e00909080221m7377f033ue9b1617b0bc38f5b@mail.gmail.com> On Mon, Sep 7, 2009 at 11:54 PM, David Winter wrote: > Hi Peter and all, > > Sorry the lack of communication from me on this. I successfully made it off > the grid for the weekend then found I couldn't push to github from work (no > ssh over the proxy for students) and couldn't email the list from home > (can't use the uni's SMTP from off campus ) - IT-security catch 22! Tricky. >> I see you've switched to _EmbossCommandLine - fine. > > Yeah, this was my stupid fault - you'd given me a heads up about the two > different version of the _EmbossCommandline and I tried out what I already > had with the the 'normal' version as saw that it failed but didn't read the > error message properly (of course it failed because I was trying to give it > the outfile parameter twice...) OK - I wondered if there was some other reason I couldn't see, so worth checking, >> I have checked this into CVS, but did also fix an old typo (in a >> docstring) and one new typo (in an argument name). Thanks David! >> >> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py >> based on test_Emboss.py? Continuing on the github branch is fine. > > Sounds good, will have a go at getting something going in the next > couple of days Great - whenever you get time. Thanks! >> We should put you in the CONTRIB file now too (are there any other >> recent people we've missed?). Would you like to give a webpage, or >> is this email address fine (be warned it may get harvested for spam)? > > Well, I'm not sure it's much of a contribution from me, but thanks :) But I'm expecting more in future *grin* > Perhaps add david.winter at gmail.com - gmail seems to handle spam > pretty well and I won't be a student here for ever (right?...) There is always a postdoc ;) Also can someone remind me at some point that we should include at least one of the EMBOSS PHYLIP tools in the alignment command line bit of the tutorial... Peter From chapmanb at 50mail.com Tue Sep 8 12:14:05 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 8 Sep 2009 08:14:05 -0400 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com> Message-ID: <20090908121405.GF63266@sobchak.mgh.harvard.edu> Hi Peter; [... callback function for specifying an ID ...] > > Did your callback function get given the "title string" and return > > the desired key? > > > > I had wondered about this, but the only way for this to be general > > (to work on all file formats) is for the callback function to be given > > a SeqRecord object - which means having to fully parse the file > > during the indexing, which ends up being *much* slower. We can > > do this if you think it adds a lot of utility i.e. mimic the key_function > > argument we already have on Bio.SeqIO.to_dict() > > A less flexible option is a callback function which maps the default > record.id to a new key. This would solve your NCBI FASTA issue, > and might be handy in other settings (e.g. removing the version > suffix in GenBank identifiers). However, it would not allow for > example switching to a completely different identifier (e.g. the GI > number) which is present elsewhere in the file. > > The point is we can support this kind of limited key_function > without suffering the severe speed penalty which doing a full > parse to give SeqRecord objects would impose. This is a great compromise. You're right, parsing the SeqRecord is too much, and allowing manipulation of default identifier would work fine. If people need to do something much more complicated to get the ID they would probably be better off extending the existing classes and writing a custom indexer that pulls the IDs they need. Brad From biopython at maubp.freeserve.co.uk Tue Sep 8 13:22:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 14:22:35 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <20090908121405.GF63266@sobchak.mgh.harvard.edu> References: <320fb6e00908200428g1676c41el9ff1c91c0cb1afdc@mail.gmail.com> <320fb6e00908310542o2b6fc566k25632c39244b332c@mail.gmail.com> <20090831132451.GD75451@sobchak.mgh.harvard.edu> <320fb6e00908310649s77684634j8f203de1ddbf2fee@mail.gmail.com> <20090901130639.GI75451@sobchak.mgh.harvard.edu> <320fb6e00909010625j4dc82796mc1c278ba576feb30@mail.gmail.com> <320fb6e00909041022x6bb818edif93104496197b18b@mail.gmail.com> <20090908121405.GF63266@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> n Tue, Sep 8, 2009 at 1:14 PM, Brad Chapman wrote: > Hi Peter; > > [... callback function for specifying an ID ...] > >> A less flexible option is a callback function which maps the default >> record.id to a new key. This would solve your NCBI FASTA issue, >> and might be handy in other settings (e.g. removing the version >> suffix in GenBank identifiers). However, it would not allow for >> example switching to a completely different identifier (e.g. the GI >> number) which is present elsewhere in the file. >> >> The point is we can support this kind of limited key_function >> without suffering the severe speed penalty which doing a full >> parse to give SeqRecord objects would impose. > > This is a great compromise. You're right, parsing the SeqRecord is too > much, and allowing manipulation of default identifier would work fine. Cool - done in CVS, including the docstring and the tutorial. > If people need to do something much more complicated to get the ID > they would probably be better off extending the existing classes and > writing a custom indexer that pulls the IDs they need. Certainly - we can't expect to cover every possible use case, and trying to do so will result in an overly complicated API. Did you have any ideas for a better name than Bio.SeqIO.indexed_dict()? Peter From mjldehoon at yahoo.com Tue Sep 8 13:30:30 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Sep 2009 06:30:30 -0700 (PDT) Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> Message-ID: <184931.66541.qm@web62403.mail.re1.yahoo.com> --- On Tue, 9/8/09, Peter wrote: > Did you have any ideas for a better name than > Bio.SeqIO.indexed_dict()? > Is indexed_dict a function? If so, I suggest we use a verb instead of a noun. Maybe just "index"? --Michiel. From biopython at maubp.freeserve.co.uk Tue Sep 8 13:53:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Sep 2009 14:53:36 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <184931.66541.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> <184931.66541.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com> On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote: > --- On Tue, 9/8/09, Peter wrote: >> Did you have any ideas for a better name than >> Bio.SeqIO.indexed_dict()? > > Is indexed_dict a function? If so, I suggest we use a verb instead > of a noun. Maybe just "index"? > > --Michiel. Bio.SeqIO.indexed_dict() is a function which returns a dictionary like object. So yes, a verb would be better, and "index" is short and sweet. Peter From bugzilla-daemon at portal.open-bio.org Wed Sep 9 13:24:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Sep 2009 09:24:41 -0400 Subject: [Biopython-dev] [Bug 2781] Bio.PDB Structure instances cannot be deepcopied In-Reply-To: Message-ID: <200909091324.n89DOf4Q013555@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2781 klaus.kopec at tuebingen.mpg.de changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #2 from klaus.kopec at tuebingen.mpg.de 2009-09-09 09:24 EST ------- this seems to be resolved in 1.51 with Python 2.6.2 under 64Bit Ubuntu? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 9 15:18:01 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Sep 2009 11:18:01 -0400 Subject: [Biopython-dev] [Bug 2910] New: Parsing some pdb files results in shorter peptide sequences than expected Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2910 Summary: Parsing some pdb files results in shorter peptide sequences than expected Product: Biopython Version: 1.49 Platform: PC OS/Version: Linux Status: NEW Severity: critical Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: schafer at rostlab.org Parsing the one-letter sequence for a specific chain out of a given pdb file often seems to result in shorter sequences than expected. The following code demonstrates this behavior for structure 1a2d chain A. Aminoacid #118 VAL after the HETATOM (#117) block is missing in the result. ------------------CODE---------------- from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import * parser = PDBParser() ppb = PPBuilder() structure = parser.get_structure('tmp', '1a2d.pdb') polypeptides = ppb.build_peptides(structure[0]['A']) sequence = str(polypeptides[0].get_sequence()) print sequence ------------------CODE---------------- Another example is structure 13gs chain C and D. Both sequences are ECG, the code above however returns only CG. So this behavior seems to be indepedent from a present HETATOM block. This bug is also present in version 1.51. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 9 15:18:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Sep 2009 11:18:48 -0400 Subject: [Biopython-dev] [Bug 2910] Parsing some pdb files results in shorter peptide sequences than expected In-Reply-To: Message-ID: <200909091518.n89FImn5016415@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 schafer at rostlab.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |schafer at rostlab.org -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 10 12:55:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Sep 2009 08:55:03 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909101255.n8ACt3Jd017456@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|critical |normal Summary|Parsing some pdb files |Bio.PDB build_peptides |results in shorter peptide |sometimes gives shorter |sequences than expected |peptide sequences than | |expected ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-10 08:55 EST ------- Retitled as this appears to be a bug in the PPBuilder build_peptides method, not the PDB parser, see: http://lists.open-bio.org/pipermail/biopython/2009-September/005532.html Test script: from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import PPBuilder, to_one_letter_code parser = PDBParser() ppb = PPBuilder() #structure = parser.get_structure('tmp', '1A2D.pdb') structure = parser.get_structure('tmp', '13GS.pdb') for model in structure : polypeptides = ppb.build_peptides(model) assert len(model) == len(polypeptides) for chain, pep in zip(model, polypeptides) : print print "Chain", chain.id print "Raw chain:" print "".join(to_one_letter_code.get(res.resname,"X") \ for res in chain if "CA" in res.child_dict) print "From peptide builder:" print pep.get_sequence() Output for 1A2D, PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2426. PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2427. PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2428. PDBConstructionWarning: WARNING: Chain B is discontinuous at line 2448. Chain A Raw chain: CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA >From peptide builder: CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA Chain B Raw chain: CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXVMKGVTSTRVYERA >From peptide builder: CDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDLVTIRSESTFKNTEISFKLGVEFDEITADDRKVKSIITLDGGALVQVQKWDGKSTTIKRKRDGDKLVVEXMKGVTSTRVYERA Notice there are discontinuities in both chains A and B, and a missing residue in their peptides. And the output from 13GS, PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3760. PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3812. PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3852. PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3948. PDBConstructionWarning: WARNING: Chain C is discontinuous at line 4033. Chain A Raw chain: MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ >From peptide builder: MPPYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ Chain B Raw chain: PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ >From peptide builder: PYTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGDLTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKDDYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFPLLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ Chain C Raw chain: ECG >From peptide builder: CG Chain D Raw chain: ECG >From peptide builder: CG Notice there are discontinuities in chains A, B and C, but missing residues in the peptide chains C and D. This suggests the discontinuities are required to trigger the problem. Also there are no HETATM residues for chains C and D. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 10 12:57:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Sep 2009 08:57:13 -0400 Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed assertion in CondonTable Fix+Patch In-Reply-To: Message-ID: <200909101257.n8ACvDe1017562@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2894 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-10 08:57 EST ------- I'm marking this as a duplicated of bug 2887, and believe it to be fixed on the trunk. *** This bug has been marked as a duplicate of bug 2887 *** -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 10 12:57:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Sep 2009 08:57:16 -0400 Subject: [Biopython-dev] [Bug 2887] set iteration order dependency in Bio.Data.CodonTable In-Reply-To: Message-ID: <200909101257.n8ACvGRn017574@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2887 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kellrott at ucsd.edu ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-10 08:57 EST ------- *** Bug 2894 has been marked as a duplicate of this bug. *** -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 10 12:57:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Sep 2009 08:57:20 -0400 Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch In-Reply-To: Message-ID: <200909101257.n8ACvKL9017592@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2895 Bug 2895 depends on bug 2894, which changed state. Bug 2894 Summary: Jython List difference causes failed assertion in CondonTable Fix+Patch http://bugzilla.open-bio.org/show_bug.cgi?id=2894 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Sep 15 13:51:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 14:51:43 +0100 Subject: [Biopython-dev] Another Biopython release? Message-ID: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> Hi all, Looking ahead, Tiago has some population genetics code he hopes to merge into the trunk at the end of the month (or in October), and we still have Brad's GFF stuff, my SFF work, Kristian's RNA code, Kyle's misc suggestions, and perhaps most importantly the phylogenetics GSoC work to consider. I know it's been only a month since we released Biopython 1.51, but does anyone (other than me) think that we already have enough done to warrant another release? The associated CVS freeze would also serve as a good break point for moving to github (see other threads). Here is what we have in the NEWS file at the moment: New helper functions Bio.SeqIO.convert() and Bio.AlignIO.convert() allow an easier way to use Biopython for simple file format conversions. Additionally, these new functions allow Biopython to offer important file format specific optimisations (e.g. FASTQ to FASTA, and interconverting FASTQ variants). New function Bio.SeqIO.indexed_dict() allows indexing of most sequence file formats (but not alignment file formats), allowing dictionary like random access to all the entries in the file as SeqRecord objects, keyed on the record id. This is especially useful for very large sequencing files, where all the records cannot be held in memory at once. This supplements the more flexible but memory demanding Bio.SeqIO.to_dict() function. Bio.SeqIO can now write "phd" format files (used by PHRED, PHRAD and CONSED), allowing interconversion with FASTQ files, or FASTA+QUAL files. Bio.Emboss.Applications now includes wrappers for the "new" PHYLIP EMBASSY package (e.g. fneighbor) which replace the "old" PHYLIP EMBASSY package (e.g. efneighbor) whose Biopython wrappers are now obsolete. See also the DEPRECATED file, as several old deprecated modules have finally been removed (e.g. Bio.EUtils which had been replaced by Bio.Entrez). [As an aside - Cymon and David - do you want to be named in the NEWS file for the PHD and PHLIPNEW stuff?] We're still debating the name of the new function Bio.SeqIO.indexed_dict(), but I am happy with the code (and new documentation) otherwise. The related extensions to adding indexing via a lookup file or an SQLite database is another big chunk of work which I don't have time for at the moment, but the code already in CVS is still extremely useful as is. Again, I'm biased, but I think the Bio.SeqIO.convert(...) function will be a popular addition for its convenience, but especially valuable for anyone wanting to convert between the different FASTQ files where the optimised conversion code makes a big speed up. Does doing another quick release (say at some point next week) sound like a good plan? If people like the idea, then getting some extra testing in now would be great - especially on the new stuff (it has unit tests of course, but real world usage is also important - thanks Brad for already trying out the FASTA indexing). Peter From bartek at rezolwenta.eu.org Tue Sep 15 14:59:43 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 15 Sep 2009 16:59:43 +0200 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> Message-ID: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> On Tue, Sep 15, 2009 at 3:51 PM, Peter wrote: > Hi all, > > I know it's been only a month since we released Biopython 1.51, but > does anyone (other than me) think that we already have enough done > to warrant another release? The associated CVS freeze would also > serve as a good break point for moving to github (see other threads). > That would be great. As for the move to github, I've added some (quite preliminary) docs for developers on how to make commits to the main branch using git and github to the wiki: http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch Any comments and/or improvements are most welcome. cheers Bartek From tiagoantao at gmail.com Tue Sep 15 15:29:55 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 15 Sep 2009 16:29:55 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> Message-ID: <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> On Tue, Sep 15, 2009 at 2:51 PM, Peter wrote: > Hi all, > > Looking ahead, Tiago has some population genetics code he hopes to I can put my stuff in CVS (plus I have docs). Question: CVS is still "the place". Right? I just need to test stuff on Windows. All the rest seems ok. Tiago From biopython at maubp.freeserve.co.uk Tue Sep 15 15:35:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 16:35:13 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> Message-ID: <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> 2009/9/15 Tiago Ant?o : > On Tue, Sep 15, 2009 at 2:51 PM, Peter wrote: >> Hi all, >> >> Looking ahead, Tiago has some population genetics code he hopes to > > I can put my stuff in CVS (plus I have docs). Question: CVS is still > "the place". Right? > > I just need to test stuff on Windows. All the rest seems ok. Yes, for the short term CVS is still the master repository. If you have that stuff ready to check in now, then sure - go ahead I was assuming you didn't expect to have this ready just yet, hence the proposal to sneak out a quick release first ;) Give me a shout and I'll get my Windows test machine up and running to double check the unit tests there. Maybe we'll push back the "next week" idea a bit ;) Peter From eric.talevich at gmail.com Tue Sep 15 15:38:45 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 15 Sep 2009 11:38:45 -0400 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> Message-ID: <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> > > On Tue, Sep 15, 2009 at 3:51 PM, Peter > wrote: > > Hi all, > > > > I know it's been only a month since we released Biopython 1.51, but > > does anyone (other than me) think that we already have enough done > > to warrant another release? The associated CVS freeze would also > > serve as a good break point for moving to github (see other threads). > > > Sounds good to me. Completing the Git migration would make it much easier for me to maintain the Tree/TreeIO stuff, since I already have a few local branches based on it that an upstream CVS duplication would mangle. On Tue, Sep 15, 2009 at 10:59 AM, Bartek Wilczynski < bartek at rezolwenta.eu.org> wrote: > That would be great. As for the move to github, I've added some (quite > preliminary) docs for developers on how to make commits to the main > branch using git and github to the wiki: > http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch > > The setup here for committers looks potentially different from the setup in "Merging upstream changes" (describing read-only tracking), but also potentially similar. Diff: - The github:biopython/biopython repository is called "official" here, but "upstream" there. Different protocol too, but that's intentional. - It also shows how to treat the upstream/official repo as the origin, CVS-style. This would mean the developer doesn't have a separate GitHub fork to use for personal branches, uncertain commits, etc. that don't belong in the main repo. Maybe a good way to organize the page would be in terms of how you want to use the repo: 1. Tracking Biopython with raw Git (without signing up for GitHub) - git clone http://.../biopython/biopython - remote: upstream - how to format a patch and submit on Bugzilla 2. Tracking Biopython on GitHub (e.g. occasional contributors) - sign up, click the "fork" button - git clone http://.../your-name-here/biopython - remotes: origin, upstream - how to submit a pull request on GitHub - how to add, manage and delete branches locally and on GitHub 3. Collaborating - either #1 or #2 is fine - how to add and manage more remotes - how to apply Git patches, and why copy/paste kills kittens the next time you merge 4. Committing to Biopython - same as #2, but use the private URL for the "upstream" remote - remotes: origin, upstream - policy on pushing upstream, code reviews, tagging, etc. Cheers, Eric From tiagoantao at gmail.com Tue Sep 15 15:39:07 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 15 Sep 2009 16:39:07 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> Message-ID: <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> 2009/9/15 Peter : > Give me a shout and I'll get my Windows test machine up > and running to double check the unit tests there. I think I am not in the mood to impose the burden on you. I will find a Windows machine and test it myself. > Maybe we'll push back the "next week" idea a bit ;) I am OK with "next week". But as I said two months ago, I have calendarized the extension of Bio.PopGen to October. So the material can go on the next release after the one on "next week". I just want to have lots of free time and little travel to be able to assist potential users (as I intend to announce the new content to the evolutionary biology crowd quite a lot) -- " It always takes ideology to consummate massive error." - Ambrose Evans-Pritchard From biopython at maubp.freeserve.co.uk Tue Sep 15 15:48:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 16:48:43 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> Message-ID: <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com> 2009/9/15 Tiago Ant?o : > 2009/9/15 Peter : >> Give me a shout and I'll get my Windows test machine up >> and running to double check the unit tests there. > > I think I am not in the mood to impose the burden on you. I will find > a Windows machine and test it myself. I was just going to turn on the machine, update to the latest CVS, and do a compile/test with Python 2.4, 2.5, 2.6 - Its no extra effort, as I would be doing this anyway for a new release. Unless of course you are adding wrappers for more command line tools, which would ideally require me to install them - that I might leave for another day ;) >> Maybe we'll push back the "next week" idea a bit ;) > > I am OK with "next week". But as I said two months ago, I have > calendarized the extension of Bio.PopGen to October. So the material > can go on the next release after the one on "next week". > > I just want to have lots of free time and little travel to be able to > assist potential users (as I intend to announce the new content to the > evolutionary biology crowd quite a lot) If you are happy to merge the code this week (via CVS), and confident it is ready to release, then I could do the release next week, and then we move to git. Or, I can do the release next week, we move to git, and then you can merge the new code (via git) at your leisure (Oct). Either plan is fine with me. Which do you prefer? Peter From tiagoantao at gmail.com Tue Sep 15 15:57:17 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 15 Sep 2009 16:57:17 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com> Message-ID: <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com> > Unless of course you are adding wrappers for more command > line tools, which would ideally require me to install them - that > I might leave for another day ;) Spot on ;) . > If you are happy to merge the code this week (via CVS), and > confident it is ready to release, then I could do the release > next week, and then we move to git. I will be only able to test the code on Windows tomorrow, if I can get hold to the machine (which I should). > Either plan is fine with me. Which do you prefer? I prefer merging on CVS, I am still much more proficient with it. You should have the merge there on Friday morning when you arrive. Tutorial included. Tiago -- " It always takes ideology to consummate massive error." - Ambrose Evans-Pritchard From biopython at maubp.freeserve.co.uk Tue Sep 15 16:09:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 17:09:32 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <6d941f120909150829k67d3f344gf9648bfc290cc9f0@mail.gmail.com> <320fb6e00909150835g231e6491n7d167ef42a9116c9@mail.gmail.com> <6d941f120909150839i452d7cees25099da3ae68eb97@mail.gmail.com> <320fb6e00909150848q23b109cdg9ca38ad7882ef637@mail.gmail.com> <6d941f120909150857s6531b1f1o77674106efa050ed@mail.gmail.com> Message-ID: <320fb6e00909150909x2f45e0f5g6c4da77eafcd9a49@mail.gmail.com> 2009/9/15 Tiago Ant?o : >> Unless of course you are adding wrappers for more command >> line tools, which would ideally require me to install them - that >> I might leave for another day ;) > > Spot on ;) . OK. >> If you are happy to merge the code this week (via CVS), and >> confident it is ready to release, then I could do the release >> next week, and then we move to git. > > I will be only able to test the code on Windows tomorrow, if > I can get hold to the machine (which I should). Fingers crossed this doesn't throw any surprises at you. >> Either plan is fine with me. Which do you prefer? > > I prefer merging on CVS, I am still much more proficient with it. You > should have the merge there on Friday morning when you arrive. > Tutorial included. OK then :) Peter From bartek at rezolwenta.eu.org Tue Sep 15 19:45:22 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 15 Sep 2009 21:45:22 +0200 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> Message-ID: <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com> On Tue, Sep 15, 2009 at 5:38 PM, Eric Talevich wrote: > Sounds good to me. Completing the Git migration would make it much easier > for me to maintain the Tree/TreeIO stuff, since I already have a few local > branches based on it that an upstream CVS duplication would mangle. > Then maybe we should wait with committing your changes to the time we drop CVS, in order to avoid loss of change history in your code... What do you think, Peter? > > On Tue, Sep 15, 2009 at 10:59 AM, Bartek Wilczynski < > bartek at rezolwenta.eu.org> wrote: > >> That would be great. As for the move to github, I've added some (quite >> preliminary) docs for developers on how to make commits to the main >> branch using git and github to the wiki: >> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch >> >> > The setup here for committers looks potentially different from the setup in > "Merging upstream changes" (describing read-only tracking), but also > potentially similar. Diff: > - The github:biopython/biopython repository is called "official" here, but > "upstream" there. Different protocol too, but that's intentional. Yes, indeed. I know this might seem strange but I was trying to deliberately make the distinction between the main repository in read-write mode (official) and in read-only mode (upstream). I would keep it like this at least for a while so that the transition from CVS is as easy as possible. We have quite a few developers who are new to git and comfortable with CVS. > - It also shows how to treat the upstream/official repo as the origin, > CVS-style. Yes, exactly. > This would mean the developer doesn't have a separate GitHub fork > to use for personal branches, uncertain commits, etc. that don't belong in > the main repo. Not necessarily. It just means that these two roles are separate: a developer can (but does not have to) have his own branch of biopython tree where he/she makes the changes, but this is not directly linked to the official (read-write) biopython branch. I know it's not necessarily the best way to use github, but I would like to avoid getting people used to CVS confused. That's why I decided to describe the role of developer with read-write access differently. BTW, I would see the role of the GitUsage wiki page as a guide rather than a law. That means that if someone understands better how to use git and github and does not get lost with having in his both local and remote branches with different origins I'm absolutely fine with this. But I think it is quite complicated, especially for people new to git. So, in summary, my idea was to (currently) recommend somewhat CVS-like usage of git on the main branch, which would be simple for people to use at first and encourage them to create their own branches and do development on them. > > Maybe a good way to organize the page would be in terms of how you want to > use the repo: > > 1. Tracking Biopython with raw Git (without signing up for GitHub) > ? - git clone http://.../biopython/biopython > ? - remote: upstream > ? - how to format a patch and submit on Bugzilla > > 2. Tracking Biopython on GitHub (e.g. occasional contributors) > ? - sign up, click the "fork" button > ? - git clone http://.../your-name-here/biopython > ? - remotes: origin, upstream > ? - how to submit a pull request on GitHub > ? - how to add, manage and delete branches locally and on GitHub > > 3. Collaborating > ? - either #1 or #2 is fine > ? - how to add and manage more remotes > ? - how to apply Git patches, and why copy/paste kills kittens the next > time you merge > > 4. Committing to Biopython > ? - same as #2, but use the private URL for the "upstream" remote > ? - remotes: origin, upstream > ? - policy on pushing upstream, code reviews, tagging, etc. > > Having such documentation would be nice. I think that it is currently structured more or less like that (now we just don't have #1 and #4 currently recommends a very simple CVS-like usage). I think that adding #1 and putting in place policies on how to submit patches would be great. For #4 I would vote for recommending (at least for a while) the CVS-like way, but I'm absolutely for the development of the alternative procedure, where the developer works with a single repo both on his code and on official branch. I don't want to underestimate the git skills of our current developers, but so far I think only a few people have gotten their github accounts, which means the simpler we keep it the better (at least for a while). I certainly hope that people will get used to git quickly, but I would like to make initial change for people who will be switching from CVS to git as simple as possible. cheers Bartek From biopython at maubp.freeserve.co.uk Tue Sep 15 20:25:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 15 Sep 2009 21:25:00 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com> Message-ID: <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com> On Tue, Sep 15, 2009 at 8:45 PM, Bartek Wilczynski wrote: > On Tue, Sep 15, 2009 at 5:38 PM, Eric Talevich wrote: > >> Sounds good to me. Completing the Git migration would make it much easier >> for me to maintain the Tree/TreeIO stuff, since I already have a few local >> branches based on it that an upstream CVS duplication would mangle. > > Then maybe we should ?wait with committing your changes to the > time we drop CVS, in order to avoid loss of change history in your > code... What do you think, Peter? Yes, I was suggesting getting a final CVS release out soon, and then look at merging all the new stuff (including Eric's tree stuff) starting to pile up on github. I knew Tiago has a lump of code ready to go, and as we have just discussed, as he would prefer to check that in via CVS. So, Tiago will do that (this Friday), then we'll do the final CVS release next week, and then switch to git - and start to focus on merging in new stuff. Peter From chapmanb at 50mail.com Wed Sep 16 12:34:07 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 16 Sep 2009 08:34:07 -0400 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> Message-ID: <20090916123407.GE13500@sobchak.mgh.harvard.edu> Hi Peter; > > I know it's been only a month since we released Biopython 1.51, but > > does anyone (other than me) think that we already have enough done > > to warrant another release? The associated CVS freeze would also > > serve as a good break point for moving to github (see other threads). I don't have a strong opinion about the release. It seems a little early but if you think we are ready go for it. I have tested Osvaldo's Novoalign commandline object and have it ready to get in. Right now it's in a git tree but I can move it over to a CVS tree and integrate it for the release. It'll live in Bio/Sequencing/Applications like you suggested. I should be able to do that this evening. I am all about the move to Git and GitHub. Anything we can do to finish that off and make it official is cool by me. > That would be great. As for the move to github, I've added some (quite > preliminary) docs for developers on how to make commits to the main > branch using git and github to the wiki: > http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch This is looking great. I'd agree with Eric that we should be consistent in the doc for suggestions on naming the official biopython branch: git remote add upstream git://github.com/biopython/biopython.git git remote add official git at github.com:biopython/biopython.git My vote is for the "official" naming which is a little more specific. Great stuff, Brad From biopython at maubp.freeserve.co.uk Wed Sep 16 13:30:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 14:30:47 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <20090916123407.GE13500@sobchak.mgh.harvard.edu> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> <20090916123407.GE13500@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909160630o4dc1379dwaba667ed13ed9bde@mail.gmail.com> On Wed, Sep 16, 2009 at 1:34 PM, Brad Chapman wrote: > Hi Peter; > >> > I know it's been only a month since we released Biopython 1.51, but >> > does anyone (other than me) think that we already have enough done >> > to warrant another release? The associated CVS freeze would also >> > serve as a good break point for moving to github (see other threads). > > I don't have a strong opinion about the release. It seems a little > early but if you think we are ready go for it. OK. > I have tested Osvaldo's Novoalign commandline object and have it > ready to get in. Right now it's in a git tree but I can move it > over to a CVS tree and integrate it for the release. It'll live in > Bio/Sequencing/Applications like you suggested. I should be able to > do that this evening. Go for it - I presume you have it in a private git repostory at the moment, as I couldn't spot it on github? > I am all about the move to Git and GitHub. Anything we can do to > finish that off and make it official is cool by me. > >> That would be great. As for the move to github, I've added some (quite >> preliminary) docs for developers on how to make commits to the main >> branch using git and github to the wiki: >> http://www.biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch > > This is looking great. I'd agree with Eric that we should be > consistent in the doc for suggestions on naming the official > biopython branch: > > git remote add upstream git://github.com/biopython/biopython.git > git remote add official git at github.com:biopython/biopython.git > > My vote is for the "official" naming which is a little more > specific. Well, both "official" and "upstream" have merit. I don't mind which, but it does make sense to be consistent. Peter From biopython at maubp.freeserve.co.uk Wed Sep 16 13:48:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 14:48:39 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com> References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> <184931.66541.qm@web62403.mail.re1.yahoo.com> <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com> Message-ID: <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com> On Tue, Sep 8, 2009 at 2:53 PM, Peter wrote: > On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote: >> On Tue, 9/8/09, Peter wrote: >>> Did you have any ideas for a better name than >>> Bio.SeqIO.indexed_dict()? >> >> Is indexed_dict a function? If so, I suggest we use a verb instead >> of a noun. Maybe just "index"? > > Bio.SeqIO.indexed_dict() is a function which returns a dictionary like > object. So yes, a verb would be better, and "index" is short and sweet. Any other comments? Otherwise I'll switch Bio.SeqIO.indexed_dict() to Bio.SeqIO.index() for the next release. Thinking ahead, in addition to the current code (indexing a file, keeping the index in memory) we might in future add want to something like Bio.SeqIO.sqlite_index() where the index is kept in a database etc. Peter From bugzilla-daemon at portal.open-bio.org Wed Sep 16 22:00:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 16 Sep 2009 18:00:59 -0400 Subject: [Biopython-dev] [Bug 2904] Interface for Novoalign In-Reply-To: Message-ID: <200909162200.n8GM0x7d006226@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2904 chapmanb at 50mail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from chapmanb at 50mail.com 2009-09-16 18:00 EST ------- Osvaldo; Thanks much for the submission. This is committed and lives in: Bio/Sequencing/Applications to create a namespace for future sequencing related commandlines. You can import with: from Bio.Sequencing.Applications import NovoalignCommandline It would be great if you wanted to add a cookbook example of using it (http://biopython.org/wiki/Category:Cookbook) based on a simple pipeline. Perhaps something involving downstream parsing of the novoalign format, or converted to SAM as you suggested in Bug 2905. Thanks, Brad -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Wed Sep 16 22:53:31 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 16 Sep 2009 23:53:31 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com> References: <320fb6e00909150651w3fd6c32waca18e575021365f@mail.gmail.com> <8b34ec180909150759o6e09f712wc93758a002fa5f50@mail.gmail.com> <3f6baf360909150838g638c4ed6le0a7020aa3213c58@mail.gmail.com> <8b34ec180909151245p2a51130fn89f034b351704665@mail.gmail.com> <320fb6e00909151325v3e7a9becm5138ddb7f5880f82@mail.gmail.com> Message-ID: <6d941f120909161553l3f9bae6u5ba45e6cde9b33e3@mail.gmail.com> Hi, > I knew Tiago has a lump of code ready to go, and as we have > just discussed, as he would prefer to check that in via CVS. I just tested my stuff on Windows. It worked at first attempt. Strange... I actually have a few tests (18 to be precise). They all passed at first. Murphy's laws took a once-in-a-life vacation. I still have a minor problem. I will not have time to update the Tutorial before Tuesday. All is written in http://biopython.org/wiki/PopGen_dev_Genepop , which it will mostly become tutorial. But I simply don't have time until Tuesday to transpose. Code and tests will be committed today. Tiago From krother at rubor.de Thu Sep 17 08:40:28 2009 From: krother at rubor.de (Kristian Rother) Date: Thu, 17 Sep 2009 10:40:28 +0200 Subject: [Biopython-dev] Another Biopython release? Message-ID: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de> Hi Peter, I could prepare 2-3 exemplary modules for parsing secondary structures + tests for the Bio.RNA package. As I've been using GIT so far, it would be most convenient to stick with it and contribute when the main archive has migrated. Or is it easy to "jump" to CVS on the last possible occasion? Best, Kristian From biopython at maubp.freeserve.co.uk Thu Sep 17 09:17:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 10:17:37 +0100 Subject: [Biopython-dev] Another Biopython release? In-Reply-To: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de> References: <03de31722722ff2babeb218a011a5d8f-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWllYR15dWgA=-webmailer2@server01.webmailer.hosteurope.de> Message-ID: <320fb6e00909170217j24bab86eqae45440f72ed415e@mail.gmail.com> On Thu, Sep 17, 2009 at 9:40 AM, Kristian Rother wrote: > > Hi Peter, > > I could prepare 2-3 exemplary modules for parsing secondary structures + > tests for the Bio.RNA package. As I've been using GIT so far, it would be > most convenient to stick with it and contribute when the main archive has > migrated. Or is it easy to "jump" to CVS on the last possible occasion? > > Best, > ? Kristian My plan for this "quick release" was to mark an end to the CVS era, and not to include any of the really new stuff (like your code), but to wait until we are on git before looking at it. So keep it in git for now - this should also make the merge easier. Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 11:27:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 12:27:24 +0100 Subject: [Biopython-dev] Indexing (large) sequence files with Bio.SeqIO In-Reply-To: <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com> References: <320fb6e00909080622g520caf07gb4a16d6e3faf1b61@mail.gmail.com> <184931.66541.qm@web62403.mail.re1.yahoo.com> <320fb6e00909080653v19015309wdda039ce0ef2d0a6@mail.gmail.com> <320fb6e00909160648n2affffa9sf291fe54088a7b88@mail.gmail.com> Message-ID: <320fb6e00909170427o37813aa7kd86464d9c8e81b36@mail.gmail.com> On Wed, Sep 16, 2009 at 2:48 PM, Peter wrote: > On Tue, Sep 8, 2009 at 2:53 PM, Peter wrote: >> On Tue, Sep 8, 2009 at 2:30 PM, Michiel de Hoon wrote: >>> ?On Tue, 9/8/09, Peter wrote: >>>> Did you have any ideas for a better name than >>>> Bio.SeqIO.indexed_dict()? >>> >>> Is indexed_dict a function? If so, I suggest we use a verb instead >>> of a noun. Maybe just "index"? >> >> Bio.SeqIO.indexed_dict() is a function which returns a dictionary like >> object. So yes, a verb would be better, and "index" is short and sweet. > > Any other comments? Otherwise I'll switch Bio.SeqIO.indexed_dict() > to Bio.SeqIO.index() for the next release. Done in CVS. > Thinking ahead, in addition to the current code (indexing a file, keeping > the index in memory) we might in future add want to something like > Bio.SeqIO.sqlite_index() where the index is kept in a database etc. Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 12:02:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 13:02:18 +0100 Subject: [Biopython-dev] Using PendingDeprecation for obsolete modules Message-ID: <320fb6e00909170502m14b4e599l66c778bfe67f3625@mail.gmail.com> Hi all, Right now we have deprecation process which usually looks like this: (1) Label as obsolete in docstrings (2) Label as deprecated in docstrings, issue DeprecationWarning (3) Remove code See: http://biopython.org/wiki/Deprecation_policy I've relatively recently noticed the PendingDeprecationWarning warning (added in Python 2.3), which is by default silent, but the user can choose to enable it with the python command line switch -W. For example, $ python Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import warnings >>> warnings.warn("X is obsolete", PendingDeprecationWarning) >>> So, by default, no warning message. But if you ask for them: $ python -W allPython 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import warnings >>> warnings.warn("X is obsolete", PendingDeprecationWarning) __main__:1: PendingDeprecationWarning: X is obsolete >>> So, I thinking what we should be doing for deprecating modules is: (1) Label as obsolete in docstrings, issue PendingDeprecationWarning (2) Label as deprecated in docstrings, issue DeprecationWarning (3) Remove code I guess very few people know about pending deprecation warnings, and so are unlikely to even try using the warning switch. Therefore I have little inclination to go though all the current modules tagged as "obsolete" just to add this silent warning. However, if simply start doing this in future, is really isn't any more work. Any thoughts? Peter From winda002 at student.otago.ac.nz Fri Sep 18 03:52:11 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 18 Sep 2009 15:52:11 +1200 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <4AA58F3C.6080200@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> Message-ID: <4AB303EB.1010208@student.otago.ac.nz> >> >> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py >> based on test_Emboss.py? Continuing on the github branch is fine. >> Well, it didn't end up being very short but there is a test on my "phylo" branch (http://github.com/dwinter/biopython/tree/phylo) in test_PhylipNew.phy (which uses a couple of new files in Tests/Phylip) that I'd welcome comments on. Writing them actually exposed a bug in the code already in CVS, the FProtParsCommandline option "-intreefile" isn't mandatory so "is_required" should be set to 0 rather than 1. In my defence the emboss documentation has it listed as being both mandatory and optional. One possibly foolish thing I did was use TreeIO to test the trees that came out of these programs made sense, thinking that module would be part of the next release. If the plan is for a new release soon and having a test for these wrappers is important the tests could be done with Nexus.Trees but I found that was difficult to use for files with multiple newick trees. Cheers, David From biopython at maubp.freeserve.co.uk Fri Sep 18 09:26:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 10:26:59 +0100 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <4AB303EB.1010208@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> Message-ID: <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com> On Fri, Sep 18, 2009 at 4:52 AM, David Winter wrote: > >>> >>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py >>> based on test_Emboss.py? Continuing on the github branch is fine. >>> > > Well, it didn't end up being very short but there is a test on my "phylo" > branch (http://github.com/dwinter/biopython/tree/phylo) in > ?test_PhylipNew.phy ?(which uses a couple of new files in Tests/Phylip) that > I'd welcome comments on. Cool - I'll take a look and try and get (some of) it merged into CVS for this release. > Writing them actually exposed a bug in the code already in CVS, the > FProtParsCommandline option "-intreefile" isn't mandatory so "is_required" > should be set to 0 rather than 1. In my defence the emboss documentation has > it listed as being both mandatory and optional. How odd. Maybe EMBOSS switched it at some point? > One possibly foolish thing I did was use TreeIO to test the trees that came > out of these programs made sense, thinking that module would be part of the > next release. If the plan is for a new release soon and having a test for > these wrappers is important the tests could be done with Nexus.Trees but I > found that was difficult to use for files with multiple newick trees. Hmm. In the short term we can either comment out those bits of the test pending the inclusion of TreeIO in the next release, or add a quick tiny parser in the test itself to load the trees, split them on the ";" and pass them one by one to Bio.Nexus.Trees for parsing. Peter From biopython at maubp.freeserve.co.uk Fri Sep 18 11:09:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 12:09:24 +0100 Subject: [Biopython-dev] Entrez ELink history - XML/DTD or Biopython bug? Message-ID: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com> Hi Michiel (et al), I've been trying to get an example working using the Entrez history for ELink. Strangely here the URL doesn't use history=y but instead cmd=neighbor_history (while the default is cmd=neighbor). However, this appears to show a bug in the Bio.Entrez parser. Consider: from Bio import Entrez pmid = "14630660" print Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs", from_uid=pmid, cmd="neighbor_history").read() This gives: pubmed 14630660 pmc pubmed_pmc_refs 1 NCID_1_2657216_130.14.18.53_9001_1253271778 The XML looks reasonable by eye - although quite different from the non-history version. Now if instead of printing that, I try and parse it: >>> data = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs", from_uid=pmid, cmd="neighbor_history")) Traceback (most recent call last): ?File "", line 1, in ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 259, in read ? ?record = handler.run(handle) ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run ? ?self.parser.ParseFile(handle) ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 210, in endElement ? ?current[name] = value TypeError: 'str' object does not support item assignment I can file a Biopython bug if you like, but my initial guess is the problem lies in the XML itself versus the eLink_020511.dtd file, which does not mention the LinkSetDbHistory element at all. Do you agree that this looks like an NCBI problem? Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Sep 18 11:40:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Sep 2009 12:40:06 +0100 Subject: [Biopython-dev] Entrez ELink history - XML/DTD or Biopython bug? In-Reply-To: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com> References: <320fb6e00909180409s6fef5938u94731f00f6fd1d0b@mail.gmail.com> Message-ID: <320fb6e00909180440p701d3f5ejd22a605f171989eb@mail.gmail.com> On Fri, Sep 18, 2009 at 12:09 PM, Peter wrote: > Hi Michiel (et al), > > I've been trying to get an example working using the Entrez history > for ELink. Strangely here the URL doesn't use history=y but instead > cmd=neighbor_history (while the default is cmd=neighbor). > > However, this appears to show a bug in the Bio.Entrez parser. Consider: > > from Bio import Entrez > pmid = "14630660" > print Entrez.elink(dbfrom="pubmed", db="pmc", LinkName="pubmed_pmc_refs", > from_uid=pmid, cmd="neighbor_history").read() > > This gives: > > > ?"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> > > > ? ? ? ?pubmed > ? ? ? ? > ? ? ? ? ? ? ? ?14630660 > ? ? ? ? > ? ? ? ? > ? ? ? ? ? ? ? ?pmc > ? ? ? ? ? ? ? ?pubmed_pmc_refs > ? ? ? ? ? ? ? ?1 > ? ? ? ? > ? ? ? ?NCID_1_2657216_130.14.18.53_9001_1253271778 > > > > The XML looks reasonable by eye - although quite different from > the non-history version... but my initial guess is > the problem lies in the XML itself versus the eLink_020511.dtd > file, which does not mention the LinkSetDbHistory element at > all. Do you agree that this looks like an NCBI problem? I should have done this earlier - but two different XML validators both agree that the "history" version of the NCBI's ELink XML is invalid, while the default is fine. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660&cmd=neighbor_history versus http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660&cmd=neighbor or: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=pmc&dbfrom=pubmed&LinkName=pubmed_pmc_refs&id=14630660 I will get in touch with the NCBI... Peter From eric.talevich at gmail.com Fri Sep 18 14:08:40 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 18 Sep 2009 10:08:40 -0400 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> <320fb6e00909180226v49073526i65e1b3074ec30ef4@mail.gmail.com> Message-ID: <3f6baf360909180708w2d06c775w18922106bba003e@mail.gmail.com> On Fri, Sep 18, 2009 at 5:26 AM, Peter wrote: > On Fri, Sep 18, 2009 at 4:52 AM, David Winter > wrote: > > > One possibly foolish thing I did was use TreeIO to test the trees that > came > > out of these programs made sense, thinking that module would be part of > the > > next release. If the plan is for a new release soon and having a test for > > these wrappers is important the tests could be done with Nexus.Trees but > I > > found that was difficult to use for files with multiple newick trees. > > Hmm. In the short term we can either comment out those bits of the test > pending the inclusion of TreeIO in the next release, or add a quick tiny > parser in the test itself to load the trees, split them on the ";" and pass > them one by one to Bio.Nexus.Trees for parsing. > > That's all TreeIO does. The relevant loop is in NewickIO.parse(), if you'd like to copy it verbatim: http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py -Eric From biopython at maubp.freeserve.co.uk Sun Sep 20 11:20:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 20 Sep 2009 12:20:43 +0100 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <4AB303EB.1010208@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> Message-ID: <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com> On Fri, Sep 18, 2009 at 4:52 AM, David Winter wrote: >>> >>> Do you fancy writing a short unit tests? e.g. test_Emboss_phylip.py >>> based on test_Emboss.py? Continuing on the github branch is fine. >>> > > Well, it didn't end up being very short but there is a test on my "phylo" > branch (http://github.com/dwinter/biopython/tree/phylo) in > ?test_PhylipNew.phy ?(which uses a couple of new files in Tests/Phylip) that > I'd welcome comments on. I've checked in something based on the current version from github. I added a few checks for missing input files (I was getting cryptic errors), but then decided we had enough input files in the test suite already, and that it might be more useful to try writing alignments to the PHYLIP tools via stdin with AlignIO. Certainly at least one example should try this, assuming it works. I haven't done this yet - feel free to try. Note that the stdout from the PHYLIPNEW tools isn't clean, so we can't avoid having temp output files: http://lists.open-bio.org/pipermail/emboss-dev/2009-September/000632.html > Writing them actually exposed a bug in the code already in CVS, the > FProtParsCommandline option "-intreefile" isn't mandatory so "is_required" > should be set to 0 rather than 1. In my defence the emboss > documentation has it listed as being both mandatory and optional. Fixed in CVS - does this affect any of the other tools using this argument? > One possibly foolish thing I did was use TreeIO to test the trees that came > out of these programs made sense, thinking that module would be part of the > next release. If the plan is for a new release soon and having a test for > these wrappers is important the tests could be done with Nexus.Trees but I > found that was difficult to use for files with multiple newick trees. I put a quick crude helper function into the unit test as discussed. The unit test is working nicely on Linux with EMBOSS PHYLIP from CVS, I presume you are testing against an official release? If you could the CVS code works fine on your setup before the release that would be great. There is a bit more time as I won't be able to do the release on Monday, but it should be Tuesday or Wednesday... and fingers crossed getting PHYLIPNEW installed on my Windows machine will be easy. We can look at adding some more of your example input files, and uncommenting their tests later (especially for cases where we can't generate the input from Biopython directly). I did add the horses.tree file BTW. Thank you David :) Peter From winda002 at student.otago.ac.nz Mon Sep 21 05:13:24 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Mon, 21 Sep 2009 17:13:24 +1200 Subject: [Biopython-dev] draft release announcement Message-ID: <4AB70B74.1040308@student.otago.ac.nz> Hi guys, A draft release announcement for 1.52 for you to look at and comment on. This is written with the idea that there will be a blog post describing the convert and indexed_dict() methods for SeqIO which can be linked to so the announcement itself is pretty brief. I didn't mention the movement from CVS to git in the announcement which might be something worth adding? +++ We are pleased to announce the availability of Biopython 1.52, a new stable release of the Biopython library. It may only have been one month since the last release but in that time we've added enough useful features to warrant a new release. Biopython 1.52 will be of particular interest to people using next generation sequencing - new functions added to the AlignIO and SeqIO tools speed up the way very large sequence files can be dealt with and you can now write phd files like those created by Phred and used in 454 sequencing. SeqIO and AlignIO both now have a helper function called convert() that allows for simple, optimized conversion between file formats while SeqIO gets a new method called indexed_dict() which allows random access to sequences in a file without reading every record in that file into memory. The new release also adds command line wrappers for the EMBOSS versions of the phylip phylogeny programs and squashes a few minor bugs reported since 1.51 was released. Sources and a Windows Installer are available from the downloads page. Thanks to the Biopython development team and to everyone who has reported bugs since our last release ++++ From tiagoantao at gmail.com Mon Sep 21 05:17:39 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 21 Sep 2009 06:17:39 +0100 Subject: [Biopython-dev] draft release announcement In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz> References: <4AB70B74.1040308@student.otago.ac.nz> Message-ID: <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com> There is a big update to the PopGen module, which is now able to do frequentist statistics and tests through GenePop. I can draft one paragraph about the subject. I would imagine it is one of the biggest changes and probably the one that adds most functionality. On Mon, Sep 21, 2009 at 6:13 AM, David Winter wrote: > Hi guys, > > A draft release announcement for 1.52 for you to look at and comment on. > This is written with the idea that there will be a blog post describing the > convert and indexed_dict() methods for SeqIO which can be linked to so the > announcement itself is pretty brief. > > I didn't mention the movement from CVS to git in the announcement which > might be something worth adding? > > +++ > We are pleased to announce the availability of Biopython 1.52, a new stable > release of the Biopython library. > > It may only have been one month since the last release but in that time > we've added enough useful features to warrant a new release. Biopython 1.52 > will be of particular interest to people using next generation sequencing - > new functions added to the AlignIO and SeqIO tools speed up the way very > large sequence files can be dealt with and you can now write phd files like > those created by ?Phred and used in 454 sequencing. > > SeqIO and AlignIO both now have a helper function called convert() that > allows for simple, optimized conversion between file formats while SeqIO > gets a new method called indexed_dict() which allows random access to > sequences in a file without reading every record in that file into memory. > > The new release also adds command line wrappers for the EMBOSS versions of > the phylip phylogeny programs and squashes a few minor bugs reported since > 1.51 was released. > > Sources and a Windows Installer are available from the downloads page. > > Thanks to the Biopython development team and to everyone who has reported > bugs since our last release > > ++++ > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- " It always takes ideology to consummate massive error." - Ambrose Evans-Pritchard From winda002 at student.otago.ac.nz Mon Sep 21 05:30:44 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Mon, 21 Sep 2009 17:30:44 +1200 Subject: [Biopython-dev] draft blog post for 1.52 stuff Message-ID: <4AB70F84.6000709@student.otago.ac.nz> As I mentioned in the draft release announcement it might be useful to have a a blog post up explaining how the new functions for SeqIO and AlignIO work (thanks to Peter for this idea). I've written a draft for a post that looks at the convert function that could do with a little more detail and ignores the indexed_dict() function entirely because I just don't have a good enough idea of how it works. Again, any comments are welcome. Is it a good idea to have a post like this or should we just extend the release announcement to include a little bit more detail? ++ It's only been a month since we released Biopython 1.51 but in that time the CVS server has stacked up enough cool new features that we are going to put together a new release soon. As ever the new functions will be documented in the official tutorial and cookbook but we thought we'd show off a few of these tools here Simple, optimized format conversion with SeqIO and AlignIO No one has ever complained that bioinformatics just doesn't have enough file formats - you probably frequently find yourself converting sequence files to suit particular applications with SeqIO. At the moment this is usually a two step process, something like this: >>>records = SeqIO.parse(in_handle "genbank") >>>SeqIO.write(records, out_handle, "fasta") As of Biopython 1.52 you'll be able to achieve the same result in a single step: >>>SeqIO.convert(in_handle, "genbank", out_handle, "fasta") Adding the convert function to SeqIO will make your scripts more readable and might even save you a couple of lines of code but more importantly it allows the conversion process to be optimized for two formats being used. In the above example we are moving from a genbank file, which might include multiple features for each sequence, to a fasta file, which doesn't include features. If we used the two step process above we'd be spending time reading each sequence's features into memory just to skip them when they get passed to the write function. SeqIO.convert() knows that the sequences in the input file are destined to be written to a fasta file so it can skip over the features and save a bit of time in doing the conversion. Obviously, the optimization in SeqIO.convert() is most powerful when its used on very large files like those produced in next generation sequencing projects. When converting between each of the FASTQ file format's variants with the "SeqIO two step" a siginficant amount of time is taken creating SeqRecord objects for each record in the input file but none of the attributes or methods of the SeqRecord object are required to do the conversion. For this reason SeqIO.convert() deals with each record as two simple strings, one for the record's sequence, the other for its ID. [some information on just how much time that saves on a big file should probably go here!] +++ From winda002 at student.otago.ac.nz Mon Sep 21 05:45:34 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Mon, 21 Sep 2009 17:45:34 +1200 Subject: [Biopython-dev] draft release announcement In-Reply-To: <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com> References: <4AB70B74.1040308@student.otago.ac.nz> <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com> Message-ID: <4AB712FE.2060304@student.otago.ac.nz> Tiago Ant?o wrote: > There is a big update to the PopGen module, which is now able to do > frequentist statistics and tests through GenePop. I can draft one > paragraph about the subject. I would imagine it is one of the biggest > changes and probably the one that adds most functionality. > Cool, I see now that I should've read the original thread about the new release more closely A paragraph from you on your PopGen code would be really helpful. Cheers, David From tiagoantao at gmail.com Mon Sep 21 07:23:24 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 21 Sep 2009 08:23:24 +0100 Subject: [Biopython-dev] draft release announcement In-Reply-To: <4AB712FE.2060304@student.otago.ac.nz> References: <4AB70B74.1040308@student.otago.ac.nz> <6d941f120909202217j2b0bb818q78b01dffa6d02b2@mail.gmail.com> <4AB712FE.2060304@student.otago.ac.nz> Message-ID: <6d941f120909210023v5dc91079s6ec54a04ad8385e7@mail.gmail.com> Something along the lines of: The Population Genetics module now allows the calculation of several tests, and statistical estimators via a wrapper to GenePop. Supported are tests for Hardy-Weinberg equilibrium, linkage disequilibrium and estimates for various F statistics (Cockerham and Wier Fst and Fis, Robertson and Hill Fis, ...), null allele frequencies and number of migrants among many others. Isolation By Distance (IBD) functionality is also supported. I suppose the changes to PopGen are the biggest going on this Biopython version and probably one of the highlights. I should update the documentation ASAP. I intend to announce this version to some population genetics and evolutionary biology communities (something I have never done in the past) On Mon, Sep 21, 2009 at 6:45 AM, David Winter wrote: > Tiago Ant?o wrote: >> >> There is a big update to the PopGen module, which is now able to do >> frequentist statistics and tests through GenePop. I can draft one >> paragraph about the subject. I would imagine it is one of the biggest >> changes and probably the one that adds most functionality. >> > > Cool, I see now that I should've read the original thread about the new > release more closely > > A paragraph from you on your PopGen code would be really helpful. > > Cheers, > David > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- " It always takes ideology to consummate massive error." - Ambrose Evans-Pritchard From biopython at maubp.freeserve.co.uk Mon Sep 21 09:01:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 21 Sep 2009 10:01:10 +0100 Subject: [Biopython-dev] draft release announcement In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz> References: <4AB70B74.1040308@student.otago.ac.nz> Message-ID: <320fb6e00909210201u3d9032e5vf64ba2953d83938d@mail.gmail.com> On Mon, Sep 21, 2009 at 6:13 AM, David Winter wrote: > Hi guys, > > A draft release announcement for 1.52 for you to look at and comment on. > This is written with the idea that there will be a blog post describing the > convert and indexed_dict() methods for SeqIO which can be linked to so the > announcement itself is pretty brief. I switched indexed_dict() to just index() after discussion on the list. > I didn't mention the movement from CVS to git in the announcement which > might be something worth adding? I think that would warrant a one line paragraph (near the end) :) Peter From biopython at maubp.freeserve.co.uk Mon Sep 21 09:11:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 21 Sep 2009 10:11:17 +0100 Subject: [Biopython-dev] draft blog post for 1.52 stuff In-Reply-To: <4AB70F84.6000709@student.otago.ac.nz> References: <4AB70F84.6000709@student.otago.ac.nz> Message-ID: <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com> On Mon, Sep 21, 2009 at 6:30 AM, David Winter wrote: > As I mentioned in the draft release announcement it might be useful to have > a blog post up explaining how the new functions for SeqIO and AlignIO work > (thanks to Peter for this idea). > > I've written a draft for a post that looks at the convert function that > could do with a little more detail and ignores the indexed_dict() function > entirely because I just don't have a good enough idea of how it works. Great job - thanks for doing this. I'll tackle an indexing introduction blog post since you've done a nice job for convert :) It would also be worth mentioning that the convert function will also take filenames (not just handles), which also helps simplify simple conversion tasks. I should be able to provide some timings for things like FASTQ conversion, or FASTQ to FASTA on multi-million read files (there are probably some on the dev list already...). > Again, any comments are welcome. Is it a good idea to have a post like > this or should we just extend the release announcement to include a little > bit more detail? Well, as I mentioned the idea to David directly, I think these little motivational examples on the blog are worth trying out. What does everyone else think? Peter From biopython at maubp.freeserve.co.uk Mon Sep 21 17:41:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 21 Sep 2009 18:41:40 +0100 Subject: [Biopython-dev] draft blog post for 1.52 stuff In-Reply-To: <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com> References: <4AB70F84.6000709@student.otago.ac.nz> <320fb6e00909210211u1a02b142vb31ca2b0d995bb59@mail.gmail.com> Message-ID: <320fb6e00909211041n6378595cx39f2d395aee0ec7c@mail.gmail.com> On Mon, Sep 21, 2009 at 10:11 AM, Peter wrote: > On Mon, Sep 21, 2009 at 6:30 AM, David Winter > wrote: >> As I mentioned in the draft release announcement it might be useful to have >> a blog post up explaining how the new functions for SeqIO and AlignIO work >> (thanks to Peter for this idea). >> >> I've written a draft for a post that looks at the convert function that >> could do with a little more detail and ignores the indexed_dict() function >> entirely because I just don't have a good enough idea of how it works. > > Great job - thanks for doing this. I'll tackle an indexing introduction > blog post since you've done a nice job for convert :) Done, and up online - hopefully without typos: http://news.open-bio.org/news/2009/09/biopython-seqio-index/ Peter From winda002 at student.otago.ac.nz Tue Sep 22 05:05:31 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 22 Sep 2009 17:05:31 +1200 Subject: [Biopython-dev] draft release announcement In-Reply-To: <4AB70B74.1040308@student.otago.ac.nz> References: <4AB70B74.1040308@student.otago.ac.nz> Message-ID: <4AB85B1B.2000704@student.otago.ac.nz> David Winter wrote: > Hi guys, > > A draft release announcement for 1.52 for you to look at and comment > on. This is written with the idea that there will be a blog post > describing the convert and indexed_dict() methods for SeqIO which can > be linked to so the > announcement itself is pretty brief. Thanks to Peter and Tiago for their suggestions, there is now a marked up version of this draft with those suggestions ready and waiting on to go on the blog. Still time for suggestions from anyone else. David From winda002 at student.otago.ac.nz Tue Sep 22 05:14:07 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 22 Sep 2009 17:14:07 +1200 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com> Message-ID: <4AB85D1F.7010901@student.otago.ac.nz> Peter wrote: > > > >> Writing them actually exposed a bug in the code already in CVS, the >> FProtParsCommandline option "-intreefile" isn't mandatory so "is_required" >> should be set to 0 rather than 1. In my defence the emboss >> documentation has it listed as being both mandatory and optional. >> > > Fixed in CVS - does this affect any of the other tools using this argument? > Nope, I only slipped on this one ;) > >> One possibly foolish thing I did was use TreeIO to test the trees that came >> out of these programs made sense, thinking that module would be part of the >> next release. If the plan is for a new release soon and having a test for >> these wrappers is important the tests could be done with Nexus.Trees but I >> found that was difficult to use for files with multiple newick trees. >> > > I put a quick crude helper function into the unit test as discussed. > > The unit test is working nicely on Linux with EMBOSS PHYLIP > from CVS, I presume you are testing against an official release? > If you could the CVS code works fine on your setup before the > release that would be great. Finally got in front of the right computer to do this. The tests in the (Biopython) CVS work fine with the official EMBOSS 6.1.0 release (on ubuntu if that helps). I'd offer to try it out on windows but I don't have EMBOSS, a compiler or and of the libraries that I'd need to do that! Cheers, David From biopython at maubp.freeserve.co.uk Tue Sep 22 09:23:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 10:23:10 +0100 Subject: [Biopython-dev] Tests for Emboss Phylip wrappers In-Reply-To: <4AB85D1F.7010901@student.otago.ac.nz> References: <20090901181704.17333c3ldldjyyxs@www.studentmail.otago.ac.nz> <320fb6e00909010221t3c199933u9002145557243190@mail.gmail.com> <20090902103804.12465tikhonlk764@www.studentmail.otago.ac.nz> <320fb6e00909040833v16c97a3bxdb80eebfa6b4fc91@mail.gmail.com> <320fb6e00909071034s1126b135l2b53f6d902748460@mail.gmail.com> <4AA58F3C.6080200@student.otago.ac.nz> <4AB303EB.1010208@student.otago.ac.nz> <320fb6e00909200420r44fd8848y36cb5182090c7243@mail.gmail.com> <4AB85D1F.7010901@student.otago.ac.nz> Message-ID: <320fb6e00909220223q6f079a39o74916d20291c3400@mail.gmail.com> On Tue, Sep 22, 2009 at 6:14 AM, David Winter wrote: > Peter wrote: >>> >>> Writing them actually exposed a bug in the code already in CVS, >>> the FProtParsCommandline option "-intreefile" isn't mandatory so >>> "is_required" should be set to 0 rather than 1. In my defence the >>> emboss documentation has it listed as being both mandatory and >>> optional. >> >> Fixed in CVS - does this affect any of the other tools using this >> argument? > > Nope, I only slipped on this one ;) Great. It looks like the tests have been useful already :) >> The unit test is working nicely on Linux with EMBOSS PHYLIP >> from CVS, I presume you are testing against an official release? >> If you could the CVS code works fine on your setup before the >> release that would be great. > > Finally got in front of the right computer to do this. The tests in the > (Biopython) CVS work fine with the official EMBOSS 6.1.0 release > (on ubuntu if that helps). Great - thank you. > I'd offer to try it out on windows but I don't > have EMBOSS, a compiler or and of the libraries that I'd need to > do that! Hmm - EMBOSS only provide a Windows installer for the core EMBOSS suite, not the extras like PHYLIP. I do have a C compiler and cygwin setup on my Windows machine, so it may work. We'll see... Peter From mjldehoon at yahoo.com Tue Sep 22 10:12:37 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 22 Sep 2009 03:12:37 -0700 (PDT) Subject: [Biopython-dev] Blast records Message-ID: <230712.78074.qm@web62406.mail.re1.yahoo.com> Hi everybody, I was looking at an older bug report about the plain-text and XML Blast parsers in Biopython: http://bugzilla.open-bio.org/show_bug.cgi?id=2176 When I was checking the current behavior of Biopython's blast parsers, I noticed that the plain-text parser and the XML parser give different results when parsing psi-blast output. The plain-text parser returns a Blast.Record.PSIBlast object, whereas the XML parser returns Blast.Record.Blast objects. In addition, the XML parser misinterprets the psi-blast XML output (creating a separate Blast record for each psi-blast iteration), whereas the plain-text parser fails on psi-blast output of the current blast program. To fix this, I guess the first step is to decide whether a psi-blast parser should return a Blast.Record.Blast object or a Blast.Record.PSIBlast object. In theory having a Blast.Record.PSIBlast record seems more appropriate. However, this complicates the parser (it's not clear until halfway through the Blast output if it's Blast or Psi-Blast, which means the user has to tell the parser whether it's Blast or Psi-Blast), and the format of the XML output generated for Blast and Psi-Blast is the same. I would therefore suggest to have one Blast.Record class that can contain both Blast and Psi-Blast output. Any other opinions, comments, suggestions? --Michiel. From biopython at maubp.freeserve.co.uk Tue Sep 22 11:40:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 12:40:46 +0100 Subject: [Biopython-dev] Blast records In-Reply-To: <230712.78074.qm@web62406.mail.re1.yahoo.com> References: <230712.78074.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com> On Tue, Sep 22, 2009 at 11:12 AM, Michiel de Hoon wrote: > Hi everybody, > > When I was checking the current behavior of Biopython's blast parsers, > I noticed that the plain-text parser and the XML parser give different > results when parsing psi-blast output. The plain-text parser returns a > Blast.Record.PSIBlast object, whereas the XML parser returns > Blast.Record.Blast objects. ... > > Any other opinions, comments, suggestions? As I recall (backed up by what I wrote in the tutorial), when I last checked, the plain text PSI-BLAST output (i.e. from the command line tool blastpgp) included a lot of information missing in the XML output. Perhaps this has improved? If it hasn't, I am inclinded to leave things as they are. If the current PSI-BLAST outputs more details in the XML we may be able to do a better job. The next bit is my recollection of some of the background to this: Classic BLAST (and also RPS-BLAST) allow multiple queries and use the "iterator" block in the XML file for each query. This was an odd choice of naming, but I think the XML tag was originally only intended for the PSI-BLAST outout where each "iteration" block in the XML corresponds to each step of the algorithm. You may recall early versions of BLAST would output "concatenated" XML files for multiple queries - which were not true XML files. I guess they fixed this by reusing the existing "iteration" structure for multiple queries (rather than adding new XML tags). With this in mind the current parsing of the XML from PSI-BLAST makes sense. [In any case, I plan to do Biopython 1.52 this afternoon, with the PSI BLAST parsing left as is it]. Peter From biopython at maubp.freeserve.co.uk Tue Sep 22 13:29:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 14:29:10 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 Message-ID: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> Hi all, As previously announced, I'm going to try and get Biopython 1.52 done this afternoon - and am now declaring a CVS freeze. If all goes to plan, once I've done the release CVS will remain "frozen", and we'll probably get it made read only on the server. Instead, we're going to try and switch over to git (initially on github with a backup on the OBF servers). Stay tuned for further announcements... Peter From p.j.a.cock at googlemail.com Tue Sep 22 16:38:21 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Sep 2009 17:38:21 +0100 Subject: [Biopython-dev] Biopython 1.52 released Message-ID: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> Dear all, Those of you who signed up to our newsfeed will know this already, but we are pleased to announce the release of Biopython 1.52: http://news.open-bio.org/news/2009/09/biopython-release-152/ Thank you to all our developers, including David Winter for drafting the release announcement, and everyone else who as contributed with feedback, bug reports etc. Could I also take this opportunity to remind you all we have an application note out in the OUP journal Bioinformatics: http://news.open-bio.org/news/2009/03/biopython-paper-published/ http://dx.doi.org/10.1093/bioinformatics/btp163 In any scientific publication using Biopython, we kindly request you cite this, or another appropriate publication from this list: http://biopython.org/wiki/Documentation#Papers Thank you, Peter From biopython at maubp.freeserve.co.uk Tue Sep 22 16:42:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 17:42:49 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> Message-ID: <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> On Tue, Sep 22, 2009 at 2:29 PM, Peter wrote: > Hi all, > > As previously announced, I'm going to try and get Biopython 1.52 > done this afternoon - and am now declaring a CVS freeze. > > If all goes to plan, once I've done the release CVS will remain > "frozen", and we'll probably get it made read only on the server. > Instead, we're going to try and switch over to git (initially on > github with a backup on the OBF servers). > > Stay tuned for further announcements... OK, the release is done. Let's leave things as they are for a day or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate with Bartek about the timings for the git transition. I am considering adding a warning message to setup.py and the readme file as the final commit to CVS, pointing out that we will be moving future development to a git repository. One of the first commit to git would be to remove that warning. Does that make sense? Peter From bartek at rezolwenta.eu.org Tue Sep 22 19:46:20 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 22 Sep 2009 21:46:20 +0200 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> Message-ID: <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> On Tue, Sep 22, 2009 at 6:42 PM, Peter wrote: > > OK, the release is done. Let's leave things as they are for a day > or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate > with Bartek about the timings for the git transition. > > I am considering adding a warning message to setup.py and the > readme file as the final commit to CVS, pointing out that we will > be moving future development to a git repository. One of the first > commit to git would be to remove that warning. Does that make > sense? It seems OK to me. Let me know when you make the last commit, so that I turn off the scripts pushing CVS changes to github, which would be the only technical thing to do to make the transition. From then on, we should commit only to git. Bartek. From biopython at maubp.freeserve.co.uk Tue Sep 22 20:18:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Sep 2009 21:18:12 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> Message-ID: <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> On Tue, Sep 22, 2009 at 8:46 PM, Bartek Wilczynski wrote: > On Tue, Sep 22, 2009 at 6:42 PM, Peter wrote: >> >> OK, the release is done. Let's leave things as they are for a day >> or so (NO MORE CVS CHECKINS PLEASE), then I will co-ordinate >> with Bartek about the timings for the git transition. >> >> I am considering adding a warning message to setup.py and the >> readme file as the final commit to CVS, pointing out that we will >> be moving future development to a git repository. One of the first >> commit to git would be to remove that warning. Does that make >> sense? > > It seems OK to me. Great. > Let me know when you make the last commit, so that I turn off > the scripts pushing CVS changes to github, ... Will do - I'll give it a day or so just in case we need to do a re-release for anything critical. > ... which would be the only technical thing to do to make the > transition. From then on, we should commit only to git. Yep - although I'll ask the OBF admins to make CVS read only as a precaution. Peter From p.j.a.cock at googlemail.com Tue Sep 22 20:20:54 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Sep 2009 21:20:54 +0100 Subject: [Biopython-dev] Biopython 1.52 released In-Reply-To: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> Message-ID: <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com> > Dear all, > > Those of you who signed up to our newsfeed will know this already, > but we are pleased to announce the release of Biopython 1.52: > > http://news.open-bio.org/news/2009/09/biopython-release-152/ > > Thank you to all our developers, including David Winter for drafting > the release announcement, and everyone else who as contributed > with feedback, bug reports etc. Brad - if everything looks fine, can you do the PyPi upload now? Thanks, Peter From chapmanb at 50mail.com Tue Sep 22 20:42:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 22 Sep 2009 16:42:26 -0400 Subject: [Biopython-dev] Biopython 1.52 released In-Reply-To: <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com> References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com> Message-ID: <20090922204226.GA13500@sobchak.mgh.harvard.edu> Hi Peter; Congrats to everyone on the release. Peter, thanks as always for all the hard work. > Brad - if everything looks fine, can you do the PyPi upload now? No problem, all set: http://pypi.python.org/pypi/biopython/ I am tempted to secretly commit something to CVS and then vehemently deny doing it to mess with everyone's head. Wait, so then how did the README file get changed? A mystery... Seriously, looking forward to the Git transition, Brad From p.j.a.cock at googlemail.com Tue Sep 22 21:24:11 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 22 Sep 2009 22:24:11 +0100 Subject: [Biopython-dev] Biopython 1.52 released In-Reply-To: <20090922204226.GA13500@sobchak.mgh.harvard.edu> References: <320fb6e00909220938u4a6528f0m2e602b821eb7b00b@mail.gmail.com> <320fb6e00909221320t75425d0aua5cc817eeba3604d@mail.gmail.com> <20090922204226.GA13500@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909221424t2cd67249pc1555c382c4f5597@mail.gmail.com> On Tue, Sep 22, 2009 at 9:42 PM, Brad Chapman wrote: > Hi Peter; > Congrats to everyone on the release. Peter, thanks as always for all > the hard work. > >> Brad - if everything looks fine, can you do the PyPi upload now? > > No problem, all set: > > http://pypi.python.org/pypi/biopython/ Lovely :) > I am tempted to secretly commit something to CVS and then vehemently > deny doing it to mess with everyone's head. Wait, so then how did the > README file get changed? A mystery... Well, unless you have another CVS account that we don't know about, it wouldn't be much of a mystery would it? Grin. > Seriously, looking forward to the Git transition, May you live in interesting times? But yeah - should be good. Peter From biopython at maubp.freeserve.co.uk Wed Sep 23 10:28:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 11:28:35 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> Message-ID: <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> On Tue, Sep 22, 2009 at 9:18 PM, Peter wrote: > Bartek wrote: >> Let me know when you make the last commit, so that I turn off >> the scripts pushing CVS changes to github, ... > > Will do - I'll give it a day or so just in case we need to do a > re-release for anything critical. Hi Bartek, OK - I think that's it for final commits to CVS (a few notes about git, and finally adding the warning in setup.py). Not all of these changes have made it to github yet. We also need to 1.52 tag ("biopython-152") to get copied over. Once that is done, could you turn off your CVS to github script, and let us know by email? Thanks, Peter From biopython at maubp.freeserve.co.uk Wed Sep 23 14:34:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 15:34:42 +0100 Subject: [Biopython-dev] Blast records In-Reply-To: <154350.7800.qm@web62402.mail.re1.yahoo.com> References: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com> <154350.7800.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e00909230734k612c142cse6888a10c0de01b5@mail.gmail.com> On Wed, Sep 23, 2009 at 2:51 PM, Michiel de Hoon wrote: > > --- On Tue, 9/22/09, Peter wrote: >> As I recall (backed up by what I wrote in the tutorial), >> when I last checked, the plain text PSI-BLAST output >> (i.e. from the command line tool blastpgp) included a >> lot of information missing in the XML output. Perhaps >> this has improved? If it hasn't, I am inclined to leave >> things as they are. If the current PSI-BLAST outputs >> more details in the XML we may be able to do a better job. > > As far as I can tell, the XML contains the same information > as the plain-text psiblast output, but the XML parser doesn't > parse it correctly, since it assumes it is dealing with regular > blast rather than psi-blast. It sounds like the NCBI have changed the PSI BLAST XML output then. >> The next bit is my recollection of some of the background >> to this: >> Classic BLAST (and also RPS-BLAST) allow multiple queries >> and use the "iterator" block in the XML file for each query. >> This was an odd choice of naming, but I think the XML tag was >> originally only intended for the PSI-BLAST outout where each >> "iteration" block in the XML corresponds to each step of the >> algorithm. You may recall early versions of BLAST would output >> "concatenated" XML files for multiple queries - which were not >> true XML files. > > That is correct. To make things more complex, if you run > psi-blast with multiple queries you get concatenated XML > files again, with the iteration blocks corresponding to the > psi-blast iterations for each query. Odd - and arguably a bug, since it isn't valid XML. >> I guess they fixed this by reusing the existing "iteration" >> structure for multiple queries (rather than adding new XML >> tags). With this in mind the current parsing of the XML from >> PSI-BLAST makes sense. > > I don't know if it really makes sense. For a single psi-blast > query, we're getting multiple Blast records. For multiple > psi-blast queries, we're iterating over the iteration blocks > while ignoring the fact that they can come from different > queries. Is a single Blast record object for each PSI-BLAST iteration such a bad thing? > Ideally, we should be able to see from the XML whether > it was regular blast with multiple queries, or psi-blast with > a single query. Right now that is possible by looking at > the query-def lines, but I wonder if NCBI is considering > a better solution for this. I'll write an email to them to find out. Certainly clarification from the NCBI sounds useful. Peter From mjldehoon at yahoo.com Wed Sep 23 13:51:04 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 23 Sep 2009 06:51:04 -0700 (PDT) Subject: [Biopython-dev] Blast records In-Reply-To: <320fb6e00909220440q338d9d78xf63903b7fc4603dc@mail.gmail.com> Message-ID: <154350.7800.qm@web62402.mail.re1.yahoo.com> --- On Tue, 9/22/09, Peter wrote: > As I recall (backed up by what I wrote in the tutorial), > when I last checked, the plain text PSI-BLAST output > (i.e. from the command line tool blastpgp) included a > lot of information missing in the XML output. Perhaps > this has improved? If it hasn't, I am inclined to leave > things as they are. If the current PSI-BLAST outputs > more details in the XML we may be able to do a better job. As far as I can tell, the XML contains the same information as the plain-text psiblast output, but the XML parser doesn't parse it correctly, since it assumes it is dealing with regular blast rather than psi-blast. > The next bit is my recollection of some of the background > to this: > Classic BLAST (and also RPS-BLAST) allow multiple queries > and use the "iterator" block in the XML file for each query. > This was an odd choice of naming, but I think the XML tag was > originally only intended for the PSI-BLAST outout where each > "iteration" block in the XML corresponds to each step of the > algorithm. You may recall early versions of BLAST would output > "concatenated" XML files for multiple queries - which were not > true XML files. That is correct. To make things more complex, if you run psi-blast with multiple queries you get concatenated XML files again, with the iteration blocks corresponding to the psi-blast iterations for each query. > I guess they fixed this by reusing the existing "iteration" > structure for multiple queries (rather than adding new XML > tags). With this in mind the current parsing of the XML from > PSI-BLAST makes sense. I don't know if it really makes sense. For a single psi-blast query, we're getting multiple Blast records. For multiple psi-blast queries, we're iterating over the iteration blocks while ignoring the fact that they can come from different queries. Ideally, we should be able to see from the XML whether it was regular blast with multiple queries, or psi-blast with a single query. Right now that is possible by looking a the query-def lines, but I wonder if NCBI is considering a better solution for this. I'll write an email to them to find out. --Michiel From bugzilla-daemon at portal.open-bio.org Wed Sep 23 14:47:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 23 Sep 2009 10:47:16 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909231447.n8NElGi8003751@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-23 10:47 EST ------- I've looked at PDB file 13GS in more detail, and this doesn't look like a bug in Biopython, but rather just another odd PDB file. Chains C and D are only three residue peptides, e.g. ATOM 3301 N GLU D 1 16.854 13.061 10.252 1.00 65.68 N ATOM 3302 CA GLU D 1 17.100 13.860 9.018 1.00 66.23 C ATOM 3303 C GLU D 1 17.937 15.095 9.363 1.00 65.02 C ATOM 3304 O GLU D 1 18.510 15.724 8.439 1.00 56.86 O ATOM 3305 CB GLU D 1 15.764 14.279 8.389 1.00 66.35 C ATOM 3306 CG GLU D 1 15.913 14.994 7.062 1.00 67.41 C ATOM 3307 CD GLU D 1 14.584 15.456 6.508 1.00 68.72 C ATOM 3308 OE1 GLU D 1 13.547 15.340 7.163 1.00 69.08 O ATOM 3309 OXT GLU D 1 17.998 15.420 10.569 1.00 66.12 O ATOM 3310 N CYS D 2 14.618 15.966 5.283 1.00 69.97 N ATOM 3311 CA CYS D 2 13.431 16.483 4.614 1.00 70.18 C ATOM 3312 C CYS D 2 13.374 15.898 3.213 1.00 69.53 C ATOM 3313 O CYS D 2 14.409 15.625 2.610 1.00 65.61 O ATOM 3314 CB CYS D 2 13.502 18.008 4.507 1.00 73.18 C ATOM 3315 SG CYS D 2 14.485 18.841 5.796 1.00 76.47 S ATOM 3316 N GLY D 3 12.166 15.713 2.693 1.00 71.49 N ATOM 3317 CA GLY D 3 12.023 15.155 1.360 1.00 75.33 C ATOM 3318 C GLY D 3 11.489 13.733 1.399 1.00 78.72 C ATOM 3319 O GLY D 3 10.840 13.313 0.413 1.00 79.95 O ATOM 3320 OXT GLY D 3 11.717 13.031 2.412 1.00 80.37 O TER 3321 GLY D 3 Look at the C-alpha distances, (17.100, 13.860, 9.018) to (13.431, 16.483, 4.614) to (12.023, 15.155, 1.360) giving distances of 6.3 and 3.8: >>> from math import sqrt >>> import numpy >>> a = numpy.array((17.100, 13.860, 9.018)) >>> b = numpy.array((13.431, 16.483, 4.614)) >>> c = numpy.array((12.023, 15.155, 1.360)) >>> sqrt(sum((a-b)**2)) 6.3037215991825049 >>> sqrt(sum((b-c)**2)) 3.7861014249488876 Clearly the first two residues in this "peptide" are very far apart, regardless of if you do a simple C-alpha distance (as here), or look at the backbone's N to C bonds. The "problem" for 13GS goes away if you relax the default distance threshold, e.g. use PPBuilder(10.0) instead of PPBuilder(). However, whatever affects 1A2D seems to be a different issue... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Wed Sep 23 15:10:32 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 23 Sep 2009 17:10:32 +0200 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> Message-ID: <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> On Wed, Sep 23, 2009 at 12:28 PM, Peter wrote: > On Tue, Sep 22, 2009 at 9:18 PM, Peter ?wrote: > OK - I think that's it for final commits to CVS (a few notes about > git, and finally adding the warning in setup.py). Not all of these > changes have made it to github yet. > > We also need to 1.52 tag ("biopython-152") to get copied over. > > Once that is done, could you turn off your CVS to github > script, and let us know by email? Ta-da! We are no longer synchronizing from CVS! Please do not commit any changes to the CVS because they are not going to be transferred to git, which is now _the_ repository for biopython. Everyone with biopython CVS accounts is welcome to send their github logins (off the list) to me or Peter to get them added as biopython collaborators. cheers Bartek From biopython at maubp.freeserve.co.uk Wed Sep 23 15:16:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 16:16:19 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> Message-ID: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> On Wed, Sep 23, 2009 at 4:10 PM, Bartek Wilczynski wrote: > On Wed, Sep 23, 2009 at 12:28 PM, Peter wrote: >> On Tue, Sep 22, 2009 at 9:18 PM, Peter ?wrote: >> OK - I think that's it for final commits to CVS (a few notes about >> git, and finally adding the warning in setup.py). Not all of these >> changes have made it to github yet. >> >> We also need to 1.52 tag ("biopython-152") to get copied over. >> >> Once that is done, could you turn off your CVS to github >> script, and let us know by email? > > Ta-da! We are no longer synchronizing from CVS! Lovely... but could you double check the last few commits made it? i.e. The final commit should be: setup.py CVS revision 1.174 date: 2009/09/23 10:06:08; author: peterc; state: Exp; lines: +8 -0 Adding a warning about CVS/git to setup.py (which we will remove once we switch to git) so people know they are using an out of date repository. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Wed Sep 23 15:40:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 23 Sep 2009 11:40:00 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909231540.n8NFe0iU005670@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-23 11:39 EST ------- I think the problem with PDB file 1A2D is due to the atypical PYX residue, from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import is_aa structure = PDBParser().get_structure('tmp', '1A2D.pdb') for model in structure : for chain in model : for res in chain : if "CA" in res.child_dict and not is_aa(res) : print chain, res The polypeptide code only looks at residues that pass the is_aa test, which means we can ignore things like water atoms associated with a chain. In this PDB file there are two residues which fail this test: According to the SEQADV and MODRES lines, these are modified CYS residues. Comparing this to the PDB provided FASTA file, a "C" is used (CYS). This leads me to believe the fix is to add the PYX -> C mapping to Biopython. [The dictionary used, to_one_letter_code, is actually defined in file Bio/SCOP/RAF.py for some historical reason.] Consulting the PDB documentation suggests that there are potentially many more examples like this of unknown HETATM entries which are modified amino acid residues... see: ftp://ftp.wwpdb.org/pub/pdb/data/monomers/ Christian - did you find any other problem PDB files? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Sep 23 15:47:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 23 Sep 2009 11:47:19 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909231547.n8NFlJ39005869@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #4 from schafer at rostlab.org 2009-09-23 11:47 EST ------- Peter, yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, I'll take a look at it and post them here. It's easy to do this. What I did is, I parsed the structures through the dssp structure assignment tool and compared the obtained sequence with that obtained from the Bio.PDB parser. Background: I wanted to map the sequence that dssp sees to atomic coordinates. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Wed Sep 23 15:56:42 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 23 Sep 2009 17:56:42 +0200 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> Message-ID: <8b34ec180909230856u235a17ah437e578e02d5e6d3@mail.gmail.com> On Wed, Sep 23, 2009 at 5:16 PM, Peter wrote: > > Lovely... but could you double check the last few commits made it? Sure, your commit didn't make it to github at first, because It was just two minutes after the last scheduled synchronization. Now it's in github. cheers Bartek From biopython at maubp.freeserve.co.uk Wed Sep 23 16:04:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 17:04:30 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> Message-ID: <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> On Wed, Sep 23, 2009 at 4:16 PM, Peter wrote: > On Wed, Sep 23, 2009 at 4:10 PM, Bartek Wilczynski wrote: >> >> Ta-da! We are no longer synchronizing from CVS! >> > > Lovely... but could you double check the last few commits made it? > i.e. The final commit should be: > > setup.py CVS revision 1.174 > date: 2009/09/23 10:06:08; ?author: peterc; ?state: Exp; ?lines: +8 -0 > Adding a warning about CVS/git to setup.py (which we will remove > once we switch to git) so people know they are using an out of date > repository. It has just shown up in the last few minutes :) I'm ready to make the first commit directly to github (removing the new warning from setup.py), assuming everything is fine on your end Bartek? Peter From biopython at maubp.freeserve.co.uk Wed Sep 23 16:34:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Sep 2009 17:34:12 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> Message-ID: <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com> On Wed, Sep 23, 2009 at 5:04 PM, Peter wrote: > > I'm ready to make the first commit directly to github (removing the > new warning from setup.py), assuming everything is fine on your > end Bartek? OK - that's done now. Thank you Bartek. Ladies and Gentlemen, we are now running Biopython development with git :) Remember - CVS remains frozen (and I'll ask the OBF admins to make it read only to prevent any accidents). Now, let's make sure all the documentation and the wiki etc is up to date, and make an official announcement on the news server. Those of you who already had CVS access, once you think you are happy with using git (i.e. you'd had a play with your own local repository, and also idealy tried pushed changes to a personal repository on github), please ask for collaborators status on github. Peter From eric.talevich at gmail.com Thu Sep 24 03:48:49 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 23 Sep 2009 23:48:49 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch Message-ID: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> Folks, I've fixed a couple of remaining issues in the Bio.Tree and Bio.TreeIO modules and I'd like your opinion on what else should be done before merging this into the mainline. First, the wiki documentation for PhyloXML has an example pipeline showing how to build a phylogeny in Biopython, from a raw protein sequence to a lightly annotated phyloXML file. http://biopython.org/wiki/PhyloXML#Example_pipeline Does this look like right? I copied the first few steps from the official docs. The source code, for your review, is here: http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/ http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/ http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py Discussion: *TreeIO* The read, parse, write and convert functions work essentially the same as in SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues: (1) 'phyloxml' uses a different object representation than the other two, so converting between those formats is not possible until Nexus.Trees is ported over to Bio.Tree. (2) NexusIO.write() just doesn't seem to work. I don't understand how to make the original Nexus module write out trees that it didn't parse itself. Help? *Tree *The BaseTree module is meant to be the basis for Newick trees eventually, so I'd like to get the design right with the minimum number of public methods: (1) The find() function, named after the Unix utility that does the same thing for directory trees, seems capable of all the iteration and filtering necessary for locating data and automatically adding annotations to a tree. There's a 'terminal' argument for selecting internal nodes, external nodes, or both, and I think this means get_leaf_nodes() is unnecessary. I'm going to remove it if no one protests. (2) Should find() be based on depth_first_search or breadth_first_search (not checked in yet)? DFS would potentially find a leaf node faster, but BFS seems more common in phylogenetics. Note that iteration can easily be reversed with the standard reversed() function, so we don't need extra functions for those cases. (3) I left room in each Node for the left and right indexes used by BioSQL's nested-set representation. Now I'm doubting the utility of that -- any Biopython function that uses those indexes would need to ensure that the index is up to date, which seems tricky. Shall I remove all mention of the nested-set representation, or try to support it fully? (4) There's some mention in the literature of a relationship-matrix representation for phylogenies. Does anyone here know how to work with this representation, or know if it would let us perform complex calculations with blinding speed behind the scenes? If so, should there be a function in Bio.Tree.Utils to export a tree to a NumPy array represented this way? If not, I'll forget about it. *Graphics* I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even usable. Plus, the nodes are now a pretty shade of blue. Still, it would be nice to have a Reportlab-based module in Bio.Graphics to print phylogenies in the way biologists are used to seeing them. Does anyone know of existing code that could be borrowed for this? I looked at ETE (announced on the main biopython list last week) and liked the examples, but it uses PyQt4 and a standalone GUI for display, which is a substantial departure from the Biopython way of doing things. Best regards, Eric From mjldehoon at yahoo.com Thu Sep 24 09:33:22 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 24 Sep 2009 02:33:22 -0700 (PDT) Subject: [Biopython-dev] Blast records In-Reply-To: <320fb6e00909230734k612c142cse6888a10c0de01b5@mail.gmail.com> Message-ID: <888743.69260.qm@web62408.mail.re1.yahoo.com> --- On Wed, 9/23/09, Peter wrote: > --- Michiel wrote: > > For a single psi-blast query, we're getting multiple Blast > > records. For multiple psi-blast queries, we're iterating over > > the iteration blocks while ignoring the fact that they can come > from different queries. > > Is a single Blast record object for each PSI-BLAST > iteration such a bad thing? > Well the plain-text PSI-BLAST parser returns a single Record.PSIBlast object containing all of the PSI-BLAST iterations, whereas the XML parser returns multiple Record.Blast objects. Ideally, the plain-text parser and the XML parser should return the same thing. --Michiel. From biopython at maubp.freeserve.co.uk Thu Sep 24 09:57:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 10:57:12 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> Message-ID: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> On Thu, Sep 24, 2009 at 4:48 AM, Eric Talevich wrote: > Discussion: > > *TreeIO* > The read, parse, write and convert functions work essentially the same as in > SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues: Great. One minor point - the docstring for Bio.TreeIO.parse() says: "This is only supported for formats that can represent multiple phylogenetic trees in a single file". Is that true, and if so why? For SeqIO and AlignIO you can use parse on a file with one entry, the iterator just returns one entry. Easy. This is important for allowing generic code (e.g. a loop) regardless of how many entries there are (one, many, or even zero). On a more general note, you seem to be recreating the file/handle logic in each of the individual parsers. I think it would make much more sense to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() and Bio.TreeIO.write() functions *only* and have the underlying format specific code just use handles. This avoids the code duplication. [In fact, as I have said before, I prefer the simplicity of just allowing handles - and we should make TreeIO and SeqIO/AlignIO consistent] > (1) 'phyloxml' uses a different object representation than the other two, so > converting between those formats is not possible until Nexus.Trees is ported > over to Bio.Tree. I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming that phyloxml allows very minimal trees, the reverse as well). It does look like the best plan is to use the same tree objects for all three (updating Bio.Nexus if possible). Note that Bio.Nexus.Trees still has some useful methods you don't appear to support, like finding the last common ancestor and distances between nodes. > (2) NexusIO.write() just doesn't seem to work. I don't understand how to > make the original Nexus module write out trees that it didn't parse itself. > Help? To get the Newick tree, you can just call str(tree), which is basically what you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be more complicated. You'll need to create a minimal Nexus file - have a look at the Bio.AlignIO.NexusIO code. An alternative is to look at is having a hard coded nexus template, and just insert the tree as a Newick string (and insert the list of taxa?). Perhaps Frank or Cymon can advise us. > *Tree > *The BaseTree module is meant to be the basis for Newick trees eventually, > so I'd like to get the design right with the minimum number of public > methods: > > (1) The find() function, named after the Unix utility that does the same > thing for directory trees, seems capable of all the iteration and filtering > necessary for locating data and automatically adding annotations to a tree. > There's a 'terminal' argument for selecting internal nodes, external nodes, > or both, and I think this means get_leaf_nodes() is unnecessary. I'm going > to remove it if no one protests. I'm in two minds - iterating over the leaves (taxa) seems like a very common operation, and having an explicit method for this might be clearer than calling find with special arguments. > (2) Should find() be based on depth_first_search or breadth_first_search > (not checked in yet)? DFS would potentially find a leaf node faster, but BFS > seems more common in phylogenetics. Note that iteration can easily be > reversed with the standard reversed() function, so we don't need extra > functions for those cases. You could do both, either via an argument or having two methods, say depth_fist_search and breadth_first_search instead of find. > (3) I left room in each Node for the left and right indexes used by BioSQL's > nested-set representation. Now I'm doubting the utility of that -- any > Biopython function that uses those indexes would need to ensure that the > index is up to date, which seems tricky. Shall I remove all mention of the > nested-set representation, or try to support it fully? A partial implementation doesn't seem helpful, and wastes memory allocating unused properties. I would remove it from the base Node, but a full implementation might be useful for something (would it be possible via a subclass?). On a related point, do you think a BioSQL TaxonTree subclass is possible? i.e. Something mimicking the new Tree objects (as a subclass), but which loads data on demand from the taxon tables in a BioSQL database? This would provide a nice way to work with the NCBI taxonomy (once loaded into BioSQL), which is a very large tree. For an example use case, I might want to extract just the bacteria as a subtree, and save that to a file. > (4) There's some mention in the literature of a relationship-matrix > representation for phylogenies. Does anyone here know how to work with this > representation, or know if it would let us perform complex calculations with > blinding speed behind the scenes? If so, should there be a function in > Bio.Tree.Utils to export a tree to a NumPy array represented this way? ?If > not, I'll forget about it. I don't know. > *Graphics* > I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even > usable. Plus, the nodes are now a pretty shade of blue. Still, it would be > nice to have a Reportlab-based module in Bio.Graphics to print phylogenies > in the way biologists are used to seeing them. Does anyone know of existing > code that could be borrowed for this? I looked at ETE (announced on the main > biopython list last week) and liked the examples, but it uses PyQt4 and a > standalone GUI for display, which is a substantial departure from the > Biopython way of doing things. I still haven't tracked down my old report lab code, but it wasn't object orientated and would need a lot of work to bring up to standard... Peter From biopython at maubp.freeserve.co.uk Thu Sep 24 10:23:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 11:23:34 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com> Message-ID: <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com> On Wed, Sep 23, 2009 at 5:34 PM, Peter wrote: > > Now, let's make sure all the documentation and the wiki etc is up to date, > and make an official announcement on the news server. > How does this look for a draft news post (with links to wiki pages etc): The release of Biopython 1.52 earlier this week marked the end of an era, it was our last release using CVS for source code control. As of now, Biopython is using a git repository, hosted on github.com who kindly provide git hosting for open source projects free of charge. The BioRuby project have been using github for some time now, so we are in good company. The existing OBF hosted CVS repository will be maintained in the short to medium term as a backup, but will not be updated. Although many people have been involved in this move, we?d like to thank Bartek Wilczynski in particular for handling the CVS to git conversion, and the mirroring our CVS updates to git during the last few months transition period. In the next few weeks hopefully we?ll get our git usage wiki pages perfected, as we start using git for real. Peter From jhuerta at crg.es Thu Sep 24 10:45:21 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Thu, 24 Sep 2009 12:45:21 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> Message-ID: Hi, ( I'm the developer of ETE. ) I agree that PyQt4 is an important dependence. I chose it because Qt4-QGraphicsScene environment offers many possibilities like openGL rendering, unlimited image size, performance, and good bindings to python. However, I am working on my code to allow the rendering algorithm to use any other graphical library. So, you could render the same tree images using different backends. If you think this is useful for you, please let me know and we can think how to integrat it with biopython. Regarding the GUI, it is not a standalone application but one more method within the Tree objects. The GUI can be started at any point of the execution and the main program will continue after you close it. I did it like this because I think is quite useful for working within interactive python sessions. I develop a lot of code around tree handling, so if you think I can help, please tell me. jaime. > > *Graphics* > > I finally fixed the networkx/graphviz/matplotlib drawing to leave > unlabeled > > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps > even > > usable. Plus, the nodes are now a pretty shade of blue. Still, it would > be > > nice to have a Reportlab-based module in Bio.Graphics to print > phylogenies > > in the way biologists are used to seeing them. Does anyone know of > existing > > code that could be borrowed for this? I looked at ETE (announced on the > main > > biopython list last week) and liked the examples, but it uses PyQt4 and a > > standalone GUI for display, which is a substantial departure from the > > Biopython way of doing things. > > I still haven't tracked down my old report lab code, but it wasn't object > orientated and would need a lot of work to bring up to standard... > > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From bugzilla-daemon at portal.open-bio.org Thu Sep 24 11:14:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 07:14:37 -0400 Subject: [Biopython-dev] [Bug 2910] Bio.PDB build_peptides sometimes gives shorter peptide sequences than expected In-Reply-To: Message-ID: <200909241114.n8OBEbKH005629@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2910 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-24 07:14 EST ------- (In reply to comment #4) > Peter, > > yes, indeed, I had a couple of problematic pdb ids. As soon as I find the time, > I'll take a look at it and post them here. It's easy to do this. What I did is, > I parsed the structures through the dssp structure assignment tool and compared > the obtained sequence with that obtained from the Bio.PDB parser. Background: I > wanted to map the sequence that dssp sees to atomic coordinates. > If you can give us some more examples that would be very helpful, thank you. I have committed a partial fix which means any known modified amino acids (based on the presence of an alpha carbon) will be treated as an amino acid for building the peptide (and given the default sequence letter of X). This will also issue a warning. Any such previously unknown modified amino acid (like PYX) needs to be added to our hard coded lookup table with the appropriate single letter symbol as used by the PDF in their FASTA files (in this case, PYX -> C for cysteine). I suspect that some of your other problem PDB files still have (currently) undefined modified amino acids in them... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Sep 24 11:39:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 12:39:59 +0100 Subject: [Biopython-dev] Committing to github... Message-ID: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com> Hi all, My last couple of commits to github have been from a local clone of the *official* repository: http://github.com/biopython/biopython/ This is a nice and simple work flow for small changes, and the history and github network graph are easy to understand: http://biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch This seems like the easiest way to work for people used to CVS, and you don't need to bother with your own Biopython cloned repository on github (you just need a github account and collaborator status). I'll probably continue to do this in the short term. -- However, prior to that I did a couple of commits via a local clone of *my* personal github repository, http://github.com/peterjc/biopython/ I had kept the master branch on *my* repository identical to the official master. However, while I was only pushing a tiny change, git did this as a merge - resulting in a flurry of RSS entries and a complicated looking git network diagram. I think it is probably just down to the way we've been using the repositories during the migration? With this backlog of merges done, I expect future commits by this route will look much cleaner... Peter From chapmanb at 50mail.com Thu Sep 24 12:08:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 24 Sep 2009 08:08:00 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> Message-ID: <20090924120800.GJ13500@sobchak.mgh.harvard.edu> Eric and Peter; Looking forward to seeing the PhyloXML work merged into the main branch. Eric, thanks for posting the summary of where things are at. > > (1) 'phyloxml' uses a different object representation than the other two, so > > converting between those formats is not possible until Nexus.Trees is ported > > over to Bio.Tree. > > I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would > actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming > that phyloxml allows very minimal trees, the reverse as well). It does look > like the best plan is to use the same tree objects for all three (updating > Bio.Nexus if possible). Agreed that this would be nice to have, but I'm not sure why it's blocking getting the base TreeIO framework and all of PhyloXML into the main branch. That's a major step forward from the format specific phylogenetic code we had before and gets us a portion of the way there. Next up should be moving over Bio.Nexus to the new framework and then conversions, but this is another project. I think we should take this one step at a time. > Note that Bio.Nexus.Trees still has some useful methods you don't > appear to support, like finding the last common ancestor and distances > between nodes. Agreed. As we move Nexus over, we should be sure to keep current functionality. > > (1) The find() function, named after the Unix utility that does the same > > thing for directory trees, seems capable of all the iteration and filtering > > necessary for locating data and automatically adding annotations to a tree. > > There's a 'terminal' argument for selecting internal nodes, external nodes, > > or both, and I think this means get_leaf_nodes() is unnecessary. I'm going > > to remove it if no one protests. > > I'm in two minds - iterating over the leaves (taxa) seems like a very > common operation, and having an explicit method for this might be > clearer than calling find with special arguments. I'm for keeping it as well, and just having the underlying implementation of get_leaf_nodes call find with the right arguments. This seems like an operation that should be dead obvious to do. > > (3) I left room in each Node for the left and right indexes used by BioSQL's > > nested-set representation. Now I'm doubting the utility of that -- any > > Biopython function that uses those indexes would need to ensure that the > > index is up to date, which seems tricky. Shall I remove all mention of the > > nested-set representation, or try to support it fully? Again I agree with Peter here -- this would be best supported as a subclass that is database aware with an identical API, similar to how the Seq objects and BioSQL Seq objects work. This avoids any overhead for the in-memory case, which will be more common, but gives you a point to implement the useful database representation code in the future. If you don't have time to work on all of this right now, I'd leave the nested-set stuff out and keep it in mind as a future addition. Brad From biopython at maubp.freeserve.co.uk Thu Sep 24 12:48:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 13:48:37 +0100 Subject: [Biopython-dev] Git documentation on wiki Message-ID: <320fb6e00909240548q4db8dfc1l83be8408d3b8718f@mail.gmail.com> Hi all, I think I have updated the relevant wiki pages about the CVS to git migration. I have also make the "git" page redirect to the "Source Code" page, which is the main access point. This now has a quick summary with the basic links here for anyone wanting to grab the latest code: http://biopython.org/wiki/SourceCode If anyone spots any errors or typos, feel free to fix them or raise them here for discussion as needed. Thanks, Peter From bugzilla-daemon at portal.open-bio.org Thu Sep 24 14:42:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:42:08 -0400 Subject: [Biopython-dev] [Bug 2894] Jython List difference causes failed assertion in CondonTable Fix+Patch In-Reply-To: Message-ID: <200909241442.n8OEg8Xo012359@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2894 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-24 10:42 EST ------- I've actually installed Jython 2.5.0 and checked this. A further fix was required, but this now works with the latest Biopython now in git. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 14:46:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:46:38 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200909241446.n8OEkc1w012533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-24 10:46 EST ------- Testing with Jython 2.5.0 shows my fix didn't work. Reopening... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 14:46:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:46:49 -0400 Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch In-Reply-To: Message-ID: <200909241446.n8OEknEX012555@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2895 Bug 2895 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 14:46:53 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:46:53 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200909241446.n8OEkrFK012570@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 Bug 2893 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 14:46:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 10:46:55 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200909241446.n8OEkt93012582@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 Bug 2892 depends on bug 2891, which changed state. Bug 2891 Summary: Jython test_NCBITextParser fix+patch http://bugzilla.open-bio.org/show_bug.cgi?id=2891 What |Old Value |New Value ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 15:11:22 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:22 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200909241511.n8OFBM3q013469@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn|2890 | OtherBugsDependingO|2892, 2893, 2895 | nThis| | ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-24 11:11 EST ------- Removing dependencies on other Jython bugs - they don't block each other. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 15:11:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:25 -0400 Subject: [Biopython-dev] [Bug 2890] Getting setup.py to work in Jython In-Reply-To: Message-ID: <200909241511.n8OFBPYu013482@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2890 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO|2891 | nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 15:11:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:40 -0400 Subject: [Biopython-dev] [Bug 2895] Bio.Restriction.Restriction_Dictionary Jython Error Fix+Patch In-Reply-To: Message-ID: <200909241511.n8OFBeug013513@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2895 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn|2891 | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 15:11:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:42 -0400 Subject: [Biopython-dev] [Bug 2893] Jython test_prosite fix+patch In-Reply-To: Message-ID: <200909241511.n8OFBgcU013525@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2893 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn|2891 | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 15:11:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 11:11:45 -0400 Subject: [Biopython-dev] [Bug 2892] Jython MatrixInfo.py fix+patch In-Reply-To: Message-ID: <200909241511.n8OFBj1e013540@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2892 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn|2891 | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Sep 24 16:10:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Sep 2009 12:10:30 -0400 Subject: [Biopython-dev] [Bug 2918] New: Entrez parser fails on Jython - XMLParser lacks SetParamEntityParsing Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2918 Summary: Entrez parser fails on Jython - XMLParser lacks SetParamEntityParsing Product: Biopython Version: 1.52 Platform: All URL: http://bugs.jython.org/issue1447 OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk CC: kellrott at ucsd.edu I'm filing this as a bug report so we can track it, but the underlying issue is a known Jython bug, http://bugs.jython.org/issue1447 (thanks Kyle for reporting this already). It can be shown just by running our unit test: ~/jython2.5.0/jython run_tests.py test_Entrez.py test_Entrez ... FAIL ====================================================================== ERROR: Test parsing XML returned by EFetch, Journals database ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pjcock/repositories/biopython/Tests/test_Entrez.py", line 3443, in test_journals record = Entrez.read(input) File "/Users/pjcock/repositories/biopython/Bio/Entrez/__init__.py", line 259, in read record = handler.run(handle) File "/Users/pjcock/repositories/biopython/Bio/Entrez/Parser.py", line 85, in run self.parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS) AttributeError: 'XMLParser' object has no attribute 'SetParamEntityParsing' ... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Sep 24 17:59:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Sep 2009 18:59:06 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20090924120800.GJ13500@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <20090924120800.GJ13500@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909241059yfa43889w82c76cd7f2365dee@mail.gmail.com> On Thu, Sep 24, 2009 at 1:08 PM, Brad Chapman wrote: > Eric and Peter; > Looking forward to seeing the PhyloXML work merged into the main > branch. Eric, thanks for posting the summary of where things are at. > >> > (1) 'phyloxml' uses a different object representation than the other two, so >> > converting between those formats is not possible until Nexus.Trees is ported >> > over to Bio.Tree. >> >> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it would >> actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming >> that phyloxml allows very minimal trees, the reverse as well). It does look >> like the best plan is to use the same tree objects for all three (updating >> Bio.Nexus if possible). > > Agreed that this would be nice to have, but I'm not sure why it's > blocking getting the base TreeIO framework and all of PhyloXML into > the main branch. That's a major step forward from the format > specific phylogenetic code we had before and gets us a portion of > the way there. If the Newick/Nexus TreeIO parsers return one object type while the PhyloXML TreeIO parser returns another *incompatible* object type, then we don't have a unified tree input/output framework. Furthermore, if you did release this and then later standardise on a single tree object, you'd break backwards compatibility. All in all, best avoided. > Next up should be moving over Bio.Nexus to the new framework and > then conversions, but this is another project. I think we should > take this one step at a time. What we could do in the short term is ignore Bio.Nexus.Trees, and just leave it as is. Instead of having the Newick/Nexus TreeIO code calling the old Bio.Nexus.Trees code, we just write some new code (possibly based on old code) which will use Eric's new objects. We could then (gradually, perhaps by adding a runtime option to the Nexus parsing API) move Bio.Nexus over from using the old Bio.Nexus.Trees code to the new TreeIO, and eventually deprecate and then remove Bio.Nexus.Trees. Peter From eric.talevich at gmail.com Fri Sep 25 03:54:05 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 24 Sep 2009 23:54:05 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> Message-ID: <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Hello, Jaime, Sorry I didn't respond directly to your earlier post -- I wrote half of an e-mail, then realized I had no good suggestions on what to do so I scrapped it. My Tree and TreeIO code is basically a complete parser for the phyloXML format, plus a few base classes extracted out in hopes of eventually creating a unified set of format-independent objects, as in SeqIO and AlignIO. Your code for working with trees looks much more complete than mine, so if some of it can be incorporated into Biopython, I think that would be great. I see these issues with integration: 1. It's GPL, while Biopython uses a more permissive custom license resembling the BSD and MIT licenses. Would you be willing and able to relicense parts of your work for Biopython? 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will require some compatibility fixes -- not a huge problem. 3. Scipy and numpy dependencies: Numpy is considered a semi-optional dependency in Biopython, so if it can be imported on the fly by just the functions that need it (hopefully no core ones), that would be best. If not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so it would be better to make that an optional, on-the-fly import, too. 4. PyQt4 is a big package and I'm not sure it's as common in scientists' Python installations as numpy and scipy, so if the underlying algorithms for tree layout could be ported to Reportlab, matplotlib or PIL, that would be ideal. I personally would like to be able to pair sequence snippets with the leaves of a standard phylogram, so if you need me to do some additional work to get this section ported to Biopython, I'd consider it time well spent. 5. Presumably, the tree object type in ETE is different from Bio.Tree or Bio.Nexus, so porting the core tree manipulation code to Biopython would require a substantial effort somewhere. 6. The PhylomeDB connector is cool, and browsing the source, looks like it wouldn't require much effort at all to drop into Biopython. Thanks for letting us know about this. Cheers, Eric On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas wrote: > Hi, > > ( I'm the developer of ETE. ) > I agree that PyQt4 is an important dependence. I chose it because > Qt4-QGraphicsScene environment offers many possibilities like openGL > rendering, unlimited image size, performance, and good bindings to python. > However, I am working on my code to allow the rendering algorithm to use any > other graphical library. So, you could render the same tree images using > different backends. If you think this is useful for you, please let me know > and we can think how to integrat it with biopython. > Regarding the GUI, it is not a standalone application but one more method > within the Tree objects. The GUI can be started at any point of the > execution and the main program will continue after you close it. I did it > like this because I think is quite useful for working within interactive > python sessions. > > I develop a lot of code around tree handling, so if you think I can help, > please tell me. > jaime. > > > >> > *Graphics* >> > I finally fixed the networkx/graphviz/matplotlib drawing to leave >> unlabeled >> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps >> even >> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would >> be >> > nice to have a Reportlab-based module in Bio.Graphics to print >> phylogenies >> > in the way biologists are used to seeing them. Does anyone know of >> existing >> > code that could be borrowed for this? I looked at ETE (announced on the >> main >> > biopython list last week) and liked the examples, but it uses PyQt4 and >> a >> > standalone GUI for display, which is a substantial departure from the >> > Biopython way of doing things. >> >> I still haven't tracked down my old report lab code, but it wasn't object >> orientated and would need a lot of work to bring up to standard... >> >> > > > > > > > >> Peter >> >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > > > -- > ========================= > Jaime Huerta-Cepas, Ph.D. > CRG-Centre for Genomic Regulation > Doctor Aiguader, 88 > PRBB Building > 08003 Barcelona, Spain > http://www.crg.es/comparative_genomics > ========================= > > From eric.talevich at gmail.com Fri Sep 25 04:34:17 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 25 Sep 2009 00:34:17 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> Message-ID: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> Hi Peter, Thanks for the feedback. On Thu, Sep 24, 2009 at 5:57 AM, Peter wrote: > > One minor point - the docstring for Bio.TreeIO.parse() says: "This is only > supported for formats that can represent multiple phylogenetic trees in a > single file". Is that true, and if so why? For SeqIO and AlignIO you can > use parse on a file with one entry, the iterator just returns one entry. > Easy. > This is important for allowing generic code (e.g. a loop) regardless of > how many entries there are (one, many, or even zero). > > I'll delete that sentence. I don't know why it's there -- you're right, it's easy to return an iterable regardless of what the format itself supports. On a more general note, you seem to be recreating the file/handle logic > in each of the individual parsers. I think it would make much more sense > to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() > and > Bio.TreeIO.write() functions *only* and have the underlying format specific > code just use handles. This avoids the code duplication. > > I did the handle management case-by-case because some of the underlying libraries already do filename-to-handle conversion -- ElementTree and Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of ad-hoc handle management, but of course I can move it all to the top if you think it's best. One day, perhaps we'll have a context manager that we can reuse everywhere to make magic easy: with maybe_open(file) as handle: tree = FooIO.parse(handle) Not today, though. > (1) 'phyloxml' uses a different object representation than the other two, > so > > converting between those formats is not possible until Nexus.Trees is > ported > > over to Bio.Tree. > > > I think that is a blocker - I wouldn't want to release Bio.TreeIO until it > would > actually let you do phyloxml -> newick, and phyloxml -> nexus (and assuming > that phyloxml allows very minimal trees, the reverse as well). It does look > like the best plan is to use the same tree objects for all three (updating > Bio.Nexus if possible). > > I could comment out the 'nexus' and 'newick' lines from the supported_formats dict. That would disable the top-level functions but leave the direct NexusIO and NewickIO equivalents intact until the port is complete. Note that Bio.Nexus.Trees still has some useful methods you don't > appear to support, like finding the last common ancestor and distances > between nodes. > > That's intentional, I was just going to port those methods directly from Bio.Nexus.Trees rather than invent a new API myself. Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are combined parsers and object representations. My goal is to chop out the pure-object parts and merge them into Bio.Tree, and let the remaining parsers return objects built from the new Bio.Tree classes. This looks like it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be done. For backward compatibility, I'll leave some wrappers that trigger DeprecationWarnings in the original places. Nexus.Trees can probably be reduced to: import warnings warnings.warn("Use Bio.Tree and Bio.TreeIO instead", DeprecationWarning) from Bio.Tree.Newick import * from Bio.TreeIO.NewickIO import * (more or less) > (2) NexusIO.write() just doesn't seem to work. I don't understand how to > > make the original Nexus module write out trees that it didn't parse > itself. > > Help? > > To get the Newick tree, you can just call str(tree), which is basically > what > you are doing in Bio.TreeIO.NewickIO. To get a Nexus file is going to be > more complicated. You'll need to create a minimal Nexus file - have a > look at the Bio.AlignIO.NexusIO code. An alternative is to look at is > having > a hard coded nexus template, and just insert the tree as a Newick string > (and insert the list of taxa?). Perhaps Frank or Cymon can advise us. > > OK, thanks, I'll give it a shot. I see some default Nexus template stuff in Bio.Nexus.Nexus already. > > *Tree > > *The BaseTree module is meant to be the basis for Newick trees > eventually, > > so I'd like to get the design right with the minimum number of public > > methods: > > > > (1) The find() function, named after the Unix utility that does the same > > thing for directory trees, seems capable of all the iteration and > filtering > > necessary for locating data and automatically adding annotations to a > tree. > > There's a 'terminal' argument for selecting internal nodes, external > nodes, > > or both, and I think this means get_leaf_nodes() is unnecessary. I'm > going > > to remove it if no one protests. > > I'm in two minds - iterating over the leaves (taxa) seems like a very > common operation, and having an explicit method for this might be > clearer than calling find with special arguments. > I think .find(terminal=True) will do the right thing and looks reasonably simple, but as Brad said, this is a ridiculously common operation so finding it in the API should be ridiculously easy. I'll rename this function to get_leaves() and rename find() to findall() (to match ElementTree and make it clear that it returns an iterable). > > (3) I left room in each Node for the left and right indexes used by > BioSQL's > > nested-set representation. Now I'm doubting the utility of that -- any > > Biopython function that uses those indexes would need to ensure that the > > index is up to date, which seems tricky. Shall I remove all mention of > the > > nested-set representation, or try to support it fully? > > A partial implementation doesn't seem helpful, and wastes memory > allocating unused properties. I would remove it from the base Node, > but a full implementation might be useful for something (would it be > possible via a subclass?). > > On a related point, do you think a BioSQL TaxonTree subclass is possible? > i.e. Something mimicking the new Tree objects (as a subclass), but which > loads data on demand from the taxon tables in a BioSQL database? This > would provide a nice way to work with the NCBI taxonomy (once loaded > into BioSQL), which is a very large tree. For an example use case, I might > want to extract just the bacteria as a subtree, and save that to a file. > > Doing BioSQL integration was on the original roadmap, but research hasn't taken me back there lately. I would like to do it eventually... anyway, that would solve the indexing issue nicely. I'll drop the extra attributes -- I get the impression they're not meant to be accessed directly in BioSQL either, so there's no use for them in Biopython. Cheers, Eric From biopython at maubp.freeserve.co.uk Fri Sep 25 09:59:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 10:59:08 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> Message-ID: <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com> On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich wrote: >> >> On a related point, do you think a BioSQL TaxonTree subclass is possible? >> i.e. Something mimicking the new Tree objects (as a subclass), but which >> loads data on demand from the taxon tables in a BioSQL database? This >> would provide a nice way to work with the NCBI taxonomy (once loaded >> into BioSQL), which is a very large tree. For an example use case, I might >> want to extract just the bacteria as a subtree, and save that to a file. >> > > Doing BioSQL integration was on the original roadmap, but research hasn't > taken me back there lately. I would like to do it eventually... anyway, that > would solve the indexing issue nicely. I'll drop the extra attributes -- I > get the impression they're not meant to be accessed directly in BioSQL > either, so there's no use for them in Biopython. As things stand, there is no usage of the left/right index fields in Biopython. The current Biopython BioSQL code focusses on the database variants of the Seq and SeqRecord objects. The only interaction with the taxon tables is to load/retrieve the species annotations, and for this we don't need the complications of the left/right index. We leave them empty if we populate the taxonomy via Entrez (recalculating the left/right values is computationally expensive). However, any "DBTaxonTree" object (or whatever we call it) could potentially offer us a way to (a) populate and (b) use the these alternative indexes as a way to speed up various subtree operations. Peter From biopython at maubp.freeserve.co.uk Fri Sep 25 10:08:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 11:08:56 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> Message-ID: <320fb6e00909250308s35a286e7x67a7bb3fec6a0673@mail.gmail.com> On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich wrote: >> One minor point - the docstring for Bio.TreeIO.parse() says: "This is only >> supported for formats that can represent multiple phylogenetic trees in a >> single file". Is that true, and if so why? For SeqIO and AlignIO you can >> use parse on a file with one entry, the iterator just returns one entry. >> This is important for allowing generic code (e.g. a loop) regardless of >> how many entries there are (one, many, or even zero). > > I'll delete that sentence. I don't know why it's there -- you're right, it's > easy to return an iterable regardless of what the format itself supports. OK. >> On a more general note, you seem to be recreating the file/handle logic >> in each of the individual parsers. I think it would make much more sense >> to put this logic in the top level Bio.TreeIO.parse(), Bio.TreeIO.read() >> and Bio.TreeIO.write() functions *only* and have the underlying format >> specific code just use handles. This avoids the code duplication. > > I did the handle management case-by-case because some of the underlying > libraries already do filename-to-handle conversion -- ElementTree and > Bio.Nexus, specifically. It seemed non-kosher to have multiple layers of > ad-hoc handle management, but of course I can move it all to the top if you > think it's best. Having a single layer of handle/filename conversion in Bio.TreeIO does seem cleanest to me (even if some of the back ends allow either) and will ensure our code is consistent. > One day, perhaps we'll have a context manager that we can > reuse everywhere to make magic easy: > > with maybe_open(file) as handle: > ? tree = FooIO.parse(handle) > > Not today, though. Not yet, no. For one thing we'll have to phase out Python 2.4 support. >>> (1) 'phyloxml' uses a different object representation than the other two, >>> so converting between those formats is not possible until Nexus.Trees >>> is ported over to Bio.Tree. >> >> I think that is a blocker - I wouldn't want to release Bio.TreeIO until it >> would actually let you do phyloxml -> newick, and phyloxml -> nexus >> (and assuming that phyloxml allows very minimal trees, the reverse >> as well). It does look like the best plan is to use the same tree objects >> for all three (updating Bio.Nexus if possible). > > I could comment out the 'nexus' and 'newick' lines from the > supported_formats dict. That would disable the top-level functions > but leave the direct NexusIO and NewickIO equivalents intact until > the port is complete. I guess shipping a "phyloxml" only Bio.TreeIO would work, but it would be rather less useful. We could certainly start with just that on the trunk (i.e. initially no Bio.TreeIO.NewickIO and also no Bio.TreeIO.NexusIO modules - initially have just a single backend). >> Note that Bio.Nexus.Trees still has some useful methods you don't >> appear to support, like finding the last common ancestor and >> distances between nodes. > > That's intentional, I was just going to port those methods directly from > Bio.Nexus.Trees rather than invent a new API myself. OK - sounds good. > Currently, the Bio.Nexus.Nexus.Nexus and Nexus.Trees.Tree classes are > combined parsers and object representations. My goal is to chop out the > pure-object parts and merge them into Bio.Tree, and let the remaining > parsers return objects built from the new Bio.Tree classes. This looks like > it will be easier for Nexus.Trees than for Nexus.Nexus, but both should be > done. Sounds good - as with Bio.SeqIO and Bio.AlignIO, one of the goals has been to separate the data object from the (many possible) parsers. > For backward compatibility, I'll leave some wrappers that trigger > DeprecationWarnings in the original places. Nexus.Trees can > probably be reduced ... Something like that, sure. >>> (1) The find() function, named after the Unix utility that does the >>> same thing for directory trees, seems capable of all the iteration >>> and filtering necessary for locating data and automatically adding >>> annotations to a tree. There's a 'terminal' argument for selecting >>> internal nodes, external nodes, or both, and I think this means >>> get_leaf_nodes() is unnecessary. I'm going to remove it if no one >>> protests. >> >> I'm in two minds - iterating over the leaves (taxa) seems like a very >> common operation, and having an explicit method for this might be >> clearer than calling find with special arguments. > > I think .find(terminal=True) will do the right thing and looks reasonably > simple, but as Brad said, this is a ridiculously common operation so > finding it in the API should be ridiculously easy. I'll rename this function > to get_leaves() and rename find() to findall() (to match ElementTree > and make it clear that it returns an iterable). OK. Peter From hlapp at gmx.net Fri Sep 25 11:39:03 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 25 Sep 2009 07:39:03 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242134t48273951pd648d4b1a3ef38bb@mail.gmail.com> <320fb6e00909250259o1df2e763w42a64d3f1646c8d@mail.gmail.com> Message-ID: On Sep 25, 2009, at 5:59 AM, Peter wrote: > On Fri, Sep 25, 2009 at 5:34 AM, Eric Talevich > wrote: >>> >>> On a related point, do you think a BioSQL TaxonTree subclass is >>> possible? >>> i.e. Something mimicking the new Tree objects (as a subclass), but >>> which >>> loads data on demand from the taxon tables in a BioSQL database? >>> This >>> would provide a nice way to work with the NCBI taxonomy (once loaded >>> into BioSQL), which is a very large tree. For an example use case, >>> I might >>> want to extract just the bacteria as a subtree, and save that to a >>> file. >>> >> >> Doing BioSQL integration was on the original roadmap, but research >> hasn't >> taken me back there lately. I would like to do it eventually... >> anyway, that >> would solve the indexing issue nicely. I'll drop the extra >> attributes -- I >> get the impression they're not meant to be accessed directly in >> BioSQL >> either, so there's no use for them in Biopython. > > As things stand, there is no usage of the left/right index fields in > Biopython. The left/right fields are really a crutch for doing hierarchical (recursive) queries in SQL more efficiently. SQL doesn't have native support for recursive queries, and the left/right index values allow you to rewrite an otherwise recursive query as a single-hit set. Within an object-oriented programming language that supports recursion these values are of no use - they don't let you traverse a tree faster than you would already be able to do through recursing up or down your tree data structure. If there's a natural order of nodes, you can speed up finding nodes through binary search. But for pulling out lineages or subtrees I doubt that this will help at all - it'll have to be your data structure (such as having double links) that makes those operations efficient. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri Sep 25 12:26:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 13:26:38 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.52 In-Reply-To: <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com> References: <320fb6e00909220629o5b37514aw6dd640765aa7bedc@mail.gmail.com> <320fb6e00909220942s4ee69fecm643576962411f72d@mail.gmail.com> <8b34ec180909221246ua6b31adr2122172ca386f2cd@mail.gmail.com> <320fb6e00909221318web5780dkd348f674f9dae6e4@mail.gmail.com> <320fb6e00909230328r2fb86399l81a937fcb6523212@mail.gmail.com> <8b34ec180909230810o3ea28193l8b0eb247db8b9922@mail.gmail.com> <320fb6e00909230816y22d86037i1366d419b006e6c4@mail.gmail.com> <320fb6e00909230904m3ed5773cmf23dd2afe1ab88fd@mail.gmail.com> <320fb6e00909230934n67cfedd9hc9ceafca0a40eaab@mail.gmail.com> <320fb6e00909240323o40c4b180naa7f28654149232d@mail.gmail.com> Message-ID: <320fb6e00909250526s294eee65ubbc508136f26f48a@mail.gmail.com> On Thu, Sep 24, 2009 at 11:23 AM, Peter wrote: > On Wed, Sep 23, 2009 at 5:34 PM, Peter wrote: >> >> Now, let's make sure all the documentation and the wiki etc is up to date, >> and make an official announcement on the news server. > > How does this look for a draft news post (with links to wiki pages etc): > > The release of Biopython 1.52 earlier this week marked the end of an > era, it was our last release using CVS for source code control. ... I went ahead and posted something based on that draft: http://news.open-bio.org/news/2009/09/biopython-cvs-to-git-migration/ Nice to see several more people have started following the github repository already :) Peter From jhuerta at crg.es Fri Sep 25 15:28:36 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Fri, 25 Sep 2009 17:28:36 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: Hi Eric, Thanks for your comments, I really see a lot of potential parts in ETE that could be used from biopython, however, for the moment, we would rather prefer not to modify current ETE's GPL license. As far as I know, the main difference between GPL and BSD-like licenses is that, with the second, you could relicense the code at any moment under any other policy, including private and close licenses. GPL includes a protection for this by ensuring that any code based on GPL sources must be always GPL compatible, and that's why we have chosen it. Moreover, the use of a BSD-like license would prevent us to use a lot of great GPL code out there. It is not my purpose to open a debate about licenses. I just wonder if biopython could provide any way to link/bind external software, perhaps as addons or plugins. This would be great, since many extra features (not only from ETE but from other sources) could be added on specific demands. This would also mitigate the problem of very specific dependencies, since many of them would be optional. From my side, I could work for providing bindings between biopython and ETE's tree graphical rendering features, inline visualization GUI, extended newick support, tree manipulation and the methods within the ETE package. I will be out of the office for several weeks, but if you see any way to collaborate I will be happy to discuss this a bit more in detail... Cheers! Jaime On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich wrote: > Hello, Jaime, > > Sorry I didn't respond directly to your earlier post -- I wrote half of an > e-mail, then realized I had no good suggestions on what to do so I scrapped > it. > > My Tree and TreeIO code is basically a complete parser for the phyloXML > format, plus a few base classes extracted out in hopes of eventually > creating a unified set of format-independent objects, as in SeqIO and > AlignIO. Your code for working with trees looks much more complete than > mine, so if some of it can be incorporated into Biopython, I think that > would be great. > > I see these issues with integration: > 1. It's GPL, while Biopython uses a more permissive custom license > resembling the BSD and MIT licenses. Would you be willing and able to > relicense parts of your work for Biopython? > > 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will > require some compatibility fixes -- not a huge problem. > > 3. Scipy and numpy dependencies: Numpy is considered a semi-optional > dependency in Biopython, so if it can be imported on the fly by just the > functions that need it (hopefully no core ones), that would be best. If > not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so > it would be better to make that an optional, on-the-fly import, too. > > 4. PyQt4 is a big package and I'm not sure it's as common in scientists' > Python installations as numpy and scipy, so if the underlying algorithms for > tree layout could be ported to Reportlab, matplotlib or PIL, that would be > ideal. I personally would like to be able to pair sequence snippets with the > leaves of a standard phylogram, so if you need me to do some additional work > to get this section ported to Biopython, I'd consider it time well spent. > > 5. Presumably, the tree object type in ETE is different from Bio.Tree or > Bio.Nexus, so porting the core tree manipulation code to Biopython would > require a substantial effort somewhere. > > 6. The PhylomeDB connector is cool, and browsing the source, looks like it > wouldn't require much effort at all to drop into Biopython. > > Thanks for letting us know about this. > > Cheers, > Eric > > > > On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas wrote: > >> Hi, >> >> ( I'm the developer of ETE. ) >> I agree that PyQt4 is an important dependence. I chose it because >> Qt4-QGraphicsScene environment offers many possibilities like openGL >> rendering, unlimited image size, performance, and good bindings to python. >> However, I am working on my code to allow the rendering algorithm to use any >> other graphical library. So, you could render the same tree images using >> different backends. If you think this is useful for you, please let me know >> and we can think how to integrat it with biopython. >> Regarding the GUI, it is not a standalone application but one more method >> within the Tree objects. The GUI can be started at any point of the >> execution and the main program will continue after you close it. I did it >> like this because I think is quite useful for working within interactive >> python sessions. >> >> I develop a lot of code around tree handling, so if you think I can help, >> please tell me. >> jaime. >> >> >> >>> > *Graphics* >>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave >>> unlabeled >>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps >>> even >>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it would >>> be >>> > nice to have a Reportlab-based module in Bio.Graphics to print >>> phylogenies >>> > in the way biologists are used to seeing them. Does anyone know of >>> existing >>> > code that could be borrowed for this? I looked at ETE (announced on the >>> main >>> > biopython list last week) and liked the examples, but it uses PyQt4 and >>> a >>> > standalone GUI for display, which is a substantial departure from the >>> > Biopython way of doing things. >>> >>> I still haven't tracked down my old report lab code, but it wasn't object >>> orientated and would need a lot of work to bring up to standard... >>> >>> >> >> >> >> >> >> >> >>> Peter >>> >>> _______________________________________________ >>> Biopython-dev mailing list >>> Biopython-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>> >> >> >> >> -- >> ========================= >> Jaime Huerta-Cepas, Ph.D. >> CRG-Centre for Genomic Regulation >> Doctor Aiguader, 88 >> PRBB Building >> 08003 Barcelona, Spain >> http://www.crg.es/comparative_genomics >> ========================= >> >> > -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From eric.talevich at gmail.com Fri Sep 25 15:51:15 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 25 Sep 2009 11:51:15 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com> Hi Jaime, Just working on bindings would certainly be easier. The best way to transfer tree information from Biopython to ETE would be serializing the trees in phyloXML format (to preserve the annotations) and loading that file in ETE. I see that ETE allows rich annotation of tree objects, but I don't see phyloXML or NeXML listed as supported file formats -- is there another standard format you're using to store this information? If not, I think ETE would benefit from a phyloXML parser. Since Biopython license is GPL-compatible (I believe), you could borrow Bio.TreeIO.PhyloXMLIO directly and just port the Phylogeny and Clade classes to ETE's base classes instead of Bio.Tree.BaseTree's Tree and Node classes. Beyond that, some support for BioSQL to store sequences etc. would also help link ETE to any of the other Bio* projects. There's some example code in Biopython's top-level BioSQL directory, if you're interested. Cheers, Eric On Fri, Sep 25, 2009 at 11:28 AM, Jaime Huerta Cepas wrote: > Hi Eric, > > Thanks for your comments, > I really see a lot of potential parts in ETE that could be used from > biopython, however, for the moment, we would rather prefer not to modify > current ETE's GPL license. As far as I know, the main difference between > GPL and BSD-like licenses is that, with the second, you could relicense the > code at any moment under any other policy, including private and close > licenses. GPL includes a protection for this by ensuring that any code based > on GPL sources must be always GPL compatible, and that's why we have chosen > it. Moreover, the use of a BSD-like license would prevent us to use a lot of > great GPL code out there. > > It is not my purpose to open a debate about licenses. I just wonder if > biopython could provide any way to link/bind external software, perhaps as > addons or plugins. This would be great, since many extra features (not only > from ETE but from other sources) could be added on specific demands. This > would also mitigate the problem of very specific dependencies, since many of > them would be optional. From my side, I could work for providing bindings > between biopython and ETE's tree graphical rendering features, inline > visualization GUI, extended newick support, tree manipulation and the > methods within the ETE package. > > I will be out of the office for several weeks, but if you see any way to > collaborate I will be happy to discuss this a bit more in detail... > > Cheers! > Jaime > > > On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich wrote: > >> Hello, Jaime, >> >> Sorry I didn't respond directly to your earlier post -- I wrote half of an >> e-mail, then realized I had no good suggestions on what to do so I scrapped >> it. >> >> My Tree and TreeIO code is basically a complete parser for the phyloXML >> format, plus a few base classes extracted out in hopes of eventually >> creating a unified set of format-independent objects, as in SeqIO and >> AlignIO. Your code for working with trees looks much more complete than >> mine, so if some of it can be incorporated into Biopython, I think that >> would be great. >> >> I see these issues with integration: >> 1. It's GPL, while Biopython uses a more permissive custom license >> resembling the BSD and MIT licenses. Would you be willing and able to >> relicense parts of your work for Biopython? >> >> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will >> require some compatibility fixes -- not a huge problem. >> >> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional >> dependency in Biopython, so if it can be imported on the fly by just the >> functions that need it (hopefully no core ones), that would be best. If >> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so >> it would be better to make that an optional, on-the-fly import, too. >> >> 4. PyQt4 is a big package and I'm not sure it's as common in scientists' >> Python installations as numpy and scipy, so if the underlying algorithms for >> tree layout could be ported to Reportlab, matplotlib or PIL, that would be >> ideal. I personally would like to be able to pair sequence snippets with the >> leaves of a standard phylogram, so if you need me to do some additional work >> to get this section ported to Biopython, I'd consider it time well spent. >> >> 5. Presumably, the tree object type in ETE is different from Bio.Tree or >> Bio.Nexus, so porting the core tree manipulation code to Biopython would >> require a substantial effort somewhere. >> >> 6. The PhylomeDB connector is cool, and browsing the source, looks like it >> wouldn't require much effort at all to drop into Biopython. >> >> Thanks for letting us know about this. >> >> Cheers, >> Eric >> >> >> >> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas wrote: >> >>> Hi, >>> >>> ( I'm the developer of ETE. ) >>> I agree that PyQt4 is an important dependence. I chose it because >>> Qt4-QGraphicsScene environment offers many possibilities like openGL >>> rendering, unlimited image size, performance, and good bindings to python. >>> However, I am working on my code to allow the rendering algorithm to use any >>> other graphical library. So, you could render the same tree images using >>> different backends. If you think this is useful for you, please let me know >>> and we can think how to integrat it with biopython. >>> Regarding the GUI, it is not a standalone application but one more method >>> within the Tree objects. The GUI can be started at any point of the >>> execution and the main program will continue after you close it. I did it >>> like this because I think is quite useful for working within interactive >>> python sessions. >>> >>> I develop a lot of code around tree handling, so if you think I can >>> help, please tell me. >>> jaime. >>> >>> >>> >>>> > *Graphics* >>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave >>>> unlabeled >>>> > nodes inconspicuous, so the resulting graphic is much cleaner, perhaps >>>> even >>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it >>>> would be >>>> > nice to have a Reportlab-based module in Bio.Graphics to print >>>> phylogenies >>>> > in the way biologists are used to seeing them. Does anyone know of >>>> existing >>>> > code that could be borrowed for this? I looked at ETE (announced on >>>> the main >>>> > biopython list last week) and liked the examples, but it uses PyQt4 >>>> and a >>>> > standalone GUI for display, which is a substantial departure from the >>>> > Biopython way of doing things. >>>> >>>> I still haven't tracked down my old report lab code, but it wasn't >>>> object >>>> orientated and would need a lot of work to bring up to standard... >>>> >>>> >>> >>> >>> >>> >>> >>> >>> >>>> Peter >>>> >>>> _______________________________________________ >>>> Biopython-dev mailing list >>>> Biopython-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>>> >>> >>> >>> >>> -- >>> ========================= >>> Jaime Huerta-Cepas, Ph.D. >>> CRG-Centre for Genomic Regulation >>> Doctor Aiguader, 88 >>> PRBB Building >>> 08003 Barcelona, Spain >>> http://www.crg.es/comparative_genomics >>> ========================= >>> >>> >> > > > -- > ========================= > Jaime Huerta-Cepas, Ph.D. > CRG-Centre for Genomic Regulation > Doctor Aiguader, 88 > PRBB Building > 08003 Barcelona, Spain > http://www.crg.es/comparative_genomics > ========================= > > From jhuerta at crg.es Fri Sep 25 16:13:44 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Fri, 25 Sep 2009 18:13:44 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com> Message-ID: Hi, > Just working on bindings would certainly be easier. The best way to > transfer tree information from Biopython to ETE would be serializing the > trees in phyloXML format (to preserve the annotations) and loading that file > in ETE. I see that ETE allows rich annotation of tree objects, but I don't > see phyloXML or NeXML listed as supported file formats -- is there another > standard format you're using to store this information? Extended newick (http://www.phylosoft.org/NHX/) is the only rich format currently supported by ETE, however only text string representation of tree node annotations are allowed by this standard. Beyond this, you should use a cpickle approach to save complex annotated trees. I'm certainly interested in PhyloXML and NexML support, so, for sure, this could be a nice starting point. If not, I think ETE would benefit from a phyloXML parser. Since Biopython > license is GPL-compatible (I believe), you could borrow > Bio.TreeIO.PhyloXMLIO directly and just port the Phylogeny and Clade classes > to ETE's base classes instead of Bio.Tree.BaseTree's Tree and Node classes. > I think there is no problem in using BSD license from GPL sources, the problem would be in the other way around. Then I will take a look at your phyloxml code to find the best way to bind both packages through phyloXML serialization. > Beyond that, some support for BioSQL to store sequences etc. would also > help link ETE to any of the other Bio* projects. There's some example code > in Biopython's top-level BioSQL directory, if you're interested. > Ok. I'll take a look also. Thanks. cheers, Jaime. > > Cheers, > Eric > > > On Fri, Sep 25, 2009 at 11:28 AM, Jaime Huerta Cepas wrote: > >> Hi Eric, >> >> Thanks for your comments, >> I really see a lot of potential parts in ETE that could be used from >> biopython, however, for the moment, we would rather prefer not to modify >> current ETE's GPL license. As far as I know, the main difference between >> GPL and BSD-like licenses is that, with the second, you could relicense the >> code at any moment under any other policy, including private and close >> licenses. GPL includes a protection for this by ensuring that any code based >> on GPL sources must be always GPL compatible, and that's why we have chosen >> it. Moreover, the use of a BSD-like license would prevent us to use a lot of >> great GPL code out there. >> >> It is not my purpose to open a debate about licenses. I just wonder if >> biopython could provide any way to link/bind external software, perhaps as >> addons or plugins. This would be great, since many extra features (not only >> from ETE but from other sources) could be added on specific demands. This >> would also mitigate the problem of very specific dependencies, since many of >> them would be optional. From my side, I could work for providing bindings >> between biopython and ETE's tree graphical rendering features, inline >> visualization GUI, extended newick support, tree manipulation and the >> methods within the ETE package. >> >> I will be out of the office for several weeks, but if you see any way to >> collaborate I will be happy to discuss this a bit more in detail... >> >> Cheers! >> Jaime >> >> >> On Fri, Sep 25, 2009 at 5:54 AM, Eric Talevich wrote: >> >>> Hello, Jaime, >>> >>> Sorry I didn't respond directly to your earlier post -- I wrote half of >>> an e-mail, then realized I had no good suggestions on what to do so I >>> scrapped it. >>> >>> My Tree and TreeIO code is basically a complete parser for the phyloXML >>> format, plus a few base classes extracted out in hopes of eventually >>> creating a unified set of format-independent objects, as in SeqIO and >>> AlignIO. Your code for working with trees looks much more complete than >>> mine, so if some of it can be incorporated into Biopython, I think that >>> would be great. >>> >>> I see these issues with integration: >>> 1. It's GPL, while Biopython uses a more permissive custom license >>> resembling the BSD and MIT licenses. Would you be willing and able to >>> relicense parts of your work for Biopython? >>> >>> 2. Python 2.5 dependency: Biopython still supports Py2.4, so this will >>> require some compatibility fixes -- not a huge problem. >>> >>> 3. Scipy and numpy dependencies: Numpy is considered a semi-optional >>> dependency in Biopython, so if it can be imported on the fly by just the >>> functions that need it (hopefully no core ones), that would be best. If >>> not... we can discuss. Scipy isn't used anywhere else in Biopython yet, so >>> it would be better to make that an optional, on-the-fly import, too. >>> >>> 4. PyQt4 is a big package and I'm not sure it's as common in scientists' >>> Python installations as numpy and scipy, so if the underlying algorithms for >>> tree layout could be ported to Reportlab, matplotlib or PIL, that would be >>> ideal. I personally would like to be able to pair sequence snippets with the >>> leaves of a standard phylogram, so if you need me to do some additional work >>> to get this section ported to Biopython, I'd consider it time well spent. >>> >>> 5. Presumably, the tree object type in ETE is different from Bio.Tree or >>> Bio.Nexus, so porting the core tree manipulation code to Biopython would >>> require a substantial effort somewhere. >>> >>> 6. The PhylomeDB connector is cool, and browsing the source, looks like >>> it wouldn't require much effort at all to drop into Biopython. >>> >>> Thanks for letting us know about this. >>> >>> Cheers, >>> Eric >>> >>> >>> >>> On Thu, Sep 24, 2009 at 6:45 AM, Jaime Huerta Cepas wrote: >>> >>>> Hi, >>>> >>>> ( I'm the developer of ETE. ) >>>> I agree that PyQt4 is an important dependence. I chose it because >>>> Qt4-QGraphicsScene environment offers many possibilities like openGL >>>> rendering, unlimited image size, performance, and good bindings to python. >>>> However, I am working on my code to allow the rendering algorithm to use any >>>> other graphical library. So, you could render the same tree images using >>>> different backends. If you think this is useful for you, please let me know >>>> and we can think how to integrat it with biopython. >>>> Regarding the GUI, it is not a standalone application but one more >>>> method within the Tree objects. The GUI can be started at any point of the >>>> execution and the main program will continue after you close it. I did it >>>> like this because I think is quite useful for working within interactive >>>> python sessions. >>>> >>>> I develop a lot of code around tree handling, so if you think I can >>>> help, please tell me. >>>> jaime. >>>> >>>> >>>> >>>>> > *Graphics* >>>>> > I finally fixed the networkx/graphviz/matplotlib drawing to leave >>>>> unlabeled >>>>> > nodes inconspicuous, so the resulting graphic is much cleaner, >>>>> perhaps even >>>>> > usable. Plus, the nodes are now a pretty shade of blue. Still, it >>>>> would be >>>>> > nice to have a Reportlab-based module in Bio.Graphics to print >>>>> phylogenies >>>>> > in the way biologists are used to seeing them. Does anyone know of >>>>> existing >>>>> > code that could be borrowed for this? I looked at ETE (announced on >>>>> the main >>>>> > biopython list last week) and liked the examples, but it uses PyQt4 >>>>> and a >>>>> > standalone GUI for display, which is a substantial departure from the >>>>> > Biopython way of doing things. >>>>> >>>>> I still haven't tracked down my old report lab code, but it wasn't >>>>> object >>>>> orientated and would need a lot of work to bring up to standard... >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>> Peter >>>>> >>>>> _______________________________________________ >>>>> Biopython-dev mailing list >>>>> Biopython-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>>>> >>>> >>>> >>>> >>>> -- >>>> ========================= >>>> Jaime Huerta-Cepas, Ph.D. >>>> CRG-Centre for Genomic Regulation >>>> Doctor Aiguader, 88 >>>> PRBB Building >>>> 08003 Barcelona, Spain >>>> http://www.crg.es/comparative_genomics >>>> ========================= >>>> >>>> >>> >> >> >> -- >> ========================= >> Jaime Huerta-Cepas, Ph.D. >> CRG-Centre for Genomic Regulation >> Doctor Aiguader, 88 >> PRBB Building >> 08003 Barcelona, Spain >> http://www.crg.es/comparative_genomics >> ========================= >> >> > -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From biopython at maubp.freeserve.co.uk Fri Sep 25 16:22:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Sep 2009 17:22:40 +0100 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> <3f6baf360909250851r34a65287wa9680653f1698755@mail.gmail.com> Message-ID: <320fb6e00909250922y858c172xf1ee51f7673a4fe2@mail.gmail.com> On Fri, Sep 25, 2009 at 5:13 PM, Jaime Huerta Cepas wrote: > > I think there is no problem in using BSD license from GPL sources, the > problem would be in the other way around. > Yes, that way round is fine from a license point of view (taking Biopython's BSD/MIT style licensed code and using it in a GPL project). But we can't take your GPL code into Biopython unless you re-license it more liberally. I can see the appeal of the (L)GPL for forcing the code to stay open, but Biopython (like Python) went for the other option of basically letting anyone use the code in anyway they like. Peter From hlapp at gmx.net Fri Sep 25 20:58:36 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 25 Sep 2009 16:58:36 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: On Sep 25, 2009, at 11:28 AM, Jaime Huerta Cepas wrote: > As far as I know, the main difference between GPL and BSD-like > licenses is that, with the second, you could relicense the code at > any moment under any other policy, including private and close > licenses. This is not true. None of the open-source licenses that I'm aware of allows anyone to relicense code under a license that is less liberal, or to relicense code at all. It is the copyright owner who can relicense code, not the distributor. One of the differences between GPL and BSD is that GPL is viral. Specifically, code that links to GPL-licensed code must also be GPL- licensed *when it is distributed.* (It is a common misconception that GPL is unconditionally viral. I can take GPL code and link to it and keep my code closed source for as long as I please if I never redistribute it. GPL was written with software vendors in mind, whose business consists of distributing software for commercial gain. GPL has therefore sometimes been called anti-commercial. This is wrong, too, but I won't go into the details here.) Biopython can freely utilize GPL-licensed (or closed source, for that matter) software if it doesn't link to it. IANAL but I think it can also redistribute GPL-licensed code along with Biopython so long as Biopython doesn't link to it, and it is made clear that some of the distribution falls under a different license than BSD. (Linux distributions mix BSD and GPL software, too.) As for ETE itself, a BSD/MIT style license seems to be the by far most widely used license for Python modules. If you want to facilitate adoption of the software as a library by other programmers, GPL is going to stand in the way of that. Also, really all that you are accomplishing with GPL is that a software company can't take advantage of ETE. Is that your chief concern? GPL won't prevent any scientific lab from writing closed source code that builds on ETE and publishing the results, so long as they don't distribute their closed source code. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From chapmanb at 50mail.com Fri Sep 25 21:48:00 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 25 Sep 2009 17:48:00 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: <20090925214800.GE29829@sobchak.mgh.harvard.edu> Hi all; Hilmar -- thanks for writing up a nice summary of the license details. Jaime, I think it's a shame we would let these issues prevent working together. It sounds like you and Eric have some shared goals and it would be great to see that evolve into some useful functionality in Biopython. Generally, the BSD-like license which Biopython uses encourages cooperation and keeps people at both academia and industry happy. As scientists, our goal should be to avoid letting these types of issues preventing collaboration. Truthfully, there is very little opportunity for exploitation of bioinformatics software; the economics are just not there for companies to sell code. > (It is a common misconception that GPL is unconditionally viral. I can > take GPL code and link to it and keep my code closed source for as > long as I please if I never redistribute it. GPL was written with > software vendors in mind, whose business consists of distributing > software for commercial gain. GPL has therefore sometimes been called > anti-commercial. This is wrong, too, but I won't go into the details > here.) I agree 100%, but in practical terms it is very difficult to have this argument at a company. Speaking from experience, GPL creates all kinds of nasty thoughts in people's heads which prevents adoption of code in corporate environments. For Biopython and other bioinformatics projects, we should be actively encouraging contributions from companies as well as academia. > Biopython can freely utilize GPL-licensed (or closed source, for that > matter) software if it doesn't link to it. IANAL but I think it can > also redistribute GPL-licensed code along with Biopython so long as > Biopython doesn't link to it, and it is made clear that some of the > distribution falls under a different license than BSD. (Linux > distributions mix BSD and GPL software, too.) Yes, but this complication is bad. Let's keep it simple, Brad From bugzilla-daemon at portal.open-bio.org Fri Sep 25 22:48:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Sep 2009 18:48:13 -0400 Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a GenBank CON file In-Reply-To: Message-ID: <200909252248.n8PMmDa9028782@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2745 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1214 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-25 18:48 EST ------- (From update of attachment 1214) Checked into git, leaving this bug open until we've run some more tests on this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Sep 26 11:36:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 26 Sep 2009 07:36:45 -0400 Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a GenBank CON file In-Reply-To: Message-ID: <200909261136.n8QBajsI014127@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2745 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-26 07:36 EST ------- We'll also need to update the SeqIO GenBank output to record the CONTIG string if present. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hlapp at gmx.net Sat Sep 26 15:25:41 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 26 Sep 2009 11:25:41 -0400 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20090925214800.GE29829@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> <20090925214800.GE29829@sobchak.mgh.harvard.edu> Message-ID: On Sep 25, 2009, at 5:48 PM, Brad Chapman wrote: > I agree 100%, but in practical terms it is very difficult to have this > argument at a company. Yes, I know. > For Biopython and other bioinformatics projects, we should be > actively encouraging contributions from companies as well as academia. Having worked in commercial and private sector for almost a decade, I couldn't agree more. There is a huge amount of open-source code development contributed by people working in the private sector, and which is hence sponsored by companies. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jhuerta at crg.es Sat Sep 26 17:12:59 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Sat, 26 Sep 2009 19:12:59 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> Message-ID: Hey! Sorry, It was not my intention to open a flame about licences nor to sound rude. I apologize if I did. > As far as I know, the main difference between GPL and BSD-like licenses is >> that, with the second, you could relicense the code at any moment under any >> other policy, including private and close licenses. >> > > > This is not true. None of the open-source licenses that I'm aware of allows > anyone to relicense code under a license that is less liberal, or to > relicense code at all. It is the copyright owner who can relicense code, not > the distributor. > > I'm not an expert on software licences, so I can not enter into this issue very deeply. What I said in my previous email is what I could understand from these info: http://www.gnu.org/philosophy/license-list.html, http://www.gnu.org/philosophy/categories.html#Non-CopyleftedFreeSoftware If I was wrong and modified BSD-like sources cannot be relicensed under other less liberal licenses, then we will kindly consider a change of the ETE license in the future. > One of the differences between GPL and BSD is that GPL is viral. > Specifically, code that links to GPL-licensed code must also be GPL-licensed > *when it is distributed.* > > (It is a common misconception that GPL is unconditionally viral. I can take > GPL code and link to it and keep my code closed source for as long as I > please if I never redistribute it. GPL was written with software vendors in > mind, whose business consists of distributing software for commercial gain. > GPL has therefore sometimes been called anti-commercial. This is wrong, too, > but I won't go into the details here.) > I see, so the only problem is about distribution... Biopython can freely utilize GPL-licensed (or closed source, for that > matter) software if it doesn't link to it. IANAL but I think it can also > redistribute GPL-licensed code along with Biopython so long as Biopython > doesn't link to it, and it is made clear that some of the distribution falls > under a different license than BSD. (Linux distributions mix BSD and GPL > software, too.) > Yes, I agree. This is what I meant as biopython addons. With this in mind, biopython could be aware of many other software out there and benefit from it. Is there any work around this in bipython? As for ETE itself, a BSD/MIT style license seems to be the by far most > widely used license for Python modules. If you want to facilitate adoption > of the software as a library by other programmers, GPL is going to stand in > the way of that. Also, really all that you are accomplishing with GPL is > that a software company can't take advantage of ETE. Is that your chief > concern? Well, our intention was that code based on ETE sources (other tools or improvements) were distrubuted/published also as free software. We wanted also to leave an open door to use other GPL software from ETE. > GPL won't prevent any scientific lab from writing closed source code that > builds on ETE and publishing the results, so long as they don't distribute > their closed source code. Yes. You are right. We don't want to avoid this. In any case, thanks for your comments. I will try to get more info about what you say and, if we have to modify something, we do it. :) cheers, Jaime > > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From jhuerta at crg.es Sat Sep 26 17:28:02 2009 From: jhuerta at crg.es (Jaime Huerta Cepas) Date: Sat, 26 Sep 2009 19:28:02 +0200 Subject: [Biopython-dev] Code review request for phyloxml branch In-Reply-To: <20090925214800.GE29829@sobchak.mgh.harvard.edu> References: <3f6baf360909232048u54a63ce5q2adbd0e18ebd7036@mail.gmail.com> <320fb6e00909240257w764b36d2h9f1691fa6eed4f1d@mail.gmail.com> <3f6baf360909242054w643346bamb566610c81e1c7f3@mail.gmail.com> <20090925214800.GE29829@sobchak.mgh.harvard.edu> Message-ID: Hi Brad, Jaime, I think it's a shame we would let these issues > prevent working together. It sounds like you and Eric have some > shared goals and it would be great to see that evolve into some > useful functionality in Biopython. > Sure!! My only intention was to find the best way to contribute! However, the choice of a "viral" GPL license was specifically chosen for exactly this reason: encouraging free software and academic scientific resources. We have a lot shared goals, so I trust we will find a happy way to colaborate. Jaime. -- ========================= Jaime Huerta-Cepas, Ph.D. CRG-Centre for Genomic Regulation Doctor Aiguader, 88 PRBB Building 08003 Barcelona, Spain http://www.crg.es/comparative_genomics ========================= From jblanca at btc.upv.es Mon Sep 28 11:36:14 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 28 Sep 2009 13:36:14 +0200 Subject: [Biopython-dev] fpc and gff Message-ID: <200909281336.14794.jblanca@btc.upv.es> Sorry for the previous incomplete mail. :( Hi: I'm interested in parsing an fpc physical map and writing a gff3 file from it. That's done by the fpc people in bioperl and they go from fpc to gff2. I would like to do it in python. I've written the fpc parser looking at the bioperl one. You can take a look at: http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/fpc.py Now I have to create the gff structure and writer. I've been reading Brad's code regarding the GFF parser and writer. I would like to integrate my fpc work as much as posible with biopython and if you like it we could add the fpc to Biopython in the future. But I have not a clear idea on the relation between GFF and SeqFeature. The main problem is the subfeature and the gff feature hierarchy. My take on that at the moment is to write a GFFfeature class similar to the gff feature with seqid, source, type, start, end, score, etc. and go from the fpc to GFFFeature objects. I know that this would not integrate nicely with BioPython. Could you give some hint on how to do it in a proper way? Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From jblanca at btc.upv.es Mon Sep 28 11:28:06 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 28 Sep 2009 13:28:06 +0200 Subject: [Biopython-dev] fpc and gff Message-ID: <200909281328.06817.jblanca@btc.upv.es> Hi: I'm interested in parsing an fpc physical map and writing a gff3 file from it. That's done by the fpc people in bioperl and they go from fpc to gff2. I would like to do it in python. I've written the fpc parser looking at the bioperl one. You can take a look at: -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Mon Sep 28 11:52:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 12:52:56 +0100 Subject: [Biopython-dev] fpc and gff In-Reply-To: <200909281336.14794.jblanca@btc.upv.es> References: <200909281336.14794.jblanca@btc.upv.es> Message-ID: <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com> On Mon, Sep 28, 2009 at 12:36 PM, Jose Blanca wrote: > Sorry for the previous incomplete mail. :( > > Hi: > I'm interested in parsing an fpc physical map and writing a gff3 file from it. > That's done by the fpc people in bioperl and they go from fpc to gff2. I > would like to do it in python. > I've written the fpc parser looking at the bioperl one. You can take a look > at: > http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/fpc.py > > Now I have to create the gff structure and writer. I've been reading Brad's > code regarding the GFF parser and writer. I would like to integrate my fpc > work as much as posible with biopython and if you like it we could add the > fpc to Biopython in the future. > But I have not a clear idea on the relation between GFF and SeqFeature. The > main problem is the subfeature and the gff feature hierarchy. My take on that > at the moment is to write a GFFfeature class similar to the gff feature with > seqid, source, type, start, end, score, etc. and go from the fpc to > GFFFeature objects. I know that this would not integrate nicely with > BioPython. Could you give some hint on how to do it in a proper way? > Best regards, Right now there isn't a "proper way" as Brad's GFF code hasn't been integrated into Biopython yet. I think Brad was thinking of using the SeqFeature object "as is" to hold GFF features, with the sub-features list used for the hierarchy. Michiel and I had suggested a simpler structure more faithful to the GFF model might be useful - even if it was just a standardised tuple of the start, end, strand, id, etc, and an annotation dictionary). For the SeqIO interface, these GFF features would have to be turned into normal SeqFeature objects of course. Peter From chapmanb at 50mail.com Mon Sep 28 12:52:38 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 28 Sep 2009 08:52:38 -0400 Subject: [Biopython-dev] fpc and gff In-Reply-To: <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com> References: <200909281336.14794.jblanca@btc.upv.es> <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com> Message-ID: <20090928125238.GG29829@sobchak.mgh.harvard.edu> Jose; Glad you're interested in working on this. I'm happy to get the GFF3 writing up to speed for this task. > > I'm interested in parsing an fpc physical map and writing a gff3 file from it. [...] > > But I have not a clear idea on the relation between GFF and SeqFeature. The > > main problem is the subfeature and the gff feature hierarchy. My take on that > > at the moment is to write a GFFfeature class similar to the gff feature with > > seqid, source, type, start, end, score, etc. and go from the fpc to > > GFFFeature objects. > Right now there isn't a "proper way" as Brad's GFF code hasn't > been integrated into Biopython yet. Yes, we still have some flexibility here since it hasn't been merged into Biopython yet, so let's talk about what works best. > I think Brad was thinking of using the SeqFeature object "as is" to hold > GFF features, with the sub-features list used for the hierarchy. What exists now takes an iterator of SeqRecord objects, and writes each SeqFeature as a GFF3 line: seqid -- SeqRecord ID source -- Feature qualifier with key "source" type -- Feature type attribute start, end -- The Feature Location score -- Feature qualifier with key "score" strand -- Feature strand attribute phase -- Feature qualifier with key "phase" The remaining qualifiers are the final key/value pairs of the attribute. The hierarchy is represented as sub_features of the parent feature. This handles any arbitrarily deep nesting of parent and child features. There is some really basic code on the documentation page: http://biopython.org/wiki/GFF_Parsing#Writing_GFF3 > Michiel and I had suggested a simpler structure more faithful to the > GFF model might be useful - even if it was just a standardised tuple > of the start, end, strand, id, etc, and an annotation dictionary). For > the SeqIO interface, these GFF features would have to be turned > into normal SeqFeature objects of course. This could also be useful for a more lightweight representation. I would rather see this type of representation with primary Python types, as opposed to a GFFFeature specific class. The current SeqRecord/SeqFeature implementations is relatively close to what a GFF specific class would be so there would be a lot of duplication without saving much in terms of speed or memory. Jose, let me know if you'd rather go with a SeqRecord approach or a lightweight approach. If you provide a couple of examples of the features you want to store, we can work through how to best represent those in the GFF hierarchy and then the details of prepping them for writing. Brad From biopython at maubp.freeserve.co.uk Mon Sep 28 13:10:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 14:10:22 +0100 Subject: [Biopython-dev] fpc and gff In-Reply-To: <20090928125238.GG29829@sobchak.mgh.harvard.edu> References: <200909281336.14794.jblanca@btc.upv.es> <320fb6e00909280452j75ebf714ne2e92dc2eb990f43@mail.gmail.com> <20090928125238.GG29829@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00909280610q75f7bf4eqae49a1fb6d7eae38@mail.gmail.com> On Mon, Sep 28, 2009 at 1:52 PM, Brad Chapman wrote: > >> Michiel and I had suggested a simpler structure more faithful to the >> GFF model might be useful - even if it was just a standardised tuple >> of the start, end, strand, id, etc, and an annotation dictionary). For >> the SeqIO interface, these GFF features would have to be turned >> into normal SeqFeature objects of course. > > This could also be useful for a more lightweight representation. I > would rather see this type of representation with primary Python > types, as opposed to a GFFFeature specific class. The current > SeqRecord/SeqFeature implementations is relatively close to what > a GFF specific class would be so there would be a lot of duplication > without saving much in terms of speed or memory. Indeed. Which is why I quite like the idea of a simple tuple of ints, strings and a dict for the annotation (the final column of a GFF file). This should also be fast for people dealing with big GFF files. The other plus point here is we can get this (GFF parsing/writing using basic Python objects) into Biopython first, and then look at the SeqIO side of things more carefully as a second merge. I may be overly cautious but I want the resulting GFF <-> SeqRecord <-> GenBank/EMBL/etc mapping to try and follow established practice as closely as possible, which will need lots of testing and probably some tweaking of this mapping. i.e. To me there is a natural break between the basics of GFF parsing/writing, and the transformation into our existing object models. [This applies to all file formats in principle, but most are so simple that it isn't really an issue worth worrying about.] Peter From bugzilla-daemon at portal.open-bio.org Mon Sep 28 19:37:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Sep 2009 15:37:21 -0400 Subject: [Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a GenBank CON file In-Reply-To: Message-ID: <200909281937.n8SJbLYq012300@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2745 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-28 15:37 EST ------- (In reply to comment #6) > We'll also need to update the SeqIO GenBank output to record the CONTIG > string if present. Done, marking as fixed. Assuming there are no objections to the whole approach (treating the CONTIG data as a string) that is... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Sep 28 20:09:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Sep 2009 21:09:12 +0100 Subject: [Biopython-dev] Committing to github... In-Reply-To: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com> References: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com> Message-ID: <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com> On Thu, Sep 24, 2009 at 12:39 PM, Peter wrote: > Hi all, > > My last couple of commits to github have been from a local clone > of the *official* repository: http://github.com/biopython/biopython/ > > This is a nice and simple work flow for small changes, and the > history and github network graph are easy to understand: > http://biopython.org/wiki/GitUsage#Commiting_changes_to_main_branch > > This seems like the easiest way to work for people used to CVS, > and you don't need to bother with your own Biopython cloned > repository on github (you just need a github account and > collaborator status). I'll probably continue to do this in the short > term. This way of working (described above) is what I have been using for the last week. If there are multiple developers working (or in this case, developers using multiple machines), you can still get interesting mini-branches and merges even like this. Have a look at the Biopython github network diagram for today for a nice simple example (which was accidental - but serves as a nice illustration). [I know for some of you the following discussion isn't needed, but I think it is worth trying to explain - even if just for me, to make sure it is clear in my head what git is doing.] In words, the main trunk was split, with a (trivial) change to the tutorial done on one branch (me at work) and then two separate commits on a separate branch (unit tests tweak, and GenBank bug fix), again by me, but on my home computer. The two branches were then merged into one. Why did this happen? I was working on a local and very slightly out of date copy of the repository at home, and make these local commits. I then tried to push them to github. At that point git gave me an error saying something else had been commited in the meantime (in fact by me but on a different computer) so my local repository was out of date. So I pulled and merged the latest code from github (the tutorial change), and then pushed this to github. Done. The merge was 100% automatic because the files changed were independent. Back on CVS, as these changes were on separate files, there wouldn't have been any issue about merging. Does it matter? No. But we can reduce the likelihood of these baby branches and merges by getting into the habit of pulling the latest code from github *before* making any local commits (a sensible thing to do anyway). [Did that make sense? One the one hand this is very simple, but on the other hand, it is rather different to how I used to think about the code history under CVS.] Peter From eric.talevich at gmail.com Mon Sep 28 20:47:38 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 28 Sep 2009 16:47:38 -0400 Subject: [Biopython-dev] Committing to github... In-Reply-To: <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com> References: <320fb6e00909240439j2fe85d58w6bf9bd043e1c8569@mail.gmail.com> <320fb6e00909281309v64c6ef25s1c6c13357277f1c6@mail.gmail.com> Message-ID: <3f6baf360909281347r32c39918s4a2c8a64cff44622@mail.gmail.com> On Mon, Sep 28, 2009 at 4:09 PM, Peter wrote: > > Does it matter? No. But we can reduce the likelihood of these > baby branches and merges by getting into the habit of pulling > the latest code from github *before* making any local commits > (a sensible thing to do anyway). > > If you've committed local changes while your repository is out of date and want to avoid a baby branch, you can also use "git rebase origin/master" to fix the history. (But probably, most developers will find it easier and safer to leave the baby branches there.) Extended example: git checkout dev # a development branch # hack hack git commit -a # oops, we're out of sync git checkout master # a clean copy of upstream git pull origin master # updating like we should have earlier git rebase master dev git merge dev # Should be fast-forward git push Cheers, Eric From bugzilla-daemon at portal.open-bio.org Mon Sep 28 21:01:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Sep 2009 17:01:08 -0400 Subject: [Biopython-dev] [Bug 2919] New: Writing SeqFeature qualifiers Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2919 Summary: Writing SeqFeature qualifiers Product: Biopython Version: 1.51 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: estrain at gmail.com When writing SeqFeature qualifiers key-value pairs, the output contains one line for each character in the value, rather than simply printing the string. The sample code at the bottom produces a genbank sequence file that illustrates the problem. If I create a qualifiers dictionary using "qualDict = dict(gene="geneA")", the genbank output contains gene 1..6 /gene="g" /gene="e" /gene="n" /gene="e" /gene="A" The offending code appears to be in the InsdcIO.py file, lines 482-483. If I change 482: for value in values : 483: self.write_feature_qualifier(key,value) to self.write_feature_qualifier(key,values) then the function appears to work correctly. gene 1..6 /gene="geneA" ########################################################### ## Sample code ########################################################### from Bio.Seq import Seq from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation from Bio.Alphabet import IUPAC qualDict = dict(gene="geneA") my_seq = SeqRecord(Seq("ATGATC",IUPAC.ambiguous_dna),id="seq1") my_seq.features.append((SeqFeature(FeatureLocation(0,6),type="gene",qualifiers=qualDict))) out_handle = open("test.gbk","w") SeqIO.write([my_seq],out_handle,"genbank") -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Sep 28 21:22:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Sep 2009 17:22:32 -0400 Subject: [Biopython-dev] [Bug 2919] Writing SeqFeature qualifiers In-Reply-To: Message-ID: <200909282122.n8SLMW8w014482@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2919 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-28 17:22 EST ------- It was working as intended - for consistency with the GenBank (and other) parsers, you were expected to use a lists of strings as the feature qualifier dictionary values (not just strings). However, a similar request was made on the mailing list recently, and a fix checked in (after Biopython 1.52 was released): http://lists.open-bio.org/pipermail/biopython/2009-September/005585.html Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Sep 29 16:41:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 29 Sep 2009 12:41:08 -0400 Subject: [Biopython-dev] [Bug 2891] Jython test_NCBITextParser fix+patch In-Reply-To: Message-ID: <200909291641.n8TGf8HE011375@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2891 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-09-29 12:41 EST ------- Really fixed this time, tested on Jython 2.5.0 and 2.5.1rc3 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Sep 30 15:27:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Sep 2009 16:27:03 +0100 Subject: [Biopython-dev] [Biopython] SeqRecord reverse complement method? Message-ID: <320fb6e00909300827p441d6096u67bc85e1762e7c52@mail.gmail.com> Hi all, A few months back on the main mailing list, Cedar and I were talking about taking a SeqRecord, and how to write out its reverse complement to a file. The thread is archived here: http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html Cedar - I cc'd you, as I am not sure if you are on the dev list. I expect this could get technical pretty quickly, so I wanted to float this idea on the dev list first... ----------------------------------------------------------------- So, the background this this discussion: Unless there is some complicated annotation to transfer, using Biopython as is, making a new SeqRecord using the reverse complement sequence of the old SeqRecord isn't very hard, see: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement This has meant that generally the current status quo isn't a problem (at least for me). However, what prompted me to work on this issue was a real world example. We have a draft genome where after doing a basic annotation, it would make sense to flip the strands. I want to be able to load our current GenBank file, apply the reverse complement, and have all the annotated features recalculated to match. With more and more sequencing projects, this isn't such an odd thing to want to do. Dealing with the details of potentially complex locations in SeqFeature object's isn't very nice, so I think it would be useful to have this particular functionality built into Biopython. It is also a small step towards making the SeqRecord more Seq like (which in general seems a good idea). On Thu, Jun 25, 2009 at 12:20 AM, Peter wrote: > > What you are doing is fine - although personally I might wrap up the > first line as a function, as done in the tutorial: > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement > > While we could add a reverse_complement() method to the SeqRecord > (and other Seq methods, like translate etc), there is one big problem: > What to do with the annotation. If your record used to have a name > based on an accession or a GI number, then this really does not apply > to the reverse complement (or a translation etc). We could do something > arbitrary like adding an "rc_" prefix (or variants) but I think the only safe > answer is to make the user think about this and do what is appropriate > in their context. And as you have demonstrated, this can still be done > in one line :) > > I make a habit of using this as a justification, but I feel the zen of > Python "Explicit is better than implicit" applies quite well here. I've been thinking about this on and off since then, and I still maintain that for much of the annotation there is no easy answer. For the sequence itself, the behaviour is well defined. For all the annotation, there are three possible actions: (a) User supplies a new value (b) Reuse the old value (c) No annotation (the default for a new SeqRecord) We can do something sensible with the features (if present) and it will probably make sense to copy but reverse any per-letter annotation (if present). On a github branch I have posted some experimental code which adds a reverse_complement() method to the SeqRecord. I propose to give the new reverse_complement() a set of optional arguments (id, name, etc) following the same names as the existing attributes (and __init__ arguments), allowing the user to choose between these three actions. Assuming the general scheme is popular, I'm quite open to discussing changing these defaults. But for the first implementation this is what I picked: For the id, name and description I still lean towards making the user decide this, and therefore the default is (c). Likewise for the annotations dictionary and the database cross refs. For the features and per-letter-annotation, I would opt to make the default behaviour be to reuse the old data, option (b) above. For the per-letter-annotation (the restricted dictionary, letter_annotations) this just means reversing each entry. For the features, this means reversing the order of the features, switching their strands (if set), and calculating the new coordinates (taking care of all the possible fuzzy locations and sub-features). The code is here is anyone wants to look at the technical details: http://github.com/peterjc/biopython/commits/seqrecords Peter