From k.d.murray.91 at gmail.com Tue Apr 2 04:52:59 2013 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Tue, 2 Apr 2013 19:52:59 +1100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References: Message-ID: Hi All, Peter and I have discussedincluding the SamBam parser he has worked on into the master branch. I've offered to help with test coverage/missing features/testing. The performance is very good; reading sequentially all reads from a 11mb (540k reads) Bam file took: CPython with Kevin's Pure-python parser: 0m17.531s real, 0m17.452s user CPython with Peter's Pure-python parser: 0m5.589s real, 0m5.560s user CPython with pysam: 4m29.240s real, 4m25.576s user Pypy1.9 with Kevin's Pure-python parser: 0m6.125s real, 0m6.056s user Pypy1.9 with Peter's Pure-python parser: 0m1.716s real, 0m1.624s user What are everyone's thoughts on including this into the master branch? (with a BiopythonExperimentalWarning) Regards, Kevin On 2 April 2013 19:09, Peter Cock wrote: > We're experimenting with including experimental code in the stable > releases (modules with an experimental warning on import), so possibly yes. > I was never entirely sure if it would be useful given pysam exists (and may > well be faster on many jobs), but as this is pure Python it works on PyPy > which is interesting (and Jython, but on Java you have access to Picard). > Shall we take this dicussion to the dev mailing list? > > ? > Reply to this email directly or view it on GitHub > . > From tiagoantao at gmail.com Tue Apr 2 04:59:43 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 2 Apr 2013 09:59:43 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: Regarding the performance comparison to pysam: wow! Fantastic! On Tue, Apr 2, 2013 at 9:52 AM, Kevin Murray wrote: > Hi All, > > Peter and I have > discussedincluding > the SamBam parser he has worked on into the master branch. I've > offered to help with test coverage/missing features/testing. > > The performance is very good; reading sequentially all reads from a 11mb > (540k reads) Bam file took: > CPython with Kevin's Pure-python parser: 0m17.531s real, 0m17.452s user > CPython with Peter's Pure-python parser: 0m5.589s real, 0m5.560s user > CPython with pysam: 4m29.240s real, 4m25.576s user > Pypy1.9 with Kevin's Pure-python parser: 0m6.125s real, 0m6.056s user > Pypy1.9 with Peter's Pure-python parser: 0m1.716s real, 0m1.624s user > > > What are everyone's thoughts on including this into the master branch? > (with a BiopythonExperimentalWarning) > > Regards, > Kevin > > > > On 2 April 2013 19:09, Peter Cock wrote: > >> We're experimenting with including experimental code in the stable >> releases (modules with an experimental warning on import), so possibly yes. >> I was never entirely sure if it would be useful given pysam exists (and may >> well be faster on many jobs), but as this is pure Python it works on PyPy >> which is interesting (and Jython, but on Java you have access to Picard). >> Shall we take this dicussion to the dev mailing list? >> >> ? >> Reply to this email directly or view it on GitHub >> . >> > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- ?Grant me chastity and continence, but not yet? - St Augustine From k.d.murray.91 at gmail.com Tue Apr 2 05:09:03 2013 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Tue, 2 Apr 2013 20:09:03 +1100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: Hi Tiago, It is indeed impressive, which makes me suspect I've screwed something up in my benchmarks. I'll whack them up onto github for closer inspection sometime tomorrow (Aussie time). However, in general code: bam = BamParser("path") print next(bam) for mapping in bam: pass Regards Kevin Murray On 2 April 2013 19:59, Tiago Ant?o wrote: > Regarding the performance comparison to pysam: wow! Fantastic! > > On Tue, Apr 2, 2013 at 9:52 AM, Kevin Murray > wrote: > > Hi All, > > > > Peter and I have > > discussed< > https://github.com/kdmurray91/biopython/commit/e6343ebae50e4ff0633476a5761b47aa5ecacec4#commitcomment-2905033 > >including > > the SamBam parser he has worked on into the master branch. I've > > offered to help with test coverage/missing features/testing. > > > > The performance is very good; reading sequentially all reads from a 11mb > > (540k reads) Bam file took: > > CPython with Kevin's Pure-python parser: 0m17.531s real, 0m17.452s user > > CPython with Peter's Pure-python parser: 0m5.589s real, 0m5.560s user > > CPython with pysam: 4m29.240s real, 4m25.576s user > > Pypy1.9 with Kevin's Pure-python parser: 0m6.125s real, 0m6.056s user > > Pypy1.9 with Peter's Pure-python parser: 0m1.716s real, 0m1.624s user > > > > > > What are everyone's thoughts on including this into the master branch? > > (with a BiopythonExperimentalWarning) > > > > Regards, > > Kevin > > > > > > > > On 2 April 2013 19:09, Peter Cock wrote: > > > >> We're experimenting with including experimental code in the stable > >> releases (modules with an experimental warning on import), so possibly > yes. > >> I was never entirely sure if it would be useful given pysam exists (and > may > >> well be faster on many jobs), but as this is pure Python it works on > PyPy > >> which is interesting (and Jython, but on Java you have access to > Picard). > >> Shall we take this dicussion to the dev mailing list? > >> > >> ? > >> Reply to this email directly or view it on GitHub< > https://github.com/kdmurray91/biopython/commit/e6343ebae50e4ff0633476a5761b47aa5ecacec4#commitcomment-2925593 > > > >> . > >> > > > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > -- > ?Grant me chastity and continence, but not yet? - St Augustine > From p.j.a.cock at googlemail.com Tue Apr 2 05:32:22 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Apr 2013 10:32:22 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: >> On Tue, Apr 2, 2013 at 9:52 AM, Kevin Murray >> wrote: >> > Hi All, >> > >> > Peter and I have >> > >> > discussedincluding >> > the SamBam parser he has worked on into the master branch. I've >> > offered to help with test coverage/missing features/testing. >> > >> > The performance is very good; reading sequentially all reads from a 11mb >> > (540k reads) Bam file took: >> > CPython with Kevin's Pure-python parser: 0m17.531s real, 0m17.452s user >> > CPython with Peter's Pure-python parser: 0m5.589s real, 0m5.560s user >> > CPython with pysam: 4m29.240s real, 4m25.576s user >> > Pypy1.9 with Kevin's Pure-python parser: 0m6.125s real, 0m6.056s user >> > Pypy1.9 with Peter's Pure-python parser: 0m1.716s real, 0m1.624s user >> > >> > What are everyone's thoughts on including this into the master branch? >> > (with a BiopythonExperimentalWarning) >> > >> > Regards, >> > Kevin > On 2 April 2013 19:59, Tiago Ant?o wrote: >> >> Regarding the performance comparison to pysam: wow! Fantastic! >> On Tue, Apr 2, 2013 at 10:09 AM, Kevin Murray wrote: > Hi Tiago, > It is indeed impressive, which makes me suspect I've screwed something up in > my benchmarks. I'll whack them up onto github for closer inspection sometime > tomorrow (Aussie time). > > However, in general code: > > bam = BamParser("path") > print next(bam) > for mapping in bam: > pass > > Regards > Kevin Murray Those benchmark numbers are surprising - I suspect this is not a fair comparison. The different parsers likely have very different __str__ output for a BAM record (for mine this gives a SAM format string, pysam does something close to SAM but without the reference name). Something like BAM to SAM and then SAM to BAM would be better for profiling the basis parsing and writing performance. After than random access, and maybe something where lazy loading might have a chance to shine - perhaps counting the number of reads mapped to the reverse strand (i.e. iterate and look at the FLAG only). Peter From tiagoantao at gmail.com Tue Apr 2 06:01:00 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 2 Apr 2013 11:01:00 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: I did a small test, just getting rec.rname and rec.pos (using Peter's parser). This is something I actually need to do, to calculate basic statistics. Indeed for 1M reads, samtools is 3s whereas the pure Python parser takes 20s. Tiago On Tue, Apr 2, 2013 at 10:32 AM, Peter Cock wrote: >>> On Tue, Apr 2, 2013 at 9:52 AM, Kevin Murray >>> wrote: >>> > Hi All, >>> > >>> > Peter and I have >>> > >>> > discussedincluding >>> > the SamBam parser he has worked on into the master branch. I've >>> > offered to help with test coverage/missing features/testing. >>> > >>> > The performance is very good; reading sequentially all reads from a 11mb >>> > (540k reads) Bam file took: >>> > CPython with Kevin's Pure-python parser: 0m17.531s real, 0m17.452s user >>> > CPython with Peter's Pure-python parser: 0m5.589s real, 0m5.560s user >>> > CPython with pysam: 4m29.240s real, 4m25.576s user >>> > Pypy1.9 with Kevin's Pure-python parser: 0m6.125s real, 0m6.056s user >>> > Pypy1.9 with Peter's Pure-python parser: 0m1.716s real, 0m1.624s user >>> > >>> > What are everyone's thoughts on including this into the master branch? >>> > (with a BiopythonExperimentalWarning) >>> > >>> > Regards, >>> > Kevin > >> On 2 April 2013 19:59, Tiago Ant?o wrote: >>> >>> Regarding the performance comparison to pysam: wow! Fantastic! >>> > > On Tue, Apr 2, 2013 at 10:09 AM, Kevin Murray wrote: >> Hi Tiago, >> It is indeed impressive, which makes me suspect I've screwed something up in >> my benchmarks. I'll whack them up onto github for closer inspection sometime >> tomorrow (Aussie time). >> >> However, in general code: >> >> bam = BamParser("path") >> print next(bam) >> for mapping in bam: >> pass >> >> Regards >> Kevin Murray > > Those benchmark numbers are surprising - I suspect this is > not a fair comparison. The different parsers likely have very > different __str__ output for a BAM record (for mine this gives > a SAM format string, pysam does something close to SAM > but without the reference name). > > Something like BAM to SAM and then SAM to BAM would be > better for profiling the basis parsing and writing performance. > After than random access, and maybe something where lazy > loading might have a chance to shine - perhaps counting the > number of reads mapped to the reverse strand (i.e. iterate > and look at the FLAG only). > > Peter -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue Apr 2 12:07:33 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Apr 2013 17:07:33 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: On Tue, Apr 2, 2013 at 11:01 AM, Tiago Ant?o wrote: > I did a small test, just getting rec.rname and rec.pos (using Peter's > parser). This is something I actually need to do, to calculate basic > statistics. > > Indeed for 1M reads, samtools is 3s whereas the pure Python parser takes 20s. > > Tiago Those numbers are more believable. Was that using SAM or BAM? Which Python? Note that the rname (name of the reference a read is mapped to) is an interesting one, given explicitly as a string in SAM but as an integer offset in BAM. The pysam parser gives the low level index when parsing BAM, while mine is consistent and returns the ref name as a string for both SAM and BAM. This was a design choice to make the BAM reads self contained and avoid some of the rough edges with pysam where you must manage the reference indexes manually sometimes. Regards, Peter From tiagoantao at gmail.com Tue Apr 2 14:28:35 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 2 Apr 2013 19:28:35 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: On Tue, Apr 2, 2013 at 5:07 PM, Peter Cock wrote: > Those numbers are more believable. Was that using SAM or BAM? > Which Python? Bam. CPython 2.7.3 T From p.j.a.cock at googlemail.com Tue Apr 2 18:52:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Apr 2013 23:52:24 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: On Tue, Apr 2, 2013 at 7:28 PM, Tiago Ant?o wrote: > On Tue, Apr 2, 2013 at 5:07 PM, Peter Cock wrote: >> Those numbers are more believable. Was that using SAM or BAM? >> Which Python? > > Bam. CPython 2.7.3 > > T Here's a quick test script which I presume does something close to yours (any major differences would be interesting), which tests BAM iteration and accessing the FLAG, RNAME and POS fields only: https://github.com/peterjc/picobio/blob/master/sambam/profile/bench_iter.py I grabbed three test BAM files for this, all from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/NA12878/alignment/ NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam - 2.5 GB NA12878.chrom1.SLX.maq.SRP000032.2009_07.bam - 12 GB NA12878.chrom1.SOLID.corona.SRP000032.2009_08.bam - 8.6 GB I'm using pysam 0.7.4, and my SamBam2013 branch as of this commit: https://github.com/peterjc/biopython/commit/316125a41f0284198e1a445486b307948f8c9cd9 Then using C Python 2.7.2 (on Mac OS X), Using NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam (2.5 GB) Peter's pure Python BAM iterator. - 389.1s giving 15126090/15126090 mapped PySam's Samfile as BAM iterator. - 86.8s giving 15126090/15126090 mapped I've not run this many times (should pick a smaller test set?) but it looks like here my BAM iterator takes 4x to 5x longer than pysam, which is better than Tiago's early figure of about 6x to 7x slower, but in the same ball park. I'll run this again tomorrow hopefully, and include the two larger files too. Note I've not really tried to optimise this branch for speed - there are likely some low hanging fruit like extra assert statements etc. One of the fun things to try would be a multi-threaded BGZF parser which simply reads a few blocks ahead and delegates block decompression to worker threads. Peter From tiagoantao at gmail.com Wed Apr 3 05:09:54 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 3 Apr 2013 10:09:54 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: > One of the fun things to try would be a multi-threaded BGZF > parser which simply reads a few blocks ahead and delegates > block decompression to worker threads. Wouldn't the GIL bite here and deny any kind of advantage? (At least in CPython) -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Wed Apr 3 05:30:53 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Apr 2013 10:30:53 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: On Wed, Apr 3, 2013 at 10:09 AM, Tiago Ant?o wrote: >> One of the fun things to try would be a multi-threaded BGZF >> parser which simply reads a few blocks ahead and delegates >> block decompression to worker threads. > > Wouldn't the GIL bite here and deny any kind of advantage? > (At least in CPython) The BGZF code does the basic IO, reading in a compressed block as a string, then passes that to the gzip/zlib library to decompress. That happens in C, so could/should avoid the GIL. See also: http://www.dalkescientific.com/writings/diary/archive/2012/01/19/concurrent.futures.html Note that last time I looked at this, a year ago or so, PyPy was quite slow calling zlib - passing large byte strings from PyPy to C and back wasn't optimised. That may have improved. See this thread (I didn't get deep enough into PyPy to fix this myself): http://mail.python.org/pipermail/pypy-dev/2012-March/009623.html Peter From p.j.a.cock at googlemail.com Wed Apr 3 13:10:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Apr 2013 18:10:16 +0100 Subject: [Biopython-dev] [biopython] added unstable Bam parser class (e6343eb) In-Reply-To: References:

Message-ID: On Tue, Apr 2, 2013 at 11:52 PM, Peter Cock wrote: > On Tue, Apr 2, 2013 at 7:28 PM, Tiago Ant?o wrote: >> On Tue, Apr 2, 2013 at 5:07 PM, Peter Cock wrote: >>> Those numbers are more believable. Was that using SAM or BAM? >>> Which Python? >> >> Bam. CPython 2.7.3 >> >> T > > Here's a quick test script which I presume does something > close to yours (any major differences would be interesting), > which tests BAM iteration and accessing the FLAG, RNAME > and POS fields only: > > https://github.com/peterjc/picobio/blob/master/sambam/profile/bench_iter.py > > I grabbed three test BAM files for this, all from here: > ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/data/NA12878/alignment/ > > NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam - 2.5 GB > NA12878.chrom1.SLX.maq.SRP000032.2009_07.bam - 12 GB > NA12878.chrom1.SOLID.corona.SRP000032.2009_08.bam - 8.6 GB > > I'm using pysam 0.7.4, and my SamBam2013 branch as of this commit: > https://github.com/peterjc/biopython/commit/316125a41f0284198e1a445486b307948f8c9cd9 > > Then using C Python 2.7.2 (on Mac OS X), > > Using NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam (2.5 GB) > Peter's pure Python BAM iterator. - 389.1s giving 15126090/15126090 mapped > PySam's Samfile as BAM iterator. - 86.8s giving 15126090/15126090 mapped > > I've not run this many times (should pick a smaller test set?) > but it looks like here my BAM iterator takes 4x to 5x longer than > pysam, which is better than Tiago's early figure of about 6x to > 7x slower, but in the same ball park. I'll run this again tomorrow > hopefully, and include the two larger files too. > > Note I've not really tried to optimise this branch for speed - there > are likely some low hanging fruit like extra assert statements etc. Same machine, also using PyPy 1.9, and with the larger BAM files tested too: Using NA12878.chrom1.454.ssaha2.SRP000032.2009_10.bam (2.5GB) C Python 2.7.2 (on Mac OS X) Peter's pure Python BAM iterator. - 401.2s giving 15126090/15126090 mapped PyPy 1.9 (on Mac OS X) Peter's pure Python BAM iterator. - 146.7s giving 15126090/15126090 mapped C Python 2.7.2 (on Mac OS X) PySam's Samfile as BAM iterator. - 85.3s giving 15126090/15126090 mapped Using NA12878.chrom1.SLX.maq.SRP000032.2009_07.bam (12GB) C Python 2.7.2 (on Mac OS X) Peter's pure Python BAM iterator. - 4706.8s giving 196354464/201240699 mapped PyPy 1.9 (on Mac OS X) Peter's pure Python BAM iterator. - 1248.5s giving 196354464/201240699 mapped C Python 2.7.2 (on Mac OS X) PySam's Samfile as BAM iterator. - 795.7s giving 196354464/201240699 mapped Using NA12878.chrom1.SOLID.corona.SRP000032.2009_08.bam (8.6GB) C Python 2.7.2 (on Mac OS X) Peter's pure Python BAM iterator. - 3445.7s giving 145879316/145879316 mapped PyPy 1.9 (on Mac OS X) Peter's pure Python BAM iterator. - 875.9s giving 145879316/145879316 mapped C Python 2.7.2 (on Mac OS X) PySam's Samfile as BAM iterator. - 602.1s giving 145879316/145879316 mapped Using PyPy the run times are approaching the speed of pysam - and might perhaps match and exceed it with some more time looking at profiling? I should try the PyPy 2.0 beta as well... Anyway, right now on C Python this code is not speed competitive with pysam for parsing large BAM files. That doesn't mean it isn't useful though, but makes it harder to justify including in Biopython. Peter From p.j.a.cock at googlemail.com Mon Apr 8 11:28:07 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 8 Apr 2013 16:28:07 +0100 Subject: [Biopython-dev] Bio/SearchIO/BlastIO/blast_xml.py output under Python 2.7.4 Message-ID: Hi Bow, I've just updated the 64bit Linux buildslave from Python 2.7.3 to 2.7.4 and there is a new unicode failure with the BLAST XML writing code in SearchIO (below). Can you reproduce this? Keep in mind this could be a new bug in Python itself ;) Separately the are some BGZF issues, see also this thread: http://lists.open-bio.org/pipermail/biopython/2013-April/008490.html Thanks, Peter ====================================================================== ERROR: test_write_multiple_from_blastxml (test_SearchIO_write.BlastXmlWriteCases) Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries (xml_2226_blastp_001.xml) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) File "test_SearchIO_write.py", line 27, in parse_write_and_compare SearchIO.write(source_qresults, out_file, out_format, **kwargs) File "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", line 610, in write writer.write_file(qresults) File "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 695, in write_file xml.startDocument() File "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 612, in startDocument self.write('\n' File "/home_local/buildslave/lib/python2.7/xml/sax/saxutils.py", line 103, in write super(UnbufferedTextIOWrapper, self).write(s) TypeError: must be unicode, not str ====================================================================== ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases) Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query (xml_2226_blastp_004.xml) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) File "test_SearchIO_write.py", line 27, in parse_write_and_compare SearchIO.write(source_qresults, out_file, out_format, **kwargs) File "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", line 610, in write writer.write_file(qresults) File "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 695, in write_file xml.startDocument() File "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 612, in startDocument self.write('\n' File "/home_local/buildslave/lib/python2.7/xml/sax/saxutils.py", line 103, in write super(UnbufferedTextIOWrapper, self).write(s) TypeError: must be unicode, not str From kai.blin at biotech.uni-tuebingen.de Mon Apr 8 11:10:49 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Mon, 08 Apr 2013 17:10:49 +0200 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: References:

Message-ID: <5162DDF9.1030508@biotech.uni-tuebingen.de> On 2013-03-29 12:54, Peter Cock wrote: > Hi Brad, > > This sounds sensible in principle - it just needs some hands on testing > on various systems - any volunteers who use PIP and virtual envs? Sure, I've got a zoo of Ubuntu systems to try this on, and I'd actually be very interested in this. I'll give it a go. Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From bow at bow.web.id Mon Apr 8 13:02:39 2013 From: bow at bow.web.id (Wibowo Arindrarto) Date: Mon, 8 Apr 2013 19:02:39 +0200 Subject: [Biopython-dev] Bio/SearchIO/BlastIO/blast_xml.py output under Python 2.7.4 In-Reply-To: References: Message-ID: Hi Peter, everyone, This is reproducible in my machine (which just happen to update its default Python2.x to 2.7.4 actually). Looks like they changed the underlying SAX XML writer (xml.sax.saxutils.XMLGenerator ~ which the SearchIO blast_xml writer subclasses). The new write() method calls the io.TextIOWrapper.write() function, which expects unicode only. See the comparison here: Python2.7.3 ( http://hg.python.org/cpython/file/70274d53c1dd/Lib/xml/sax/saxutils.py#l84) vs Python2.7.4 ( http://hg.python.org/cpython/file/026ee0057e2d/Lib/xml/sax/saxutils.py#l109). Also, here's the io module doc page for reference: http://docs.python.org/2/library/io.html I'll write up a Python <=2.7.3-compatible patch to fix this soon, if everybody's ok with it :). Cheers, Bow On Mon, Apr 8, 2013 at 5:28 PM, Peter Cock wrote: > Hi Bow, > > I've just updated the 64bit Linux buildslave from Python 2.7.3 to > 2.7.4 and there > is a new unicode failure with the BLAST XML writing code in SearchIO > (below). > > Can you reproduce this? Keep in mind this could be a new bug in Python > itself ;) > > Separately the are some BGZF issues, see also this thread: > http://lists.open-bio.org/pipermail/biopython/2013-April/008490.html > > Thanks, > > Peter > > ====================================================================== > ERROR: test_write_multiple_from_blastxml > (test_SearchIO_write.BlastXmlWriteCases) > Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries > (xml_2226_blastp_001.xml) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_SearchIO_write.py", line 55, in > test_write_multiple_from_blastxml > self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) > File "test_SearchIO_write.py", line 27, in parse_write_and_compare > SearchIO.write(source_qresults, out_file, out_format, **kwargs) > File > "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", > line 610, in write > writer.write_file(qresults) > File > "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 695, in write_file > xml.startDocument() > File > "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 612, in startDocument > self.write('\n' > File "/home_local/buildslave/lib/python2.7/xml/sax/saxutils.py", > line 103, in write > super(UnbufferedTextIOWrapper, self).write(s) > TypeError: must be unicode, not str > > ====================================================================== > ERROR: test_write_single_from_blastxml > (test_SearchIO_write.BlastXmlWriteCases) > Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query > (xml_2226_blastp_004.xml) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_SearchIO_write.py", line 49, in > test_write_single_from_blastxml > self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) > File "test_SearchIO_write.py", line 27, in parse_write_and_compare > SearchIO.write(source_qresults, out_file, out_format, **kwargs) > File > "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", > line 610, in write > writer.write_file(qresults) > File > "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 695, in write_file > xml.startDocument() > File > "/home_local/buildslave/repositories/biopython/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 612, in startDocument > self.write('\n' > File "/home_local/buildslave/lib/python2.7/xml/sax/saxutils.py", > line 103, in write > super(UnbufferedTextIOWrapper, self).write(s) > TypeError: must be unicode, not str > From p.j.a.cock at googlemail.com Mon Apr 8 13:08:57 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 8 Apr 2013 18:08:57 +0100 Subject: [Biopython-dev] Bio/SearchIO/BlastIO/blast_xml.py output under Python 2.7.4 In-Reply-To: References: Message-ID: On Mon, Apr 8, 2013 at 6:02 PM, Wibowo Arindrarto wrote: > Hi Peter, everyone, > > This is reproducible in my machine (which just happen to update its default > Python2.x to 2.7.4 actually). Looks like they changed the underlying SAX XML > writer (xml.sax.saxutils.XMLGenerator ~ which the SearchIO blast_xml writer > subclasses). The new write() method calls the io.TextIOWrapper.write() > function, which expects unicode only. > > See the comparison here: > Python2.7.3 > (http://hg.python.org/cpython/file/70274d53c1dd/Lib/xml/sax/saxutils.py#l84) > vs Python2.7.4 > (http://hg.python.org/cpython/file/026ee0057e2d/Lib/xml/sax/saxutils.py#l109). > Also, here's the io module doc page for reference: > http://docs.python.org/2/library/io.html > > I'll write up a Python <=2.7.3-compatible patch to fix this soon, if > everybody's ok with it :). > > Cheers, > Bow Sounds good - with the combination of TravisCI and the buildbot we should have the major platforms covered. Thanks, Peter From natemsutton at yahoo.com Mon Apr 8 17:17:05 2013 From: natemsutton at yahoo.com (Nate Sutton) Date: Mon, 8 Apr 2013 14:17:05 -0700 (PDT) Subject: [Biopython-dev] Work note Message-ID: <1365455825.76591.YahooMailNeo@web122604.mail.ne1.yahoo.com> Hi, I wanted to post a quick note keep efforts organized, I am working on and have some progress with https://redmine.open-bio.org/issues/3336 . -Nate From p.j.a.cock at googlemail.com Tue Apr 9 06:20:43 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 9 Apr 2013 11:20:43 +0100 Subject: [Biopython-dev] OBF not accepted for GSoC 2013 Message-ID: Dear all, Unfortunately this year we have not been accepted on the Google Summer of Code scheme: I'm sure the rest of the OBF board and the other Bio* developers will join me in thanking Pjotr Prins for his efforts as the OBF GSoC administrator co-ordinating our application this year, as well as last year's administrator Rob Bruels and the other mentors for their efforts. For those of you not subscribed to the OBF's GSoC mailing list, I am forwarding Pjotr's email from last night (also below): http://lists.open-bio.org/pipermail/gsoc/2013/000211.html In all 177 organisations were accepted (about the same as the last few years), and they will be listed here (once they have filled out their profile information): https://google-melange.appspot.com/gsoc/accepted_orgs/google/gsoc2013 To potential students this summer, the good news is that some related organisations have been accepted, such as NESCent, the National Resource for Network Biology (NRNB - known for Cytoscape), SciRuby (Ruby Science Foundation), so there is still some scope for doing a bioinformatics related project in GSoC 2013, perhaps even with a Bio* developer as a co-mentor. Thank you all, Peter (Biopython developer, OBF board member) ---------- Forwarded message ---------- From: Pjotr Prins Date: Mon, Apr 8, 2013 at 9:13 PM Subject: Re: GSoC 2013 is ON To: Pjotr Prins Cc: ..., OBF GSoC Sadly, our application got rejected by GSoC this year. I am not sure what the reason was, but I am convinced our application was similar to that of other years. Maybe the project ideas could have been better presented. I am not sure at this stage. I'll make a list of successful projects to see if we can digest some truths. The upside is that FOSS is going strong! And that the field is getting increasingly competitive. As an open source geezer I can only be happy, even if it hurts our own application. Sorry everyone, and many thanks for the trouble you took getting projects written up. Let's not feel discouraged for next year. Pj. From p.j.a.cock at googlemail.com Tue Apr 9 06:41:26 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 9 Apr 2013 11:41:26 +0100 Subject: [Biopython-dev] OBF not accepted for GSoC 2013 In-Reply-To: References: Message-ID: On Tue, Apr 9, 2013 at 11:20 AM, Peter Cock wrote: > Dear all, > > Unfortunately this year we have not been accepted on the Google > Summer of Code scheme: > > I'm sure the rest of the OBF board and the other Bio* developers > will join me in thanking Pjotr Prins for his efforts as the OBF > GSoC administrator co-ordinating our application this year, as > well as last year's administrator Rob Bruels and the other mentors > for their efforts. Ahem. Rob Buels was the previous OBF GSoC co-ordinator, sorry. > For those of you not subscribed to the OBF's GSoC mailing list, > I am forwarding Pjotr's email from last night (also below): > http://lists.open-bio.org/pipermail/gsoc/2013/000211.html > > In all 177 organisations were accepted (about the same as the > last few years), and they will be listed here (once they have filled > out their profile information): > https://google-melange.appspot.com/gsoc/accepted_orgs/google/gsoc2013 > > To potential students this summer, the good news is that some > related organisations have been accepted, such as NESCent, > the National Resource for Network Biology (NRNB - known for > Cytoscape), SciRuby (Ruby Science Foundation), so there is > still some scope for doing a bioinformatics related project in > GSoC 2013, perhaps even with a Bio* developer as a co-mentor. > > Thank you all, > > Peter > (Biopython developer, OBF board member) For any Biopython based projects, I think our best bet is to talk to NESCent (with whom we've had a couple of GSoC students prior to the OBF applying directly). That seems like a good fit with the phylogenetic ideas Eric suggested: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 Looking over the list of other accepted applications, the Python Software Foundation (PSF) was also accepted and they're doing this as an umbrella organisation - so also worth talking to: http://wiki.python.org/moin/SummerOfCode/2013 Regards, Peter From p.j.a.cock at googlemail.com Tue Apr 9 07:08:10 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 9 Apr 2013 12:08:10 +0100 Subject: [Biopython-dev] OBF not accepted for GSoC 2013 In-Reply-To: References:

Message-ID: On Tue, Apr 9, 2013 at 11:41 AM, Peter Cock wrote: > For any Biopython based projects, I think our best bet is to > talk to NESCent (with whom we've had a couple of GSoC > students prior to the OBF applying directly). That seems > like a good fit with the phylogenetic ideas Eric suggested: > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 > > Looking over the list of other accepted applications, the > Python Software Foundation (PSF) was also accepted > and they're doing this as an umbrella organisation - so > also worth talking to: > http://wiki.python.org/moin/SummerOfCode/2013 Not all the GSoC ideas we'd come up with for 2013 would fit under NEScent's interests, so I've emailed the PSF contact person listed on that page to see if they are still willing to add more projects under their umbrella. Regards, Peter From p.j.a.cock at googlemail.com Tue Apr 9 07:15:48 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 9 Apr 2013 12:15:48 +0100 Subject: [Biopython-dev] Bio/SearchIO/BlastIO/blast_xml.py output under Python 2.7.4 In-Reply-To: References:

Message-ID: On Mon, Apr 8, 2013 at 6:08 PM, Peter Cock wrote: > On Mon, Apr 8, 2013 at 6:02 PM, Wibowo Arindrarto wrote: >> Hi Peter, everyone, >> >> This is reproducible in my machine (which just happen to update its default >> Python2.x to 2.7.4 actually). Looks like they changed the underlying SAX XML >> writer (xml.sax.saxutils.XMLGenerator ~ which the SearchIO blast_xml writer >> subclasses). The new write() method calls the io.TextIOWrapper.write() >> function, which expects unicode only. >> >> See the comparison here: >> Python2.7.3 >> (http://hg.python.org/cpython/file/70274d53c1dd/Lib/xml/sax/saxutils.py#l84) >> vs Python2.7.4 >> (http://hg.python.org/cpython/file/026ee0057e2d/Lib/xml/sax/saxutils.py#l109). >> Also, here's the io module doc page for reference: >> http://docs.python.org/2/library/io.html >> >> I'll write up a Python <=2.7.3-compatible patch to fix this soon, if >> everybody's ok with it :). >> >> Cheers, >> Bow > > Sounds good - with the combination of TravisCI and the buildbot we should > have the major platforms covered. > > Thanks, > > Peter Pull request merged, thanks Bow https://github.com/biopython/biopython/pull/176 Peter From p.j.a.cock at googlemail.com Tue Apr 9 09:36:27 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 9 Apr 2013 14:36:27 +0100 Subject: [Biopython-dev] Abstract for "Biopython Project Update" at BOSC 2013 Message-ID: Hi all, First a general reminder that the BOSC 2013 abstract deadline is this coming Friday 12 April, http://www.open-bio.org/wiki/BOSC_2013 Second, we need to prepare an abstract for the traditional Biopython Project update. Both Brad and I should be there for BOSC 2013, and the preceding mini hackathon/codefest to which everyone is welcome: http://www.open-bio.org/wiki/Codefest_2013 Who else is planning to be at BOSC 2013 (and the Codefest), and is there anyone keen to present the update? I'm happy to give the talk, but this is a nice chance for someone else to present and give a more rounded impression of the Biopython developers ;) According to http://biopython.org/wiki/Documentation#Presentations the recent Biopython presenters at BOSC have been: BOSC 2013, Berlin, Germany - ??? BOSC 2012, Long beach, USA - Eric BOSC 2011, Vienna, Austria - Peter BOSC 2010, Boston, USA - Brad BOSC 2009, Stockholm, Sweden - Peter BOSC 2008, Toronto, Canada - Tiago BOSC 2007, Vienna, Austria - Peter BOSC 2006, Fortaleza, Brazil - No one ... So, any volunteers from our development team? I know from personal experience that giving a talk can be very useful for securing travel funding - and we can if need be ask the BOSC committee to waive the registration fee (in a new initiative for this year at BOSC). Thanks, Peter (Disclaimer: I'm also on the BOSC committee) From dalloliogm at gmail.com Tue Apr 9 11:10:30 2013 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 9 Apr 2013 17:10:30 +0200 Subject: [Biopython-dev] OBF not accepted for GSoC 2013 In-Reply-To: References:

Message-ID: Too bad to hear about these news. Hopefully you will be able to fit some of the projects under NEScent, or find other solutions. On Tue, Apr 9, 2013 at 1:08 PM, Peter Cock wrote: > On Tue, Apr 9, 2013 at 11:41 AM, Peter Cock > wrote: > > For any Biopython based projects, I think our best bet is to > > talk to NESCent (with whom we've had a couple of GSoC > > students prior to the OBF applying directly). That seems > > like a good fit with the phylogenetic ideas Eric suggested: > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 > > > > Looking over the list of other accepted applications, the > > Python Software Foundation (PSF) was also accepted > > and they're doing this as an umbrella organisation - so > > also worth talking to: > > http://wiki.python.org/moin/SummerOfCode/2013 > > Not all the GSoC ideas we'd come up with for 2013 would > fit under NEScent's interests, so I've emailed the PSF > contact person listed on that page to see if they are still > willing to add more projects under their umbrella. > > Regards, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Giovanni Dall'Olio, phd student IBE, Institut de Biologia Evolutiva, CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From p.j.a.cock at googlemail.com Wed Apr 10 07:17:34 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Apr 2013 12:17:34 +0100 Subject: [Biopython-dev] Debugging the warning filters in the unit tests Message-ID: Hi all, I've suspected for a some time that there is some subtle bug in run_tests.py or a limitation in the Python warnings module - although our code tries to reset the filters between tests this doesn't seem to be working all the time. The obvious effect of this is getting loads of ResourceWarning messages under Python 3.3, e.g. http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/130/steps/shell/logs/stdio Sometimes theses warnings can manifest as real bugs, in particular on Windows where you cannot delete a file when there is still an open handle to it which will not be closed until the garbage collection runs. i.e. Bugs which tend to only show under Windows with PyPy or Jython. e.g. https://github.com/biopython/biopython/commit/943fffe2835dca4a996a6a171f026f6374ecaaa9 After fixing most of the handle leaks (which is a good thing to do anyway), the remainder of being shown are in the documentation tests (where fixing them can make the examples harder to follow): http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/140/steps/shell/logs/stdio I'm hoping someone might have some insight into the current (not entirely successful) attempts to handle warning filters in run_tests.py Note this is complicated because many of the individual test_XXX.py files manipulate the (global) warnings filters, either to silence harmless expected warnings, or to verify that desired warnings appear. Thanks, Peter P.S. One of the first things I want to do once we drop support for Python 2.5 (currently we're planning just one more release support it) is update the doctests to use context managers for file handles (i.e. use the with statement). Currently we can't easily do this in plain doctests without including this line explicitly: from __future__ import with_statement We could do that automatically in test_Tutorial.py though as part of the code which extracts doctest examples form the LaTeX and runs them. From rozanski.andrei at gmail.com Wed Apr 10 12:16:02 2013 From: rozanski.andrei at gmail.com (Andrei Rozanski) Date: Wed, 10 Apr 2013 13:16:02 -0300 Subject: [Biopython-dev] Intro Message-ID: Hi all, As a phd student in bioinformatics, I found in Biopython a useful tool. Actually, I work on NGS, differential gene expression, alternative splicing and conservationa analysis of transposable elements. Therefore, would like to introduce myself and say that I would like to contribute. Best regards, -- Andrei R From w.arindrarto at gmail.com Wed Apr 10 12:30:58 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 10 Apr 2013 18:30:58 +0200 Subject: [Biopython-dev] Debugging the warning filters in the unit tests In-Reply-To: References: Message-ID: Hi Peter, Could we apply a simple warning filter prior to the doctest run? Something like: with warnings.catch_warnings(): warnings.simplefilter('ignore', ResourceWarning) suite.run(result) I see that the run_tests.py script already can differentiate between doctests and non-doctests. The filter can then be applied before the 'else' clause from L375 ends (consequently, the previous 'if' should do its own unfiltered `suite.run(result)` invocation. I tried this locally, and it does work for the ResourceWarning from the doctests. It does not silence all ResourceWarnings, though (test_Tutorial isn't really a doctest as seen by `run_tests.py`, and I saw a separate warning raised from L431 during GC cleanup). However, maybe a related issue is whether we want to silence this warnings at all? After digging a bit into the tests and code, the warnings do seem to be generated by real leaky handles (so not a Python bug). For example, the SearchIO and SeqIO warnings here: http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/140/steps/shell/logs/stdio seems to be caused by the `index` or `index_db` function not closing its open file handles. When the doctest suite resets the global environment for each test (doctest.py:L1439), Python discovers that these handles are still open and raises a warning. I haven't checked if the same issue is the root cause of other ResourceWarning occurences. Perhaps we need to look deeper into this? Cheers, Bow On Wed, Apr 10, 2013 at 1:17 PM, Peter Cock wrote: > Hi all, > > I've suspected for a some time that there is some subtle bug > in run_tests.py or a limitation in the Python warnings module - > although our code tries to reset the filters between tests this > doesn't seem to be working all the time. > > The obvious effect of this is getting loads of ResourceWarning > messages under Python 3.3, e.g. > > http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/130/steps/shell/logs/stdio > > Sometimes theses warnings can manifest as real bugs, in > particular on Windows where you cannot delete a file when > there is still an open handle to it which will not be closed > until the garbage collection runs. i.e. Bugs which tend to > only show under Windows with PyPy or Jython. e.g. > > https://github.com/biopython/biopython/commit/943fffe2835dca4a996a6a171f026f6374ecaaa9 > > After fixing most of the handle leaks (which is a good thing > to do anyway), the remainder of being shown are in the > documentation tests (where fixing them can make the > examples harder to follow): > > http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/140/steps/shell/logs/stdio > > I'm hoping someone might have some insight into the > current (not entirely successful) attempts to handle > warning filters in run_tests.py > > Note this is complicated because many of the > individual test_XXX.py files manipulate the (global) > warnings filters, either to silence harmless expected > warnings, or to verify that desired warnings appear. > > Thanks, > > Peter > > P.S. > > One of the first things I want to do once we drop support > for Python 2.5 (currently we're planning just one more > release support it) is update the doctests to use context > managers for file handles (i.e. use the with statement). > Currently we can't easily do this in plain doctests without > including this line explicitly: > > from __future__ import with_statement > > We could do that automatically in test_Tutorial.py though > as part of the code which extracts doctest examples form > the LaTeX and runs them. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Wed Apr 10 15:44:25 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 10 Apr 2013 15:44:25 -0400 Subject: [Biopython-dev] Work note In-Reply-To: <1365455825.76591.YahooMailNeo@web122604.mail.ne1.yahoo.com> References: <1365455825.76591.YahooMailNeo@web122604.mail.ne1.yahoo.com> Message-ID: Hi Nate, Cool. Which aspects of Phylo.draw are you working on? Is there a branch we could watch on GitHub? Thanks, Eric On Mon, Apr 8, 2013 at 5:17 PM, Nate Sutton wrote: > Hi, > I wanted to post a quick note keep efforts organized, I am working on and > have some progress with https://redmine.open-bio.org/issues/3336 . > -Nate > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chris.mit7 at gmail.com Wed Apr 10 22:57:41 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Wed, 10 Apr 2013 22:57:41 -0400 Subject: [Biopython-dev] samtools threaded daemon Message-ID: Hi everyone, I've been doing a ton of mpileup work recently with samtools so I made a python daemon to parallelize the process. Is there any interest in a generic SamTools package for BioPython? I know pysam exists, but it'd be an added dependency as well as not threaded. In my experience, for querying a ton of positions threading mpileup is the best way to go (much faster than -l bed_file in my use cases). If there's interest, I'll package it as a general SamTools command line wrapper with the added bonuses that for certain operations you can input a list and thread those parts. Chris From christian at brueffer.de Thu Apr 11 02:10:06 2013 From: christian at brueffer.de (Christian Brueffer) Date: Thu, 11 Apr 2013 08:10:06 +0200 Subject: [Biopython-dev] samtools threaded daemon In-Reply-To: References: Message-ID: <516653BE.8060509@brueffer.de> On 4/11/13 4:57 , Chris Mitchell wrote: > Hi everyone, > > I've been doing a ton of mpileup work recently with samtools so I made a > python daemon to parallelize the process. Is there any interest in a > generic SamTools package for BioPython? I know pysam exists, but it'd be > an added dependency as well as not threaded. In my experience, for > querying a ton of positions threading mpileup is the best way to go (much > faster than -l bed_file in my use cases). If there's interest, I'll > package it as a general SamTools command line wrapper with the added > bonuses that for certain operations you can input a list and thread those > parts. > Hi Chris, sounds great! I use samtools/pysam a lot, so I'd appreciate another option. My collegue uses mpileup with pysam a lot as well, I'm sure he wouldn't mind some speedup in that area. Cheers, Chris From p.j.a.cock at googlemail.com Thu Apr 11 05:55:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Apr 2013 10:55:24 +0100 Subject: [Biopython-dev] samtools threaded daemon In-Reply-To: <516653BE.8060509@brueffer.de> References: <516653BE.8060509@brueffer.de> Message-ID: On Thu, Apr 11, 2013 at 7:10 AM, Christian Brueffer wrote: > On 4/11/13 4:57 , Chris Mitchell wrote: >> Hi everyone, >> >> I've been doing a ton of mpileup work recently with samtools so I made a >> python daemon to parallelize the process. Is there any interest in a >> generic SamTools package for BioPython? I know pysam exists, but it'd be >> an added dependency as well as not threaded. In my experience, for >> querying a ton of positions threading mpileup is the best way to go (much >> faster than -l bed_file in my use cases). If there's interest, I'll >> package it as a general SamTools command line wrapper with the added >> bonuses that for certain operations you can input a list and thread those >> parts. >> > > Hi Chris, > > sounds great! I use samtools/pysam a lot, so I'd appreciate another > option. My collegue uses mpileup with pysam a lot as well, I'm sure he > wouldn't mind some speedup in that area. > > Cheers, > > Chris A samtools command line wrapper sounds useful in itself. Saket has done a bwa wrapper I need to merge: https://github.com/biopython/biopython/pull/167 I think he was planning to do samtools next: https://github.com/saketkc/biopython/tree/samtools_wrapper What did you have in mind for threading? Automatically calling multiple independent samtools processes in the background? Peter P.S. See also this thread: http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010492.html From p.j.a.cock at googlemail.com Thu Apr 11 06:12:48 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Apr 2013 11:12:48 +0100 Subject: [Biopython-dev] Debugging the warning filters in the unit tests In-Reply-To: References:

Message-ID: On Wed, Apr 10, 2013 at 5:30 PM, Wibowo Arindrarto wrote: > Hi Peter, > > Could we apply a simple warning filter prior to the doctest run? Something > like: > > with warnings.catch_warnings(): > warnings.simplefilter('ignore', ResourceWarning) > suite.run(result) > The ResourceWarning is already silenced by default, unless running Python in debug mode. The issue is something is resetting the warning filters - most likely one of our unit tests doing something slightly wrong with warning filters. > However, maybe a related issue is whether we want to silence this > warnings at all? True, there is something to be said for deliberately showing any ResourceWarning messages from our test suite. > After digging a bit into the tests and code, the warnings do seem to be > generated by real leaky handles (so not a Python bug). For example, the > SearchIO and SeqIO warnings here: > http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/140/steps/shell/logs/stdio > seems to be caused by the `index` or `index_db` function not closing its > open file handles. When the doctest suite resets the global environment for > each test (doctest.py:L1439), Python discovers that these handles are still > open and raises a warning. The dictionary objects from SeqIO/SearchIO do have close methods. The SQLite backend had this for a while, I only added this to the in memory dictionary to close the sequence file handle this week: https://github.com/biopython/biopython/commit/5822f83444d9ccf028bb32b5978208594b4d9c07 The SQLite handles had to be closed in order for the test suite to delete the index files after usage - which is why I added that close method much earlier. So was can now explicitly close those handles in the doctests, with the downside of making the example a little more complex. > I haven't checked if the same issue is the root cause of other > ResourceWarning occurences. Perhaps we need to look deeper > into this? One common bad pattern was unit test code which did things like this, data = open(filenname).read() rather than: with open(filename) as handle: data = handle.read() or on pre-Python 2.5: handle = open(filename) data = handle.read() handle.close() The one line version is still very widely used but only safe on C Python where the garbage collection is predictable and will close the handle when it goes out of scope. Peter From p.j.a.cock at googlemail.com Thu Apr 11 06:17:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Apr 2013 11:17:52 +0100 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: <5162DDF9.1030508@biotech.uni-tuebingen.de> References:

<5162DDF9.1030508@biotech.uni-tuebingen.de> Message-ID: On Mon, Apr 8, 2013 at 4:10 PM, Kai Blin wrote: > On 2013-03-29 12:54, Peter Cock wrote: >> >> Hi Brad, >> >> This sounds sensible in principle - it just needs some hands on testing >> on various systems - any volunteers who use PIP and virtual envs? > > > Sure, I've got a zoo of Ubuntu systems to try this on, and I'd actually be > very interested in this. I'll give it a go. > > Cheers, > Kai A whole zoo of Ubuntu systems? Sounds like an excellent test set :) How did it go? Peter From chris.mit7 at gmail.com Thu Apr 11 09:46:41 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 11 Apr 2013 09:46:41 -0400 Subject: [Biopython-dev] samtools threaded daemon In-Reply-To: References: <516653BE.8060509@brueffer.de> Message-ID: For threading I am currently using map_async and subprocess to handle the threading and calling of samtools. There are some other details like using a generator to reduce the memory overhead (since map_async by itself runs the entire list and puts the return values into memory before sending them to you...obviously a terrible idea for 10000+ pileups, a generator gets around this by chunking the input, which reduces the gains from threading as it waits until all jobs are finished before submitting the next batch). If anyone knows of a way to have map_async or similar methods return values to the callback as threads finish, that would be good to know. >From a user perspective, this is a simple example of how it works now: st = SamTools(bamSource,binary=sTools,threads=30) st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in xrange(2000001,2001001)],callback=processPileup) print st.mpileup(f=hg19,r='chr1:2000000-2000010') For things that make sense to have multiple copies, like bamfiles, bed files, or positions, if a list is provided as that keyword, it will thread it. This will put 30 threads together, and call processPileup with the output. Since there seems to be some interest, I'm going to look at the existing command line wrappers to make it consistent with BioPython's approach. Also, if a binary can't be found, having it fallback to the future BioPython parser seems like it might be a good idea (provided it has similar functionality like creating pileups, does it?). Chris On Thu, Apr 11, 2013 at 5:55 AM, Peter Cock wrote: > On Thu, Apr 11, 2013 at 7:10 AM, Christian Brueffer > wrote: > > On 4/11/13 4:57 , Chris Mitchell wrote: > >> Hi everyone, > >> > >> I've been doing a ton of mpileup work recently with samtools so I made a > >> python daemon to parallelize the process. Is there any interest in a > >> generic SamTools package for BioPython? I know pysam exists, but it'd > be > >> an added dependency as well as not threaded. In my experience, for > >> querying a ton of positions threading mpileup is the best way to go > (much > >> faster than -l bed_file in my use cases). If there's interest, I'll > >> package it as a general SamTools command line wrapper with the added > >> bonuses that for certain operations you can input a list and thread > those > >> parts. > >> > > > > Hi Chris, > > > > sounds great! I use samtools/pysam a lot, so I'd appreciate another > > option. My collegue uses mpileup with pysam a lot as well, I'm sure he > > wouldn't mind some speedup in that area. > > > > Cheers, > > > > Chris > > A samtools command line wrapper sounds useful in itself. Saket has > done a bwa wrapper I need to merge: > https://github.com/biopython/biopython/pull/167 > > I think he was planning to do samtools next: > https://github.com/saketkc/biopython/tree/samtools_wrapper > > What did you have in mind for threading? Automatically calling > multiple independent samtools processes in the background? > > Peter > > P.S. See also this thread: > http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010492.html > From p.j.a.cock at googlemail.com Thu Apr 11 09:54:55 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Apr 2013 14:54:55 +0100 Subject: [Biopython-dev] samtools threaded daemon In-Reply-To: References: <516653BE.8060509@brueffer.de> Message-ID: On Thu, Apr 11, 2013 at 2:46 PM, Chris Mitchell wrote: > Also, if a binary can't be found, having it fallback to the future > BioPython parser seems like it might be a good idea (provided it has > similar functionality like creating pileups, does it?). It has the low level random access via the BAI index done, but does not yet have a reimplementation of the mpileup code, no. (Would that be useful compared to calling samtools and parsing its output?) Peter From chris.mit7 at gmail.com Thu Apr 11 10:04:15 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 11 Apr 2013 10:04:15 -0400 Subject: [Biopython-dev] samtools threaded daemon In-Reply-To: References: <516653BE.8060509@brueffer.de>

Message-ID: Given that we'd be chasing after the samtools development cycle, I think it's just easier to implement command line wrappers that are dynamic enough to handle future versions. For instance, some of the code doesn't seem too set in stone and appears empirical (the BAQ computation comes to mind) and therefore probable to change in future versions. I can package in my existing pileup parser, but in general I think most people will be using a callback routine to handle it themselves since use cases of the final output sort of vary project by project. Chris On Thu, Apr 11, 2013 at 9:54 AM, Peter Cock wrote: > On Thu, Apr 11, 2013 at 2:46 PM, Chris Mitchell > wrote: > > Also, if a binary can't be found, having it fallback to the future > > BioPython parser seems like it might be a good idea (provided it has > > similar functionality like creating pileups, does it?). > > It has the low level random access via the BAI index done, but > does not yet have a reimplementation of the mpileup code, no. > (Would that be useful compared to calling samtools and parsing > its output?) > > Peter > From chris.mit7 at gmail.com Thu Apr 11 14:21:36 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 11 Apr 2013 14:21:36 -0400 Subject: [Biopython-dev] samtools threaded daemon In-Reply-To: References: <516653BE.8060509@brueffer.de>

Message-ID: Here's the branch I'm starting with, including a working mpileup daemon for those who want to use it: https://github.com/chrismit/biopython/tree/samtools sample usage: from Bio.SamTools import SamTools sTools = '/home/chris/bin/samtools' hg19 = '/media/chris/ChrisSSD/ref/human/hg19.fa' bamSource = '/media/chris/ChrisSSD/TH1Alignment/NK/accepted_hits.bam' st = SamTools(bamSource,binary=sTools,threads=30) #now with a callback, which is advisable to use to process data as it is generated def processPileup(pileup): print 'to process',pileup #st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in xrange(2000001,2001001)],callback=processPileup) #with callback #print st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in xrange(2000001,2000101)]) #will just return as a list On Thu, Apr 11, 2013 at 10:04 AM, Chris Mitchell wrote: > Given that we'd be chasing after the samtools development cycle, I think > it's just easier to implement command line wrappers that are dynamic enough > to handle future versions. For instance, some of the code doesn't seem too > set in stone and appears empirical (the BAQ computation comes to mind) and > therefore probable to change in future versions. I can package in my > existing pileup parser, but in general I think most people will be using a > callback routine to handle it themselves since use cases of the final > output sort of vary project by project. > > Chris > > > On Thu, Apr 11, 2013 at 9:54 AM, Peter Cock wrote: > >> On Thu, Apr 11, 2013 at 2:46 PM, Chris Mitchell >> wrote: >> > Also, if a binary can't be found, having it fallback to the future >> > BioPython parser seems like it might be a good idea (provided it has >> > similar functionality like creating pileups, does it?). >> >> It has the low level random access via the BAI index done, but >> does not yet have a reimplementation of the mpileup code, no. >> (Would that be useful compared to calling samtools and parsing >> its output?) >> >> Peter >> > > From p.j.a.cock at googlemail.com Sun Apr 14 16:10:02 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 14 Apr 2013 21:10:02 +0100 Subject: [Biopython-dev] Level of Python 3 support? Message-ID: Hi all, We've have Python 3 coverage on the Travis continuous integration tests & the nightly builtbot tests for quite some time now. http://travis-ci.org/biopython/biopython http://testing.open-bio.org/biopython/tgrid There have been a few Python 3 specific issues reported (which have been fixed), but I think we should probably go ahead and say we support Python 3.1 or later (with the exception of some minor C code which has not been ported). Does this seem like a good move for the next release? We've already said that this (Biopython 1.62) will be the final release to support Python 2.5, which will help make some bits of the 2/3 compatibility code much simpler. Regards, Peter From p.j.a.cock at googlemail.com Sun Apr 14 17:30:00 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 14 Apr 2013 22:30:00 +0100 Subject: [Biopython-dev] Abstract for "Biopython Project Update" at BOSC 2013 In-Reply-To: References: Message-ID: On Tue, Apr 9, 2013 at 2:36 PM, Peter Cock wrote: > Hi all, > > First a general reminder that the BOSC 2013 abstract deadline is > this coming Friday 12 April, http://www.open-bio.org/wiki/BOSC_2013 > > Second, we need to prepare an abstract for the traditional Biopython > Project update. Both Brad and I should be there for BOSC 2013, > and the preceding mini hackathon/codefest to which everyone is > welcome: http://www.open-bio.org/wiki/Codefest_2013 > > Who else is planning to be at BOSC 2013 (and the Codefest), and > is there anyone keen to present the update? I'm happy to give the > talk, but this is a nice chance for someone else to present and give > a more rounded impression of the Biopython developers ;) > > According to http://biopython.org/wiki/Documentation#Presentations > the recent Biopython presenters at BOSC have been: > > BOSC 2013, Berlin, Germany - ??? > BOSC 2012, Long beach, USA - Eric > BOSC 2011, Vienna, Austria - Peter > BOSC 2010, Boston, USA - Brad > BOSC 2009, Stockholm, Sweden - Peter > BOSC 2008, Toronto, Canada - Tiago > BOSC 2007, Vienna, Austria - Peter > BOSC 2006, Fortaleza, Brazil - No one > ... > > So, any volunteers from our development team? I know from > personal experience that giving a talk can be very useful for > securing travel funding - and we can if need be ask the BOSC > committee to waive the registration fee (in a new initiative for this > year at BOSC). > > Thanks, > > Peter > (Disclaimer: I'm also on the BOSC committee) Looks like it will be me giving the talk then (although its not too late to change that if needed). I've attached the abstract as it stands, assuming we get a talk slot then we'll be able to make some revisions during the review process (the BOSC review panel may have comments to address). Feedback from you all is welcome too of course. Thanks, Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2013.pdf Type: application/pdf Size: 250680 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Biopython_BOSC_abstract_2013.tex Type: application/x-tex Size: 4439 bytes Desc: not available URL: From eric.talevich at gmail.com Sun Apr 14 21:18:58 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 14 Apr 2013 19:18:58 -0600 Subject: [Biopython-dev] Level of Python 3 support? In-Reply-To: References: Message-ID: On Sun, Apr 14, 2013 at 2:10 PM, Peter Cock wrote: > Hi all, > > We've have Python 3 coverage on the Travis continuous > integration tests & the nightly builtbot tests for quite some > time now. > http://travis-ci.org/biopython/biopython > http://testing.open-bio.org/biopython/tgrid > > There have been a few Python 3 specific issues reported > (which have been fixed), but I think we should probably > go ahead and say we support Python 3.1 or later (with > the exception of some minor C code which has not been > ported). > > Does this seem like a good move for the next release? > > Sounds good to me. Looking at the changes between 3.1 and 3.2: http://docs.python.org/3.2/whatsnew/3.2.html I don't see any features unique to 3.2+ that would justify us skipping version 3.1 support. The main changes that might affect compatibility between 3.1 and 3.2 are: - More attention was paid to strings vs. bytes, potentially fixing some bugs and making other functions in the standard library pickier about which types they accept. - Similarly for context management, the __enter__ and __exit__ methods were implemented on more resource types so the "with" statement will work. - ElementTree (xml.etree) got a long-awaited update. - In the gzip module, functions compress() and decompress() were added and GzipFile grew a peek() method. Would you want to use any of those? - There are some new unittest functions in both 3.2 and 2.7 that would probably be handy. From p.j.a.cock at googlemail.com Mon Apr 15 04:15:46 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 15 Apr 2013 09:15:46 +0100 Subject: [Biopython-dev] Level of Python 3 support? In-Reply-To: References:

Message-ID: On Monday, April 15, 2013, Eric Talevich wrote: > On Sun, Apr 14, 2013 at 2:10 PM, Peter Cock > > wrote: > >> Hi all, >> >> We've have Python 3 coverage on the Travis continuous >> integration tests & the nightly builtbot tests for quite some >> time now. >> http://travis-ci.org/biopython/biopython >> http://testing.open-bio.org/biopython/tgrid >> >> There have been a few Python 3 specific issues reported >> (which have been fixed), but I think we should probably >> go ahead and say we support Python 3.1 or later (with >> the exception of some minor C code which has not been >> ported). >> >> Does this seem like a good move for the next release? >> >> > Sounds good to me. Looking at the changes between 3.1 and 3.2: > http://docs.python.org/3.2/whatsnew/3.2.html > > I don't see any features unique to 3.2+ that would justify us skipping > version 3.1 support. The main changes that might affect compatibility > between 3.1 and 3.2 are: > > - More attention was paid to strings vs. bytes, potentially fixing some > bugs and making other functions in the standard library pickier about which > types they accept. > > - Similarly for context management, the __enter__ and __exit__ methods > were implemented on more resource types so the "with" statement will work. > > - ElementTree (xml.etree) got a long-awaited update. > > - In the gzip module, functions compress() and decompress() were added and > GzipFile grew a peek() method. Would you want to use any of those? > > - There are some new unittest functions in both 3.2 and 2.7 that would > probably be handy. > > All nice things. Unicode literals are back in Python 3.3 which would potentially make a joint codebase realistic (without needing 2to3), so there are reasons to set a more ambitious minimum version of Python 3 than 3.1 - but first lets drop Python 2.5 and work from there :) As a point of reference, the new ReportLab version is targeting Python 2.7 and 3.3 onwards only (IIRC). Peter From kai.blin at biotech.uni-tuebingen.de Mon Apr 15 07:58:22 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Mon, 15 Apr 2013 13:58:22 +0200 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: References:

<5162DDF9.1030508@biotech.uni-tuebingen.de> Message-ID: <516BEB5E.5030404@biotech.uni-tuebingen.de> On 2013-04-11 12:17, Peter Cock wrote: Hi folks, >> Sure, I've got a zoo of Ubuntu systems to try this on, and I'd actually be >> very interested in this. I'll give it a go. > > A whole zoo of Ubuntu systems? Sounds like an excellent test set :) > So, my tests were run like this: - set up an Ubuntu server install of the given version. - installed build-essential python-dev git-core python-virtualenv - ran: $ git clone git://github.com/chapmanb/biopython.git $ virtualenv test $ source test/bin/activate $ pushd biopython $ pip install . $ popd $ python -c "import Bio; print Bio.__version__" Current results are: 10.04: works great 12.04: fails, telling me that numpy is required 12.10: fails, telling me that numpy is required Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From p.j.a.cock at googlemail.com Mon Apr 15 09:08:54 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 15 Apr 2013 14:08:54 +0100 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: <516BEB5E.5030404@biotech.uni-tuebingen.de> References:

<5162DDF9.1030508@biotech.uni-tuebingen.de> <516BEB5E.5030404@biotech.uni-tuebingen.de> Message-ID: On Mon, Apr 15, 2013 at 12:58 PM, Kai Blin wrote: > On 2013-04-11 12:17, Peter Cock wrote: > > Hi folks, > >>> Sure, I've got a zoo of Ubuntu systems to try this on, and I'd actually >>> be >>> very interested in this. I'll give it a go. >> >> >> A whole zoo of Ubuntu systems? Sounds like an excellent test set :) >> > > So, my tests were run like this: > > - set up an Ubuntu server install of the given version. > - installed build-essential python-dev git-core python-virtualenv > - ran: > $ git clone git://github.com/chapmanb/biopython.git > $ virtualenv test > $ source test/bin/activate > $ pushd biopython > $ pip install . > $ popd > $ python -c "import Bio; print Bio.__version__" > > Current results are: > 10.04: works great > 12.04: fails, telling me that numpy is required > 12.10: fails, telling me that numpy is required That sounds like what Brad intended, so that's good. Does this seem like an improvement to you? Peter From chapmanb at 50mail.com Mon Apr 15 09:02:55 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 15 Apr 2013 09:02:55 -0400 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: <516BEB5E.5030404@biotech.uni-tuebingen.de> References:

<5162DDF9.1030508@biotech.uni-tuebingen.de> <516BEB5E.5030404@biotech.uni-tuebingen.de> Message-ID: <87ehec7xq8.fsf@fastmail.fm> Kai; Thanks for testing this out. For your failures on 12.04 and 12.10: - What version of pip are you using? (pip --version) - Is it failing during the install process, or at some other point? I'll work on reproducing this here. Thanks again, Brad > On 2013-04-11 12:17, Peter Cock wrote: > > Hi folks, > >>> Sure, I've got a zoo of Ubuntu systems to try this on, and I'd actually be >>> very interested in this. I'll give it a go. >> >> A whole zoo of Ubuntu systems? Sounds like an excellent test set :) >> > > So, my tests were run like this: > > - set up an Ubuntu server install of the given version. > - installed build-essential python-dev git-core python-virtualenv > - ran: > $ git clone git://github.com/chapmanb/biopython.git > $ virtualenv test > $ source test/bin/activate > $ pushd biopython > $ pip install . > $ popd > $ python -c "import Bio; print Bio.__version__" > > Current results are: > 10.04: works great > 12.04: fails, telling me that numpy is required > 12.10: fails, telling me that numpy is required > > Cheers, > Kai > > -- > Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de > Institute for Microbiology and Infection Medicine > Division of Microbiology/Biotechnology > Eberhard-Karls-Universit?t T?bingen > Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 > D-72076 T?bingen Fax : ++49 7071 29-5979 > Germany > Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Mon Apr 15 10:29:45 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 15 Apr 2013 14:29:45 +0000 Subject: [Biopython-dev] [Biopython - Feature #3427] (New) Cache paths in Bio.Phylo trees for later use Message-ID: Issue #3427 has been reported by Ben Morris. ---------------------------------------- Feature #3427: Cache paths in Bio.Phylo trees for later use https://redmine.open-bio.org/issues/3427 Author: Ben Morris Status: New Priority: Normal Assignee: Category: Target version: URL: I'm doing some analyses using Bio.Phylo in which I need to find many distances between pairs of taxa, and have found that this can be quite slow as currently implemented. My current solution is to extend Newick.Tree objects and cache the result of the get_path function. This way, after finding the distance between species A and B, finding the distance between A and C doesn't require recomputing the path from A to the root. Example:

class CachingTree(bp.Newick.Tree):
    _paths = {}
    def __init__(self, tree):
        self.__dict__.update(tree.__dict__)

    def get_path(self, target, **kwargs):
        if not target in self._paths:
            self._paths[target] = bp.Newick.Tree.get_path(self, target=target,
                                                          **kwargs)
        return self._paths[target]

Should this functionality be incorporated into the BaseTree class itself? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From kai.blin at biotech.uni-tuebingen.de Mon Apr 15 11:33:02 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Mon, 15 Apr 2013 17:33:02 +0200 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: <87ehec7xq8.fsf@fastmail.fm> References:

<5162DDF9.1030508@biotech.uni-tuebingen.de> <516BEB5E.5030404@biotech.uni-tuebingen.de> <87ehec7xq8.fsf@fastmail.fm> Message-ID: <516C1DAE.4020800@biotech.uni-tuebingen.de> On 2013-04-15 15:02, Brad Chapman wrote: Hi Brad, > Thanks for testing this out. For your failures on 12.04 and 12.10: > > - What version of pip are you using? (pip --version) 12.04 has pip 1.1 installed, with python 2.7. If I upgrade to current pip 1.3.1, the biopython install succeeds. 12.10 also ships with pip 1.1, so the same workaround applies. > - Is it failing during the install process, or at some other point? Oh, sorry. The pip install command returns 255 and prints the blurb about installing NumPy. Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From chapmanb at 50mail.com Mon Apr 15 13:39:37 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 15 Apr 2013 13:39:37 -0400 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: <516C1DAE.4020800@biotech.uni-tuebingen.de> References:

<5162DDF9.1030508@biotech.uni-tuebingen.de> <516BEB5E.5030404@biotech.uni-tuebingen.de> <87ehec7xq8.fsf@fastmail.fm> <516C1DAE.4020800@biotech.uni-tuebingen.de> Message-ID: <87bo9f8zhi.fsf@fastmail.fm> Kai; >> - What version of pip are you using? (pip --version) > > 12.04 has pip 1.1 installed, with python 2.7. > If I upgrade to current pip 1.3.1, the biopython install succeeds. > > 12.10 also ships with pip 1.1, so the same workaround applies. Perfect, thanks for the version information. I reproduced this and checked in a fix so it'll work out of the box with pip 1.1 as well. That part is a bit ugly because there isn't a clean way to know what tool is installing it, so I need to inspect the commandline and infer. Thanks again for testing. Let us know if you run into any other issues, Brad From eric.talevich at gmail.com Mon Apr 15 20:43:12 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 15 Apr 2013 18:43:12 -0600 Subject: [Biopython-dev] Abstract for "Biopython Project Update" at BOSC 2013 In-Reply-To: References:

Message-ID: On Sun, Apr 14, 2013 at 3:30 PM, Peter Cock wrote: > On Tue, Apr 9, 2013 at 2:36 PM, Peter Cock > wrote: > > Hi all, > > > > First a general reminder that the BOSC 2013 abstract deadline is > > this coming Friday 12 April, http://www.open-bio.org/wiki/BOSC_2013 > > > > Second, we need to prepare an abstract for the traditional Biopython > > Project update. Both Brad and I should be there for BOSC 2013, > > and the preceding mini hackathon/codefest to which everyone is > > welcome: http://www.open-bio.org/wiki/Codefest_2013 > > > > Who else is planning to be at BOSC 2013 (and the Codefest), and > > is there anyone keen to present the update? I'm happy to give the > > talk, but this is a nice chance for someone else to present and give > > a more rounded impression of the Biopython developers ;) > > > > According to http://biopython.org/wiki/Documentation#Presentations > > the recent Biopython presenters at BOSC have been: > > > > BOSC 2013, Berlin, Germany - ??? > > BOSC 2012, Long beach, USA - Eric > > BOSC 2011, Vienna, Austria - Peter > > BOSC 2010, Boston, USA - Brad > > BOSC 2009, Stockholm, Sweden - Peter > > BOSC 2008, Toronto, Canada - Tiago > > BOSC 2007, Vienna, Austria - Peter > > BOSC 2006, Fortaleza, Brazil - No one > > ... > > > > So, any volunteers from our development team? I know from > > personal experience that giving a talk can be very useful for > > securing travel funding - and we can if need be ask the BOSC > > committee to waive the registration fee (in a new initiative for this > > year at BOSC). > > > > Thanks, > > > > Peter > > (Disclaimer: I'm also on the BOSC committee) > > Looks like it will be me giving the talk then (although > its not too late to change that if needed). I've attached > the abstract as it stands, assuming we get a talk slot > then we'll be able to make some revisions during the > review process (the BOSC review panel may have > comments to address). > > Feedback from you all is welcome too of course. > > Thanks, > > Peter > The abstract looks good to me. Which release was the first to include SearchIO, was that 1.61? If so, maybe it would be good to note that in addition to the smaller improvements, SearchIO specifically was (one of?) the new module(s) that introduced the beta designation. From p.j.a.cock at googlemail.com Tue Apr 16 04:47:01 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Apr 2013 09:47:01 +0100 Subject: [Biopython-dev] Abstract for "Biopython Project Update" at BOSC 2013 In-Reply-To: References:

Message-ID: On Tue, Apr 16, 2013 at 1:43 AM, Eric Talevich wrote: > > The abstract looks good to me. Which release was the first to include > SearchIO, was that 1.61? If so, maybe it would be good to note that in > addition to the smaller improvements, SearchIO specifically was (one of?) > the new module(s) that introduced the beta designation. > Yes, SearchIO was included in Biopython 1.61, but you're right that could be made a bit clearer. Thanks, Peter From redmine at redmine.open-bio.org Tue Apr 16 09:48:14 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 16 Apr 2013 13:48:14 +0000 Subject: [Biopython-dev] [Biopython - Bug #2235] SeqRecord from Bio.SwissProt.SProt lacks annotation information References: Message-ID: Issue #2235 has been updated by Peter Cock. Assignee changed from Peter Cock to Biopython Dev Mailing List ---------------------------------------- Bug #2235: SeqRecord from Bio.SwissProt.SProt lacks annotation information https://redmine.open-bio.org/issues/2235 Author: Peter Cock Status: In Progress Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: Not Applicable URL: This also affects Bio.SeqIO as it calls Bio.SwissProt.SProt to parse "swiss" format files into SeqRecord objects: The _SequenceConsumer called by the SequenceParser (both defined in Bio/SwissProt/SProt.py) records only the bare minimum of sequence, id, name and description. It should implement methods for more of the consumer events (i.e. Swiss-Prot line types), storing the data in a format as close as possible to the SeqRecord objects produced by Bio.GenBank I plan to tackle this after the next release of Biopython, together with a more detailed set of tests for Bio.SeqIO that look at the annotation and database cross references of SeqRecord objects. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From albl500 at york.ac.uk Fri Apr 19 07:29:28 2013 From: albl500 at york.ac.uk (Alex Leach) Date: Fri, 19 Apr 2013 12:29:28 +0100 Subject: [Biopython-dev] Parsing fastq files with SeqIO.parser(handle) Message-ID: Dear BioPython Devs, Probably a strange request, but I was wondering if it might be a good idea to make the fasta parser raise an error when it is asked to parse incorrectly formatted files. I ask, because a while ago, I made a simple command line utility to convert sequence files to/from various formats, using SeqIO.parser. It's attached if anyone's interested. My supervisor's now using it to filter fastq formatted sequences by length, but keeps forgetting to add a '-format fastq' option. The script by default assumes fasta formatted sequences, which, like SeqIO.parser is by design, but the problem is that the parser doesn't mind at all when a fastq file doesn't contain a single ">" character. Are there any interfaces to make the fasta parser stricter? This error is completely silent until picked up by external programs; hmmer, in this instance. Ideally, an error would be raised much earlier in the process, especially as the department's NFS servers take ages to retrieve and convert an IonTorrent dataset. (I've got him using /var/tmp for the converted files, but he keeps the original fastq's in an NFS home folder, which is sloooooow). The department's using BioPython 1.57 btw. Thanks for your time. Kind regards, Alex p.s. Don't suppose there's any plans to implement any parsers as C-extensions? --- Alex Leach. BSc, MRes Chong & Redeker Labs Department of Biology University of York YO10 5DD Tel: 07940 480 771 EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: seqDB.py Type: application/octet-stream Size: 10674 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Apr 19 09:24:39 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 19 Apr 2013 14:24:39 +0100 Subject: [Biopython-dev] Parsing fastq files with SeqIO.parser(handle) In-Reply-To: References: Message-ID: Hi Alex, Are you talking about FASTA or FASTQ or both? On Fri, Apr 19, 2013 at 12:29 PM, Alex Leach wrote: > Dear BioPython Devs, > > Probably a strange request, but I was wondering if it might be a good idea > to make the fasta parser raise an error when it is asked to parse > incorrectly formatted files. The FASTA parser (based on the older versions in Biopython) tolerates random stuff before the first record. This has some practical uses like the fact it can be used to parse the FASTA block at the end of a GFF file. Changing that might cause trouble... > I ask, because a while ago, I made a simple command line utility to convert > sequence files to/from various formats, using SeqIO.parser. It's attached if > anyone's interested. For simple jobs, Bio.SeqIO.convert is nice and often faster too. I see you're making heavy use of the SeqRecord's format method. That is not a good idea for speed (as the docstring and documentation tries to explain), as it makes a StringIO handle and calls SeqIO.write internally - just call SeqIO.write directly with your file handle. > My supervisor's now using it to filter fastq formatted sequences by length, > but keeps forgetting to add a '-format fastq' option. The script by default > assumes fasta formatted sequences, which, like SeqIO.parser is by design, > but the problem is that the parser doesn't mind at all when a fastq file > doesn't contain a single ">" character. > > Are there any interfaces to make the fasta parser stricter? This error is > completely silent until picked up by external programs; hmmer, in this > instance. Ideally, an error would be raised much earlier in the process, > especially as the department's NFS servers take ages to retrieve and convert > an IonTorrent dataset. (I've got him using /var/tmp for the converted files, > but he keeps the original fastq's in an NFS home folder, which is > sloooooow). In your shoes I'd write some sanity testing into the script, for example check the filesize for hints of truncation or being empty. Also I'd try parsing it just enough to check the first few reads are sane. > The department's using BioPython 1.57 btw. > That's quite elderly now, released two years ago. > Thanks for your time. > Kind regards, > Alex > > p.s. Don't suppose there's any plans to implement any parsers as > C-extensions? I don't have any such plans. It would be possible and likely faster for the special case where you can push the file handle stuff all to the C level, but then it won't cope with general Python handles (e.g. StringIO, network handles, decompressed files, etc). And it won't work under Jython, and becomes more complex for PyPy. Also from the benchmarking I've done, even with FASTA and FASTQ one of the major time sinks is building the Python objects. If you stick with strings using for example then even parsing in pure Python is much quicker. See: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ from Bio.SeqIO.QualityIO import FastqGeneralIterator from Bio.SeqIO.FastaIO import SimpleFastaParser There are other ideas about, e.g. lazy loading which I suggested as a possible GSoC project this year: http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html From albl500 at york.ac.uk Fri Apr 19 12:04:28 2013 From: albl500 at york.ac.uk (Alex Leach) Date: Fri, 19 Apr 2013 17:04:28 +0100 Subject: [Biopython-dev] Parsing fastq files with SeqIO.parser(handle) In-Reply-To: References:

Message-ID: Hi Peter, On Fri, 19 Apr 2013 14:24:39 +0100, Peter Cock wrote: >> >> p.s. Don't suppose there's any plans to implement any parsers as >> C-extensions? > > I don't have any such plans. It would be possible and likely faster > for the special case where you can push the file handle stuff all > to the C level, but then it won't cope with general Python handles > (e.g. StringIO, network handles, decompressed files, etc). And > it won't work under Jython, and becomes more complex for PyPy. I haven't done any in-depth reading into Jython or PyPy, but don't all file-like objects aim to provide the same API? At least on the Python side, the only difference should be whether a file-like object is seek-able or not, I would have thought... Bearing this in mind, I would have thought that even when using odd handles, like StringIO objects, the C-Python Object Protocol functions[1] could be used to interface with them. [1] - http://docs.python.org/2/c-api/object.html e.g. void write_to_obj_handle(PyObject * obj_handle, PyObject * sequence_string) { // Get the "write" method on "obj_handle" PyObject * handle_writer = PyObject_GetAttr(obj_handle, "write"); // Call "obj_handle.write(sequence_string)" PyObject_CallFunctionObjArgs(handle_writer, sequence_string, NULL); ... Still, it would be a horrendous amount of work to get anything nearly as flexible as the current parsers, which as you say, might not get much faster anyway. > > Also from the benchmarking I've done, even with FASTA and > FASTQ one of the major time sinks is building the Python objects. > If you stick with strings using for example then even parsing in > pure Python is much quicker. See: Yes, creating millions of PyObject's does add a lot of overhead... I've been investigating various extension module techniques and it seems the "best" ones, like e.g. numpy arrays, use a dedicated container object (MultipleSeqAlignment?) which holds C-type records. Only when an individual item or slice is requested by Python is it converted into a PyObject (SeqRecord?), but functions written in C, Fortran and C++ can still use the raw C-Level API. For heavy computations on big matrices, this essentially removes all Python-related overhead in PyObject's excessive usage of the heap. > > http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ > from Bio.SeqIO.QualityIO import FastqGeneralIterator > from Bio.SeqIO.FastaIO import SimpleFastaParser Thanks, I've now read the article. Some good info in there! I haven't really done too much with different fastq formats, but that day will no doubt soon come... > > There are other ideas about, e.g. lazy loading which I suggested > as a possible GSoC project this year: > http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html Just re-read your email and looked back through my old ArbIO.py module[2]. Now I know why I didn't say anything! Looks like I wrote that a looong time ago, and the code is a bit of a mess. It was designed to work only with Arb Silva's almost unique way of formatting fasta files. It worked very well for that use case. Your ideas regarding BAM, tabix and BioSQL sound a much better idea than using pickle dumps to save indexes, though. Shame the GSOC proposals got bounced... Kind regards, Alex [2] - http://code.google.com/p/ssummo/source/browse/trunk/ssummo/lib/ArbIO.py?r=5 -- --- Alex Leach. BSc, MRes Chong & Redeker Labs Department of Biology University of York YO10 5DD Tel: 07940 480 771 EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm From jan.kosinski at gmail.com Tue Apr 23 04:09:30 2013 From: jan.kosinski at gmail.com (Jan Kosinski) Date: Tue, 23 Apr 2013 10:09:30 +0200 Subject: [Biopython-dev] Bug: Bio.PDB.DSSP fails on PDB files with no header Message-ID: The following script: import sys import Bio from Bio.PDB.DSSP import DSSP from Bio.PDB.PDBParser import PDBParser pdb_filename = sys.argv[1] p = PDBParser() structure = p.get_structure('chupacabra', pdb_filename) model = structure[0] biodssp = DSSP(model, pdb_filename, dssp="dssp") will fail on files without HEADER with traceback: Traceback (most recent call last): File "testdssp_fail.py", line 13, in biodssp = DSSP(model, pdb_filename, dssp="dssp") File "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", line 200, in __init__ dssp_dict, dssp_keys = dssp_dict_from_pdb_file(pdb_file, dssp) File "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", line 101, in dssp_dict_from_pdb_file out_dict, keys = make_dssp_dict(out_file.name) File "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", line 121, in make_dssp_dict if sl[1] == "RESIDUE": IndexError: list index out of range This is because in newer dssp versions the header of DSSP output will have empty lines if the PDB file had no HEADER and this part of make_dssp_dict function will fail for l in handle.readlines(): sl = l.split() if sl[1] == "RESIDUE: Changing it to sth like: if len(sl) > 1 and sl[1] == "RESIDUE": fixes the problem. Cheers, Jan From p.j.a.cock at googlemail.com Tue Apr 23 04:14:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Apr 2013 09:14:23 +0100 Subject: [Biopython-dev] Bug: Bio.PDB.DSSP fails on PDB files with no header In-Reply-To: References: Message-ID: On Tue, Apr 23, 2013 at 9:09 AM, Jan Kosinski wrote: > The following script: > > import sys > import Bio > from Bio.PDB.DSSP import DSSP > from Bio.PDB.PDBParser import PDBParser > > pdb_filename = sys.argv[1] > p = PDBParser() > structure = p.get_structure('chupacabra', pdb_filename) > model = structure[0] > > biodssp = DSSP(model, pdb_filename, dssp="dssp") > > will fail on files without HEADER with traceback: > Traceback (most recent call last): > File "testdssp_fail.py", line 13, in > biodssp = DSSP(model, pdb_filename, dssp="dssp") > File > "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", > line 200, in __init__ > dssp_dict, dssp_keys = dssp_dict_from_pdb_file(pdb_file, dssp) > File > "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", > line 101, in dssp_dict_from_pdb_file > out_dict, keys = make_dssp_dict(out_file.name) > File > "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", > line 121, in make_dssp_dict > if sl[1] == "RESIDUE": > IndexError: list index out of range > > This is because in newer dssp versions the header of DSSP output will have > empty lines if the PDB file had no HEADER and this part of make_dssp_dict > function will fail > for l in handle.readlines(): > sl = l.split() > if sl[1] == "RESIDUE: > > Changing it to sth like: > if len(sl) > 1 and sl[1] == "RESIDUE": > fixes the problem. > > Cheers, > Jan Hi Jan, Do you have any (small) sample output we could use for a test case to include with Biopython? Do you know which versions of DSSP are affected? Thanks, Peter From jan.kosinski at gmail.com Tue Apr 23 04:42:49 2013 From: jan.kosinski at gmail.com (Jan Kosinski) Date: Tue, 23 Apr 2013 10:42:49 +0200 Subject: [Biopython-dev] Bug: Bio.PDB.DSSP fails on PDB files with no header In-Reply-To: References:

Message-ID: Just get any and remove the header: wget http://www.rcsb.org/pdb/files/1X9Z.pdb grep -v HEADER 1X9Z.pdb > 1X9Z.no_header.pdb The current versions of dssp (2.0.4 or 2.1.0 from http://swift.cmbi.ru.nl/gv/dssp/) work but give: . REFERENCE W. KABSCH AND C.SANDER, BIOPOLYMERS 22 (1983) 2577-2637 . . COMPND 2 MOLECULE: DNA MISMATCH REPAIR PROTEIN MUTL; Note the empty line between REFERENCE and COMPND. Older versions of dssp (e.g. CMBI version by Elmar.Krieger at cmbi.kun.nl / November 18,2002 ) were giving: !!! HEADER-card missing !!! !!! COMPOUND-card missing !!! !!! SOURCE-card missing !!! !!! AUTHOR-card missing !!! in place of this empty line. On Tue, Apr 23, 2013 at 10:14 AM, Peter Cock wrote: > On Tue, Apr 23, 2013 at 9:09 AM, Jan Kosinski > wrote: > > The following script: > > > > import sys > > import Bio > > from Bio.PDB.DSSP import DSSP > > from Bio.PDB.PDBParser import PDBParser > > > > pdb_filename = sys.argv[1] > > p = PDBParser() > > structure = p.get_structure('chupacabra', pdb_filename) > > model = structure[0] > > > > biodssp = DSSP(model, pdb_filename, dssp="dssp") > > > > will fail on files without HEADER with traceback: > > Traceback (most recent call last): > > File "testdssp_fail.py", line 13, in > > biodssp = DSSP(model, pdb_filename, dssp="dssp") > > File > > > "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", > > line 200, in __init__ > > dssp_dict, dssp_keys = dssp_dict_from_pdb_file(pdb_file, dssp) > > File > > > "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", > > line 101, in dssp_dict_from_pdb_file > > out_dict, keys = make_dssp_dict(out_file.name) > > File > > > "/home/modorama_dev/modorama/ENV_INI/lib/python2.7/site-packages/Bio/PDB/DSSP.py", > > line 121, in make_dssp_dict > > if sl[1] == "RESIDUE": > > IndexError: list index out of range > > > > This is because in newer dssp versions the header of DSSP output will > have > > empty lines if the PDB file had no HEADER and this part of make_dssp_dict > > function will fail > > for l in handle.readlines(): > > sl = l.split() > > if sl[1] == "RESIDUE: > > > > Changing it to sth like: > > if len(sl) > 1 and sl[1] == "RESIDUE": > > fixes the problem. > > > > Cheers, > > Jan > > Hi Jan, > > Do you have any (small) sample output we could use for a test case > to include with Biopython? > > Do you know which versions of DSSP are affected? > > Thanks, > > Peter > From p.j.a.cock at googlemail.com Wed Apr 24 15:19:08 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Apr 2013 20:19:08 +0100 Subject: [Biopython-dev] OBF not accepted for GSoC 2013 In-Reply-To: References:

Message-ID: On Tue, Apr 9, 2013 at 12:08 PM, Peter Cock wrote: > On Tue, Apr 9, 2013 at 11:41 AM, Peter Cock wrote: >> For any Biopython based projects, I think our best bet is to >> talk to NESCent (with whom we've had a couple of GSoC >> students prior to the OBF applying directly). That seems >> like a good fit with the phylogenetic ideas Eric suggested: >> http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 >> >> Looking over the list of other accepted applications, the >> Python Software Foundation (PSF) was also accepted >> and they're doing this as an umbrella organisation - so >> also worth talking to: >> http://wiki.python.org/moin/SummerOfCode/2013 > > Not all the GSoC ideas we'd come up with for 2013 would > fit under NEScent's interests, so I've emailed the PSF > contact person listed on that page to see if they are still > willing to add more projects under their umbrella. NESCent seem OK with the current GSoC ideas, so being able to participate in GSoC 2013 that way is great. Thanks Eric for sorting that out and updating the wiki page. The PSF would also have been possible, but the timing didn't work out - I think there were a lot of other Python projects making contact at the same time and responding to us all was a challenge. So we'll want to plan this well in advance next year - we weren't expecting the OBF application to be turned down, but Google seems to want to rotate the slots and bring in new organisations as well. I'll send out a general email in a moment, as we've not have much public interest from potential students yet :( Peter From p.j.a.cock at googlemail.com Wed Apr 24 15:19:48 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Apr 2013 20:19:48 +0100 Subject: [Biopython-dev] Biopython GSoC 2013 applications via NESCent Message-ID: To all the Biopythoneers, For the last few years Biopython has participated in the Google Summer of Code (GSoC) program under the umbrella of the Open Bioinformatics Foundation (OBF): https://developers.google.com/open-source/soc/ https://github.com/OBF/GSoC Unfortunately like quite a few previously accepted organisations, this year the OBF not accepted. Google has kept the total about the same year on year, so this is probably simply a slot rotation to get some new organisations involved. The good news (for those not following the Biopython-dev mailing list) is we have an alternative option agreed with the good people at NESCent, as we did back in 2009: http://biopython.org/wiki/Google_Summer_of_Code http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 I'd like to thank Eric for co-ordinating this, and encourage any interested potential students to sign up to the Biopython development list and NESCent's Google+ group as soon as possible (if you haven't done so already): http://lists.open-bio.org/mailman/listinfo/biopython-dev https://plus.google.com/communities/105828320619238393015 Google are already accepting student applications, and the deadline is Friday 3 May. That doesn't leave very long for asking feedback and talking to potential mentors - which is essential for a competitive proposal. Thank you for your interest, Peter From natemsutton at yahoo.com Fri Apr 26 19:11:38 2013 From: natemsutton at yahoo.com (Nate Sutton) Date: Fri, 26 Apr 2013 16:11:38 -0700 (PDT) Subject: [Biopython-dev] Work note In-Reply-To: References: <1365455825.76591.YahooMailNeo@web122604.mail.ne1.yahoo.com> Message-ID: <1367017898.85568.YahooMailNeo@web122602.mail.ne1.yahoo.com> Thanks, I apologize for the delayed response, I am getting a different computer so my computer access was limited.? I created a branch for this:? https://github.com/nmsutton/biopython/tree/PhyloDrawAdditions So far I have been able to create code to pass kwargs to pyplot (bottom of utils.py).? It looks like you accomplished allowing formatting options for confidence/support values (good work :)?).? Last goal is to return a mapping of clade objects and I will work on that next. ________________________________ From: Eric Talevich To: Nate Sutton Cc: "biopython-dev at lists.open-bio.org" Sent: Wednesday, April 10, 2013 12:44 PM Subject: Re: [Biopython-dev] Work note Hi Nate, Cool. Which aspects of Phylo.draw are you working on? Is there a branch we could watch on GitHub? Thanks, Eric On Mon, Apr 8, 2013 at 5:17 PM, Nate Sutton wrote: Hi, >I wanted to post a quick note keep efforts organized, I am working on and have some progress with https://redmine.open-bio.org/issues/3336 . >-Nate >_______________________________________________ >Biopython-dev mailing list >Biopython-dev at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython-dev > From zhigang.wu at email.ucr.edu Fri Apr 26 20:52:19 2013 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Fri, 26 Apr 2013 17:52:19 -0700 Subject: [Biopython-dev] Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Peter, I am interested in implementing the lazy-loading sequence parsers. I know the time is pretty tight for me to write an proposal on it. But even I cannot contribute under the umbrella of GSoC and assuming no body is implemented, I am still interested in implementing this (I just wanna have something nice on my CV and while contributing to Open source software community as well). While at this moment, I don't have very clear picture on how to do it. Can you point me to somewhere where I can start to get a sense how this can be implemented. As far as I know, samtools (view) may have similar techniques in them. Thanks. Zhigang On Wed, Apr 24, 2013 at 12:19 PM, Peter Cock wrote: > To all the Biopythoneers, > > For the last few years Biopython has participated in the > Google Summer of Code (GSoC) program under the umbrella > of the Open Bioinformatics Foundation (OBF): > https://developers.google.com/open-source/soc/ > https://github.com/OBF/GSoC > > Unfortunately like quite a few previously accepted organisations, > this year the OBF not accepted. Google has kept the total about > the same year on year, so this is probably simply a slot rotation > to get some new organisations involved. > > The good news (for those not following the Biopython-dev > mailing list) is we have an alternative option agreed with > the good people at NESCent, as we did back in 2009: > > http://biopython.org/wiki/Google_Summer_of_Code > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 > > I'd like to thank Eric for co-ordinating this, and encourage > any interested potential students to sign up to the Biopython > development list and NESCent's Google+ group as soon as > possible (if you haven't done so already): > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > https://plus.google.com/communities/105828320619238393015 > > Google are already accepting student applications, and the > deadline is Friday 3 May. That doesn't leave very long for > asking feedback and talking to potential mentors - which > is essential for a competitive proposal. > > Thank you for your interest, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From yeyanbo289 at gmail.com Fri Apr 26 23:16:22 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Sat, 27 Apr 2013 11:16:22 +0800 Subject: [Biopython-dev] Self Introduction Message-ID: Hi guys, My name is Yanbo Ye. I'm a third-year master student in bioinformatics(mainly phylogenetics) in China. My master project is about the genome evolution of baculovirus and a bioinformatics tool development. The tool is a java program named BlastGraphthat can be used for sequence clustering, genome tree construction and gene gain and loss estimation. This involves a range of phylogenetics algorithms commonly used. So based on my experice, I'd like to contribute to the Phylo package of Biopython for some unimplemented methods during the Google Summer of Code this year, which will be under the direction of Eric Talevich. I have been using python and Biopython for nearly a year to do some sequence file parsing and tree manipulation. But I must say I'm still new to python about the design convention or any popular python packages, etc. Anyway, I'm a fast learner and would like to improve my python programming skills through this project. I'd like to discuss with you during this project and hope you could help me. Thanks, Yanbo -- ??? ???????????????? Ye Yanbo Bioinformatics Group, Wuhan Institute Of Virology, Chinese Academy of Sciences From rz1991 at foxmail.com Fri Apr 26 22:12:11 2013 From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=) Date: Sat, 27 Apr 2013 10:12:11 +0800 Subject: [Biopython-dev] GSoC Application Message-ID: Hi Biopython developers, I hope it's not too late to inform you my interest in Biopython GSoC, but I talked to Eric Talevich several weeks before that I would like to apply the Codon Alignment proposal. As a simple trial, I've already got a simple implementation of codon alignment script as can be found in https://github.com/zruan/CodonAlignment. It doesn't account for frame shift and mismatches between protein sequences and nucleotide sequences. But as least it works if the protein sequences can be exactly translated by nucleotide sequences. Hopefully I will finish my proposal this week. Is there any suggestions? Sincerely, Ruan From rz1991 at foxmail.com Sat Apr 27 04:18:56 2013 From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=) Date: Sat, 27 Apr 2013 16:18:56 +0800 Subject: [Biopython-dev] Fw: GSoC Application Message-ID: Sorry I forget to introduce myself. My name is Zheng Ruan, currently a first year graduate students at the University of Georiga. I have a background in biology and now pursue a PhD in Bioinformatics. I taught myself many programming language and finally python became my favorite. I use biopython a lot for everyday sequence manipulation and format conversion. For example, I use Bio.Phylo class for reroot tree and visualization functionality in the sever http://bioinformatics.publichealth.uga.edu/SpeciesTreeAnalysis/index.php, which has just been accepted by NAR. I really hope to contribute the the Open Source Project and GSoC seems to be a good start point. I think the codon alignment proposal suitable to me because it requires medium programming skills and I've already finished the Computational Molecular Evolution by Ziheng Yang and also have a phylogenetics course this spring semester. Thanks for all. Best, Zheng Ruan ------------------ Original ------------------ From: "Zheng Ruan"; Date: Apr 27, 2013 To: "biopython-dev"; Subject: [Biopython-dev] GSoC Application Hi Biopython developers, I hope it's not too late to inform you my interest in Biopython GSoC, but I talked to Eric Talevich several weeks before that I would like to apply the Codon Alignment proposal. As a simple trial, I've already got a simple implementation of codon alignment script as can be found in https://github.com/zruan/CodonAlignment. It doesn't account for frame shift and mismatches between protein sequences and nucleotide sequences. But as least it works if the protein sequences can be exactly translated by nucleotide sequences. Hopefully I will finish my proposal this week. Is there any suggestions? Sincerely, Ruan From p.j.a.cock at googlemail.com Sat Apr 27 06:55:41 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 27 Apr 2013 11:55:41 +0100 Subject: [Biopython-dev] Self Introduction In-Reply-To: References: Message-ID: On Sat, Apr 27, 2013 at 4:16 AM, Yanbo Ye wrote: > Hi guys, > > My name is Yanbo Ye. I'm a third-year master student in > bioinformatics(mainly phylogenetics) in China. My master project is about > the genome evolution of baculovirus and a bioinformatics tool development. Hi Yanbo, and welcome. > So based on my experice, I'd like to contribute to the Phylo package of > Biopython for some unimplemented methods during the Google Summer of Code > this year, which will be under the direction of Eric Talevich. To apply for GSoC you need to write a project proposal (including a workplan for the summer), which will then be judged by the NESCent GSoC mentors in competition with the other applicants. We encourage students to share their draft proposals for feedback - here on the biopython-dev list and/or the NESCent Phyloinformatics Summer of Code community on Google Plus. Please also introduce yourself there: https://plus.google.com/communities/105828320619238393015 > I have been using python and Biopython for nearly a year to do some > sequence file parsing and tree manipulation. But I must say I'm still new > to python about the design convention or any popular python packages, etc. > Anyway, I'm a fast learner and would like to improve my python programming > skills through this project. I'd like to discuss with you during this > project and hope you could help me. Are you interested in one of the ideas we put on the wiki? http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 http://biopython.org/wiki/Google_Summer_of_Code These are suggestions which can be modified, or provided there is a suitable mentor you can suggest something different. Do you have something specific you'd like to work on? Regards, Peter From p.j.a.cock at googlemail.com Sat Apr 27 07:20:57 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 27 Apr 2013 12:20:57 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent Message-ID: On Sat, Apr 27, 2013 at 1:52 AM, Zhigang Wu wrote: > Hi Peter, > > I am interested in implementing the lazy-loading sequence parsers. > I know the time is pretty tight for me to write an proposal on it. But even > I cannot contribute under the umbrella of GSoC and assuming no body is > implemented, I am still interested in implementing this (I just wanna have > something nice on my CV and while contributing to Open source software > community as well). While at this moment, I don't have very clear picture on > how to do it. Can you point me to somewhere where I can start to get a sense > how this can be implemented. As far as I know, samtools (view) may have > similar techniques in them. Thanks. > > > Zhigang Hi Zhigang, It isn't too late to write up a proposal for GSoC 2013, but please also introduce yourself on the NESCent Phyloinformatics Summer of Code community on Google Plus: https://plus.google.com/communities/105828320619238393015 The GSoC program is a great chance to spend a few months focussed just on one programming project - which can be really fun. However, the fact that you're interested in making contributions outside of GSoC is great. I wrote some more about the lazy-loading sequence parsers and indexing idea on the biopython-dev mailing list last month: http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html However, lazy-parsing can also be done separately from the indexing. This is something I was trying in my experimental SAM/BAM parser mentioned on this thread: http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010492.html The basic idea here was that the raw data for each record was loaded into memory as a (bytes) string, but not all of it was parsed into the individual fields right away. For example, the tags get turned into a dictionary only if the user tried to use the tag values. Similarly for many of the BAM fields, the binary string was only decoded if needed. I once tried something similar with the FASTQ parser. I wrote a subclass to preserve the normal SeqRecord interface, but only decode the ASCII encoded quality scores into a list of integers if needed. This worked but that attempt did not seem to make things any faster. An example where I think there would be clear benefits to a lazy parsing approach is EMBL/GenBank files where parsing the features could be delayed (both the complex feature location, and their dictionary of annotations). However, for this to be a successful GSoC project, you would need to have a good understanding of Python and how our existing parsers work to have a realistic chance of completing it. I should be quite a technically exciting project, with the hope of being able to show big speedups via benchmarks. Does that help? Is there a particular file format you'd be interested in - perhaps something you are already using in your projects or work? Regards, Peter From p.j.a.cock at googlemail.com Sat Apr 27 07:24:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 27 Apr 2013 12:24:13 +0100 Subject: [Biopython-dev] Fw: GSoC Application In-Reply-To: References: Message-ID: Hi Zheng, No it isn't too late to apply. Its great that you've already been talking to Eric, but please also make sure you sign up to the NESCent Phyloinformatics Summer of Code community on Google Plus and also introduce yourself there: https://plus.google.com/communities/105828320619238393015 The sooner you share a draft proposal with the community (ideally here and on the NEScent group), the sooner you can get feedback to make it into a strong proposal. Thanks for you interest, Peter On Sat, Apr 27, 2013 at 9:18 AM, ?? wrote: > Sorry I forget to introduce myself. > My name is Zheng Ruan, currently a first year graduate students at the > University of Georiga. I have a background in biology and now pursue a PhD > in Bioinformatics. I taught myself many programming language and finally > python became my favorite. I use biopython a lot for everyday sequence > manipulation and format conversion. For example, I use Bio.Phylo class for > reroot tree and visualization functionality in the sever > http://bioinformatics.publichealth.uga.edu/SpeciesTreeAnalysis/index.php, > which has just been accepted by NAR. I really hope to contribute the the > Open Source Project and GSoC seems to be a good start point. I think the > codon alignment proposal suitable to me because it requires medium > programming skills and I've already finished the Computational Molecular > Evolution by Ziheng Yang and also have a phylogenetics course this spring > semester. Thanks for all. > > > Best, > Zheng Ruan > > > ------------------ Original ------------------ > From: "Zheng Ruan"; > Date: Apr 27, 2013 > To: "biopython-dev"; > > Subject: [Biopython-dev] GSoC Application > > > > Hi Biopython developers, > > I hope it's not too late to inform you my interest in Biopython GSoC, but > I talked to Eric Talevich several weeks before that I would like to apply > the Codon Alignment proposal. As a simple trial, I've already got a simple > implementation of codon alignment script as can be found in > https://github.com/zruan/CodonAlignment. It doesn't account for frame shift > and mismatches between protein sequences and nucleotide sequences. But as > least it works if the protein sequences can be exactly translated by > nucleotide sequences. Hopefully I will finish my proposal this week. Is > there any suggestions? > > > Sincerely, > Ruan > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From yeyanbo289 at gmail.com Sat Apr 27 08:33:18 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Sat, 27 Apr 2013 20:33:18 +0800 Subject: [Biopython-dev] Self Introduction In-Reply-To: References:

Message-ID: Hi, Peter. Thanks. I'm interested in the project of "Phylogenetics in Biopython: Filling in the gaps" and have already submitted a draft proposal. I previously wanted to apply the project of "Discovering links to ToLWeb content from a tree in the Open Tree of Life's software system" when the biopython projects were not added to the list. I finally choose the first one because it is more suitable for me, based on my experience. Here is my proposal draft on github. I'm still revising it based on Eric Talevich's advice. Hope you can also give me some help. https://github.com/lijax/gsoc/blob/master/proposal_biopython.md Best regards, Yanbo On Sat, Apr 27, 2013 at 6:55 PM, Peter Cock wrote: > On Sat, Apr 27, 2013 at 4:16 AM, Yanbo Ye wrote: > > Hi guys, > > > > My name is Yanbo Ye. I'm a third-year master student in > > bioinformatics(mainly phylogenetics) in China. My master project is about > > the genome evolution of baculovirus and a bioinformatics tool > development. > > Hi Yanbo, and welcome. > > > So based on my experice, I'd like to contribute to the Phylo package of > > Biopython for some unimplemented methods during the Google Summer of Code > > this year, which will be under the direction of Eric Talevich. > > To apply for GSoC you need to write a project proposal > (including a workplan for the summer), which will then > be judged by the NESCent GSoC mentors in competition > with the other applicants. We encourage students to > share their draft proposals for feedback - here on the > biopython-dev list and/or the NESCent Phyloinformatics > Summer of Code community on Google Plus. Please > also introduce yourself there: > https://plus.google.com/communities/105828320619238393015 > > > I have been using python and Biopython for nearly a year to do some > > sequence file parsing and tree manipulation. But I must say I'm still new > > to python about the design convention or any popular python packages, > etc. > > Anyway, I'm a fast learner and would like to improve my python > programming > > skills through this project. I'd like to discuss with you during this > > project and hope you could help me. > > Are you interested in one of the ideas we put on the wiki? > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 > http://biopython.org/wiki/Google_Summer_of_Code > > These are suggestions which can be modified, or provided > there is a suitable mentor you can suggest something different. > > Do you have something specific you'd like to work on? > > Regards, > > Peter > -- ??? ???????????????? Ye Yanbo Bioinformatics Group, Wuhan Institute Of Virology, Chinese Academy of Sciences From zhigangwu.bgi at gmail.com Sat Apr 27 15:22:07 2013 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Sat, 27 Apr 2013 12:22:07 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Peter, Thanks for the detailed explanation. It's very helpful. I am not quite sure about the goal of the lazy-loading parser. Let me try to summarize what are the goals of lazy-loading and how lazy-loading would work. Please correct me if necessary. Below I use fasta/fastq file as an example. The idea should generally applies to other format such as GenBank/EMBL as you mentioned. Lazy-loading is useful under the assumption that given a large file, we are interested in partial information of it but not all of them. For example a fasta file contains Arabidopsis genome, we only interested in the sequence of chr5 from index position from 2000-3000. Rather than parsing the whole file and storing each record in memory as most parsers will do, during the indexing step, lazy loading parser will only store a few position information, such as access positions (readily usable for seek) for all chromosomes (chr1, chr2, chr3, chr4, chr5, ...) and may be position index information such as the access positions for every 1000bp positions for each sequence in the given file. After indexing, we store these information in a dictionary like following {'chr1':{0:access_pos, 1000:access_pos, 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, 2000:access_pos,}, 'chr3'...}. Compared to the usual parser which tends to parsing the whole file, we gain two benefits: speed, less memory usage and random access. Speed is gained because we skipped a lot during the parsing step. Go back to my example, once we have the dictionary, we can just seek to the access position of chr5:2000 and start reading and parsing from there. Less memory usage is due to we only stores access positions for each record as a dictionary in memory. Best, Zhigang On Sat, Apr 27, 2013 at 4:20 AM, Peter Cock wrote: > On Sat, Apr 27, 2013 at 1:52 AM, Zhigang Wu wrote: >> Hi Peter, >> >> I am interested in implementing the lazy-loading sequence parsers. >> I know the time is pretty tight for me to write an proposal on it. But even >> I cannot contribute under the umbrella of GSoC and assuming no body is >> implemented, I am still interested in implementing this (I just wanna have >> something nice on my CV and while contributing to Open source software >> community as well). While at this moment, I don't have very clear picture on >> how to do it. Can you point me to somewhere where I can start to get a sense >> how this can be implemented. As far as I know, samtools (view) may have >> similar techniques in them. Thanks. >> >> >> Zhigang > > Hi Zhigang, > > It isn't too late to write up a proposal for GSoC 2013, but please > also introduce yourself on the NESCent Phyloinformatics > Summer of Code community on Google Plus: > https://plus.google.com/communities/105828320619238393015 > > The GSoC program is a great chance to spend a few months > focussed just on one programming project - which can be > really fun. However, the fact that you're interested in making > contributions outside of GSoC is great. > > I wrote some more about the lazy-loading sequence parsers > and indexing idea on the biopython-dev mailing list last month: > http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html > > However, lazy-parsing can also be done separately from the > indexing. This is something I was trying in my experimental > SAM/BAM parser mentioned on this thread: > http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010492.html > > The basic idea here was that the raw data for each record was > loaded into memory as a (bytes) string, but not all of it was > parsed into the individual fields right away. For example, the > tags get turned into a dictionary only if the user tried to use > the tag values. Similarly for many of the BAM fields, the binary > string was only decoded if needed. > > I once tried something similar with the FASTQ parser. I wrote > a subclass to preserve the normal SeqRecord interface, but > only decode the ASCII encoded quality scores into a list of > integers if needed. This worked but that attempt did not seem > to make things any faster. > > An example where I think there would be clear benefits to a > lazy parsing approach is EMBL/GenBank files where parsing > the features could be delayed (both the complex feature > location, and their dictionary of annotations). > > However, for this to be a successful GSoC project, you > would need to have a good understanding of Python and > how our existing parsers work to have a realistic chance > of completing it. I should be quite a technically exciting > project, with the hope of being able to show big speedups > via benchmarks. > > Does that help? Is there a particular file format you'd be > interested in - perhaps something you are already using > in your projects or work? > > Regards, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Sat Apr 27 15:46:51 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 27 Apr 2013 19:46:51 +0000 Subject: [Biopython-dev] [Biopython - Bug #3430] (New) Error parsing PubMedCentral XML files Message-ID: Issue #3430 has been reported by Paulo Nuin. ---------------------------------------- Bug #3430: Error parsing PubMedCentral XML files https://redmine.open-bio.org/issues/3430 Author: Paulo Nuin Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: It seems that there is an error parsing locally downloaded PubMedCentral xml (extension nxml) files. Using the code @ from Bio import Entrez handle = open('nihms83342.nxml') records = Entrez.parse(handle) for record in records: print record @ the following error occurs (copied from iPython), even though the XML header contains the declaration --------------------------------------------------------------------------- NotXMLError Traceback (most recent call last) in () 2 handle = open('nihms83342.nxml') 3 records = Entrez.parse(handle) ----> 4 for record in records: 5 print record /Library/Python/2.7/site-packages/Bio/Entrez/Parser.pyc in parse(self, handle) 229 # We did not see the initial 231 raise NotXMLError("XML declaration not found") 232 self.parser.Parse("", True) 233 self.parser = None NotXMLError: Failed to parse the XML data (XML declaration not found). Please make sure that the input data are in XML format. The XML file in question is attached. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Apr 27 15:46:51 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 27 Apr 2013 19:46:51 +0000 Subject: [Biopython-dev] [Biopython - Bug #3430] (New) Error parsing PubMedCentral XML files Message-ID: Issue #3430 has been reported by Paulo Nuin. ---------------------------------------- Bug #3430: Error parsing PubMedCentral XML files https://redmine.open-bio.org/issues/3430 Author: Paulo Nuin Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: It seems that there is an error parsing locally downloaded PubMedCentral xml (extension nxml) files. Using the code @ from Bio import Entrez handle = open('nihms83342.nxml') records = Entrez.parse(handle) for record in records: print record @ the following error occurs (copied from iPython), even though the XML header contains the declaration --------------------------------------------------------------------------- NotXMLError Traceback (most recent call last) in () 2 handle = open('nihms83342.nxml') 3 records = Entrez.parse(handle) ----> 4 for record in records: 5 print record /Library/Python/2.7/site-packages/Bio/Entrez/Parser.pyc in parse(self, handle) 229 # We did not see the initial 231 raise NotXMLError("XML declaration not found") 232 self.parser.Parse("", True) 233 self.parser = None NotXMLError: Failed to parse the XML data (XML declaration not found). Please make sure that the input data are in XML format. The XML file in question is attached. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Sat Apr 27 15:57:38 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 27 Apr 2013 20:57:38 +0100 Subject: [Biopython-dev] Fwd: Questions about Codon Alignment Proposal In-Reply-To: References: Message-ID: Oops - biopython-dev != python-dev ;) (I'll reply shortly) Peter ---------- Forwarded message ---------- From: ?? Date: Sat, Apr 27, 2013 at 6:23 PM Subject: Questions about Codon Alignment Proposal To: Eric Talevich , "p.j.a.cock" Cc: python-dev Hi Eric and Peter, I'm preparing the proposal for the codon alignment project. Two things I may want to hear your advice. 1) In the biopython wiki page, you mentioned "model selection" in the Approach & Goals. I'm not sure if there are any advantages to use codon alignment for model selection. Could you give me some references? Another thing is that model selection involves estimation of tree topology as well as branch lengthes and parameters across many substitution models. Will it be too computationally intensive for a python implementation? 2) You also mentioned the "validation (testing for frame shift)". Is there a test for frame shift? Or I can simply detect it by comparing amino acid sequences and nucleotide sequences. Best, Zheng Ruan From p.j.a.cock at googlemail.com Sat Apr 27 16:11:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 27 Apr 2013 21:11:23 +0100 Subject: [Biopython-dev] Questions about Codon Alignment Proposal In-Reply-To: References:

Message-ID: On Sat, Apr 27, 2013 at 6:23 PM, ?? wrote: > Hi Eric and Peter, > > I'm preparing the proposal for the codon alignment project. Two things I may > want to hear your advice. > > 1) In the biopython wiki page, you mentioned "model selection" in the > Approach & Goals. I'm not sure if there are any advantages to use codon > alignment for model selection. Could you give me some references? Another > thing is that model selection involves estimation of tree topology as well > as branch lengthes and parameters across many substitution models. Will it > be too computationally intensive for a python implementation? I'm not sure what Eric had in mind, but one option is to wrap an existing specialised tool specifically for model selection. One example I can think of is the graphical tool Topali2 (written by some of my current work colleagues) which I believe initially called Modelgenerator to do this, but now calls PhyML: http://www.topali.org http://bioinf.nuim.ie/modelgenerator/ > 2) You also mentioned the "validation (testing for frame shift)". Is there a > test for frame shift? Or I can simply detect it by comparing amino acid > sequences and nucleotide sequences. > > Best, > Zheng Ruan Again, you'd have to ask Eric exactly what he had in mind. Both these questions would probably be idea to ask on the NESCent Google Group - there should be some phylogenetic experts able to give a much more detailed answer than I can :) https://plus.google.com/communities/105828320619238393015 Peter From p.j.a.cock at googlemail.com Sat Apr 27 16:40:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 27 Apr 2013 21:40:52 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References:

Message-ID: On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu wrote: > Peter, > > Thanks for the detailed explanation. It's very helpful. I am not quite > sure about the goal of the lazy-loading parser. > Let me try to summarize what are the goals of lazy-loading and how > lazy-loading would work. Please correct me if necessary. Below I use > fasta/fastq file as an example. The idea should generally applies to > other format such as GenBank/EMBL as you mentioned. > > Lazy-loading is useful under the assumption that given a large file, > we are interested in partial information of it but not all of them. > For example a fasta file contains Arabidopsis genome, we only > interested in the sequence of chr5 from index position from 2000-3000. > Rather than parsing the whole file and storing each record in memory > as most parsers will do, during the indexing step, lazy loading > parser will only store a few position information, such as access > positions (readily usable for seek) for all chromosomes (chr1, chr2, > chr3, chr4, chr5, ...) and may be position index information such as > the access positions for every 1000bp positions for each sequence in > the given file. After indexing, we store these information in a > dictionary like following {'chr1':{0:access_pos, 1000:access_pos, > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, > 2000:access_pos,}, 'chr3'...}. > > Compared to the usual parser which tends to parsing the whole file, we > gain two benefits: speed, less memory usage and random access. Speed > is gained because we skipped a lot during the parsing step. Go back to > my example, once we have the dictionary, we can just seek to the > access position of chr5:2000 and start reading and parsing from there. > Less memory usage is due to we only stores access positions for each > record as a dictionary in memory. > > > Best, > > Zhigang Hi Zhigang, Yes - that's the basic idea of a disk based lazy loader. Here the data stays on the disk until needed, so generally this is very low memory but can be slow as it needs to read from the disk. And existing example already in Biopython is our BioSQL bindings which present a SeqRecord subclass which only retrieves values from the database on demand. Note in the case of FASTA, we might want to use the existing FAI index files from Heng Li's faidx tool (or another existing index scheme). That relies on each record using a consistent line wrapping length, so that seek offsets can be easily calculated. An alternative idea is to load the data into memory (so that the file is not touched again, useful for stream processing where you cannot seek within the input data) but it is only parsed into Python objects on demand. This would use a lot more memory, but should be faster as there is no disk seeking and reading (other than the one initial read). For FASTA this wouldn't help much but it might work for EMBL/GenBank. Something to beware of with any lazy loading / lazy parsing is what happens if the user tries to edit the record? Do you want to allow this (it makes the code more complex) or not (simpler and still very useful). In terms of usage examples, for things like raw NGS data this is (currently) made up of lots and lots of short sequences (under 1000bp). Lazy loading here is unlikely to be very helpful - unless perhaps you can make the FASTQ parser faster this way? (Once the reads are assembled or mapped to a reference, random access to lookup reads by their mapped location is very very important, thus the BAI indexing of BAM files). In terms of this project, I was thinking about a SeqRecord style interface extending Bio.SeqIO (but you can suggest something different for your project). What I saw as the main use case here is large datasets like whole chromosomes in FASTA format or richly annotated formats like EMBL, GenBank or GFF3. Right now if I am doing something with (for example) the annotated human chromosomes, loading these as GenBank files is quite slow (it takes a far amount of memory too, but that isn't my main worry). A lazy loading approach should let me 'load' the GenBank files almost instantly, and delay reading specific features or sequence from the disk until needed. For example, I might have a list of genes for which I wish to extract the annotation or sequence for - and there is no need to load all the other features or the rest of the genome. (Note we can already do this by loading GenBank files into a BioSQL database, and access them that way) Regards, Peter From eric.talevich at gmail.com Sat Apr 27 18:25:33 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 27 Apr 2013 18:25:33 -0400 Subject: [Biopython-dev] Questions about Codon Alignment Proposal In-Reply-To: References:

Message-ID: On Sat, Apr 27, 2013 at 4:11 PM, Peter Cock wrote: > On Sat, Apr 27, 2013 at 6:23 PM, ?? wrote: > > Hi Eric and Peter, > > > > I'm preparing the proposal for the codon alignment project. Two things I > may > > want to hear your advice. > > > > 1) In the biopython wiki page, you mentioned "model selection" in the > > Approach & Goals. I'm not sure if there are any advantages to use codon > > alignment for model selection. Could you give me some references? Another > > thing is that model selection involves estimation of tree topology as > well > > as branch lengthes and parameters across many substitution models. Will > it > > be too computationally intensive for a python implementation? > > I'm not sure what Eric had in mind, but one option is to wrap an > existing specialised tool specifically for model selection. One > example I can think of is the graphical tool Topali2 (written by > some of my current work colleagues) which I believe initially > called Modelgenerator to do this, but now calls PhyML: > > http://www.topali.org > http://bioinf.nuim.ie/modelgenerator/ > Actually, I added "model selection" to the project description because Peter mentioned it earlier. :) I didn't have any particular function in mind, but calling out to external tools sounds like a reasonable approach. We do have some primitive functionality for likelihood ratio testing through the PAML module, so maybe that could come into play here. > 2) You also mentioned the "validation (testing for frame shift)". Is > there a > > test for frame shift? Or I can simply detect it by comparing amino acid > > sequences and nucleotide sequences. > > > > Best, > > Zheng Ruan > > Again, you'd have to ask Eric exactly what he had in mind. > Yes, it's just a matter of testing for unexpected cases in which the protein and nucleotide sequences don't quite match, and handling them appropriately. It would be nice to see a description of these possible edge cases and your treatment of them in the detailed weekly schedule for the project. > Both these questions would probably be idea to ask on the > NESCent Google Group - there should be some phylogenetic > experts able to give a much more detailed answer than I can :) > https://plus.google.com/communities/105828320619238393015 > > From eric.talevich at gmail.com Sat Apr 27 18:36:01 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 27 Apr 2013 18:36:01 -0400 Subject: [Biopython-dev] Self Introduction In-Reply-To: References:

Message-ID: On Sat, Apr 27, 2013 at 8:33 AM, Yanbo Ye wrote: > Hi, Peter. Thanks. > > I'm interested in the project of "Phylogenetics in Biopython: Filling in > the gaps" and have already submitted a draft proposal. I > previously wanted to apply the project of "Discovering links to ToLWeb > content from a tree in the Open Tree of Life's software system" when the > biopython projects were not added to the list. I finally choose the first > one because it is more suitable for me, based on my experience. > > Here is my proposal draft on github. I'm still revising it based on Eric > Talevich's advice. Hope you can also give me some help. > https://github.com/lijax/gsoc/blob/master/proposal_biopython.md > > Best regards, > > Yanbo > > Hi Yanbo, thanks for introducing yourself to the Biopython community. I saw you started your GSoC application in Melange, too. For reference, here is the original discussion we had on the PhyloSoC G+ page: https://plus.google.com/u/0/105784663295110007862/posts/Mw1aG12N67b From saketkc at gmail.com Sun Apr 28 09:06:28 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Sun, 28 Apr 2013 18:36:28 +0530 Subject: [Biopython-dev] Questions about Codon Alignment Proposal In-Reply-To: References:

Message-ID: Hi Peter and Eric, So do we also implement the alignment algorithm as well ? or do we use clustalw or any other alignment software ? Thanks Saket On 28 April 2013 03:55, Eric Talevich wrote: > On Sat, Apr 27, 2013 at 4:11 PM, Peter Cock wrote: > >> On Sat, Apr 27, 2013 at 6:23 PM, ?? wrote: >> > Hi Eric and Peter, >> > >> > I'm preparing the proposal for the codon alignment project. Two things I >> may >> > want to hear your advice. >> > >> > 1) In the biopython wiki page, you mentioned "model selection" in the >> > Approach & Goals. I'm not sure if there are any advantages to use codon >> > alignment for model selection. Could you give me some references? Another >> > thing is that model selection involves estimation of tree topology as >> well >> > as branch lengthes and parameters across many substitution models. Will >> it >> > be too computationally intensive for a python implementation? >> >> I'm not sure what Eric had in mind, but one option is to wrap an >> existing specialised tool specifically for model selection. One >> example I can think of is the graphical tool Topali2 (written by >> some of my current work colleagues) which I believe initially >> called Modelgenerator to do this, but now calls PhyML: >> >> http://www.topali.org >> http://bioinf.nuim.ie/modelgenerator/ >> > > Actually, I added "model selection" to the project description because > Peter mentioned it earlier. :) > > I didn't have any particular function in mind, but calling out to external > tools sounds like a reasonable approach. We do have some primitive > functionality for likelihood ratio testing through the PAML module, so > maybe that could come into play here. > >> 2) You also mentioned the "validation (testing for frame shift)". Is >> there a >> > test for frame shift? Or I can simply detect it by comparing amino acid >> > sequences and nucleotide sequences. >> > >> > Best, >> > Zheng Ruan >> >> Again, you'd have to ask Eric exactly what he had in mind. >> > > Yes, it's just a matter of testing for unexpected cases in which the > protein and nucleotide sequences don't quite match, and handling them > appropriately. It would be nice to see a description of these possible edge > cases and your treatment of them in the detailed weekly schedule for the > project. > > > >> Both these questions would probably be idea to ask on the >> NESCent Google Group - there should be some phylogenetic >> experts able to give a much more detailed answer than I can :) >> https://plus.google.com/communities/105828320619238393015 >> >> > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From saketkc at gmail.com Sun Apr 28 10:13:43 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Sun, 28 Apr 2013 19:43:43 +0530 Subject: [Biopython-dev] [GSoC2013] Codon alignment and analysis in Biopython Message-ID: Hi All, I am a Fourth Year Undergraduate studying Chemical Engineering at IIT Bombay, India. As a warmup, I have been working upon NGS analysis this sem. Besides playing around with NGS pipelines, scripting one on my own, I have lately been plying around with the PSSM approach for BWA . https://github.com/pkerpedjiev/bwa-pssm/ . So far my contributions to this project have been only at testing level, I havent really contributed to the code as such, just reported some bugs. I have been using BioPython for my Computational Biology Course that I am doing this semester. I have also been using BIoPython for NGS analysis, modifying fastq formats and all. I recently contributed to the BWA wrapper for BioPython (https://github.com/biopython/biopython/pull/167). I am also working on the samtools wrapper. I was a Google Summer of Code 2012 Student for Connexions(http://cnx.org) and contributed towards SlideImporter Module(Pyramid/Python): That allows the user to upload SlideShows to SlideShare and Google Drive and makes a CNXML module out of it. (https://github.com/oerpub/oerpub.rhaptoslabs.slideimporter). I am still working for the same organisation and contribute regularly (https://github.com/oerpub/oerpub.rhaptoslabs.swordpushweb/, Checkout branches starting with 'saket') I have deep interest in Biology and have done Molecular and Cell Biology and the computational aspects of it. BioPython is like the Best amalgamation for me : Bio + Python I would like to participate in GSoC 2013 as a student . I would like to contribute to " Codon Alignment" . Having worked on bwa-pssm codebase for a while has made me acquainted with alignment algorithms and I think I can contribute to this. I have written the first draft of my proposal at https://docs.google.com/document/d/1xlRSJ74POxi51xn7Web08ACZt4r7KSFIrzRAawTL2ds/edit#heading=h.j7p6z4a3ssy3 Being a first draft it may have a lot of ambiguity , say I would be trying to implement a function that is already existing in Biopython or I would have misinterpreted the problem completely. You can post your comments on the doc itself and I will try to improve upon it. Thanks Saket https://github.com/saketkc From p.j.a.cock at googlemail.com Sun Apr 28 10:28:58 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 28 Apr 2013 15:28:58 +0100 Subject: [Biopython-dev] Questions about Codon Alignment Proposal In-Reply-To: References:

Message-ID: On Sun, Apr 28, 2013 at 2:06 PM, Saket Choudhary wrote: > Hi Peter and Eric, > > So do we also implement the alignment algorithm as well ? > or do we use clustalw or any other alignment software ? > > Thanks > Saket Hi Saket, Currently Biopython has no build in MSA algorithms, but has instead called out to tools like ClustalW or MUSCLE. BioPerl does the same. BioJava however does try to include this sort of thing. Peter From rz1991 at foxmail.com Sun Apr 28 14:59:33 2013 From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=) Date: Mon, 29 Apr 2013 02:59:33 +0800 Subject: [Biopython-dev] Questions about Codon Alignment Proposal Message-ID: Thanks Peter and Eric, I just finished my draft proposal for Codon Alignment Project. It can be found at https://github.com/zruan/CodonAlignment/blob/master/Proposal/proposal.md I didn't include model selection into my time line as it seems a little beyond the usage scope of codon alignment. I'm looking forward to hearing your comments and suggestions. Thanks again. Best, Zheng Ruan ------------------ Original ------------------ From: "Eric Talevich"; Date: Apr 28, 2013 To: "????"; "Peter Cock"; Cc: "Biopython-Dev Mailing List"; Subject: Re: Questions about Codon Alignment Proposal On Sat, Apr 27, 2013 at 4:11 PM, Peter Cock wrote: On Sat, Apr 27, 2013 at 6:23 PM, ???? wrote: > Hi Eric and Peter, > > I'm preparing the proposal for the codon alignment project. Two things I may > want to hear your advice. > > 1) In the biopython wiki page, you mentioned "model selection" in the > Approach & Goals. I'm not sure if there are any advantages to use codon > alignment for model selection. Could you give me some references? Another > thing is that model selection involves estimation of tree topology as well > as branch lengthes and parameters across many substitution models. Will it > be too computationally intensive for a python implementation? I'm not sure what Eric had in mind, but one option is to wrap an existing specialised tool specifically for model selection. One example I can think of is the graphical tool Topali2 (written by some of my current work colleagues) which I believe initially called Modelgenerator to do this, but now calls PhyML: http://www.topali.org http://bioinf.nuim.ie/modelgenerator/ Actually, I added "model selection" to the project description because Peter mentioned it earlier. :) I didn't have any particular function in mind, but calling out to external tools sounds like a reasonable approach. We do have some primitive functionality for likelihood ratio testing through the PAML module, so maybe that could come into play here. > 2) You also mentioned the "validation (testing for frame shift)". Is there a > test for frame shift? Or I can simply detect it by comparing amino acid > sequences and nucleotide sequences. > > Best, > Zheng Ruan Again, you'd have to ask Eric exactly what he had in mind. Yes, it's just a matter of testing for unexpected cases in which the protein and nucleotide sequences don't quite match, and handling them appropriately. It would be nice to see a description of these possible edge cases and your treatment of them in the detailed weekly schedule for the project. Both these questions would probably be idea to ask on the NESCent Google Group - there should be some phylogenetic experts able to give a much more detailed answer than I can :) https://plus.google.com/communities/105828320619238393015 From redmine at redmine.open-bio.org Mon Apr 29 05:54:24 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 29 Apr 2013 09:54:24 +0000 Subject: [Biopython-dev] [Biopython - Bug #3430] Error parsing PubMedCentral XML files References: Message-ID: Issue #3430 has been updated by Michiel de Hoon. > NotXMLError: Failed to parse the XML data (XML declaration not found). Please make sure that the input data are in XML format. The error message is correct: The XML file does not start with the XML declaration

Either the XML file returned by Entrez is broken, or something went wrong when saving the file. ---------------------------------------- Bug #3430: Error parsing PubMedCentral XML files https://redmine.open-bio.org/issues/3430 Author: Paulo Nuin Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: It seems that there is an error parsing locally downloaded PubMedCentral xml (extension nxml) files. Using the code @ from Bio import Entrez handle = open('nihms83342.nxml') records = Entrez.parse(handle) for record in records: print record @ the following error occurs (copied from iPython), even though the XML header contains the declaration --------------------------------------------------------------------------- NotXMLError Traceback (most recent call last)