From biopython at maubp.freeserve.co.uk Mon Dec 1 05:22:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Dec 2008 10:22:27 +0000 Subject: [BioPython] will BioSQL work with psycopg2? In-Reply-To: References: Message-ID: <320fb6e00812010222o5fb87c4as46c40605d87c7cb7@mail.gmail.com> On Mon, Dec 1, 2008 at 2:59 AM, Alex Lancaster wrote: > Hi there, > > I am maintaining the Fedora package for Biopython and we are doing a > complete rebuild of all Python packages for Python 2.6. Excellent. Biopython 1.49 onwards should be fine on Python 2.6, but please do let us know if you find anything amiss or any deprecations we've missed. Slightly off topic, but If you want any clarification on things like the deprecation of Martel (with its dependency on mxTextTools), or the switch from Numeric to NumPy please ask. > Currently I have a dependency on psycopg (version 1.1.21) but since > that is so old pyscopg won't rebuild against the new mx, meaning that > I can't rebuild Biopython because the dependencies aren't there. > > So my question is, will the Biopython BioSQL work with the newer > psycopg2 (currently version 2.0.8)? See: > http://www.initd.org/pub/software/psycopg/ Yes, psycopg2 should work with Biopython 1.49 onwards (including Biopython 1.49 beta) thanks to a patch from Cymon Cox, see Bug 2616: http://bugzilla.open-bio.org/show_bug.cgi?id=2616 > Does it require the 1.x API or will it work with 2.x? The BioSQL page: > http://biopython.org/wiki/BioSQL > isn't clear on this. I'm not sure, having not used psycopg or psycopg2 myself. Hopefully Cymon can clarify this (CC'd). Peter From lueck at ipk-gatersleben.de Mon Dec 1 08:13:21 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 1 Dec 2008 14:13:21 +0100 Subject: [BioPython] Emboss eprimer3-Product Size Range Message-ID: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> Hi! I'm working with Emboss eprimer3 and I have a short question: How can I enter the paramter "product size range"? Usually it's something like 500-1000 450-500 etc. If I add this into python, nothing happens: from Bio import Fasta from Bio.Emboss.Applications import Primer3Commandline from Bio.Application import generic_run from Bio.Emboss import Primer3 primer_cl = Primer3Commandline() primer_cl.set_parameter("-sequence", "in.txt") primer_cl.set_parameter("-outfile", "out.pr3") primer_cl.set_parameter("-productsizerange", "100-200") primer_cl.set_parameter("-target", "%s,%s" % (50, 100)) result, messages, errors = generic_run(primer_cl) Whereas primer_cl.set_parameter("-productsizerange", "100") gives a primer output. Thanks in advance! Stefanie From biopython at maubp.freeserve.co.uk Mon Dec 1 10:08:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Dec 2008 15:08:51 +0000 Subject: [BioPython] Emboss eprimer3-Product Size Range In-Reply-To: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> On Mon, Dec 1, 2008 at 1:13 PM, Stefanie L?ck wrote: > Hi! > > I'm working with Emboss eprimer3 and I have a short question: > > How can I enter the paramter "product size range"? Usually it's > something like 500-1000 450-500 etc. According the the eprimer3 tool itself (try "eprimer3 --help" at the command line) you seem to be using productsizerange correctly in the following code snippet. > from Bio import Fasta You don't seem to be using Bio.Fasta, and this module is now obsolete. > from Bio.Emboss.Applications import Primer3Commandline > from Bio.Application import generic_run > from Bio.Emboss import Primer3 > > primer_cl = Primer3Commandline() > primer_cl.set_parameter("-sequence", "in.txt") > primer_cl.set_parameter("-outfile", "out.pr3") > primer_cl.set_parameter("-productsizerange", "100-200") > primer_cl.set_parameter("-target", "%s,%s" % (50, 100)) > result, messages, errors = generic_run(primer_cl) So this fails - what what exactly goes wrong i.e. what does this give: print "Command line:" print primer_cl print "Return code:" print result.return_code print "Errors:" print errors.read() print "Messages": print messages.read() And my usual question in this sort of situation: what happens if you run this command by hand at the command prompt? If you email me your input file (in.txt) then I can try running this on my machine (don't send it to the mailing list unless its very small and you don't mind sharing it with the world). Peter From lueck at ipk-gatersleben.de Mon Dec 1 11:31:22 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 1 Dec 2008 17:31:22 +0100 Subject: [BioPython] [Correction!] Emboss eprimer3-Product Size Range References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> Message-ID: <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> I'm sorry! I prescribed me! primer_cl.set_parameter("-productsizerange", "100-200 250-300") Causes no output and not primer_cl.set_parameter("-productsizerange", "100-200") as I wrote! >You don't seem to be using Bio.Fasta, and this module is now obsolete. Sorry I just forgot to deleted. I showed only a part of the code. The error print and on the command prompt gives: Command line: eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 -productsizerange 100-200 250-300 Return code: 1 Errors: Error: Argument '250-300' : Too many parameters 3/2 Messages: But it's described in the eprimer3 help file: "-productsizerange range [100-300] ...If one desires PCR products in either the range from 100 to 150 bases or in the range from 200 to 250 bases then one would set this parameter to 100-150 200-250. EPrimer3 favors ranges to the left side of the parameter string..." My input file is a normal multiple fasta file. Thanks and sorry for the mistake! ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Monday, December 01, 2008 4:08 PM Subject: Re: [BioPython] Emboss eprimer3-Product Size Range On Mon, Dec 1, 2008 at 1:13 PM, Stefanie L?ck wrote: > Hi! > > I'm working with Emboss eprimer3 and I have a short question: > > How can I enter the paramter "product size range"? Usually it's > something like 500-1000 450-500 etc. According the the eprimer3 tool itself (try "eprimer3 --help" at the command line) you seem to be using productsizerange correctly in the following code snippet. > from Bio import Fasta You don't seem to be using Bio.Fasta, and this module is now obsolete. > from Bio.Emboss.Applications import Primer3Commandline > from Bio.Application import generic_run > from Bio.Emboss import Primer3 > > primer_cl = Primer3Commandline() > primer_cl.set_parameter("-sequence", "in.txt") > primer_cl.set_parameter("-outfile", "out.pr3") > primer_cl.set_parameter("-productsizerange", "100-200") > primer_cl.set_parameter("-target", "%s,%s" % (50, 100)) > result, messages, errors = generic_run(primer_cl) So this fails - what what exactly goes wrong i.e. what does this give: print "Command line:" print primer_cl print "Return code:" print result.return_code print "Errors:" print errors.read() print "Messages": print messages.read() And my usual question in this sort of situation: what happens if you run this command by hand at the command prompt? If you email me your input file (in.txt) then I can try running this on my machine (don't send it to the mailing list unless its very small and you don't mind sharing it with the world). Peter From biopython at maubp.freeserve.co.uk Mon Dec 1 12:07:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Dec 2008 17:07:51 +0000 Subject: [BioPython] [Correction!] Emboss eprimer3-Product Size Range In-Reply-To: <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00812010907t27a5d9d5q6a9a9b3c41971a18@mail.gmail.com> > primer_cl.set_parameter("-productsizerange", "100-200 250-300") > Causes no output and not > primer_cl.set_parameter("-productsizerange", "100-200") > as I wrote! OK - that helps :) This will fail at the command line: eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 -productsizerange 100-200 250-300 Based on my experience of unix command lines and how arguments are parsed, this should work: eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 -productsizerange "100-200 250-300" If so, then in python you need to include the quotes yourself, e.g. primer_cl.set_parameter('-productsizerange', '"100-200 250-300"') That is single quotes to delimit the string in python, with double quotes as part of the string itself. You could also use double quotes by escaping them with a slash: primer_cl.set_parameter("-productsizerange", "\"100-200 250-300\"") To try and explain the python syntax here, try the following examples at the python prompt: >>> print "100-200 250-300" 100-200 250-300 >>> print '100-200 250-300' 100-200 250-300 >>> print '"100-200 250-300"' "100-200 250-300" >>> print "\"100-200 250-300\"" "100-200 250-300" Peter From cy at cymon.org Mon Dec 1 15:53:06 2008 From: cy at cymon.org (Cymon Cox) Date: Mon, 1 Dec 2008 20:53:06 +0000 Subject: [BioPython] will BioSQL work with psycopg2? In-Reply-To: <320fb6e00812010222o5fb87c4as46c40605d87c7cb7@mail.gmail.com> References: <320fb6e00812010222o5fb87c4as46c40605d87c7cb7@mail.gmail.com> Message-ID: <7265d4f0812011253h1f25b8a0y9a87bbf0d99fd830@mail.gmail.com> Hi Peter and Alex, 2008/12/1 Peter > On Mon, Dec 1, 2008 at 2:59 AM, Alex Lancaster wrote: > > Currently I have a dependency on psycopg (version 1.1.21) but since > > that is so old pyscopg won't rebuild against the new mx, meaning that > > I can't rebuild Biopython because the dependencies aren't there. > > > > So my question is, will the Biopython BioSQL work with the newer > > psycopg2 (currently version 2.0.8)? See: > > http://www.initd.org/pub/software/psycopg/ > > Yes, psycopg2 should work with Biopython 1.49 onwards (including > Biopython 1.49 beta) thanks to a patch from Cymon Cox, see Bug 2616: To confirm: I'm currently using psycopg2 vers. 2.0.8 > > Does it require the 1.x API or will it work with 2.x? The BioSQL page: > > http://biopython.org/wiki/BioSQL > > isn't clear on this. > > I'm not sure, having not used psycopg or psycopg2 myself. Hopefully > Cymon can clarify this (CC'd). Sorry, but I'm not sure what the question is here... Cheers, C. -- ____________________________________________________________________ Cymon J. Cox From bartek at rezolwenta.eu.org Mon Dec 1 15:53:59 2008 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 1 Dec 2008 21:53:59 +0100 Subject: [BioPython] [Biopython-dev] Refactoring motif analysis code In-Reply-To: <492ACE38.1090301@gmail.com> References: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> <492ACE38.1090301@gmail.com> Message-ID: <8b34ec180812011253p28a08a0bv43cd72369062b39b@mail.gmail.com> Hi all, I've done some work regarding the motif analysis in Biopython. I've done the following stuff: - refactored the Bio.AlignAce and Bio.MEME to use one common motif object - Put all of the refactored code in the Bio.Motif directory - Added more code (from my attic) to do motif comparisons and computing thresholds (this was actually written by my colleague Norbert Dojer, but I adapted it and I have his permission to contribute the code) - written a short tutorial on the usage of Bio.Motif (that's where I'd put it). - Written a basic test suite for the new motif. I haven't added it to cvs yet, but posted it as an attchment to the enhancement proposal in bugzilla: http://bugzilla.open-bio.org/show_bug.cgi?id=2694 I have cvs access, so I can commit the changes myself, but I'd like to wait for an "OK" from someone more involved in the release process. Since Giovanni and Bruce have responded to my previous call for comments, I'll try to answer them below: On Mon, Nov 24, 2008 at 4:54 PM, Bruce Southey wrote: > > Actually I am not that thrilled with the licenses for these packages and > similar packages because these are free only for academic use. To me this > clashes with the spirit of an open-sourced project especially a BSD-licensed > one. But if there is a need for such modules then these modules should be > included. > I have similar feelings about the "academic-use-only" licenses. On the other hand, since most of the biopython users are in academia, then I don't see it as a big problem. Also, since I don't have any truly open and free replacement for these programs, I think it's better to keep them. In fact the new Bio.Motif package provides some methods for motif comparisons, which at least to some extent can be used as a replacement for the respective functions of CompareACE and MAST. As a side note, I think that there is no point in providing parsers for every single motif finder that comes out, and I don't think that AlignAce and MEME are the best or the most representative ones. It just happened that these parsers were written "to scratch someone's itch". I think that the other functionality (motif searching, comparisons,weblogo) might be more useful to people. > While it is only free for academic use, have you seen TAMO? > *TAMO: a flexible, object-oriented framework for analyzing transcriptional > regulation using DNA-sequence motifs. * > Bioinformatics. 2005 Jul 15;21(14):3164-5. > > > http://fraenkel.mit.edu/TAMO/ Yes, I've seen it and I've even recommended it on the biopython mailing list when there was no replacement in biopython. However, their library is free only for academia and AFAIK it's not using biopython datastructures, so needs some work to integrate with TAMO if you are using Biopython. Bio.Motif is meant to provide free software for Motif analysis. > Well, I am not sure how many used Bio.AlignAce given the Parser.py bug :-) > Based on the CVS, both have been untouched for about three years. > Well, I've not used it myself for a while... I'm no longer doing de-novo motif discovery. However, it still works so it's potentially useful. I think this is largely due to the lack of documentation for the Bio.AlignAce and Bio.MEME tools (partially my fault). Hopefully people will start using this if they read the tutorial. > Also, what species are these used for? > One of the papers of AlignAce indicate that the base composition was set for > yeast. > They're both general purpose, you can set the gc content for alignAce and even an HMM for MEME. > > Personally I would be interested in a general protein motif finding module > because of my current research. However, I do have a different view with > respect to the Biopython community as indicated above with the licenses. Both MEME and AlignAce can be used to find motifs in proteins, but it has not so much to do with Bio.Motif, since it does not provide any motif-finnding capabilities by itself. In general Bio.Motif should be able to deal with protein motifs, but I've never tested it (I'm mostly using it for DNA motifs), so I'll be happy to help if you find bugs. On Mon, Nov 24, 2008 at 4:25 PM, Giovanni Marco Dall'Olio wrote: > > I would just like to tell you that I have tried the TAMO framework you > suggested me, and found it very useful. Yes, I remember, but the problem is with the TAMO license. I think that the Motif object might be still useful since it is free, allows to read motifs from databases like JASPAR to scan sequences and/or compare them with "your" motifs. > I am not using it anymore because I don't need it, but I remember that I liked: > - the methods to represent motifs as matrixes of frequencies/occurrencies etc.. done > - the fact that it was easy to create a motif from an alignment of sequences depending on your definition of easy, it's there > - the integration it had with this website: > http://weblogo.berkeley.edu/logo.cgi. done > I would suggest you to provide integration with this other web > service, which enable to plot the difference between two sequence > logos: http://www.twosamplelogo.org/examples.html. This I haven't done yet, but I'll try to provide functionality for that (shouldn't take too long). -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From dalloliogm at gmail.com Mon Dec 1 16:07:08 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 1 Dec 2008 22:07:08 +0100 Subject: [BioPython] [Biopython-dev] Refactoring motif analysis code In-Reply-To: <8b34ec180812011253p28a08a0bv43cd72369062b39b@mail.gmail.com> References: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> <492ACE38.1090301@gmail.com> <8b34ec180812011253p28a08a0bv43cd72369062b39b@mail.gmail.com> Message-ID: <5aa3b3570812011307q710cab78q2fbae061f5dd5eff@mail.gmail.com> On Mon, Dec 1, 2008 at 9:53 PM, Bartek Wilczynski wrote: > On Mon, Nov 24, 2008 at 4:25 PM, Giovanni Marco Dall'Olio > wrote: >> >> I would just like to tell you that I have tried the TAMO framework you >> suggested me, and found it very useful. > > Yes, I remember, but the problem is with the TAMO license. I think > that the Motif object might be still > useful since it is free, allows to read motifs from databases like > JASPAR to scan sequences and/or > compare them with "your" motifs. Thanks for all these changes. I remember that I wrote a mail to TAMO's authors when I was using it. They seemed to be interested in integrating the code with biopython, so maybe the license issue could be superated. It's up to you, whether you want to reimplement all the functions they have or not. -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Dec 1 16:09:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Dec 2008 21:09:33 +0000 Subject: [BioPython] will BioSQL work with psycopg2? In-Reply-To: <7265d4f0812011253h1f25b8a0y9a87bbf0d99fd830@mail.gmail.com> References: <320fb6e00812010222o5fb87c4as46c40605d87c7cb7@mail.gmail.com> <7265d4f0812011253h1f25b8a0y9a87bbf0d99fd830@mail.gmail.com> Message-ID: <320fb6e00812011309l104fb07as67a77be59778fa54@mail.gmail.com> 2008/12/1 Cymon Cox wrote: > 2008/12/1 Peter > >> On Mon, Dec 1, 2008 at 2:59 AM, Alex Lancaster wrote: >> > Currently I have a dependency on psycopg (version 1.1.21) but since >> > that is so old pyscopg won't rebuild against the new mx, meaning that >> > I can't rebuild Biopython because the dependencies aren't there. >> > >> > So my question is, will the Biopython BioSQL work with the newer >> > psycopg2 (currently version 2.0.8)? See: >> > http://www.initd.org/pub/software/psycopg/ >> >> Yes, psycopg2 should work with Biopython 1.49 onwards (including >> Biopython 1.49 beta) thanks to a patch from Cymon Cox, see Bug 2616: > > To confirm: I'm currently using psycopg2 vers. 2.0.8 > >> > Does it require the 1.x API or will it work with 2.x? The BioSQL page: >> > http://biopython.org/wiki/BioSQL >> > isn't clear on this. >> >> I'm not sure, having not used psycopg or psycopg2 myself. Hopefully >> Cymon can clarify this (CC'd). > > Sorry, but I'm not sure what the question is here... On reflection, I think Alex was asking if Biopython's BioSQL interface would work with or require psycopg 1.x or psycopg 2.x - and the answer to that is as of Biopython 1.49 we support either (but don't require either - you could use MySQL instead of PostgreSQL for example). Older versions of Biopython don't support psycopg 2.x. Peter From lueck at ipk-gatersleben.de Tue Dec 2 02:37:30 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 2 Dec 2008 08:37:30 +0100 Subject: [BioPython] [Correction!] Emboss eprimer3-Product Size Range References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> <320fb6e00812010907t27a5d9d5q6a9a9b3c41971a18@mail.gmail.com> Message-ID: <000c01c95450$d51e9d40$1022a8c0@ipkgatersleben.de> Thanks Peter, that's was it! Like always you solved my problem ;-) I wish everybody a nice Christmas Time! Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Monday, December 01, 2008 6:07 PM Subject: Re: [BioPython] [Correction!] Emboss eprimer3-Product Size Range >> primer_cl.set_parameter("-productsizerange", "100-200 250-300") >> Causes no output and not >> primer_cl.set_parameter("-productsizerange", "100-200") >> as I wrote! > > OK - that helps :) > > This will fail at the command line: > > eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 > -productsizerange 100-200 250-300 > > Based on my experience of unix command lines and how arguments are > parsed, this should work: > > eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 > -productsizerange "100-200 250-300" > > If so, then in python you need to include the quotes yourself, e.g. > > primer_cl.set_parameter('-productsizerange', '"100-200 250-300"') > > That is single quotes to delimit the string in python, with double > quotes as part of the string itself. You could also use double quotes > by escaping them with a slash: > > primer_cl.set_parameter("-productsizerange", "\"100-200 250-300\"") > > To try and explain the python syntax here, try the following examples > at the python prompt: > >>>> print "100-200 250-300" > 100-200 250-300 >>>> print '100-200 250-300' > 100-200 250-300 >>>> print '"100-200 250-300"' > "100-200 250-300" >>>> print "\"100-200 250-300\"" > "100-200 250-300" > > Peter > From biopython at maubp.freeserve.co.uk Tue Dec 2 05:25:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Dec 2008 10:25:47 +0000 Subject: [BioPython] [Correction!] Emboss eprimer3-Product Size Range In-Reply-To: <000c01c95450$d51e9d40$1022a8c0@ipkgatersleben.de> References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> <320fb6e00812010907t27a5d9d5q6a9a9b3c41971a18@mail.gmail.com> <000c01c95450$d51e9d40$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00812020225m4806fce1p4b1316e9a497c212@mail.gmail.com> On Tue, Dec 2, 2008 at 7:37 AM, Stefanie L?ck wrote: > Thanks Peter, that's was it! > Like always you solved my problem ;-) > I wish everybody a nice Christmas Time! > Stefanie Great! I'm glad to be of help. Peter P.S. Its already snowing here in Scotland! From aloraine at uncc.edu Tue Dec 2 10:48:00 2008 From: aloraine at uncc.edu (Loraine, Ann) Date: Tue, 2 Dec 2008 10:48:00 -0500 Subject: [BioPython] blat (psl) parser? Message-ID: Dear all, Does BioPython include a blat (psl format) parser? If yes, I would be grateful for pointers to documentation or tutorials describing how to use it. I appreciate your help! -Ann Loraine From dalloliogm at gmail.com Wed Dec 3 14:03:16 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 3 Dec 2008 20:03:16 +0100 Subject: [BioPython] [OT] Revision control and databases In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Message-ID: <5aa3b3570812031103m53050429lf3d517ccf6142bd7@mail.gmail.com> On 10/23/08, Giovanni Marco Dall'Olio wrote: > The problem is I don't know if databases do Revision Control. > When I used flat files, I was used to save all the results in a git > repository, and, everytime something was changed or calculated again, I did > commit it. Just in case someone else interested. I have found this plugin for elixir (which is an extension for sqlalchemy itself) which does version control and seems very easy to use. - http://elixir.ematia.de/apidocs/elixir.ext.versioned.html - http://elixir.ematia.de/trac/browser/elixir/trunk/tests/test_versioning.py It has things like automated versioning and reverting, but id doesn't seem to have commit messages. Of course it doesn't seem feasible to use it on a very big database, but it is good to know it exists. The three of them, elixir, sqlalchemy, and this plugin, seems very useful instruments to anyone wishing to use database, in my opinion :). > Do you know how to do this with databases? Does MySQL provide support for > revision control? > Thanks :) > > (sorry for cross-posting :( ) > > -- > ----------------------------------------------------------- > > My Blog on Bioinformatics (italian): http://bioinfoblog.it > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From pmmagic at gmail.com Thu Dec 4 00:56:32 2008 From: pmmagic at gmail.com (paul m) Date: Thu, 4 Dec 2008 00:56:32 -0500 Subject: [BioPython] blat (psl) parser? In-Reply-To: References: Message-ID: <991e7bc10812032156v3de1913bj3051d0f7a8435870@mail.gmail.com> Ann, I don't think BioPython has a blat parser (or at least it didn't last time I looked), but I've written one that I use. Nothing fancy but it works. I'd be happy to send it to you via email. Cheers, Paul On Tue, Dec 2, 2008 at 10:48 AM, Loraine, Ann wrote: > Dear all, > > Does BioPython include a blat (psl format) parser? > > If yes, I would be grateful for pointers to documentation or tutorials describing how to use it. > > I appreciate your help! > > -Ann Loraine > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Thu Dec 4 12:49:53 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 4 Dec 2008 18:49:53 +0100 Subject: [BioPython] AlignIO: Sequences of different length Message-ID: Hello all! I'm running BioPython 1.49 in my Linux machine and it's been working rather fine until now. I'm submitting two sequences for a pairwise alignment, using the EMBOSS webservices. The results I get are good and the file is nicely formatted, so there is no problem with the needle output (check here: http://pastebin.com/m12ab3b2b ). However, the format it comes is not handy for what I want to do next, so I thought of using Biopython to convert the alignment format into something more useful, such as pir or fasta. And that's when I hit a problem. The code I'm running is here: http://pastebin.com/m509fd88f When executed, it gives me this error: Traceback (most recent call last): File "needle.py", line 50, in alignments = AlignIO.read(open('alignment.results'), "emboss") File "/usr/lib/python2.5/site-packages/PIL/__init__.py", line 375, in read File "/home/joao/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/AlignIO/EmbossIO.py", line 197, in next SyntaxError: Error parsing alignment - sequences of different length? Which is, to say the least, weird. First, the PIL __init__.py it calls is completely empty. Then, the second thing he mentions is the file on my Desktop folder, which doesn't exist anymore. Third, if I use AlignIO.parse() instead of read(), it runs ok. But as soon as I try to actually _do_ something with it, it gives me this very same error. So, is this a bug or is it me and my nasty coding abilities :) ? Thanks in advance! Jo?o Rodrigues http://doeidoei.wordpress.com Utrecht University Netherlands From biopython at maubp.freeserve.co.uk Thu Dec 4 12:56:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:56:27 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: Message-ID: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> On Thu, Dec 4, 2008 at 5:49 PM, Jo?o Rodrigues wrote: > Hello all! I'm running BioPython 1.49 in my Linux machine and it's been > working rather fine until now. > > I'm submitting two sequences for a pairwise alignment, using the EMBOSS > webservices. The results I get are good and the file is nicely formatted, so > there is no problem with the needle output (check here: > http://pastebin.com/m12ab3b2b ). > > However, the format it comes is not handy for what I want to do next, so I > thought of using Biopython to convert the alignment format into something > more useful, such as pir or fasta. And that's when I hit a problem. > > The code I'm running is here: http://pastebin.com/m509fd88f > When executed, it gives me this error: > > Traceback (most recent call last): > File "needle.py", line 50, in > alignments = AlignIO.read(open('alignment.results'), "emboss") > File "/usr/lib/python2.5/site-packages/PIL/__init__.py", line 375, in read > > File > "/home/joao/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/AlignIO/EmbossIO.py", > line 197, in next > SyntaxError: Error parsing alignment - sequences of different length? > > Which is, to say the least, weird. First, the PIL __init__.py it calls is > completely empty. Then, the second thing he mentions is the file on my > Desktop folder, which doesn't exist anymore. Third, if I use AlignIO.parse() > instead of read(), it runs ok. But as soon as I try to actually _do_ > something with it, it gives me this very same error. The bit about PIL in the stack trace is odd. > So, is this a bug or is it me and my nasty coding abilities :) ? > > Thanks in advance! > > Jo?o Rodrigues I don't know if its good news or bad news, but its a bug in the Biopython "emboss" parser not your code. I get the same error message here on my machine using your sample output. I'll take a look at the code get back to you shortly... Can we include your sample output as a unit test in Biopython please? Thanks Peter From biopython at maubp.freeserve.co.uk Thu Dec 4 13:10:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 18:10:58 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> Message-ID: <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> On Thu, Dec 4, 2008 at 6:02 PM, Jo?o Rodrigues wrote: > Well, bad news, I'd rather have it be a problem with my code :D No problem > at all to include my output. Thanks. For anyone wanting to try this at home, working backwards from the answer, the first input sequence is: >E1 MSSDRQRSDDESPSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNTKLSSKTTAKLS TSAKRIQKELAEITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFSSDY PFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADPLV GSIATQYLTNRAEHDRIARQWTKRYAT And the second: >E2 GMSDDDSRASTSSSSSSSSNQQTEKETNTPKKKESKVSMSKNSKLLSTSAKRIQKELADI TLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFTPEYPFKPPKVTFRTRI YHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADPLVGSIATQYMTNRAE HDRMARQWTKRYAT I've assumed default needle parameters are being used. Its the start of the alignment which is causing the problem, i.e. this bit of your file: E1 1 MSSDRQRSDDES-PSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNT 49 ..|||:| .||||.||.: ..|:..|.:.|.:||.: E2 1 GMSDDDSRASTSSSSSSS----------SNQQTEKETNTPKKKES 35 This is easier to see with a fixed width font, but compare it to what I get using EMBOSS 6.0.1 on my local machine: E1 1 MSSDRQRSDDES-PSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNT 49 ..|||:| .||||.||.: ..|:..|.:.|.:||.: E2 1 -----GMSDDDSRASTSSSSSSS----------SNQQTEKETNTPKKKES 35 Note that here the second sequence, E2, has five leading gap characters. These are missing in your file, where spaces have been used, and the Biopthon parser was not expecting this. What URL are you using for the EMBOSS webservice? I'd like to try this myself, and if possible see what version of EMBOSS they are using on the server. Peter From anaryin at gmail.com Thu Dec 4 13:19:43 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 4 Dec 2008 19:19:43 +0100 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> Message-ID: I believe the script I gave you had the needle function on it :x It's just a simple WSDL file provided by EBI being used by the SOAPpy module to access the webservice. The parameters are default as well so, gapopen 10.0 and gapextend 0.5. The page of the service is: http://www.ebi.ac.uk/Tools/webservices/services/emboss I kind of noticed that non-sense gap, but that comes with the format unfortunately. I ran the Web version of the program, not the webservice, and the outcome was the same (regarding the gap): http://www.ebi.ac.uk/Tools/es/cgi-bin/jobresults.cgi/needle/needle-20081204-18180513899973.html Jo?o Rodrigues http://doeidoei.wordpress.com From biopython at maubp.freeserve.co.uk Thu Dec 4 13:57:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 18:57:57 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> Message-ID: <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> On Thu, Dec 4, 2008 at 6:19 PM, Jo?o Rodrigues wrote: > I believe the script I gave you had the needle function on it :x It's just a > simple WSDL file provided by EBI being used by the SOAPpy module to access > the webservice. The parameters are default as well so, gapopen 10.0 and > gapextend 0.5. Oh yes - I see it now on pastebin, previously I'd only looked at the output file. > The page of the service is: > http://www.ebi.ac.uk/Tools/webservices/services/emboss > > I kind of noticed that non-sense gap, but that comes with the format > unfortunately. I ran the Web version of the program, not the webservice, and > the outcome was the same (regarding the gap): > > http://www.ebi.ac.uk/Tools/es/cgi-bin/jobresults.cgi/needle/needle-20081204-18180513899973.html > So the web versions of EMBOSS are using spaces for leading gaps (program version unknown), while the standalone version of EMBOSS (up to version 6.0.1) are using dashes (minus signs). Biopython 1.49 expects the leading dashes. I suspect that the EBI are running a more recent not-yet-released version of the EMBOSS tools to power their webservices. I'm not familiar enough with their code to know where to look... I suggest you email the webservice people and ask them why the needle output is different to the command line version (tell them parsers such as Biopython may be broken by this change). If this is a forthcoming change to the EMBOSS standalone tools, then I guess we'll have to fix the parser anyway. I may find time to look at this over the weekend - we'll see. Regards, Peter From anaryin at gmail.com Fri Dec 5 06:34:33 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 5 Dec 2008 12:34:33 +0100 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> Message-ID: I got a reply from the EBI support team saying that the webserver they provide is outdated, when compared to the versions of NEEDLE we (me on the web and Peter on his local machine) used. So, BioPython is nice and up-to-date, it's their server that is quite outdated. " Actually the WSEmboss web service uses an older version of EMBOSS (2.9.0), which exibits this behaviour. I suggest you contact the BioPython folks and let them know that older versions of EMBOSS behave differently. If you want to use the latest version of EMBOSS I suggest looking at using the Soaplab services (see http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) instead. All the best, Support at EBI" Jo?o Rodrigues http://doeidoei.wordpress.com On Thu, Dec 4, 2008 at 7:57 PM, Peter wrote: > On Thu, Dec 4, 2008 at 6:19 PM, Jo?o Rodrigues wrote: > > I believe the script I gave you had the needle function on it :x It's > just a > > simple WSDL file provided by EBI being used by the SOAPpy module to > access > > the webservice. The parameters are default as well so, gapopen 10.0 and > > gapextend 0.5. > > Oh yes - I see it now on pastebin, previously I'd only looked at the > output file. > > > The page of the service is: > > http://www.ebi.ac.uk/Tools/webservices/services/emboss > > > > I kind of noticed that non-sense gap, but that comes with the format > > unfortunately. I ran the Web version of the program, not the webservice, > and > > the outcome was the same (regarding the gap): > > > > > http://www.ebi.ac.uk/Tools/es/cgi-bin/jobresults.cgi/needle/needle-20081204-18180513899973.html > > > > So the web versions of EMBOSS are using spaces for leading gaps > (program version unknown), while the standalone version of EMBOSS (up > to version 6.0.1) are using dashes (minus signs). Biopython 1.49 > expects the leading dashes. > > I suspect that the EBI are running a more recent not-yet-released > version of the EMBOSS tools to power their webservices. I'm not > familiar enough with their code to know where to look... > > I suggest you email the webservice people and ask them why the needle > output is different to the command line version (tell them parsers > such as Biopython may be broken by this change). > > If this is a forthcoming change to the EMBOSS standalone tools, then I > guess we'll have to fix the parser anyway. I may find time to look at > this over the weekend - we'll see. > > Regards, > > Peter > From biopython at maubp.freeserve.co.uk Fri Dec 5 07:18:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Dec 2008 12:18:50 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> Message-ID: <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> On Fri, Dec 5, 2008 at 11:34 AM, Jo?o Rodrigues wrote: > I got a reply from the EBI support team saying that the webserver they > provide is outdated, when compared to the versions of NEEDLE we (me on the > web and Peter on his local machine) used. So, BioPython is nice and > up-to-date, it's their server that is quite outdated. > > " Actually the WSEmboss web service uses an older version of EMBOSS (2.9.0), > which exibits this behaviour. I suggest you contact the BioPython folks and let > them know that older versions of EMBOSS behave differently. > > If you want to use the latest version of EMBOSS I suggest looking at using > the Soaplab services (see http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) > instead. > > All the best, > > Support at EBI" > > Jo?o Rodrigues Thanks the update :) Are you OK using the more up to date SOAP needle, or perhaps standalone needle? Does thos From anaryin at gmail.com Fri Dec 5 08:59:25 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 5 Dec 2008 13:59:25 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> Message-ID: Well... My VISTA partition just erased my Linux one, don't know how, so I can't answer that right now :x As soon as I get linux again, as soon as I get my script written again, I'll give an update here :) But I had solved the problem by changing the alignment output format to markx10 and "parsing" it my own way. Cheers and thanks for the help :) Jo?o Rodrigues http://doeidoei.wordpress.com On Fri, Dec 5, 2008 at 12:18 PM, Peter wrote: > On Fri, Dec 5, 2008 at 11:34 AM, Jo?o Rodrigues wrote: > > I got a reply from the EBI support team saying that the webserver they > > provide is outdated, when compared to the versions of NEEDLE we (me on > the > > web and Peter on his local machine) used. So, BioPython is nice and > > up-to-date, it's their server that is quite outdated. > > > > " Actually the WSEmboss web service uses an older version of EMBOSS > (2.9.0), > > which exibits this behaviour. I suggest you contact the BioPython folks > and let > > them know that older versions of EMBOSS behave differently. > > > > If you want to use the latest version of EMBOSS I suggest looking at > using > > the Soaplab services (see > http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) > > instead. > > > > All the best, > > > > Support at EBI" > > > > Jo?o Rodrigues > > Thanks the update :) > > Are you OK using the more up to date SOAP needle, or perhaps standalone > needle? > > Does thos > From anaryin at gmail.com Mon Dec 8 18:26:36 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 9 Dec 2008 00:26:36 +0100 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> Message-ID: Well, as promised, here goes update. I didn't try with soaplab2 because it was too complicated to get it to work. I didn't want to lose more than 10 minutes either so... However, with standalone needle, which EBI claim to be the same version as the soaplab2 service, it works flawlessly :) Code here: http://pastebin.com/f29ff12d6 Console output here: http://pastebin.com/f5bbc5593 It's not a bug then, it's just an old version :) Using the web versions, there may be some workarounds. If you convert the format to one of the others, you may get a usable one for Biopython. I tried markx1 I believe, and it was "almost" parsable, it just didn't get the correct sequences (if you deleted everything BUT the sequences, it would work). So, I think there should at least be a warning somewhere for the users so that they don't get nuts or reporting bugs :) Thanks for all the help! Regards! Jo?o Rodrigues http://doeidoei.wordpress.com On Fri, Dec 5, 2008 at 2:59 PM, Jo?o Rodrigues wrote: > Well... My VISTA partition just erased my Linux one, don't know how, so I > can't answer that right now :x As soon as I get linux again, as soon as I > get my script written again, I'll give an update here :) But I had solved > the problem by changing the alignment output format to markx10 and "parsing" > it my own way. > > Cheers and thanks for the help :) > > Jo?o Rodrigues > http://doeidoei.wordpress.com > > > On Fri, Dec 5, 2008 at 12:18 PM, Peter wrote: > >> On Fri, Dec 5, 2008 at 11:34 AM, Jo?o Rodrigues >> wrote: >> > I got a reply from the EBI support team saying that the webserver they >> > provide is outdated, when compared to the versions of NEEDLE we (me on >> the >> > web and Peter on his local machine) used. So, BioPython is nice and >> > up-to-date, it's their server that is quite outdated. >> > >> > " Actually the WSEmboss web service uses an older version of EMBOSS >> (2.9.0), >> > which exibits this behaviour. I suggest you contact the BioPython folks >> and let >> > them know that older versions of EMBOSS behave differently. >> > >> > If you want to use the latest version of EMBOSS I suggest looking at >> using >> > the Soaplab services (see >> http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) >> > instead. >> > >> > All the best, >> > >> > Support at EBI" >> > >> > Jo?o Rodrigues >> >> Thanks the update :) >> >> Are you OK using the more up to date SOAP needle, or perhaps standalone >> needle? >> >> Does thos >> > > From biopython at maubp.freeserve.co.uk Tue Dec 9 05:17:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Dec 2008 10:17:40 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> Message-ID: <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> On Mon, Dec 8, 2008 at 11:26 PM, Jo?o Rodrigues wrote: > Well, as promised, here goes update. I didn't try with soaplab2 because it > was too complicated to get it to work. I didn't want to lose more than 10 > minutes either so... However, with standalone needle, which EBI claim to be > the same version as the soaplab2 service, it works flawlessly :) > > Code here: http://pastebin.com/f29ff12d6 > > Console output here: http://pastebin.com/f5bbc5593 > > It's not a bug then, it's just an old version :) Well, arguably it would be nice Biopython could parse old versions of the EMBOSS pairs/simple output too, but its not so important. > Using the web versions, there may be some workarounds. If you convert > the format to one of the others, you may get a usable one for Biopython. If you just want the alignment itself, using FASTA as the output format from needle is very simple. e.g. $ needle one.fasta two.fasta --auto --filter -aformat fasta >E1 MSSDRQRSDDES-PSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNTKLS-SKTTAK LSTSAKRIQKELAEITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFSS DYPFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADP LVGSIATQYLTNRAEHDRIARQWTKRYAT >E2 -----GMSDDDSRASTSSSSSSS----------SNQQTEKETNTPKKKESKVSMSKNSKL LSTSAKRIQKELADITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFTP EYPFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADP LVGSIATQYMTNRAEHDRMARQWTKRYAT > I tried markx1 I believe, and it was "almost" parsable, it just didn't get the > correct sequences (if you deleted everything BUT the sequences, it would > work). How were you trying to parse the markx1 output? Note that the EMBOSS markx10 output is similar to, but differs from, the FASTA -m 10 output (which Biopython can parse as the "fasta-m10" format in Bio.AlignIO). > So, I think there should at least be a warning somewhere for the > users so that they don't get nuts or reporting bugs :) Do you mean a warning about trying to use Bio.AlignIO with the "emboss" format to read output from old versions of EMBOSS needle tool? Peter From anaryin at gmail.com Tue Dec 9 06:25:37 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 9 Dec 2008 12:25:37 +0100 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> References: <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> Message-ID: > > > Using the web versions, there may be some workarounds. If you convert > > the format to one of the others, you may get a usable one for Biopython. > > If you just want the alignment itself, using FASTA as the output > format from needle is very simple. > > e.g. > > $ needle one.fasta two.fasta --auto --filter -aformat fasta > >E1 > MSSDRQRSDDES-PSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNTKLS-SKTTAK > LSTSAKRIQKELAEITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFSS > DYPFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADP > LVGSIATQYLTNRAEHDRIARQWTKRYAT > >E2 > -----GMSDDDSRASTSSSSSSS----------SNQQTEKETNTPKKKESKVSMSKNSKL > LSTSAKRIQKELADITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFTP > EYPFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADP > LVGSIATQYMTNRAEHDRMARQWTKRYAT > Yep, but in the web version such format does not exist.. don't know why. > > > I tried markx1 I believe, and it was "almost" parsable, it just didn't > get the > > correct sequences (if you deleted everything BUT the sequences, it would > > work). > > How were you trying to parse the markx1 output? > > Note that the EMBOSS markx10 output is similar to, but differs from, > the FASTA -m 10 output (which Biopython can parse as the "fasta-m10" > format in Bio.AlignIO). > I tried with FASTA as the argument for the parser, because the description said: "This is the standard default output format used by Bill Pearson's suite of FASTA programs." And btw, it was the markx0, not the 1. Typo yesterday night.. > > > So, I think there should at least be a warning somewhere for the > > users so that they don't get nuts or reporting bugs :) > > Do you mean a warning about trying to use Bio.AlignIO with the > "emboss" format to read output from old versions of EMBOSS needle > tool? Well, it may be frustrating for someone who's using that webservice to try and parse it and it gives that error. It might be useful for example, to mention, when such error occurs, that it might be happening due to use of web version. Just a small appendix to the error message f example. Regards, Jo?o From biopython at maubp.freeserve.co.uk Tue Dec 9 06:42:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Dec 2008 11:42:34 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> Message-ID: <320fb6e00812090342n7534dd9dy9284ccc7f887209d@mail.gmail.com> On Tue, Dec 9, 2008 at 11:25 AM, Jo?o Rodrigues wrote: >> > Using the web versions, there may be some workarounds. If you convert >> > the format to one of the others, you may get a usable one for Biopython. >> >> If you just want the alignment itself, using FASTA as the output >> format from needle is very simple. >> >> e.g. >> >> $ needle one.fasta two.fasta --auto --filter -aformat fasta >> ... > > Yep, but in the web version such format does not exist.. don't know why. A strange omission on their part. >> > I tried markx1 I believe, and it was "almost" parsable, it just didn't >> > get the correct sequences (if you deleted everything BUT the >> > sequences, it would work). >> >> How were you trying to parse the markx1 output? >> >> Note that the EMBOSS markx10 output is similar to, but differs from, >> the FASTA -m 10 output (which Biopython can parse as the "fasta-m10" >> format in Bio.AlignIO). > > I tried with FASTA as the argument for the parser, because the description > said: > "This is the standard default output format used by Bill Pearson's suite of > FASTA programs." > > And btw, it was the markx0, not the 1. Typo yesterday night.. The various EMBOSS output formats are described here, http://emboss.sourceforge.net/docs/themes/AlignFormats.html The outputs markx0, markx1, ..., markx10 are EMBOSS *imitations* of the FASTA tool's output formats (but with the addition of EMBOSS style header/footers). Right now, Biopython doesn't parse any of these. In Biopython's Bio.AlignIO, "fasta" refers to the FASTA input file format (the simple file format using greater than signs for each new sequence). The only FASTA output format we support is "fasta-m10" which is how we refer to the output from FASTA's -m 10 command line argument. Right now, the Biopython FASTA m10 parser can't cope with the EMBOSS markx10 format. It might be nice if it did, but given we can parse EMBOSS's default output this doesn't seem like a big issue. >> > So, I think there should at least be a warning somewhere for the >> > users so that they don't get nuts or reporting bugs :) >> >> Do you mean a warning about trying to use Bio.AlignIO with the >> "emboss" format to read output from old versions of EMBOSS needle >> tool? > > Well, it may be frustrating for someone who's using that webservice to try > and parse it and it gives that error. It might be useful for example, to > mention, when such error occurs, that it might be happening due to use of > web version. Just a small appendix to the error message f example. So instead of "Error parsing alignment - sequences of different length?" it could say "Error parsing alignment - sequences of different length? Possibly you are using an old version of EMBOSS." That should help. As an aside, do you mind me asking why are you using needle via a webservice? If you expect to do lots of alignments, surely running it locally is faster and more reliable (no network issues to worry about)? Peter From biopython at maubp.freeserve.co.uk Tue Dec 9 07:05:30 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Dec 2008 12:05:30 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812090342n7534dd9dy9284ccc7f887209d@mail.gmail.com> References: <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> <320fb6e00812090342n7534dd9dy9284ccc7f887209d@mail.gmail.com> Message-ID: <320fb6e00812090405i5a23f32ar3c2f7cd535b67b64@mail.gmail.com> On Tue, Dec 9, 2008 at 11:42 AM, Peter wrote: > > So instead of "Error parsing alignment - sequences of different > length?" it could say "Error parsing alignment - sequences of > different length? Possibly you are using an old version of EMBOSS." > That should help. I've tried to clarify this exception message in the latest code. For anyone interested in the details, see CVS revision 1.6 of Bio/AlignIO/EmbossIO.py which is viewable online: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/AlignIO/EmbossIO.py?cvsroot=biopython There is no reason to update your installation Jo?o as this will make no difference to you - parsing the old EMBOSS 2.9.0 needle output will still fail. Peter From rjalves at igc.gulbenkian.pt Thu Dec 11 12:25:32 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Thu, 11 Dec 2008 17:25:32 +0000 Subject: [BioPython] KEGG Gene parser Message-ID: <49414D0C.8080509@igc.gulbenkian.pt> Hi everyone, Bringing back the KEGG Gene parser subject (from january 2008), Bio.KEGG has some modules for KEGG resources but not Gene. SeqIO doesn't seem to support KEGG either. So my question is, have any progresses been made in this regard? Thanks, Renato. From biopython at maubp.freeserve.co.uk Fri Dec 12 06:06:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Dec 2008 11:06:07 +0000 Subject: [BioPython] KEGG Gene parser In-Reply-To: <49414D0C.8080509@igc.gulbenkian.pt> References: <49414D0C.8080509@igc.gulbenkian.pt> Message-ID: <320fb6e00812120306i2e1af2cey6bcc7d402a037ffd@mail.gmail.com> On Thu, Dec 11, 2008 at 5:25 PM, Renato Alves wrote: > Hi everyone, > > Bringing back the KEGG Gene parser subject (from january 2008), Bio.KEGG has > some modules for KEGG resources but not Gene. SeqIO doesn't seem to support > KEGG either. What are you trying to do? Do you want to parse gene files from KEGG into sequence objects? If so, could you point me at an particular example file so I have a better feel for the problem (and if it would fit into Bio.SeqIO). Thanks, Peter From jae at lmi.net Fri Dec 12 11:21:47 2008 From: jae at lmi.net (Jason Eshleman) Date: Fri, 12 Dec 2008 08:21:47 -0800 Subject: [BioPython] bioPython and STRUCTURE Message-ID: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> Greetings. I'm curious if anyone has worked with code to operate the multi-locus pop gen. program "STRUCTURE" (http://pritch.bsd.uchicago.edu/software.html). I've got some code myself that I'd be happy to share/contribute if there's interest. I haven't been able to find any such discussions in the archives, but it could be my searching skills. The term 'structure' returns a large number of completely irrelevant hits. It does seem like bioPython is light in the pop. gen dept at this point. -jae From tiagoantao at gmail.com Fri Dec 12 11:39:34 2008 From: tiagoantao at gmail.com (tiagoantao at gmail.com) Date: Fri, 12 Dec 2008 16:39:34 +0000 Subject: [BioPython] bioPython and STRUCTURE In-Reply-To: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> References: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> Message-ID: <6d941f120812120839i3f4b7d48gdcaa3f40a96364b6@mail.gmail.com> Hi, i am writing this from a mobile phone in a middle of a conference, so I will be short. your effort is most welcome. As soon as I am back (next week) I will gladly help you with putting the code on biopython pop gen. Structure is widely used and your contribution, from my part, is most welcome. there is actually a big chunk of updates that can be commited soon, maybe yours can go along On 12/12/08, Jason Eshleman wrote: > Greetings. I'm curious if anyone has worked with code to operate the > multi-locus pop gen. program "STRUCTURE" > (http://pritch.bsd.uchicago.edu/software.html). I've got some code myself > that I'd be happy to share/contribute if there's interest. I haven't been > able to find any such discussions in the archives, but it could be my > searching skills. The term 'structure' returns a large number of > completely irrelevant hits. It does seem like bioPython is light in the > pop. gen dept at this point. > > -jae > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From dalloliogm at gmail.com Fri Dec 12 12:31:28 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 12 Dec 2008 18:31:28 +0100 Subject: [BioPython] bioPython and STRUCTURE In-Reply-To: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> References: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> Message-ID: <5aa3b3570812120931i3a5d654ta2955f9fe9bee292@mail.gmail.com> On 12/12/08, Jason Eshleman wrote: > Greetings. I'm curious if anyone has worked with code to operate the > multi-locus pop gen. program "STRUCTURE" > (http://pritch.bsd.uchicago.edu/software.html). I've got > some code myself that I'd be happy to share/contribute if there's interest. > You could have a look at this code: - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen which is a merge between Tiago's code and mine to implement population genetics things in python/biopython. Actually I wonder whether it would be easier to use tools like waf or scons to handle external tools, but anyway it is good to have handlers like that in biopython. > I haven't been able to find any such discussions in the archives, but it > could be my searching skills. mmmm have you tried something like "biopython structure -3d -pdb genetics"? > The term 'structure' returns a large number > of completely irrelevant hits. It does seem like bioPython is light in the > pop. gen dept at this point. At the moment yes, it is. > > -jae > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Fri Dec 12 13:28:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Dec 2008 18:28:14 +0000 Subject: [BioPython] KEGG Gene parser In-Reply-To: <4942A89E.3070002@igc.gulbenkian.pt> References: <49414D0C.8080509@igc.gulbenkian.pt> <320fb6e00812120306i2e1af2cey6bcc7d402a037ffd@mail.gmail.com> <4942A89E.3070002@igc.gulbenkian.pt> Message-ID: <320fb6e00812121028t4546ec4esac4957bab707e4f6@mail.gmail.com> On Fri, Dec 12, 2008 at 6:08 PM, Renato Alves wrote: > At the moment I'm doing exactly that, getting the sequence out of gene files > like the one attached. When you say sequence, do you want the nucleotides or the protein (or both)? Is there a URL for where that example file came from? I'd like to have a look at similar examples etc but all I found so far on KEGG were the HTML equivalents to this data. Thanks Peter From rjalves at igc.gulbenkian.pt Fri Dec 12 14:08:47 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Fri, 12 Dec 2008 19:08:47 +0000 Subject: [BioPython] KEGG Gene parser In-Reply-To: <320fb6e00812121028t4546ec4esac4957bab707e4f6@mail.gmail.com> References: <49414D0C.8080509@igc.gulbenkian.pt> <320fb6e00812120306i2e1af2cey6bcc7d402a037ffd@mail.gmail.com> <4942A89E.3070002@igc.gulbenkian.pt> <320fb6e00812121028t4546ec4esac4957bab707e4f6@mail.gmail.com> Message-ID: <4942B6BF.8050806@igc.gulbenkian.pt> Both. I got that one via KEGG API but you can get them at ftp://ftp.genome.jp/pub/kegg/genes/ . In the organisms folder you have full genome files (*.ent) in KEGG format. Renato Quoting Peter on 12/12/2008 06:28 PM: > On Fri, Dec 12, 2008 at 6:08 PM, Renato Alves wrote: > >> At the moment I'm doing exactly that, getting the sequence out of gene files >> like the one attached. >> > > When you say sequence, do you want the nucleotides or the protein (or both)? > > Is there a URL for where that example file came from? I'd like to > have a look at similar examples etc but all I found so far on KEGG > were the HTML equivalents to this data. > > Thanks > > Peter > From stran104 at chapman.edu Sun Dec 14 04:38:34 2008 From: stran104 at chapman.edu (Matthew Strand) Date: Sun, 14 Dec 2008 01:38:34 -0800 Subject: [BioPython] SeqIO.write() Multiple Calls for fasta In-Reply-To: <2a63cc350812140122n2260c0c3r17b7e8088aaaeec9@mail.gmail.com> References: <2a63cc350812140122n2260c0c3r17b7e8088aaaeec9@mail.gmail.com> Message-ID: <2a63cc350812140138g4cb85e5eo2ef6f2c97c15a533@mail.gmail.com> Hello, I have been working with SeqIO.write() on fasta files based on some info provided in the API Documentation. It is written that SeqIO.write() should "probably" perform fine with multiple calls, but with my experience it actually does overwrite the whole file, even when the file is opened and closed immediately before and after the write. Has anyone else had this experience? I will be rewriting my code to create large arrays before adding to the file, which is easy for the example provided below. However, this will take some work to change the part of the application that runs against our local Blast databases for a few days, periodically adding sequences to files. I'd like to make sure that I'm not the only one with this issue before rewriting it. ---------BEGIN API Documentation Quote Output - Advanced ================= The effect of calling write() multiple times on a single file will vary depending on the file format, and is best avoided unless you have a strong reason to do so. Trying this for certain alignment formats (e.g. phylip, clustal, stockholm) would have the effect of concatenating several multiple sequence alignments together. Such files are created by the PHYLIP suite of programs for bootstrap analysis. For sequential files formats (e.g. fasta, genbank) each "record block" holds a single sequence. For these files it would probably be safe to call write() multiple times. ---------END API Documentation Quote ---------BEGIN Code Sample to take a bunch of fasta files with multiple species and generate individual files for each species. for j in range(1, len(kogid)): name = "EXT-CLB-" + kogid[j] + ".seq" if os.path.exists(name): handle = open(name, "rU") records = list(SeqIO.parse(handle, "fasta")) for record in records: speciesID = record.id.split('|')[0] outFile = open(speciesID.split('-')[0] + ".seq", 'w') SeqIO.write([record], outFile, "fasta") outFile.close() print "Added a record for" + speciesID.split('-')[0] handle.close() --------END Code Sample Thank you for your responses, -Matthew J From mjldehoon at yahoo.com Sun Dec 14 05:53:50 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 14 Dec 2008 02:53:50 -0800 (PST) Subject: [BioPython] SeqIO.write() Multiple Calls for fasta In-Reply-To: <2a63cc350812140138g4cb85e5eo2ef6f2c97c15a533@mail.gmail.com> Message-ID: <826414.74394.qm@web62404.mail.re1.yahoo.com> > for j in range(1, len(kogid)): > name = "EXT-CLB-" + kogid[j] + ".seq" > if os.path.exists(name): > handle = open(name, "rU") > records = list(SeqIO.parse(handle, "fasta")) You don't need the 'list' here > for record in records: > speciesID = record.id.split('|')[0] > outFile = open(speciesID.split('-')[0] + ".seq", 'w') > SeqIO.write([record], outFile, "fasta") > outFile.close() > print "Added a record for" + speciesID.split('-')[0] > handle.close() The handle.close() should be inside the "if" block, so with an additional four spaces of indentation. Though this is not important for the problem you mentioned. The only way I can see that the SeqIo.write overwrites a files is if speciesID.split('-')[0] + ".seq" results in the same file name for more than one of the records. It's not a SeqIO.write issue; if you comment out the SeqIO.write line, you'll probably end up with the exact same set of output files (all of them empty though). --Michiel From biopython at maubp.freeserve.co.uk Sun Dec 14 08:05:09 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Dec 2008 13:05:09 +0000 Subject: [BioPython] SeqIO.write() Multiple Calls for fasta In-Reply-To: <826414.74394.qm@web62404.mail.re1.yahoo.com> References: <2a63cc350812140138g4cb85e5eo2ef6f2c97c15a533@mail.gmail.com> <826414.74394.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e00812140505w6f35d863n3896a2524b50d5ed@mail.gmail.com> Matthew wrote: > It is written that SeqIO.write() should "probably" perform fine with > multiple calls, but with my experience it actually does overwrite > the whole file, even when the file is opened and closed > immediately before and after the write. You seem to have misunderstood the documentation - are you already familiar with working with file handles in python? Perhaps this could be clarified. Using FASTA format, this is safe: out_handle = open("example.fasta","w") SeqIO.write(records, out_handle, "fasta") SeqIO.write(more_records, out_handle, "fasta") SeqIO.write(even_records, out_handle, "fasta") out_handle.close() You could also have written: out_handle = open("example.fasta","w") SeqIO.write(records+more_records+even_more_records, out_handle, "fasta") out_handle.close() I suspect what you are doing is instead is akin to this: out_handle = open("example.fasta","w") SeqIO.write(records, out_handle, "fasta") out_handle.close() out_handle = open("example.fasta","w") SeqIO.write(more_records, out_handle, "fasta") out_handle.close() out_handle = open("example.fasta","w") SeqIO.write(even_records, out_handle, "fasta") out_handle.close() This code will write the file once, then replace it, and again replace it. The final file contains only the third set of records. This is probably not what you intended. Your example code seems to be trying to create one file per sequence. Perhaps you have some duplicate filenames being generated as Michiel suggested. Peter From biopython at maubp.freeserve.co.uk Sun Dec 14 18:58:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Dec 2008 23:58:01 +0000 Subject: [BioPython] SeqIO.write() Multiple Calls for fasta In-Reply-To: <2a63cc350812141439x31a78a7fi52da56ebb483cf67@mail.gmail.com> References: <2a63cc350812140138g4cb85e5eo2ef6f2c97c15a533@mail.gmail.com> <826414.74394.qm@web62404.mail.re1.yahoo.com> <320fb6e00812140505w6f35d863n3896a2524b50d5ed@mail.gmail.com> <2a63cc350812141439x31a78a7fi52da56ebb483cf67@mail.gmail.com> Message-ID: <320fb6e00812141558t61669d25q328f588fe93f10bd@mail.gmail.com> Hi Matthew, I've CC'ed your replay back to the mailing list. On Sun, Dec 14, 2008 at 10:39 PM, Matthew Strand wrote: > I see, you both are right, this is not a SeqIO.write() issue. I should have > created the empty files and then used the append ('a') mode instead of the > write ('w') mode to add records to the file since the 'w' mode will > overwrite the file. I think using "a" for append will create the file if it does not already exist. Be careful if you run your script more than once - you may get multiple entries in each output file! > The way I interpreted the documentation was that it was safe to call > SeqIO.write() multiple times on the same file without overwriting it. And as > you both have shown, this is safe, as long as the right mode is used. > > Thank you for your responses and your time. I hope it helped :) Good night. Peter From dalloliogm at gmail.com Mon Dec 15 17:16:17 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 15 Dec 2008 23:16:17 +0100 Subject: [BioPython] [Popgen] a binary format for genotypes Message-ID: <5aa3b3570812151416q6dc22d34tb4fc5e551329acfb@mail.gmail.com> Hi, I was reading this article: - http://www.biomedcentral.com/1471-2105/9/526/abstract The authors describe a binary format to store SNPs data in a more efficently way than flat files. One of the authors, in his blog, says that they have developed some python APIs: - http://www.mailund.dk/index.php/2008/12/11/snpfile/ I think this is interesting for our biopython Popgen module. Maybe we can ask them for collaboration, and we could use such a format to store SNP data internally or at least provide support for their format. What do you think? -- My blog on bioinformatics (now in English): http://bioinfoblog.it From kteague at bcgsc.ca Mon Dec 15 17:53:29 2008 From: kteague at bcgsc.ca (Kevin Teague) Date: Mon, 15 Dec 2008 14:53:29 -0800 Subject: [BioPython] [Popgen] a binary format for genotypes In-Reply-To: <5aa3b3570812151416q6dc22d34tb4fc5e551329acfb@mail.gmail.com> References: <5aa3b3570812151416q6dc22d34tb4fc5e551329acfb@mail.gmail.com> Message-ID: <0D23A55E-4C8C-40DC-AF5A-0690B3B0131F@bcgsc.ca> A lot of the headaches of dealing with large scale data sets in a performance optimizing manner (self-describing format, platform independant binary files) have been worked out in other fields of science who've been dealing with large scale data sets for a lot longer than the field of bioinformatics (e.g. astronomy and climatology). While I've only used it a little bit, so I can't comment if there are any other formats that are worthy contenders, the HDF5 format is well established for working with large scale data sets: http://www.hdfgroup.org/HDF5/ There are libraries for accessing this format for many languages. With Python there is PyTables, which is a very good library: http://www.pytables.org/ I haven't heard of anyone using this in bioinformatics, but I've seen it demonstrated in very high traffic financial application written in Python where performance of this library was impressive. The developer ported to PyTables after PostgreSQL became a bottle-neck and found that PyTables was an order of magnitude faster. Of course, this isn't a purely fair comparison, since PyTables gives up transactions, concurrency and referential integrity in favor of pure speed. But in most data analysis pipelines, each data set can be produced independantly of each other, so those features of a RDBMS aren't usually needed. There have been a number of other bioinformatics tools and libraries that have been using custom binary file formats to deal with the ever increasing size of bioinformatic data sets. From a sysadmin and developer perspective it's a big headache since these custom formats can be platform-sensitive and require compiling and installing binaries to deal with each data format. Bleh! I have yet to see a "custom bioinformatic binary file format" which had to be developed to account for short comings of an already existing binary file format ... From dalloliogm at gmail.com Mon Dec 15 18:49:29 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 16 Dec 2008 00:49:29 +0100 Subject: [BioPython] [Popgen] a binary format for genotypes In-Reply-To: <0D23A55E-4C8C-40DC-AF5A-0690B3B0131F@bcgsc.ca> References: <5aa3b3570812151416q6dc22d34tb4fc5e551329acfb@mail.gmail.com> <0D23A55E-4C8C-40DC-AF5A-0690B3B0131F@bcgsc.ca> Message-ID: <5aa3b3570812151549p530f8005m9c2200712e840777@mail.gmail.com> On Mon, Dec 15, 2008 at 11:53 PM, Kevin Teague wrote: > A lot of the headaches of dealing with large scale data sets in a > performance optimizing manner (self-describing format, platform independant > binary files) have been worked out in other fields of science who've been > dealing with large scale data sets for a lot longer than the field of > bioinformatics (e.g. astronomy and climatology). > > While I've only used it a little bit, so I can't comment if there are any > other formats that are worthy contenders, the HDF5 format is well > established for working with large scale data sets: > > http://www.hdfgroup.org/HDF5/ I have already heard of this format, but for some reasons I thought that it couldn't be more efficient than a database. I have to deal with a table of ~10^7 entries, correlated with another one of 10^3, so, if I'd organize it in a certain way, it will have 10^10 entries. Do you think that this binary format would be more efficient than a database to handle all this? Does it supports relationships? (ok, I will read the documentation!! :) ). > > There are libraries for accessing this format for many languages. With > Python there is PyTables, which is a very good library: > > http://www.pytables.org/ Thanks for the link > I haven't heard of anyone using this in bioinformatics, but I've seen it > demonstrated in very high traffic financial application written in Python > where performance of this library was impressive. The developer ported to > PyTables after PostgreSQL became a bottle-neck and found that PyTables was > an order of magnitude faster. Of course, this isn't a purely fair > comparison, since PyTables gives up transactions, concurrency and > referential integrity in favor of pure speed. But in most data analysis > pipelines, each data set can be produced independantly of each other, so > those features of a RDBMS aren't usually needed. > > There have been a number of other bioinformatics tools and libraries that > have been using custom binary file formats to deal with the ever increasing > size of bioinformatic data sets. From a sysadmin and developer perspective > it's a big headache since these custom formats can be platform-sensitive and > require compiling and installing binaries to deal with each data format. > Bleh! > I have yet to see a "custom bioinformatic binary file format" which had to > be developed to account for short comings of an already existing binary file > format ... > > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From pzs at dcs.gla.ac.uk Thu Dec 18 08:47:11 2008 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 18 Dec 2008 13:47:11 +0000 Subject: [BioPython] Genbank LOCUS line slightly misaligned Message-ID: <494A545F.2020307@dcs.gla.ac.uk> I have a genbank file sent to my lab from a company called Genomatrix. It is slightly misformed. Specifically, the LOCUS lines have the right features, but not quite aligned; for example, the "bp" marker is not always at exactly the positions ([29:33] and [40:44]) required by _feed_first_line() in $biopythonhome/Genbank/Scanner.py. Have Genomatrix made an error in producing these genbank files, or should the bioptyon routines accommodate these variations? Some lines just give warnings and plough on, but others report that there isn't a space in exactly the right place and fail to read the record at all. I'm having to hack the genbank file as we speak... Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Dec 18 10:15:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Dec 2008 15:15:07 +0000 Subject: [BioPython] Genbank LOCUS line slightly misaligned In-Reply-To: <494A545F.2020307@dcs.gla.ac.uk> References: <494A545F.2020307@dcs.gla.ac.uk> Message-ID: <320fb6e00812180715o6bc3fe28pe6359c62e1f2e39e@mail.gmail.com> On Thu, Dec 18, 2008 at 1:47 PM, Peter Saffrey wrote: > I have a genbank file sent to my lab from a company called Genomatrix. It is > slightly misformed. Oh dear. Parsing misformed files is difficult as often they can be interpreted in more than one way. In general, the only safe and explicit choice here is to throw an exception - although we do tolerate some minor deviations from the spec in places. > Specifically, the LOCUS lines have the right features, but not quite > aligned; for example, the "bp" marker is not always at exactly the positions > ([29:33] and [40:44]) required by _feed_first_line() in > $biopythonhome/Genbank/Scanner.py. The fact we allow for the "bp" (or "aa") marker in two places reflects two iterations of the GenBank standard. In theory we could remove the support for the older version but there may be third party tools still producing GenBank files using that style. > Have Genomatrix made an error in producing these genbank files, or should > the bioptyon routines accommodate these variations? I presume Genomatrix have made an error - try emailing them for clarification. The GenBank file format for the LOCUS line is very explicit and uses very precise column positions for the fields. In theory we could try parsing ambiguous files using spaces to split up the fields, but as many of the fields are optional, this isn't generally possible without a little guess work. > Some lines just give warnings and plough on, but others report that > there isn't a space in exactly the right place and fail to read the record > at all. I'm having to hack the genbank file as we speak... I suspect that they (Genomatrix) are inserting a large locus identifier into the beginning of the LOCUS line which is sometimes bigger than the allocated slot, pushing the rest of the fields out of position in some of the files. I'd need to see several examples to be confident about this guess. If you don't actually need much information from the LOCUS line, you might find it easier to hack our parser to be a little more tolerant - I would suggest simply pulling out the locus ID, ignoring the rest of the LOCUS line, and printing a warning. Peter P.S. Which version of Biopython are you using? Biopython 1.48 onwards is a little less fussy than Biopython 1.47 in order to accept GenBank files produced by EMBOSS seqret. From pzs at dcs.gla.ac.uk Thu Dec 18 10:25:54 2008 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 18 Dec 2008 15:25:54 +0000 Subject: [BioPython] Genbank LOCUS line slightly misaligned In-Reply-To: <320fb6e00812180715o6bc3fe28pe6359c62e1f2e39e@mail.gmail.com> References: <494A545F.2020307@dcs.gla.ac.uk> <320fb6e00812180715o6bc3fe28pe6359c62e1f2e39e@mail.gmail.com> Message-ID: <494A6B82.4050906@dcs.gla.ac.uk> Thanks for your prompt reply. Peter wrote: > I suspect that they (Genomatrix) are inserting a large locus > identifier into the beginning of the LOCUS line which is sometimes > bigger than the allocated slot, pushing the rest of the fields out of > position in some of the files. I'd need to see several examples to be > confident about this guess. > That sounds about right. Here's a sample: $ grep LOCUS skurukutipromo.gb | head LOCUS GXP_4216 601 bp DNA LOCUS GXP_4217 601 bp DNA LOCUS GXP_4220 601 bp DNA LOCUS GXP_4226 603 bp DNA LOCUS GXP_1485624 601 bp DNA LOCUS GXP_1485625 601 bp DNA LOCUS GXP_4230 601 bp DNA LOCUS GXP_4253 640 bp DNA LOCUS GXP_648168 662 bp DNA LOCUS GXP_4281 601 bp DNA It's a bit careless on their part, but who listens to standards anyway? ;) > If you don't actually need much information from the LOCUS line, you > might find it easier to hack our parser to be a little more tolerant - > I would suggest simply pulling out the locus ID, ignoring the rest of > the LOCUS line, and printing a warning. > I already did a regex on the file itself to excise everything after the locus id, which put an end to the complaints. I'm also finding I have to manually parse the description entry, which comes out in one big lump like this: 'loc=GXL_3369|sym=AK092090|acc=GXP_4216|taxid=9606| spec=Homo sapiens|chr=19|ctg=NC_000019|str=(-)| start=9533632|end=9534232|len=601|tss=501| descr=Homo sapiens cDNA FLJ34771 fis, clone NT2NE2003150.| comm=GXT_2771591/AK096729/501/gold' Has some other formatting error prevented biopython from breaking this up for me, or is this the expected behaviour? I'm using biopython1.49. It's not a big deal, I was just wondering. Cheers, Peter From biopython at maubp.freeserve.co.uk Thu Dec 18 11:01:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Dec 2008 16:01:10 +0000 Subject: [BioPython] Genbank LOCUS line slightly misaligned In-Reply-To: <494A6B82.4050906@dcs.gla.ac.uk> References: <494A545F.2020307@dcs.gla.ac.uk> <320fb6e00812180715o6bc3fe28pe6359c62e1f2e39e@mail.gmail.com> <494A6B82.4050906@dcs.gla.ac.uk> Message-ID: <320fb6e00812180801u3bb2d31chddb44dae7502c2f4@mail.gmail.com> On Thu, Dec 18, 2008 at 3:25 PM, Peter Saffrey wrote: > Thanks for your prompt reply. > > Peter wrote: >> >> I suspect that they (Genomatrix) are inserting a large locus >> identifier into the beginning of the LOCUS line which is sometimes >> bigger than the allocated slot, pushing the rest of the fields out of >> position in some of the files. I'd need to see several examples to be >> confident about this guess. >> > > That sounds about right. Here's a sample: > > $ grep LOCUS skurukutipromo.gb | head > LOCUS GXP_4216 601 bp DNA > LOCUS GXP_4217 601 bp DNA > LOCUS GXP_4220 601 bp DNA > LOCUS GXP_4226 603 bp DNA > LOCUS GXP_1485624 601 bp DNA > LOCUS GXP_1485625 601 bp DNA > LOCUS GXP_4230 601 bp DNA > LOCUS GXP_4253 640 bp DNA > LOCUS GXP_648168 662 bp DNA > LOCUS GXP_4281 601 bp DNA > > It's a bit careless on their part, but who listens to standards anyway? ;) Writing general output to GenBank format is tricky if you have long record identifiers. >> If you don't actually need much information from the LOCUS line, you >> might find it easier to hack our parser to be a little more tolerant - >> I would suggest simply pulling out the locus ID, ignoring the rest of >> the LOCUS line, and printing a warning. > > I already did a regex on the file itself to excise everything after the > locus id, which put an end to the complaints. If you're happy, that's fine. > I'm also finding I have to manually parse the description entry, which comes > out in one big lump like this: > > 'loc=GXL_3369|sym=AK092090|acc=GXP_4216|taxid=9606| spec=Homo > sapiens|chr=19|ctg=NC_000019|str=(-)| > start=9533632|end=9534232|len=601|tss=501| descr=Homo sapiens cDNA FLJ34771 > fis, clone NT2NE2003150.| comm=GXT_2771591/AK096729/501/gold' What did the DEFINITION lines look like? Its usually just a long string like "species name, complete genome" spanning one or more lines. Here I'm guessing Genomatrix are sticking a whole load of meta data into this field using their own convention. This is a bit odd, but I think I've also seem similar extra data dumped into the COMMENT lines by other programs. > Has some other formatting error prevented biopython from breaking this up > for me, or is this the expected behaviour? I'm using biopython1.49. It's not > a big deal, I was just wondering. I think that's the expected behaviour, the DEFINITION lines becomes the record's description property (a simple string). Peter From biopython.chen at gmail.com Mon Dec 22 11:38:45 2008 From: biopython.chen at gmail.com (Chandan Kumar) Date: Mon, 22 Dec 2008 08:38:45 -0800 Subject: [BioPython] help for local alignment Message-ID: <4c2163890812220838o40cb50fcyb55f24392cf14101@mail.gmail.com> Dear all, can any one provide me simple code for local alignment python code which can be applied for protein or nucleotide sequence. Please provide me the simplest code as I am new to python and from biology background. Thanking you. Kind regards Chen From biopython at maubp.freeserve.co.uk Mon Dec 22 12:47:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Dec 2008 17:47:16 +0000 Subject: [BioPython] help for local alignment In-Reply-To: <4c2163890812220838o40cb50fcyb55f24392cf14101@mail.gmail.com> References: <4c2163890812220838o40cb50fcyb55f24392cf14101@mail.gmail.com> Message-ID: <320fb6e00812220947xd9444ffp636c13c684fda2c4@mail.gmail.com> On Mon, Dec 22, 2008 at 4:38 PM, Chandan Kumar wrote: > Dear all, > can any one provide me simple code for local alignment > python code which can be applied for protein or nucleotide sequence. Please > provide me the simplest code as I am new to python and from biology > background. > > Thanking you. > > Kind regards > Chen Hi Chen, Are you wanting to do pairwise alignments (aligning two sequences to each other), or multiple sequence alignments? For multiple sequence alignments, you might want to use a 3rd party tool like ClustalW, or MUSCLE. Biopython can parse several alignment formats including ClustalW format. See our tutorial for examples using ClustalW. Biopython's Bio.pairwise2 can do pairwise alignments, although we only have the built in documentation for this at the moment (nothing in our tutorial). This documentation is also available online: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html For pairwise sequence alignments I personally use the EMBOSS tools "water" (Smith-Waterman algorithm for local alignment) or "needle" (Needleman-Wunsch for global alignment). Biopython's Bio.AlignIO module can parse their output. Peter From bala.biophysics at gmail.com Mon Dec 29 16:37:31 2008 From: bala.biophysics at gmail.com (Bala subramanian) Date: Mon, 29 Dec 2008 22:37:31 +0100 Subject: [BioPython] error in writting pdb file Message-ID: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> Dear Friends, When i try to write a pdb file with PDBIO, i get the following error. What could be the possible reason for the same. >>> out=PDBIO() >>> out.set_structure(s) >>> out.save("new.pdb") Traceback (most recent call last): File "", line 1, in File "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", line 150, in save File "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", line 84, in _get_atom_line TypeError: %c requires int or char Thanks in advance, Bala From biopython at maubp.freeserve.co.uk Mon Dec 29 18:24:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Dec 2008 23:24:46 +0000 Subject: [BioPython] error in writting pdb file In-Reply-To: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> References: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> Message-ID: <320fb6e00812291524x5a49f375r7fc262188810f988@mail.gmail.com> On Mon, Dec 29, 2008 at 9:37 PM, Bala subramanian wrote: > Dear Friends, > > When i try to write a pdb file with PDBIO, i get the following error. What > could be the possible reason for the same. > >>>> out=PDBIO() >>>> out.set_structure(s) >>>> out.save("new.pdb") > Traceback (most recent call last): > File "", line 1, in > File > "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", > line 150, in save > File > "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", > line 84, in _get_atom_line > TypeError: %c requires int or char > > Thanks in advance, > Bala Something in one of your atom objects isn't as expected. The _get_atom_line code is trying to construct a string for an atom line for the PDB file, but one of the strong formatting arguments isn't setup right (the TypeError about %c). Without seeing how you constructed the structure (variable s in your code) its hard to guess what is wrong. Maybe one of the required properties is set to None? Peter From srini_iyyer_bio at yahoo.com Mon Dec 29 19:39:15 2008 From: srini_iyyer_bio at yahoo.com (Srinivas Iyyer) Date: Mon, 29 Dec 2008 16:39:15 -0800 (PST) Subject: [BioPython] blastcl3 Message-ID: <113203.41517.qm@web38105.mail.mud.yahoo.com> Dear Group, I am using netblast blastcl3 to blast my small fasta sequences to human genome. blastcl3 -p blastn -i test.fa -d gpipe/9606/all_contig -o test.out Above is my command. I want to be able to parse the output which is a text based format. I used this: from Bio.Blast import NCBIWWW import Bio.Blast.Record blast_out = open('test.out','r') parser = NCBIWWW.BlastParser() blastRecord = parser.parse(blast_out) I hit error and is reported below. Instad I did the following: from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML fasta_string = open("test.fa").read() result_handle = NCBIWWW.qblast("blastn", "gpipe/9606/all_contig", fasta_string) blast_records = NCBIXML.parse(result_handle) blast_records = list(blast_records) Treaceback (most recent call last): File"", line 1, in StopIteration Instead: if I say : for item in blast_records: print i I get IndexError: list index out of range. what should I do? could any one help me please. thanks Srini Error for :blastRecord = parser.parse(blast_out) >>> blastRecord = parser.parse(blast_out) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py", line 51, in parse self._scanner.feed(handle, self._consumer) File "/usr/local/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py", line 103, in feed has_re=re.compile(r'.?BLAST')) File "/usr/local/lib/python2.5/site-packages/Bio/ParserSupport.py", line 334, in read_and_call_until line = safe_readline(uhandle) File "/usr/local/lib/python2.5/site-packages/Bio/ParserSupport.py", line 410, in safe_readline raise ValueError, "Unexpected end of stream." ValueError: Unexpected end of stream. From chapmanb at 50mail.com Mon Dec 29 20:09:29 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 29 Dec 2008 20:09:29 -0500 Subject: [BioPython] blastcl3 In-Reply-To: <113203.41517.qm@web38105.mail.mud.yahoo.com> References: <113203.41517.qm@web38105.mail.mud.yahoo.com> Message-ID: <20081230010929.GA57412@kunkel> Hi Srini; > I am using netblast blastcl3 to blast my small fasta sequences to human genome. > blastcl3 -p blastn -i test.fa -d gpipe/9606/all_contig -o test.out > > Above is my command. I want to be able to parse the output which is a > text based format. My first suggestion if you want to parse BLAST is to use the XML output. Based on the NCBI documentation here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/netblast.html it appears as if the parameter you want is '-m 7'. XML output is much more stable, and details on parsing it in Biopython are here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56 The error you report below makes it seem as if the output file is empty but it is a bit tough to say. If parsing the XML output does not work, you might want to double check the 'test.out' file to be sure it looks decent, and if so attach it here so we can help more. Hope this helps, Brad > I used this: > from Bio.Blast import NCBIWWW > import Bio.Blast.Record > blast_out = open('test.out','r') > parser = NCBIWWW.BlastParser() > blastRecord = parser.parse(blast_out) > > I hit error and is reported below. > > Instad I did the following: > > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > fasta_string = open("test.fa").read() > result_handle = NCBIWWW.qblast("blastn", "gpipe/9606/all_contig", fasta_string) > blast_records = NCBIXML.parse(result_handle) > blast_records = list(blast_records) > Treaceback (most recent call last): > File"", line 1, in > StopIteration > > Instead: > > if I say : > for item in blast_records: > print i > > I get IndexError: list index out of range. > > what should I do? > could any one help me please. > thanks > Srini > > > > > > > > > > > Error for :blastRecord = parser.parse(blast_out) > > > >>> blastRecord = parser.parse(blast_out) > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py", line 51, in parse > self._scanner.feed(handle, self._consumer) > File "/usr/local/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py", line 103, in feed > has_re=re.compile(r'.?BLAST')) > File "/usr/local/lib/python2.5/site-packages/Bio/ParserSupport.py", line 334, in read_and_call_until > line = safe_readline(uhandle) > File "/usr/local/lib/python2.5/site-packages/Bio/ParserSupport.py", line 410, in safe_readline > raise ValueError, "Unexpected end of stream." > ValueError: Unexpected end of stream. > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Dec 30 11:55:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Dec 2008 16:55:32 +0000 Subject: [BioPython] error in writting pdb file In-Reply-To: <288df32a0812292147o70f17332safcb7c86ad55d404@mail.gmail.com> References: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> <320fb6e00812291524x5a49f375r7fc262188810f988@mail.gmail.com> <288df32a0812292147o70f17332safcb7c86ad55d404@mail.gmail.com> Message-ID: <320fb6e00812300855o75e17ab9wcd62c8387f20629f@mail.gmail.com> On Tue, Dec 30, 2008 at 5:47 AM, Bala subramanian wrote: > Peter, > Here is the small code i where i try to renumber the residues. > > Python 2.5.2 >>>> from Bio.PDB import PDBParser >>>> from Bio.PDB import PDBIO >>>> par=PDBParser() >>>> S=par.get_structure('cef','1CE4.pdb') >>>> seq=range(100,134+1) >>>> i=0 >>>> for residues in S.get_residues(): > ... residues.id=('',seq[i],'') > ... i += 1 > ... >>>> out=PDBIO() >>>> out.set_structure(S) >>>> out.save("new.pdb") > Traceback (most recent call last): > File "", line 1, in > File > "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", > line 150, in save > File > "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", > line 84, in _get_atom_line > TypeError: %c requires int or char For this example, the copy of 1CE4.pdb I just downloaded seems to have 700 residues - but you only created a list of 35 new identifiers. This mean the code above fails for me with an index error - easy to fix but I'm not 100% sure how you want to renumber the residues. As to the TypeError, I think the problem is you are setting the first and last parts of the ID to empty string. Try using a single space instead - how about: for index, residue in enumerate(S.get_residues()) : residue.id = (" ", index+100, " ") #Note quoted spaces! Notice I'm using the python enumerate function, which means index counts from 0, 1, 2, ... and I then use this to calculate the new identifier by adding 100. You may want to do something differently. Peter From biopython at maubp.freeserve.co.uk Tue Dec 30 12:42:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Dec 2008 17:42:14 +0000 Subject: [BioPython] blastcl3 In-Reply-To: <113203.41517.qm@web38105.mail.mud.yahoo.com> References: <113203.41517.qm@web38105.mail.mud.yahoo.com> Message-ID: <320fb6e00812300942j1943e059j5cae6fea4c9c3de@mail.gmail.com> On Tue, Dec 30, 2008 at 12:39 AM, Srinivas Iyyer wrote: > Dear Group, > I am using netblast blastcl3 to blast my small fasta sequences to human genome. > > blastcl3 -p blastn -i test.fa -d gpipe/9606/all_contig -o test.out > > Above is my command. I want to be able to parse the output which is a text based format. I would urge you to tell blast to produce XML output as already described by Brad. Just to clarify: Bio.Blast.NCBIXML includes our XML blast parser (recommended) Bio.Blast.NCBIStandalone includes our plain text parser (discouraged) Bio.Blast.NCBIWWW includes our deprecated HTML blast parser The module naming reflects the historical introduction of the different BLAST tools, and is unfortunately a little misleading nowadays since both the standalone command line tool and the website can produce XML, plain text or HTML output. > I used this: > from Bio.Blast import NCBIWWW > import Bio.Blast.Record > blast_out = open('test.out','r') > parser = NCBIWWW.BlastParser() > blastRecord = parser.parse(blast_out) The above code will try and parse HTML (web page) format BLAST output - but you said test.out should be in plain text format, so this won't work. If you really want to use the plain text format, try the parser in Bio.Blast.NCBIStandalone - but it doesn't work 100% on the output from the latest version of the BLAST standalone tools. > Instad I did the following: > > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > fasta_string = open("test.fa").read() > result_handle = NCBIWWW.qblast("blastn", "gpipe/9606/all_contig", fasta_string) This function runs BLAST over the internet, and it should default to XML format. You can override using the format_type argument as described in the docstring or the tutorial. You should be able to parse it using Bio.Blast.NCBIXML as you tried... However, I would assume that "gpipe/9606/all_contig" is a local database on your machine, so there is no way the NCBI's servers can use it. If you examine the results by hand it will probably be an error message, try this: print result_handle.read() Peter From biopython at maubp.freeserve.co.uk Tue Dec 30 13:00:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Dec 2008 18:00:43 +0000 Subject: [BioPython] error in writting pdb file In-Reply-To: <288df32a0812300944l743743ceo79079b697e37ef29@mail.gmail.com> References: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> <320fb6e00812291524x5a49f375r7fc262188810f988@mail.gmail.com> <288df32a0812292147o70f17332safcb7c86ad55d404@mail.gmail.com> <320fb6e00812300855o75e17ab9wcd62c8387f20629f@mail.gmail.com> <288df32a0812300944l743743ceo79079b697e37ef29@mail.gmail.com> Message-ID: <320fb6e00812301000w4a6a5557nc473baaf0e58bcbc@mail.gmail.com> On Tue, Dec 30, 2008 at 5:44 PM, Bala subramanian wrote: > Dear Peter, > > Actually 1Ce4.pdb is a NMR structure and i just did the renumbering on one > model extraced from it. That would explain why you had less residues. > Now the script work fine after adjusting the quoted space. Thank you very much. Good. I'm glad we could solve this so quickly. > Could you please suggest me some good tutorials for Bio.PDB > > Bala If you haven't already done so, please see http://biopython.org/wiki/Documentation First of all there is a whole chapter in the main Biopython Tutorial, included with the the Biopython source code archives, and also available online: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf Then there is also a separate document, which goes into more detail: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf There are also a few other examples elsewhere online, try Google. Peter P.S. Please CC the mailing list on your replies, so that the discussion is open, and archived for future readers. From sudhir.cr at gmail.com Wed Dec 31 02:49:48 2008 From: sudhir.cr at gmail.com (sudhir cr) Date: Wed, 31 Dec 2008 02:49:48 -0500 Subject: [BioPython] How to use Bio.Kegg.Compound Module Message-ID: Hello, I am a newbie to python. Can anyone please tell me how to use the Bio.Kegg.Compound Module to get the DBLinks from a KEGG Compound file? Thanks in advance Sudhir From biopython at maubp.freeserve.co.uk Wed Dec 31 09:43:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 31 Dec 2008 14:43:39 +0000 Subject: [BioPython] How to use Bio.Kegg.Compound Module In-Reply-To: References: Message-ID: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> On Wed, Dec 31, 2008 at 7:49 AM, sudhir cr wrote: > Hello, > > I am a newbie to python. > > Can anyone please tell me how to use the Bio.Kegg.Compound Module to get the > DBLinks from a KEGG Compound file? > > Thanks in advance > Sudhir Looking at the code, we do need to add some more to the KEGG docstrings. However, I think you want to do something like this: from Bio.KEGG import Compound handle = open("my_kegg_file.txt") for record in Compound.parse(handle) : print record.entry for database, links in record.dblinks : print database, links handle.close() Peter From sudhir.cr at gmail.com Wed Dec 31 10:07:21 2008 From: sudhir.cr at gmail.com (sudhir cr) Date: Wed, 31 Dec 2008 10:07:21 -0500 Subject: [BioPython] How to use Bio.Kegg.Compound Module In-Reply-To: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> References: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> Message-ID: Hello Peter, Thanks for the quick reply. This code is working great. P.S: The new KEGG format has changed to "Other DBs" from "DBLINKS" Thanks a lot, Have a great New Year - 2009 Sudhir On Wed, Dec 31, 2008 at 9:43 AM, Peter wrote: > On Wed, Dec 31, 2008 at 7:49 AM, sudhir cr wrote: > > Hello, > > > > I am a newbie to python. > > > > Can anyone please tell me how to use the Bio.Kegg.Compound Module to get > the > > DBLinks from a KEGG Compound file? > > > > Thanks in advance > > Sudhir > > Looking at the code, we do need to add some more to the KEGG > docstrings. However, I think you want to do something like this: > > from Bio.KEGG import Compound > handle = open("my_kegg_file.txt") > for record in Compound.parse(handle) : > print record.entry > for database, links in record.dblinks : > print database, links > handle.close() > > Peter > From biopython at maubp.freeserve.co.uk Wed Dec 31 10:17:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 31 Dec 2008 15:17:39 +0000 Subject: [BioPython] How to use Bio.Kegg.Compound Module In-Reply-To: References: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> Message-ID: <320fb6e00812310717s51cce5dm52694079ffe9c253@mail.gmail.com> On Wed, Dec 31, 2008 at 3:07 PM, sudhir cr wrote: > Hello Peter, > > Thanks for the quick reply. This code is working great. Great. > P.S: The new KEGG format has changed to "Other DBs" from "DBLINKS" Do you have a link for this? If we need to update our parser could you file a bug on Bugzilla please? http://bugzilla.open-bio.org/ Thanks, Peter From alexl at users.sourceforge.net Mon Dec 1 02:59:40 2008 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Sun, 30 Nov 2008 19:59:40 -0700 Subject: [BioPython] will BioSQL work with psycopg2? Message-ID: Hi there, I am maintaining the Fedora package for Biopython and we are doing a complete rebuild of all Python packages for Python 2.6. Currently I have a dependency on psycopg (version 1.1.21) but since that is so old pyscopg won't rebuild against the new mx, meaning that I can't rebuild Biopython because the dependencies aren't there. So my question is, will the Biopython BioSQL work with the newer psycopg2 (currently version 2.0.8)? See: http://www.initd.org/pub/software/psycopg/ Does it require the 1.x API or will it work with 2.x? The BioSQL page: http://biopython.org/wiki/BioSQL isn't clear on this. Thanks, Alex From biopython at maubp.freeserve.co.uk Mon Dec 1 10:22:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Dec 2008 10:22:27 +0000 Subject: [BioPython] will BioSQL work with psycopg2? In-Reply-To: References: Message-ID: <320fb6e00812010222o5fb87c4as46c40605d87c7cb7@mail.gmail.com> On Mon, Dec 1, 2008 at 2:59 AM, Alex Lancaster wrote: > Hi there, > > I am maintaining the Fedora package for Biopython and we are doing a > complete rebuild of all Python packages for Python 2.6. Excellent. Biopython 1.49 onwards should be fine on Python 2.6, but please do let us know if you find anything amiss or any deprecations we've missed. Slightly off topic, but If you want any clarification on things like the deprecation of Martel (with its dependency on mxTextTools), or the switch from Numeric to NumPy please ask. > Currently I have a dependency on psycopg (version 1.1.21) but since > that is so old pyscopg won't rebuild against the new mx, meaning that > I can't rebuild Biopython because the dependencies aren't there. > > So my question is, will the Biopython BioSQL work with the newer > psycopg2 (currently version 2.0.8)? See: > http://www.initd.org/pub/software/psycopg/ Yes, psycopg2 should work with Biopython 1.49 onwards (including Biopython 1.49 beta) thanks to a patch from Cymon Cox, see Bug 2616: http://bugzilla.open-bio.org/show_bug.cgi?id=2616 > Does it require the 1.x API or will it work with 2.x? The BioSQL page: > http://biopython.org/wiki/BioSQL > isn't clear on this. I'm not sure, having not used psycopg or psycopg2 myself. Hopefully Cymon can clarify this (CC'd). Peter From lueck at ipk-gatersleben.de Mon Dec 1 13:13:21 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 1 Dec 2008 14:13:21 +0100 Subject: [BioPython] Emboss eprimer3-Product Size Range Message-ID: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> Hi! I'm working with Emboss eprimer3 and I have a short question: How can I enter the paramter "product size range"? Usually it's something like 500-1000 450-500 etc. If I add this into python, nothing happens: from Bio import Fasta from Bio.Emboss.Applications import Primer3Commandline from Bio.Application import generic_run from Bio.Emboss import Primer3 primer_cl = Primer3Commandline() primer_cl.set_parameter("-sequence", "in.txt") primer_cl.set_parameter("-outfile", "out.pr3") primer_cl.set_parameter("-productsizerange", "100-200") primer_cl.set_parameter("-target", "%s,%s" % (50, 100)) result, messages, errors = generic_run(primer_cl) Whereas primer_cl.set_parameter("-productsizerange", "100") gives a primer output. Thanks in advance! Stefanie From biopython at maubp.freeserve.co.uk Mon Dec 1 15:08:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Dec 2008 15:08:51 +0000 Subject: [BioPython] Emboss eprimer3-Product Size Range In-Reply-To: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> On Mon, Dec 1, 2008 at 1:13 PM, Stefanie L?ck wrote: > Hi! > > I'm working with Emboss eprimer3 and I have a short question: > > How can I enter the paramter "product size range"? Usually it's > something like 500-1000 450-500 etc. According the the eprimer3 tool itself (try "eprimer3 --help" at the command line) you seem to be using productsizerange correctly in the following code snippet. > from Bio import Fasta You don't seem to be using Bio.Fasta, and this module is now obsolete. > from Bio.Emboss.Applications import Primer3Commandline > from Bio.Application import generic_run > from Bio.Emboss import Primer3 > > primer_cl = Primer3Commandline() > primer_cl.set_parameter("-sequence", "in.txt") > primer_cl.set_parameter("-outfile", "out.pr3") > primer_cl.set_parameter("-productsizerange", "100-200") > primer_cl.set_parameter("-target", "%s,%s" % (50, 100)) > result, messages, errors = generic_run(primer_cl) So this fails - what what exactly goes wrong i.e. what does this give: print "Command line:" print primer_cl print "Return code:" print result.return_code print "Errors:" print errors.read() print "Messages": print messages.read() And my usual question in this sort of situation: what happens if you run this command by hand at the command prompt? If you email me your input file (in.txt) then I can try running this on my machine (don't send it to the mailing list unless its very small and you don't mind sharing it with the world). Peter From lueck at ipk-gatersleben.de Mon Dec 1 16:31:22 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 1 Dec 2008 17:31:22 +0100 Subject: [BioPython] [Correction!] Emboss eprimer3-Product Size Range References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> Message-ID: <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> I'm sorry! I prescribed me! primer_cl.set_parameter("-productsizerange", "100-200 250-300") Causes no output and not primer_cl.set_parameter("-productsizerange", "100-200") as I wrote! >You don't seem to be using Bio.Fasta, and this module is now obsolete. Sorry I just forgot to deleted. I showed only a part of the code. The error print and on the command prompt gives: Command line: eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 -productsizerange 100-200 250-300 Return code: 1 Errors: Error: Argument '250-300' : Too many parameters 3/2 Messages: But it's described in the eprimer3 help file: "-productsizerange range [100-300] ...If one desires PCR products in either the range from 100 to 150 bases or in the range from 200 to 250 bases then one would set this parameter to 100-150 200-250. EPrimer3 favors ranges to the left side of the parameter string..." My input file is a normal multiple fasta file. Thanks and sorry for the mistake! ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Monday, December 01, 2008 4:08 PM Subject: Re: [BioPython] Emboss eprimer3-Product Size Range On Mon, Dec 1, 2008 at 1:13 PM, Stefanie L?ck wrote: > Hi! > > I'm working with Emboss eprimer3 and I have a short question: > > How can I enter the paramter "product size range"? Usually it's > something like 500-1000 450-500 etc. According the the eprimer3 tool itself (try "eprimer3 --help" at the command line) you seem to be using productsizerange correctly in the following code snippet. > from Bio import Fasta You don't seem to be using Bio.Fasta, and this module is now obsolete. > from Bio.Emboss.Applications import Primer3Commandline > from Bio.Application import generic_run > from Bio.Emboss import Primer3 > > primer_cl = Primer3Commandline() > primer_cl.set_parameter("-sequence", "in.txt") > primer_cl.set_parameter("-outfile", "out.pr3") > primer_cl.set_parameter("-productsizerange", "100-200") > primer_cl.set_parameter("-target", "%s,%s" % (50, 100)) > result, messages, errors = generic_run(primer_cl) So this fails - what what exactly goes wrong i.e. what does this give: print "Command line:" print primer_cl print "Return code:" print result.return_code print "Errors:" print errors.read() print "Messages": print messages.read() And my usual question in this sort of situation: what happens if you run this command by hand at the command prompt? If you email me your input file (in.txt) then I can try running this on my machine (don't send it to the mailing list unless its very small and you don't mind sharing it with the world). Peter From biopython at maubp.freeserve.co.uk Mon Dec 1 17:07:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Dec 2008 17:07:51 +0000 Subject: [BioPython] [Correction!] Emboss eprimer3-Product Size Range In-Reply-To: <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00812010907t27a5d9d5q6a9a9b3c41971a18@mail.gmail.com> > primer_cl.set_parameter("-productsizerange", "100-200 250-300") > Causes no output and not > primer_cl.set_parameter("-productsizerange", "100-200") > as I wrote! OK - that helps :) This will fail at the command line: eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 -productsizerange 100-200 250-300 Based on my experience of unix command lines and how arguments are parsed, this should work: eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 -productsizerange "100-200 250-300" If so, then in python you need to include the quotes yourself, e.g. primer_cl.set_parameter('-productsizerange', '"100-200 250-300"') That is single quotes to delimit the string in python, with double quotes as part of the string itself. You could also use double quotes by escaping them with a slash: primer_cl.set_parameter("-productsizerange", "\"100-200 250-300\"") To try and explain the python syntax here, try the following examples at the python prompt: >>> print "100-200 250-300" 100-200 250-300 >>> print '100-200 250-300' 100-200 250-300 >>> print '"100-200 250-300"' "100-200 250-300" >>> print "\"100-200 250-300\"" "100-200 250-300" Peter From cy at cymon.org Mon Dec 1 20:53:06 2008 From: cy at cymon.org (Cymon Cox) Date: Mon, 1 Dec 2008 20:53:06 +0000 Subject: [BioPython] will BioSQL work with psycopg2? In-Reply-To: <320fb6e00812010222o5fb87c4as46c40605d87c7cb7@mail.gmail.com> References: <320fb6e00812010222o5fb87c4as46c40605d87c7cb7@mail.gmail.com> Message-ID: <7265d4f0812011253h1f25b8a0y9a87bbf0d99fd830@mail.gmail.com> Hi Peter and Alex, 2008/12/1 Peter > On Mon, Dec 1, 2008 at 2:59 AM, Alex Lancaster wrote: > > Currently I have a dependency on psycopg (version 1.1.21) but since > > that is so old pyscopg won't rebuild against the new mx, meaning that > > I can't rebuild Biopython because the dependencies aren't there. > > > > So my question is, will the Biopython BioSQL work with the newer > > psycopg2 (currently version 2.0.8)? See: > > http://www.initd.org/pub/software/psycopg/ > > Yes, psycopg2 should work with Biopython 1.49 onwards (including > Biopython 1.49 beta) thanks to a patch from Cymon Cox, see Bug 2616: To confirm: I'm currently using psycopg2 vers. 2.0.8 > > Does it require the 1.x API or will it work with 2.x? The BioSQL page: > > http://biopython.org/wiki/BioSQL > > isn't clear on this. > > I'm not sure, having not used psycopg or psycopg2 myself. Hopefully > Cymon can clarify this (CC'd). Sorry, but I'm not sure what the question is here... Cheers, C. -- ____________________________________________________________________ Cymon J. Cox From bartek at rezolwenta.eu.org Mon Dec 1 20:53:59 2008 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 1 Dec 2008 21:53:59 +0100 Subject: [BioPython] [Biopython-dev] Refactoring motif analysis code In-Reply-To: <492ACE38.1090301@gmail.com> References: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> <492ACE38.1090301@gmail.com> Message-ID: <8b34ec180812011253p28a08a0bv43cd72369062b39b@mail.gmail.com> Hi all, I've done some work regarding the motif analysis in Biopython. I've done the following stuff: - refactored the Bio.AlignAce and Bio.MEME to use one common motif object - Put all of the refactored code in the Bio.Motif directory - Added more code (from my attic) to do motif comparisons and computing thresholds (this was actually written by my colleague Norbert Dojer, but I adapted it and I have his permission to contribute the code) - written a short tutorial on the usage of Bio.Motif (that's where I'd put it). - Written a basic test suite for the new motif. I haven't added it to cvs yet, but posted it as an attchment to the enhancement proposal in bugzilla: http://bugzilla.open-bio.org/show_bug.cgi?id=2694 I have cvs access, so I can commit the changes myself, but I'd like to wait for an "OK" from someone more involved in the release process. Since Giovanni and Bruce have responded to my previous call for comments, I'll try to answer them below: On Mon, Nov 24, 2008 at 4:54 PM, Bruce Southey wrote: > > Actually I am not that thrilled with the licenses for these packages and > similar packages because these are free only for academic use. To me this > clashes with the spirit of an open-sourced project especially a BSD-licensed > one. But if there is a need for such modules then these modules should be > included. > I have similar feelings about the "academic-use-only" licenses. On the other hand, since most of the biopython users are in academia, then I don't see it as a big problem. Also, since I don't have any truly open and free replacement for these programs, I think it's better to keep them. In fact the new Bio.Motif package provides some methods for motif comparisons, which at least to some extent can be used as a replacement for the respective functions of CompareACE and MAST. As a side note, I think that there is no point in providing parsers for every single motif finder that comes out, and I don't think that AlignAce and MEME are the best or the most representative ones. It just happened that these parsers were written "to scratch someone's itch". I think that the other functionality (motif searching, comparisons,weblogo) might be more useful to people. > While it is only free for academic use, have you seen TAMO? > *TAMO: a flexible, object-oriented framework for analyzing transcriptional > regulation using DNA-sequence motifs. * > Bioinformatics. 2005 Jul 15;21(14):3164-5. > > > http://fraenkel.mit.edu/TAMO/ Yes, I've seen it and I've even recommended it on the biopython mailing list when there was no replacement in biopython. However, their library is free only for academia and AFAIK it's not using biopython datastructures, so needs some work to integrate with TAMO if you are using Biopython. Bio.Motif is meant to provide free software for Motif analysis. > Well, I am not sure how many used Bio.AlignAce given the Parser.py bug :-) > Based on the CVS, both have been untouched for about three years. > Well, I've not used it myself for a while... I'm no longer doing de-novo motif discovery. However, it still works so it's potentially useful. I think this is largely due to the lack of documentation for the Bio.AlignAce and Bio.MEME tools (partially my fault). Hopefully people will start using this if they read the tutorial. > Also, what species are these used for? > One of the papers of AlignAce indicate that the base composition was set for > yeast. > They're both general purpose, you can set the gc content for alignAce and even an HMM for MEME. > > Personally I would be interested in a general protein motif finding module > because of my current research. However, I do have a different view with > respect to the Biopython community as indicated above with the licenses. Both MEME and AlignAce can be used to find motifs in proteins, but it has not so much to do with Bio.Motif, since it does not provide any motif-finnding capabilities by itself. In general Bio.Motif should be able to deal with protein motifs, but I've never tested it (I'm mostly using it for DNA motifs), so I'll be happy to help if you find bugs. On Mon, Nov 24, 2008 at 4:25 PM, Giovanni Marco Dall'Olio wrote: > > I would just like to tell you that I have tried the TAMO framework you > suggested me, and found it very useful. Yes, I remember, but the problem is with the TAMO license. I think that the Motif object might be still useful since it is free, allows to read motifs from databases like JASPAR to scan sequences and/or compare them with "your" motifs. > I am not using it anymore because I don't need it, but I remember that I liked: > - the methods to represent motifs as matrixes of frequencies/occurrencies etc.. done > - the fact that it was easy to create a motif from an alignment of sequences depending on your definition of easy, it's there > - the integration it had with this website: > http://weblogo.berkeley.edu/logo.cgi. done > I would suggest you to provide integration with this other web > service, which enable to plot the difference between two sequence > logos: http://www.twosamplelogo.org/examples.html. This I haven't done yet, but I'll try to provide functionality for that (shouldn't take too long). -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From dalloliogm at gmail.com Mon Dec 1 21:07:08 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 1 Dec 2008 22:07:08 +0100 Subject: [BioPython] [Biopython-dev] Refactoring motif analysis code In-Reply-To: <8b34ec180812011253p28a08a0bv43cd72369062b39b@mail.gmail.com> References: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> <492ACE38.1090301@gmail.com> <8b34ec180812011253p28a08a0bv43cd72369062b39b@mail.gmail.com> Message-ID: <5aa3b3570812011307q710cab78q2fbae061f5dd5eff@mail.gmail.com> On Mon, Dec 1, 2008 at 9:53 PM, Bartek Wilczynski wrote: > On Mon, Nov 24, 2008 at 4:25 PM, Giovanni Marco Dall'Olio > wrote: >> >> I would just like to tell you that I have tried the TAMO framework you >> suggested me, and found it very useful. > > Yes, I remember, but the problem is with the TAMO license. I think > that the Motif object might be still > useful since it is free, allows to read motifs from databases like > JASPAR to scan sequences and/or > compare them with "your" motifs. Thanks for all these changes. I remember that I wrote a mail to TAMO's authors when I was using it. They seemed to be interested in integrating the code with biopython, so maybe the license issue could be superated. It's up to you, whether you want to reimplement all the functions they have or not. -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Dec 1 21:09:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Dec 2008 21:09:33 +0000 Subject: [BioPython] will BioSQL work with psycopg2? In-Reply-To: <7265d4f0812011253h1f25b8a0y9a87bbf0d99fd830@mail.gmail.com> References: <320fb6e00812010222o5fb87c4as46c40605d87c7cb7@mail.gmail.com> <7265d4f0812011253h1f25b8a0y9a87bbf0d99fd830@mail.gmail.com> Message-ID: <320fb6e00812011309l104fb07as67a77be59778fa54@mail.gmail.com> 2008/12/1 Cymon Cox wrote: > 2008/12/1 Peter > >> On Mon, Dec 1, 2008 at 2:59 AM, Alex Lancaster wrote: >> > Currently I have a dependency on psycopg (version 1.1.21) but since >> > that is so old pyscopg won't rebuild against the new mx, meaning that >> > I can't rebuild Biopython because the dependencies aren't there. >> > >> > So my question is, will the Biopython BioSQL work with the newer >> > psycopg2 (currently version 2.0.8)? See: >> > http://www.initd.org/pub/software/psycopg/ >> >> Yes, psycopg2 should work with Biopython 1.49 onwards (including >> Biopython 1.49 beta) thanks to a patch from Cymon Cox, see Bug 2616: > > To confirm: I'm currently using psycopg2 vers. 2.0.8 > >> > Does it require the 1.x API or will it work with 2.x? The BioSQL page: >> > http://biopython.org/wiki/BioSQL >> > isn't clear on this. >> >> I'm not sure, having not used psycopg or psycopg2 myself. Hopefully >> Cymon can clarify this (CC'd). > > Sorry, but I'm not sure what the question is here... On reflection, I think Alex was asking if Biopython's BioSQL interface would work with or require psycopg 1.x or psycopg 2.x - and the answer to that is as of Biopython 1.49 we support either (but don't require either - you could use MySQL instead of PostgreSQL for example). Older versions of Biopython don't support psycopg 2.x. Peter From lueck at ipk-gatersleben.de Tue Dec 2 07:37:30 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 2 Dec 2008 08:37:30 +0100 Subject: [BioPython] [Correction!] Emboss eprimer3-Product Size Range References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> <320fb6e00812010907t27a5d9d5q6a9a9b3c41971a18@mail.gmail.com> Message-ID: <000c01c95450$d51e9d40$1022a8c0@ipkgatersleben.de> Thanks Peter, that's was it! Like always you solved my problem ;-) I wish everybody a nice Christmas Time! Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Monday, December 01, 2008 6:07 PM Subject: Re: [BioPython] [Correction!] Emboss eprimer3-Product Size Range >> primer_cl.set_parameter("-productsizerange", "100-200 250-300") >> Causes no output and not >> primer_cl.set_parameter("-productsizerange", "100-200") >> as I wrote! > > OK - that helps :) > > This will fail at the command line: > > eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 > -productsizerange 100-200 250-300 > > Based on my experience of unix command lines and how arguments are > parsed, this should work: > > eprimer3 -sequence in.txt -outfile out.pr3 -target 50,100 > -productsizerange "100-200 250-300" > > If so, then in python you need to include the quotes yourself, e.g. > > primer_cl.set_parameter('-productsizerange', '"100-200 250-300"') > > That is single quotes to delimit the string in python, with double > quotes as part of the string itself. You could also use double quotes > by escaping them with a slash: > > primer_cl.set_parameter("-productsizerange", "\"100-200 250-300\"") > > To try and explain the python syntax here, try the following examples > at the python prompt: > >>>> print "100-200 250-300" > 100-200 250-300 >>>> print '100-200 250-300' > 100-200 250-300 >>>> print '"100-200 250-300"' > "100-200 250-300" >>>> print "\"100-200 250-300\"" > "100-200 250-300" > > Peter > From biopython at maubp.freeserve.co.uk Tue Dec 2 10:25:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Dec 2008 10:25:47 +0000 Subject: [BioPython] [Correction!] Emboss eprimer3-Product Size Range In-Reply-To: <000c01c95450$d51e9d40$1022a8c0@ipkgatersleben.de> References: <004001c953b6$95816bd0$1022a8c0@ipkgatersleben.de> <320fb6e00812010708g64046188kd9ea8c142f9f8de4@mail.gmail.com> <006201c953d2$3f6597f0$1022a8c0@ipkgatersleben.de> <320fb6e00812010907t27a5d9d5q6a9a9b3c41971a18@mail.gmail.com> <000c01c95450$d51e9d40$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00812020225m4806fce1p4b1316e9a497c212@mail.gmail.com> On Tue, Dec 2, 2008 at 7:37 AM, Stefanie L?ck wrote: > Thanks Peter, that's was it! > Like always you solved my problem ;-) > I wish everybody a nice Christmas Time! > Stefanie Great! I'm glad to be of help. Peter P.S. Its already snowing here in Scotland! From aloraine at uncc.edu Tue Dec 2 15:48:00 2008 From: aloraine at uncc.edu (Loraine, Ann) Date: Tue, 2 Dec 2008 10:48:00 -0500 Subject: [BioPython] blat (psl) parser? Message-ID: Dear all, Does BioPython include a blat (psl format) parser? If yes, I would be grateful for pointers to documentation or tutorials describing how to use it. I appreciate your help! -Ann Loraine From dalloliogm at gmail.com Wed Dec 3 19:03:16 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 3 Dec 2008 20:03:16 +0100 Subject: [BioPython] [OT] Revision control and databases In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Message-ID: <5aa3b3570812031103m53050429lf3d517ccf6142bd7@mail.gmail.com> On 10/23/08, Giovanni Marco Dall'Olio wrote: > The problem is I don't know if databases do Revision Control. > When I used flat files, I was used to save all the results in a git > repository, and, everytime something was changed or calculated again, I did > commit it. Just in case someone else interested. I have found this plugin for elixir (which is an extension for sqlalchemy itself) which does version control and seems very easy to use. - http://elixir.ematia.de/apidocs/elixir.ext.versioned.html - http://elixir.ematia.de/trac/browser/elixir/trunk/tests/test_versioning.py It has things like automated versioning and reverting, but id doesn't seem to have commit messages. Of course it doesn't seem feasible to use it on a very big database, but it is good to know it exists. The three of them, elixir, sqlalchemy, and this plugin, seems very useful instruments to anyone wishing to use database, in my opinion :). > Do you know how to do this with databases? Does MySQL provide support for > revision control? > Thanks :) > > (sorry for cross-posting :( ) > > -- > ----------------------------------------------------------- > > My Blog on Bioinformatics (italian): http://bioinfoblog.it > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From pmmagic at gmail.com Thu Dec 4 05:56:32 2008 From: pmmagic at gmail.com (paul m) Date: Thu, 4 Dec 2008 00:56:32 -0500 Subject: [BioPython] blat (psl) parser? In-Reply-To: References: Message-ID: <991e7bc10812032156v3de1913bj3051d0f7a8435870@mail.gmail.com> Ann, I don't think BioPython has a blat parser (or at least it didn't last time I looked), but I've written one that I use. Nothing fancy but it works. I'd be happy to send it to you via email. Cheers, Paul On Tue, Dec 2, 2008 at 10:48 AM, Loraine, Ann wrote: > Dear all, > > Does BioPython include a blat (psl format) parser? > > If yes, I would be grateful for pointers to documentation or tutorials describing how to use it. > > I appreciate your help! > > -Ann Loraine > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Thu Dec 4 17:49:53 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 4 Dec 2008 18:49:53 +0100 Subject: [BioPython] AlignIO: Sequences of different length Message-ID: Hello all! I'm running BioPython 1.49 in my Linux machine and it's been working rather fine until now. I'm submitting two sequences for a pairwise alignment, using the EMBOSS webservices. The results I get are good and the file is nicely formatted, so there is no problem with the needle output (check here: http://pastebin.com/m12ab3b2b ). However, the format it comes is not handy for what I want to do next, so I thought of using Biopython to convert the alignment format into something more useful, such as pir or fasta. And that's when I hit a problem. The code I'm running is here: http://pastebin.com/m509fd88f When executed, it gives me this error: Traceback (most recent call last): File "needle.py", line 50, in alignments = AlignIO.read(open('alignment.results'), "emboss") File "/usr/lib/python2.5/site-packages/PIL/__init__.py", line 375, in read File "/home/joao/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/AlignIO/EmbossIO.py", line 197, in next SyntaxError: Error parsing alignment - sequences of different length? Which is, to say the least, weird. First, the PIL __init__.py it calls is completely empty. Then, the second thing he mentions is the file on my Desktop folder, which doesn't exist anymore. Third, if I use AlignIO.parse() instead of read(), it runs ok. But as soon as I try to actually _do_ something with it, it gives me this very same error. So, is this a bug or is it me and my nasty coding abilities :) ? Thanks in advance! Jo?o Rodrigues http://doeidoei.wordpress.com Utrecht University Netherlands From biopython at maubp.freeserve.co.uk Thu Dec 4 17:56:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:56:27 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: Message-ID: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> On Thu, Dec 4, 2008 at 5:49 PM, Jo?o Rodrigues wrote: > Hello all! I'm running BioPython 1.49 in my Linux machine and it's been > working rather fine until now. > > I'm submitting two sequences for a pairwise alignment, using the EMBOSS > webservices. The results I get are good and the file is nicely formatted, so > there is no problem with the needle output (check here: > http://pastebin.com/m12ab3b2b ). > > However, the format it comes is not handy for what I want to do next, so I > thought of using Biopython to convert the alignment format into something > more useful, such as pir or fasta. And that's when I hit a problem. > > The code I'm running is here: http://pastebin.com/m509fd88f > When executed, it gives me this error: > > Traceback (most recent call last): > File "needle.py", line 50, in > alignments = AlignIO.read(open('alignment.results'), "emboss") > File "/usr/lib/python2.5/site-packages/PIL/__init__.py", line 375, in read > > File > "/home/joao/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/AlignIO/EmbossIO.py", > line 197, in next > SyntaxError: Error parsing alignment - sequences of different length? > > Which is, to say the least, weird. First, the PIL __init__.py it calls is > completely empty. Then, the second thing he mentions is the file on my > Desktop folder, which doesn't exist anymore. Third, if I use AlignIO.parse() > instead of read(), it runs ok. But as soon as I try to actually _do_ > something with it, it gives me this very same error. The bit about PIL in the stack trace is odd. > So, is this a bug or is it me and my nasty coding abilities :) ? > > Thanks in advance! > > Jo?o Rodrigues I don't know if its good news or bad news, but its a bug in the Biopython "emboss" parser not your code. I get the same error message here on my machine using your sample output. I'll take a look at the code get back to you shortly... Can we include your sample output as a unit test in Biopython please? Thanks Peter From biopython at maubp.freeserve.co.uk Thu Dec 4 18:10:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 18:10:58 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> Message-ID: <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> On Thu, Dec 4, 2008 at 6:02 PM, Jo?o Rodrigues wrote: > Well, bad news, I'd rather have it be a problem with my code :D No problem > at all to include my output. Thanks. For anyone wanting to try this at home, working backwards from the answer, the first input sequence is: >E1 MSSDRQRSDDESPSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNTKLSSKTTAKLS TSAKRIQKELAEITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFSSDY PFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADPLV GSIATQYLTNRAEHDRIARQWTKRYAT And the second: >E2 GMSDDDSRASTSSSSSSSSNQQTEKETNTPKKKESKVSMSKNSKLLSTSAKRIQKELADI TLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFTPEYPFKPPKVTFRTRI YHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADPLVGSIATQYMTNRAE HDRMARQWTKRYAT I've assumed default needle parameters are being used. Its the start of the alignment which is causing the problem, i.e. this bit of your file: E1 1 MSSDRQRSDDES-PSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNT 49 ..|||:| .||||.||.: ..|:..|.:.|.:||.: E2 1 GMSDDDSRASTSSSSSSS----------SNQQTEKETNTPKKKES 35 This is easier to see with a fixed width font, but compare it to what I get using EMBOSS 6.0.1 on my local machine: E1 1 MSSDRQRSDDES-PSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNT 49 ..|||:| .||||.||.: ..|:..|.:.|.:||.: E2 1 -----GMSDDDSRASTSSSSSSS----------SNQQTEKETNTPKKKES 35 Note that here the second sequence, E2, has five leading gap characters. These are missing in your file, where spaces have been used, and the Biopthon parser was not expecting this. What URL are you using for the EMBOSS webservice? I'd like to try this myself, and if possible see what version of EMBOSS they are using on the server. Peter From anaryin at gmail.com Thu Dec 4 18:19:43 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 4 Dec 2008 19:19:43 +0100 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> Message-ID: I believe the script I gave you had the needle function on it :x It's just a simple WSDL file provided by EBI being used by the SOAPpy module to access the webservice. The parameters are default as well so, gapopen 10.0 and gapextend 0.5. The page of the service is: http://www.ebi.ac.uk/Tools/webservices/services/emboss I kind of noticed that non-sense gap, but that comes with the format unfortunately. I ran the Web version of the program, not the webservice, and the outcome was the same (regarding the gap): http://www.ebi.ac.uk/Tools/es/cgi-bin/jobresults.cgi/needle/needle-20081204-18180513899973.html Jo?o Rodrigues http://doeidoei.wordpress.com From biopython at maubp.freeserve.co.uk Thu Dec 4 18:57:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 18:57:57 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> Message-ID: <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> On Thu, Dec 4, 2008 at 6:19 PM, Jo?o Rodrigues wrote: > I believe the script I gave you had the needle function on it :x It's just a > simple WSDL file provided by EBI being used by the SOAPpy module to access > the webservice. The parameters are default as well so, gapopen 10.0 and > gapextend 0.5. Oh yes - I see it now on pastebin, previously I'd only looked at the output file. > The page of the service is: > http://www.ebi.ac.uk/Tools/webservices/services/emboss > > I kind of noticed that non-sense gap, but that comes with the format > unfortunately. I ran the Web version of the program, not the webservice, and > the outcome was the same (regarding the gap): > > http://www.ebi.ac.uk/Tools/es/cgi-bin/jobresults.cgi/needle/needle-20081204-18180513899973.html > So the web versions of EMBOSS are using spaces for leading gaps (program version unknown), while the standalone version of EMBOSS (up to version 6.0.1) are using dashes (minus signs). Biopython 1.49 expects the leading dashes. I suspect that the EBI are running a more recent not-yet-released version of the EMBOSS tools to power their webservices. I'm not familiar enough with their code to know where to look... I suggest you email the webservice people and ask them why the needle output is different to the command line version (tell them parsers such as Biopython may be broken by this change). If this is a forthcoming change to the EMBOSS standalone tools, then I guess we'll have to fix the parser anyway. I may find time to look at this over the weekend - we'll see. Regards, Peter From anaryin at gmail.com Fri Dec 5 11:34:33 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 5 Dec 2008 12:34:33 +0100 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> Message-ID: I got a reply from the EBI support team saying that the webserver they provide is outdated, when compared to the versions of NEEDLE we (me on the web and Peter on his local machine) used. So, BioPython is nice and up-to-date, it's their server that is quite outdated. " Actually the WSEmboss web service uses an older version of EMBOSS (2.9.0), which exibits this behaviour. I suggest you contact the BioPython folks and let them know that older versions of EMBOSS behave differently. If you want to use the latest version of EMBOSS I suggest looking at using the Soaplab services (see http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) instead. All the best, Support at EBI" Jo?o Rodrigues http://doeidoei.wordpress.com On Thu, Dec 4, 2008 at 7:57 PM, Peter wrote: > On Thu, Dec 4, 2008 at 6:19 PM, Jo?o Rodrigues wrote: > > I believe the script I gave you had the needle function on it :x It's > just a > > simple WSDL file provided by EBI being used by the SOAPpy module to > access > > the webservice. The parameters are default as well so, gapopen 10.0 and > > gapextend 0.5. > > Oh yes - I see it now on pastebin, previously I'd only looked at the > output file. > > > The page of the service is: > > http://www.ebi.ac.uk/Tools/webservices/services/emboss > > > > I kind of noticed that non-sense gap, but that comes with the format > > unfortunately. I ran the Web version of the program, not the webservice, > and > > the outcome was the same (regarding the gap): > > > > > http://www.ebi.ac.uk/Tools/es/cgi-bin/jobresults.cgi/needle/needle-20081204-18180513899973.html > > > > So the web versions of EMBOSS are using spaces for leading gaps > (program version unknown), while the standalone version of EMBOSS (up > to version 6.0.1) are using dashes (minus signs). Biopython 1.49 > expects the leading dashes. > > I suspect that the EBI are running a more recent not-yet-released > version of the EMBOSS tools to power their webservices. I'm not > familiar enough with their code to know where to look... > > I suggest you email the webservice people and ask them why the needle > output is different to the command line version (tell them parsers > such as Biopython may be broken by this change). > > If this is a forthcoming change to the EMBOSS standalone tools, then I > guess we'll have to fix the parser anyway. I may find time to look at > this over the weekend - we'll see. > > Regards, > > Peter > From biopython at maubp.freeserve.co.uk Fri Dec 5 12:18:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Dec 2008 12:18:50 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> Message-ID: <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> On Fri, Dec 5, 2008 at 11:34 AM, Jo?o Rodrigues wrote: > I got a reply from the EBI support team saying that the webserver they > provide is outdated, when compared to the versions of NEEDLE we (me on the > web and Peter on his local machine) used. So, BioPython is nice and > up-to-date, it's their server that is quite outdated. > > " Actually the WSEmboss web service uses an older version of EMBOSS (2.9.0), > which exibits this behaviour. I suggest you contact the BioPython folks and let > them know that older versions of EMBOSS behave differently. > > If you want to use the latest version of EMBOSS I suggest looking at using > the Soaplab services (see http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) > instead. > > All the best, > > Support at EBI" > > Jo?o Rodrigues Thanks the update :) Are you OK using the more up to date SOAP needle, or perhaps standalone needle? Does thos From anaryin at gmail.com Fri Dec 5 13:59:25 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 5 Dec 2008 13:59:25 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> Message-ID: Well... My VISTA partition just erased my Linux one, don't know how, so I can't answer that right now :x As soon as I get linux again, as soon as I get my script written again, I'll give an update here :) But I had solved the problem by changing the alignment output format to markx10 and "parsing" it my own way. Cheers and thanks for the help :) Jo?o Rodrigues http://doeidoei.wordpress.com On Fri, Dec 5, 2008 at 12:18 PM, Peter wrote: > On Fri, Dec 5, 2008 at 11:34 AM, Jo?o Rodrigues wrote: > > I got a reply from the EBI support team saying that the webserver they > > provide is outdated, when compared to the versions of NEEDLE we (me on > the > > web and Peter on his local machine) used. So, BioPython is nice and > > up-to-date, it's their server that is quite outdated. > > > > " Actually the WSEmboss web service uses an older version of EMBOSS > (2.9.0), > > which exibits this behaviour. I suggest you contact the BioPython folks > and let > > them know that older versions of EMBOSS behave differently. > > > > If you want to use the latest version of EMBOSS I suggest looking at > using > > the Soaplab services (see > http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) > > instead. > > > > All the best, > > > > Support at EBI" > > > > Jo?o Rodrigues > > Thanks the update :) > > Are you OK using the more up to date SOAP needle, or perhaps standalone > needle? > > Does thos > From anaryin at gmail.com Mon Dec 8 23:26:36 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 9 Dec 2008 00:26:36 +0100 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> Message-ID: Well, as promised, here goes update. I didn't try with soaplab2 because it was too complicated to get it to work. I didn't want to lose more than 10 minutes either so... However, with standalone needle, which EBI claim to be the same version as the soaplab2 service, it works flawlessly :) Code here: http://pastebin.com/f29ff12d6 Console output here: http://pastebin.com/f5bbc5593 It's not a bug then, it's just an old version :) Using the web versions, there may be some workarounds. If you convert the format to one of the others, you may get a usable one for Biopython. I tried markx1 I believe, and it was "almost" parsable, it just didn't get the correct sequences (if you deleted everything BUT the sequences, it would work). So, I think there should at least be a warning somewhere for the users so that they don't get nuts or reporting bugs :) Thanks for all the help! Regards! Jo?o Rodrigues http://doeidoei.wordpress.com On Fri, Dec 5, 2008 at 2:59 PM, Jo?o Rodrigues wrote: > Well... My VISTA partition just erased my Linux one, don't know how, so I > can't answer that right now :x As soon as I get linux again, as soon as I > get my script written again, I'll give an update here :) But I had solved > the problem by changing the alignment output format to markx10 and "parsing" > it my own way. > > Cheers and thanks for the help :) > > Jo?o Rodrigues > http://doeidoei.wordpress.com > > > On Fri, Dec 5, 2008 at 12:18 PM, Peter wrote: > >> On Fri, Dec 5, 2008 at 11:34 AM, Jo?o Rodrigues >> wrote: >> > I got a reply from the EBI support team saying that the webserver they >> > provide is outdated, when compared to the versions of NEEDLE we (me on >> the >> > web and Peter on his local machine) used. So, BioPython is nice and >> > up-to-date, it's their server that is quite outdated. >> > >> > " Actually the WSEmboss web service uses an older version of EMBOSS >> (2.9.0), >> > which exibits this behaviour. I suggest you contact the BioPython folks >> and let >> > them know that older versions of EMBOSS behave differently. >> > >> > If you want to use the latest version of EMBOSS I suggest looking at >> using >> > the Soaplab services (see >> http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) >> > instead. >> > >> > All the best, >> > >> > Support at EBI" >> > >> > Jo?o Rodrigues >> >> Thanks the update :) >> >> Are you OK using the more up to date SOAP needle, or perhaps standalone >> needle? >> >> Does thos >> > > From biopython at maubp.freeserve.co.uk Tue Dec 9 10:17:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Dec 2008 10:17:40 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812040956p4f5830ffr3ef2e0173e0c5bba@mail.gmail.com> <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> Message-ID: <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> On Mon, Dec 8, 2008 at 11:26 PM, Jo?o Rodrigues wrote: > Well, as promised, here goes update. I didn't try with soaplab2 because it > was too complicated to get it to work. I didn't want to lose more than 10 > minutes either so... However, with standalone needle, which EBI claim to be > the same version as the soaplab2 service, it works flawlessly :) > > Code here: http://pastebin.com/f29ff12d6 > > Console output here: http://pastebin.com/f5bbc5593 > > It's not a bug then, it's just an old version :) Well, arguably it would be nice Biopython could parse old versions of the EMBOSS pairs/simple output too, but its not so important. > Using the web versions, there may be some workarounds. If you convert > the format to one of the others, you may get a usable one for Biopython. If you just want the alignment itself, using FASTA as the output format from needle is very simple. e.g. $ needle one.fasta two.fasta --auto --filter -aformat fasta >E1 MSSDRQRSDDES-PSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNTKLS-SKTTAK LSTSAKRIQKELAEITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFSS DYPFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADP LVGSIATQYLTNRAEHDRIARQWTKRYAT >E2 -----GMSDDDSRASTSSSSSSS----------SNQQTEKETNTPKKKESKVSMSKNSKL LSTSAKRIQKELADITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFTP EYPFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADP LVGSIATQYMTNRAEHDRMARQWTKRYAT > I tried markx1 I believe, and it was "almost" parsable, it just didn't get the > correct sequences (if you deleted everything BUT the sequences, it would > work). How were you trying to parse the markx1 output? Note that the EMBOSS markx10 output is similar to, but differs from, the FASTA -m 10 output (which Biopython can parse as the "fasta-m10" format in Bio.AlignIO). > So, I think there should at least be a warning somewhere for the > users so that they don't get nuts or reporting bugs :) Do you mean a warning about trying to use Bio.AlignIO with the "emboss" format to read output from old versions of EMBOSS needle tool? Peter From anaryin at gmail.com Tue Dec 9 11:25:37 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 9 Dec 2008 12:25:37 +0100 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> References: <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> Message-ID: > > > Using the web versions, there may be some workarounds. If you convert > > the format to one of the others, you may get a usable one for Biopython. > > If you just want the alignment itself, using FASTA as the output > format from needle is very simple. > > e.g. > > $ needle one.fasta two.fasta --auto --filter -aformat fasta > >E1 > MSSDRQRSDDES-PSTSSGSSDADQRDPAAPEPEEQEERKPSATQQKKNTKLS-SKTTAK > LSTSAKRIQKELAEITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFSS > DYPFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADP > LVGSIATQYLTNRAEHDRIARQWTKRYAT > >E2 > -----GMSDDDSRASTSSSSSSS----------SNQQTEKETNTPKKKESKVSMSKNSKL > LSTSAKRIQKELADITLDPPPNCSAGPKGDNIYEWRSTILGPPGSVYEGGVFFLDITFTP > EYPFKPPKVTFRTRIYHCNINSQGVICLDILKDNWSPALTISKVLLSICSLLTDCNPADP > LVGSIATQYMTNRAEHDRMARQWTKRYAT > Yep, but in the web version such format does not exist.. don't know why. > > > I tried markx1 I believe, and it was "almost" parsable, it just didn't > get the > > correct sequences (if you deleted everything BUT the sequences, it would > > work). > > How were you trying to parse the markx1 output? > > Note that the EMBOSS markx10 output is similar to, but differs from, > the FASTA -m 10 output (which Biopython can parse as the "fasta-m10" > format in Bio.AlignIO). > I tried with FASTA as the argument for the parser, because the description said: "This is the standard default output format used by Bill Pearson's suite of FASTA programs." And btw, it was the markx0, not the 1. Typo yesterday night.. > > > So, I think there should at least be a warning somewhere for the > > users so that they don't get nuts or reporting bugs :) > > Do you mean a warning about trying to use Bio.AlignIO with the > "emboss" format to read output from old versions of EMBOSS needle > tool? Well, it may be frustrating for someone who's using that webservice to try and parse it and it gives that error. It might be useful for example, to mention, when such error occurs, that it might be happening due to use of web version. Just a small appendix to the error message f example. Regards, Jo?o From biopython at maubp.freeserve.co.uk Tue Dec 9 11:42:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Dec 2008 11:42:34 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: References: <320fb6e00812041010w6b64db45n6ad34a429c0b9058@mail.gmail.com> <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> Message-ID: <320fb6e00812090342n7534dd9dy9284ccc7f887209d@mail.gmail.com> On Tue, Dec 9, 2008 at 11:25 AM, Jo?o Rodrigues wrote: >> > Using the web versions, there may be some workarounds. If you convert >> > the format to one of the others, you may get a usable one for Biopython. >> >> If you just want the alignment itself, using FASTA as the output >> format from needle is very simple. >> >> e.g. >> >> $ needle one.fasta two.fasta --auto --filter -aformat fasta >> ... > > Yep, but in the web version such format does not exist.. don't know why. A strange omission on their part. >> > I tried markx1 I believe, and it was "almost" parsable, it just didn't >> > get the correct sequences (if you deleted everything BUT the >> > sequences, it would work). >> >> How were you trying to parse the markx1 output? >> >> Note that the EMBOSS markx10 output is similar to, but differs from, >> the FASTA -m 10 output (which Biopython can parse as the "fasta-m10" >> format in Bio.AlignIO). > > I tried with FASTA as the argument for the parser, because the description > said: > "This is the standard default output format used by Bill Pearson's suite of > FASTA programs." > > And btw, it was the markx0, not the 1. Typo yesterday night.. The various EMBOSS output formats are described here, http://emboss.sourceforge.net/docs/themes/AlignFormats.html The outputs markx0, markx1, ..., markx10 are EMBOSS *imitations* of the FASTA tool's output formats (but with the addition of EMBOSS style header/footers). Right now, Biopython doesn't parse any of these. In Biopython's Bio.AlignIO, "fasta" refers to the FASTA input file format (the simple file format using greater than signs for each new sequence). The only FASTA output format we support is "fasta-m10" which is how we refer to the output from FASTA's -m 10 command line argument. Right now, the Biopython FASTA m10 parser can't cope with the EMBOSS markx10 format. It might be nice if it did, but given we can parse EMBOSS's default output this doesn't seem like a big issue. >> > So, I think there should at least be a warning somewhere for the >> > users so that they don't get nuts or reporting bugs :) >> >> Do you mean a warning about trying to use Bio.AlignIO with the >> "emboss" format to read output from old versions of EMBOSS needle >> tool? > > Well, it may be frustrating for someone who's using that webservice to try > and parse it and it gives that error. It might be useful for example, to > mention, when such error occurs, that it might be happening due to use of > web version. Just a small appendix to the error message f example. So instead of "Error parsing alignment - sequences of different length?" it could say "Error parsing alignment - sequences of different length? Possibly you are using an old version of EMBOSS." That should help. As an aside, do you mind me asking why are you using needle via a webservice? If you expect to do lots of alignments, surely running it locally is faster and more reliable (no network issues to worry about)? Peter From biopython at maubp.freeserve.co.uk Tue Dec 9 12:05:30 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Dec 2008 12:05:30 +0000 Subject: [BioPython] AlignIO: Sequences of different length In-Reply-To: <320fb6e00812090342n7534dd9dy9284ccc7f887209d@mail.gmail.com> References: <320fb6e00812041057k850964cl5d6f064fd13623a3@mail.gmail.com> <320fb6e00812050418i4061a014y431e50d46ed30855@mail.gmail.com> <320fb6e00812090217u38be70b4o8046e9c177978f86@mail.gmail.com> <320fb6e00812090342n7534dd9dy9284ccc7f887209d@mail.gmail.com> Message-ID: <320fb6e00812090405i5a23f32ar3c2f7cd535b67b64@mail.gmail.com> On Tue, Dec 9, 2008 at 11:42 AM, Peter wrote: > > So instead of "Error parsing alignment - sequences of different > length?" it could say "Error parsing alignment - sequences of > different length? Possibly you are using an old version of EMBOSS." > That should help. I've tried to clarify this exception message in the latest code. For anyone interested in the details, see CVS revision 1.6 of Bio/AlignIO/EmbossIO.py which is viewable online: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/AlignIO/EmbossIO.py?cvsroot=biopython There is no reason to update your installation Jo?o as this will make no difference to you - parsing the old EMBOSS 2.9.0 needle output will still fail. Peter From rjalves at igc.gulbenkian.pt Thu Dec 11 17:25:32 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Thu, 11 Dec 2008 17:25:32 +0000 Subject: [BioPython] KEGG Gene parser Message-ID: <49414D0C.8080509@igc.gulbenkian.pt> Hi everyone, Bringing back the KEGG Gene parser subject (from january 2008), Bio.KEGG has some modules for KEGG resources but not Gene. SeqIO doesn't seem to support KEGG either. So my question is, have any progresses been made in this regard? Thanks, Renato. From biopython at maubp.freeserve.co.uk Fri Dec 12 11:06:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Dec 2008 11:06:07 +0000 Subject: [BioPython] KEGG Gene parser In-Reply-To: <49414D0C.8080509@igc.gulbenkian.pt> References: <49414D0C.8080509@igc.gulbenkian.pt> Message-ID: <320fb6e00812120306i2e1af2cey6bcc7d402a037ffd@mail.gmail.com> On Thu, Dec 11, 2008 at 5:25 PM, Renato Alves wrote: > Hi everyone, > > Bringing back the KEGG Gene parser subject (from january 2008), Bio.KEGG has > some modules for KEGG resources but not Gene. SeqIO doesn't seem to support > KEGG either. What are you trying to do? Do you want to parse gene files from KEGG into sequence objects? If so, could you point me at an particular example file so I have a better feel for the problem (and if it would fit into Bio.SeqIO). Thanks, Peter From jae at lmi.net Fri Dec 12 16:21:47 2008 From: jae at lmi.net (Jason Eshleman) Date: Fri, 12 Dec 2008 08:21:47 -0800 Subject: [BioPython] bioPython and STRUCTURE Message-ID: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> Greetings. I'm curious if anyone has worked with code to operate the multi-locus pop gen. program "STRUCTURE" (http://pritch.bsd.uchicago.edu/software.html). I've got some code myself that I'd be happy to share/contribute if there's interest. I haven't been able to find any such discussions in the archives, but it could be my searching skills. The term 'structure' returns a large number of completely irrelevant hits. It does seem like bioPython is light in the pop. gen dept at this point. -jae From tiagoantao at gmail.com Fri Dec 12 16:39:34 2008 From: tiagoantao at gmail.com (tiagoantao at gmail.com) Date: Fri, 12 Dec 2008 16:39:34 +0000 Subject: [BioPython] bioPython and STRUCTURE In-Reply-To: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> References: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> Message-ID: <6d941f120812120839i3f4b7d48gdcaa3f40a96364b6@mail.gmail.com> Hi, i am writing this from a mobile phone in a middle of a conference, so I will be short. your effort is most welcome. As soon as I am back (next week) I will gladly help you with putting the code on biopython pop gen. Structure is widely used and your contribution, from my part, is most welcome. there is actually a big chunk of updates that can be commited soon, maybe yours can go along On 12/12/08, Jason Eshleman wrote: > Greetings. I'm curious if anyone has worked with code to operate the > multi-locus pop gen. program "STRUCTURE" > (http://pritch.bsd.uchicago.edu/software.html). I've got some code myself > that I'd be happy to share/contribute if there's interest. I haven't been > able to find any such discussions in the archives, but it could be my > searching skills. The term 'structure' returns a large number of > completely irrelevant hits. It does seem like bioPython is light in the > pop. gen dept at this point. > > -jae > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From dalloliogm at gmail.com Fri Dec 12 17:31:28 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 12 Dec 2008 18:31:28 +0100 Subject: [BioPython] bioPython and STRUCTURE In-Reply-To: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> References: <6.1.2.0.2.20081212081435.03193830@pop.att.yahoo.com> Message-ID: <5aa3b3570812120931i3a5d654ta2955f9fe9bee292@mail.gmail.com> On 12/12/08, Jason Eshleman wrote: > Greetings. I'm curious if anyone has worked with code to operate the > multi-locus pop gen. program "STRUCTURE" > (http://pritch.bsd.uchicago.edu/software.html). I've got > some code myself that I'd be happy to share/contribute if there's interest. > You could have a look at this code: - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen which is a merge between Tiago's code and mine to implement population genetics things in python/biopython. Actually I wonder whether it would be easier to use tools like waf or scons to handle external tools, but anyway it is good to have handlers like that in biopython. > I haven't been able to find any such discussions in the archives, but it > could be my searching skills. mmmm have you tried something like "biopython structure -3d -pdb genetics"? > The term 'structure' returns a large number > of completely irrelevant hits. It does seem like bioPython is light in the > pop. gen dept at this point. At the moment yes, it is. > > -jae > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Fri Dec 12 18:28:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Dec 2008 18:28:14 +0000 Subject: [BioPython] KEGG Gene parser In-Reply-To: <4942A89E.3070002@igc.gulbenkian.pt> References: <49414D0C.8080509@igc.gulbenkian.pt> <320fb6e00812120306i2e1af2cey6bcc7d402a037ffd@mail.gmail.com> <4942A89E.3070002@igc.gulbenkian.pt> Message-ID: <320fb6e00812121028t4546ec4esac4957bab707e4f6@mail.gmail.com> On Fri, Dec 12, 2008 at 6:08 PM, Renato Alves wrote: > At the moment I'm doing exactly that, getting the sequence out of gene files > like the one attached. When you say sequence, do you want the nucleotides or the protein (or both)? Is there a URL for where that example file came from? I'd like to have a look at similar examples etc but all I found so far on KEGG were the HTML equivalents to this data. Thanks Peter From rjalves at igc.gulbenkian.pt Fri Dec 12 19:08:47 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Fri, 12 Dec 2008 19:08:47 +0000 Subject: [BioPython] KEGG Gene parser In-Reply-To: <320fb6e00812121028t4546ec4esac4957bab707e4f6@mail.gmail.com> References: <49414D0C.8080509@igc.gulbenkian.pt> <320fb6e00812120306i2e1af2cey6bcc7d402a037ffd@mail.gmail.com> <4942A89E.3070002@igc.gulbenkian.pt> <320fb6e00812121028t4546ec4esac4957bab707e4f6@mail.gmail.com> Message-ID: <4942B6BF.8050806@igc.gulbenkian.pt> Both. I got that one via KEGG API but you can get them at ftp://ftp.genome.jp/pub/kegg/genes/ . In the organisms folder you have full genome files (*.ent) in KEGG format. Renato Quoting Peter on 12/12/2008 06:28 PM: > On Fri, Dec 12, 2008 at 6:08 PM, Renato Alves wrote: > >> At the moment I'm doing exactly that, getting the sequence out of gene files >> like the one attached. >> > > When you say sequence, do you want the nucleotides or the protein (or both)? > > Is there a URL for where that example file came from? I'd like to > have a look at similar examples etc but all I found so far on KEGG > were the HTML equivalents to this data. > > Thanks > > Peter > From stran104 at chapman.edu Sun Dec 14 09:38:34 2008 From: stran104 at chapman.edu (Matthew Strand) Date: Sun, 14 Dec 2008 01:38:34 -0800 Subject: [BioPython] SeqIO.write() Multiple Calls for fasta In-Reply-To: <2a63cc350812140122n2260c0c3r17b7e8088aaaeec9@mail.gmail.com> References: <2a63cc350812140122n2260c0c3r17b7e8088aaaeec9@mail.gmail.com> Message-ID: <2a63cc350812140138g4cb85e5eo2ef6f2c97c15a533@mail.gmail.com> Hello, I have been working with SeqIO.write() on fasta files based on some info provided in the API Documentation. It is written that SeqIO.write() should "probably" perform fine with multiple calls, but with my experience it actually does overwrite the whole file, even when the file is opened and closed immediately before and after the write. Has anyone else had this experience? I will be rewriting my code to create large arrays before adding to the file, which is easy for the example provided below. However, this will take some work to change the part of the application that runs against our local Blast databases for a few days, periodically adding sequences to files. I'd like to make sure that I'm not the only one with this issue before rewriting it. ---------BEGIN API Documentation Quote Output - Advanced ================= The effect of calling write() multiple times on a single file will vary depending on the file format, and is best avoided unless you have a strong reason to do so. Trying this for certain alignment formats (e.g. phylip, clustal, stockholm) would have the effect of concatenating several multiple sequence alignments together. Such files are created by the PHYLIP suite of programs for bootstrap analysis. For sequential files formats (e.g. fasta, genbank) each "record block" holds a single sequence. For these files it would probably be safe to call write() multiple times. ---------END API Documentation Quote ---------BEGIN Code Sample to take a bunch of fasta files with multiple species and generate individual files for each species. for j in range(1, len(kogid)): name = "EXT-CLB-" + kogid[j] + ".seq" if os.path.exists(name): handle = open(name, "rU") records = list(SeqIO.parse(handle, "fasta")) for record in records: speciesID = record.id.split('|')[0] outFile = open(speciesID.split('-')[0] + ".seq", 'w') SeqIO.write([record], outFile, "fasta") outFile.close() print "Added a record for" + speciesID.split('-')[0] handle.close() --------END Code Sample Thank you for your responses, -Matthew J From mjldehoon at yahoo.com Sun Dec 14 10:53:50 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 14 Dec 2008 02:53:50 -0800 (PST) Subject: [BioPython] SeqIO.write() Multiple Calls for fasta In-Reply-To: <2a63cc350812140138g4cb85e5eo2ef6f2c97c15a533@mail.gmail.com> Message-ID: <826414.74394.qm@web62404.mail.re1.yahoo.com> > for j in range(1, len(kogid)): > name = "EXT-CLB-" + kogid[j] + ".seq" > if os.path.exists(name): > handle = open(name, "rU") > records = list(SeqIO.parse(handle, "fasta")) You don't need the 'list' here > for record in records: > speciesID = record.id.split('|')[0] > outFile = open(speciesID.split('-')[0] + ".seq", 'w') > SeqIO.write([record], outFile, "fasta") > outFile.close() > print "Added a record for" + speciesID.split('-')[0] > handle.close() The handle.close() should be inside the "if" block, so with an additional four spaces of indentation. Though this is not important for the problem you mentioned. The only way I can see that the SeqIo.write overwrites a files is if speciesID.split('-')[0] + ".seq" results in the same file name for more than one of the records. It's not a SeqIO.write issue; if you comment out the SeqIO.write line, you'll probably end up with the exact same set of output files (all of them empty though). --Michiel From biopython at maubp.freeserve.co.uk Sun Dec 14 13:05:09 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Dec 2008 13:05:09 +0000 Subject: [BioPython] SeqIO.write() Multiple Calls for fasta In-Reply-To: <826414.74394.qm@web62404.mail.re1.yahoo.com> References: <2a63cc350812140138g4cb85e5eo2ef6f2c97c15a533@mail.gmail.com> <826414.74394.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e00812140505w6f35d863n3896a2524b50d5ed@mail.gmail.com> Matthew wrote: > It is written that SeqIO.write() should "probably" perform fine with > multiple calls, but with my experience it actually does overwrite > the whole file, even when the file is opened and closed > immediately before and after the write. You seem to have misunderstood the documentation - are you already familiar with working with file handles in python? Perhaps this could be clarified. Using FASTA format, this is safe: out_handle = open("example.fasta","w") SeqIO.write(records, out_handle, "fasta") SeqIO.write(more_records, out_handle, "fasta") SeqIO.write(even_records, out_handle, "fasta") out_handle.close() You could also have written: out_handle = open("example.fasta","w") SeqIO.write(records+more_records+even_more_records, out_handle, "fasta") out_handle.close() I suspect what you are doing is instead is akin to this: out_handle = open("example.fasta","w") SeqIO.write(records, out_handle, "fasta") out_handle.close() out_handle = open("example.fasta","w") SeqIO.write(more_records, out_handle, "fasta") out_handle.close() out_handle = open("example.fasta","w") SeqIO.write(even_records, out_handle, "fasta") out_handle.close() This code will write the file once, then replace it, and again replace it. The final file contains only the third set of records. This is probably not what you intended. Your example code seems to be trying to create one file per sequence. Perhaps you have some duplicate filenames being generated as Michiel suggested. Peter From biopython at maubp.freeserve.co.uk Sun Dec 14 23:58:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Dec 2008 23:58:01 +0000 Subject: [BioPython] SeqIO.write() Multiple Calls for fasta In-Reply-To: <2a63cc350812141439x31a78a7fi52da56ebb483cf67@mail.gmail.com> References: <2a63cc350812140138g4cb85e5eo2ef6f2c97c15a533@mail.gmail.com> <826414.74394.qm@web62404.mail.re1.yahoo.com> <320fb6e00812140505w6f35d863n3896a2524b50d5ed@mail.gmail.com> <2a63cc350812141439x31a78a7fi52da56ebb483cf67@mail.gmail.com> Message-ID: <320fb6e00812141558t61669d25q328f588fe93f10bd@mail.gmail.com> Hi Matthew, I've CC'ed your replay back to the mailing list. On Sun, Dec 14, 2008 at 10:39 PM, Matthew Strand wrote: > I see, you both are right, this is not a SeqIO.write() issue. I should have > created the empty files and then used the append ('a') mode instead of the > write ('w') mode to add records to the file since the 'w' mode will > overwrite the file. I think using "a" for append will create the file if it does not already exist. Be careful if you run your script more than once - you may get multiple entries in each output file! > The way I interpreted the documentation was that it was safe to call > SeqIO.write() multiple times on the same file without overwriting it. And as > you both have shown, this is safe, as long as the right mode is used. > > Thank you for your responses and your time. I hope it helped :) Good night. Peter From dalloliogm at gmail.com Mon Dec 15 22:16:17 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 15 Dec 2008 23:16:17 +0100 Subject: [BioPython] [Popgen] a binary format for genotypes Message-ID: <5aa3b3570812151416q6dc22d34tb4fc5e551329acfb@mail.gmail.com> Hi, I was reading this article: - http://www.biomedcentral.com/1471-2105/9/526/abstract The authors describe a binary format to store SNPs data in a more efficently way than flat files. One of the authors, in his blog, says that they have developed some python APIs: - http://www.mailund.dk/index.php/2008/12/11/snpfile/ I think this is interesting for our biopython Popgen module. Maybe we can ask them for collaboration, and we could use such a format to store SNP data internally or at least provide support for their format. What do you think? -- My blog on bioinformatics (now in English): http://bioinfoblog.it From kteague at bcgsc.ca Mon Dec 15 22:53:29 2008 From: kteague at bcgsc.ca (Kevin Teague) Date: Mon, 15 Dec 2008 14:53:29 -0800 Subject: [BioPython] [Popgen] a binary format for genotypes In-Reply-To: <5aa3b3570812151416q6dc22d34tb4fc5e551329acfb@mail.gmail.com> References: <5aa3b3570812151416q6dc22d34tb4fc5e551329acfb@mail.gmail.com> Message-ID: <0D23A55E-4C8C-40DC-AF5A-0690B3B0131F@bcgsc.ca> A lot of the headaches of dealing with large scale data sets in a performance optimizing manner (self-describing format, platform independant binary files) have been worked out in other fields of science who've been dealing with large scale data sets for a lot longer than the field of bioinformatics (e.g. astronomy and climatology). While I've only used it a little bit, so I can't comment if there are any other formats that are worthy contenders, the HDF5 format is well established for working with large scale data sets: http://www.hdfgroup.org/HDF5/ There are libraries for accessing this format for many languages. With Python there is PyTables, which is a very good library: http://www.pytables.org/ I haven't heard of anyone using this in bioinformatics, but I've seen it demonstrated in very high traffic financial application written in Python where performance of this library was impressive. The developer ported to PyTables after PostgreSQL became a bottle-neck and found that PyTables was an order of magnitude faster. Of course, this isn't a purely fair comparison, since PyTables gives up transactions, concurrency and referential integrity in favor of pure speed. But in most data analysis pipelines, each data set can be produced independantly of each other, so those features of a RDBMS aren't usually needed. There have been a number of other bioinformatics tools and libraries that have been using custom binary file formats to deal with the ever increasing size of bioinformatic data sets. From a sysadmin and developer perspective it's a big headache since these custom formats can be platform-sensitive and require compiling and installing binaries to deal with each data format. Bleh! I have yet to see a "custom bioinformatic binary file format" which had to be developed to account for short comings of an already existing binary file format ... From dalloliogm at gmail.com Mon Dec 15 23:49:29 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 16 Dec 2008 00:49:29 +0100 Subject: [BioPython] [Popgen] a binary format for genotypes In-Reply-To: <0D23A55E-4C8C-40DC-AF5A-0690B3B0131F@bcgsc.ca> References: <5aa3b3570812151416q6dc22d34tb4fc5e551329acfb@mail.gmail.com> <0D23A55E-4C8C-40DC-AF5A-0690B3B0131F@bcgsc.ca> Message-ID: <5aa3b3570812151549p530f8005m9c2200712e840777@mail.gmail.com> On Mon, Dec 15, 2008 at 11:53 PM, Kevin Teague wrote: > A lot of the headaches of dealing with large scale data sets in a > performance optimizing manner (self-describing format, platform independant > binary files) have been worked out in other fields of science who've been > dealing with large scale data sets for a lot longer than the field of > bioinformatics (e.g. astronomy and climatology). > > While I've only used it a little bit, so I can't comment if there are any > other formats that are worthy contenders, the HDF5 format is well > established for working with large scale data sets: > > http://www.hdfgroup.org/HDF5/ I have already heard of this format, but for some reasons I thought that it couldn't be more efficient than a database. I have to deal with a table of ~10^7 entries, correlated with another one of 10^3, so, if I'd organize it in a certain way, it will have 10^10 entries. Do you think that this binary format would be more efficient than a database to handle all this? Does it supports relationships? (ok, I will read the documentation!! :) ). > > There are libraries for accessing this format for many languages. With > Python there is PyTables, which is a very good library: > > http://www.pytables.org/ Thanks for the link > I haven't heard of anyone using this in bioinformatics, but I've seen it > demonstrated in very high traffic financial application written in Python > where performance of this library was impressive. The developer ported to > PyTables after PostgreSQL became a bottle-neck and found that PyTables was > an order of magnitude faster. Of course, this isn't a purely fair > comparison, since PyTables gives up transactions, concurrency and > referential integrity in favor of pure speed. But in most data analysis > pipelines, each data set can be produced independantly of each other, so > those features of a RDBMS aren't usually needed. > > There have been a number of other bioinformatics tools and libraries that > have been using custom binary file formats to deal with the ever increasing > size of bioinformatic data sets. From a sysadmin and developer perspective > it's a big headache since these custom formats can be platform-sensitive and > require compiling and installing binaries to deal with each data format. > Bleh! > I have yet to see a "custom bioinformatic binary file format" which had to > be developed to account for short comings of an already existing binary file > format ... > > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From pzs at dcs.gla.ac.uk Thu Dec 18 13:47:11 2008 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 18 Dec 2008 13:47:11 +0000 Subject: [BioPython] Genbank LOCUS line slightly misaligned Message-ID: <494A545F.2020307@dcs.gla.ac.uk> I have a genbank file sent to my lab from a company called Genomatrix. It is slightly misformed. Specifically, the LOCUS lines have the right features, but not quite aligned; for example, the "bp" marker is not always at exactly the positions ([29:33] and [40:44]) required by _feed_first_line() in $biopythonhome/Genbank/Scanner.py. Have Genomatrix made an error in producing these genbank files, or should the bioptyon routines accommodate these variations? Some lines just give warnings and plough on, but others report that there isn't a space in exactly the right place and fail to read the record at all. I'm having to hack the genbank file as we speak... Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Dec 18 15:15:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Dec 2008 15:15:07 +0000 Subject: [BioPython] Genbank LOCUS line slightly misaligned In-Reply-To: <494A545F.2020307@dcs.gla.ac.uk> References: <494A545F.2020307@dcs.gla.ac.uk> Message-ID: <320fb6e00812180715o6bc3fe28pe6359c62e1f2e39e@mail.gmail.com> On Thu, Dec 18, 2008 at 1:47 PM, Peter Saffrey wrote: > I have a genbank file sent to my lab from a company called Genomatrix. It is > slightly misformed. Oh dear. Parsing misformed files is difficult as often they can be interpreted in more than one way. In general, the only safe and explicit choice here is to throw an exception - although we do tolerate some minor deviations from the spec in places. > Specifically, the LOCUS lines have the right features, but not quite > aligned; for example, the "bp" marker is not always at exactly the positions > ([29:33] and [40:44]) required by _feed_first_line() in > $biopythonhome/Genbank/Scanner.py. The fact we allow for the "bp" (or "aa") marker in two places reflects two iterations of the GenBank standard. In theory we could remove the support for the older version but there may be third party tools still producing GenBank files using that style. > Have Genomatrix made an error in producing these genbank files, or should > the bioptyon routines accommodate these variations? I presume Genomatrix have made an error - try emailing them for clarification. The GenBank file format for the LOCUS line is very explicit and uses very precise column positions for the fields. In theory we could try parsing ambiguous files using spaces to split up the fields, but as many of the fields are optional, this isn't generally possible without a little guess work. > Some lines just give warnings and plough on, but others report that > there isn't a space in exactly the right place and fail to read the record > at all. I'm having to hack the genbank file as we speak... I suspect that they (Genomatrix) are inserting a large locus identifier into the beginning of the LOCUS line which is sometimes bigger than the allocated slot, pushing the rest of the fields out of position in some of the files. I'd need to see several examples to be confident about this guess. If you don't actually need much information from the LOCUS line, you might find it easier to hack our parser to be a little more tolerant - I would suggest simply pulling out the locus ID, ignoring the rest of the LOCUS line, and printing a warning. Peter P.S. Which version of Biopython are you using? Biopython 1.48 onwards is a little less fussy than Biopython 1.47 in order to accept GenBank files produced by EMBOSS seqret. From pzs at dcs.gla.ac.uk Thu Dec 18 15:25:54 2008 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 18 Dec 2008 15:25:54 +0000 Subject: [BioPython] Genbank LOCUS line slightly misaligned In-Reply-To: <320fb6e00812180715o6bc3fe28pe6359c62e1f2e39e@mail.gmail.com> References: <494A545F.2020307@dcs.gla.ac.uk> <320fb6e00812180715o6bc3fe28pe6359c62e1f2e39e@mail.gmail.com> Message-ID: <494A6B82.4050906@dcs.gla.ac.uk> Thanks for your prompt reply. Peter wrote: > I suspect that they (Genomatrix) are inserting a large locus > identifier into the beginning of the LOCUS line which is sometimes > bigger than the allocated slot, pushing the rest of the fields out of > position in some of the files. I'd need to see several examples to be > confident about this guess. > That sounds about right. Here's a sample: $ grep LOCUS skurukutipromo.gb | head LOCUS GXP_4216 601 bp DNA LOCUS GXP_4217 601 bp DNA LOCUS GXP_4220 601 bp DNA LOCUS GXP_4226 603 bp DNA LOCUS GXP_1485624 601 bp DNA LOCUS GXP_1485625 601 bp DNA LOCUS GXP_4230 601 bp DNA LOCUS GXP_4253 640 bp DNA LOCUS GXP_648168 662 bp DNA LOCUS GXP_4281 601 bp DNA It's a bit careless on their part, but who listens to standards anyway? ;) > If you don't actually need much information from the LOCUS line, you > might find it easier to hack our parser to be a little more tolerant - > I would suggest simply pulling out the locus ID, ignoring the rest of > the LOCUS line, and printing a warning. > I already did a regex on the file itself to excise everything after the locus id, which put an end to the complaints. I'm also finding I have to manually parse the description entry, which comes out in one big lump like this: 'loc=GXL_3369|sym=AK092090|acc=GXP_4216|taxid=9606| spec=Homo sapiens|chr=19|ctg=NC_000019|str=(-)| start=9533632|end=9534232|len=601|tss=501| descr=Homo sapiens cDNA FLJ34771 fis, clone NT2NE2003150.| comm=GXT_2771591/AK096729/501/gold' Has some other formatting error prevented biopython from breaking this up for me, or is this the expected behaviour? I'm using biopython1.49. It's not a big deal, I was just wondering. Cheers, Peter From biopython at maubp.freeserve.co.uk Thu Dec 18 16:01:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Dec 2008 16:01:10 +0000 Subject: [BioPython] Genbank LOCUS line slightly misaligned In-Reply-To: <494A6B82.4050906@dcs.gla.ac.uk> References: <494A545F.2020307@dcs.gla.ac.uk> <320fb6e00812180715o6bc3fe28pe6359c62e1f2e39e@mail.gmail.com> <494A6B82.4050906@dcs.gla.ac.uk> Message-ID: <320fb6e00812180801u3bb2d31chddb44dae7502c2f4@mail.gmail.com> On Thu, Dec 18, 2008 at 3:25 PM, Peter Saffrey wrote: > Thanks for your prompt reply. > > Peter wrote: >> >> I suspect that they (Genomatrix) are inserting a large locus >> identifier into the beginning of the LOCUS line which is sometimes >> bigger than the allocated slot, pushing the rest of the fields out of >> position in some of the files. I'd need to see several examples to be >> confident about this guess. >> > > That sounds about right. Here's a sample: > > $ grep LOCUS skurukutipromo.gb | head > LOCUS GXP_4216 601 bp DNA > LOCUS GXP_4217 601 bp DNA > LOCUS GXP_4220 601 bp DNA > LOCUS GXP_4226 603 bp DNA > LOCUS GXP_1485624 601 bp DNA > LOCUS GXP_1485625 601 bp DNA > LOCUS GXP_4230 601 bp DNA > LOCUS GXP_4253 640 bp DNA > LOCUS GXP_648168 662 bp DNA > LOCUS GXP_4281 601 bp DNA > > It's a bit careless on their part, but who listens to standards anyway? ;) Writing general output to GenBank format is tricky if you have long record identifiers. >> If you don't actually need much information from the LOCUS line, you >> might find it easier to hack our parser to be a little more tolerant - >> I would suggest simply pulling out the locus ID, ignoring the rest of >> the LOCUS line, and printing a warning. > > I already did a regex on the file itself to excise everything after the > locus id, which put an end to the complaints. If you're happy, that's fine. > I'm also finding I have to manually parse the description entry, which comes > out in one big lump like this: > > 'loc=GXL_3369|sym=AK092090|acc=GXP_4216|taxid=9606| spec=Homo > sapiens|chr=19|ctg=NC_000019|str=(-)| > start=9533632|end=9534232|len=601|tss=501| descr=Homo sapiens cDNA FLJ34771 > fis, clone NT2NE2003150.| comm=GXT_2771591/AK096729/501/gold' What did the DEFINITION lines look like? Its usually just a long string like "species name, complete genome" spanning one or more lines. Here I'm guessing Genomatrix are sticking a whole load of meta data into this field using their own convention. This is a bit odd, but I think I've also seem similar extra data dumped into the COMMENT lines by other programs. > Has some other formatting error prevented biopython from breaking this up > for me, or is this the expected behaviour? I'm using biopython1.49. It's not > a big deal, I was just wondering. I think that's the expected behaviour, the DEFINITION lines becomes the record's description property (a simple string). Peter From biopython.chen at gmail.com Mon Dec 22 16:38:45 2008 From: biopython.chen at gmail.com (Chandan Kumar) Date: Mon, 22 Dec 2008 08:38:45 -0800 Subject: [BioPython] help for local alignment Message-ID: <4c2163890812220838o40cb50fcyb55f24392cf14101@mail.gmail.com> Dear all, can any one provide me simple code for local alignment python code which can be applied for protein or nucleotide sequence. Please provide me the simplest code as I am new to python and from biology background. Thanking you. Kind regards Chen From biopython at maubp.freeserve.co.uk Mon Dec 22 17:47:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Dec 2008 17:47:16 +0000 Subject: [BioPython] help for local alignment In-Reply-To: <4c2163890812220838o40cb50fcyb55f24392cf14101@mail.gmail.com> References: <4c2163890812220838o40cb50fcyb55f24392cf14101@mail.gmail.com> Message-ID: <320fb6e00812220947xd9444ffp636c13c684fda2c4@mail.gmail.com> On Mon, Dec 22, 2008 at 4:38 PM, Chandan Kumar wrote: > Dear all, > can any one provide me simple code for local alignment > python code which can be applied for protein or nucleotide sequence. Please > provide me the simplest code as I am new to python and from biology > background. > > Thanking you. > > Kind regards > Chen Hi Chen, Are you wanting to do pairwise alignments (aligning two sequences to each other), or multiple sequence alignments? For multiple sequence alignments, you might want to use a 3rd party tool like ClustalW, or MUSCLE. Biopython can parse several alignment formats including ClustalW format. See our tutorial for examples using ClustalW. Biopython's Bio.pairwise2 can do pairwise alignments, although we only have the built in documentation for this at the moment (nothing in our tutorial). This documentation is also available online: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html For pairwise sequence alignments I personally use the EMBOSS tools "water" (Smith-Waterman algorithm for local alignment) or "needle" (Needleman-Wunsch for global alignment). Biopython's Bio.AlignIO module can parse their output. Peter From bala.biophysics at gmail.com Mon Dec 29 21:37:31 2008 From: bala.biophysics at gmail.com (Bala subramanian) Date: Mon, 29 Dec 2008 22:37:31 +0100 Subject: [BioPython] error in writting pdb file Message-ID: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> Dear Friends, When i try to write a pdb file with PDBIO, i get the following error. What could be the possible reason for the same. >>> out=PDBIO() >>> out.set_structure(s) >>> out.save("new.pdb") Traceback (most recent call last): File "", line 1, in File "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", line 150, in save File "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", line 84, in _get_atom_line TypeError: %c requires int or char Thanks in advance, Bala From biopython at maubp.freeserve.co.uk Mon Dec 29 23:24:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Dec 2008 23:24:46 +0000 Subject: [BioPython] error in writting pdb file In-Reply-To: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> References: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> Message-ID: <320fb6e00812291524x5a49f375r7fc262188810f988@mail.gmail.com> On Mon, Dec 29, 2008 at 9:37 PM, Bala subramanian wrote: > Dear Friends, > > When i try to write a pdb file with PDBIO, i get the following error. What > could be the possible reason for the same. > >>>> out=PDBIO() >>>> out.set_structure(s) >>>> out.save("new.pdb") > Traceback (most recent call last): > File "", line 1, in > File > "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", > line 150, in save > File > "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", > line 84, in _get_atom_line > TypeError: %c requires int or char > > Thanks in advance, > Bala Something in one of your atom objects isn't as expected. The _get_atom_line code is trying to construct a string for an atom line for the PDB file, but one of the strong formatting arguments isn't setup right (the TypeError about %c). Without seeing how you constructed the structure (variable s in your code) its hard to guess what is wrong. Maybe one of the required properties is set to None? Peter From srini_iyyer_bio at yahoo.com Tue Dec 30 00:39:15 2008 From: srini_iyyer_bio at yahoo.com (Srinivas Iyyer) Date: Mon, 29 Dec 2008 16:39:15 -0800 (PST) Subject: [BioPython] blastcl3 Message-ID: <113203.41517.qm@web38105.mail.mud.yahoo.com> Dear Group, I am using netblast blastcl3 to blast my small fasta sequences to human genome. blastcl3 -p blastn -i test.fa -d gpipe/9606/all_contig -o test.out Above is my command. I want to be able to parse the output which is a text based format. I used this: from Bio.Blast import NCBIWWW import Bio.Blast.Record blast_out = open('test.out','r') parser = NCBIWWW.BlastParser() blastRecord = parser.parse(blast_out) I hit error and is reported below. Instad I did the following: from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML fasta_string = open("test.fa").read() result_handle = NCBIWWW.qblast("blastn", "gpipe/9606/all_contig", fasta_string) blast_records = NCBIXML.parse(result_handle) blast_records = list(blast_records) Treaceback (most recent call last): File"", line 1, in StopIteration Instead: if I say : for item in blast_records: print i I get IndexError: list index out of range. what should I do? could any one help me please. thanks Srini Error for :blastRecord = parser.parse(blast_out) >>> blastRecord = parser.parse(blast_out) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py", line 51, in parse self._scanner.feed(handle, self._consumer) File "/usr/local/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py", line 103, in feed has_re=re.compile(r'.?BLAST')) File "/usr/local/lib/python2.5/site-packages/Bio/ParserSupport.py", line 334, in read_and_call_until line = safe_readline(uhandle) File "/usr/local/lib/python2.5/site-packages/Bio/ParserSupport.py", line 410, in safe_readline raise ValueError, "Unexpected end of stream." ValueError: Unexpected end of stream. From chapmanb at 50mail.com Tue Dec 30 01:09:29 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 29 Dec 2008 20:09:29 -0500 Subject: [BioPython] blastcl3 In-Reply-To: <113203.41517.qm@web38105.mail.mud.yahoo.com> References: <113203.41517.qm@web38105.mail.mud.yahoo.com> Message-ID: <20081230010929.GA57412@kunkel> Hi Srini; > I am using netblast blastcl3 to blast my small fasta sequences to human genome. > blastcl3 -p blastn -i test.fa -d gpipe/9606/all_contig -o test.out > > Above is my command. I want to be able to parse the output which is a > text based format. My first suggestion if you want to parse BLAST is to use the XML output. Based on the NCBI documentation here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/netblast.html it appears as if the parameter you want is '-m 7'. XML output is much more stable, and details on parsing it in Biopython are here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56 The error you report below makes it seem as if the output file is empty but it is a bit tough to say. If parsing the XML output does not work, you might want to double check the 'test.out' file to be sure it looks decent, and if so attach it here so we can help more. Hope this helps, Brad > I used this: > from Bio.Blast import NCBIWWW > import Bio.Blast.Record > blast_out = open('test.out','r') > parser = NCBIWWW.BlastParser() > blastRecord = parser.parse(blast_out) > > I hit error and is reported below. > > Instad I did the following: > > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > fasta_string = open("test.fa").read() > result_handle = NCBIWWW.qblast("blastn", "gpipe/9606/all_contig", fasta_string) > blast_records = NCBIXML.parse(result_handle) > blast_records = list(blast_records) > Treaceback (most recent call last): > File"", line 1, in > StopIteration > > Instead: > > if I say : > for item in blast_records: > print i > > I get IndexError: list index out of range. > > what should I do? > could any one help me please. > thanks > Srini > > > > > > > > > > > Error for :blastRecord = parser.parse(blast_out) > > > >>> blastRecord = parser.parse(blast_out) > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py", line 51, in parse > self._scanner.feed(handle, self._consumer) > File "/usr/local/lib/python2.5/site-packages/Bio/Blast/NCBIWWW.py", line 103, in feed > has_re=re.compile(r'.?BLAST')) > File "/usr/local/lib/python2.5/site-packages/Bio/ParserSupport.py", line 334, in read_and_call_until > line = safe_readline(uhandle) > File "/usr/local/lib/python2.5/site-packages/Bio/ParserSupport.py", line 410, in safe_readline > raise ValueError, "Unexpected end of stream." > ValueError: Unexpected end of stream. > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Dec 30 16:55:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Dec 2008 16:55:32 +0000 Subject: [BioPython] error in writting pdb file In-Reply-To: <288df32a0812292147o70f17332safcb7c86ad55d404@mail.gmail.com> References: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> <320fb6e00812291524x5a49f375r7fc262188810f988@mail.gmail.com> <288df32a0812292147o70f17332safcb7c86ad55d404@mail.gmail.com> Message-ID: <320fb6e00812300855o75e17ab9wcd62c8387f20629f@mail.gmail.com> On Tue, Dec 30, 2008 at 5:47 AM, Bala subramanian wrote: > Peter, > Here is the small code i where i try to renumber the residues. > > Python 2.5.2 >>>> from Bio.PDB import PDBParser >>>> from Bio.PDB import PDBIO >>>> par=PDBParser() >>>> S=par.get_structure('cef','1CE4.pdb') >>>> seq=range(100,134+1) >>>> i=0 >>>> for residues in S.get_residues(): > ... residues.id=('',seq[i],'') > ... i += 1 > ... >>>> out=PDBIO() >>>> out.set_structure(S) >>>> out.save("new.pdb") > Traceback (most recent call last): > File "", line 1, in > File > "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", > line 150, in save > File > "/home/bala/Desktop/biopython-1.49/build/lib.linux-i686-2.5/Bio/PDB/PDBIO.py", > line 84, in _get_atom_line > TypeError: %c requires int or char For this example, the copy of 1CE4.pdb I just downloaded seems to have 700 residues - but you only created a list of 35 new identifiers. This mean the code above fails for me with an index error - easy to fix but I'm not 100% sure how you want to renumber the residues. As to the TypeError, I think the problem is you are setting the first and last parts of the ID to empty string. Try using a single space instead - how about: for index, residue in enumerate(S.get_residues()) : residue.id = (" ", index+100, " ") #Note quoted spaces! Notice I'm using the python enumerate function, which means index counts from 0, 1, 2, ... and I then use this to calculate the new identifier by adding 100. You may want to do something differently. Peter From biopython at maubp.freeserve.co.uk Tue Dec 30 17:42:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Dec 2008 17:42:14 +0000 Subject: [BioPython] blastcl3 In-Reply-To: <113203.41517.qm@web38105.mail.mud.yahoo.com> References: <113203.41517.qm@web38105.mail.mud.yahoo.com> Message-ID: <320fb6e00812300942j1943e059j5cae6fea4c9c3de@mail.gmail.com> On Tue, Dec 30, 2008 at 12:39 AM, Srinivas Iyyer wrote: > Dear Group, > I am using netblast blastcl3 to blast my small fasta sequences to human genome. > > blastcl3 -p blastn -i test.fa -d gpipe/9606/all_contig -o test.out > > Above is my command. I want to be able to parse the output which is a text based format. I would urge you to tell blast to produce XML output as already described by Brad. Just to clarify: Bio.Blast.NCBIXML includes our XML blast parser (recommended) Bio.Blast.NCBIStandalone includes our plain text parser (discouraged) Bio.Blast.NCBIWWW includes our deprecated HTML blast parser The module naming reflects the historical introduction of the different BLAST tools, and is unfortunately a little misleading nowadays since both the standalone command line tool and the website can produce XML, plain text or HTML output. > I used this: > from Bio.Blast import NCBIWWW > import Bio.Blast.Record > blast_out = open('test.out','r') > parser = NCBIWWW.BlastParser() > blastRecord = parser.parse(blast_out) The above code will try and parse HTML (web page) format BLAST output - but you said test.out should be in plain text format, so this won't work. If you really want to use the plain text format, try the parser in Bio.Blast.NCBIStandalone - but it doesn't work 100% on the output from the latest version of the BLAST standalone tools. > Instad I did the following: > > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > fasta_string = open("test.fa").read() > result_handle = NCBIWWW.qblast("blastn", "gpipe/9606/all_contig", fasta_string) This function runs BLAST over the internet, and it should default to XML format. You can override using the format_type argument as described in the docstring or the tutorial. You should be able to parse it using Bio.Blast.NCBIXML as you tried... However, I would assume that "gpipe/9606/all_contig" is a local database on your machine, so there is no way the NCBI's servers can use it. If you examine the results by hand it will probably be an error message, try this: print result_handle.read() Peter From biopython at maubp.freeserve.co.uk Tue Dec 30 18:00:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Dec 2008 18:00:43 +0000 Subject: [BioPython] error in writting pdb file In-Reply-To: <288df32a0812300944l743743ceo79079b697e37ef29@mail.gmail.com> References: <288df32a0812291337od644e6w64b6216213d64172@mail.gmail.com> <320fb6e00812291524x5a49f375r7fc262188810f988@mail.gmail.com> <288df32a0812292147o70f17332safcb7c86ad55d404@mail.gmail.com> <320fb6e00812300855o75e17ab9wcd62c8387f20629f@mail.gmail.com> <288df32a0812300944l743743ceo79079b697e37ef29@mail.gmail.com> Message-ID: <320fb6e00812301000w4a6a5557nc473baaf0e58bcbc@mail.gmail.com> On Tue, Dec 30, 2008 at 5:44 PM, Bala subramanian wrote: > Dear Peter, > > Actually 1Ce4.pdb is a NMR structure and i just did the renumbering on one > model extraced from it. That would explain why you had less residues. > Now the script work fine after adjusting the quoted space. Thank you very much. Good. I'm glad we could solve this so quickly. > Could you please suggest me some good tutorials for Bio.PDB > > Bala If you haven't already done so, please see http://biopython.org/wiki/Documentation First of all there is a whole chapter in the main Biopython Tutorial, included with the the Biopython source code archives, and also available online: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf Then there is also a separate document, which goes into more detail: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf There are also a few other examples elsewhere online, try Google. Peter P.S. Please CC the mailing list on your replies, so that the discussion is open, and archived for future readers. From sudhir.cr at gmail.com Wed Dec 31 07:49:48 2008 From: sudhir.cr at gmail.com (sudhir cr) Date: Wed, 31 Dec 2008 02:49:48 -0500 Subject: [BioPython] How to use Bio.Kegg.Compound Module Message-ID: Hello, I am a newbie to python. Can anyone please tell me how to use the Bio.Kegg.Compound Module to get the DBLinks from a KEGG Compound file? Thanks in advance Sudhir From biopython at maubp.freeserve.co.uk Wed Dec 31 14:43:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 31 Dec 2008 14:43:39 +0000 Subject: [BioPython] How to use Bio.Kegg.Compound Module In-Reply-To: References: Message-ID: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> On Wed, Dec 31, 2008 at 7:49 AM, sudhir cr wrote: > Hello, > > I am a newbie to python. > > Can anyone please tell me how to use the Bio.Kegg.Compound Module to get the > DBLinks from a KEGG Compound file? > > Thanks in advance > Sudhir Looking at the code, we do need to add some more to the KEGG docstrings. However, I think you want to do something like this: from Bio.KEGG import Compound handle = open("my_kegg_file.txt") for record in Compound.parse(handle) : print record.entry for database, links in record.dblinks : print database, links handle.close() Peter From sudhir.cr at gmail.com Wed Dec 31 15:07:21 2008 From: sudhir.cr at gmail.com (sudhir cr) Date: Wed, 31 Dec 2008 10:07:21 -0500 Subject: [BioPython] How to use Bio.Kegg.Compound Module In-Reply-To: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> References: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> Message-ID: Hello Peter, Thanks for the quick reply. This code is working great. P.S: The new KEGG format has changed to "Other DBs" from "DBLINKS" Thanks a lot, Have a great New Year - 2009 Sudhir On Wed, Dec 31, 2008 at 9:43 AM, Peter wrote: > On Wed, Dec 31, 2008 at 7:49 AM, sudhir cr wrote: > > Hello, > > > > I am a newbie to python. > > > > Can anyone please tell me how to use the Bio.Kegg.Compound Module to get > the > > DBLinks from a KEGG Compound file? > > > > Thanks in advance > > Sudhir > > Looking at the code, we do need to add some more to the KEGG > docstrings. However, I think you want to do something like this: > > from Bio.KEGG import Compound > handle = open("my_kegg_file.txt") > for record in Compound.parse(handle) : > print record.entry > for database, links in record.dblinks : > print database, links > handle.close() > > Peter > From biopython at maubp.freeserve.co.uk Wed Dec 31 15:17:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 31 Dec 2008 15:17:39 +0000 Subject: [BioPython] How to use Bio.Kegg.Compound Module In-Reply-To: References: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> Message-ID: <320fb6e00812310717s51cce5dm52694079ffe9c253@mail.gmail.com> On Wed, Dec 31, 2008 at 3:07 PM, sudhir cr wrote: > Hello Peter, > > Thanks for the quick reply. This code is working great. Great. > P.S: The new KEGG format has changed to "Other DBs" from "DBLINKS" Do you have a link for this? If we need to update our parser could you file a bug on Bugzilla please? http://bugzilla.open-bio.org/ Thanks, Peter