From lueck at ipk-gatersleben.de Mon Nov 3 05:54:25 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 3 Nov 2008 11:54:25 +0100 Subject: [BioPython] ClustalW problem upwards Biopython 1.43 References: <320fb6e00810280420t75f62774x55335e8a5aa11151@mail.gmail.com> Message-ID: <003f01c93da2$89a5cab0$1022a8c0@ipkgatersleben.de> I'm sorry! I'm using CLUSTAL W 1.8. It's my mistake because I work on several PC's ;-) Well, for the full path: I put the .exe into the folder of my programs because I use it over Biopython. So usually I have it on my USB stick (X:\MyProgram\clustalw.exe). I'll try the code you gave me. Thanks, Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Tuesday, October 28, 2008 12:20 PM Subject: Re: [BioPython] ClustalW problem upwards Biopython 1.43 > Stephanie wrote: >> >>>>> print str(cline) >> >> clustalw pb.fasta -OUTFILE=test2.aln >> >> I'm using CLUSTAL W 2.0. > > Are you sure? The Clustal W 2.0 executable is normally called > clustalw2.exe rather than clustalw.exe - so based on the command line > above I would have expect Clustalw 1.x to be used. Maybe you have > both versions of ClustalW installed? > > Could you tell me where exactly (full paths) you have Clustalw.exe > and/or Clustalw2.exe installed? This would be helpful for the new > unit test I'm working on. > >> Under DOS everything works fine. > > I've been having "fun" trying to get a new unit test for this to work > nicely on Windows - there a certainly some combinations of file name > arguments with spaces etc which won't work on Biopython 1.48. I found > examples where the command line string ran "by hand" at the "DOS" > prompt worked fine, but would fail when invoked in python via os.popen > - on the bright side, using subprocess.Popen instead works much better > (although this isn't available for python 2.3). > > If you want to try this new code, I would suggest you first install > Biopython 1.48, and then backup and update > C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py to revision > 1.25 from CVS which you can download here (should be updated within > the hour): > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Clustalw/__init__.py?cvsroot=biopython > > Thanks! > > Peter > From lueck at ipk-gatersleben.de Mon Nov 3 05:56:37 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 3 Nov 2008 11:56:37 +0100 Subject: [BioPython] Sequence graph References: <490B3267.5020501@pingoured.fr> Message-ID: <004a01c93da2$d7e5aec0$1022a8c0@ipkgatersleben.de> Hi! I would be very much interested of this too! At the moment I do it myself but it's quite nasty... Does someone has experience in converting Perl to Python codes? This would be an option... Thanks in advance! Stefanie ----- Original Message ----- From: "Pierre-Yves" To: Sent: Friday, October 31, 2008 5:29 PM Subject: [BioPython] Sequence graph > Dear list, > > I am sorry to come here to ask this question that must have been already > asked in the past, but my search have been rather unsuccessful... > > I would like to reproduce such graph: > http://www.bioperl.org/wiki/HOWTO:Graphics#Improving_the_Image but even if > bioperl is nice I would like to do it through BioPython. > > I have thus two questions : > * Is that possible ? > * Could someone point me to an example ? > > Thanks in advance for your help, > > Best regards, > > Pierre > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From lpritc at scri.ac.uk Mon Nov 3 06:04:00 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 03 Nov 2008 11:04:00 +0000 Subject: [BioPython] Sequence graph In-Reply-To: <490B3267.5020501@pingoured.fr> Message-ID: Hi Pierre-Yves On 31/10/2008 16:29, "Pierre-Yves" wrote: > I would like to reproduce such graph: > http://www.bioperl.org/wiki/HOWTO:Graphics#Improving_the_Image but even > if bioperl is nice I would like to do it through BioPython. > > I have thus two questions : > * Is that possible ? > * Could someone point me to an example ? As far as I am aware there is not yet an equivalent of this graphical output in Biopython, though I agree that the facility would be nice ;) Robert Cadena and I are working on incorporating GenomeDiagram ( http://bioinf.scri.ac.uk/lp/programs.php) into Biopython. It works differently to the Perl code, though you could make images rendering the same information as in the BioPerl example you link to. Even though it's not yet part of Biopython, it does play nicely with Biopython, so you might like to try it out. If you are specifically looking to create a graphical representation of BLAST output, then I have a Python script that might be useful to you. Please get in touch if you'd like a copy. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From pingou at pingoured.fr Mon Nov 3 06:07:23 2008 From: pingou at pingoured.fr (Pierre-Yves) Date: Mon, 03 Nov 2008 12:07:23 +0100 Subject: [BioPython] Sequence graph In-Reply-To: References: Message-ID: <490EDB6B.9080704@pingoured.fr> Hi Leighton, Leighton Pritchard wrote: > Robert Cadena and I are working on incorporating GenomeDiagram ( > http://bioinf.scri.ac.uk/lp/programs.php) into Biopython. It works > differently to the Perl code, though you could make images rendering the > same information as in the BioPerl example you link to. Even though it's > not yet part of Biopython, it does play nicely with Biopython, so you might > like to try it out. Thanks for the link I will have a look at it. > > If you are specifically looking to create a graphical representation of > BLAST output, then I have a Python script that might be useful to you. > Please get in touch if you'd like a copy. It is not exactly a BLAST output but it could be approximate to, so actually yes I would be interested by your script. Thanks again, Best regards, P.Yves From biopython at maubp.freeserve.co.uk Mon Nov 3 06:30:30 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Nov 2008 11:30:30 +0000 Subject: [BioPython] ClustalW problem upwards Biopython 1.43 In-Reply-To: <003f01c93da2$89a5cab0$1022a8c0@ipkgatersleben.de> References: <320fb6e00810280420t75f62774x55335e8a5aa11151@mail.gmail.com> <003f01c93da2$89a5cab0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00811030330t1f80d3d4v48c84cafbe9f9377@mail.gmail.com> On Mon, Nov 3, 2008 at 10:54 AM, Stefanie L?ck wrote: > I'm sorry! I'm using CLUSTAL W 1.8. It's my mistake because I work on > several PC's ;-) Easily done - I'm not sure if it matters or not here. > I'll try the code you gave me. Thanks - from my testing the code in CVS should be fine (except on Python 2.3 for some filename combinations with spaces in them). It would be great to confirm this works for you. Peter From dalloliogm at gmail.com Mon Nov 3 10:37:06 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 3 Nov 2008 16:37:06 +0100 Subject: [BioPython] [Biopython-dev] Statistics in population genetics module - Part I In-Reply-To: <5aa3b3570811030736g7d7a0893x759777252c8d1828@mail.gmail.com> References: <6d941f120810301658wec8678ald332abb8ddbdf80d@mail.gmail.com> <5aa3b3570811030736g7d7a0893x759777252c8d1828@mail.gmail.com> Message-ID: <5aa3b3570811030737x620ff5f3vd35cdca0d4373769@mail.gmail.com> On Fri, Oct 31, 2008 at 12:58 AM, Tiago Ant?o wrote: > Hi, > > Statistics is the most important part of population genetics modules. > In fact one could say that statistics where invented FOR population > genetics (check http://en.wikipedia.org/wiki/Ronald_Fisher ). > When I started to work on the population genetics module I decided to > delay the statistics module a bit, in order to get experience with the > whole biopython project before committing to do the most important > thing. > Irrespective of it is possible or not to link scipy or not, now seems > to be the time to advance, especially considering that Giovanni is > interested in participating. > A few of points need to be said before suggesting on how to put > statistics in Bio.PopGen > > 1. Whatever design is put in, it should be reasonably future proof: in > a few releases it should not be a good idea to break older code. That > should be avoided in as much as possible. For how much time do you think a biopython module should be kept compatible with older versions, more or less? It will take a long time to develop the module, and it is sure that we will make some mistakes. So, what is the best way to proceed? What if we create a separated biopython branch where we can test all the new features? At the moment I am working with a separated git repository for all the popgen modules. The problem is that I didn't include all biopython modules in the repository, so, if any of my changes breaks something in biopython, I won't know it until I'll merge everything with biopython code. On the other side, if I include a biopython release in my popgen repository, I won't be able to track changes made in biopython, and my popgen code will be compatible with that version only. I think git provides some options to handle this kind of situations... I am not very used to cvs, so I don't know. p.s. When python3000 will be released, it will be probably necessary to rewrite large portions of biopython, if not creating a 'biopython 2' version (I think they were discussing something like this in bioperl's list). I thought that maybe, even if we make some 'mistakes' in this version of biopython, we will be able to fix them in a later version. > > 2. It goes without saying that the code should be useful to everybody > doing population genetics and not only the authors of Bio.PopGen: all > kinds of markers and population structures should be accommodatable in > the future . I think that a good idea would be starting collecting use cases to have an idea how many things we'll have to implement in this module. It would be useful to talk to the authors of similar modules in other Bio.* projects, to see if they have some good suggestions. I sent that mail to the Open::Bio::I last week, but still haven't received many replies... I will send a message to the various Bio.* mailing list in the next days. - Show quoted text - > > 3. For reasons that I've partially explained on the biopython list, I > don't think a OO model explicitly based on individuals or populations > e good (or even necessary) > 4. Any framework should be more pragmatic than anything else. I would > envision a typical use case like this > a) read data (from a certain data source) > b) Do some basic processing (changing individuals or populations, > converting markers) > c) calculate statistics > A few comments regarding each of these points: > a) data sources, file formats: file formats in population > genetics exist in large quantities and are essencialy completely > ad-hoc, most made in a very naive way. Good or BAD, that is what there > is. The most used format (some kind of de facto standard, GenePop) can > only be used for frequency-based statistics, for all the rest things > are fragmented (although, if there are no population structure and the > data is sequences than standard sequence based formats can be used - > but from my experience this is a small minority) > b) basic processing: This is the point where a OO model of > individuals and populations would pay, but I think it is not the "meat > of the issue" > c) statistics: there are of every type and for every taste. If > you want to have an idea of what is out there an interesting place to > look at is the arlequin3 manual: > http://cmpg.unibe.ch/software/arlequin3/arlequin31.pdf > (part of the manual is UI description, but especially starting at page > 89 - the table there is a good overview - there are descriptions of > the overall panorama). What if we create some very-generic objects, like: Population self._to_popgen_input -> represents population as an input to popgen ([Pop1, (Ind1, Ind2...)]) self._to_othertool_input -> represents population as an input to popgen Thanks for the link to arlequin3 manual, it seems very informative. > > With time, and after at least 3 failed attempts to think in terms of > individuals/populations I started to cristalize around a model > centered on types of statistics. This model ends up actually having > implicit models of populations and individuals, and that is, in fact, > there. It is just implicit and not unified: different kinds of > statistics have different implicit models. > The model that I would like to propose, centered around statistics, > will be the subject of my next email (which I will send in the next > couple of days - still under design and lost sleep). I might split it > in 2 parts (concepts and suggestions for implementation). > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From bsouthey at gmail.com Mon Nov 3 15:50:14 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 03 Nov 2008 14:50:14 -0600 Subject: [BioPython] [Biopython-dev] Statistics in population genetics module - Part I In-Reply-To: <5aa3b3570811030737x620ff5f3vd35cdca0d4373769@mail.gmail.com> References: <6d941f120810301658wec8678ald332abb8ddbdf80d@mail.gmail.com> <5aa3b3570811030736g7d7a0893x759777252c8d1828@mail.gmail.com> <5aa3b3570811030737x620ff5f3vd35cdca0d4373769@mail.gmail.com> Message-ID: <490F6406.5030800@gmail.com> Giovanni Marco Dall'Olio wrote: > On Fri, Oct 31, 2008 at 12:58 AM, Tiago Ant?o wrote: > > >> Hi, >> >> Statistics is the most important part of population genetics modules. >> In fact one could say that statistics where invented FOR population >> genetics (check http://en.wikipedia.org/wiki/Ronald_Fisher ). >> When I started to work on the population genetics module I decided to >> delay the statistics module a bit, in order to get experience with the >> whole biopython project before committing to do the most important >> thing. >> Irrespective of it is possible or not to link scipy or not, now seems >> to be the time to advance, especially considering that Giovanni is >> interested in participating. >> A few of points need to be said before suggesting on how to put >> statistics in Bio.PopGen >> >> 1. Whatever design is put in, it should be reasonably future proof: in >> a few releases it should not be a good idea to break older code. That >> should be avoided in as much as possible. >> > > > For how much time do you think a biopython module should be kept compatible > with older versions, more or less? > It will take a long time to develop the module, and it is sure that we will > make some mistakes. So, what is the best way to proceed? What if we create a > separated biopython branch where we can test all the new features? > At the moment I am working with a separated git repository for all the > popgen modules. The problem is that I didn't include all biopython modules > in the repository, so, if any of my changes breaks something in biopython, I > won't know it until I'll merge everything with biopython code. > On the other side, if I include a biopython release in my popgen repository, > I won't be able to track changes made in biopython, and my popgen code will > be compatible with that version only. > I think git provides some options to handle this kind of situations... I am > not very used to cvs, so I don't know. > If you have modified a Biopython module you probably see if it is acceptable to change the main Biopython distribution especially if it involves an API change or modify your code because I do not think it is good idea to have different versions of the same Biopython module or any name clashes with Biopython. Otherwise, you just need to check that it runs with a very recent version of Biopython (and under the Biopython supported Python versions). If you have not done so, I would suggest developing unit tests that not only ensure code accuracy but also maintain future compatibility. A failed test will indicate some problem that needs resolving and the solution will mean that the code will be made compatible if necessary. > p.s. When python3000 will be released, it will be probably necessary to > rewrite large portions of biopython, if not creating a 'biopython 2' version > (I think they were discussing something like this in bioperl's list). > I thought that maybe, even if we make some 'mistakes' in this version of > biopython, we will be able to fix them in a later version. > Python 3 can not be discussed until all incompatible modules like numpy or Biopython can be used under Python 3 (rc1 is available). Further, the advice from above (see Guido's blog http://www.artima.com/weblogs/viewpost.jsp?thread=227041) is that the conversion should be a direct port without any changes especially API ones. So correcting any major 'mistakes' in the existing module probably will not be acceptable to the community. Further any correction at any time to the main distribution is not trivial especially as you must first get the users informed (I saw that with changing histogram in numpy). There is a lot of flexibility in a separate project that you will lose when a project is widely released or included in an well established project like Biopython. I think that you should maintain a separate project of some type until everything is sufficiently acceptable to the Biopython community. This gives sufficient time to address various concerns and enables an easy integration. Finally, if you require additional dependencies than those currently required by Biopython (especially something like scipy) then I think it will be very hard or impossible for you to get any code associated with these dependencies into Biopython. Just my opinions on your questions, Bruce From dalloliogm at gmail.com Tue Nov 4 05:58:41 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 4 Nov 2008 11:58:41 +0100 Subject: [BioPython] Proposal: doctest for biopython Message-ID: <5aa3b3570811040258h57803b7fy9a5f8b32e6f6982e@mail.gmail.com> Hi!! I would like to propose to use doctest tests in biopython. I found them very useful to understand how a script should be used, and moreover they can act as test units. I have just posted a patch file to that adds doctest documentation to Bio/SeqRecordIO: - http://bugzilla.open-bio.org/show_bug.cgi?id=2640 What do you think of it? Here it is the main documentation for unittest: - http://www.python.org/doc/2.5.2/lib/module-doctest.html Usually, you add a _test() function to every module, which calls the unittest libraries, and launch it with __name__ == '__main__'. The most significative example is added to the documentation string of every module/function, and tested with doctest.testmod(); later, you add more tests in a separate file, and launch them with doctest.testfile(). -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopythonlist at gmail.com Fri Nov 7 11:26:03 2008 From: biopythonlist at gmail.com (dr goettel) Date: Fri, 7 Nov 2008 17:26:03 +0100 Subject: [BioPython] Parsing ACE files Message-ID: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> I don't know why but I cannot search in the mailing list using the search link (http://search.open-bio.org/). I've seen in the documentation that the Bio.SeqIO can read ace files and uses Bio.Sequencing.Ace. After reading module Bio.SeqIO.AceIO it remains unclear for me how to use it. Could anybody tell me how to parse ACE files? is there a tutorial or example to look at? Thankyou very much! From biopython at maubp.freeserve.co.uk Fri Nov 7 11:33:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 Nov 2008 16:33:39 +0000 Subject: [BioPython] Parsing ACE files In-Reply-To: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> Message-ID: <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> On Fri, Nov 7, 2008 at 4:26 PM, dr goettel wrote: > I don't know why but I cannot search in the mailing list using the search > link (http://search.open-bio.org/). That's odd - it used to work OK... > I've seen in the documentation that the Bio.SeqIO can read ace files and > uses Bio.Sequencing.Ace. Yes. Depending on what information you want from the ACE files, you might be better off using Bio.Sequencing.Ace directly. Using Bio.SeqIO may not expose all the details you want (I'd have to check the details - its not fresh in my mind). > After reading module Bio.SeqIO.AceIO it remains unclear for me how to use > it. Could anybody tell me how to parse ACE files? is there a tutorial or > example to look at? You would typically use the Bio.SeqIO.parse() function (which will call Bio.SeqIO.AceIO internally). See Chapter 4 of the tutorial on Bio.SeqIO, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Or the Bio.SeqIO wiki page, http://biopython.org/wiki/SeqIO Peter From biopythonlist at gmail.com Fri Nov 7 12:02:30 2008 From: biopythonlist at gmail.com (dr goettel) Date: Fri, 7 Nov 2008 18:02:30 +0100 Subject: [BioPython] Parsing ACE files In-Reply-To: <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> Message-ID: <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> Thank you! > > Yes. Depending on what information you want from the ACE files, you > might be better off using Bio.Sequencing.Ace directly. Any example or tutorial for this solution? > Using Bio.SeqIO may not expose all the details you want (I'd have to check > the details - its not fresh in my mind). > You are right, it's not all the information I need. Peter > Cheers! From biopython at maubp.freeserve.co.uk Fri Nov 7 12:17:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 Nov 2008 17:17:01 +0000 Subject: [BioPython] Parsing ACE files In-Reply-To: <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> Message-ID: <320fb6e00811070917h74843b41tae82e53180ba080d@mail.gmail.com> On Fri, Nov 7, 2008 at 5:02 PM, dr goettel wrote: > Thank you! > >> Yes. Depending on what information you want from the ACE files, you >> might be better off using Bio.Sequencing.Ace directly. > > Any example or tutorial for this solution? > The bad news is I don't think this is covered in the Biopython Tutorial. However, there are some quite detailed built-in docstrings. >From within python, you can access the documentation via the python help function: >>> from Bio.Sequencing import Ace >>> help(Ace) ... These are also available online on our API pages (for the current release): http://biopython.org/DIST/docs/api/ http://biopython.org/DIST/docs/api/Bio.Sequencing.Ace-module.html However, you'll see this is quite a low level parser and it helps to know what the two letter line types mean (consult the ACE documentation). >> Using Bio.SeqIO may not expose all the details you want (I'd have to check >> the details - its not fresh in my mind). > > You are right, it's not all the information I need. I wrote the Bio.SeqIO wrapper for the ACE parser, so it might be possible to extend this to capture more information. What in particular do you want to extract? > Cheers! Sure! By the way - make sure you are using Biopython 1.48 or later, as Bio.Sequencing.Ace was switched to a more modern python iterator style then. Peter From sbassi at gmail.com Sat Nov 8 15:35:03 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 8 Nov 2008 18:35:03 -0200 Subject: [BioPython] Parsing ACE files In-Reply-To: <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> Message-ID: On Fri, Nov 7, 2008 at 3:02 PM, dr goettel wrote: >> Yes. Depending on what information you want from the ACE files, you >> might be better off using Bio.Sequencing.Ace directly. > Any example or tutorial for this solution? Here is something I wrote some time back I hope it still works: from Bio.Sequencing import Ace aceparser = Ace.ACEParser() fn = '/mnt/hda2/bio/836CLEAN-100.fasta.cap.ace' acefilerecord = aceparser.parse(open(fn)) # For each contig: for ctg in acefilerecord.contigs: print '==========================================' print 'Contig name: %s'%ctg.name print 'Bases: %s'%ctg.nbases print 'Reads: %s'%ctg.nreads print 'Segments: %s'%ctg.nsegments print 'Sequence: %s'%ctg.sequence print 'Quality: %s'%ctg.quality # For each read in contig: for read in ctg.reads: print 'Read name: %s'%read.rd.name print 'Align start: %s'%read.qa.align_clipping_start print 'Align end: %s'%read.qa.align_clipping_end print 'Read sequence: %s'%read.rd.sequence print '==========================================' From rodrigo_faccioli at uol.com.br Sun Nov 9 09:18:16 2008 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Sun, 9 Nov 2008 12:18:16 -0200 Subject: [BioPython] PDB file - Validation and WebService Message-ID: <3715adb70811090618i3ad46099y7fd1ff55be28ac23@mail.gmail.com> Hello, I am a very new BioPython member and I have listened only good news about BioPython project. So, I have two doubts: 1. The module Bio.PDB checks a PDB file like http://deposit.rcsb.org/cgi-bin/validate/adit-session-driver . If not, are there others possibilities in software ? 2. I read about PDB webservice from pdb website. The BioPython project is there supports for it? Because I read in http://www.rcsb.org/robohelp/webservices/summary.htm and there is a option with Python. Thanks for any support. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From biopython at maubp.freeserve.co.uk Sun Nov 9 10:16:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 9 Nov 2008 15:16:50 +0000 Subject: [BioPython] Biopython 1.49 beta released Message-ID: <320fb6e00811090716v58637d55o470246df4175464e@mail.gmail.com> Dear Biopythoneers, We are pleased to announce a beta release of Biopython 1.49. There are been some significant changes since Biopython 1.48 was released two months ago, which is why we are initially releasing a beta for wider testing. As previously announced, the big news is that Biopython now uses NumPy rather than its precursor Numeric (the original Numerical Python library). As in the previous releases, Biopython 1.49 beta supports Python 2.3, 2.4 and 2.5 but should now also work fine on Python 2.6. Please note that we intend to drop support for Python 2.3 in a couple of releases time. We also have some new functionality, starting with the basic sequence object (the Seq class) which now has more methods. This encourages a more object orientated coding style, and makes basic biological operations like transcription and translation more accessible and discoverable. Our BioSQL interface can now optionally fetch the NCBI taxonomy on demand when loading sequences (via Bio.Entrez) allowing you to populate the taxon/taxon_name tables gradually. Also, BioSQL should now work with the psycopg2 driver for PostgreSQL (as well as the older psycopg driver). Finally, our old parsing infrastructure (Martel and Bio.Mindy) is now considered to be deprecated, meaning mxTextTools is no longer required to use Biopython. This should not affect any of the typically used parsers (e.g. Bio.SeqIO and Bio.AlignIO). So, if you are feeling brave and know the risks, please try out Biopython 1.49 beta, and let us know on the mailing lists if it works, or more importantly if something doesn't. We'd also like feedback on the updated Biopython Tutorial and Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Source distributions and Windows installers are available from the Biopython website: http://biopython.org/wiki/Download Thanks! -Peter on behalf of the Biopython developers P.S. Those of you subscribed to our news feed would have seen this announcement already. For RSS links etc, see: http://biopython.org/wiki/News From biopython at maubp.freeserve.co.uk Sun Nov 9 10:26:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 9 Nov 2008 15:26:02 +0000 Subject: [BioPython] PDB file - Validation and WebService In-Reply-To: <3715adb70811090618i3ad46099y7fd1ff55be28ac23@mail.gmail.com> References: <3715adb70811090618i3ad46099y7fd1ff55be28ac23@mail.gmail.com> Message-ID: <320fb6e00811090726k2bef0c78t99c8909d781fb12@mail.gmail.com> On Sun, Nov 9, 2008 at 2:18 PM, Rodrigo faccioli wrote: > Hello, > > I am a very new BioPython member and I have listened only good news about > BioPython project. > > So, I have two doubts: > > > 1. The module Bio.PDB checks a PDB file like > http://deposit.rcsb.org/cgi-bin/validate/adit-session-driver . If not, > are there others possibilities in software ? Bio.PDB can do some validation (it has an optional strict mode for parsing). I don't know if this checks the same things as ADIT. Have you looked at downloading ADIT itself? http://sw-tools.pdb.org/apps/ADIT/index.html > 2. I read about PDB webservice from pdb website. The BioPython project is > there supports for it? Because I read in > http://www.rcsb.org/robohelp/webservices/summary.htm and there is a > option with Python. I've not looked at that before - all it seems to be at the moment is a way to run BLAST against the PDB database. Assuming their XML BLAST output is compatible with the NCBI's you should be able to use the Bio.Blast.NCBIXML parser on the results. I just tried their python example on Linux with Python 2.4.3 but it failed - perhaps my version of SOAP is out of date? Note they have a stray semi colon at the end of the call which shouldn't be there. See http://www.rcsb.org/robohelp/webservices/samples/python_samples.htm On the other hand, you could just use the NCBI's qblast (via Biopython if you like) to run an online BLAST search against the PDB database. So I don't see what benefit using the PDB's server offers - unless they plan additional functionality in future. Peter From dalloliogm at gmail.com Mon Nov 10 06:04:10 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 10 Nov 2008 12:04:10 +0100 Subject: [BioPython] annotations in an Alignment object Message-ID: <5aa3b3570811100304o4655fe60o4ecabf41e054c211@mail.gmail.com> Is there any way to store some annotations in an Alignment object?? For example: the alignment tool used, its parameters, its version, the date, and the nature of the sequence aligned. I am asking this because I would like to write a module to create ldhat input files from an alignment program. A ldhat file (http://www.stats.ox.ac.uk/~mcvean/LDhat/instructions.html) is very similar to a fasta file; the only difference is that in its first line, it contains three numbers, one of which can't always be inferred by the data. I have looked at Bio.Align.Generic's code, but I am not sure. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Nov 10 06:15:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Nov 2008 11:15:52 +0000 Subject: [BioPython] Parsing ACE files In-Reply-To: References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> Message-ID: <320fb6e00811100315s654e49d8i6ac208b033d4f024@mail.gmail.com> > Here is something I wrote some time back I hope it still works: > > from Bio.Sequencing import Ace > aceparser = Ace.ACEParser() > fn = '/mnt/hda2/bio/836CLEAN-100.fasta.cap.ace' > acefilerecord = aceparser.parse(open(fn)) > # For each contig: > for ctg in acefilerecord.contigs: > .... I guess I'm the bearer of bad news - the ACEParser object (with its iterator method) was deprecated in Biopython 1.48, in favour of a simple function calls read and parse (the DEPRECATED file didn't mention this, an oversight I've just rectified). Your code needs a small update: from Bio.Sequencing import Ace fn = '/mnt/hda2/bio/836CLEAN-100.fasta.cap.ace' acefilerecord=Ace.read(open(fn)) # For each contig: for ctg in acefilerecord.contigs: print '==========================================' print 'Contig name: %s'%ctg.name print 'Bases: %s'%ctg.nbases print 'Reads: %s'%ctg.nreads print 'Segments: %s'%ctg.nsegments print 'Sequence: %s'%ctg.sequence print 'Quality: %s'%ctg.quality # For each read in contig: for read in ctg.reads: print 'Read name: %s'%read.rd.name print 'Align start: %s'%read.qa.align_clipping_start print 'Align end: %s'%read.qa.align_clipping_end print 'Read sequence: %s'%read.rd.sequence print '==========================================' If you try the old code on Biopython 1.48 or 1.49b you should get a deprecation warning suggesting this change. Or, you can use Ace.parse(open(fn)) to iterate over the contigs directly (assuming you don't care about the WA, CT, RT and WR tags which may be at the end of the file). Peter From biopython at maubp.freeserve.co.uk Mon Nov 10 06:28:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Nov 2008 11:28:00 +0000 Subject: [BioPython] annotations in an Alignment object In-Reply-To: <5aa3b3570811100304o4655fe60o4ecabf41e054c211@mail.gmail.com> References: <5aa3b3570811100304o4655fe60o4ecabf41e054c211@mail.gmail.com> Message-ID: <320fb6e00811100328j1a565c36t7f3522344e7c95c0@mail.gmail.com> On Mon, Nov 10, 2008 at 11:04 AM, Giovanni Marco Dall'Olio wrote: > Is there any way to store some annotations in an Alignment object?? > For example: the alignment tool used, its parameters, its version, the > date, and the nature of the sequence aligned. Not officially, no. This is on my mental list of things to do with the alignment object (after Biopython 1.49 is done). I've CC'd the dev-mailing list which is probably a better place to discuss the details. If you look at Bio/AlignIO/StockholmIO.py or the Bio/AlignIO/FastaIO.py code you'll see I've recorded this kind of information in a private dictionary, i.e. alignment._annotations. This makes the data available if anyone really needs it, but signals that this is not part of the public API and is likely to change. As part of an alignment annotation enhancement, we should try and establish some agreed standards for naming annotation entries (and also counting systems). > I am asking this because I would like to write a module to create > ldhat input files from an alignment program. > A ldhat file (http://www.stats.ox.ac.uk/~mcvean/LDhat/instructions.html) > is very similar to a fasta file; the only difference is that in its > first line, it contains three numbers, one of which can't always be > inferred by the data. Why go to the trouble of making a new Bio.AlignIO module? For this example from the LDhat manual, it looks like a FASTA file with an extra header: 4 10 1 >SampleA TCCGC??RTT >SampleB TACGC??GTA >SampleC TC?-CTTGTA >SampleD TCC-CTTGTT Rather than writing support for a whole new file format, wouldn't it be easier to do something like this: alignment = ... number_a = 4 number_b = 10 number_c = 1 handle = open("example.txt","w") handle.write("%i %i %i\n" % (number_a, number_b, number_c)) handle.write(alignment.format("fasta")) handle.close() Peter From dalloliogm at gmail.com Mon Nov 10 06:42:31 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 10 Nov 2008 12:42:31 +0100 Subject: [BioPython] annotations in an Alignment object In-Reply-To: <320fb6e00811100328j1a565c36t7f3522344e7c95c0@mail.gmail.com> References: <5aa3b3570811100304o4655fe60o4ecabf41e054c211@mail.gmail.com> <320fb6e00811100328j1a565c36t7f3522344e7c95c0@mail.gmail.com> Message-ID: <5aa3b3570811100342t7c23c0fl2b101be3fd352159@mail.gmail.com> On Mon, Nov 10, 2008 at 12:28 PM, Peter wrote: > On Mon, Nov 10, 2008 at 11:04 AM, Giovanni Marco Dall'Olio > wrote: >> Is there any way to store some annotations in an Alignment object?? >> For example: the alignment tool used, its parameters, its version, the >> date, and the nature of the sequence aligned. > > Not officially, no. This is on my mental list of things to do with > the alignment object (after Biopython 1.49 is done). I've CC'd the > dev-mailing list which is probably a better place to discuss the > details. > > If you look at Bio/AlignIO/StockholmIO.py or the > Bio/AlignIO/FastaIO.py code you'll see I've recorded this kind of > information in a private dictionary, i.e. alignment._annotations. > This makes the data available if anyone really needs it, but signals > that this is not part of the public API and is likely to change. > > As part of an alignment annotation enhancement, we should try and > establish some agreed standards for naming annotation entries (and > also counting systems). ok... I will use the private dictionary for my own implementation. Unfortunately I don't have any useful suggestion for this.. >> I am asking this because I would like to write a module to create >> ldhat input files from an alignment program. >> A ldhat file (http://www.stats.ox.ac.uk/~mcvean/LDhat/instructions.html) >> is very similar to a fasta file; the only difference is that in its >> first line, it contains three numbers, one of which can't always be >> inferred by the data. > > Why go to the trouble of making a new Bio.AlignIO module? For this > example from the LDhat manual, it looks like a FASTA file with an > extra header: Yeah.. of course :) Let's say I am simply playing with biopython's code, to better understand it. Since I am going to use this function many times, I will have to write a module for it any way. The first number in the ldhat file is the number of sequences, the second is their length, and the third should be usually one in an alignment object, I suppose. > > 4 10 1 >>SampleA > TCCGC??RTT >>SampleB > TACGC??GTA >>SampleC > TC?-CTTGTA >>SampleD > TCC-CTTGTT > > Rather than writing support for a whole new file format, wouldn't it > be easier to do something like this: > > alignment = ... > number_a = 4 > number_b = 10 > number_c = 1 > > handle = open("example.txt","w") > handle.write("%i %i %i\n" % (number_a, number_b, number_c)) > handle.write(alignment.format("fasta")) > handle.close() > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From paul at rudin.co.uk Mon Nov 10 11:59:27 2008 From: paul at rudin.co.uk (Paul Rudin) Date: Mon, 10 Nov 2008 16:59:27 +0000 Subject: [BioPython] Bio.KDTree on 64 bit machines Message-ID: <87myg75xg0.fsf@rudin.co.uk> I'm looking at the biopython KDTree class. It requires arrays with dtype=="float32". I can make such arrays, but on 64 bit machines it's more natural (and the default for numpy float arrays) to have "float64". From biopython at maubp.freeserve.co.uk Mon Nov 10 14:00:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Nov 2008 19:00:14 +0000 Subject: [BioPython] Bio.KDTree on 64 bit machines In-Reply-To: <87myg75xg0.fsf@rudin.co.uk> References: <87myg75xg0.fsf@rudin.co.uk> Message-ID: <320fb6e00811101100k4a5eee48w98e3c993c23d4bf9@mail.gmail.com> On Mon, Nov 10, 2008 at 4:59 PM, Paul Rudin wrote: > > I'm looking at the biopython KDTree class. It requires arrays with > dtype=="float32". I can make such arrays, but on 64 bit machines it's > more natural (and the default for numpy float arrays) to have "float64". > You're looking at the code for Biopython 1.49b or CVS right? i.e. Bio/KDTree/KDTree.py CVS revision 1.10 or 1.11 [For Biopython 1.48 and older we used Numeric, which just had "f" as the type.] Hopefully Michiel can explain if there was a particular reason for choosing "float32", but on the face of it following the numpy default would seem sensible. Would you like to file a bug on this issue? Peter From paul at rudin.co.uk Mon Nov 10 14:19:46 2008 From: paul at rudin.co.uk (Paul Rudin) Date: Mon, 10 Nov 2008 19:19:46 +0000 Subject: [BioPython] Bio.KDTree on 64 bit machines References: <87myg75xg0.fsf@rudin.co.uk> <320fb6e00811101100k4a5eee48w98e3c993c23d4bf9@mail.gmail.com> Message-ID: <874p2f5qy5.fsf@rudin.co.uk> Peter writes: > On Mon, Nov 10, 2008 at 4:59 PM, Paul Rudin wrote: >> >> I'm looking at the biopython KDTree class. It requires arrays with >> dtype=="float32". I can make such arrays, but on 64 bit machines it's >> more natural (and the default for numpy float arrays) to have "float64". >> > > You're looking at the code for Biopython 1.49b or CVS right? > i.e. Bio/KDTree/KDTree.py CVS revision 1.10 or 1.11 Yes - I installed with: "sudo easy_install -f http://biopython.org/DIST/biopython", which seems to be 1.49 beta at the time of writing. > > [For Biopython 1.48 and older we used Numeric, which just had "f" as the type.] > > Hopefully Michiel can explain if there was a particular reason for > choosing "float32", but on the face of it following the numpy default > would seem sensible. > Would you like to file a bug on this issue? OK, I will. From sbassi at gmail.com Tue Nov 11 06:21:50 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Tue, 11 Nov 2008 09:21:50 -0200 Subject: [BioPython] Parsing ACE files In-Reply-To: <320fb6e00811100315s654e49d8i6ac208b033d4f024@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> <320fb6e00811100315s654e49d8i6ac208b033d4f024@mail.gmail.com> Message-ID: On Mon, Nov 10, 2008 at 9:15 AM, Peter wrote: > I guess I'm the bearer of bad news - the ACEParser object (with its > iterator method) was deprecated in Biopython 1.48, in favour of a Not at all, it is good to know that I was doing something wrong (maybe not wrong now but sure it was going to be an issue later). > simple function calls read and parse (the DEPRECATED file didn't > mention this, an oversight I've just rectified). Your code needs a At least it was worth to correct that file. Thank you for the correction to my code. Best, -- Vendo isla: http://www.genesdigitales.com/isla/ Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 "It is pitch black. You are likely to be eaten by a grue." -- Zork From cy at cymon.org Tue Nov 11 07:39:05 2008 From: cy at cymon.org (Cymon Cox) Date: Tue, 11 Nov 2008 12:39:05 +0000 Subject: [BioPython] Cannot __add__ two DBSeq objects Message-ID: <7265d4f0811110439h6c18e111te97d23070565cca2@mail.gmail.com> Hi All, Two DBSeq objects cannot be concatenated, although the DBSeq object inherits __add__ from Seq. It tries to init a new DBSeq object rather than returning a Seq object as would be expected. >>> s1 DBSeq('CTAAGCCATTCTACGACGTAGAATGAGCGTGTCACTGTATTTACGTCTCTTTCG...GGT', DNAAlphabet()) >>> s2 DBSeq('ACTCAAGGGTGAAGTATTTCCAATCGAAATAGGTGCTTCTATACCGGAAATAAT...CAT', DNAAlphabet()) >>> s1 + s2 Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.5/site-packages/Bio/Seq.py", line 144, in __add__ return self.__class__(str(self) + str(other), a) TypeError: __init__() takes exactly 6 arguments (3 given) Presumably, DBSeq needs to overide Seq.__add__ (Using CVS as of yesterday...) Cheers, C. -- From biopython at maubp.freeserve.co.uk Tue Nov 11 08:02:18 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Nov 2008 13:02:18 +0000 Subject: [BioPython] Cannot __add__ two DBSeq objects In-Reply-To: <7265d4f0811110439h6c18e111te97d23070565cca2@mail.gmail.com> References: <7265d4f0811110439h6c18e111te97d23070565cca2@mail.gmail.com> Message-ID: <320fb6e00811110502y624cf6c1r52c316d61a1f7228@mail.gmail.com> On Tue, Nov 11, 2008 at 12:39 PM, Cymon Cox wrote: > Hi All, > > Two DBSeq objects cannot be concatenated, although the DBSeq object inherits > __add__ from Seq. Interesting point - not something I'd considered (nor anyone else until now!) > It tries to init a new DBSeq object rather than returning a Seq object as would be expected. > ... > Presumably, DBSeq needs to overide Seq.__add__ > (Using CVS as of yesterday...) Clearly we can't create a new DBSeq object (there wouldn't be any suitable sequence in the database to point to), and returning a Seq object is sensible. We should probably continue this discussion on the dev mailing list (CC'd). Either we have the DBSeq override the __add__ method (and __radd__), or we could make the base Seq class always use new Seq objects in __add__ etc. This would affect anyone writing their own Seq subclass... On balance, I think you're right and its DBSeq which needs to be changed. Would you like to tackle this, or should I? We'd also want to extend the BioSQL unit test to cover adding DBSeq+DBSeq, DBSeq+Seq, Seq+DBSeq, DBSeq+MutableSeq, MutableSeq+DBSeq, etc. Peter From biopython at maubp.freeserve.co.uk Tue Nov 11 09:53:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Nov 2008 14:53:32 +0000 Subject: [BioPython] Cannot __add__ two DBSeq objects In-Reply-To: <320fb6e00811110502y624cf6c1r52c316d61a1f7228@mail.gmail.com> References: <7265d4f0811110439h6c18e111te97d23070565cca2@mail.gmail.com> <320fb6e00811110502y624cf6c1r52c316d61a1f7228@mail.gmail.com> Message-ID: <320fb6e00811110653u63e85bc6k572d5fa42ede8280@mail.gmail.com> On Tue, Nov 11, 2008 at 1:02 PM, Peter wrote: > On Tue, Nov 11, 2008 at 12:39 PM, Cymon Cox wrote: >> Hi All, >> >> Two DBSeq objects cannot be concatenated, although the DBSeq object inherits >> __add__ from Seq. > > Interesting point - not something I'd considered (nor anyone else until now!) > >> It tries to init a new DBSeq object rather than returning a Seq object as would be expected. >> ... >> Presumably, DBSeq needs to overide Seq.__add__ >> (Using CVS as of yesterday...) > > Clearly we can't create a new DBSeq object (there wouldn't be any > suitable sequence in the database to point to), and returning a Seq > object is sensible. We should probably continue this discussion on > the dev mailing list (CC'd). Fixed in CVS by implementing the __add__ and __radd__ methods in the DBSeq object, and having these simply off load the work to the Seq class. See: BioSQL/BioSeq.py revision: 1.28 Tests/test_BioSQL.py revision: 1.26 Tests/output/test_BioSQL revision: 1.2 Peter From dalloliogm at gmail.com Wed Nov 12 11:25:47 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 12 Nov 2008 17:25:47 +0100 Subject: [BioPython] a sequence set object in biopython? Message-ID: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> Hi, I think it could be useful to add a generic SequenceSet object in biopython. Such an object would represent a generic set of sequences, and could have some useful methods like .format('fasta') or .align('alignment_tool'). Is there something similar available already? I have noticed that the actual Generic.Alignment is very similar to such an object. However, it would be better to be able to work with a separated class, because sometimes you want to deal with sequences that are not aligned. Some use cases: - a set of sequences that represents all introns in a particular gene, on which I want to calculate the conservation of the splicing regulatory sites. - all genes sequences in an organisms, which I want to convert in EMBL format - a set of seqs to be aligned or used as input for other tools etc.. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Nov 12 12:53:35 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Nov 2008 17:53:35 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> Message-ID: <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> On Wed, Nov 12, 2008 at 4:25 PM, Giovanni Marco Dall'Olio wrote: > Hi, > I think it could be useful to add a generic SequenceSet object in biopython. > Such an object would represent a generic set of sequences, and could > have some useful methods like .format('fasta') or > .align('alignment_tool'). > Is there something similar available already? Given your example to turn the SequenceSet into a FASTA file, then clearly you are thinking of a collection of SeqRecord objects rather than just Seq objects. For this kind of thing I personally just use a list of SeqRecord objects. If I want to turn a list of SeqRecord objects into a FASTA file, I can pass the list to the Bio.SeqIO.write() function. Once I've made a FASTA file, I can call an external tool to align them - and then load them in again using Bio.AlignIO or Bio.SeqIO depending on what I plan to do next. > I have noticed that the actual Generic.Alignment is very similar to > such an object. However, it would be better to be able to work with a > separated class, because sometimes you want to deal with sequences > that are not aligned. Yes, the generic alignment is basically a list of SeqRecord objects plus some extra functionality like column access. > Some use cases: > - a set of sequences that represents all introns in a particular gene, > on which I want to calculate the conservation of the splicing > regulatory sites. > - all genes sequences in an organisms, which I want to convert in EMBL format > - a set of seqs to be aligned or used as input for other tools > etc.. All sensible use cases - but all seem to be covered by a simple python list of SeqRecord objects, or in some cases a list of Seq objects (e.g. the introns example, as I doube the introns have names). Peter From biopython at maubp.freeserve.co.uk Wed Nov 12 13:06:19 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Nov 2008 18:06:19 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> Message-ID: <320fb6e00811121006mbe32efar2fca638d1a5fe2ef@mail.gmail.com> On Wed, Nov 12, 2008 at 5:53 PM, Peter wrote: > On Wed, Nov 12, 2008 at 4:25 PM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I think it could be useful to add a generic SequenceSet object in biopython. >> Such an object would represent a generic set of sequences, and could >> have some useful methods like .format('fasta') or >> .align('alignment_tool'). >> Is there something similar available already? > > Given your example to turn the SequenceSet into a FASTA file, then > clearly you are thinking of a collection of SeqRecord objects rather > than just Seq objects. For this kind of thing I personally just use a > list of SeqRecord objects. > > If I want to turn a list of SeqRecord objects into a FASTA file, I can > pass the list to the Bio.SeqIO.write() function. Once I've made a > FASTA file, I can call an external tool to align them - and then load > them in again using Bio.AlignIO or Bio.SeqIO depending on what I plan > to do next. If you really want a list like object with a format method in your code, how about something like this: class SeqRecordList(list) : """Subclass of the python list, to hold SeqRecord objects only.""" #TODO - Override the list methods to make sure all the items #are indeed SeqRecord objects def format(self, format) : """Returns a string of all the records in a requested file format. The argument format should be any file format supported by the Bio.SeqIO.write() function. This must be a lower case string. """ from Bio import SeqIO from StringIO import StringIO handle = StringIO() SeqIO.write(self, handle, format) handle.seek(0) return handle.read() if __name__ == "__main__" : print "Loading records..." from Bio import SeqIO my_list = SeqRecordList(SeqIO.parse(open("ls_orchid.gbk"),"genbank")) print len(my_list) for format in ["fasta","tab"] : print print format print "="*len(format) print my_list.format(format) Peter From dalloliogm at gmail.com Wed Nov 12 13:17:48 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 12 Nov 2008 19:17:48 +0100 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> Message-ID: <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> On Wed, Nov 12, 2008 at 6:53 PM, Peter wrote: > On Wed, Nov 12, 2008 at 4:25 PM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I think it could be useful to add a generic SequenceSet object in biopython. >> Such an object would represent a generic set of sequences, and could >> have some useful methods like .format('fasta') or >> .align('alignment_tool'). >> Is there something similar available already? > > Given your example to turn the SequenceSet into a FASTA file, then > clearly you are thinking of a collection of SeqRecord objects rather > than just Seq objects. For this kind of thing I personally just use a > list of SeqRecord objects. > > If I want to turn a list of SeqRecord objects into a FASTA file, I can > pass the list to the Bio.SeqIO.write() function. Once I've made a > FASTA file, I can call an external tool to align them - and then load > them in again using Bio.AlignIO or Bio.SeqIO depending on what I plan > to do next. > >> Some use cases: >> - a set of sequences that represents all introns in a particular gene, >> on which I want to calculate the conservation of the splicing >> regulatory sites. >> - all genes sequences in an organisms, which I want to convert in EMBL format >> - a set of seqs to be aligned or used as input for other tools >> etc.. > > All sensible use cases - but all seem to be covered by a simple python > list of SeqRecord objects, or in some cases a list of Seq objects > (e.g. the introns example, as I doube the introns have names). > Not always. For example, if I have a set of genes in an organism, sometimes I would need to access to only some of them, by their id; so, a __getattribute__ method to make it work as a dictionary could also be useful. The fact is that I think that such an object would be so widely used, that maybe it would be useful to implement it in biopython. What I would do, honestly, is to create a GenericSeqRecordSet class from which to derive Alignment, specifying that in an alignment all the sequences should have the same lenght. It would not require much work and it would change the interface. very tiny little minusculus p.s. if you need help for implement such a thing or anything else I can volounteer :). > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Nov 12 13:36:11 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Nov 2008 18:36:11 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> Message-ID: <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> Giovanni Marco Dall'Olio wrote: >> All sensible use cases - but all seem to be covered by a simple python >> list of SeqRecord objects, or in some cases a list of Seq objects >> (e.g. the introns example, as I doube the introns have names). > > Not always. > For example, if I have a set of genes in an organism, sometimes I > would need to access to only some of them, by their id; so, a > __getattribute__ method to make it work as a dictionary could also be > useful. OK, then use a dict of SeqRecords for this, as shown in the tutorial chapter for Bio.SeqIO and the wiki. We even have a helper function Bio.SeqIO.to_dict() to do this and check for duplicate keys. If you need an order preserving dictionary, there are examples of this on the net and there is even PEP372 for adding this to python itself: http://www.python.org/dev/peps/pep-0372/ > The fact is that I think that such an object would be so widely used, > that maybe it would be useful to implement it in biopython. > What I would do, honestly, is to create a GenericSeqRecordSet class > from which to derive Alignment, specifying that in an alignment all > the sequences should have the same lenght. It would not require much > work and it would change the interface. I agree that IF we added some sort of "GenericSeqRecordSet class", it might be sensible for the alignment objects to subclass it - especially if you want it to behave list a python list primarily. Note that in python sets are not order preserving. > very tiny little minusculus p.s. if you need help for implement such a > thing or anything else I can volounteer :). That's good to hear :) However, we'd have to establish the need for this new object first - but so far we've only had two people's view so its too early to form a consensus. I don't see a strong reason for adding yet another object, when the core language provides lists, sets and dict which seem to be enough. Peter From dalloliogm at gmail.com Wed Nov 12 19:16:44 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 01:16:44 +0100 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> Message-ID: <5aa3b3570811121616u5f95cc8du9f0d91e4743f067f@mail.gmail.com> On Wed, Nov 12, 2008 at 7:36 PM, Peter wrote: > Giovanni Marco Dall'Olio wrote: >>> All sensible use cases - but all seem to be covered by a simple python >>> list of SeqRecord objects, or in some cases a list of Seq objects >>> (e.g. the introns example, as I doube the introns have names). >> >> Not always. >> For example, if I have a set of genes in an organism, sometimes I >> would need to access to only some of them, by their id; so, a >> __getattribute__ method to make it work as a dictionary could also be >> useful. > > OK, then use a dict of SeqRecords for this, as shown in the tutorial > chapter for Bio.SeqIO and the wiki. We even have a helper function > Bio.SeqIO.to_dict() to do this and check for duplicate keys. I would prefer a SeqRecordSet object with a to_dict method :) > If you need an order preserving dictionary, there are examples of this > on the net and there is even PEP372 for adding this to python itself: > http://www.python.org/dev/peps/pep-0372/ >> The fact is that I think that such an object would be so widely used, >> that maybe it would be useful to implement it in biopython. >> What I would do, honestly, is to create a GenericSeqRecordSet class >> from which to derive Alignment, specifying that in an alignment all >> the sequences should have the same lenght. It would not require much >> work and it would change the interface. > > I agree that IF we added some sort of "GenericSeqRecordSet class", it > might be sensible for the alignment objects to subclass it - > especially if you want it to behave list a python list primarily. Let's see it from another point of view. In biopython, if you want to print a set of sequences in fasta format, you have to do the following: >>> s1 = SeqRecord(Seq('cacacac')) >>> s2 = SeqRecord(Seq('cacacac')) >>> seqs = s1, s2 >>> out = '' >>> for seq in seqs: >>> # a "print seq.format('fasta')" statement won't work properly here, because of blank lines >>> out += seq.format('fasta') >>> print out On the other side, printing an alignment in fasta format is a lot simpler: >>> al = Alignment(SingleLetterAlphabet) >>> al.add_sequence('s1', 'cacaca') >>> al.add_sequence('s2, 'cacaca') >>> print al.format('fasta') I work more often with sets of sequences rather than with alignments. So, why it is more difficult to print some un-related sequences in a certain format, than aligned sequence? I would end up using Alignment objects also for sequences that are not aligned. I am also thinking about many format parsers. Wouldn't it be easier: >>> seqs = Bio.SeqIO.parse(filehandler, 'fasta') >>> record_dict = seqs.to_dict() than invoking SeqIO twice? > Note that in python sets are not order preserving. > >> very tiny little minusculus p.s. if you need help for implement such a >> thing or anything else I can volounteer :). > > That's good to hear :) > > However, we'd have to establish the need for this new object first - > but so far we've only had two people's view so its too early to form a > consensus. I don't see a strong reason for adding yet another object, > when the core language provides lists, sets and dict which seem to be > enough. Take for example this code you wrote for me before: > class SeqRecordList(list) : > """Subclass of the python list, to hold SeqRecord objects only.""" > #TODO - Override the list methods to make sure all the items > #are indeed SeqRecord objects > > def format(self, format) : > """Returns a string of all the records in a requested file format. > > The argument format should be any file format supported by > the Bio.SeqIO.write() function. This must be a lower case string. > """ > from Bio import SeqIO > from StringIO import StringIO > handle = StringIO() > SeqIO.write(self, handle, format) > handle.seek(0) > return handle.read() It's very useful, but I don't think a python/biopython newbie would be able to write it. That's why I think it should be included. Last year, I was in another laboratory and I didn't have much experience with biopython, and I was missing such a kind of object. > Peter > Goodnight!! -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Thu Nov 13 04:37:35 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 10:37:35 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator Message-ID: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> I am writing a module to generate semi-random sets of haplotypes. For example, let's say you want a set of 100 sequences of 200 SNPs, in which an hotspot is located in a certain position: the module is meant to generate such datasets, mainly for testing purposes. You can find the code here: - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py Could you give me some suggestions about this? For example, which kinds of haplotype model would you think it could be useful to implement (see the function paramsGenerator)? What do you think about the way I have written this code? Would you implement it in a different way? -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From mjldehoon at yahoo.com Thu Nov 13 05:27:57 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 13 Nov 2008 02:27:57 -0800 (PST) Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <5aa3b3570811121616u5f95cc8du9f0d91e4743f067f@mail.gmail.com> Message-ID: <25667.98653.qm@web62408.mail.re1.yahoo.com> Adding new classes to Biopython should be done very carefully ... once they're in, it's difficult to remove them again. In the past, removing classes that turned out to be less than ideal was a real headache. Right now I don't see a clear need for a sequence set object ... read on. --- On Wed, 11/12/08, Giovanni Marco Dall'Olio > > > > OK, then use a dict of SeqRecords for this, as shown > > in the tutorial chapter for Bio.SeqIO and the wiki. > > We even have a helper function > > Bio.SeqIO.to_dict() to do this and check for duplicate > > keys. > > I would prefer a SeqRecordSet object with a to_dict method > Wouldn't it be easier: > >>> seqs = Bio.SeqIO.parse(filehandler, > 'fasta') > >>> record_dict = seqs.to_dict() > > than invoking SeqIO twice? Maybe, yes, but it's just a matter of typing and I don't think that by itself it is a good enough reason for a SeqRecordSet class. > Let's see it from another point of view. > In biopython, if you want to print a set of sequences in > fasta format, > you have to do the following: > >>> s1 = SeqRecord(Seq('cacacac')) > >>> s2 = SeqRecord(Seq('cacacac')) > >>> seqs = s1, s2 > >>> out = '' > >>> for seq in seqs: > # a "print seq.format('fasta')" statement won't work > # properly here, because of blank lines > out += seq.format('fasta') > >>> print out I don't quite understand why "print seq.format('fasta')" won't work. > Take for example this code you wrote for me before: > > > class SeqRecordList(list) : > > def format(self, format) : > > from Bio import SeqIO > > from StringIO import StringIO > > handle = StringIO() > > SeqIO.write(self, handle, format) > > handle.seek(0) > > return handle.read() > > It's very useful, but I don't think a > python/biopython newbie would be > able to write it. I agree that this is too complicated. What if we redefine SeqIO.write as def write(self, handle=sys.stdout, format='fasta'): ... So by default SeqIO.write prints to the screen. Then you can do SeqIO.write(records) where records are a list of SeqRecord's. --Michiel. From tiagoantao at gmail.com Thu Nov 13 05:34:54 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 10:34:54 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> Message-ID: <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> I love the comment documentation, makes everything very easy to understand at first read. Where would you think this would fit in a PopGen hierarchy? Or to put it in another way, please complete Bio.PopGen.... Tiago On Thu, Nov 13, 2008 at 9:37 AM, Giovanni Marco Dall'Olio wrote: > I am writing a module to generate semi-random sets of haplotypes. > For example, let's say you want a set of 100 sequences of 200 SNPs, in > which an hotspot is located in a certain position: the module is meant > to generate such datasets, mainly for testing purposes. > > You can find the code here: > - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py > > Could you give me some suggestions about this? For example, which > kinds of haplotype model would you think it could be useful to > implement (see the function paramsGenerator)? > What do you think about the way I have written this code? Would you > implement it in a different way? > > -- > ----------------------------------------------------------- > > My Blog on Bioinformatics (italian): http://bioinfoblog.it > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From biopython at maubp.freeserve.co.uk Thu Nov 13 05:37:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 10:37:32 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <5aa3b3570811121616n65e3cc38mc7def11e3cd90b04@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> <5aa3b3570811121616n65e3cc38mc7def11e3cd90b04@mail.gmail.com> Message-ID: <320fb6e00811130237y623df295o53043d87bf239b83@mail.gmail.com> On Thu, Nov 13, 2008 at 12:16 AM, Giovanni Marco Dall'Olio wrote: > > On Wed, Nov 12, 2008 at 7:36 PM, Peter wrote: >> Giovanni Marco Dall'Olio wrote: >>>> All sensible use cases - but all seem to be covered by a simple python >>>> list of SeqRecord objects, or in some cases a list of Seq objects >>>> (e.g. the introns example, as I doube the introns have names). >>> >>> Not always. >>> For example, if I have a set of genes in an organism, sometimes I >>> would need to access to only some of them, by their id; so, a >>> __getattribute__ method to make it work as a dictionary could also be >>> useful. >> >> OK, then use a dict of SeqRecords for this, as shown in the tutorial >> chapter for Bio.SeqIO and the wiki. We even have a helper function >> Bio.SeqIO.to_dict() to do this and check for duplicate keys. > > I would prefer a SeqRecordSet object with a to_dict method :) OK, that is a style choice. BTW, you're using the word "Set" hear rather than "List", which could be misleading as in python sets have no order, but lists do. >> If you need an order preserving dictionary, there are examples of this >> on the net and there is even PEP372 for adding this to python itself: >> http://www.python.org/dev/peps/pep-0372/ > >>> The fact is that I think that such an object would be so widely used, >>> that maybe it would be useful to implement it in biopython. >>> What I would do, honestly, is to create a GenericSeqRecordSet class >>> from which to derive Alignment, specifying that in an alignment all >>> the sequences should have the same lenght. It would not require much >>> work and it would change the interface. >> >> I agree that IF we added some sort of "GenericSeqRecordSet class", it >> might be sensible for the alignment objects to subclass it - >> especially if you want it to behave list a python list primarily. > > Let's see it from another point of view. > In biopython, if you want to print a set of sequences in fasta format, > you have to do the following: >>>> s1 = SeqRecord(Seq('cacacac')) >>>> s2 = SeqRecord(Seq('cacacac')) >>>> seqs = s1, s2 >>>> out = '' >>>> for seq in seqs: >>>> # a "print seq.format('fasta')" statement won't work properly here, because of blank lines >>>> out += seq.format('fasta') >>>> print out First of all, in my opinion using variable names seq and seqs for SeqRecord objects rather than Seq objects is confusing. Secondly, creating SeqRecord objects without an ID is a very bad idea if you want to output them to a file. Thirdly, you can have as many blank likes as you like in a FASTA format file. Your problem is using "print" in python will append a new line for display. For writing to a file, it is important that the format("fasta") method include the trailing new line. i.e. for printing on screen you could do: for rec in seqs: #seqs is a list of SeqRecord objects print rec.format("fasta").rstrip() #removing trailing new line as print adds one Or (based on Michiel's email which arrived while I was writing mine) use the stdout handle: import sys from Bio import SeqIO SeqIO.write(seqs, sys.stdout, "fasta") > On the other side, printing an alignment in fasta format is a lot simpler: >>>> al = Alignment(SingleLetterAlphabet) >>>> al.add_sequence('s1', 'cacaca') >>>> al.add_sequence('s2, 'cacaca') >>>> print al.format('fasta') > > I work more often with sets of sequences rather than with alignments. > So, why it is more difficult to print some un-related sequences in a > certain format, than aligned sequence? I would end up using Alignment > objects also for sequences that are not aligned. Out of interest, why do you want to print out records to screen in a particular file format? Why not just write them to a file? > I am also thinking about many format parsers. > > Wouldn't it be easier: >>>> seqs = Bio.SeqIO.parse(filehandler, 'fasta') >>>> record_dict = seqs.to_dict() > > than invoking SeqIO twice? You don't like this: from Bio import SeqIO record_dict = SeqIO.to_dict(SeqIO.parse(handle, format)) Well, I can live with it. We *could* make the SeqIO.parse function always return a new object, a SeqRecordIterator which could have a to_dict() method in addition to the iteration interface - but this is overly complicated. >> Note that in python sets are not order preserving. >> >>> very tiny little minusculus p.s. if you need help for implement such a >>> thing or anything else I can volounteer :). >> >> That's good to hear :) >> >> However, we'd have to establish the need for this new object first - >> but so far we've only had two people's view so its too early to form a >> consensus. I don't see a strong reason for adding yet another object, >> when the core language provides lists, sets and dict which seem to be >> enough. > > Take for example this code you wrote for me before: > >> class SeqRecordList(list) : >> """Subclass of the python list, to hold SeqRecord objects only.""" >> #TODO - Override the list methods to make sure all the items >> #are indeed SeqRecord objects >> >> def format(self, format) : >> """Returns a string of all the records in a requested file format. >> >> The argument format should be any file format supported by >> the Bio.SeqIO.write() function. This must be a lower case string. >> """ >> from Bio import SeqIO >> from StringIO import StringIO >> handle = StringIO() >> SeqIO.write(self, handle, format) >> handle.seek(0) >> return handle.read() > > It's very useful, but I don't think a python/biopython newbie would be > able to write it. > That's why I think it should be included. > Last year, I was in another laboratory and I didn't have much > experience with biopython, and I was missing such a kind of object. A python newbie should first learn about basic python lists, sets, etc. Peter From biopython at maubp.freeserve.co.uk Thu Nov 13 06:11:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 11:11:10 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <25667.98653.qm@web62408.mail.re1.yahoo.com> References: <5aa3b3570811121616u5f95cc8du9f0d91e4743f067f@mail.gmail.com> <25667.98653.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00811130311t4e813a8fqeb21504fd5696bf1@mail.gmail.com> Michiel wrote: >Marco wrote: >> Take for example this code you [Peter] wrote for me before: >> >> > class SeqRecordList(list) : >> > def format(self, format) : >> > from Bio import SeqIO >> > from StringIO import StringIO >> > handle = StringIO() >> > SeqIO.write(self, handle, format) >> > handle.seek(0) >> > return handle.read() >> >> It's very useful, but I don't think a >> python/biopython newbie would be >> able to write it. > > I agree that this is too complicated. This wasn't aimed at a beginner, but rather for Marco if he really wants to use this kind of object in his own code, or as a basis for further discussion. > What if we redefine SeqIO.write as > > def write(self, handle=sys.stdout, format='fasta'): > ... > > So by default SeqIO.write prints to the screen. Then you can do > > SeqIO.write(records) > > where records are a list of SeqRecord's. We could certainly include something like this in the documentation: #Just an example to create some records: from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord records = [SeqRecord(Seq("ACGT"),"Alpha"), SeqRecord(Seq("GTGC"),"Beta")] #One way to "print" records to screen, import sys from Bio import SeqIO SeqIO.write(records, sys.stdout, "fasta") I'm not so keen on making the handle default to standard out, but this is nicer than the suggestion you made some time ago that if the handle were omitted a string be returned (no longer an option since Bug 2628 was committed). Any other votes for the standard out default? Peter From dalloliogm at gmail.com Thu Nov 13 06:51:38 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 12:51:38 +0100 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <320fb6e00811130237y623df295o53043d87bf239b83@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> <5aa3b3570811121616n65e3cc38mc7def11e3cd90b04@mail.gmail.com> <320fb6e00811130237y623df295o53043d87bf239b83@mail.gmail.com> Message-ID: <5aa3b3570811130351j6051b934n2216c8595814b8fe@mail.gmail.com> On Thu, Nov 13, 2008 at 11:37 AM, Peter wrote: > On Thu, Nov 13, 2008 at 12:16 AM, Giovanni Marco Dall'Olio wrote: >> >> I would prefer a SeqRecordSet object with a to_dict method :) > > OK, that is a style choice. > > BTW, you're using the word "Set" hear rather than "List", which could > be misleading as in python sets have no order, but lists do. Maybe the word 'SequencesPool' would be less misleading? I don't have much confidence with English :( The word List could also be misleading, because I was thinking about an object that could act as a dictionary as well. > Out of interest, why do you want to print out records to screen in a > particular file format? Why not just write them to a file? just for debugging purposes - I wasn't expecting the blankline in the output. > You don't like this: > > from Bio import SeqIO > record_dict = SeqIO.to_dict(SeqIO.parse(handle, format)) > > Well, I can live with it. We *could* make the SeqIO.parse function > always return a new object, a SeqRecordIterator which could have a > to_dict() method in addition to the iteration interface - but this is > overly complicated. ok.. so I understand, if it would take too much work, nevermind. I just thought it could have been an useful suggestion and sent it, because otherwise I would have forgot about it :). Cheers :) > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From hma2 at staffmail.ed.ac.uk Thu Nov 13 06:18:34 2008 From: hma2 at staffmail.ed.ac.uk (Hongwu Ma) Date: Thu, 13 Nov 2008 11:18:34 +0000 Subject: [BioPython] said Your XML file was empty in parsing blast records but can see the results in the saved xml file In-Reply-To: References: Message-ID: <491C0D0A.7060408@staffmail.ed.ac.uk> Sometimes when I parse the blast records in biopython using the following program I get the error "Your XML file was empty" but there are actually some results in the saved xml files. Anyone know what is the problem? Thanks in advance. Hongwu myfolder='c:/mgenome/' mydatafolder=myfolder+'alphao/' my_blast_db = myfolder+'orfsre.txt' my_blast_exe =myfolder+'blastall.exe' evalue=0.0001 my_blast_file = mydatafolder+file result_handle, error_handle = NCBIStandalone.blastall(blastcmd=my_blast_exe, program="tblastn", database=my_blast_db, infile=my_blast_file, expectation=evalue) bres=result_handle.read() save_file = open(myfolder+file[:3]+'orfre.xml', "w") save_file.write(bres) save_file.close() blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records: > > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From dalloliogm at gmail.com Thu Nov 13 06:55:11 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 12:55:11 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> Message-ID: <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> On Thu, Nov 13, 2008 at 11:34 AM, Tiago Ant?o wrote: > I love the comment documentation, makes everything very easy to > understand at first read. It's doctest > Where would you think this would fit in a PopGen hierarchy? Or to put > it in another way, please complete Bio.PopGen.... Maybe we could create a module called 'Generators' where to put all similar Generators. p.s. are you using some kind of repository/rcs for you PopGen modules? I just don't want to write code that will be difficult to reimplement in the future PopGen module.. > Tiago > > On Thu, Nov 13, 2008 at 9:37 AM, Giovanni Marco Dall'Olio > wrote: >> I am writing a module to generate semi-random sets of haplotypes. >> For example, let's say you want a set of 100 sequences of 200 SNPs, in >> which an hotspot is located in a certain position: the module is meant >> to generate such datasets, mainly for testing purposes. >> >> You can find the code here: >> - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py >> >> Could you give me some suggestions about this? For example, which >> kinds of haplotype model would you think it could be useful to >> implement (see the function paramsGenerator)? >> What do you think about the way I have written this code? Would you >> implement it in a different way? >> >> -- >> ----------------------------------------------------------- >> >> My Blog on Bioinformatics (italian): http://bioinfoblog.it >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > "Data always beats theories. 'Look at data three times and then come > to a conclusion,' versus 'coming to a conclusion and searching for > some data.' The former will win every time." > ?Matthew Simmons, > http://www.tiago.org > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Nov 13 07:09:24 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 12:09:24 +0000 Subject: [BioPython] said Your XML file was empty in parsing blast records but can see the results in the saved xml file In-Reply-To: <491C0D0A.7060408@staffmail.ed.ac.uk> References: <491C0D0A.7060408@staffmail.ed.ac.uk> Message-ID: <320fb6e00811130409s488e98e1sbbdff52189255e2e@mail.gmail.com> On Thu, Nov 13, 2008 at 11:18 AM, Hongwu Ma wrote: > Sometimes when I parse the blast records in biopython using the following > program I get the error "Your XML file was empty" but there are actually > some results in the saved xml files. Anyone know what is the problem? > Thanks in advance. > Hongwu > > ... > bres=result_handle.read() > save_file = open(myfolder+file[:3]+'orfre.xml', "w") > save_file.write(bres) > save_file.close() > blast_records = NCBIXML.parse(result_handle) > for blast_record in blast_records: > ... When you do result_handle.read() it reads in all the data in the handle - leaving it empty (pointing at the end of the file). When the parser tries to read more data from the handle there isn't any, which is why the parser says the file seems to be empty. You'll have to "reset" the handle to the beginning. One way would be to open the file you just wrote to disk: ... save_file.close() result_handle = open(...) blast_records = NCBIXML.parse(result_handle) ... Peter From hma2 at staffmail.ed.ac.uk Thu Nov 13 08:55:12 2008 From: hma2 at staffmail.ed.ac.uk (Hongwu Ma) Date: Thu, 13 Nov 2008 13:55:12 +0000 Subject: [BioPython] said Your XML file was empty in parsing blast records but can see the results in the saved xml file In-Reply-To: <320fb6e00811130409s488e98e1sbbdff52189255e2e@mail.gmail.com> References: <491C0D0A.7060408@staffmail.ed.ac.uk> <320fb6e00811130409s488e98e1sbbdff52189255e2e@mail.gmail.com> Message-ID: <491C31C0.2030609@staffmail.ed.ac.uk> Thanks, Peter. I tried to reopen the xml file and it was working. I also tried to parse the result before reading as below: ... blast_records = NCBIXML.parse(result_handle) bres=result_handle.read() ... I found I still got the problem, a good xml file but empty blast_records. Why the later read() function affect the parse before it? > On Thu, Nov 13, 2008 at 11:18 AM, Hongwu Ma wrote: > >> Sometimes when I parse the blast records in biopython using the following >> program I get the error "Your XML file was empty" but there are actually >> some results in the saved xml files. Anyone know what is the problem? >> Thanks in advance. >> Hongwu >> >> ... >> bres=result_handle.read() >> save_file = open(myfolder+file[:3]+'orfre.xml', "w") >> save_file.write(bres) >> save_file.close() >> blast_records = NCBIXML.parse(result_handle) >> for blast_record in blast_records: >> ... >> > > When you do result_handle.read() it reads in all the data in the > handle - leaving it empty (pointing at the end of the file). When the > parser tries to read more data from the handle there isn't any, which > is why the parser says the file seems to be empty. You'll have to > "reset" the handle to the beginning. > > One way would be to open the file you just wrote to disk: > > ... > save_file.close() > result_handle = open(...) > blast_records = NCBIXML.parse(result_handle) > ... > > Peter > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From tiagoantao at gmail.com Thu Nov 13 09:04:46 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 14:04:46 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> Message-ID: <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> > Maybe we could create a module called 'Generators' where to put all > similar Generators. I'm ok with that. Do you envision more generators in the future? Actually simcoal could also be used to generate these things (as long as I remember to do an arlequin parser to get the results out). In the short run it would be nice to extend the tutorial also and put a few tests. I don't think your generator code causes a major disturbance or any maintenance problem upstream (read: for Peter, Michiel or final users), so, from my point of view it could be put in the main distribution in a short time frame. > p.s. are you using some kind of repository/rcs for you PopGen modules? > I just don't want to write code that will be difficult to reimplement > in the future PopGen module.. Just my hardisk, open bio CVS and now most everything is in git ;) . For now don't worry with that, you actually have all my code (except recent bugfixes). Lets just keep communicating. From biopython at maubp.freeserve.co.uk Thu Nov 13 09:14:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 14:14:26 +0000 Subject: [BioPython] said Your XML file was empty in parsing blast records but can see the results in the saved xml file In-Reply-To: <491C31C0.2030609@staffmail.ed.ac.uk> References: <491C0D0A.7060408@staffmail.ed.ac.uk> <320fb6e00811130409s488e98e1sbbdff52189255e2e@mail.gmail.com> <491C31C0.2030609@staffmail.ed.ac.uk> Message-ID: <320fb6e00811130614i7eaaaec4pc774461be96b3ab4@mail.gmail.com> On Thu, Nov 13, 2008 at 1:55 PM, Hongwu Ma wrote: > > Thanks, Peter. I tried to reopen the xml file and it was working. Good. > I also tried to parse the result before reading as below: > ... > > I found I still got the problem, a good xml file but empty blast_records. > Why the later read() function affect the parse before it? You can only call handle.read() once - it gives you all the remaining data in the file. Once you've called handle.read(), any further calls to handle.read() or handle.readline() etc won't return any more data. If you call result_handle.read() first, then when you give result_handle to the parser, it can't get any data from it. Have a look at reading files in python which should help to explain the ideas: http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files Peter From biopython at maubp.freeserve.co.uk Thu Nov 13 09:19:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 14:19:13 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> Message-ID: <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> On Thu, Nov 13, 2008 at 2:04 PM, Tiago Ant?o wrote: >> Maybe we could create a module called 'Generators' where to put all >> similar Generators. > > I'm ok with that. Do you envision more generators in the future? > Actually simcoal could also be used to generate these things (as long > as I remember to do an arlequin parser to get the results out). > In the short run it would be nice to extend the tutorial also and put > a few tests. You're talking about a module to create biological data (e.g. haplotypes), right? I'm don't think using the word "generators" is a good idea. Python itself uses this terminology in the context of iteration (see generator functions, generator expressions, etc). That code did not look like a python generator to me. Peter From lueck at ipk-gatersleben.de Thu Nov 13 09:06:13 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 13 Nov 2008 15:06:13 +0100 Subject: [BioPython] Problems with Emboss.Primer3 Message-ID: <000801c94598$fd183f20$1022a8c0@ipkgatersleben.de> Hi! I'm trying to generate a Primer3 file but I have some problems because my output file is allways empty. Unfortunatly I don't get an error message. Here's my code: from Bio import Fasta from Bio.Emboss.Applications import Primer3Commandline from Bio.Application import generic_run from Bio.Emboss.Primer import Primer3Parser primer_cl = Primer3Commandline() primer_cl.set_parameter("-sequence", "p3input.txt") primer_cl.set_parameter("-outfile", "out.pr3") primer_cl.set_parameter("-productsizerange", "350,10000") primer_cl.set_parameter("-target", "%s,%s" % (50, 500)) result, messages, errors = generic_run(primer_cl) p3input.txt looks like this: PRIMER_SEQUENCE_ID=HF15E08r SEQUENCE=GCATGTAATAATGCCAAAGCTCACAGCTGCAGTTGAATCTTGGGACCCGCGGAGCGAGAATGTACCAATCCATGTATGGGTACACCCATGGCTGCCAACTCTAGGGCAAAGGATAGATACACTGTGCCACTCTATCCGGTACAAGCTGAGTAGTGTCCTCCAATTATGGCAAGCTCACGATTCATCAGCTTATGCTGTGCTATCTCCATGGAAGGGTGTATTTGATCCAGCAAGTTGGGAAGACTTGATAGTGCGTTATATCATTCCTAAACTGAAAATGGCACTCCAGGAGTTCCAGATTAACCCAGCAAGCCAAAAGTTTGACCAGTTTAACTGGGTTATGATCTGGGCTTCTGCTGTCCCGGTACACCATATGGTCCATATGTTGGAAGTTGATTTCTTTAGCAAGTGGCAGCTGGTTTTGTACCATTGGCTGAGCTCACCAAATCCTGATTTCAATGAGATAATGAATTGGTAT PRIMER_PRODUCT_SIZE_RANGE=500-1000 450-500 400-450 350-400 300-350 250-300 200-250 150-200 PRIMER_OPT_TM=60.0 PRIMER_MIN_TM=58.0 PRIMER_MAX_TM=65.0 PRIMER_MAX_DIFF_TM=3.0 PRIMER_DNA_CONC=420 PRIMER_NUM_RETURN=1 = PRIMER_SEQUENCE_ID=HO05B04S SEQUENCE=GAAAACCCAATGACAGTAGGATGACAAGGGAAAACTGGTGAGCAACGTCGTAGTCGGGGTTACCACCGGCGGGAAAAAGTAGCAAAACTATGTCATGTCTTATAATCTGGAGTTGGGAACACCTTGTATTATACTCGTGTCTGGGGATCGACCGATCGGTCGCGTAGAAGAAAAACCCAAAGCGCGGAAATGGACCGCGCCAACAAAAAAAGAGGGTGCGGGTGTGGATAATATGGAGAAGAACTGTATTTTGCTTACCCCCTTGATTCTTTTGTATGTAAAATGTGGGCACTGTCAGACCTCACTGTGTGATCAAATCCTCTCTGTCCTGTCCTGTCCTGAAGGGGCCTCTCGTTCTGGATGAATAAACAGCAAATAACTTTGCGTGTGGCTGGCCCCACCTGTCGGTGATTGGTAATTAAAACGACGGTAATTGTTGTG PRIMER_PRODUCT_SIZE_RANGE=500-1000 450-500 400-450 350-400 300-350 250-300 200-250 150-200 PRIMER_OPT_TM=60.0 PRIMER_MIN_TM=58.0 PRIMER_MAX_TM=65.0 PRIMER_MAX_DIFF_TM=3.0 PRIMER_DNA_CONC=420 PRIMER_NUM_RETURN=1 ... Does someone has idea what's the problem? Thanks in advance, Stefanie From dalloliogm at gmail.com Thu Nov 13 09:27:50 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 15:27:50 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> Message-ID: <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> On Thu, Nov 13, 2008 at 3:19 PM, Peter wrote: > On Thu, Nov 13, 2008 at 2:04 PM, Tiago Ant?o wrote: >>> Maybe we could create a module called 'Generators' where to put all >>> similar Generators. >> >> I'm ok with that. Do you envision more generators in the future? >> Actually simcoal could also be used to generate these things (as long >> as I remember to do an arlequin parser to get the results out). >> In the short run it would be nice to extend the tutorial also and put >> a few tests. > > You're talking about a module to create biological data (e.g. > haplotypes), right? I'm don't think using the word "generators" is a > good idea. Python itself uses this terminology in the context of > iteration (see generator functions, generator expressions, etc). That > code did not look like a python generator to me. This is right: which word can I use, then? HaplotypesSampler? RandomHaplotypesSpawner? HaplotypesCreator? Anyway, I will change the module's interface sooner, it will accept different parameters. I need it will need still some amount of work.. > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From tiagoantao at gmail.com Thu Nov 13 09:32:23 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 14:32:23 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> Message-ID: <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> > This is right: which word can I use, then? > HaplotypesSampler? RandomHaplotypesSpawner? > HaplotypesCreator? Considering that this is probably a small piece of code in the long run (correct me if I am wrong), I suggest creating Bio.PopGen.Utils.NameToBeDecided.py From p.j.a.cock at googlemail.com Thu Nov 13 09:43:46 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Nov 2008 14:43:46 +0000 Subject: [BioPython] Problems with Emboss.Primer3 In-Reply-To: <000801c94598$fd183f20$1022a8c0@ipkgatersleben.de> References: <000801c94598$fd183f20$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00811130643p357092f6y8e6d983a11909003@mail.gmail.com> Stefanie L?ck wrote: > Hi! > > I'm trying to generate a Primer3 file but I have some problems because > my output file is allways empty. > Unfortunatly I don't get an error message. Hi Stefanie, I have a couple of suggestions to try and work out what is wrong here... > Here's my code: > > from Bio import Fasta > from Bio.Emboss.Applications import Primer3Commandline > from Bio.Application import generic_run > from Bio.Emboss.Primer import Primer3Parser Note Bio.Emboss.Primer was deprecated for Biopthon 1.49, I think you'll want to use Bio.Emboss.Primer3 instead. > primer_cl = Primer3Commandline() > primer_cl.set_parameter("-sequence", "p3input.txt") > primer_cl.set_parameter("-outfile", "out.pr3") > primer_cl.set_parameter("-productsizerange", "350,10000") > primer_cl.set_parameter("-target", "%s,%s" % (50, 500)) > result, messages, errors = generic_run(primer_cl) > ... What does this give: print "Command line:" print primer_cl print "Return code:" print result.return_code print "Errors:" print errors.read() print "Messages": print messages.read() Also try running the command line by hand at the command prompt. I get this, which may mean a problem with the input file: $ eprimer3 -sequence p3input.txt -outfile out.pr3 -target 50,500 -productsizerange 350,10000 Picks PCR primers and hybridization oligos Error: Unable to read sequence 'p3input.txt' Died: eprimer3 terminated: Bad value for '-sequence' and no prompt Is your input file really expected to work? Reading the docs I would suggest trying a FASTA file as input, but I am not familiar with this tool: http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html Peter From bsouthey at gmail.com Thu Nov 13 10:29:38 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 13 Nov 2008 09:29:38 -0600 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> Message-ID: <491C47E2.60904@gmail.com> Tiago Ant?o wrote: >> This is right: which word can I use, then? >> HaplotypesSampler? RandomHaplotypesSpawner? >> HaplotypesCreator? >> > > Considering that this is probably a small piece of code in the long > run (correct me if I am wrong), I suggest creating > Bio.PopGen.Utils.NameToBeDecided.py > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, I really don't mean to be negative, but you have certain responsibilities once you release code into the Biopython community. Part of my concern is that some of this is being overlooked especially in terms of the user of the code. I do see that simulation of SNPs is useful for users so it is important that it integrated correctly. I think Michiel's recent comment in 'a sequence set object in biopython' thread is important here as well: "Adding new classes to Biopython should be done very carefully ... once they're in, it's difficult to remove them again. In the past, removing classes that turned out to be less than ideal was a real headache." While I have not looked at the code, my view is that must remain integrated into the PopGen module. I would expect that a user would some Biopython (PopGen) modules with some simulated SNPs. I would prefer that Biopython remains as much as possible a set of integrated tools rather than just a collection of tools. This is a clear example where if it is not totally integrated then I don't see the point in including it in Biopython. The second aspect is that it must have a very stable API, similarly to Michiel's comment is that changing APIs after a release is also a pain especially if the module has been around a long time. Based on your first post, I would argue that you are not quite at this stage yet. Bruce From dalloliogm at gmail.com Thu Nov 13 11:07:51 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 17:07:51 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <491C47E2.60904@gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> Message-ID: <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> On Thu, Nov 13, 2008 at 4:29 PM, Bruce Southey wrote: > Tiago Ant?o wrote: >>> >>> This is right: which word can I use, then? >>> HaplotypesSampler? RandomHaplotypesSpawner? >>> HaplotypesCreator? >>> >> >> Considering that this is probably a small piece of code in the long >> run (correct me if I am wrong), I suggest creating >> Bio.PopGen.Utils.NameToBeDecided.py >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > > Hi, > I really don't mean to be negative, but you have certain responsibilities > once you release code into the Biopython community. Part of my concern is > that some of this is being overlooked especially in terms of the user of the > code. I do see that simulation of SNPs is useful for users so it is > important that it integrated correctly. > > I think Michiel's recent comment in 'a sequence set object in biopython' > thread is important here as well: > > "Adding new classes to Biopython should be done very carefully ... once > they're in, it's difficult to remove them again. In the past, removing > classes that turned out to be less than ideal was a real headache." > > While I have not looked at the code, my view is that must remain integrated > into the PopGen module. I would expect that a user would some Biopython > (PopGen) modules with some simulated SNPs. I would prefer that Biopython > remains as much as possible a set of integrated tools rather than just a > collection of tools. This is a clear example where if it is not totally > integrated then I don't see the point in including it in Biopython. > > The second aspect is that it must have a very stable API, similarly to > Michiel's comment is that changing APIs after a release is also a pain > especially if the module has been around a long time. Based on your first > post, I would argue that you are not quite at this stage yet. ehi, wait :) I wasn't proposing to integrate this module in biopython, at least not yet!! :) This is a module to generate test sets to help the development of the other future PopGen modules. For example, we wanted to write a function to calculate the Fst statistics over snps data. The Fst is an index that tells you if, given two populations, they follow the same pattern of variability, and therefore can be considered as two subpopulations of the same population or not. To test such a script, you will need a module like the one I wrote here: for example, you could create two samples of 200 individuals with the same frequencies at every site, and see what your Fst script tells. Then, probably, compare the results with another tool that is already know to calculate the Fst correctly. So I was just asking for any suggestions - which models should I implement in this generator? And how? Which parameters should it accept? Should it use the random module? > Bruce > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From tiagoantao at gmail.com Thu Nov 13 11:33:18 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 16:33:18 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <491C47E2.60904@gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> Message-ID: <6d941f120811130833o194c280fqc8707bd18472a369@mail.gmail.com> On Thu, Nov 13, 2008 at 3:29 PM, Bruce Southey wrote: > While I have not looked at the code, my view is that must remain integrated > into the PopGen module. I would expect that a user would some Biopython > (PopGen) modules with some simulated SNPs. I would prefer that Biopython > remains as much as possible a set of integrated tools rather than just a > collection of tools. This is a clear example where if it is not totally > integrated then I don't see the point in including it in Biopython. There are several dimensions here, and I would like to sum up my ideas on several things being floated around: 1. Support for tools with a small user base: I do think that the user base size should not be a fundamental criteria. As long as tools are maintained (which, I agree, might be a problem with some fringe applications), this should not be a issue. A good example is fdist support on PopGen: The user base seems to be increasing quite a lot for the method because of code done on top of it Bio.PopGen.FDist (something I was not expecting, to be honest). 2. Integration inside PopGen: Up to now, there has been an effort in PopGen to have a coherent module where all parts interoperate. With the exception of Simcoal output, all the rest works in a cohesive way, you can take a genepop file, and feed it to fdist, for instance as the module has provisions for interop (the same for the new LDNe code that I have). 3. Integration with the rest of biopython. I do expect things to work quite smoothly. Like SNP extraction from sequencies and feed in to fdist, ldne and (future) statistics. I see issues with microsatelllite/STR + RFLPs stuff, but that is because there might be little provision in the rest of biopython for that type of markers. 4. New code and new developers. I think that an overly stringent process will put new people off. I have no problems in accepting _non crucial code_ that does _not impose big maintenance hurdles_, though that code might be somewhat naive in the big picture (maybe this particular example should actually go to the test base, BTW). The truth is, an overly stringent process, while it might assure fantastic code puts a gigantic barrier for new people. I am more in favor of a learning process where less fundamental code can be accepted at the beginning. I don't want to discourage new people, I think a balance between quality and encouragement can be made. > The second aspect is that it must have a very stable API, similarly to > Michiel's comment is that changing APIs after a release is also a pain > especially if the module has been around a long time. Based on your first > post, I would argue that you are not quite at this stage yet. Agree, especially with crucial functionality (but maybe not so much with less crucial parts). That is why I have avoided comiting my statistics code to bioopython (although it exists for quite a long time - available on GIT): The API has to be future-resilient! In fact I have a proposal to make in this front, but because I want to be sure that the API is future proof in as much as possible, the proposal will not be all-enconpassing for now (I still don't know how to have a future proof API for multi-loci statistics like simple linkage desiquilibrium or more modern things like EHH). But yes, to be honest I think open-bio projects err on the excessive bureaucratic side and discourage new people. From tiagoantao at gmail.com Thu Nov 13 11:38:21 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 16:38:21 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> Message-ID: <6d941f120811130838y5fc5e2c6qc42c37ed065bc616@mail.gmail.com> Hi, > For example, we wanted to write a function to calculate the Fst > statistics over snps data. Pigging back on Bruce comment on APIs being future proof, this is not really what we want ;) We want to be able to calculate Fst for any marker (SNPs, Microsatellites, AFLPs, sequences). We cannot have something like: calc_fst(put_your_snps_here) What we want is: calc_fst(put_your_marker_frequencies_here) We want to serve all, not just ourselves ;) Tiago From dalloliogm at gmail.com Thu Nov 13 12:51:30 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 18:51:30 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> Message-ID: <5aa3b3570811130951x549e29a4xa400d3df1ec9123e@mail.gmail.com> On Thu, Nov 13, 2008 at 3:04 PM, Tiago Ant?o wrote: >> p.s. are you using some kind of repository/rcs for you PopGen modules? >> I just don't want to write code that will be difficult to reimplement >> in the future PopGen module.. > > Just my hardisk, open bio CVS and now most everything is in git ;) . > For now don't worry with that, you actually have all my code (except > recent bugfixes). Lets just keep communicating. well, if you want to re-use the repository on github, I think you will have to register there and then tell me your username, so I will be able to add you as a collaborator. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From bourbine at yahoo.de Thu Nov 13 12:55:58 2008 From: bourbine at yahoo.de (Samuel Bader) Date: Thu, 13 Nov 2008 17:55:58 +0000 (GMT) Subject: [BioPython] local uniprot search Message-ID: <898399.98502.qm@web28508.mail.ukl.yahoo.com> Hi, I?m a newbe in programming and like to make a local uniprot search (submit accession number and return the entry, I like the whole entry not just the sequence). So far I always scanned through the whole uniprot file, but this take several minutes on my computer. As it is much faster on the internet, I think there is also a faster way to do that locally. Does somebody know, how this could be done? Thanks Cheers From bourbine at yahoo.de Thu Nov 13 12:56:03 2008 From: bourbine at yahoo.de (Samuel Bader) Date: Thu, 13 Nov 2008 17:56:03 +0000 (GMT) Subject: [BioPython] local uniprot search Message-ID: <875283.66754.qm@web28506.mail.ukl.yahoo.com> Hi, I?m a newbe in programming and like to make a local uniprot search (submit accession number and return the entry, I like the whole entry not just the sequence). So far I always scanned through the whole uniprot file, but this take several minutes on my computer. As it is much faster on the internet, I think there is also a faster way to do that locally. Does somebody know, how this could be done? Thanks Cheers From bsouthey at gmail.com Thu Nov 13 13:57:30 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 13 Nov 2008 12:57:30 -0600 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> Message-ID: <491C789A.3070004@gmail.com> Giovanni Marco Dall'Olio wrote: > On Thu, Nov 13, 2008 at 4:29 PM, Bruce Southey wrote: > >> Tiago Ant?o wrote: >> >>>> This is right: which word can I use, then? >>>> HaplotypesSampler? RandomHaplotypesSpawner? >>>> HaplotypesCreator? >>>> >>>> >>> Considering that this is probably a small piece of code in the long >>> run (correct me if I am wrong), I suggest creating >>> Bio.PopGen.Utils.NameToBeDecided.py >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >>> >> Hi, >> I really don't mean to be negative, but you have certain responsibilities >> once you release code into the Biopython community. Part of my concern is >> that some of this is being overlooked especially in terms of the user of the >> code. I do see that simulation of SNPs is useful for users so it is >> important that it integrated correctly. >> >> I think Michiel's recent comment in 'a sequence set object in biopython' >> thread is important here as well: >> >> "Adding new classes to Biopython should be done very carefully ... once >> they're in, it's difficult to remove them again. In the past, removing >> classes that turned out to be less than ideal was a real headache." >> >> While I have not looked at the code, my view is that must remain integrated >> into the PopGen module. I would expect that a user would some Biopython >> (PopGen) modules with some simulated SNPs. I would prefer that Biopython >> remains as much as possible a set of integrated tools rather than just a >> collection of tools. This is a clear example where if it is not totally >> integrated then I don't see the point in including it in Biopython. >> >> The second aspect is that it must have a very stable API, similarly to >> Michiel's comment is that changing APIs after a release is also a pain >> especially if the module has been around a long time. Based on your first >> post, I would argue that you are not quite at this stage yet. >> > > ehi, wait :) I wasn't proposing to integrate this module in biopython, > at least not yet!! :) > Oh, I am on the right list? It does say Biopython... :-) > This is a module to generate test sets to help the development of the > other future PopGen modules. > Great! > For example, we wanted to write a function to calculate the Fst > statistics over snps data. > The Fst is an index that tells you if, given two populations, they > follow the same pattern of variability, and therefore can be > considered as two subpopulations of the same population or not. > To test such a script, you will need a module like the one I wrote > here: for example, you could create two samples of 200 individuals > with the same frequencies at every site, and see what your Fst script > tells. Then, probably, compare the results with another tool that is > already know to calculate the Fst correctly. > > So I was just asking for any suggestions - which models should I > implement in this generator? And how? Which parameters should it > accept? Should it use the random module? > > The importance is more the API than the actual implementation - as the later posts by Tiago indicate. Some coding related comments: freqs_per_site and alleles_per_site are lists. This is a problem because these could get very large, it is inflexible and you could become out of sync. While you do check for length, you should be more informative of which has a different length. Also you need to check for valid inputs (frequencies between 0 and 1, bases in ACGT). Some other comments Perhaps I misunderstood the situation but the major problem that I have is that the locations are treated as independent so your model assumes unlinked loci. I just don't find this a useful scenario. You assume that the user knows exactly which locations and frequency to change. Often you just want a random frequency and random location. In that case you need to randomly select locations and frequencies based on some function. But I do not find the mode=='random' of paramsGenerator sufficient to address this. Further, you might want a random sequence of some length but you not want all locations to change. While you could set those locations to zero, a more sparse form would be desirable. Also, the randomly generated frequencies should have a way to be limited in other ranges than the [0 to 1) of random.random. Obviously the question is whether or not the user has to do it themselves. One particular use of generating SNPs pertains to known genes or sequences. In such cases to would be great to use a known sequence as a base for the simulation. Further, it would be very useful be able incorporate known SNP data especially frequencies from some source like Hapmap (http://www.hapmap.org/). A nice but harder problem is to do this based on a protein sequence since many diseases refer to amino acids. Perhaps my biggest 'disappointment' is the lack of ancestry control because I also interested in families or some admixture in a population. This just generates sequences randomly assuming you are randomly selecting individuals from a homogenous population. I do understand this usage so it is not that important to include this here. Bruce From dalloliogm at gmail.com Fri Nov 14 06:21:02 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 14 Nov 2008 12:21:02 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <491C789A.3070004@gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> <491C789A.3070004@gmail.com> Message-ID: <5aa3b3570811140321i74921288kb22f83be2d41a175@mail.gmail.com> On Thu, Nov 13, 2008 at 7:57 PM, Bruce Southey wrote: > > Oh, I am on the right list? It does say Biopython... :-) I added a [PopGen] tag to the subject of the mail, to indicate that it was related to the PopGen module and it s development. >> This is a module to generate test sets to help the development of the >> other future PopGen modules. >> > > Great! > >> For example, we wanted to write a function to calculate the Fst >> statistics over snps data. >> The Fst is an index that tells you if, given two populations, they >> follow the same pattern of variability, and therefore can be >> considered as two subpopulations of the same population or not. >> To test such a script, you will need a module like the one I wrote >> here: for example, you could create two samples of 200 individuals >> with the same frequencies at every site, and see what your Fst script >> tells. Then, probably, compare the results with another tool that is >> already know to calculate the Fst correctly. >> >> So I was just asking for any suggestions - which models should I >> implement in this generator? And how? Which parameters should it >> accept? Should it use the random module? >> >> > > The importance is more the API than the actual implementation - as the later > posts by Tiago indicate. > > Some coding related comments: > freqs_per_site and alleles_per_site are lists. > This is a problem because these could get very large, it is inflexible and > you could become out of sync. they are not required to be lists. freqs and alleles _per site can be any kind python object with a __getitem__ and a __len__method. What I would like to do now is to create two 'Freqs' and 'Alleles' objects with such methods, so I can use them as containers for these informations without having to change the actual interface. The __getitem__ function could return a background value (0.5) for any position except for those that are defined to be differently when initialized. This would save memory space also. Have a look at the new changes: - http://tinyurl.com/64tfef (http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py) > While you do check for length, you should be more informative of which has a > different length. > Also you need to check for valid inputs (frequencies between 0 and 1, bases > in ACGT). ok > Some other comments > > Perhaps I misunderstood the situation but the major problem that I have is > that the locations are treated as independent so your model assumes unlinked > loci. I just don't find this a useful scenario. This depends on which parameters you pass to the HaplotypesGenerator init function. I would prefer to create a basic module that generates sequences given the frequencies and alleles in every position, and other functions to create its parameters. I forgot to say it in the first mail, but if you want to use more sophisticated scenarios - like populations that have suffered a bottleneck or have a particular history - there are already better tools available to do that; we should think on how to integrate this module with them. Maybe I should rename this module as 'SimpleHaplotypesSampler'. > You assume that the user knows exactly which locations and frequency to > change. Often you just want a random frequency and random location. In that > case you need to randomly select locations and frequencies based on some > function. But I do not find the mode=='random' of paramsGenerator sufficient > to address this. Further, you might want a random sequence of some length > but you not want all locations to change. ok, but consider that these are haplotypes and not sequences, so you most likely need to have regions that are more conserved and others that change more. This is a good question, about which models to implement, but I would need to find a better way to represent frequencies first, and then think about which models to implement. > While you could set those > locations to zero, a more sparse form would be desirable. I think the idea of a Freqs_per_site object should fix this > Also, the randomly > generated frequencies should have a way to be limited in other ranges than > the [0 to 1) of random.random. Obviously the question is whether or not the > user has to do it themselves. > One particular use of generating SNPs pertains to known genes or sequences. > In such cases to would be great to use a known sequence as a base for the > simulation. > Further, it would be very useful be able incorporate known SNP > data especially frequencies from some source like Hapmap > (http://www.hapmap.org/). This is too complicated for the moment. We would need to develop a standard way to handle HapMap and in general SNPs first. > A nice but harder problem is to do this based on a > protein sequence since many diseases refer to amino acids. This is a good idea, but at the moment I was thinking more on genotypes than other characters. I would need to have a better way to handle all these suggestions.. too bad github doesn't provide an integrated ticketing system. > Perhaps my biggest 'disappointment' is the lack of ancestry control because > I also interested in families or some admixture in a population. This just > generates sequences randomly assuming you are randomly selecting individuals > from a homogenous population. I think simcoal can do this? > I do understand this usage so it is not that > important to include this here. > > > > Bruce > > > > > > > > > > > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From tiagoantao at gmail.com Fri Nov 14 06:49:05 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 14 Nov 2008 11:49:05 +0000 Subject: [BioPython] [PopGen] HapMap Message-ID: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> On Thu, Nov 13, 2008 at 6:57 PM, Bruce Southey wrote: > One particular use of generating SNPs pertains to known genes or sequences. > In such cases to would be great to use a known sequence as a base for the > simulation. Further, it would be very useful be able incorporate known SNP > data especially frequencies from some source like Hapmap > (http://www.hapmap.org/). A nice but harder problem is to do this based on a > protein sequence since many diseases refer to amino acids. Talking about hapmap, and in a different front I have some code available to deal with HapMap. The problem is that, in order for it to be useful (performance), it injects all the data in an SQL database. That requires a schema for persistance, but I have been "ping-ponged" regarding where people in Biopython say that they prefer things to be on BioSQL and people on BioSQL say they don't care (and, this being voluntary work I simply don't have the patience to fight the bureaucracy). > Perhaps my biggest 'disappointment' is the lack of ancestry control because > I also interested in families or some admixture in a population. This just > generates sequences randomly assuming you are randomly selecting individuals > from a homogenous population. I do understand this usage so it is not that > important to include this here. You can use the Simcoal module to generate (coalescent based) sequences. I don't know if that helps you. The only hurdle is that simcoal churns data in the Arlequin Format and I still haven't got round to finalize one (although I could increase the priority if there is interest). From biopython at maubp.freeserve.co.uk Fri Nov 14 07:02:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Nov 2008 12:02:42 +0000 Subject: [BioPython] [PopGen] HapMap In-Reply-To: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> References: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> Message-ID: <320fb6e00811140402o301ee86aqc118493b143137c4@mail.gmail.com> On Fri, Nov 14, 2008 at 11:49 AM, Tiago Ant?o wrote: > Talking about hapmap, and in a different front I have some code > available to deal with HapMap. The problem is that, in order for it to > be useful (performance), it injects all the data in an SQL database. > That requires a schema for persistance, but I have been "ping-ponged" > regarding where people in Biopython say that they prefer things to be > on BioSQL and people on BioSQL say they don't care (and, this being > voluntary work I simply don't have the patience to fight the > bureaucracy). I didn't say it had to be done via BioSQL - I wanted you to check that any schema ideas wouldn't be overlapping existing work, and felt BioSQL was the obvious place to ask. If these schema ideas are not a good fit to BioSQL, then that's fine. We do have some non-BioSQL bits of Biopython using MySQL already (perhaps not as well looked after as they should be, Bio.GFF and Bio.DocSQL). The trouble with any Biopython code requiring a database is keeping that code maintained and tested is much harder - potentially only the original developer will actually be able to test it. For this particular bit of HapMap code, do you need persistence? If all you need is an on the fly database there may be other options (maybe sqlite - some versions of python ship with this). Peter From tiagoantao at gmail.com Fri Nov 14 07:16:20 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 14 Nov 2008 12:16:20 +0000 Subject: [BioPython] [PopGen] HapMap In-Reply-To: <320fb6e00811140402o301ee86aqc118493b143137c4@mail.gmail.com> References: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> <320fb6e00811140402o301ee86aqc118493b143137c4@mail.gmail.com> Message-ID: <6d941f120811140416l7369bb0fl75188bde6c49141e@mail.gmail.com> > For this particular bit of HapMap code, do you need persistence? If > all you need is an on the fly database there may be other options > (maybe sqlite - some versions of python ship with this). Considering that there are now more people here that seem to be interested in this, maybe this can be discussed. The HapMap is a fairly big database of SNPs taken for 3 (or 4, depends on how you count) human populations. The database is available in text format. If I recall well (this is old code and old work) there is a file per chromosome and per pop with a (big) list of SNPs. Actually there are several files, from allele counts to haplotype reconstruction. The problem is, if you want to search for a certain criteria, (say SNPID, a chunk of a chromosome, or whatever) going through the files is a painfully slow process. My (now very old) implementation (which, I think is on GIT), downloads the text files, uploads then on a local sqllite database, indexes it and exposes a fast interface. The code is actually quite agile, making life quite easy on downloading and manipulating data, at least in my opinion. If there is interest here, I can pull out my code and we can discuss the approach that I followed in the past. Also, if somebody else wants to take the lead on this, go ahead (you can still use my code). To be honest I would prefer to have a shared discussion on this, then just submitting the code alone, with just my own reasoning to back it. From tiagoantao at gmail.com Fri Nov 14 07:41:22 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 14 Nov 2008 12:41:22 +0000 Subject: [BioPython] [PopGen] HapMap In-Reply-To: <6d941f120811140416l7369bb0fl75188bde6c49141e@mail.gmail.com> References: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> <320fb6e00811140402o301ee86aqc118493b143137c4@mail.gmail.com> <6d941f120811140416l7369bb0fl75188bde6c49141e@mail.gmail.com> Message-ID: <6d941f120811140441r37698743n2035cbb6d699686@mail.gmail.com> On Fri, Nov 14, 2008 at 12:16 PM, Tiago Ant?o wrote: > My (now very old) implementation (which, I think is on GIT), downloads > the text files, uploads then on a local sqllite database, indexes it > and exposes a fast interface. The code is actually quite agile, making > life quite easy on downloading and manipulating data, at least in my > opinion. Just an update on this, the code on GIT is incomplete. I will dig my archives (other computer) and find the complete stuff From p.j.a.cock at googlemail.com Fri Nov 14 08:07:01 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Nov 2008 13:07:01 +0000 Subject: [BioPython] Problems with Emboss.Primer3 In-Reply-To: <320fb6e00811130643p357092f6y8e6d983a11909003@mail.gmail.com> References: <000801c94598$fd183f20$1022a8c0@ipkgatersleben.de> <320fb6e00811130643p357092f6y8e6d983a11909003@mail.gmail.com> Message-ID: <320fb6e00811140507j78bd40dbybba6ed6f1e74e5ec@mail.gmail.com> Just for anyone else with a similar issue, it turned out there was an EMBOSS setup problem on Stefanie's machine - running the command line by hand at the command line prompt didn't work either. Problem solved :) Peter From bsouthey at gmail.com Fri Nov 14 10:17:08 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 14 Nov 2008 09:17:08 -0600 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811140321i74921288kb22f83be2d41a175@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> <491C789A.3070004@gmail.com> <5aa3b3570811140321i74921288kb22f83be2d41a175@mail.gmail.com> Message-ID: <491D9674.7010807@gmail.com> [snip] > >> Some other comments >> >> Perhaps I misunderstood the situation but the major problem that I have is >> that the locations are treated as independent so your model assumes unlinked >> loci. I just don't find this a useful scenario. >> > > This depends on which parameters you pass to the HaplotypesGenerator > init function. > I would prefer to create a basic module that generates sequences given > the frequencies and alleles in every position, and other functions to > create its parameters. > Well this depends on your meaning for haplotype (e.g. http://en.wikipedia.org/wiki/Haplotype). I agree but you need to capture how close the positions ie linkage/ linkage disequilibrium. Simulating independent positions in a required format is useful but this is just a special case of simulating dependent positions. > I forgot to say it in the first mail, but if you want to use more > sophisticated scenarios - like populations that have suffered a > bottleneck or have a particular history - there are already better > tools available to do that; we should think on how to integrate this > module with them. > Maybe I should rename this module as 'SimpleHaplotypesSampler'. > Perhaps IndependentLociSampler. :) > >> You assume that the user knows exactly which locations and frequency to >> change. Often you just want a random frequency and random location. In that >> case you need to randomly select locations and frequencies based on some >> function. But I do not find the mode=='random' of paramsGenerator sufficient >> to address this. Further, you might want a random sequence of some length >> but you not want all locations to change. >> > > ok, but consider that these are haplotypes and not sequences, so you > most likely need to have regions that are more conserved and others > that change more. > This is a good question, about which models to implement, but I would > need to find a better way to represent frequencies first, and then > think about which models to implement. > Really the implementation requires some representation of the genetic map. After all if the positions are very close, the two loci should not change very frequently. I do not know a nice way to represent this even with genetic marker simulation (something I do know about). I have not used simcoal as my work has moved from genetic markers. Perhaps you need to see how simcoal and similar packages do it. I do understand the usefulness of the simulating independent loci but I also find it a very simple special case of what should be done. I think you need to develop some outline of what you want to achieve that changes as it progresses. Also, not everything needs to get done, other people can contribute if they want to but the general framework needs to be in place. Bruce From tiagoantao at gmail.com Fri Nov 14 15:27:38 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 14 Nov 2008 20:27:38 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <491D9674.7010807@gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> <491C789A.3070004@gmail.com> <5aa3b3570811140321i74921288kb22f83be2d41a175@mail.gmail.com> <491D9674.7010807@gmail.com> Message-ID: <6d941f120811141227s42be58d9g6557b2b731e378a@mail.gmail.com> > Really the implementation requires some representation of the genetic map. > After all if the positions are very close, the two loci should not change > very frequently. I do not know a nice way to represent this even with > genetic marker simulation (something I do know about). I have not used > simcoal as my work has moved from genetic markers. Perhaps you need to see > how simcoal and similar packages do it. Just for the record, there is an excellent (in all respects: features, code quality, documentation, author support) forward time population genetics simulator in python: simuPOP. It is probably the best forward time simulator that I know of (probably better than the R-based "competitor" rmetasim, which doesn't have any provision for selection). It might be interesting to study how simuPOP represents a genome. Tiago From fglaser at technion.ac.il Sun Nov 16 02:27:06 2008 From: fglaser at technion.ac.il (Fabian Glaser) Date: Sun, 16 Nov 2008 09:27:06 +0200 Subject: [BioPython] disordered atoms in pdb Message-ID: <491FCB4A.9050207@technion.ac.il> Hi, I am quite new to biopython, so forgive me if I am asking trivial questions for a while... I am successfully reading and updating pdb files with biopython, with only one exception: disordered atoms. I understand they are part of a different object than regular atoms, but when I am trying for example to change their temperature factor values with the following code: if residue.is_disordered(): for atom in residue: print residue, atom, atom.get_bfactor() atom.set_bfactor(0) print residue, atom, atom.get_bfactor() The code if there is more than one option, for example A and B, only the first one is updated: 23.48 0 25.38 0 So how can I cleanly access every unordered atom? Thanks a lot in advance, Fabian -- Fabian Glaser, PhD Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - Israel Institute of Technology Haifa 32000, ISRAEL Web: http://bku.technion.ac.il Email: fglaser at tx.technion.ac.il Tel: +972-(0)4-8293701 Cel: +972-(0)54-4772396 From biopython at maubp.freeserve.co.uk Mon Nov 17 05:07:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 10:07:43 +0000 Subject: [BioPython] disordered atoms in pdb In-Reply-To: <491FCB4A.9050207@technion.ac.il> References: <491FCB4A.9050207@technion.ac.il> Message-ID: <320fb6e00811170207m2d22f96cg2c3d60011ae1d2d2@mail.gmail.com> Fabian Glaser wrote: > Hi, > > I am quite new to biopython, so forgive me if I am asking trivial questions > for a while... Hi Fabian, > I am successfully reading and updating pdb files with biopython, with only > one exception: disordered atoms. I understand they are part of a different > object than regular atoms, but when I am trying for example to change their > temperature factor values with the following code: > > if residue.is_disordered(): for atom in > residue: > print residue, atom, atom.get_bfactor() > atom.set_bfactor(0) > print residue, atom, atom.get_bfactor() Your indentation went funny in the email. Could you repeat the example, and add a little bit more code to load the PDB file and select this residue? (any PDB file with a disordered residue should be fine). > The code if there is more than one option, for example A and B, only the > first one is updated: > > 23.48 > 0 > 25.38 > 0 In this example atoms CB and CG seem to have both had their bfactor updated. I don't understand what is wrong. Peter From dalloliogm at gmail.com Mon Nov 17 06:32:28 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 17 Nov 2008 12:32:28 +0100 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) Message-ID: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> Hi, a general question. Are you used to organize your python/biopython scripts in pipelines or workflows? For example, many people use automatic build tools like 'make' to organize their scientific scripts. Let's say you want to study the structure of a protein from pdb. I would create a script to download it from pdb.org, one to parse its format, and others to do the analysis; then, I would write a Makefile to put everything together. I noticed that there are already some tools to do automated builds written in python. I have asked in some lists and, apart from scons, they suggested me these: - http://www.blueskyonmars.com/projects/paver (paver) - http://code.google.com/p/waf/ (waf) So, do you know these tools? Do you have any special recommendation to integrate them with biopython? -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Nov 17 06:53:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 11:53:04 +0000 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> Message-ID: <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> Giovanni Marco Dall'Olio wrote: > Hi, > a general question. > > Are you used to organize your python/biopython scripts in pipelines or > workflows? > For example, many people use automatic build tools like 'make' to > organize their scientific scripts. > Let's say you want to study the structure of a protein from pdb. I > would create a script to download it from pdb.org, one to parse its > format, and others to do the analysis; then, I would write a Makefile > to put everything together. Personally in this situation I tend to just write a wrapper python script (or sometimes a shell script or batch file) to call the sub scripts. i.e. the KISS principle. I really don't think Makefiles are a sensible solution to this problem - although it is possible. A Makefile lets you deal with simple dependencies (e.g. building an index file, or running a BLAST search and saving it to disk) but I prefer to just deal with this within my python scripts (e.g. if the index is missing, build it; if the BLAST output is missing, call BLAST). Why do you think you need a Makefile? Are you intending to provide the workflow to other people? Using a complicated Makefile means the project is harder for a new developer to understand (they need to learn a whole new programming language/tool). This may also hinder cross platform deployment (the average Windows machine won't have make installed). Peter From bartomas at gmail.com Mon Nov 17 07:22:33 2008 From: bartomas at gmail.com (bar tomas) Date: Mon, 17 Nov 2008 12:22:33 +0000 Subject: [BioPython] Using BioPython Entrez from behind proxy-based firewall Message-ID: Hi, I'm using BioPython to access NCBI Entrez databases. I'm doing this from behind a proxy-based firewall. Do you know how I can pass on my firewall parameters so that BioPython handles them? Thanks a lot From biopython at maubp.freeserve.co.uk Mon Nov 17 07:32:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 12:32:21 +0000 Subject: [BioPython] Using BioPython Entrez from behind proxy-based firewall In-Reply-To: References: Message-ID: <320fb6e00811170432h32e30f2cx7aaf1bfb8a5a5e45@mail.gmail.com> On Mon, Nov 17, 2008 at 12:22 PM, bar tomas wrote: > Hi, > > I'm using BioPython to access NCBI Entrez databases. > I'm doing this from behind a proxy-based firewall. > Do you know how I can pass on my firewall parameters so that BioPython > handles them? > > Thanks a lot Hi, Bio.Entrez is just using the python urllib to connect to the NCBI Entrez servers, and that should support a password-less proxy. Right now the Biopython code which uses the urllib.urlopen function doesn't directly let you specify the proxy. However, consulting the python documentation you should still be able to do this by setting an environment variable: http://www.python.org/doc/2.5.2/lib/module-urllib.html Does that work? Peter From bartomas at gmail.com Mon Nov 17 07:53:26 2008 From: bartomas at gmail.com (bar tomas) Date: Mon, 17 Nov 2008 12:53:26 +0000 Subject: [BioPython] Using BioPython Entrez from behind proxy-based firewall In-Reply-To: <320fb6e00811170432h32e30f2cx7aaf1bfb8a5a5e45@mail.gmail.com> References: <320fb6e00811170432h32e30f2cx7aaf1bfb8a5a5e45@mail.gmail.com> Message-ID: Your solution does work!! Thanks a lot. On Mon, Nov 17, 2008 at 12:32 PM, Peter wrote: > On Mon, Nov 17, 2008 at 12:22 PM, bar tomas wrote: > > Hi, > > > > I'm using BioPython to access NCBI Entrez databases. > > I'm doing this from behind a proxy-based firewall. > > Do you know how I can pass on my firewall parameters so that BioPython > > handles them? > > > > Thanks a lot > > Hi, > > Bio.Entrez is just using the python urllib to connect to the NCBI > Entrez servers, and that should support a password-less proxy. Right > now the Biopython code which uses the urllib.urlopen function doesn't > directly let you specify the proxy. However, consulting the python > documentation you should still be able to do this by setting an > environment variable: > http://www.python.org/doc/2.5.2/lib/module-urllib.html > > Does that work? > > Peter > From dalloliogm at gmail.com Mon Nov 17 11:26:21 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 17 Nov 2008 17:26:21 +0100 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> Message-ID: <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> On Mon, Nov 17, 2008 at 12:53 PM, Peter wrote: > Giovanni Marco Dall'Olio wrote: >> Hi, >> a general question. >> >> Are you used to organize your python/biopython scripts in pipelines or >> workflows? >> For example, many people use automatic build tools like 'make' to >> organize their scientific scripts. >> Let's say you want to study the structure of a protein from pdb. I >> would create a script to download it from pdb.org, one to parse its >> format, and others to do the analysis; then, I would write a Makefile >> to put everything together. > > Personally in this situation I tend to just write a wrapper python > script (or sometimes a shell script or batch file) to call the sub > scripts. i.e. the KISS principle. wrapper scripts often are not the very optimal solution. - Over time, they tend to be become very complex and full of commented statements. When you complete a part of your experiment (e.g. you download your input sequences from ncbi) you will likely to comment out the statement that you used to download it. If you then discover that the sequences you have downloaded were wrong, you have to decomment-out the same statement, but here you can make some errors It is very difficult to remember which statements you commented out because they were wrong and when, and the wrapper script become messy very quickly, while it will take always much time to you to maintain. I have used wrapper scripts for a year during my master project and I think that's not really KISS. It seems very difficult to reproduce an analysis done without a pipeline. - make can have a nasty syntax, but it is a standard. If you type 'make help' you get help, and if you type 'make all' usually you will carry out the whole analysis, without having to worry on which scripts are be run in particular. - there are other build system than make, some of them are written in python and/or for python. That means you won't have to necessarly learn a new programming syntax. Have a look at rake, all the examples I've seen are very clean. I'll let you know when I will have learnt waf or paver. - makefiles like tools usually already support multi-threading. If I want to run a program on a cluster, the easiest thing for me is to write a makefile, and it works already. - makefile allows you to re-execute parts of your analysis easily when your input files or your scripts changes. This is very useful, I don't want to write a wrapper script that checks if a file has been modified since the last time I have used it to calculate some results - because make tools already do that. > > I really don't think Makefiles are a sensible solution to this problem > - although it is possible. A Makefile lets you deal with simple > dependencies (e.g. building an index file, or running a BLAST search > and saving it to disk) but I prefer to just deal with this within my > python scripts (e.g. if the index is missing, build it; if the BLAST > output is missing, call BLAST). Wouldn't you prefer something like: - if the blast output doesn't exist, OR it exists but it is older than the script used to launch it, or older than the input sequence, then run it again? that's the kind of things that makefile tools can do for you already, without having to write complicated python conditions. > Why do you think you need a Makefile? Are you intending to provide the > workflow to other people? Using a complicated Makefile means the > project is harder for a new developer to understand (they need to > learn a whole new programming language/tool). The best thing would be to learn how to write workflows, like the ones from taverna and similar. But it takes time, and I think it is better if you know the two things. As I was saying before, make has the worst syntax, but maybe there are other building tools which are better. > This may also hinder > cross platform deployment (the average Windows machine won't have make > installed). > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Nov 17 12:27:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 17:27:12 +0000 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> Message-ID: <320fb6e00811170927v585ff192p85887d007efcd978@mail.gmail.com> >> Personally in this situation I tend to just write a wrapper python >> script (or sometimes a shell script or batch file) to call the sub >> scripts. i.e. the KISS principle. > > wrapper scripts often are not the very optimal solution. > - Over time, they tend to be become very complex and full of commented > statements. That certainly can happen - but it can happen with any tool, even Makefiles. > When you complete a part of your experiment (e.g. you download your > input sequences from ncbi) you will likely to comment out the > statement that you used to download it. Personally to avoid this kind of thing, I make the download (or running BLAST, or whatever) conditional on a check to see if the output file exists (and don't just comment out the call). You could also do date checking in code too. > If you then discover that the sequences you have downloaded were > wrong, you have to decomment-out the same statement, but here you can > make some errors In my case, I can delete the old input sequences (or the BLAST output) and re-run the script. I would agree that for more complicate multi-step analyses this requires some thought - but you can at least handle error conditions any way you like (i.e. helpful messages instead of whatever the build tool does). > It is very difficult to remember which statements you commented out > because they were wrong and when, and the wrapper script become messy > very quickly, while it will take always much time to you to maintain. > I have used wrapper scripts for a year during my master project and I > think that's not really KISS. It seems very difficult to reproduce an > analysis done without a pipeline. I guess it depends on what you mean by a pipeline - you can have a robust pipeline which is essentially one master python script. I agree there is a danger that the script will evolve over time into a horrible mess. > - make can have a nasty syntax, but it is a standard. If you type > 'make help' you get help, and if you type 'make all' usually you will > carry out the whole analysis, without having to worry on which scripts > are be run in particular. I would agree that make has a nasty syntax. Note that make isn't a completely cross platform standard (although you can get it on Windows via cygwin for example). > - there are other build system than make, some of them are written in > python and/or for python. > That means you won't have to necessarly learn a new programming > syntax. Have a look at rake, all the examples I've seen are very > clean. I'll let you know when I will have learnt waf or paver. These (and Make) all seem to be designed to solve a different problem, handling the compilation and/or installation of software with multiple dependencies. That doesn't mean you can't use them for a pipeline, but it may not be ideal. > - makefiles like tools usually already support multi-threading. If I > want to run a program on a cluster, the easiest thing for me is to > write a makefile, and it works already. For trivial multi-threading, yes, make can help. > - makefile allows you to re-execute parts of your analysis easily when > your input files or your scripts changes. > This is very useful, I don't want to write a wrapper script that > checks if a file has been modified since the last time I have used it > to calculate some results - because make tools already do that. If you already know how to work with make files, that this does have some advantages. i.e. Instead of writing a python wrapper script, you write a simple Makefile. I think we agree that Make is pretty complex, a language in its own right. This means if you want someone else to use your pipeline, then they have to learn how to use make too (if anything goes wrong or they want to change it). > Wouldn't you prefer something like: > - if the blast output doesn't exist, OR it exists but it is older than > the script used to launch it, or older than the input sequence, then > run it again? That sounds potentially useful for a complicated analysis pipeline. But suppose you also wanted to check the current version of BLAST installed and the version of BLAST used in the existing output file? This would probably be possible within a Makefile using some embedded shell scripts calling grep, but it wouldn't be very nice at all. Although it would still be a non-trivial bit of code, I would prefer to do this in python (maybe put the code into a library function for reuse). My point is, using some other tool like Make could make certain operations easier, but with a python script you can do this sort of thing and more. You have full control, without adding another dependency to the project. > that's the kind of things that makefile tools can do for you already, > without having to write complicated python conditions. True - but as I have tried to illustrate above, even Make has its limitations. > The best thing would be to learn how to write workflows, like the ones > from taverna and similar. > But it takes time, and I think it is better if you know the two things. > As I was saying before, make has the worst syntax, but maybe there are > other building tools which are better. I certainly wouldn't be keen on make itself, but there might be a python library out there that would be a good compromise (making the common file existence/date based tasks easy, but allowing arbitrary extension - e.g. my BLAST version check requirement). Peter From dalloliogm at gmail.com Mon Nov 17 13:19:04 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 17 Nov 2008 19:19:04 +0100 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <320fb6e00811170927v585ff192p85887d007efcd978@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> <320fb6e00811170927v585ff192p85887d007efcd978@mail.gmail.com> Message-ID: <5aa3b3570811171019g2b4815bby5f1cd8b7af482928@mail.gmail.com> > > That sounds potentially useful for a complicated analysis pipeline. > But suppose you also wanted to check the current version of BLAST > installed and the version of BLAST used in the existing output file? > This would probably be possible within a Makefile using some embedded > shell scripts calling grep, but it wouldn't be very nice at all. > Although it would still be a non-trivial bit of code, I would prefer > to do this in python (maybe put the code into a library function for > reuse). well, in principle you would check for blast's executable last modification date. If the blast executable has a modification date which is younger than the results file, you will have to calculate them again. Other build tools can also check for md5 modification to prerequisite files, or can be integrated with subversion/other rcs systems. You can do this in python, but it takes a lot of time, and would mean re-writing existing code. I am sure there is should be something specific for bioinformaticians already :). Well, I'll write some workflows with the tools I linked before (and also with scons) and let you know. > My point is, using some other tool like Make could make certain > operations easier, but with a python script you can do this sort of > thing and more. You have full control, without adding another > dependency to the project. > >> that's the kind of things that makefile tools can do for you already, >> without having to write complicated python conditions. > > True - but as I have tried to illustrate above, even Make has its limitations. > >> The best thing would be to learn how to write workflows, like the ones >> from taverna and similar. >> But it takes time, and I think it is better if you know the two things. >> As I was saying before, make has the worst syntax, but maybe there are >> other building tools which are better. > > I certainly wouldn't be keen on make itself, but there might be a > python library out there that would be a good compromise (making the > common file existence/date based tasks easy, but allowing arbitrary > extension - e.g. my BLAST version check requirement). > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Nov 17 13:33:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 18:33:26 +0000 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <5aa3b3570811171019g2b4815bby5f1cd8b7af482928@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> <320fb6e00811170927v585ff192p85887d007efcd978@mail.gmail.com> <5aa3b3570811171019g2b4815bby5f1cd8b7af482928@mail.gmail.com> Message-ID: <320fb6e00811171033j6eb7e806o8ea70aa863a59eb2@mail.gmail.com> >> That sounds potentially useful for a complicated analysis pipeline. >> But suppose you also wanted to check the current version of BLAST >> installed and the version of BLAST used in the existing output file? >> This would probably be possible within a Makefile using some embedded >> shell scripts calling grep, but it wouldn't be very nice at all. >> Although it would still be a non-trivial bit of code, I would prefer >> to do this in python (maybe put the code into a library function for >> reuse). > > well, in principle you would check for blast's executable last > modification date. > If the blast executable has a modification date which is younger than > the results file, you will have to calculate them again. That might work, but is a slightly different check. Just because the executable is "newer" doesn't mean its a different version. > Other build tools can also check for md5 modification to prerequisite > files, or can be integrated with subversion/other rcs systems. > You can do this in python, but it takes a lot of time, and would mean > re-writing existing code. I am sure there is should be something > specific for bioinformaticians already :). There might be, but I don't see this kind of thing as specific to bioinformatics. Data analysis pipelines could be applied to any scientific data analysis, e.g. meteorological data analysis. > Well, I'll write some workflows with the tools I linked before (and > also with scons) and let you know. I guess the best way to evaluate the tools is to try using them :) Good luck, Peter From bsouthey at gmail.com Wed Nov 19 12:46:19 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 19 Nov 2008 11:46:19 -0600 Subject: [BioPython] Anyone using Affy module? Message-ID: <492450EB.7010801@gmail.com> Hi, If there is anyone who uses the Affy module, could you please let me know? If so, I would also like to know which version of Affy chips are being used. I know that Version 4 is binary and this is not supported. But with version 3 CEL files, the code provides the transpose of the rows and columns. Also the code does not read in the outliers or masks sections. Thanks, Bruce From bsouthey at gmail.com Wed Nov 19 15:40:52 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 19 Nov 2008 14:40:52 -0600 Subject: [BioPython] Usage of Bio.NMR Message-ID: <492479D4.3050205@gmail.com> Hi, Does anyone use the Bio.NMR module? If so, could you let me know? Also, I would appreciate a sample .xpk peaklist file that could be used for testing purposes. From the code # xpktools.py: A python module containing function definitions and classes # useful for manipulating data from nmrview .xpk peaklist files I presume this uses the NMRview software: http://www.onemoonscientific.com/nmrview/summary.html Thanks Bruce From biopython at maubp.freeserve.co.uk Thu Nov 20 05:38:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Nov 2008 10:38:54 +0000 Subject: [BioPython] Usage of Bio.NMR In-Reply-To: <492479D4.3050205@gmail.com> References: <492479D4.3050205@gmail.com> Message-ID: <320fb6e00811200238l607a5fbaq9107abf8bf8a305a@mail.gmail.com> On Wed, Nov 19, 2008 at 8:40 PM, Bruce Southey wrote: > Hi, > Does anyone use the Bio.NMR module? > If so, could you let me know? > Also, I would appreciate a sample .xpk peaklist file that could be used for > testing purposes. Have you tried emailing the Bio.NMR author, Robert G. Bussel? There is a www.med.cornell.edu email address in the source code which might still be live. Peter From bsouthey at gmail.com Thu Nov 20 15:46:34 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 20 Nov 2008 14:46:34 -0600 Subject: [BioPython] Does anyone use EZRetrieve? Message-ID: <4925CCAA.2040809@gmail.com> Hi, Does anyone use EZRetrieve (http://siriusb.umdnj.edu:18080/EZRetrieve/single_r.jsp) ? This allows a user to retrieve a human, mouse or rat genome nucleic sequence based on an valid identifier. I think that most of the functionality of Bio.EZRetrieve is already present in Biopython and the genome sources appear to be 5 years old. For example, it uses LocusLink that was discontinued March 2005. If so could you please let me know? Thanks Bruce From biopython at maubp.freeserve.co.uk Thu Nov 20 15:53:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Nov 2008 20:53:37 +0000 Subject: [BioPython] Does anyone use EZRetrieve? In-Reply-To: <4925CCAA.2040809@gmail.com> References: <4925CCAA.2040809@gmail.com> Message-ID: <320fb6e00811201253j66336d7cl977e4e3112c9f9f7@mail.gmail.com> On Thu, Nov 20, 2008 at 8:46 PM, Bruce Southey wrote: > Hi, > Does anyone use EZRetrieve > (http://siriusb.umdnj.edu:18080/EZRetrieve/single_r.jsp) ? > This allows a user to retrieve a human, mouse or rat genome nucleic sequence > based on an valid identifier. > > I think that most of the functionality of Bio.EZRetrieve is already present > in Biopython and the genome sources appear to be 5 years old. For example, > it uses LocusLink that was discontinued March 2005. > > If so could you please let me know? Actually - could you let the whole mailing list know? ;) Given nature of the database and the limited functionality this python code offers, if no-one is using Bio.EZRetrieve then it could be considered for deprecation. Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Nov 21 11:59:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 Nov 2008 16:59:08 +0000 Subject: [BioPython] Biopython 1.49 released Message-ID: <320fb6e00811210859n2d128fd6nc21ad1012e1d93bf@mail.gmail.com> Dear Biopythoneers, We are pleased to announce the release of Biopython 1.49. There have been some significant changes since Biopython 1.48 was released a few months ago, which is why we initially released a beta for wider testing. Thank you to all those who tried this and reported the minor problems uncovered. As previously announced, the big news is that Biopython now uses NumPy rather than its precursor Numeric (the original Numerical Python library). As in the previous releases, Biopython 1.49 supports Python 2.3, 2.4 and 2.5 but should now also work fine on Python 2.6. Please note that we intend to drop support for Python 2.3 in a couple of releases time. We also have some new functionality, starting with the basic sequence object (the Seq class) which now has more methods. This encourages a more object orientated coding style, and makes basic biological operations like transcription and translation more accessible and discoverable. Our BioSQL interface can now optionally fetch the NCBI taxonomy on demand when loading sequences (via Bio.Entrez) allowing you to populate the taxon/taxon_name tables gradually. Also, BioSQL should now work with the psycopg2 driver for PostgreSQL (as well as the older psycopg driver), and the handling of feature locations has also been improved. We've also updated the Biopython Tutorial and Cookbook (also available in PDF). http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Finally, our old parsing infrastructure (Martel and Bio.Mindy) is now considered to be deprecated, meaning mxTextTools is no longer required to use Biopython. This should not affect any of the typically used parsers (e.g. Bio.SeqIO and Bio.AlignIO). Given there have been more changes than in recent Biopython releases, please do check your old scripts still work fine, and let us know on the mailing list or file a bug if there is anything wrong. Source distributions and Windows installers are available from the Biopython website: http://biopython.org/wiki/Download Thanks! -Peter on behalf of the Biopython developers P.S. You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News From lueck at ipk-gatersleben.de Sun Nov 23 08:21:52 2008 From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de) Date: Sun, 23 Nov 2008 14:21:52 +0100 Subject: [BioPython] ClustalW Multiple Alignment Message-ID: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> Hi! I want to align several sequences under the allowance to check whether there are reversed complement. That means I align all sequences against a reference sequence and then reversed complement them and align them again. After that I want to compare the score and choose the better once. Now my question: How can I get the score? Unfortunately it's not in the dnd file. In an old message of this mailing list, it's was written that it's in the log file. Does this has been removed? Thanks in adavance! Stefanie From biopython at maubp.freeserve.co.uk Sun Nov 23 08:32:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 23 Nov 2008 13:32:23 +0000 Subject: [BioPython] ClustalW Multiple Alignment In-Reply-To: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> References: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> Message-ID: <320fb6e00811230532u7e8d1586ia87e8adecc6618b4@mail.gmail.com> On Sun, Nov 23, 2008 at 1:21 PM, wrote: > Hi! > > I want to align several sequences under the allowance to check whether there > are reversed complement. > > That means I align all sequences against a reference sequence and then > reversed complement them and align them again. After that I want to compare > the score and choose the better once. If I have understood you correctly, you have one reference nucleotide sequence, and many nucleotide sequences of unknown orientation. For each sequence you want to do a pairwise alignment against the reference to decide which orientation matches best (forwards, or reverse complement). So you want to lots of pairwise alignments. Perhaps ClustalW is not the best choice - maybe use EMBOSS needle? You could also try Biopython's Bio.pairwise2 module. > Now my question: > > How can I get the score? What score exactly are you looking for? > Unfortunately it's not in the dnd file. The dnd file from clustalw is just a tree, there is no score. > In an old message of this mailing list, it's was written that it's > in the log file. Does this has been removed? What log file? I didn't think clustalw wrote a log file. It could be in the standard output printed to screen... What old message on the mailing list are you refering to? Could you link to it in the archive maybe? http://lists.open-bio.org/pipermail/biopython/ Peter From pmmagic at gmail.com Sun Nov 23 10:42:34 2008 From: pmmagic at gmail.com (paul m) Date: Sun, 23 Nov 2008 10:42:34 -0500 Subject: [BioPython] ClustalW Multiple Alignment In-Reply-To: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> References: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> Message-ID: <991e7bc10811230742r38ae28d2j78a5bb87171e0a4d@mail.gmail.com> On Sun, Nov 23, 2008 at 8:21 AM, wrote: > Hi! > > I want to align several sequences under the allowance to check whether there > are reversed complement. > > That means I align all sequences against a reference sequence and then > reversed complement them and align them again. After that I want to compare > the score and choose the better once. > Now my question: > > How can I get the score? Unfortunately it's not in the dnd file. In an old > message of this mailing list, it's was written that it's in the log file. > Does this has been removed? I think Thomas Mailund's ClustalW package will allow you to get the scores: See: http://www.daimi.au.dk/~mailund/clustalw_wrapper.html --Paul From lueck at ipk-gatersleben.de Mon Nov 24 06:23:45 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 24 Nov 2008 12:23:45 +0100 Subject: [BioPython] ClustalW Multiple Alignment References: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> <320fb6e00811230532u7e8d1586ia87e8adecc6618b4@mail.gmail.com> Message-ID: <001e01c94e27$1d236600$1022a8c0@ipkgatersleben.de> Yes, first I want to make a pairwise alignment to find the right orientation of all sequences and after it's I'll align everything in a multiple alignment. I meant this message: http://lists.open-bio.org/pipermail/biopython/2005-May/002656.html So you're right they talk about the printed output. I try to get the score. Thanks for the link! Stefanie ----- Original Message ----- From: "Peter" To: Cc: Sent: Sunday, November 23, 2008 2:32 PM Subject: Re: [BioPython] ClustalW Multiple Alignment > On Sun, Nov 23, 2008 at 1:21 PM, wrote: >> Hi! >> >> I want to align several sequences under the allowance to check whether >> there >> are reversed complement. >> >> That means I align all sequences against a reference sequence and then >> reversed complement them and align them again. After that I want to >> compare >> the score and choose the better once. > > If I have understood you correctly, you have one reference nucleotide > sequence, and many nucleotide sequences of unknown orientation. For > each sequence you want to do a pairwise alignment against the > reference to decide which orientation matches best (forwards, or > reverse complement). > > So you want to lots of pairwise alignments. Perhaps ClustalW is not > the best choice - maybe use EMBOSS needle? You could also try > Biopython's Bio.pairwise2 module. > >> Now my question: >> >> How can I get the score? > > What score exactly are you looking for? > >> Unfortunately it's not in the dnd file. > > The dnd file from clustalw is just a tree, there is no score. > >> In an old message of this mailing list, it's was written that it's >> in the log file. Does this has been removed? > > What log file? I didn't think clustalw wrote a log file. It could be > in the standard output printed to screen... > > What old message on the mailing list are you refering to? Could you > link to it in the archive maybe? > http://lists.open-bio.org/pipermail/biopython/ > > Peter > From david.moreira at u-psud.fr Mon Nov 24 09:24:59 2008 From: david.moreira at u-psud.fr (David Moreira) Date: Mon, 24 Nov 2008 14:24:59 +0000 Subject: [BioPython] ClustalW Multiple Alignment In-Reply-To: <001e01c94e27$1d236600$1022a8c0@ipkgatersleben.de> References: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> <320fb6e00811230532u7e8d1586ia87e8adecc6618b4@mail.gmail.com> <001e01c94e27$1d236600$1022a8c0@ipkgatersleben.de> Message-ID: <492AB93B.9050100@u-psud.fr> Dear Stefanie, For a similar case, what I do is to use first BLAST to know the orientation of the sequence. I have a small sequence data base with sequences in the good orientation and I BLAST against it, BLAST tells you whether the retrieved result is "plus/plus" or "minus/plus", i.e., whether your query was in the same or different orientation. You have just to parse the BLAST output to retrieve that information. BLAST is extremely rapid, so you can retrieve the orientation of hundreds of sequences in a few minutes. Then, you can reverse-complement the sequences in "minus" orientation and construct your multiple sequence alignment. It is very easy to have a single script doing all the work. David Stefanie Lu"ck a e'crit : > Yes, first I want to make a pairwise alignment to find the right > orientation of all sequences and after it's I'll align everything in a > multiple alignment. > > I meant this message: > http://lists.open-bio.org/pipermail/biopython/2005-May/002656.html > > So you're right they talk about the printed output. > I try to get the score. > Thanks for the link! > Stefanie > > ----- Original Message ----- From: "Peter" > > To: > Cc: > Sent: Sunday, November 23, 2008 2:32 PM > Subject: Re: [BioPython] ClustalW Multiple Alignment > > >> On Sun, Nov 23, 2008 at 1:21 PM, wrote: >>> Hi! >>> >>> I want to align several sequences under the allowance to check >>> whether there >>> are reversed complement. >>> >>> That means I align all sequences against a reference sequence and then >>> reversed complement them and align them again. After that I want to >>> compare >>> the score and choose the better once. >> >> If I have understood you correctly, you have one reference nucleotide >> sequence, and many nucleotide sequences of unknown orientation. For >> each sequence you want to do a pairwise alignment against the >> reference to decide which orientation matches best (forwards, or >> reverse complement). >> >> So you want to lots of pairwise alignments. Perhaps ClustalW is not >> the best choice - maybe use EMBOSS needle? You could also try >> Biopython's Bio.pairwise2 module. >> >>> Now my question: >>> >>> How can I get the score? >> >> What score exactly are you looking for? >> >>> Unfortunately it's not in the dnd file. >> >> The dnd file from clustalw is just a tree, there is no score. >> >>> In an old message of this mailing list, it's was written that it's >>> in the log file. Does this has been removed? >> >> What log file? I didn't think clustalw wrote a log file. It could be >> in the standard output printed to screen... >> >> What old message on the mailing list are you refering to? Could you >> link to it in the archive maybe? >> http://lists.open-bio.org/pipermail/biopython/ >> >> Peter >> > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bartek at rezolwenta.eu.org Mon Nov 24 09:51:12 2008 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 24 Nov 2008 15:51:12 +0100 Subject: [BioPython] Refactoring motif analysis code Message-ID: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> Hello All, Currently, there are two packages dealing with motif analysis in biopython : Bio.AlignAce (written by me) and Bio.MEME (written by Jason Hackney). Both of them are quite old and they were developed independently so the functionality is largely overlapping. Particularly the files AlignAce/Motif.py and MEME/Motif.py contain almost identical functionality useful for anyone interested in motif analysis of writing a parser for yet another motif searching tool. I'd like to change this and create a new library called Bio.Motif, which would contain: -Motif class for all general functionality concerning motif objects: i/o, comparisons, sequence scanning -AlignAce Parser -MEME Parser When this is completed, we could deprecate the AlignAce and MEME modules. For AlignAce I have most of the code already written, I need to rewrite portions of MEME parser to work with different motif implementation (not a major pain). Then I just need to polish it a bit and provide tests and a short tutorial. After this rather long intro I'd like to ask about several things: - Are there many Bio.AlignAce or Bio.MEME users who would be unhappy about deprecating them? - Are there any features which people would find valuable in Bio.Motif - Both MEME and AlignAce are DNA-oriented, I've never worked on Protein motifs myself, but I'd like to know whether anyone is interested in using Bio.Motif for that Any comments/ideas are welcome cheers Bartek -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From dalloliogm at gmail.com Mon Nov 24 10:25:23 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 24 Nov 2008 16:25:23 +0100 Subject: [BioPython] [Biopython-dev] Refactoring motif analysis code In-Reply-To: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> References: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> Message-ID: <5aa3b3570811240725n54f7f624oc1db5fe0b88e3f5a@mail.gmail.com> On Mon, Nov 24, 2008 at 3:51 PM, Bartek Wilczynski wrote: > Hello All, > > Currently, there are two packages dealing with motif analysis in biopython : > Bio.AlignAce (written by me) and Bio.MEME (written by Jason Hackney). Hi, I asked a question about motifs one year ago on this list. Here it is the thread: - http://lists.open-bio.org/pipermail/biopython/2007-September/003727.html I would just like to tell you that I have tried the TAMO framework you suggested me, and found it very useful. I am not using it anymore because I don't need it, but I remember that I liked: - the methods to represent motifs as matrixes of frequencies/occurrencies etc.. - the fact that it was easy to create a motif from an alignment of sequences - the integration it had with this website: http://weblogo.berkeley.edu/logo.cgi. I would suggest you to provide integration with this other web service, which enable to plot the difference between two sequence logos: http://www.twosamplelogo.org/examples.html. Maybe you should contact TAMO's author to ask him if he wants to contribute, because I remember that its framework was really complete. > > Both of them are quite old and they were developed independently so > the functionality is largely overlapping. > Particularly the files AlignAce/Motif.py and MEME/Motif.py contain > almost identical functionality useful for > anyone interested in motif analysis of writing a parser for yet > another motif searching tool. > > I'd like to change this and create a new library called Bio.Motif, > which would contain: > -Motif class for all general functionality concerning motif objects: > i/o, comparisons, sequence scanning > -AlignAce Parser > -MEME Parser > > When this is completed, we could deprecate the AlignAce and MEME > modules. For AlignAce I have most of the code > already written, I need to rewrite portions of MEME parser to work > with different motif implementation (not a major pain). > Then I just need to polish it a bit and provide tests and a short tutorial. > > After this rather long intro I'd like to ask about several things: > - Are there many Bio.AlignAce or Bio.MEME users who would be unhappy > about deprecating them? > - Are there any features which people would find valuable in Bio.Motif > - Both MEME and AlignAce are DNA-oriented, I've never worked on > Protein motifs myself, but I'd like to know whether anyone is > interested in using Bio.Motif for that > > Any comments/ideas are welcome > > cheers > Bartek > > -- > Bartek Wilczynski > ================== > Postdoctoral fellow > EMBL, Furlong group > Meyerhoffstrasse 1, > 69012 Heidelberg, > Germany > tel: +49 6221 387 8433 > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From T.Gkikopoulos at dundee.ac.uk Mon Nov 24 09:58:35 2008 From: T.Gkikopoulos at dundee.ac.uk (Triantafyllos Gkikopoulos) Date: Mon, 24 Nov 2008 14:58:35 +0000 Subject: [BioPython] averaging ATG based signal from chip on chip data Message-ID: <492AC11B020000B3000010E2@gw-out.dundee.ac.uk> Hi all, I am new to python, have started learning on my own. I want to do some microarray analysis and one of the things I would like to do is to average signal from a tilled array by aligning a set coordinates say ORF ATGs and plotting the average signal for a region of fixed length say from ATG to 300 bp downstream . obviously this should be the same if I have a bed file and want to do the same analysis based on either start or stop of all fragments included in the bed file. I can use the csv compnent to import my bed file and signal file, not sure what is the best way to import such data as, make a dictionary or make an array or just a list. Appreciate any help cheers Dr Triantafyllos Gkikopoulos The University of Dundee is a registered Scottish charity, No: SC015096 From bsouthey at gmail.com Mon Nov 24 10:54:32 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 24 Nov 2008 09:54:32 -0600 Subject: [BioPython] [Biopython-dev] Refactoring motif analysis code In-Reply-To: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> References: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> Message-ID: <492ACE38.1090301@gmail.com> Bartek Wilczynski wrote: > Hello All, > > Currently, there are two packages dealing with motif analysis in biopython : > Bio.AlignAce (written by me) and Bio.MEME (written by Jason Hackney). > Actually I am not that thrilled with the licenses for these packages and similar packages because these are free only for academic use. To me this clashes with the spirit of an open-sourced project especially a BSD-licensed one. But if there is a need for such modules then these modules should be included. > Both of them are quite old and they were developed independently so > the functionality is largely overlapping. > Particularly the files AlignAce/Motif.py and MEME/Motif.py contain > almost identical functionality useful for > anyone interested in motif analysis of writing a parser for yet > another motif searching tool. > > I'd like to change this and create a new library called Bio.Motif, > which would contain: > -Motif class for all general functionality concerning motif objects: > i/o, comparisons, sequence scanning > -AlignAce Parser > -MEME Parser > > While it is only free for academic use, have you seen TAMO? *TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. * Bioinformatics. 2005 Jul 15;21(14):3164-5. http://fraenkel.mit.edu/TAMO/ > When this is completed, we could deprecate the AlignAce and MEME > modules. For AlignAce I have most of the code > already written, I need to rewrite portions of MEME parser to work > with different motif implementation (not a major pain). > Then I just need to polish it a bit and provide tests and a short tutorial. > > After this rather long intro I'd like to ask about several things: > - Are there many Bio.AlignAce or Bio.MEME users who would be unhappy > about deprecating them? > Well, I am not sure how many used Bio.AlignAce given the Parser.py bug :-) Based on the CVS, both have been untouched for about three years. Also, what species are these used for? One of the papers of AlignAce indicate that the base composition was set for yeast. > - Are there any features which people would find valuable in Bio.Motif > - Both MEME and AlignAce are DNA-oriented, I've never worked on > Protein motifs myself, but I'd like to know whether anyone is > interested in using Bio.Motif for that > > Any comments/ideas are welcome > > cheers > Bartek > > Personally I would be interested in a general protein motif finding module because of my current research. However, I do have a different view with respect to the Biopython community as indicated above with the licenses. Bruce From cjauvin at gmail.com Mon Nov 24 17:18:29 2008 From: cjauvin at gmail.com (Christian Jauvin) Date: Mon, 24 Nov 2008 17:18:29 -0500 Subject: [BioPython] PubMed find_related Message-ID: Hi, I'd like to use the PubMed find_related function, but the doc says that it's deprecated and that I should use the one in the Bio.Entrez module: "Find related articles in PubMed, returns an ID list (DEPRECATED). Please use Bio.Entrez instead as described in the Biopython Tutorial." The problem is that I can't find the equivalent in the Bio.Entrez module.. (I'm using latest version 1.49) Thanks, Christian From mjldehoon at yahoo.com Mon Nov 24 23:05:01 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 24 Nov 2008 20:05:01 -0800 (PST) Subject: [BioPython] PubMed find_related In-Reply-To: Message-ID: <580790.81356.qm@web62404.mail.re1.yahoo.com> >>> from Bio import Entrez >>> handle = Entrez.elink(dbfrom='pubmed',id=12345) >>> record = Entrez.read(handle) Feel free to write a section about Entrez.elink for the Biopython documentation :-). Currently, this section is almost empty. --Michiel. --- On Mon, 11/24/08, Christian Jauvin wrote: > From: Christian Jauvin > Subject: [BioPython] PubMed find_related > To: biopython at biopython.org > Date: Monday, November 24, 2008, 5:18 PM > Hi, > > I'd like to use the PubMed find_related function, but > the doc says > that it's deprecated and that I should use the one in > the Bio.Entrez > module: > > "Find related articles in PubMed, returns an ID list > (DEPRECATED). > Please use Bio.Entrez instead as described in the Biopython > Tutorial." > > The problem is that I can't find the equivalent in the > Bio.Entrez > module.. (I'm using latest version 1.49) > > Thanks, > > Christian > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From rjalves at igc.gulbenkian.pt Thu Nov 27 10:17:38 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Thu, 27 Nov 2008 15:17:38 +0000 Subject: [BioPython] Cladograms Message-ID: <492EBA12.2080408@igc.gulbenkian.pt> Hi everyone, I've been searching the web for python modules to do cladograms but the only relevant stuff I found was relative to dendograms and hierarchical clustering which will give the representation I need. My goal is something that resembles a heatmap[1] but where the trees will be cladograms[2] instead of the result of clustering steps. I know that probably I won't find modules doing exactly what I want, which is why I'm searching for tools to do each step separately and try to glue them somehow. For the heatmap I have something already that will probably do the job, but for the cladograms I couldn't find any decent module. Do you happen to know any dark alley in BioPython or any other external module that would allow me to do the cladogram? Thanks, Renato [1] - http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png [2] - http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif From rjalves at igc.gulbenkian.pt Thu Nov 27 11:11:01 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Thu, 27 Nov 2008 16:11:01 +0000 Subject: [BioPython] Cladograms In-Reply-To: <492EBA12.2080408@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> Message-ID: <492EC695.7020102@igc.gulbenkian.pt> Where "which will give the representation I need" is read should be "which will *not* give the representation I need". Sorry for that. Renato Quoting Renato Alves on 11/27/2008 03:17 PM: > Hi everyone, > > I've been searching the web for python modules to do cladograms but > the only relevant stuff I found was relative to dendograms and > hierarchical clustering which will give the representation I need. > My goal is something that resembles a heatmap[1] but where the trees > will be cladograms[2] instead of the result of clustering steps. > I know that probably I won't find modules doing exactly what I want, > which is why I'm searching for tools to do each step separately and > try to glue them somehow. For the heatmap I have something already > that will probably do the job, but for the cladograms I couldn't find > any decent module. > Do you happen to know any dark alley in BioPython or any other > external module that would allow me to do the cladogram? > > Thanks, > Renato > > [1] - > http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png > > [2] - > http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From dalloliogm at gmail.com Thu Nov 27 11:45:37 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 27 Nov 2008 17:45:37 +0100 Subject: [BioPython] Cladograms In-Reply-To: <492EBA12.2080408@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> Message-ID: <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves wrote: > Hi everyone, > > I've been searching the web for python modules to do cladograms but the only > relevant stuff I found was relative to dendograms and hierarchical > clustering which will give the representation I need. > My goal is something that resembles a heatmap[1] Let me premise that I am not able to help you :). But this seems to be the kind of things that R does. Have you had a look at it? > but where the trees will be > cladograms[2] instead of the result of clustering steps. > I know that probably I won't find modules doing exactly what I want, which > is why I'm searching for tools to do each step separately and try to glue > them somehow. For the heatmap I have something already that will probably do > the job, but for the cladograms I couldn't find any decent module. > Do you happen to know any dark alley in BioPython or any other external > module that would allow me to do the cladogram? But from which kind of data? Do you have to align sequences, or are they aligned already? do you already have the cladograms, or do you have to calculate them? > > Thanks, > Renato > > [1] - > http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png > [2] - > http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From rjalves at igc.gulbenkian.pt Thu Nov 27 13:50:36 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Thu, 27 Nov 2008 18:50:36 +0000 Subject: [BioPython] Cladograms In-Reply-To: <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> Message-ID: <492EEBFC.4050201@igc.gulbenkian.pt> Quoting Giovanni Marco Dall'Olio on 11/27/2008 04:45 PM: > On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves wrote: > >> Hi everyone, >> >> I've been searching the web for python modules to do cladograms but the only >> relevant stuff I found was relative to dendograms and hierarchical >> clustering which will give the representation I need. >> My goal is something that resembles a heatmap[1] >> > Let me premise that I am not able to help you :). > But this seems to be the kind of things that R does. Have you had a look at it? > For the heatmap I did have a look and seems easy to use, for the cladograms I couldn't find much using the name with help.search(). Also google doesn't give the best results when searching for R . Still I would like to remain in python as much as possible. I used rpy before to do some R from within python and it worked but the code is far from "maintainable". >> but where the trees will be >> cladograms[2] instead of the result of clustering steps. >> I know that probably I won't find modules doing exactly what I want, which >> is why I'm searching for tools to do each step separately and try to glue >> them somehow. For the heatmap I have something already that will probably do >> the job, but for the cladograms I couldn't find any decent module. >> Do you happen to know any dark alley in BioPython or any other external >> module that would allow me to do the cladogram? >> > But from which kind of data? > Do you have to align sequences, or are they aligned already? do you > already have the cladograms, or do you have to calculate them? > The data will be mostly taxonomic and in some cases genes grouped by properties. I only need to find a simple way to turn it into an image. Thanks >> Thanks, >> Renato >> >> [1] - >> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png >> [2] - >> http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> From dalloliogm at gmail.com Thu Nov 27 15:48:07 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 27 Nov 2008 21:48:07 +0100 Subject: [BioPython] Cladograms In-Reply-To: <492EEBFC.4050201@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> <492EEBFC.4050201@igc.gulbenkian.pt> Message-ID: <5aa3b3570811271248u5fcde149md8741c943a6b62ef@mail.gmail.com> On Thu, Nov 27, 2008 at 7:50 PM, Renato Alves wrote: > Quoting Giovanni Marco Dall'Olio on 11/27/2008 04:45 PM: >> >> On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves >> wrote: >> >>> >>> Hi everyone, >>> >>> I've been searching the web for python modules to do cladograms but the >>> only >>> relevant stuff I found was relative to dendograms and hierarchical >>> clustering which will give the representation I need. >>> My goal is something that resembles a heatmap[1] >>> >> >> Let me premise that I am not able to help you :). >> But this seems to be the kind of things that R does. Have you had a look >> at it? >> > > For the heatmap I did have a look and seems easy to use, for the cladograms > I couldn't find much using the name with help.search(). Also google doesn't > give the best results when searching for R . Ask to the R users mailing list. Be careful on how you write your message there, because it is a mailing list with a lot of users. If you find someting interesting in R, please don't forget us :), don't forget that python is cool :). > Still I would like to remain in python as much as possible. A population genetics module is under development at the moment, but it doesn't implement anything like that. I am sorry I am not aware of any module capable of doing this. > Thanks >>> >>> Thanks, >>> Renato >>> >>> [1] - >>> >>> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png >>> [2] - >>> >>> http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From winda002 at student.otago.ac.nz Thu Nov 27 16:52:08 2008 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 28 Nov 2008 10:52:08 +1300 Subject: [BioPython] Cladograms In-Reply-To: <492EEBFC.4050201@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> <492EEBFC.4050201@igc.gulbenkian.pt> Message-ID: <492F1688.7060505@student.otago.ac.nz> Renato Alves wrote: > Quoting Giovanni Marco Dall'Olio on 11/27/2008 04:45 PM: >> On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves >> wrote: >> >>> Hi everyone, >>> >>> I've been searching the web for python modules to do cladograms but >>> the only >>> relevant stuff I found was relative to dendograms and hierarchical >>> clustering which will give the representation I need. >>> My goal is something that resembles a heatmap[1] >>> >> Let me premise that I am not able to help you :). >> But this seems to be the kind of things that R does. Have you had a >> look at it? >> > For the heatmap I did have a look and seems easy to use, for the > cladograms I couldn't find much using the name with help.search(). [snip] Hi renato, Have you looked into the R package ape (analysis of phylogentics and evolution, install.packages("ape")) - it has object classes for phylip and nexus trees. I don't know how easy it is to kludge things together from different packages but it might be worth looking into? There is a python module to integrate with R if you want to stay pure ;) -- PhD Student Allan Wilson Centre Department of Zoology University of Otago, PO Box 56, Dunedin 9054 ph: +64-3-4778459 mob: +64-27-3326815 e: winda002 at student.otago.ac.nz From p.j.a.cock at googlemail.com Fri Nov 28 06:29:44 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Nov 2008 11:29:44 +0000 Subject: [BioPython] Cladograms In-Reply-To: <492EBA12.2080408@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> Message-ID: <320fb6e00811280329x6dca1614qf5e7ba4fbb187765@mail.gmail.com> On Thu, Nov 27, 2008 at 3:17 PM, Renato Alves wrote: > Hi everyone, > > I've been searching the web for python modules to do cladograms but the only > relevant stuff I found was relative to dendograms and hierarchical > clustering which will give the representation I need. Hi Renato, > My goal is something that resembles a heatmap[1] but where the trees will be > cladograms[2] instead of the result of clustering steps. In my heatmap example you cited (using R), you can in principle supply the tree to be used (instead of it defaulting to doing a hierarchical clustering). If you like the images from the R heatmap function, I would suggest you look at loading phylogenetic trees into R and passing them to the heatmap function. I have not tried this myself. > I know that probably I won't find modules doing exactly what I want, which > is why I'm searching for tools to do each step separately and try to glue > them somehow. I can understand the idea here, but in practice the "glueing" the images together may not be trivial. You'll have to make sure that they are drawn using vector image formats (allowing you to scale the images to match), or if using bitmaps you'll need to be able to specify things pixel perfect. You will also need to hope that the tree is drawn with equal vertical spacing between leaves, otherwise it won't match the grid of the heatmap. That said, there a lot of tree drawing packages out there, and this could work. > For the heatmap I have something already that will probably do > the job, but for the cladograms I couldn't find any decent module. > Do you happen to know any dark alley in BioPython or any other external > module that would allow me to do the cladogram? You could in principle use python and a package like reportlab to draw both the tree and the heatmap - but you'd end up writing a lot of your own code. For example, I have used python and reportlab to draw colourful PDF trees with aligned columns of data, e.g. Supplementary figures 1 and 2 from: http://dx.doi.org/10.1099/mic.0.2007/013672-0 The script that drew these trees is actually rather complicated (partly due to showing two sets of bootstrap values). If I recall correctly, it also used Thomas Mailund's Newick Tree module to parse the tree files, and not Biopython. See http://www.daimi.au.dk/~mailund/newick.html Next time I need to draw a customised tree, I'll try to look at writing something more general purpose to go in Biopython under Bio.Graphics (using Bio.Nexus to load tree files). For now, I would suggest you explore the R heatmap function and its arguments (and perhaps call this via python if you need to - it would be simpler just to use R directly). Peter From rjalves at igc.gulbenkian.pt Fri Nov 28 06:54:19 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Fri, 28 Nov 2008 11:54:19 +0000 Subject: [BioPython] Cladograms In-Reply-To: <492F1688.7060505@student.otago.ac.nz> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> <492EEBFC.4050201@igc.gulbenkian.pt> <492F1688.7060505@student.otago.ac.nz> Message-ID: <492FDBEB.20001@igc.gulbenkian.pt> @David and Giovanni Thank you both for your feedback. I guess I will give R a more decent try. In the meanwhile I guess I found myself a python small side project :) . I will be sure to contribute with the code if I get to any decent result. Renato Quoting David Winter on 11/27/2008 09:52 PM: > Renato Alves wrote: >> Quoting Giovanni Marco Dall'Olio on 11/27/2008 04:45 PM: >>> On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves >>> wrote: >>> >>>> Hi everyone, >>>> >>>> I've been searching the web for python modules to do cladograms but >>>> the only >>>> relevant stuff I found was relative to dendograms and hierarchical >>>> clustering which will give the representation I need. >>>> My goal is something that resembles a heatmap[1] >>>> >>> Let me premise that I am not able to help you :). >>> But this seems to be the kind of things that R does. Have you had a >>> look at it? >>> >> For the heatmap I did have a look and seems easy to use, for the >> cladograms I couldn't find much using the name with help.search(). > [snip] > > Hi renato, > > Have you looked into the R package ape (analysis of phylogentics and > evolution, install.packages("ape")) - it has object classes for phylip > and nexus trees. I don't know how easy it is to kludge things together > from different packages but it might be worth looking into? There is a > python module to integrate with R if you want to stay pure ;) > From rjalves at igc.gulbenkian.pt Fri Nov 28 12:03:48 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Fri, 28 Nov 2008 17:03:48 +0000 Subject: [BioPython] Cladograms In-Reply-To: <320fb6e00811280329x6dca1614qf5e7ba4fbb187765@mail.gmail.com> References: <492EBA12.2080408@igc.gulbenkian.pt> <320fb6e00811280329x6dca1614qf5e7ba4fbb187765@mail.gmail.com> Message-ID: <49302474.4090907@igc.gulbenkian.pt> Quoting Peter Cock on 11/28/2008 11:29 AM: > Hi Renato, > Hi Peter, >> My goal is something that resembles a heatmap[1] but where the trees will be >> cladograms[2] instead of the result of clustering steps. >> > In my heatmap example you cited (using R), you can in principle supply > the tree to be used (instead of it defaulting to doing a hierarchical > clustering). If you like the images from the R heatmap function, I > would suggest you look at loading phylogenetic trees into R and > passing them to the heatmap function. I have not tried this myself. > R's heatmap.2 is giving me what I need so far, it even reorders the columns/rows according to a pre-calculated dendrogram. The only thing that I don't like that much is how the dendrograms are plotted when multiple branches are at the same levels, but I can live with it :) >> I know that probably I won't find modules doing exactly what I want, which >> is why I'm searching for tools to do each step separately and try to glue >> them somehow. >> > I can understand the idea here, but in practice the "glueing" the > images together may not be trivial. You'll have to make sure that > they are drawn using vector image formats (allowing you to scale the > images to match), or if using bitmaps you'll need to be able to > specify things pixel perfect. You will also need to hope that the > tree is drawn with equal vertical spacing between leaves, otherwise it > won't match the grid of the heatmap. That said, there a lot of tree > drawing packages out there, and this could work. > Well I was thinking of a naive approach such as "glue by hand" (shame on me). But for a real thing I would probably use matplotlib. Although given my current knowledge on the library it would take a while... >> For the heatmap I have something already that will probably do >> the job, but for the cladograms I couldn't find any decent module. >> Do you happen to know any dark alley in BioPython or any other external >> module that would allow me to do the cladogram? >> > You could in principle use python and a package like reportlab to draw > both the tree and the heatmap - but you'd end up writing a lot of your > own code. For example, I have used python and reportlab to draw > colourful PDF trees with aligned columns of data, e.g. Supplementary > figures 1 and 2 from: http://dx.doi.org/10.1099/mic.0.2007/013672-0 > Never used reportlab directly, only via other tools that did all the job. But it's good to know it does a good job at what it does. The only time I messed with PDF libraries using python I ended up with pyPdf - http://pybrary.net/pyPdf/ - It was the only library that had the "limited" edit capabilities I needed. > The script that drew these trees is actually rather complicated > (partly due to showing two sets of bootstrap values). If I recall > correctly, it also used Thomas Mailund's Newick Tree module to parse > the tree files, and not Biopython. See > http://www.daimi.au.dk/~mailund/newick.html > That's also my problem when writing R code using rpy. Not very pythonic (mine at least), hard to read and reuse. Sometimes I end up writing code in the original language, dumping data to files and launching it with os.system/subprocess.call than using rpy. I hope this changes a bit with rpy2... > Next time I need to draw a customised tree, I'll try to look at > writing something more general purpose to go in Biopython under > Bio.Graphics (using Bio.Nexus to load tree files). > > For now, I would suggest you explore the R heatmap function and its > arguments (and perhaps call this via python if you need to - it would > be simpler just to use R directly). > I'm going straight to R for now. But I think this one should be simple and elegant to do in rpy. > Peter > Thanks a bunch for the nice tips and feedback. Renato From p.j.a.cock at googlemail.com Fri Nov 28 13:14:22 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Nov 2008 18:14:22 +0000 Subject: [BioPython] Cladograms In-Reply-To: <49302474.4090907@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> <320fb6e00811280329x6dca1614qf5e7ba4fbb187765@mail.gmail.com> <49302474.4090907@igc.gulbenkian.pt> Message-ID: <320fb6e00811281014j4d52da75r5dcc4d2152d5819c@mail.gmail.com> >> In my heatmap example you cited (using R), you can in principle supply >> the tree to be used (instead of it defaulting to doing a hierarchical >> clustering). If you like the images from the R heatmap function, I >> would suggest you look at loading phylogenetic trees into R and >> passing them to the heatmap function. I have not tried this myself. > > R's heatmap.2 is giving me what I need so far, it even reorders the > columns/rows according to a pre-calculated dendrogram. > The only thing that I don't like that much is how the dendrograms are > plotted when multiple branches are at the same levels, but I can live with > it :) I'm glad that worked out OK. If you haven't signed up to the rpy mailing list, I suggest you do so: https://lists.sourceforge.net/lists/listinfo/rpy-list Peter From pingou at pingoured.fr Sun Nov 30 04:41:13 2008 From: pingou at pingoured.fr (Pierre-Yves) Date: Sun, 30 Nov 2008 10:41:13 +0100 Subject: [BioPython] Cladograms In-Reply-To: <492EEBFC.4050201@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> <492EEBFC.4050201@igc.gulbenkian.pt> Message-ID: <49325FB9.2080005@pingoured.fr> Renato Alves wrote: > For the heatmap I did have a look and seems easy to use, for the > cladograms I couldn't find much using the name with help.search(). Also > google doesn't give the best results when searching for R . If you are looking for something related to R, I would recommend to use http://www.rseek.org instead of our friend google. Rseek might give better results ;) Regards, Pierre PS. Sorry Renato but I just realized that I forgot to send the mail to the list and I though other people might be interested to From alexl at users.sourceforge.net Sun Nov 30 21:59:40 2008 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Sun, 30 Nov 2008 19:59:40 -0700 Subject: [BioPython] will BioSQL work with psycopg2? Message-ID: Hi there, I am maintaining the Fedora package for Biopython and we are doing a complete rebuild of all Python packages for Python 2.6. Currently I have a dependency on psycopg (version 1.1.21) but since that is so old pyscopg won't rebuild against the new mx, meaning that I can't rebuild Biopython because the dependencies aren't there. So my question is, will the Biopython BioSQL work with the newer psycopg2 (currently version 2.0.8)? See: http://www.initd.org/pub/software/psycopg/ Does it require the 1.x API or will it work with 2.x? The BioSQL page: http://biopython.org/wiki/BioSQL isn't clear on this. Thanks, Alex From lueck at ipk-gatersleben.de Mon Nov 3 10:54:25 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 3 Nov 2008 11:54:25 +0100 Subject: [BioPython] ClustalW problem upwards Biopython 1.43 References: <320fb6e00810280420t75f62774x55335e8a5aa11151@mail.gmail.com> Message-ID: <003f01c93da2$89a5cab0$1022a8c0@ipkgatersleben.de> I'm sorry! I'm using CLUSTAL W 1.8. It's my mistake because I work on several PC's ;-) Well, for the full path: I put the .exe into the folder of my programs because I use it over Biopython. So usually I have it on my USB stick (X:\MyProgram\clustalw.exe). I'll try the code you gave me. Thanks, Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Tuesday, October 28, 2008 12:20 PM Subject: Re: [BioPython] ClustalW problem upwards Biopython 1.43 > Stephanie wrote: >> >>>>> print str(cline) >> >> clustalw pb.fasta -OUTFILE=test2.aln >> >> I'm using CLUSTAL W 2.0. > > Are you sure? The Clustal W 2.0 executable is normally called > clustalw2.exe rather than clustalw.exe - so based on the command line > above I would have expect Clustalw 1.x to be used. Maybe you have > both versions of ClustalW installed? > > Could you tell me where exactly (full paths) you have Clustalw.exe > and/or Clustalw2.exe installed? This would be helpful for the new > unit test I'm working on. > >> Under DOS everything works fine. > > I've been having "fun" trying to get a new unit test for this to work > nicely on Windows - there a certainly some combinations of file name > arguments with spaces etc which won't work on Biopython 1.48. I found > examples where the command line string ran "by hand" at the "DOS" > prompt worked fine, but would fail when invoked in python via os.popen > - on the bright side, using subprocess.Popen instead works much better > (although this isn't available for python 2.3). > > If you want to try this new code, I would suggest you first install > Biopython 1.48, and then backup and update > C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py to revision > 1.25 from CVS which you can download here (should be updated within > the hour): > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Clustalw/__init__.py?cvsroot=biopython > > Thanks! > > Peter > From lueck at ipk-gatersleben.de Mon Nov 3 10:56:37 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 3 Nov 2008 11:56:37 +0100 Subject: [BioPython] Sequence graph References: <490B3267.5020501@pingoured.fr> Message-ID: <004a01c93da2$d7e5aec0$1022a8c0@ipkgatersleben.de> Hi! I would be very much interested of this too! At the moment I do it myself but it's quite nasty... Does someone has experience in converting Perl to Python codes? This would be an option... Thanks in advance! Stefanie ----- Original Message ----- From: "Pierre-Yves" To: Sent: Friday, October 31, 2008 5:29 PM Subject: [BioPython] Sequence graph > Dear list, > > I am sorry to come here to ask this question that must have been already > asked in the past, but my search have been rather unsuccessful... > > I would like to reproduce such graph: > http://www.bioperl.org/wiki/HOWTO:Graphics#Improving_the_Image but even if > bioperl is nice I would like to do it through BioPython. > > I have thus two questions : > * Is that possible ? > * Could someone point me to an example ? > > Thanks in advance for your help, > > Best regards, > > Pierre > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From lpritc at scri.ac.uk Mon Nov 3 11:04:00 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 03 Nov 2008 11:04:00 +0000 Subject: [BioPython] Sequence graph In-Reply-To: <490B3267.5020501@pingoured.fr> Message-ID: Hi Pierre-Yves On 31/10/2008 16:29, "Pierre-Yves" wrote: > I would like to reproduce such graph: > http://www.bioperl.org/wiki/HOWTO:Graphics#Improving_the_Image but even > if bioperl is nice I would like to do it through BioPython. > > I have thus two questions : > * Is that possible ? > * Could someone point me to an example ? As far as I am aware there is not yet an equivalent of this graphical output in Biopython, though I agree that the facility would be nice ;) Robert Cadena and I are working on incorporating GenomeDiagram ( http://bioinf.scri.ac.uk/lp/programs.php) into Biopython. It works differently to the Perl code, though you could make images rendering the same information as in the BioPerl example you link to. Even though it's not yet part of Biopython, it does play nicely with Biopython, so you might like to try it out. If you are specifically looking to create a graphical representation of BLAST output, then I have a Python script that might be useful to you. Please get in touch if you'd like a copy. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From pingou at pingoured.fr Mon Nov 3 11:07:23 2008 From: pingou at pingoured.fr (Pierre-Yves) Date: Mon, 03 Nov 2008 12:07:23 +0100 Subject: [BioPython] Sequence graph In-Reply-To: References: Message-ID: <490EDB6B.9080704@pingoured.fr> Hi Leighton, Leighton Pritchard wrote: > Robert Cadena and I are working on incorporating GenomeDiagram ( > http://bioinf.scri.ac.uk/lp/programs.php) into Biopython. It works > differently to the Perl code, though you could make images rendering the > same information as in the BioPerl example you link to. Even though it's > not yet part of Biopython, it does play nicely with Biopython, so you might > like to try it out. Thanks for the link I will have a look at it. > > If you are specifically looking to create a graphical representation of > BLAST output, then I have a Python script that might be useful to you. > Please get in touch if you'd like a copy. It is not exactly a BLAST output but it could be approximate to, so actually yes I would be interested by your script. Thanks again, Best regards, P.Yves From biopython at maubp.freeserve.co.uk Mon Nov 3 11:30:30 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Nov 2008 11:30:30 +0000 Subject: [BioPython] ClustalW problem upwards Biopython 1.43 In-Reply-To: <003f01c93da2$89a5cab0$1022a8c0@ipkgatersleben.de> References: <320fb6e00810280420t75f62774x55335e8a5aa11151@mail.gmail.com> <003f01c93da2$89a5cab0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00811030330t1f80d3d4v48c84cafbe9f9377@mail.gmail.com> On Mon, Nov 3, 2008 at 10:54 AM, Stefanie L?ck wrote: > I'm sorry! I'm using CLUSTAL W 1.8. It's my mistake because I work on > several PC's ;-) Easily done - I'm not sure if it matters or not here. > I'll try the code you gave me. Thanks - from my testing the code in CVS should be fine (except on Python 2.3 for some filename combinations with spaces in them). It would be great to confirm this works for you. Peter From dalloliogm at gmail.com Mon Nov 3 15:37:06 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 3 Nov 2008 16:37:06 +0100 Subject: [BioPython] [Biopython-dev] Statistics in population genetics module - Part I In-Reply-To: <5aa3b3570811030736g7d7a0893x759777252c8d1828@mail.gmail.com> References: <6d941f120810301658wec8678ald332abb8ddbdf80d@mail.gmail.com> <5aa3b3570811030736g7d7a0893x759777252c8d1828@mail.gmail.com> Message-ID: <5aa3b3570811030737x620ff5f3vd35cdca0d4373769@mail.gmail.com> On Fri, Oct 31, 2008 at 12:58 AM, Tiago Ant?o wrote: > Hi, > > Statistics is the most important part of population genetics modules. > In fact one could say that statistics where invented FOR population > genetics (check http://en.wikipedia.org/wiki/Ronald_Fisher ). > When I started to work on the population genetics module I decided to > delay the statistics module a bit, in order to get experience with the > whole biopython project before committing to do the most important > thing. > Irrespective of it is possible or not to link scipy or not, now seems > to be the time to advance, especially considering that Giovanni is > interested in participating. > A few of points need to be said before suggesting on how to put > statistics in Bio.PopGen > > 1. Whatever design is put in, it should be reasonably future proof: in > a few releases it should not be a good idea to break older code. That > should be avoided in as much as possible. For how much time do you think a biopython module should be kept compatible with older versions, more or less? It will take a long time to develop the module, and it is sure that we will make some mistakes. So, what is the best way to proceed? What if we create a separated biopython branch where we can test all the new features? At the moment I am working with a separated git repository for all the popgen modules. The problem is that I didn't include all biopython modules in the repository, so, if any of my changes breaks something in biopython, I won't know it until I'll merge everything with biopython code. On the other side, if I include a biopython release in my popgen repository, I won't be able to track changes made in biopython, and my popgen code will be compatible with that version only. I think git provides some options to handle this kind of situations... I am not very used to cvs, so I don't know. p.s. When python3000 will be released, it will be probably necessary to rewrite large portions of biopython, if not creating a 'biopython 2' version (I think they were discussing something like this in bioperl's list). I thought that maybe, even if we make some 'mistakes' in this version of biopython, we will be able to fix them in a later version. > > 2. It goes without saying that the code should be useful to everybody > doing population genetics and not only the authors of Bio.PopGen: all > kinds of markers and population structures should be accommodatable in > the future . I think that a good idea would be starting collecting use cases to have an idea how many things we'll have to implement in this module. It would be useful to talk to the authors of similar modules in other Bio.* projects, to see if they have some good suggestions. I sent that mail to the Open::Bio::I last week, but still haven't received many replies... I will send a message to the various Bio.* mailing list in the next days. - Show quoted text - > > 3. For reasons that I've partially explained on the biopython list, I > don't think a OO model explicitly based on individuals or populations > e good (or even necessary) > 4. Any framework should be more pragmatic than anything else. I would > envision a typical use case like this > a) read data (from a certain data source) > b) Do some basic processing (changing individuals or populations, > converting markers) > c) calculate statistics > A few comments regarding each of these points: > a) data sources, file formats: file formats in population > genetics exist in large quantities and are essencialy completely > ad-hoc, most made in a very naive way. Good or BAD, that is what there > is. The most used format (some kind of de facto standard, GenePop) can > only be used for frequency-based statistics, for all the rest things > are fragmented (although, if there are no population structure and the > data is sequences than standard sequence based formats can be used - > but from my experience this is a small minority) > b) basic processing: This is the point where a OO model of > individuals and populations would pay, but I think it is not the "meat > of the issue" > c) statistics: there are of every type and for every taste. If > you want to have an idea of what is out there an interesting place to > look at is the arlequin3 manual: > http://cmpg.unibe.ch/software/arlequin3/arlequin31.pdf > (part of the manual is UI description, but especially starting at page > 89 - the table there is a good overview - there are descriptions of > the overall panorama). What if we create some very-generic objects, like: Population self._to_popgen_input -> represents population as an input to popgen ([Pop1, (Ind1, Ind2...)]) self._to_othertool_input -> represents population as an input to popgen Thanks for the link to arlequin3 manual, it seems very informative. > > With time, and after at least 3 failed attempts to think in terms of > individuals/populations I started to cristalize around a model > centered on types of statistics. This model ends up actually having > implicit models of populations and individuals, and that is, in fact, > there. It is just implicit and not unified: different kinds of > statistics have different implicit models. > The model that I would like to propose, centered around statistics, > will be the subject of my next email (which I will send in the next > couple of days - still under design and lost sleep). I might split it > in 2 parts (concepts and suggestions for implementation). > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From bsouthey at gmail.com Mon Nov 3 20:50:14 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 03 Nov 2008 14:50:14 -0600 Subject: [BioPython] [Biopython-dev] Statistics in population genetics module - Part I In-Reply-To: <5aa3b3570811030737x620ff5f3vd35cdca0d4373769@mail.gmail.com> References: <6d941f120810301658wec8678ald332abb8ddbdf80d@mail.gmail.com> <5aa3b3570811030736g7d7a0893x759777252c8d1828@mail.gmail.com> <5aa3b3570811030737x620ff5f3vd35cdca0d4373769@mail.gmail.com> Message-ID: <490F6406.5030800@gmail.com> Giovanni Marco Dall'Olio wrote: > On Fri, Oct 31, 2008 at 12:58 AM, Tiago Ant?o wrote: > > >> Hi, >> >> Statistics is the most important part of population genetics modules. >> In fact one could say that statistics where invented FOR population >> genetics (check http://en.wikipedia.org/wiki/Ronald_Fisher ). >> When I started to work on the population genetics module I decided to >> delay the statistics module a bit, in order to get experience with the >> whole biopython project before committing to do the most important >> thing. >> Irrespective of it is possible or not to link scipy or not, now seems >> to be the time to advance, especially considering that Giovanni is >> interested in participating. >> A few of points need to be said before suggesting on how to put >> statistics in Bio.PopGen >> >> 1. Whatever design is put in, it should be reasonably future proof: in >> a few releases it should not be a good idea to break older code. That >> should be avoided in as much as possible. >> > > > For how much time do you think a biopython module should be kept compatible > with older versions, more or less? > It will take a long time to develop the module, and it is sure that we will > make some mistakes. So, what is the best way to proceed? What if we create a > separated biopython branch where we can test all the new features? > At the moment I am working with a separated git repository for all the > popgen modules. The problem is that I didn't include all biopython modules > in the repository, so, if any of my changes breaks something in biopython, I > won't know it until I'll merge everything with biopython code. > On the other side, if I include a biopython release in my popgen repository, > I won't be able to track changes made in biopython, and my popgen code will > be compatible with that version only. > I think git provides some options to handle this kind of situations... I am > not very used to cvs, so I don't know. > If you have modified a Biopython module you probably see if it is acceptable to change the main Biopython distribution especially if it involves an API change or modify your code because I do not think it is good idea to have different versions of the same Biopython module or any name clashes with Biopython. Otherwise, you just need to check that it runs with a very recent version of Biopython (and under the Biopython supported Python versions). If you have not done so, I would suggest developing unit tests that not only ensure code accuracy but also maintain future compatibility. A failed test will indicate some problem that needs resolving and the solution will mean that the code will be made compatible if necessary. > p.s. When python3000 will be released, it will be probably necessary to > rewrite large portions of biopython, if not creating a 'biopython 2' version > (I think they were discussing something like this in bioperl's list). > I thought that maybe, even if we make some 'mistakes' in this version of > biopython, we will be able to fix them in a later version. > Python 3 can not be discussed until all incompatible modules like numpy or Biopython can be used under Python 3 (rc1 is available). Further, the advice from above (see Guido's blog http://www.artima.com/weblogs/viewpost.jsp?thread=227041) is that the conversion should be a direct port without any changes especially API ones. So correcting any major 'mistakes' in the existing module probably will not be acceptable to the community. Further any correction at any time to the main distribution is not trivial especially as you must first get the users informed (I saw that with changing histogram in numpy). There is a lot of flexibility in a separate project that you will lose when a project is widely released or included in an well established project like Biopython. I think that you should maintain a separate project of some type until everything is sufficiently acceptable to the Biopython community. This gives sufficient time to address various concerns and enables an easy integration. Finally, if you require additional dependencies than those currently required by Biopython (especially something like scipy) then I think it will be very hard or impossible for you to get any code associated with these dependencies into Biopython. Just my opinions on your questions, Bruce From dalloliogm at gmail.com Tue Nov 4 10:58:41 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 4 Nov 2008 11:58:41 +0100 Subject: [BioPython] Proposal: doctest for biopython Message-ID: <5aa3b3570811040258h57803b7fy9a5f8b32e6f6982e@mail.gmail.com> Hi!! I would like to propose to use doctest tests in biopython. I found them very useful to understand how a script should be used, and moreover they can act as test units. I have just posted a patch file to that adds doctest documentation to Bio/SeqRecordIO: - http://bugzilla.open-bio.org/show_bug.cgi?id=2640 What do you think of it? Here it is the main documentation for unittest: - http://www.python.org/doc/2.5.2/lib/module-doctest.html Usually, you add a _test() function to every module, which calls the unittest libraries, and launch it with __name__ == '__main__'. The most significative example is added to the documentation string of every module/function, and tested with doctest.testmod(); later, you add more tests in a separate file, and launch them with doctest.testfile(). -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopythonlist at gmail.com Fri Nov 7 16:26:03 2008 From: biopythonlist at gmail.com (dr goettel) Date: Fri, 7 Nov 2008 17:26:03 +0100 Subject: [BioPython] Parsing ACE files Message-ID: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> I don't know why but I cannot search in the mailing list using the search link (http://search.open-bio.org/). I've seen in the documentation that the Bio.SeqIO can read ace files and uses Bio.Sequencing.Ace. After reading module Bio.SeqIO.AceIO it remains unclear for me how to use it. Could anybody tell me how to parse ACE files? is there a tutorial or example to look at? Thankyou very much! From biopython at maubp.freeserve.co.uk Fri Nov 7 16:33:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 Nov 2008 16:33:39 +0000 Subject: [BioPython] Parsing ACE files In-Reply-To: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> Message-ID: <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> On Fri, Nov 7, 2008 at 4:26 PM, dr goettel wrote: > I don't know why but I cannot search in the mailing list using the search > link (http://search.open-bio.org/). That's odd - it used to work OK... > I've seen in the documentation that the Bio.SeqIO can read ace files and > uses Bio.Sequencing.Ace. Yes. Depending on what information you want from the ACE files, you might be better off using Bio.Sequencing.Ace directly. Using Bio.SeqIO may not expose all the details you want (I'd have to check the details - its not fresh in my mind). > After reading module Bio.SeqIO.AceIO it remains unclear for me how to use > it. Could anybody tell me how to parse ACE files? is there a tutorial or > example to look at? You would typically use the Bio.SeqIO.parse() function (which will call Bio.SeqIO.AceIO internally). See Chapter 4 of the tutorial on Bio.SeqIO, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Or the Bio.SeqIO wiki page, http://biopython.org/wiki/SeqIO Peter From biopythonlist at gmail.com Fri Nov 7 17:02:30 2008 From: biopythonlist at gmail.com (dr goettel) Date: Fri, 7 Nov 2008 18:02:30 +0100 Subject: [BioPython] Parsing ACE files In-Reply-To: <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> Message-ID: <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> Thank you! > > Yes. Depending on what information you want from the ACE files, you > might be better off using Bio.Sequencing.Ace directly. Any example or tutorial for this solution? > Using Bio.SeqIO may not expose all the details you want (I'd have to check > the details - its not fresh in my mind). > You are right, it's not all the information I need. Peter > Cheers! From biopython at maubp.freeserve.co.uk Fri Nov 7 17:17:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 Nov 2008 17:17:01 +0000 Subject: [BioPython] Parsing ACE files In-Reply-To: <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> Message-ID: <320fb6e00811070917h74843b41tae82e53180ba080d@mail.gmail.com> On Fri, Nov 7, 2008 at 5:02 PM, dr goettel wrote: > Thank you! > >> Yes. Depending on what information you want from the ACE files, you >> might be better off using Bio.Sequencing.Ace directly. > > Any example or tutorial for this solution? > The bad news is I don't think this is covered in the Biopython Tutorial. However, there are some quite detailed built-in docstrings. >From within python, you can access the documentation via the python help function: >>> from Bio.Sequencing import Ace >>> help(Ace) ... These are also available online on our API pages (for the current release): http://biopython.org/DIST/docs/api/ http://biopython.org/DIST/docs/api/Bio.Sequencing.Ace-module.html However, you'll see this is quite a low level parser and it helps to know what the two letter line types mean (consult the ACE documentation). >> Using Bio.SeqIO may not expose all the details you want (I'd have to check >> the details - its not fresh in my mind). > > You are right, it's not all the information I need. I wrote the Bio.SeqIO wrapper for the ACE parser, so it might be possible to extend this to capture more information. What in particular do you want to extract? > Cheers! Sure! By the way - make sure you are using Biopython 1.48 or later, as Bio.Sequencing.Ace was switched to a more modern python iterator style then. Peter From sbassi at gmail.com Sat Nov 8 20:35:03 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 8 Nov 2008 18:35:03 -0200 Subject: [BioPython] Parsing ACE files In-Reply-To: <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> Message-ID: On Fri, Nov 7, 2008 at 3:02 PM, dr goettel wrote: >> Yes. Depending on what information you want from the ACE files, you >> might be better off using Bio.Sequencing.Ace directly. > Any example or tutorial for this solution? Here is something I wrote some time back I hope it still works: from Bio.Sequencing import Ace aceparser = Ace.ACEParser() fn = '/mnt/hda2/bio/836CLEAN-100.fasta.cap.ace' acefilerecord = aceparser.parse(open(fn)) # For each contig: for ctg in acefilerecord.contigs: print '==========================================' print 'Contig name: %s'%ctg.name print 'Bases: %s'%ctg.nbases print 'Reads: %s'%ctg.nreads print 'Segments: %s'%ctg.nsegments print 'Sequence: %s'%ctg.sequence print 'Quality: %s'%ctg.quality # For each read in contig: for read in ctg.reads: print 'Read name: %s'%read.rd.name print 'Align start: %s'%read.qa.align_clipping_start print 'Align end: %s'%read.qa.align_clipping_end print 'Read sequence: %s'%read.rd.sequence print '==========================================' From rodrigo_faccioli at uol.com.br Sun Nov 9 14:18:16 2008 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Sun, 9 Nov 2008 12:18:16 -0200 Subject: [BioPython] PDB file - Validation and WebService Message-ID: <3715adb70811090618i3ad46099y7fd1ff55be28ac23@mail.gmail.com> Hello, I am a very new BioPython member and I have listened only good news about BioPython project. So, I have two doubts: 1. The module Bio.PDB checks a PDB file like http://deposit.rcsb.org/cgi-bin/validate/adit-session-driver . If not, are there others possibilities in software ? 2. I read about PDB webservice from pdb website. The BioPython project is there supports for it? Because I read in http://www.rcsb.org/robohelp/webservices/summary.htm and there is a option with Python. Thanks for any support. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From biopython at maubp.freeserve.co.uk Sun Nov 9 15:16:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 9 Nov 2008 15:16:50 +0000 Subject: [BioPython] Biopython 1.49 beta released Message-ID: <320fb6e00811090716v58637d55o470246df4175464e@mail.gmail.com> Dear Biopythoneers, We are pleased to announce a beta release of Biopython 1.49. There are been some significant changes since Biopython 1.48 was released two months ago, which is why we are initially releasing a beta for wider testing. As previously announced, the big news is that Biopython now uses NumPy rather than its precursor Numeric (the original Numerical Python library). As in the previous releases, Biopython 1.49 beta supports Python 2.3, 2.4 and 2.5 but should now also work fine on Python 2.6. Please note that we intend to drop support for Python 2.3 in a couple of releases time. We also have some new functionality, starting with the basic sequence object (the Seq class) which now has more methods. This encourages a more object orientated coding style, and makes basic biological operations like transcription and translation more accessible and discoverable. Our BioSQL interface can now optionally fetch the NCBI taxonomy on demand when loading sequences (via Bio.Entrez) allowing you to populate the taxon/taxon_name tables gradually. Also, BioSQL should now work with the psycopg2 driver for PostgreSQL (as well as the older psycopg driver). Finally, our old parsing infrastructure (Martel and Bio.Mindy) is now considered to be deprecated, meaning mxTextTools is no longer required to use Biopython. This should not affect any of the typically used parsers (e.g. Bio.SeqIO and Bio.AlignIO). So, if you are feeling brave and know the risks, please try out Biopython 1.49 beta, and let us know on the mailing lists if it works, or more importantly if something doesn't. We'd also like feedback on the updated Biopython Tutorial and Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Source distributions and Windows installers are available from the Biopython website: http://biopython.org/wiki/Download Thanks! -Peter on behalf of the Biopython developers P.S. Those of you subscribed to our news feed would have seen this announcement already. For RSS links etc, see: http://biopython.org/wiki/News From biopython at maubp.freeserve.co.uk Sun Nov 9 15:26:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 9 Nov 2008 15:26:02 +0000 Subject: [BioPython] PDB file - Validation and WebService In-Reply-To: <3715adb70811090618i3ad46099y7fd1ff55be28ac23@mail.gmail.com> References: <3715adb70811090618i3ad46099y7fd1ff55be28ac23@mail.gmail.com> Message-ID: <320fb6e00811090726k2bef0c78t99c8909d781fb12@mail.gmail.com> On Sun, Nov 9, 2008 at 2:18 PM, Rodrigo faccioli wrote: > Hello, > > I am a very new BioPython member and I have listened only good news about > BioPython project. > > So, I have two doubts: > > > 1. The module Bio.PDB checks a PDB file like > http://deposit.rcsb.org/cgi-bin/validate/adit-session-driver . If not, > are there others possibilities in software ? Bio.PDB can do some validation (it has an optional strict mode for parsing). I don't know if this checks the same things as ADIT. Have you looked at downloading ADIT itself? http://sw-tools.pdb.org/apps/ADIT/index.html > 2. I read about PDB webservice from pdb website. The BioPython project is > there supports for it? Because I read in > http://www.rcsb.org/robohelp/webservices/summary.htm and there is a > option with Python. I've not looked at that before - all it seems to be at the moment is a way to run BLAST against the PDB database. Assuming their XML BLAST output is compatible with the NCBI's you should be able to use the Bio.Blast.NCBIXML parser on the results. I just tried their python example on Linux with Python 2.4.3 but it failed - perhaps my version of SOAP is out of date? Note they have a stray semi colon at the end of the call which shouldn't be there. See http://www.rcsb.org/robohelp/webservices/samples/python_samples.htm On the other hand, you could just use the NCBI's qblast (via Biopython if you like) to run an online BLAST search against the PDB database. So I don't see what benefit using the PDB's server offers - unless they plan additional functionality in future. Peter From dalloliogm at gmail.com Mon Nov 10 11:04:10 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 10 Nov 2008 12:04:10 +0100 Subject: [BioPython] annotations in an Alignment object Message-ID: <5aa3b3570811100304o4655fe60o4ecabf41e054c211@mail.gmail.com> Is there any way to store some annotations in an Alignment object?? For example: the alignment tool used, its parameters, its version, the date, and the nature of the sequence aligned. I am asking this because I would like to write a module to create ldhat input files from an alignment program. A ldhat file (http://www.stats.ox.ac.uk/~mcvean/LDhat/instructions.html) is very similar to a fasta file; the only difference is that in its first line, it contains three numbers, one of which can't always be inferred by the data. I have looked at Bio.Align.Generic's code, but I am not sure. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Nov 10 11:15:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Nov 2008 11:15:52 +0000 Subject: [BioPython] Parsing ACE files In-Reply-To: References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> Message-ID: <320fb6e00811100315s654e49d8i6ac208b033d4f024@mail.gmail.com> > Here is something I wrote some time back I hope it still works: > > from Bio.Sequencing import Ace > aceparser = Ace.ACEParser() > fn = '/mnt/hda2/bio/836CLEAN-100.fasta.cap.ace' > acefilerecord = aceparser.parse(open(fn)) > # For each contig: > for ctg in acefilerecord.contigs: > .... I guess I'm the bearer of bad news - the ACEParser object (with its iterator method) was deprecated in Biopython 1.48, in favour of a simple function calls read and parse (the DEPRECATED file didn't mention this, an oversight I've just rectified). Your code needs a small update: from Bio.Sequencing import Ace fn = '/mnt/hda2/bio/836CLEAN-100.fasta.cap.ace' acefilerecord=Ace.read(open(fn)) # For each contig: for ctg in acefilerecord.contigs: print '==========================================' print 'Contig name: %s'%ctg.name print 'Bases: %s'%ctg.nbases print 'Reads: %s'%ctg.nreads print 'Segments: %s'%ctg.nsegments print 'Sequence: %s'%ctg.sequence print 'Quality: %s'%ctg.quality # For each read in contig: for read in ctg.reads: print 'Read name: %s'%read.rd.name print 'Align start: %s'%read.qa.align_clipping_start print 'Align end: %s'%read.qa.align_clipping_end print 'Read sequence: %s'%read.rd.sequence print '==========================================' If you try the old code on Biopython 1.48 or 1.49b you should get a deprecation warning suggesting this change. Or, you can use Ace.parse(open(fn)) to iterate over the contigs directly (assuming you don't care about the WA, CT, RT and WR tags which may be at the end of the file). Peter From biopython at maubp.freeserve.co.uk Mon Nov 10 11:28:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Nov 2008 11:28:00 +0000 Subject: [BioPython] annotations in an Alignment object In-Reply-To: <5aa3b3570811100304o4655fe60o4ecabf41e054c211@mail.gmail.com> References: <5aa3b3570811100304o4655fe60o4ecabf41e054c211@mail.gmail.com> Message-ID: <320fb6e00811100328j1a565c36t7f3522344e7c95c0@mail.gmail.com> On Mon, Nov 10, 2008 at 11:04 AM, Giovanni Marco Dall'Olio wrote: > Is there any way to store some annotations in an Alignment object?? > For example: the alignment tool used, its parameters, its version, the > date, and the nature of the sequence aligned. Not officially, no. This is on my mental list of things to do with the alignment object (after Biopython 1.49 is done). I've CC'd the dev-mailing list which is probably a better place to discuss the details. If you look at Bio/AlignIO/StockholmIO.py or the Bio/AlignIO/FastaIO.py code you'll see I've recorded this kind of information in a private dictionary, i.e. alignment._annotations. This makes the data available if anyone really needs it, but signals that this is not part of the public API and is likely to change. As part of an alignment annotation enhancement, we should try and establish some agreed standards for naming annotation entries (and also counting systems). > I am asking this because I would like to write a module to create > ldhat input files from an alignment program. > A ldhat file (http://www.stats.ox.ac.uk/~mcvean/LDhat/instructions.html) > is very similar to a fasta file; the only difference is that in its > first line, it contains three numbers, one of which can't always be > inferred by the data. Why go to the trouble of making a new Bio.AlignIO module? For this example from the LDhat manual, it looks like a FASTA file with an extra header: 4 10 1 >SampleA TCCGC??RTT >SampleB TACGC??GTA >SampleC TC?-CTTGTA >SampleD TCC-CTTGTT Rather than writing support for a whole new file format, wouldn't it be easier to do something like this: alignment = ... number_a = 4 number_b = 10 number_c = 1 handle = open("example.txt","w") handle.write("%i %i %i\n" % (number_a, number_b, number_c)) handle.write(alignment.format("fasta")) handle.close() Peter From dalloliogm at gmail.com Mon Nov 10 11:42:31 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 10 Nov 2008 12:42:31 +0100 Subject: [BioPython] annotations in an Alignment object In-Reply-To: <320fb6e00811100328j1a565c36t7f3522344e7c95c0@mail.gmail.com> References: <5aa3b3570811100304o4655fe60o4ecabf41e054c211@mail.gmail.com> <320fb6e00811100328j1a565c36t7f3522344e7c95c0@mail.gmail.com> Message-ID: <5aa3b3570811100342t7c23c0fl2b101be3fd352159@mail.gmail.com> On Mon, Nov 10, 2008 at 12:28 PM, Peter wrote: > On Mon, Nov 10, 2008 at 11:04 AM, Giovanni Marco Dall'Olio > wrote: >> Is there any way to store some annotations in an Alignment object?? >> For example: the alignment tool used, its parameters, its version, the >> date, and the nature of the sequence aligned. > > Not officially, no. This is on my mental list of things to do with > the alignment object (after Biopython 1.49 is done). I've CC'd the > dev-mailing list which is probably a better place to discuss the > details. > > If you look at Bio/AlignIO/StockholmIO.py or the > Bio/AlignIO/FastaIO.py code you'll see I've recorded this kind of > information in a private dictionary, i.e. alignment._annotations. > This makes the data available if anyone really needs it, but signals > that this is not part of the public API and is likely to change. > > As part of an alignment annotation enhancement, we should try and > establish some agreed standards for naming annotation entries (and > also counting systems). ok... I will use the private dictionary for my own implementation. Unfortunately I don't have any useful suggestion for this.. >> I am asking this because I would like to write a module to create >> ldhat input files from an alignment program. >> A ldhat file (http://www.stats.ox.ac.uk/~mcvean/LDhat/instructions.html) >> is very similar to a fasta file; the only difference is that in its >> first line, it contains three numbers, one of which can't always be >> inferred by the data. > > Why go to the trouble of making a new Bio.AlignIO module? For this > example from the LDhat manual, it looks like a FASTA file with an > extra header: Yeah.. of course :) Let's say I am simply playing with biopython's code, to better understand it. Since I am going to use this function many times, I will have to write a module for it any way. The first number in the ldhat file is the number of sequences, the second is their length, and the third should be usually one in an alignment object, I suppose. > > 4 10 1 >>SampleA > TCCGC??RTT >>SampleB > TACGC??GTA >>SampleC > TC?-CTTGTA >>SampleD > TCC-CTTGTT > > Rather than writing support for a whole new file format, wouldn't it > be easier to do something like this: > > alignment = ... > number_a = 4 > number_b = 10 > number_c = 1 > > handle = open("example.txt","w") > handle.write("%i %i %i\n" % (number_a, number_b, number_c)) > handle.write(alignment.format("fasta")) > handle.close() > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From paul at rudin.co.uk Mon Nov 10 16:59:27 2008 From: paul at rudin.co.uk (Paul Rudin) Date: Mon, 10 Nov 2008 16:59:27 +0000 Subject: [BioPython] Bio.KDTree on 64 bit machines Message-ID: <87myg75xg0.fsf@rudin.co.uk> I'm looking at the biopython KDTree class. It requires arrays with dtype=="float32". I can make such arrays, but on 64 bit machines it's more natural (and the default for numpy float arrays) to have "float64". From biopython at maubp.freeserve.co.uk Mon Nov 10 19:00:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Nov 2008 19:00:14 +0000 Subject: [BioPython] Bio.KDTree on 64 bit machines In-Reply-To: <87myg75xg0.fsf@rudin.co.uk> References: <87myg75xg0.fsf@rudin.co.uk> Message-ID: <320fb6e00811101100k4a5eee48w98e3c993c23d4bf9@mail.gmail.com> On Mon, Nov 10, 2008 at 4:59 PM, Paul Rudin wrote: > > I'm looking at the biopython KDTree class. It requires arrays with > dtype=="float32". I can make such arrays, but on 64 bit machines it's > more natural (and the default for numpy float arrays) to have "float64". > You're looking at the code for Biopython 1.49b or CVS right? i.e. Bio/KDTree/KDTree.py CVS revision 1.10 or 1.11 [For Biopython 1.48 and older we used Numeric, which just had "f" as the type.] Hopefully Michiel can explain if there was a particular reason for choosing "float32", but on the face of it following the numpy default would seem sensible. Would you like to file a bug on this issue? Peter From paul at rudin.co.uk Mon Nov 10 19:19:46 2008 From: paul at rudin.co.uk (Paul Rudin) Date: Mon, 10 Nov 2008 19:19:46 +0000 Subject: [BioPython] Bio.KDTree on 64 bit machines References: <87myg75xg0.fsf@rudin.co.uk> <320fb6e00811101100k4a5eee48w98e3c993c23d4bf9@mail.gmail.com> Message-ID: <874p2f5qy5.fsf@rudin.co.uk> Peter writes: > On Mon, Nov 10, 2008 at 4:59 PM, Paul Rudin wrote: >> >> I'm looking at the biopython KDTree class. It requires arrays with >> dtype=="float32". I can make such arrays, but on 64 bit machines it's >> more natural (and the default for numpy float arrays) to have "float64". >> > > You're looking at the code for Biopython 1.49b or CVS right? > i.e. Bio/KDTree/KDTree.py CVS revision 1.10 or 1.11 Yes - I installed with: "sudo easy_install -f http://biopython.org/DIST/biopython", which seems to be 1.49 beta at the time of writing. > > [For Biopython 1.48 and older we used Numeric, which just had "f" as the type.] > > Hopefully Michiel can explain if there was a particular reason for > choosing "float32", but on the face of it following the numpy default > would seem sensible. > Would you like to file a bug on this issue? OK, I will. From sbassi at gmail.com Tue Nov 11 11:21:50 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Tue, 11 Nov 2008 09:21:50 -0200 Subject: [BioPython] Parsing ACE files In-Reply-To: <320fb6e00811100315s654e49d8i6ac208b033d4f024@mail.gmail.com> References: <9b15d9f30811070826w732a0e0m8305cb440ceba605@mail.gmail.com> <320fb6e00811070833h360845d5p74538abeb049ab6c@mail.gmail.com> <9b15d9f30811070902j672e68c4uddc433f87cdf0853@mail.gmail.com> <320fb6e00811100315s654e49d8i6ac208b033d4f024@mail.gmail.com> Message-ID: On Mon, Nov 10, 2008 at 9:15 AM, Peter wrote: > I guess I'm the bearer of bad news - the ACEParser object (with its > iterator method) was deprecated in Biopython 1.48, in favour of a Not at all, it is good to know that I was doing something wrong (maybe not wrong now but sure it was going to be an issue later). > simple function calls read and parse (the DEPRECATED file didn't > mention this, an oversight I've just rectified). Your code needs a At least it was worth to correct that file. Thank you for the correction to my code. Best, -- Vendo isla: http://www.genesdigitales.com/isla/ Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 "It is pitch black. You are likely to be eaten by a grue." -- Zork From cy at cymon.org Tue Nov 11 12:39:05 2008 From: cy at cymon.org (Cymon Cox) Date: Tue, 11 Nov 2008 12:39:05 +0000 Subject: [BioPython] Cannot __add__ two DBSeq objects Message-ID: <7265d4f0811110439h6c18e111te97d23070565cca2@mail.gmail.com> Hi All, Two DBSeq objects cannot be concatenated, although the DBSeq object inherits __add__ from Seq. It tries to init a new DBSeq object rather than returning a Seq object as would be expected. >>> s1 DBSeq('CTAAGCCATTCTACGACGTAGAATGAGCGTGTCACTGTATTTACGTCTCTTTCG...GGT', DNAAlphabet()) >>> s2 DBSeq('ACTCAAGGGTGAAGTATTTCCAATCGAAATAGGTGCTTCTATACCGGAAATAAT...CAT', DNAAlphabet()) >>> s1 + s2 Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.5/site-packages/Bio/Seq.py", line 144, in __add__ return self.__class__(str(self) + str(other), a) TypeError: __init__() takes exactly 6 arguments (3 given) Presumably, DBSeq needs to overide Seq.__add__ (Using CVS as of yesterday...) Cheers, C. -- From biopython at maubp.freeserve.co.uk Tue Nov 11 13:02:18 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Nov 2008 13:02:18 +0000 Subject: [BioPython] Cannot __add__ two DBSeq objects In-Reply-To: <7265d4f0811110439h6c18e111te97d23070565cca2@mail.gmail.com> References: <7265d4f0811110439h6c18e111te97d23070565cca2@mail.gmail.com> Message-ID: <320fb6e00811110502y624cf6c1r52c316d61a1f7228@mail.gmail.com> On Tue, Nov 11, 2008 at 12:39 PM, Cymon Cox wrote: > Hi All, > > Two DBSeq objects cannot be concatenated, although the DBSeq object inherits > __add__ from Seq. Interesting point - not something I'd considered (nor anyone else until now!) > It tries to init a new DBSeq object rather than returning a Seq object as would be expected. > ... > Presumably, DBSeq needs to overide Seq.__add__ > (Using CVS as of yesterday...) Clearly we can't create a new DBSeq object (there wouldn't be any suitable sequence in the database to point to), and returning a Seq object is sensible. We should probably continue this discussion on the dev mailing list (CC'd). Either we have the DBSeq override the __add__ method (and __radd__), or we could make the base Seq class always use new Seq objects in __add__ etc. This would affect anyone writing their own Seq subclass... On balance, I think you're right and its DBSeq which needs to be changed. Would you like to tackle this, or should I? We'd also want to extend the BioSQL unit test to cover adding DBSeq+DBSeq, DBSeq+Seq, Seq+DBSeq, DBSeq+MutableSeq, MutableSeq+DBSeq, etc. Peter From biopython at maubp.freeserve.co.uk Tue Nov 11 14:53:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Nov 2008 14:53:32 +0000 Subject: [BioPython] Cannot __add__ two DBSeq objects In-Reply-To: <320fb6e00811110502y624cf6c1r52c316d61a1f7228@mail.gmail.com> References: <7265d4f0811110439h6c18e111te97d23070565cca2@mail.gmail.com> <320fb6e00811110502y624cf6c1r52c316d61a1f7228@mail.gmail.com> Message-ID: <320fb6e00811110653u63e85bc6k572d5fa42ede8280@mail.gmail.com> On Tue, Nov 11, 2008 at 1:02 PM, Peter wrote: > On Tue, Nov 11, 2008 at 12:39 PM, Cymon Cox wrote: >> Hi All, >> >> Two DBSeq objects cannot be concatenated, although the DBSeq object inherits >> __add__ from Seq. > > Interesting point - not something I'd considered (nor anyone else until now!) > >> It tries to init a new DBSeq object rather than returning a Seq object as would be expected. >> ... >> Presumably, DBSeq needs to overide Seq.__add__ >> (Using CVS as of yesterday...) > > Clearly we can't create a new DBSeq object (there wouldn't be any > suitable sequence in the database to point to), and returning a Seq > object is sensible. We should probably continue this discussion on > the dev mailing list (CC'd). Fixed in CVS by implementing the __add__ and __radd__ methods in the DBSeq object, and having these simply off load the work to the Seq class. See: BioSQL/BioSeq.py revision: 1.28 Tests/test_BioSQL.py revision: 1.26 Tests/output/test_BioSQL revision: 1.2 Peter From dalloliogm at gmail.com Wed Nov 12 16:25:47 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 12 Nov 2008 17:25:47 +0100 Subject: [BioPython] a sequence set object in biopython? Message-ID: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> Hi, I think it could be useful to add a generic SequenceSet object in biopython. Such an object would represent a generic set of sequences, and could have some useful methods like .format('fasta') or .align('alignment_tool'). Is there something similar available already? I have noticed that the actual Generic.Alignment is very similar to such an object. However, it would be better to be able to work with a separated class, because sometimes you want to deal with sequences that are not aligned. Some use cases: - a set of sequences that represents all introns in a particular gene, on which I want to calculate the conservation of the splicing regulatory sites. - all genes sequences in an organisms, which I want to convert in EMBL format - a set of seqs to be aligned or used as input for other tools etc.. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Nov 12 17:53:35 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Nov 2008 17:53:35 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> Message-ID: <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> On Wed, Nov 12, 2008 at 4:25 PM, Giovanni Marco Dall'Olio wrote: > Hi, > I think it could be useful to add a generic SequenceSet object in biopython. > Such an object would represent a generic set of sequences, and could > have some useful methods like .format('fasta') or > .align('alignment_tool'). > Is there something similar available already? Given your example to turn the SequenceSet into a FASTA file, then clearly you are thinking of a collection of SeqRecord objects rather than just Seq objects. For this kind of thing I personally just use a list of SeqRecord objects. If I want to turn a list of SeqRecord objects into a FASTA file, I can pass the list to the Bio.SeqIO.write() function. Once I've made a FASTA file, I can call an external tool to align them - and then load them in again using Bio.AlignIO or Bio.SeqIO depending on what I plan to do next. > I have noticed that the actual Generic.Alignment is very similar to > such an object. However, it would be better to be able to work with a > separated class, because sometimes you want to deal with sequences > that are not aligned. Yes, the generic alignment is basically a list of SeqRecord objects plus some extra functionality like column access. > Some use cases: > - a set of sequences that represents all introns in a particular gene, > on which I want to calculate the conservation of the splicing > regulatory sites. > - all genes sequences in an organisms, which I want to convert in EMBL format > - a set of seqs to be aligned or used as input for other tools > etc.. All sensible use cases - but all seem to be covered by a simple python list of SeqRecord objects, or in some cases a list of Seq objects (e.g. the introns example, as I doube the introns have names). Peter From biopython at maubp.freeserve.co.uk Wed Nov 12 18:06:19 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Nov 2008 18:06:19 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> Message-ID: <320fb6e00811121006mbe32efar2fca638d1a5fe2ef@mail.gmail.com> On Wed, Nov 12, 2008 at 5:53 PM, Peter wrote: > On Wed, Nov 12, 2008 at 4:25 PM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I think it could be useful to add a generic SequenceSet object in biopython. >> Such an object would represent a generic set of sequences, and could >> have some useful methods like .format('fasta') or >> .align('alignment_tool'). >> Is there something similar available already? > > Given your example to turn the SequenceSet into a FASTA file, then > clearly you are thinking of a collection of SeqRecord objects rather > than just Seq objects. For this kind of thing I personally just use a > list of SeqRecord objects. > > If I want to turn a list of SeqRecord objects into a FASTA file, I can > pass the list to the Bio.SeqIO.write() function. Once I've made a > FASTA file, I can call an external tool to align them - and then load > them in again using Bio.AlignIO or Bio.SeqIO depending on what I plan > to do next. If you really want a list like object with a format method in your code, how about something like this: class SeqRecordList(list) : """Subclass of the python list, to hold SeqRecord objects only.""" #TODO - Override the list methods to make sure all the items #are indeed SeqRecord objects def format(self, format) : """Returns a string of all the records in a requested file format. The argument format should be any file format supported by the Bio.SeqIO.write() function. This must be a lower case string. """ from Bio import SeqIO from StringIO import StringIO handle = StringIO() SeqIO.write(self, handle, format) handle.seek(0) return handle.read() if __name__ == "__main__" : print "Loading records..." from Bio import SeqIO my_list = SeqRecordList(SeqIO.parse(open("ls_orchid.gbk"),"genbank")) print len(my_list) for format in ["fasta","tab"] : print print format print "="*len(format) print my_list.format(format) Peter From dalloliogm at gmail.com Wed Nov 12 18:17:48 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 12 Nov 2008 19:17:48 +0100 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> Message-ID: <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> On Wed, Nov 12, 2008 at 6:53 PM, Peter wrote: > On Wed, Nov 12, 2008 at 4:25 PM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I think it could be useful to add a generic SequenceSet object in biopython. >> Such an object would represent a generic set of sequences, and could >> have some useful methods like .format('fasta') or >> .align('alignment_tool'). >> Is there something similar available already? > > Given your example to turn the SequenceSet into a FASTA file, then > clearly you are thinking of a collection of SeqRecord objects rather > than just Seq objects. For this kind of thing I personally just use a > list of SeqRecord objects. > > If I want to turn a list of SeqRecord objects into a FASTA file, I can > pass the list to the Bio.SeqIO.write() function. Once I've made a > FASTA file, I can call an external tool to align them - and then load > them in again using Bio.AlignIO or Bio.SeqIO depending on what I plan > to do next. > >> Some use cases: >> - a set of sequences that represents all introns in a particular gene, >> on which I want to calculate the conservation of the splicing >> regulatory sites. >> - all genes sequences in an organisms, which I want to convert in EMBL format >> - a set of seqs to be aligned or used as input for other tools >> etc.. > > All sensible use cases - but all seem to be covered by a simple python > list of SeqRecord objects, or in some cases a list of Seq objects > (e.g. the introns example, as I doube the introns have names). > Not always. For example, if I have a set of genes in an organism, sometimes I would need to access to only some of them, by their id; so, a __getattribute__ method to make it work as a dictionary could also be useful. The fact is that I think that such an object would be so widely used, that maybe it would be useful to implement it in biopython. What I would do, honestly, is to create a GenericSeqRecordSet class from which to derive Alignment, specifying that in an alignment all the sequences should have the same lenght. It would not require much work and it would change the interface. very tiny little minusculus p.s. if you need help for implement such a thing or anything else I can volounteer :). > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Nov 12 18:36:11 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Nov 2008 18:36:11 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> Message-ID: <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> Giovanni Marco Dall'Olio wrote: >> All sensible use cases - but all seem to be covered by a simple python >> list of SeqRecord objects, or in some cases a list of Seq objects >> (e.g. the introns example, as I doube the introns have names). > > Not always. > For example, if I have a set of genes in an organism, sometimes I > would need to access to only some of them, by their id; so, a > __getattribute__ method to make it work as a dictionary could also be > useful. OK, then use a dict of SeqRecords for this, as shown in the tutorial chapter for Bio.SeqIO and the wiki. We even have a helper function Bio.SeqIO.to_dict() to do this and check for duplicate keys. If you need an order preserving dictionary, there are examples of this on the net and there is even PEP372 for adding this to python itself: http://www.python.org/dev/peps/pep-0372/ > The fact is that I think that such an object would be so widely used, > that maybe it would be useful to implement it in biopython. > What I would do, honestly, is to create a GenericSeqRecordSet class > from which to derive Alignment, specifying that in an alignment all > the sequences should have the same lenght. It would not require much > work and it would change the interface. I agree that IF we added some sort of "GenericSeqRecordSet class", it might be sensible for the alignment objects to subclass it - especially if you want it to behave list a python list primarily. Note that in python sets are not order preserving. > very tiny little minusculus p.s. if you need help for implement such a > thing or anything else I can volounteer :). That's good to hear :) However, we'd have to establish the need for this new object first - but so far we've only had two people's view so its too early to form a consensus. I don't see a strong reason for adding yet another object, when the core language provides lists, sets and dict which seem to be enough. Peter From dalloliogm at gmail.com Thu Nov 13 00:16:44 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 01:16:44 +0100 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> Message-ID: <5aa3b3570811121616u5f95cc8du9f0d91e4743f067f@mail.gmail.com> On Wed, Nov 12, 2008 at 7:36 PM, Peter wrote: > Giovanni Marco Dall'Olio wrote: >>> All sensible use cases - but all seem to be covered by a simple python >>> list of SeqRecord objects, or in some cases a list of Seq objects >>> (e.g. the introns example, as I doube the introns have names). >> >> Not always. >> For example, if I have a set of genes in an organism, sometimes I >> would need to access to only some of them, by their id; so, a >> __getattribute__ method to make it work as a dictionary could also be >> useful. > > OK, then use a dict of SeqRecords for this, as shown in the tutorial > chapter for Bio.SeqIO and the wiki. We even have a helper function > Bio.SeqIO.to_dict() to do this and check for duplicate keys. I would prefer a SeqRecordSet object with a to_dict method :) > If you need an order preserving dictionary, there are examples of this > on the net and there is even PEP372 for adding this to python itself: > http://www.python.org/dev/peps/pep-0372/ >> The fact is that I think that such an object would be so widely used, >> that maybe it would be useful to implement it in biopython. >> What I would do, honestly, is to create a GenericSeqRecordSet class >> from which to derive Alignment, specifying that in an alignment all >> the sequences should have the same lenght. It would not require much >> work and it would change the interface. > > I agree that IF we added some sort of "GenericSeqRecordSet class", it > might be sensible for the alignment objects to subclass it - > especially if you want it to behave list a python list primarily. Let's see it from another point of view. In biopython, if you want to print a set of sequences in fasta format, you have to do the following: >>> s1 = SeqRecord(Seq('cacacac')) >>> s2 = SeqRecord(Seq('cacacac')) >>> seqs = s1, s2 >>> out = '' >>> for seq in seqs: >>> # a "print seq.format('fasta')" statement won't work properly here, because of blank lines >>> out += seq.format('fasta') >>> print out On the other side, printing an alignment in fasta format is a lot simpler: >>> al = Alignment(SingleLetterAlphabet) >>> al.add_sequence('s1', 'cacaca') >>> al.add_sequence('s2, 'cacaca') >>> print al.format('fasta') I work more often with sets of sequences rather than with alignments. So, why it is more difficult to print some un-related sequences in a certain format, than aligned sequence? I would end up using Alignment objects also for sequences that are not aligned. I am also thinking about many format parsers. Wouldn't it be easier: >>> seqs = Bio.SeqIO.parse(filehandler, 'fasta') >>> record_dict = seqs.to_dict() than invoking SeqIO twice? > Note that in python sets are not order preserving. > >> very tiny little minusculus p.s. if you need help for implement such a >> thing or anything else I can volounteer :). > > That's good to hear :) > > However, we'd have to establish the need for this new object first - > but so far we've only had two people's view so its too early to form a > consensus. I don't see a strong reason for adding yet another object, > when the core language provides lists, sets and dict which seem to be > enough. Take for example this code you wrote for me before: > class SeqRecordList(list) : > """Subclass of the python list, to hold SeqRecord objects only.""" > #TODO - Override the list methods to make sure all the items > #are indeed SeqRecord objects > > def format(self, format) : > """Returns a string of all the records in a requested file format. > > The argument format should be any file format supported by > the Bio.SeqIO.write() function. This must be a lower case string. > """ > from Bio import SeqIO > from StringIO import StringIO > handle = StringIO() > SeqIO.write(self, handle, format) > handle.seek(0) > return handle.read() It's very useful, but I don't think a python/biopython newbie would be able to write it. That's why I think it should be included. Last year, I was in another laboratory and I didn't have much experience with biopython, and I was missing such a kind of object. > Peter > Goodnight!! -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Thu Nov 13 09:37:35 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 10:37:35 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator Message-ID: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> I am writing a module to generate semi-random sets of haplotypes. For example, let's say you want a set of 100 sequences of 200 SNPs, in which an hotspot is located in a certain position: the module is meant to generate such datasets, mainly for testing purposes. You can find the code here: - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py Could you give me some suggestions about this? For example, which kinds of haplotype model would you think it could be useful to implement (see the function paramsGenerator)? What do you think about the way I have written this code? Would you implement it in a different way? -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From mjldehoon at yahoo.com Thu Nov 13 10:27:57 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 13 Nov 2008 02:27:57 -0800 (PST) Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <5aa3b3570811121616u5f95cc8du9f0d91e4743f067f@mail.gmail.com> Message-ID: <25667.98653.qm@web62408.mail.re1.yahoo.com> Adding new classes to Biopython should be done very carefully ... once they're in, it's difficult to remove them again. In the past, removing classes that turned out to be less than ideal was a real headache. Right now I don't see a clear need for a sequence set object ... read on. --- On Wed, 11/12/08, Giovanni Marco Dall'Olio > > > > OK, then use a dict of SeqRecords for this, as shown > > in the tutorial chapter for Bio.SeqIO and the wiki. > > We even have a helper function > > Bio.SeqIO.to_dict() to do this and check for duplicate > > keys. > > I would prefer a SeqRecordSet object with a to_dict method > Wouldn't it be easier: > >>> seqs = Bio.SeqIO.parse(filehandler, > 'fasta') > >>> record_dict = seqs.to_dict() > > than invoking SeqIO twice? Maybe, yes, but it's just a matter of typing and I don't think that by itself it is a good enough reason for a SeqRecordSet class. > Let's see it from another point of view. > In biopython, if you want to print a set of sequences in > fasta format, > you have to do the following: > >>> s1 = SeqRecord(Seq('cacacac')) > >>> s2 = SeqRecord(Seq('cacacac')) > >>> seqs = s1, s2 > >>> out = '' > >>> for seq in seqs: > # a "print seq.format('fasta')" statement won't work > # properly here, because of blank lines > out += seq.format('fasta') > >>> print out I don't quite understand why "print seq.format('fasta')" won't work. > Take for example this code you wrote for me before: > > > class SeqRecordList(list) : > > def format(self, format) : > > from Bio import SeqIO > > from StringIO import StringIO > > handle = StringIO() > > SeqIO.write(self, handle, format) > > handle.seek(0) > > return handle.read() > > It's very useful, but I don't think a > python/biopython newbie would be > able to write it. I agree that this is too complicated. What if we redefine SeqIO.write as def write(self, handle=sys.stdout, format='fasta'): ... So by default SeqIO.write prints to the screen. Then you can do SeqIO.write(records) where records are a list of SeqRecord's. --Michiel. From tiagoantao at gmail.com Thu Nov 13 10:34:54 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 10:34:54 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> Message-ID: <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> I love the comment documentation, makes everything very easy to understand at first read. Where would you think this would fit in a PopGen hierarchy? Or to put it in another way, please complete Bio.PopGen.... Tiago On Thu, Nov 13, 2008 at 9:37 AM, Giovanni Marco Dall'Olio wrote: > I am writing a module to generate semi-random sets of haplotypes. > For example, let's say you want a set of 100 sequences of 200 SNPs, in > which an hotspot is located in a certain position: the module is meant > to generate such datasets, mainly for testing purposes. > > You can find the code here: > - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py > > Could you give me some suggestions about this? For example, which > kinds of haplotype model would you think it could be useful to > implement (see the function paramsGenerator)? > What do you think about the way I have written this code? Would you > implement it in a different way? > > -- > ----------------------------------------------------------- > > My Blog on Bioinformatics (italian): http://bioinfoblog.it > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From biopython at maubp.freeserve.co.uk Thu Nov 13 10:37:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 10:37:32 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <5aa3b3570811121616n65e3cc38mc7def11e3cd90b04@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> <5aa3b3570811121616n65e3cc38mc7def11e3cd90b04@mail.gmail.com> Message-ID: <320fb6e00811130237y623df295o53043d87bf239b83@mail.gmail.com> On Thu, Nov 13, 2008 at 12:16 AM, Giovanni Marco Dall'Olio wrote: > > On Wed, Nov 12, 2008 at 7:36 PM, Peter wrote: >> Giovanni Marco Dall'Olio wrote: >>>> All sensible use cases - but all seem to be covered by a simple python >>>> list of SeqRecord objects, or in some cases a list of Seq objects >>>> (e.g. the introns example, as I doube the introns have names). >>> >>> Not always. >>> For example, if I have a set of genes in an organism, sometimes I >>> would need to access to only some of them, by their id; so, a >>> __getattribute__ method to make it work as a dictionary could also be >>> useful. >> >> OK, then use a dict of SeqRecords for this, as shown in the tutorial >> chapter for Bio.SeqIO and the wiki. We even have a helper function >> Bio.SeqIO.to_dict() to do this and check for duplicate keys. > > I would prefer a SeqRecordSet object with a to_dict method :) OK, that is a style choice. BTW, you're using the word "Set" hear rather than "List", which could be misleading as in python sets have no order, but lists do. >> If you need an order preserving dictionary, there are examples of this >> on the net and there is even PEP372 for adding this to python itself: >> http://www.python.org/dev/peps/pep-0372/ > >>> The fact is that I think that such an object would be so widely used, >>> that maybe it would be useful to implement it in biopython. >>> What I would do, honestly, is to create a GenericSeqRecordSet class >>> from which to derive Alignment, specifying that in an alignment all >>> the sequences should have the same lenght. It would not require much >>> work and it would change the interface. >> >> I agree that IF we added some sort of "GenericSeqRecordSet class", it >> might be sensible for the alignment objects to subclass it - >> especially if you want it to behave list a python list primarily. > > Let's see it from another point of view. > In biopython, if you want to print a set of sequences in fasta format, > you have to do the following: >>>> s1 = SeqRecord(Seq('cacacac')) >>>> s2 = SeqRecord(Seq('cacacac')) >>>> seqs = s1, s2 >>>> out = '' >>>> for seq in seqs: >>>> # a "print seq.format('fasta')" statement won't work properly here, because of blank lines >>>> out += seq.format('fasta') >>>> print out First of all, in my opinion using variable names seq and seqs for SeqRecord objects rather than Seq objects is confusing. Secondly, creating SeqRecord objects without an ID is a very bad idea if you want to output them to a file. Thirdly, you can have as many blank likes as you like in a FASTA format file. Your problem is using "print" in python will append a new line for display. For writing to a file, it is important that the format("fasta") method include the trailing new line. i.e. for printing on screen you could do: for rec in seqs: #seqs is a list of SeqRecord objects print rec.format("fasta").rstrip() #removing trailing new line as print adds one Or (based on Michiel's email which arrived while I was writing mine) use the stdout handle: import sys from Bio import SeqIO SeqIO.write(seqs, sys.stdout, "fasta") > On the other side, printing an alignment in fasta format is a lot simpler: >>>> al = Alignment(SingleLetterAlphabet) >>>> al.add_sequence('s1', 'cacaca') >>>> al.add_sequence('s2, 'cacaca') >>>> print al.format('fasta') > > I work more often with sets of sequences rather than with alignments. > So, why it is more difficult to print some un-related sequences in a > certain format, than aligned sequence? I would end up using Alignment > objects also for sequences that are not aligned. Out of interest, why do you want to print out records to screen in a particular file format? Why not just write them to a file? > I am also thinking about many format parsers. > > Wouldn't it be easier: >>>> seqs = Bio.SeqIO.parse(filehandler, 'fasta') >>>> record_dict = seqs.to_dict() > > than invoking SeqIO twice? You don't like this: from Bio import SeqIO record_dict = SeqIO.to_dict(SeqIO.parse(handle, format)) Well, I can live with it. We *could* make the SeqIO.parse function always return a new object, a SeqRecordIterator which could have a to_dict() method in addition to the iteration interface - but this is overly complicated. >> Note that in python sets are not order preserving. >> >>> very tiny little minusculus p.s. if you need help for implement such a >>> thing or anything else I can volounteer :). >> >> That's good to hear :) >> >> However, we'd have to establish the need for this new object first - >> but so far we've only had two people's view so its too early to form a >> consensus. I don't see a strong reason for adding yet another object, >> when the core language provides lists, sets and dict which seem to be >> enough. > > Take for example this code you wrote for me before: > >> class SeqRecordList(list) : >> """Subclass of the python list, to hold SeqRecord objects only.""" >> #TODO - Override the list methods to make sure all the items >> #are indeed SeqRecord objects >> >> def format(self, format) : >> """Returns a string of all the records in a requested file format. >> >> The argument format should be any file format supported by >> the Bio.SeqIO.write() function. This must be a lower case string. >> """ >> from Bio import SeqIO >> from StringIO import StringIO >> handle = StringIO() >> SeqIO.write(self, handle, format) >> handle.seek(0) >> return handle.read() > > It's very useful, but I don't think a python/biopython newbie would be > able to write it. > That's why I think it should be included. > Last year, I was in another laboratory and I didn't have much > experience with biopython, and I was missing such a kind of object. A python newbie should first learn about basic python lists, sets, etc. Peter From biopython at maubp.freeserve.co.uk Thu Nov 13 11:11:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 11:11:10 +0000 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <25667.98653.qm@web62408.mail.re1.yahoo.com> References: <5aa3b3570811121616u5f95cc8du9f0d91e4743f067f@mail.gmail.com> <25667.98653.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00811130311t4e813a8fqeb21504fd5696bf1@mail.gmail.com> Michiel wrote: >Marco wrote: >> Take for example this code you [Peter] wrote for me before: >> >> > class SeqRecordList(list) : >> > def format(self, format) : >> > from Bio import SeqIO >> > from StringIO import StringIO >> > handle = StringIO() >> > SeqIO.write(self, handle, format) >> > handle.seek(0) >> > return handle.read() >> >> It's very useful, but I don't think a >> python/biopython newbie would be >> able to write it. > > I agree that this is too complicated. This wasn't aimed at a beginner, but rather for Marco if he really wants to use this kind of object in his own code, or as a basis for further discussion. > What if we redefine SeqIO.write as > > def write(self, handle=sys.stdout, format='fasta'): > ... > > So by default SeqIO.write prints to the screen. Then you can do > > SeqIO.write(records) > > where records are a list of SeqRecord's. We could certainly include something like this in the documentation: #Just an example to create some records: from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord records = [SeqRecord(Seq("ACGT"),"Alpha"), SeqRecord(Seq("GTGC"),"Beta")] #One way to "print" records to screen, import sys from Bio import SeqIO SeqIO.write(records, sys.stdout, "fasta") I'm not so keen on making the handle default to standard out, but this is nicer than the suggestion you made some time ago that if the handle were omitted a string be returned (no longer an option since Bug 2628 was committed). Any other votes for the standard out default? Peter From dalloliogm at gmail.com Thu Nov 13 11:51:38 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 12:51:38 +0100 Subject: [BioPython] a sequence set object in biopython? In-Reply-To: <320fb6e00811130237y623df295o53043d87bf239b83@mail.gmail.com> References: <5aa3b3570811120825y6ed11c00y384751e8f0f7adff@mail.gmail.com> <320fb6e00811120953t57c206e7nd0c8151b92361d5a@mail.gmail.com> <5aa3b3570811121017u72eb7552v94275368cb23cf48@mail.gmail.com> <320fb6e00811121036w17e0d2acv6723c751350f1893@mail.gmail.com> <5aa3b3570811121616n65e3cc38mc7def11e3cd90b04@mail.gmail.com> <320fb6e00811130237y623df295o53043d87bf239b83@mail.gmail.com> Message-ID: <5aa3b3570811130351j6051b934n2216c8595814b8fe@mail.gmail.com> On Thu, Nov 13, 2008 at 11:37 AM, Peter wrote: > On Thu, Nov 13, 2008 at 12:16 AM, Giovanni Marco Dall'Olio wrote: >> >> I would prefer a SeqRecordSet object with a to_dict method :) > > OK, that is a style choice. > > BTW, you're using the word "Set" hear rather than "List", which could > be misleading as in python sets have no order, but lists do. Maybe the word 'SequencesPool' would be less misleading? I don't have much confidence with English :( The word List could also be misleading, because I was thinking about an object that could act as a dictionary as well. > Out of interest, why do you want to print out records to screen in a > particular file format? Why not just write them to a file? just for debugging purposes - I wasn't expecting the blankline in the output. > You don't like this: > > from Bio import SeqIO > record_dict = SeqIO.to_dict(SeqIO.parse(handle, format)) > > Well, I can live with it. We *could* make the SeqIO.parse function > always return a new object, a SeqRecordIterator which could have a > to_dict() method in addition to the iteration interface - but this is > overly complicated. ok.. so I understand, if it would take too much work, nevermind. I just thought it could have been an useful suggestion and sent it, because otherwise I would have forgot about it :). Cheers :) > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From hma2 at staffmail.ed.ac.uk Thu Nov 13 11:18:34 2008 From: hma2 at staffmail.ed.ac.uk (Hongwu Ma) Date: Thu, 13 Nov 2008 11:18:34 +0000 Subject: [BioPython] said Your XML file was empty in parsing blast records but can see the results in the saved xml file In-Reply-To: References: Message-ID: <491C0D0A.7060408@staffmail.ed.ac.uk> Sometimes when I parse the blast records in biopython using the following program I get the error "Your XML file was empty" but there are actually some results in the saved xml files. Anyone know what is the problem? Thanks in advance. Hongwu myfolder='c:/mgenome/' mydatafolder=myfolder+'alphao/' my_blast_db = myfolder+'orfsre.txt' my_blast_exe =myfolder+'blastall.exe' evalue=0.0001 my_blast_file = mydatafolder+file result_handle, error_handle = NCBIStandalone.blastall(blastcmd=my_blast_exe, program="tblastn", database=my_blast_db, infile=my_blast_file, expectation=evalue) bres=result_handle.read() save_file = open(myfolder+file[:3]+'orfre.xml', "w") save_file.write(bres) save_file.close() blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records: > > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From dalloliogm at gmail.com Thu Nov 13 11:55:11 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 12:55:11 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> Message-ID: <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> On Thu, Nov 13, 2008 at 11:34 AM, Tiago Ant?o wrote: > I love the comment documentation, makes everything very easy to > understand at first read. It's doctest > Where would you think this would fit in a PopGen hierarchy? Or to put > it in another way, please complete Bio.PopGen.... Maybe we could create a module called 'Generators' where to put all similar Generators. p.s. are you using some kind of repository/rcs for you PopGen modules? I just don't want to write code that will be difficult to reimplement in the future PopGen module.. > Tiago > > On Thu, Nov 13, 2008 at 9:37 AM, Giovanni Marco Dall'Olio > wrote: >> I am writing a module to generate semi-random sets of haplotypes. >> For example, let's say you want a set of 100 sequences of 200 SNPs, in >> which an hotspot is located in a certain position: the module is meant >> to generate such datasets, mainly for testing purposes. >> >> You can find the code here: >> - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py >> >> Could you give me some suggestions about this? For example, which >> kinds of haplotype model would you think it could be useful to >> implement (see the function paramsGenerator)? >> What do you think about the way I have written this code? Would you >> implement it in a different way? >> >> -- >> ----------------------------------------------------------- >> >> My Blog on Bioinformatics (italian): http://bioinfoblog.it >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > "Data always beats theories. 'Look at data three times and then come > to a conclusion,' versus 'coming to a conclusion and searching for > some data.' The former will win every time." > ?Matthew Simmons, > http://www.tiago.org > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Nov 13 12:09:24 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 12:09:24 +0000 Subject: [BioPython] said Your XML file was empty in parsing blast records but can see the results in the saved xml file In-Reply-To: <491C0D0A.7060408@staffmail.ed.ac.uk> References: <491C0D0A.7060408@staffmail.ed.ac.uk> Message-ID: <320fb6e00811130409s488e98e1sbbdff52189255e2e@mail.gmail.com> On Thu, Nov 13, 2008 at 11:18 AM, Hongwu Ma wrote: > Sometimes when I parse the blast records in biopython using the following > program I get the error "Your XML file was empty" but there are actually > some results in the saved xml files. Anyone know what is the problem? > Thanks in advance. > Hongwu > > ... > bres=result_handle.read() > save_file = open(myfolder+file[:3]+'orfre.xml', "w") > save_file.write(bres) > save_file.close() > blast_records = NCBIXML.parse(result_handle) > for blast_record in blast_records: > ... When you do result_handle.read() it reads in all the data in the handle - leaving it empty (pointing at the end of the file). When the parser tries to read more data from the handle there isn't any, which is why the parser says the file seems to be empty. You'll have to "reset" the handle to the beginning. One way would be to open the file you just wrote to disk: ... save_file.close() result_handle = open(...) blast_records = NCBIXML.parse(result_handle) ... Peter From hma2 at staffmail.ed.ac.uk Thu Nov 13 13:55:12 2008 From: hma2 at staffmail.ed.ac.uk (Hongwu Ma) Date: Thu, 13 Nov 2008 13:55:12 +0000 Subject: [BioPython] said Your XML file was empty in parsing blast records but can see the results in the saved xml file In-Reply-To: <320fb6e00811130409s488e98e1sbbdff52189255e2e@mail.gmail.com> References: <491C0D0A.7060408@staffmail.ed.ac.uk> <320fb6e00811130409s488e98e1sbbdff52189255e2e@mail.gmail.com> Message-ID: <491C31C0.2030609@staffmail.ed.ac.uk> Thanks, Peter. I tried to reopen the xml file and it was working. I also tried to parse the result before reading as below: ... blast_records = NCBIXML.parse(result_handle) bres=result_handle.read() ... I found I still got the problem, a good xml file but empty blast_records. Why the later read() function affect the parse before it? > On Thu, Nov 13, 2008 at 11:18 AM, Hongwu Ma wrote: > >> Sometimes when I parse the blast records in biopython using the following >> program I get the error "Your XML file was empty" but there are actually >> some results in the saved xml files. Anyone know what is the problem? >> Thanks in advance. >> Hongwu >> >> ... >> bres=result_handle.read() >> save_file = open(myfolder+file[:3]+'orfre.xml', "w") >> save_file.write(bres) >> save_file.close() >> blast_records = NCBIXML.parse(result_handle) >> for blast_record in blast_records: >> ... >> > > When you do result_handle.read() it reads in all the data in the > handle - leaving it empty (pointing at the end of the file). When the > parser tries to read more data from the handle there isn't any, which > is why the parser says the file seems to be empty. You'll have to > "reset" the handle to the beginning. > > One way would be to open the file you just wrote to disk: > > ... > save_file.close() > result_handle = open(...) > blast_records = NCBIXML.parse(result_handle) > ... > > Peter > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From tiagoantao at gmail.com Thu Nov 13 14:04:46 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 14:04:46 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> Message-ID: <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> > Maybe we could create a module called 'Generators' where to put all > similar Generators. I'm ok with that. Do you envision more generators in the future? Actually simcoal could also be used to generate these things (as long as I remember to do an arlequin parser to get the results out). In the short run it would be nice to extend the tutorial also and put a few tests. I don't think your generator code causes a major disturbance or any maintenance problem upstream (read: for Peter, Michiel or final users), so, from my point of view it could be put in the main distribution in a short time frame. > p.s. are you using some kind of repository/rcs for you PopGen modules? > I just don't want to write code that will be difficult to reimplement > in the future PopGen module.. Just my hardisk, open bio CVS and now most everything is in git ;) . For now don't worry with that, you actually have all my code (except recent bugfixes). Lets just keep communicating. From biopython at maubp.freeserve.co.uk Thu Nov 13 14:14:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 14:14:26 +0000 Subject: [BioPython] said Your XML file was empty in parsing blast records but can see the results in the saved xml file In-Reply-To: <491C31C0.2030609@staffmail.ed.ac.uk> References: <491C0D0A.7060408@staffmail.ed.ac.uk> <320fb6e00811130409s488e98e1sbbdff52189255e2e@mail.gmail.com> <491C31C0.2030609@staffmail.ed.ac.uk> Message-ID: <320fb6e00811130614i7eaaaec4pc774461be96b3ab4@mail.gmail.com> On Thu, Nov 13, 2008 at 1:55 PM, Hongwu Ma wrote: > > Thanks, Peter. I tried to reopen the xml file and it was working. Good. > I also tried to parse the result before reading as below: > ... > > I found I still got the problem, a good xml file but empty blast_records. > Why the later read() function affect the parse before it? You can only call handle.read() once - it gives you all the remaining data in the file. Once you've called handle.read(), any further calls to handle.read() or handle.readline() etc won't return any more data. If you call result_handle.read() first, then when you give result_handle to the parser, it can't get any data from it. Have a look at reading files in python which should help to explain the ideas: http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files Peter From biopython at maubp.freeserve.co.uk Thu Nov 13 14:19:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Nov 2008 14:19:13 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> Message-ID: <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> On Thu, Nov 13, 2008 at 2:04 PM, Tiago Ant?o wrote: >> Maybe we could create a module called 'Generators' where to put all >> similar Generators. > > I'm ok with that. Do you envision more generators in the future? > Actually simcoal could also be used to generate these things (as long > as I remember to do an arlequin parser to get the results out). > In the short run it would be nice to extend the tutorial also and put > a few tests. You're talking about a module to create biological data (e.g. haplotypes), right? I'm don't think using the word "generators" is a good idea. Python itself uses this terminology in the context of iteration (see generator functions, generator expressions, etc). That code did not look like a python generator to me. Peter From lueck at ipk-gatersleben.de Thu Nov 13 14:06:13 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 13 Nov 2008 15:06:13 +0100 Subject: [BioPython] Problems with Emboss.Primer3 Message-ID: <000801c94598$fd183f20$1022a8c0@ipkgatersleben.de> Hi! I'm trying to generate a Primer3 file but I have some problems because my output file is allways empty. Unfortunatly I don't get an error message. Here's my code: from Bio import Fasta from Bio.Emboss.Applications import Primer3Commandline from Bio.Application import generic_run from Bio.Emboss.Primer import Primer3Parser primer_cl = Primer3Commandline() primer_cl.set_parameter("-sequence", "p3input.txt") primer_cl.set_parameter("-outfile", "out.pr3") primer_cl.set_parameter("-productsizerange", "350,10000") primer_cl.set_parameter("-target", "%s,%s" % (50, 500)) result, messages, errors = generic_run(primer_cl) p3input.txt looks like this: PRIMER_SEQUENCE_ID=HF15E08r SEQUENCE=GCATGTAATAATGCCAAAGCTCACAGCTGCAGTTGAATCTTGGGACCCGCGGAGCGAGAATGTACCAATCCATGTATGGGTACACCCATGGCTGCCAACTCTAGGGCAAAGGATAGATACACTGTGCCACTCTATCCGGTACAAGCTGAGTAGTGTCCTCCAATTATGGCAAGCTCACGATTCATCAGCTTATGCTGTGCTATCTCCATGGAAGGGTGTATTTGATCCAGCAAGTTGGGAAGACTTGATAGTGCGTTATATCATTCCTAAACTGAAAATGGCACTCCAGGAGTTCCAGATTAACCCAGCAAGCCAAAAGTTTGACCAGTTTAACTGGGTTATGATCTGGGCTTCTGCTGTCCCGGTACACCATATGGTCCATATGTTGGAAGTTGATTTCTTTAGCAAGTGGCAGCTGGTTTTGTACCATTGGCTGAGCTCACCAAATCCTGATTTCAATGAGATAATGAATTGGTAT PRIMER_PRODUCT_SIZE_RANGE=500-1000 450-500 400-450 350-400 300-350 250-300 200-250 150-200 PRIMER_OPT_TM=60.0 PRIMER_MIN_TM=58.0 PRIMER_MAX_TM=65.0 PRIMER_MAX_DIFF_TM=3.0 PRIMER_DNA_CONC=420 PRIMER_NUM_RETURN=1 = PRIMER_SEQUENCE_ID=HO05B04S SEQUENCE=GAAAACCCAATGACAGTAGGATGACAAGGGAAAACTGGTGAGCAACGTCGTAGTCGGGGTTACCACCGGCGGGAAAAAGTAGCAAAACTATGTCATGTCTTATAATCTGGAGTTGGGAACACCTTGTATTATACTCGTGTCTGGGGATCGACCGATCGGTCGCGTAGAAGAAAAACCCAAAGCGCGGAAATGGACCGCGCCAACAAAAAAAGAGGGTGCGGGTGTGGATAATATGGAGAAGAACTGTATTTTGCTTACCCCCTTGATTCTTTTGTATGTAAAATGTGGGCACTGTCAGACCTCACTGTGTGATCAAATCCTCTCTGTCCTGTCCTGTCCTGAAGGGGCCTCTCGTTCTGGATGAATAAACAGCAAATAACTTTGCGTGTGGCTGGCCCCACCTGTCGGTGATTGGTAATTAAAACGACGGTAATTGTTGTG PRIMER_PRODUCT_SIZE_RANGE=500-1000 450-500 400-450 350-400 300-350 250-300 200-250 150-200 PRIMER_OPT_TM=60.0 PRIMER_MIN_TM=58.0 PRIMER_MAX_TM=65.0 PRIMER_MAX_DIFF_TM=3.0 PRIMER_DNA_CONC=420 PRIMER_NUM_RETURN=1 ... Does someone has idea what's the problem? Thanks in advance, Stefanie From dalloliogm at gmail.com Thu Nov 13 14:27:50 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 15:27:50 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> Message-ID: <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> On Thu, Nov 13, 2008 at 3:19 PM, Peter wrote: > On Thu, Nov 13, 2008 at 2:04 PM, Tiago Ant?o wrote: >>> Maybe we could create a module called 'Generators' where to put all >>> similar Generators. >> >> I'm ok with that. Do you envision more generators in the future? >> Actually simcoal could also be used to generate these things (as long >> as I remember to do an arlequin parser to get the results out). >> In the short run it would be nice to extend the tutorial also and put >> a few tests. > > You're talking about a module to create biological data (e.g. > haplotypes), right? I'm don't think using the word "generators" is a > good idea. Python itself uses this terminology in the context of > iteration (see generator functions, generator expressions, etc). That > code did not look like a python generator to me. This is right: which word can I use, then? HaplotypesSampler? RandomHaplotypesSpawner? HaplotypesCreator? Anyway, I will change the module's interface sooner, it will accept different parameters. I need it will need still some amount of work.. > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From tiagoantao at gmail.com Thu Nov 13 14:32:23 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 14:32:23 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> Message-ID: <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> > This is right: which word can I use, then? > HaplotypesSampler? RandomHaplotypesSpawner? > HaplotypesCreator? Considering that this is probably a small piece of code in the long run (correct me if I am wrong), I suggest creating Bio.PopGen.Utils.NameToBeDecided.py From p.j.a.cock at googlemail.com Thu Nov 13 14:43:46 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 13 Nov 2008 14:43:46 +0000 Subject: [BioPython] Problems with Emboss.Primer3 In-Reply-To: <000801c94598$fd183f20$1022a8c0@ipkgatersleben.de> References: <000801c94598$fd183f20$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00811130643p357092f6y8e6d983a11909003@mail.gmail.com> Stefanie L?ck wrote: > Hi! > > I'm trying to generate a Primer3 file but I have some problems because > my output file is allways empty. > Unfortunatly I don't get an error message. Hi Stefanie, I have a couple of suggestions to try and work out what is wrong here... > Here's my code: > > from Bio import Fasta > from Bio.Emboss.Applications import Primer3Commandline > from Bio.Application import generic_run > from Bio.Emboss.Primer import Primer3Parser Note Bio.Emboss.Primer was deprecated for Biopthon 1.49, I think you'll want to use Bio.Emboss.Primer3 instead. > primer_cl = Primer3Commandline() > primer_cl.set_parameter("-sequence", "p3input.txt") > primer_cl.set_parameter("-outfile", "out.pr3") > primer_cl.set_parameter("-productsizerange", "350,10000") > primer_cl.set_parameter("-target", "%s,%s" % (50, 500)) > result, messages, errors = generic_run(primer_cl) > ... What does this give: print "Command line:" print primer_cl print "Return code:" print result.return_code print "Errors:" print errors.read() print "Messages": print messages.read() Also try running the command line by hand at the command prompt. I get this, which may mean a problem with the input file: $ eprimer3 -sequence p3input.txt -outfile out.pr3 -target 50,500 -productsizerange 350,10000 Picks PCR primers and hybridization oligos Error: Unable to read sequence 'p3input.txt' Died: eprimer3 terminated: Bad value for '-sequence' and no prompt Is your input file really expected to work? Reading the docs I would suggest trying a FASTA file as input, but I am not familiar with this tool: http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html Peter From bsouthey at gmail.com Thu Nov 13 15:29:38 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 13 Nov 2008 09:29:38 -0600 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> Message-ID: <491C47E2.60904@gmail.com> Tiago Ant?o wrote: >> This is right: which word can I use, then? >> HaplotypesSampler? RandomHaplotypesSpawner? >> HaplotypesCreator? >> > > Considering that this is probably a small piece of code in the long > run (correct me if I am wrong), I suggest creating > Bio.PopGen.Utils.NameToBeDecided.py > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, I really don't mean to be negative, but you have certain responsibilities once you release code into the Biopython community. Part of my concern is that some of this is being overlooked especially in terms of the user of the code. I do see that simulation of SNPs is useful for users so it is important that it integrated correctly. I think Michiel's recent comment in 'a sequence set object in biopython' thread is important here as well: "Adding new classes to Biopython should be done very carefully ... once they're in, it's difficult to remove them again. In the past, removing classes that turned out to be less than ideal was a real headache." While I have not looked at the code, my view is that must remain integrated into the PopGen module. I would expect that a user would some Biopython (PopGen) modules with some simulated SNPs. I would prefer that Biopython remains as much as possible a set of integrated tools rather than just a collection of tools. This is a clear example where if it is not totally integrated then I don't see the point in including it in Biopython. The second aspect is that it must have a very stable API, similarly to Michiel's comment is that changing APIs after a release is also a pain especially if the module has been around a long time. Based on your first post, I would argue that you are not quite at this stage yet. Bruce From dalloliogm at gmail.com Thu Nov 13 16:07:51 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 17:07:51 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <491C47E2.60904@gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> Message-ID: <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> On Thu, Nov 13, 2008 at 4:29 PM, Bruce Southey wrote: > Tiago Ant?o wrote: >>> >>> This is right: which word can I use, then? >>> HaplotypesSampler? RandomHaplotypesSpawner? >>> HaplotypesCreator? >>> >> >> Considering that this is probably a small piece of code in the long >> run (correct me if I am wrong), I suggest creating >> Bio.PopGen.Utils.NameToBeDecided.py >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > > Hi, > I really don't mean to be negative, but you have certain responsibilities > once you release code into the Biopython community. Part of my concern is > that some of this is being overlooked especially in terms of the user of the > code. I do see that simulation of SNPs is useful for users so it is > important that it integrated correctly. > > I think Michiel's recent comment in 'a sequence set object in biopython' > thread is important here as well: > > "Adding new classes to Biopython should be done very carefully ... once > they're in, it's difficult to remove them again. In the past, removing > classes that turned out to be less than ideal was a real headache." > > While I have not looked at the code, my view is that must remain integrated > into the PopGen module. I would expect that a user would some Biopython > (PopGen) modules with some simulated SNPs. I would prefer that Biopython > remains as much as possible a set of integrated tools rather than just a > collection of tools. This is a clear example where if it is not totally > integrated then I don't see the point in including it in Biopython. > > The second aspect is that it must have a very stable API, similarly to > Michiel's comment is that changing APIs after a release is also a pain > especially if the module has been around a long time. Based on your first > post, I would argue that you are not quite at this stage yet. ehi, wait :) I wasn't proposing to integrate this module in biopython, at least not yet!! :) This is a module to generate test sets to help the development of the other future PopGen modules. For example, we wanted to write a function to calculate the Fst statistics over snps data. The Fst is an index that tells you if, given two populations, they follow the same pattern of variability, and therefore can be considered as two subpopulations of the same population or not. To test such a script, you will need a module like the one I wrote here: for example, you could create two samples of 200 individuals with the same frequencies at every site, and see what your Fst script tells. Then, probably, compare the results with another tool that is already know to calculate the Fst correctly. So I was just asking for any suggestions - which models should I implement in this generator? And how? Which parameters should it accept? Should it use the random module? > Bruce > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From tiagoantao at gmail.com Thu Nov 13 16:33:18 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 16:33:18 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <491C47E2.60904@gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> Message-ID: <6d941f120811130833o194c280fqc8707bd18472a369@mail.gmail.com> On Thu, Nov 13, 2008 at 3:29 PM, Bruce Southey wrote: > While I have not looked at the code, my view is that must remain integrated > into the PopGen module. I would expect that a user would some Biopython > (PopGen) modules with some simulated SNPs. I would prefer that Biopython > remains as much as possible a set of integrated tools rather than just a > collection of tools. This is a clear example where if it is not totally > integrated then I don't see the point in including it in Biopython. There are several dimensions here, and I would like to sum up my ideas on several things being floated around: 1. Support for tools with a small user base: I do think that the user base size should not be a fundamental criteria. As long as tools are maintained (which, I agree, might be a problem with some fringe applications), this should not be a issue. A good example is fdist support on PopGen: The user base seems to be increasing quite a lot for the method because of code done on top of it Bio.PopGen.FDist (something I was not expecting, to be honest). 2. Integration inside PopGen: Up to now, there has been an effort in PopGen to have a coherent module where all parts interoperate. With the exception of Simcoal output, all the rest works in a cohesive way, you can take a genepop file, and feed it to fdist, for instance as the module has provisions for interop (the same for the new LDNe code that I have). 3. Integration with the rest of biopython. I do expect things to work quite smoothly. Like SNP extraction from sequencies and feed in to fdist, ldne and (future) statistics. I see issues with microsatelllite/STR + RFLPs stuff, but that is because there might be little provision in the rest of biopython for that type of markers. 4. New code and new developers. I think that an overly stringent process will put new people off. I have no problems in accepting _non crucial code_ that does _not impose big maintenance hurdles_, though that code might be somewhat naive in the big picture (maybe this particular example should actually go to the test base, BTW). The truth is, an overly stringent process, while it might assure fantastic code puts a gigantic barrier for new people. I am more in favor of a learning process where less fundamental code can be accepted at the beginning. I don't want to discourage new people, I think a balance between quality and encouragement can be made. > The second aspect is that it must have a very stable API, similarly to > Michiel's comment is that changing APIs after a release is also a pain > especially if the module has been around a long time. Based on your first > post, I would argue that you are not quite at this stage yet. Agree, especially with crucial functionality (but maybe not so much with less crucial parts). That is why I have avoided comiting my statistics code to bioopython (although it exists for quite a long time - available on GIT): The API has to be future-resilient! In fact I have a proposal to make in this front, but because I want to be sure that the API is future proof in as much as possible, the proposal will not be all-enconpassing for now (I still don't know how to have a future proof API for multi-loci statistics like simple linkage desiquilibrium or more modern things like EHH). But yes, to be honest I think open-bio projects err on the excessive bureaucratic side and discourage new people. From tiagoantao at gmail.com Thu Nov 13 16:38:21 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 13 Nov 2008 16:38:21 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> Message-ID: <6d941f120811130838y5fc5e2c6qc42c37ed065bc616@mail.gmail.com> Hi, > For example, we wanted to write a function to calculate the Fst > statistics over snps data. Pigging back on Bruce comment on APIs being future proof, this is not really what we want ;) We want to be able to calculate Fst for any marker (SNPs, Microsatellites, AFLPs, sequences). We cannot have something like: calc_fst(put_your_snps_here) What we want is: calc_fst(put_your_marker_frequencies_here) We want to serve all, not just ourselves ;) Tiago From dalloliogm at gmail.com Thu Nov 13 17:51:30 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 13 Nov 2008 18:51:30 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> Message-ID: <5aa3b3570811130951x549e29a4xa400d3df1ec9123e@mail.gmail.com> On Thu, Nov 13, 2008 at 3:04 PM, Tiago Ant?o wrote: >> p.s. are you using some kind of repository/rcs for you PopGen modules? >> I just don't want to write code that will be difficult to reimplement >> in the future PopGen module.. > > Just my hardisk, open bio CVS and now most everything is in git ;) . > For now don't worry with that, you actually have all my code (except > recent bugfixes). Lets just keep communicating. well, if you want to re-use the repository on github, I think you will have to register there and then tell me your username, so I will be able to add you as a collaborator. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From bourbine at yahoo.de Thu Nov 13 17:55:58 2008 From: bourbine at yahoo.de (Samuel Bader) Date: Thu, 13 Nov 2008 17:55:58 +0000 (GMT) Subject: [BioPython] local uniprot search Message-ID: <898399.98502.qm@web28508.mail.ukl.yahoo.com> Hi, I?m a newbe in programming and like to make a local uniprot search (submit accession number and return the entry, I like the whole entry not just the sequence). So far I always scanned through the whole uniprot file, but this take several minutes on my computer. As it is much faster on the internet, I think there is also a faster way to do that locally. Does somebody know, how this could be done? Thanks Cheers From bourbine at yahoo.de Thu Nov 13 17:56:03 2008 From: bourbine at yahoo.de (Samuel Bader) Date: Thu, 13 Nov 2008 17:56:03 +0000 (GMT) Subject: [BioPython] local uniprot search Message-ID: <875283.66754.qm@web28506.mail.ukl.yahoo.com> Hi, I?m a newbe in programming and like to make a local uniprot search (submit accession number and return the entry, I like the whole entry not just the sequence). So far I always scanned through the whole uniprot file, but this take several minutes on my computer. As it is much faster on the internet, I think there is also a faster way to do that locally. Does somebody know, how this could be done? Thanks Cheers From bsouthey at gmail.com Thu Nov 13 18:57:30 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 13 Nov 2008 12:57:30 -0600 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> Message-ID: <491C789A.3070004@gmail.com> Giovanni Marco Dall'Olio wrote: > On Thu, Nov 13, 2008 at 4:29 PM, Bruce Southey wrote: > >> Tiago Ant?o wrote: >> >>>> This is right: which word can I use, then? >>>> HaplotypesSampler? RandomHaplotypesSpawner? >>>> HaplotypesCreator? >>>> >>>> >>> Considering that this is probably a small piece of code in the long >>> run (correct me if I am wrong), I suggest creating >>> Bio.PopGen.Utils.NameToBeDecided.py >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >>> >> Hi, >> I really don't mean to be negative, but you have certain responsibilities >> once you release code into the Biopython community. Part of my concern is >> that some of this is being overlooked especially in terms of the user of the >> code. I do see that simulation of SNPs is useful for users so it is >> important that it integrated correctly. >> >> I think Michiel's recent comment in 'a sequence set object in biopython' >> thread is important here as well: >> >> "Adding new classes to Biopython should be done very carefully ... once >> they're in, it's difficult to remove them again. In the past, removing >> classes that turned out to be less than ideal was a real headache." >> >> While I have not looked at the code, my view is that must remain integrated >> into the PopGen module. I would expect that a user would some Biopython >> (PopGen) modules with some simulated SNPs. I would prefer that Biopython >> remains as much as possible a set of integrated tools rather than just a >> collection of tools. This is a clear example where if it is not totally >> integrated then I don't see the point in including it in Biopython. >> >> The second aspect is that it must have a very stable API, similarly to >> Michiel's comment is that changing APIs after a release is also a pain >> especially if the module has been around a long time. Based on your first >> post, I would argue that you are not quite at this stage yet. >> > > ehi, wait :) I wasn't proposing to integrate this module in biopython, > at least not yet!! :) > Oh, I am on the right list? It does say Biopython... :-) > This is a module to generate test sets to help the development of the > other future PopGen modules. > Great! > For example, we wanted to write a function to calculate the Fst > statistics over snps data. > The Fst is an index that tells you if, given two populations, they > follow the same pattern of variability, and therefore can be > considered as two subpopulations of the same population or not. > To test such a script, you will need a module like the one I wrote > here: for example, you could create two samples of 200 individuals > with the same frequencies at every site, and see what your Fst script > tells. Then, probably, compare the results with another tool that is > already know to calculate the Fst correctly. > > So I was just asking for any suggestions - which models should I > implement in this generator? And how? Which parameters should it > accept? Should it use the random module? > > The importance is more the API than the actual implementation - as the later posts by Tiago indicate. Some coding related comments: freqs_per_site and alleles_per_site are lists. This is a problem because these could get very large, it is inflexible and you could become out of sync. While you do check for length, you should be more informative of which has a different length. Also you need to check for valid inputs (frequencies between 0 and 1, bases in ACGT). Some other comments Perhaps I misunderstood the situation but the major problem that I have is that the locations are treated as independent so your model assumes unlinked loci. I just don't find this a useful scenario. You assume that the user knows exactly which locations and frequency to change. Often you just want a random frequency and random location. In that case you need to randomly select locations and frequencies based on some function. But I do not find the mode=='random' of paramsGenerator sufficient to address this. Further, you might want a random sequence of some length but you not want all locations to change. While you could set those locations to zero, a more sparse form would be desirable. Also, the randomly generated frequencies should have a way to be limited in other ranges than the [0 to 1) of random.random. Obviously the question is whether or not the user has to do it themselves. One particular use of generating SNPs pertains to known genes or sequences. In such cases to would be great to use a known sequence as a base for the simulation. Further, it would be very useful be able incorporate known SNP data especially frequencies from some source like Hapmap (http://www.hapmap.org/). A nice but harder problem is to do this based on a protein sequence since many diseases refer to amino acids. Perhaps my biggest 'disappointment' is the lack of ancestry control because I also interested in families or some admixture in a population. This just generates sequences randomly assuming you are randomly selecting individuals from a homogenous population. I do understand this usage so it is not that important to include this here. Bruce From dalloliogm at gmail.com Fri Nov 14 11:21:02 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 14 Nov 2008 12:21:02 +0100 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <491C789A.3070004@gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> <491C789A.3070004@gmail.com> Message-ID: <5aa3b3570811140321i74921288kb22f83be2d41a175@mail.gmail.com> On Thu, Nov 13, 2008 at 7:57 PM, Bruce Southey wrote: > > Oh, I am on the right list? It does say Biopython... :-) I added a [PopGen] tag to the subject of the mail, to indicate that it was related to the PopGen module and it s development. >> This is a module to generate test sets to help the development of the >> other future PopGen modules. >> > > Great! > >> For example, we wanted to write a function to calculate the Fst >> statistics over snps data. >> The Fst is an index that tells you if, given two populations, they >> follow the same pattern of variability, and therefore can be >> considered as two subpopulations of the same population or not. >> To test such a script, you will need a module like the one I wrote >> here: for example, you could create two samples of 200 individuals >> with the same frequencies at every site, and see what your Fst script >> tells. Then, probably, compare the results with another tool that is >> already know to calculate the Fst correctly. >> >> So I was just asking for any suggestions - which models should I >> implement in this generator? And how? Which parameters should it >> accept? Should it use the random module? >> >> > > The importance is more the API than the actual implementation - as the later > posts by Tiago indicate. > > Some coding related comments: > freqs_per_site and alleles_per_site are lists. > This is a problem because these could get very large, it is inflexible and > you could become out of sync. they are not required to be lists. freqs and alleles _per site can be any kind python object with a __getitem__ and a __len__method. What I would like to do now is to create two 'Freqs' and 'Alleles' objects with such methods, so I can use them as containers for these informations without having to change the actual interface. The __getitem__ function could return a background value (0.5) for any position except for those that are defined to be differently when initialized. This would save memory space also. Have a look at the new changes: - http://tinyurl.com/64tfef (http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/HaplotypesGenerator.py) > While you do check for length, you should be more informative of which has a > different length. > Also you need to check for valid inputs (frequencies between 0 and 1, bases > in ACGT). ok > Some other comments > > Perhaps I misunderstood the situation but the major problem that I have is > that the locations are treated as independent so your model assumes unlinked > loci. I just don't find this a useful scenario. This depends on which parameters you pass to the HaplotypesGenerator init function. I would prefer to create a basic module that generates sequences given the frequencies and alleles in every position, and other functions to create its parameters. I forgot to say it in the first mail, but if you want to use more sophisticated scenarios - like populations that have suffered a bottleneck or have a particular history - there are already better tools available to do that; we should think on how to integrate this module with them. Maybe I should rename this module as 'SimpleHaplotypesSampler'. > You assume that the user knows exactly which locations and frequency to > change. Often you just want a random frequency and random location. In that > case you need to randomly select locations and frequencies based on some > function. But I do not find the mode=='random' of paramsGenerator sufficient > to address this. Further, you might want a random sequence of some length > but you not want all locations to change. ok, but consider that these are haplotypes and not sequences, so you most likely need to have regions that are more conserved and others that change more. This is a good question, about which models to implement, but I would need to find a better way to represent frequencies first, and then think about which models to implement. > While you could set those > locations to zero, a more sparse form would be desirable. I think the idea of a Freqs_per_site object should fix this > Also, the randomly > generated frequencies should have a way to be limited in other ranges than > the [0 to 1) of random.random. Obviously the question is whether or not the > user has to do it themselves. > One particular use of generating SNPs pertains to known genes or sequences. > In such cases to would be great to use a known sequence as a base for the > simulation. > Further, it would be very useful be able incorporate known SNP > data especially frequencies from some source like Hapmap > (http://www.hapmap.org/). This is too complicated for the moment. We would need to develop a standard way to handle HapMap and in general SNPs first. > A nice but harder problem is to do this based on a > protein sequence since many diseases refer to amino acids. This is a good idea, but at the moment I was thinking more on genotypes than other characters. I would need to have a better way to handle all these suggestions.. too bad github doesn't provide an integrated ticketing system. > Perhaps my biggest 'disappointment' is the lack of ancestry control because > I also interested in families or some admixture in a population. This just > generates sequences randomly assuming you are randomly selecting individuals > from a homogenous population. I think simcoal can do this? > I do understand this usage so it is not that > important to include this here. > > > > Bruce > > > > > > > > > > > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From tiagoantao at gmail.com Fri Nov 14 11:49:05 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 14 Nov 2008 11:49:05 +0000 Subject: [BioPython] [PopGen] HapMap Message-ID: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> On Thu, Nov 13, 2008 at 6:57 PM, Bruce Southey wrote: > One particular use of generating SNPs pertains to known genes or sequences. > In such cases to would be great to use a known sequence as a base for the > simulation. Further, it would be very useful be able incorporate known SNP > data especially frequencies from some source like Hapmap > (http://www.hapmap.org/). A nice but harder problem is to do this based on a > protein sequence since many diseases refer to amino acids. Talking about hapmap, and in a different front I have some code available to deal with HapMap. The problem is that, in order for it to be useful (performance), it injects all the data in an SQL database. That requires a schema for persistance, but I have been "ping-ponged" regarding where people in Biopython say that they prefer things to be on BioSQL and people on BioSQL say they don't care (and, this being voluntary work I simply don't have the patience to fight the bureaucracy). > Perhaps my biggest 'disappointment' is the lack of ancestry control because > I also interested in families or some admixture in a population. This just > generates sequences randomly assuming you are randomly selecting individuals > from a homogenous population. I do understand this usage so it is not that > important to include this here. You can use the Simcoal module to generate (coalescent based) sequences. I don't know if that helps you. The only hurdle is that simcoal churns data in the Arlequin Format and I still haven't got round to finalize one (although I could increase the priority if there is interest). From biopython at maubp.freeserve.co.uk Fri Nov 14 12:02:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Nov 2008 12:02:42 +0000 Subject: [BioPython] [PopGen] HapMap In-Reply-To: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> References: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> Message-ID: <320fb6e00811140402o301ee86aqc118493b143137c4@mail.gmail.com> On Fri, Nov 14, 2008 at 11:49 AM, Tiago Ant?o wrote: > Talking about hapmap, and in a different front I have some code > available to deal with HapMap. The problem is that, in order for it to > be useful (performance), it injects all the data in an SQL database. > That requires a schema for persistance, but I have been "ping-ponged" > regarding where people in Biopython say that they prefer things to be > on BioSQL and people on BioSQL say they don't care (and, this being > voluntary work I simply don't have the patience to fight the > bureaucracy). I didn't say it had to be done via BioSQL - I wanted you to check that any schema ideas wouldn't be overlapping existing work, and felt BioSQL was the obvious place to ask. If these schema ideas are not a good fit to BioSQL, then that's fine. We do have some non-BioSQL bits of Biopython using MySQL already (perhaps not as well looked after as they should be, Bio.GFF and Bio.DocSQL). The trouble with any Biopython code requiring a database is keeping that code maintained and tested is much harder - potentially only the original developer will actually be able to test it. For this particular bit of HapMap code, do you need persistence? If all you need is an on the fly database there may be other options (maybe sqlite - some versions of python ship with this). Peter From tiagoantao at gmail.com Fri Nov 14 12:16:20 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 14 Nov 2008 12:16:20 +0000 Subject: [BioPython] [PopGen] HapMap In-Reply-To: <320fb6e00811140402o301ee86aqc118493b143137c4@mail.gmail.com> References: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> <320fb6e00811140402o301ee86aqc118493b143137c4@mail.gmail.com> Message-ID: <6d941f120811140416l7369bb0fl75188bde6c49141e@mail.gmail.com> > For this particular bit of HapMap code, do you need persistence? If > all you need is an on the fly database there may be other options > (maybe sqlite - some versions of python ship with this). Considering that there are now more people here that seem to be interested in this, maybe this can be discussed. The HapMap is a fairly big database of SNPs taken for 3 (or 4, depends on how you count) human populations. The database is available in text format. If I recall well (this is old code and old work) there is a file per chromosome and per pop with a (big) list of SNPs. Actually there are several files, from allele counts to haplotype reconstruction. The problem is, if you want to search for a certain criteria, (say SNPID, a chunk of a chromosome, or whatever) going through the files is a painfully slow process. My (now very old) implementation (which, I think is on GIT), downloads the text files, uploads then on a local sqllite database, indexes it and exposes a fast interface. The code is actually quite agile, making life quite easy on downloading and manipulating data, at least in my opinion. If there is interest here, I can pull out my code and we can discuss the approach that I followed in the past. Also, if somebody else wants to take the lead on this, go ahead (you can still use my code). To be honest I would prefer to have a shared discussion on this, then just submitting the code alone, with just my own reasoning to back it. From tiagoantao at gmail.com Fri Nov 14 12:41:22 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 14 Nov 2008 12:41:22 +0000 Subject: [BioPython] [PopGen] HapMap In-Reply-To: <6d941f120811140416l7369bb0fl75188bde6c49141e@mail.gmail.com> References: <6d941f120811140349o1ef11304p86b000cfc174b697@mail.gmail.com> <320fb6e00811140402o301ee86aqc118493b143137c4@mail.gmail.com> <6d941f120811140416l7369bb0fl75188bde6c49141e@mail.gmail.com> Message-ID: <6d941f120811140441r37698743n2035cbb6d699686@mail.gmail.com> On Fri, Nov 14, 2008 at 12:16 PM, Tiago Ant?o wrote: > My (now very old) implementation (which, I think is on GIT), downloads > the text files, uploads then on a local sqllite database, indexes it > and exposes a fast interface. The code is actually quite agile, making > life quite easy on downloading and manipulating data, at least in my > opinion. Just an update on this, the code on GIT is incomplete. I will dig my archives (other computer) and find the complete stuff From p.j.a.cock at googlemail.com Fri Nov 14 13:07:01 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Nov 2008 13:07:01 +0000 Subject: [BioPython] Problems with Emboss.Primer3 In-Reply-To: <320fb6e00811130643p357092f6y8e6d983a11909003@mail.gmail.com> References: <000801c94598$fd183f20$1022a8c0@ipkgatersleben.de> <320fb6e00811130643p357092f6y8e6d983a11909003@mail.gmail.com> Message-ID: <320fb6e00811140507j78bd40dbybba6ed6f1e74e5ec@mail.gmail.com> Just for anyone else with a similar issue, it turned out there was an EMBOSS setup problem on Stefanie's machine - running the command line by hand at the command line prompt didn't work either. Problem solved :) Peter From bsouthey at gmail.com Fri Nov 14 15:17:08 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 14 Nov 2008 09:17:08 -0600 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <5aa3b3570811140321i74921288kb22f83be2d41a175@mail.gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130234y6a365610oaa695dc09ad7495d@mail.gmail.com> <5aa3b3570811130355h58ad9el81a875295923228d@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> <491C789A.3070004@gmail.com> <5aa3b3570811140321i74921288kb22f83be2d41a175@mail.gmail.com> Message-ID: <491D9674.7010807@gmail.com> [snip] > >> Some other comments >> >> Perhaps I misunderstood the situation but the major problem that I have is >> that the locations are treated as independent so your model assumes unlinked >> loci. I just don't find this a useful scenario. >> > > This depends on which parameters you pass to the HaplotypesGenerator > init function. > I would prefer to create a basic module that generates sequences given > the frequencies and alleles in every position, and other functions to > create its parameters. > Well this depends on your meaning for haplotype (e.g. http://en.wikipedia.org/wiki/Haplotype). I agree but you need to capture how close the positions ie linkage/ linkage disequilibrium. Simulating independent positions in a required format is useful but this is just a special case of simulating dependent positions. > I forgot to say it in the first mail, but if you want to use more > sophisticated scenarios - like populations that have suffered a > bottleneck or have a particular history - there are already better > tools available to do that; we should think on how to integrate this > module with them. > Maybe I should rename this module as 'SimpleHaplotypesSampler'. > Perhaps IndependentLociSampler. :) > >> You assume that the user knows exactly which locations and frequency to >> change. Often you just want a random frequency and random location. In that >> case you need to randomly select locations and frequencies based on some >> function. But I do not find the mode=='random' of paramsGenerator sufficient >> to address this. Further, you might want a random sequence of some length >> but you not want all locations to change. >> > > ok, but consider that these are haplotypes and not sequences, so you > most likely need to have regions that are more conserved and others > that change more. > This is a good question, about which models to implement, but I would > need to find a better way to represent frequencies first, and then > think about which models to implement. > Really the implementation requires some representation of the genetic map. After all if the positions are very close, the two loci should not change very frequently. I do not know a nice way to represent this even with genetic marker simulation (something I do know about). I have not used simcoal as my work has moved from genetic markers. Perhaps you need to see how simcoal and similar packages do it. I do understand the usefulness of the simulating independent loci but I also find it a very simple special case of what should be done. I think you need to develop some outline of what you want to achieve that changes as it progresses. Also, not everything needs to get done, other people can contribute if they want to but the general framework needs to be in place. Bruce From tiagoantao at gmail.com Fri Nov 14 20:27:38 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 14 Nov 2008 20:27:38 +0000 Subject: [BioPython] [PopGen] a random Haplotype Sets generator In-Reply-To: <491D9674.7010807@gmail.com> References: <5aa3b3570811130137o2965711p73358ea655b1daef@mail.gmail.com> <6d941f120811130604h29c09fffq7c2d79e0349ae70f@mail.gmail.com> <320fb6e00811130619r72998d29u1c7b46c1e0fa181b@mail.gmail.com> <5aa3b3570811130627q4708265dl983d9abcf9930cb9@mail.gmail.com> <6d941f120811130632r5e979322l2133ca25f86f422@mail.gmail.com> <491C47E2.60904@gmail.com> <5aa3b3570811130807n5be2c8b7w7d05e10daf354c91@mail.gmail.com> <491C789A.3070004@gmail.com> <5aa3b3570811140321i74921288kb22f83be2d41a175@mail.gmail.com> <491D9674.7010807@gmail.com> Message-ID: <6d941f120811141227s42be58d9g6557b2b731e378a@mail.gmail.com> > Really the implementation requires some representation of the genetic map. > After all if the positions are very close, the two loci should not change > very frequently. I do not know a nice way to represent this even with > genetic marker simulation (something I do know about). I have not used > simcoal as my work has moved from genetic markers. Perhaps you need to see > how simcoal and similar packages do it. Just for the record, there is an excellent (in all respects: features, code quality, documentation, author support) forward time population genetics simulator in python: simuPOP. It is probably the best forward time simulator that I know of (probably better than the R-based "competitor" rmetasim, which doesn't have any provision for selection). It might be interesting to study how simuPOP represents a genome. Tiago From fglaser at technion.ac.il Sun Nov 16 07:27:06 2008 From: fglaser at technion.ac.il (Fabian Glaser) Date: Sun, 16 Nov 2008 09:27:06 +0200 Subject: [BioPython] disordered atoms in pdb Message-ID: <491FCB4A.9050207@technion.ac.il> Hi, I am quite new to biopython, so forgive me if I am asking trivial questions for a while... I am successfully reading and updating pdb files with biopython, with only one exception: disordered atoms. I understand they are part of a different object than regular atoms, but when I am trying for example to change their temperature factor values with the following code: if residue.is_disordered(): for atom in residue: print residue, atom, atom.get_bfactor() atom.set_bfactor(0) print residue, atom, atom.get_bfactor() The code if there is more than one option, for example A and B, only the first one is updated: 23.48 0 25.38 0 So how can I cleanly access every unordered atom? Thanks a lot in advance, Fabian -- Fabian Glaser, PhD Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - Israel Institute of Technology Haifa 32000, ISRAEL Web: http://bku.technion.ac.il Email: fglaser at tx.technion.ac.il Tel: +972-(0)4-8293701 Cel: +972-(0)54-4772396 From biopython at maubp.freeserve.co.uk Mon Nov 17 10:07:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 10:07:43 +0000 Subject: [BioPython] disordered atoms in pdb In-Reply-To: <491FCB4A.9050207@technion.ac.il> References: <491FCB4A.9050207@technion.ac.il> Message-ID: <320fb6e00811170207m2d22f96cg2c3d60011ae1d2d2@mail.gmail.com> Fabian Glaser wrote: > Hi, > > I am quite new to biopython, so forgive me if I am asking trivial questions > for a while... Hi Fabian, > I am successfully reading and updating pdb files with biopython, with only > one exception: disordered atoms. I understand they are part of a different > object than regular atoms, but when I am trying for example to change their > temperature factor values with the following code: > > if residue.is_disordered(): for atom in > residue: > print residue, atom, atom.get_bfactor() > atom.set_bfactor(0) > print residue, atom, atom.get_bfactor() Your indentation went funny in the email. Could you repeat the example, and add a little bit more code to load the PDB file and select this residue? (any PDB file with a disordered residue should be fine). > The code if there is more than one option, for example A and B, only the > first one is updated: > > 23.48 > 0 > 25.38 > 0 In this example atoms CB and CG seem to have both had their bfactor updated. I don't understand what is wrong. Peter From dalloliogm at gmail.com Mon Nov 17 11:32:28 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 17 Nov 2008 12:32:28 +0100 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) Message-ID: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> Hi, a general question. Are you used to organize your python/biopython scripts in pipelines or workflows? For example, many people use automatic build tools like 'make' to organize their scientific scripts. Let's say you want to study the structure of a protein from pdb. I would create a script to download it from pdb.org, one to parse its format, and others to do the analysis; then, I would write a Makefile to put everything together. I noticed that there are already some tools to do automated builds written in python. I have asked in some lists and, apart from scons, they suggested me these: - http://www.blueskyonmars.com/projects/paver (paver) - http://code.google.com/p/waf/ (waf) So, do you know these tools? Do you have any special recommendation to integrate them with biopython? -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Nov 17 11:53:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 11:53:04 +0000 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> Message-ID: <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> Giovanni Marco Dall'Olio wrote: > Hi, > a general question. > > Are you used to organize your python/biopython scripts in pipelines or > workflows? > For example, many people use automatic build tools like 'make' to > organize their scientific scripts. > Let's say you want to study the structure of a protein from pdb. I > would create a script to download it from pdb.org, one to parse its > format, and others to do the analysis; then, I would write a Makefile > to put everything together. Personally in this situation I tend to just write a wrapper python script (or sometimes a shell script or batch file) to call the sub scripts. i.e. the KISS principle. I really don't think Makefiles are a sensible solution to this problem - although it is possible. A Makefile lets you deal with simple dependencies (e.g. building an index file, or running a BLAST search and saving it to disk) but I prefer to just deal with this within my python scripts (e.g. if the index is missing, build it; if the BLAST output is missing, call BLAST). Why do you think you need a Makefile? Are you intending to provide the workflow to other people? Using a complicated Makefile means the project is harder for a new developer to understand (they need to learn a whole new programming language/tool). This may also hinder cross platform deployment (the average Windows machine won't have make installed). Peter From bartomas at gmail.com Mon Nov 17 12:22:33 2008 From: bartomas at gmail.com (bar tomas) Date: Mon, 17 Nov 2008 12:22:33 +0000 Subject: [BioPython] Using BioPython Entrez from behind proxy-based firewall Message-ID: Hi, I'm using BioPython to access NCBI Entrez databases. I'm doing this from behind a proxy-based firewall. Do you know how I can pass on my firewall parameters so that BioPython handles them? Thanks a lot From biopython at maubp.freeserve.co.uk Mon Nov 17 12:32:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 12:32:21 +0000 Subject: [BioPython] Using BioPython Entrez from behind proxy-based firewall In-Reply-To: References: Message-ID: <320fb6e00811170432h32e30f2cx7aaf1bfb8a5a5e45@mail.gmail.com> On Mon, Nov 17, 2008 at 12:22 PM, bar tomas wrote: > Hi, > > I'm using BioPython to access NCBI Entrez databases. > I'm doing this from behind a proxy-based firewall. > Do you know how I can pass on my firewall parameters so that BioPython > handles them? > > Thanks a lot Hi, Bio.Entrez is just using the python urllib to connect to the NCBI Entrez servers, and that should support a password-less proxy. Right now the Biopython code which uses the urllib.urlopen function doesn't directly let you specify the proxy. However, consulting the python documentation you should still be able to do this by setting an environment variable: http://www.python.org/doc/2.5.2/lib/module-urllib.html Does that work? Peter From bartomas at gmail.com Mon Nov 17 12:53:26 2008 From: bartomas at gmail.com (bar tomas) Date: Mon, 17 Nov 2008 12:53:26 +0000 Subject: [BioPython] Using BioPython Entrez from behind proxy-based firewall In-Reply-To: <320fb6e00811170432h32e30f2cx7aaf1bfb8a5a5e45@mail.gmail.com> References: <320fb6e00811170432h32e30f2cx7aaf1bfb8a5a5e45@mail.gmail.com> Message-ID: Your solution does work!! Thanks a lot. On Mon, Nov 17, 2008 at 12:32 PM, Peter wrote: > On Mon, Nov 17, 2008 at 12:22 PM, bar tomas wrote: > > Hi, > > > > I'm using BioPython to access NCBI Entrez databases. > > I'm doing this from behind a proxy-based firewall. > > Do you know how I can pass on my firewall parameters so that BioPython > > handles them? > > > > Thanks a lot > > Hi, > > Bio.Entrez is just using the python urllib to connect to the NCBI > Entrez servers, and that should support a password-less proxy. Right > now the Biopython code which uses the urllib.urlopen function doesn't > directly let you specify the proxy. However, consulting the python > documentation you should still be able to do this by setting an > environment variable: > http://www.python.org/doc/2.5.2/lib/module-urllib.html > > Does that work? > > Peter > From dalloliogm at gmail.com Mon Nov 17 16:26:21 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 17 Nov 2008 17:26:21 +0100 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> Message-ID: <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> On Mon, Nov 17, 2008 at 12:53 PM, Peter wrote: > Giovanni Marco Dall'Olio wrote: >> Hi, >> a general question. >> >> Are you used to organize your python/biopython scripts in pipelines or >> workflows? >> For example, many people use automatic build tools like 'make' to >> organize their scientific scripts. >> Let's say you want to study the structure of a protein from pdb. I >> would create a script to download it from pdb.org, one to parse its >> format, and others to do the analysis; then, I would write a Makefile >> to put everything together. > > Personally in this situation I tend to just write a wrapper python > script (or sometimes a shell script or batch file) to call the sub > scripts. i.e. the KISS principle. wrapper scripts often are not the very optimal solution. - Over time, they tend to be become very complex and full of commented statements. When you complete a part of your experiment (e.g. you download your input sequences from ncbi) you will likely to comment out the statement that you used to download it. If you then discover that the sequences you have downloaded were wrong, you have to decomment-out the same statement, but here you can make some errors It is very difficult to remember which statements you commented out because they were wrong and when, and the wrapper script become messy very quickly, while it will take always much time to you to maintain. I have used wrapper scripts for a year during my master project and I think that's not really KISS. It seems very difficult to reproduce an analysis done without a pipeline. - make can have a nasty syntax, but it is a standard. If you type 'make help' you get help, and if you type 'make all' usually you will carry out the whole analysis, without having to worry on which scripts are be run in particular. - there are other build system than make, some of them are written in python and/or for python. That means you won't have to necessarly learn a new programming syntax. Have a look at rake, all the examples I've seen are very clean. I'll let you know when I will have learnt waf or paver. - makefiles like tools usually already support multi-threading. If I want to run a program on a cluster, the easiest thing for me is to write a makefile, and it works already. - makefile allows you to re-execute parts of your analysis easily when your input files or your scripts changes. This is very useful, I don't want to write a wrapper script that checks if a file has been modified since the last time I have used it to calculate some results - because make tools already do that. > > I really don't think Makefiles are a sensible solution to this problem > - although it is possible. A Makefile lets you deal with simple > dependencies (e.g. building an index file, or running a BLAST search > and saving it to disk) but I prefer to just deal with this within my > python scripts (e.g. if the index is missing, build it; if the BLAST > output is missing, call BLAST). Wouldn't you prefer something like: - if the blast output doesn't exist, OR it exists but it is older than the script used to launch it, or older than the input sequence, then run it again? that's the kind of things that makefile tools can do for you already, without having to write complicated python conditions. > Why do you think you need a Makefile? Are you intending to provide the > workflow to other people? Using a complicated Makefile means the > project is harder for a new developer to understand (they need to > learn a whole new programming language/tool). The best thing would be to learn how to write workflows, like the ones from taverna and similar. But it takes time, and I think it is better if you know the two things. As I was saying before, make has the worst syntax, but maybe there are other building tools which are better. > This may also hinder > cross platform deployment (the average Windows machine won't have make > installed). > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Nov 17 17:27:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 17:27:12 +0000 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> Message-ID: <320fb6e00811170927v585ff192p85887d007efcd978@mail.gmail.com> >> Personally in this situation I tend to just write a wrapper python >> script (or sometimes a shell script or batch file) to call the sub >> scripts. i.e. the KISS principle. > > wrapper scripts often are not the very optimal solution. > - Over time, they tend to be become very complex and full of commented > statements. That certainly can happen - but it can happen with any tool, even Makefiles. > When you complete a part of your experiment (e.g. you download your > input sequences from ncbi) you will likely to comment out the > statement that you used to download it. Personally to avoid this kind of thing, I make the download (or running BLAST, or whatever) conditional on a check to see if the output file exists (and don't just comment out the call). You could also do date checking in code too. > If you then discover that the sequences you have downloaded were > wrong, you have to decomment-out the same statement, but here you can > make some errors In my case, I can delete the old input sequences (or the BLAST output) and re-run the script. I would agree that for more complicate multi-step analyses this requires some thought - but you can at least handle error conditions any way you like (i.e. helpful messages instead of whatever the build tool does). > It is very difficult to remember which statements you commented out > because they were wrong and when, and the wrapper script become messy > very quickly, while it will take always much time to you to maintain. > I have used wrapper scripts for a year during my master project and I > think that's not really KISS. It seems very difficult to reproduce an > analysis done without a pipeline. I guess it depends on what you mean by a pipeline - you can have a robust pipeline which is essentially one master python script. I agree there is a danger that the script will evolve over time into a horrible mess. > - make can have a nasty syntax, but it is a standard. If you type > 'make help' you get help, and if you type 'make all' usually you will > carry out the whole analysis, without having to worry on which scripts > are be run in particular. I would agree that make has a nasty syntax. Note that make isn't a completely cross platform standard (although you can get it on Windows via cygwin for example). > - there are other build system than make, some of them are written in > python and/or for python. > That means you won't have to necessarly learn a new programming > syntax. Have a look at rake, all the examples I've seen are very > clean. I'll let you know when I will have learnt waf or paver. These (and Make) all seem to be designed to solve a different problem, handling the compilation and/or installation of software with multiple dependencies. That doesn't mean you can't use them for a pipeline, but it may not be ideal. > - makefiles like tools usually already support multi-threading. If I > want to run a program on a cluster, the easiest thing for me is to > write a makefile, and it works already. For trivial multi-threading, yes, make can help. > - makefile allows you to re-execute parts of your analysis easily when > your input files or your scripts changes. > This is very useful, I don't want to write a wrapper script that > checks if a file has been modified since the last time I have used it > to calculate some results - because make tools already do that. If you already know how to work with make files, that this does have some advantages. i.e. Instead of writing a python wrapper script, you write a simple Makefile. I think we agree that Make is pretty complex, a language in its own right. This means if you want someone else to use your pipeline, then they have to learn how to use make too (if anything goes wrong or they want to change it). > Wouldn't you prefer something like: > - if the blast output doesn't exist, OR it exists but it is older than > the script used to launch it, or older than the input sequence, then > run it again? That sounds potentially useful for a complicated analysis pipeline. But suppose you also wanted to check the current version of BLAST installed and the version of BLAST used in the existing output file? This would probably be possible within a Makefile using some embedded shell scripts calling grep, but it wouldn't be very nice at all. Although it would still be a non-trivial bit of code, I would prefer to do this in python (maybe put the code into a library function for reuse). My point is, using some other tool like Make could make certain operations easier, but with a python script you can do this sort of thing and more. You have full control, without adding another dependency to the project. > that's the kind of things that makefile tools can do for you already, > without having to write complicated python conditions. True - but as I have tried to illustrate above, even Make has its limitations. > The best thing would be to learn how to write workflows, like the ones > from taverna and similar. > But it takes time, and I think it is better if you know the two things. > As I was saying before, make has the worst syntax, but maybe there are > other building tools which are better. I certainly wouldn't be keen on make itself, but there might be a python library out there that would be a good compromise (making the common file existence/date based tasks easy, but allowing arbitrary extension - e.g. my BLAST version check requirement). Peter From dalloliogm at gmail.com Mon Nov 17 18:19:04 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 17 Nov 2008 19:19:04 +0100 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <320fb6e00811170927v585ff192p85887d007efcd978@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> <320fb6e00811170927v585ff192p85887d007efcd978@mail.gmail.com> Message-ID: <5aa3b3570811171019g2b4815bby5f1cd8b7af482928@mail.gmail.com> > > That sounds potentially useful for a complicated analysis pipeline. > But suppose you also wanted to check the current version of BLAST > installed and the version of BLAST used in the existing output file? > This would probably be possible within a Makefile using some embedded > shell scripts calling grep, but it wouldn't be very nice at all. > Although it would still be a non-trivial bit of code, I would prefer > to do this in python (maybe put the code into a library function for > reuse). well, in principle you would check for blast's executable last modification date. If the blast executable has a modification date which is younger than the results file, you will have to calculate them again. Other build tools can also check for md5 modification to prerequisite files, or can be integrated with subversion/other rcs systems. You can do this in python, but it takes a lot of time, and would mean re-writing existing code. I am sure there is should be something specific for bioinformaticians already :). Well, I'll write some workflows with the tools I linked before (and also with scons) and let you know. > My point is, using some other tool like Make could make certain > operations easier, but with a python script you can do this sort of > thing and more. You have full control, without adding another > dependency to the project. > >> that's the kind of things that makefile tools can do for you already, >> without having to write complicated python conditions. > > True - but as I have tried to illustrate above, even Make has its limitations. > >> The best thing would be to learn how to write workflows, like the ones >> from taverna and similar. >> But it takes time, and I think it is better if you know the two things. >> As I was saying before, make has the worst syntax, but maybe there are >> other building tools which are better. > > I certainly wouldn't be keen on make itself, but there might be a > python library out there that would be a good compromise (making the > common file existence/date based tasks easy, but allowing arbitrary > extension - e.g. my BLAST version check requirement). > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Nov 17 18:33:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Nov 2008 18:33:26 +0000 Subject: [BioPython] biopython integration with make-like tools (e.g. waf, paver) In-Reply-To: <5aa3b3570811171019g2b4815bby5f1cd8b7af482928@mail.gmail.com> References: <5aa3b3570811170332s35f6ba62v748073efdce33a40@mail.gmail.com> <320fb6e00811170353l19a5674bv8640f06aa71c1320@mail.gmail.com> <5aa3b3570811170826m3039ebd8t17615302dca69c19@mail.gmail.com> <320fb6e00811170927v585ff192p85887d007efcd978@mail.gmail.com> <5aa3b3570811171019g2b4815bby5f1cd8b7af482928@mail.gmail.com> Message-ID: <320fb6e00811171033j6eb7e806o8ea70aa863a59eb2@mail.gmail.com> >> That sounds potentially useful for a complicated analysis pipeline. >> But suppose you also wanted to check the current version of BLAST >> installed and the version of BLAST used in the existing output file? >> This would probably be possible within a Makefile using some embedded >> shell scripts calling grep, but it wouldn't be very nice at all. >> Although it would still be a non-trivial bit of code, I would prefer >> to do this in python (maybe put the code into a library function for >> reuse). > > well, in principle you would check for blast's executable last > modification date. > If the blast executable has a modification date which is younger than > the results file, you will have to calculate them again. That might work, but is a slightly different check. Just because the executable is "newer" doesn't mean its a different version. > Other build tools can also check for md5 modification to prerequisite > files, or can be integrated with subversion/other rcs systems. > You can do this in python, but it takes a lot of time, and would mean > re-writing existing code. I am sure there is should be something > specific for bioinformaticians already :). There might be, but I don't see this kind of thing as specific to bioinformatics. Data analysis pipelines could be applied to any scientific data analysis, e.g. meteorological data analysis. > Well, I'll write some workflows with the tools I linked before (and > also with scons) and let you know. I guess the best way to evaluate the tools is to try using them :) Good luck, Peter From bsouthey at gmail.com Wed Nov 19 17:46:19 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 19 Nov 2008 11:46:19 -0600 Subject: [BioPython] Anyone using Affy module? Message-ID: <492450EB.7010801@gmail.com> Hi, If there is anyone who uses the Affy module, could you please let me know? If so, I would also like to know which version of Affy chips are being used. I know that Version 4 is binary and this is not supported. But with version 3 CEL files, the code provides the transpose of the rows and columns. Also the code does not read in the outliers or masks sections. Thanks, Bruce From bsouthey at gmail.com Wed Nov 19 20:40:52 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 19 Nov 2008 14:40:52 -0600 Subject: [BioPython] Usage of Bio.NMR Message-ID: <492479D4.3050205@gmail.com> Hi, Does anyone use the Bio.NMR module? If so, could you let me know? Also, I would appreciate a sample .xpk peaklist file that could be used for testing purposes. From the code # xpktools.py: A python module containing function definitions and classes # useful for manipulating data from nmrview .xpk peaklist files I presume this uses the NMRview software: http://www.onemoonscientific.com/nmrview/summary.html Thanks Bruce From biopython at maubp.freeserve.co.uk Thu Nov 20 10:38:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Nov 2008 10:38:54 +0000 Subject: [BioPython] Usage of Bio.NMR In-Reply-To: <492479D4.3050205@gmail.com> References: <492479D4.3050205@gmail.com> Message-ID: <320fb6e00811200238l607a5fbaq9107abf8bf8a305a@mail.gmail.com> On Wed, Nov 19, 2008 at 8:40 PM, Bruce Southey wrote: > Hi, > Does anyone use the Bio.NMR module? > If so, could you let me know? > Also, I would appreciate a sample .xpk peaklist file that could be used for > testing purposes. Have you tried emailing the Bio.NMR author, Robert G. Bussel? There is a www.med.cornell.edu email address in the source code which might still be live. Peter From bsouthey at gmail.com Thu Nov 20 20:46:34 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 20 Nov 2008 14:46:34 -0600 Subject: [BioPython] Does anyone use EZRetrieve? Message-ID: <4925CCAA.2040809@gmail.com> Hi, Does anyone use EZRetrieve (http://siriusb.umdnj.edu:18080/EZRetrieve/single_r.jsp) ? This allows a user to retrieve a human, mouse or rat genome nucleic sequence based on an valid identifier. I think that most of the functionality of Bio.EZRetrieve is already present in Biopython and the genome sources appear to be 5 years old. For example, it uses LocusLink that was discontinued March 2005. If so could you please let me know? Thanks Bruce From biopython at maubp.freeserve.co.uk Thu Nov 20 20:53:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Nov 2008 20:53:37 +0000 Subject: [BioPython] Does anyone use EZRetrieve? In-Reply-To: <4925CCAA.2040809@gmail.com> References: <4925CCAA.2040809@gmail.com> Message-ID: <320fb6e00811201253j66336d7cl977e4e3112c9f9f7@mail.gmail.com> On Thu, Nov 20, 2008 at 8:46 PM, Bruce Southey wrote: > Hi, > Does anyone use EZRetrieve > (http://siriusb.umdnj.edu:18080/EZRetrieve/single_r.jsp) ? > This allows a user to retrieve a human, mouse or rat genome nucleic sequence > based on an valid identifier. > > I think that most of the functionality of Bio.EZRetrieve is already present > in Biopython and the genome sources appear to be 5 years old. For example, > it uses LocusLink that was discontinued March 2005. > > If so could you please let me know? Actually - could you let the whole mailing list know? ;) Given nature of the database and the limited functionality this python code offers, if no-one is using Bio.EZRetrieve then it could be considered for deprecation. Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Nov 21 16:59:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 Nov 2008 16:59:08 +0000 Subject: [BioPython] Biopython 1.49 released Message-ID: <320fb6e00811210859n2d128fd6nc21ad1012e1d93bf@mail.gmail.com> Dear Biopythoneers, We are pleased to announce the release of Biopython 1.49. There have been some significant changes since Biopython 1.48 was released a few months ago, which is why we initially released a beta for wider testing. Thank you to all those who tried this and reported the minor problems uncovered. As previously announced, the big news is that Biopython now uses NumPy rather than its precursor Numeric (the original Numerical Python library). As in the previous releases, Biopython 1.49 supports Python 2.3, 2.4 and 2.5 but should now also work fine on Python 2.6. Please note that we intend to drop support for Python 2.3 in a couple of releases time. We also have some new functionality, starting with the basic sequence object (the Seq class) which now has more methods. This encourages a more object orientated coding style, and makes basic biological operations like transcription and translation more accessible and discoverable. Our BioSQL interface can now optionally fetch the NCBI taxonomy on demand when loading sequences (via Bio.Entrez) allowing you to populate the taxon/taxon_name tables gradually. Also, BioSQL should now work with the psycopg2 driver for PostgreSQL (as well as the older psycopg driver), and the handling of feature locations has also been improved. We've also updated the Biopython Tutorial and Cookbook (also available in PDF). http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Finally, our old parsing infrastructure (Martel and Bio.Mindy) is now considered to be deprecated, meaning mxTextTools is no longer required to use Biopython. This should not affect any of the typically used parsers (e.g. Bio.SeqIO and Bio.AlignIO). Given there have been more changes than in recent Biopython releases, please do check your old scripts still work fine, and let us know on the mailing list or file a bug if there is anything wrong. Source distributions and Windows installers are available from the Biopython website: http://biopython.org/wiki/Download Thanks! -Peter on behalf of the Biopython developers P.S. You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News From lueck at ipk-gatersleben.de Sun Nov 23 13:21:52 2008 From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de) Date: Sun, 23 Nov 2008 14:21:52 +0100 Subject: [BioPython] ClustalW Multiple Alignment Message-ID: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> Hi! I want to align several sequences under the allowance to check whether there are reversed complement. That means I align all sequences against a reference sequence and then reversed complement them and align them again. After that I want to compare the score and choose the better once. Now my question: How can I get the score? Unfortunately it's not in the dnd file. In an old message of this mailing list, it's was written that it's in the log file. Does this has been removed? Thanks in adavance! Stefanie From biopython at maubp.freeserve.co.uk Sun Nov 23 13:32:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 23 Nov 2008 13:32:23 +0000 Subject: [BioPython] ClustalW Multiple Alignment In-Reply-To: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> References: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> Message-ID: <320fb6e00811230532u7e8d1586ia87e8adecc6618b4@mail.gmail.com> On Sun, Nov 23, 2008 at 1:21 PM, wrote: > Hi! > > I want to align several sequences under the allowance to check whether there > are reversed complement. > > That means I align all sequences against a reference sequence and then > reversed complement them and align them again. After that I want to compare > the score and choose the better once. If I have understood you correctly, you have one reference nucleotide sequence, and many nucleotide sequences of unknown orientation. For each sequence you want to do a pairwise alignment against the reference to decide which orientation matches best (forwards, or reverse complement). So you want to lots of pairwise alignments. Perhaps ClustalW is not the best choice - maybe use EMBOSS needle? You could also try Biopython's Bio.pairwise2 module. > Now my question: > > How can I get the score? What score exactly are you looking for? > Unfortunately it's not in the dnd file. The dnd file from clustalw is just a tree, there is no score. > In an old message of this mailing list, it's was written that it's > in the log file. Does this has been removed? What log file? I didn't think clustalw wrote a log file. It could be in the standard output printed to screen... What old message on the mailing list are you refering to? Could you link to it in the archive maybe? http://lists.open-bio.org/pipermail/biopython/ Peter From pmmagic at gmail.com Sun Nov 23 15:42:34 2008 From: pmmagic at gmail.com (paul m) Date: Sun, 23 Nov 2008 10:42:34 -0500 Subject: [BioPython] ClustalW Multiple Alignment In-Reply-To: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> References: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> Message-ID: <991e7bc10811230742r38ae28d2j78a5bb87171e0a4d@mail.gmail.com> On Sun, Nov 23, 2008 at 8:21 AM, wrote: > Hi! > > I want to align several sequences under the allowance to check whether there > are reversed complement. > > That means I align all sequences against a reference sequence and then > reversed complement them and align them again. After that I want to compare > the score and choose the better once. > Now my question: > > How can I get the score? Unfortunately it's not in the dnd file. In an old > message of this mailing list, it's was written that it's in the log file. > Does this has been removed? I think Thomas Mailund's ClustalW package will allow you to get the scores: See: http://www.daimi.au.dk/~mailund/clustalw_wrapper.html --Paul From lueck at ipk-gatersleben.de Mon Nov 24 11:23:45 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 24 Nov 2008 12:23:45 +0100 Subject: [BioPython] ClustalW Multiple Alignment References: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> <320fb6e00811230532u7e8d1586ia87e8adecc6618b4@mail.gmail.com> Message-ID: <001e01c94e27$1d236600$1022a8c0@ipkgatersleben.de> Yes, first I want to make a pairwise alignment to find the right orientation of all sequences and after it's I'll align everything in a multiple alignment. I meant this message: http://lists.open-bio.org/pipermail/biopython/2005-May/002656.html So you're right they talk about the printed output. I try to get the score. Thanks for the link! Stefanie ----- Original Message ----- From: "Peter" To: Cc: Sent: Sunday, November 23, 2008 2:32 PM Subject: Re: [BioPython] ClustalW Multiple Alignment > On Sun, Nov 23, 2008 at 1:21 PM, wrote: >> Hi! >> >> I want to align several sequences under the allowance to check whether >> there >> are reversed complement. >> >> That means I align all sequences against a reference sequence and then >> reversed complement them and align them again. After that I want to >> compare >> the score and choose the better once. > > If I have understood you correctly, you have one reference nucleotide > sequence, and many nucleotide sequences of unknown orientation. For > each sequence you want to do a pairwise alignment against the > reference to decide which orientation matches best (forwards, or > reverse complement). > > So you want to lots of pairwise alignments. Perhaps ClustalW is not > the best choice - maybe use EMBOSS needle? You could also try > Biopython's Bio.pairwise2 module. > >> Now my question: >> >> How can I get the score? > > What score exactly are you looking for? > >> Unfortunately it's not in the dnd file. > > The dnd file from clustalw is just a tree, there is no score. > >> In an old message of this mailing list, it's was written that it's >> in the log file. Does this has been removed? > > What log file? I didn't think clustalw wrote a log file. It could be > in the standard output printed to screen... > > What old message on the mailing list are you refering to? Could you > link to it in the archive maybe? > http://lists.open-bio.org/pipermail/biopython/ > > Peter > From david.moreira at u-psud.fr Mon Nov 24 14:24:59 2008 From: david.moreira at u-psud.fr (David Moreira) Date: Mon, 24 Nov 2008 14:24:59 +0000 Subject: [BioPython] ClustalW Multiple Alignment In-Reply-To: <001e01c94e27$1d236600$1022a8c0@ipkgatersleben.de> References: <20081123142152.7uhz72t937u0oow4@webmail.ipk-gatersleben.de> <320fb6e00811230532u7e8d1586ia87e8adecc6618b4@mail.gmail.com> <001e01c94e27$1d236600$1022a8c0@ipkgatersleben.de> Message-ID: <492AB93B.9050100@u-psud.fr> Dear Stefanie, For a similar case, what I do is to use first BLAST to know the orientation of the sequence. I have a small sequence data base with sequences in the good orientation and I BLAST against it, BLAST tells you whether the retrieved result is "plus/plus" or "minus/plus", i.e., whether your query was in the same or different orientation. You have just to parse the BLAST output to retrieve that information. BLAST is extremely rapid, so you can retrieve the orientation of hundreds of sequences in a few minutes. Then, you can reverse-complement the sequences in "minus" orientation and construct your multiple sequence alignment. It is very easy to have a single script doing all the work. David Stefanie Lu"ck a e'crit : > Yes, first I want to make a pairwise alignment to find the right > orientation of all sequences and after it's I'll align everything in a > multiple alignment. > > I meant this message: > http://lists.open-bio.org/pipermail/biopython/2005-May/002656.html > > So you're right they talk about the printed output. > I try to get the score. > Thanks for the link! > Stefanie > > ----- Original Message ----- From: "Peter" > > To: > Cc: > Sent: Sunday, November 23, 2008 2:32 PM > Subject: Re: [BioPython] ClustalW Multiple Alignment > > >> On Sun, Nov 23, 2008 at 1:21 PM, wrote: >>> Hi! >>> >>> I want to align several sequences under the allowance to check >>> whether there >>> are reversed complement. >>> >>> That means I align all sequences against a reference sequence and then >>> reversed complement them and align them again. After that I want to >>> compare >>> the score and choose the better once. >> >> If I have understood you correctly, you have one reference nucleotide >> sequence, and many nucleotide sequences of unknown orientation. For >> each sequence you want to do a pairwise alignment against the >> reference to decide which orientation matches best (forwards, or >> reverse complement). >> >> So you want to lots of pairwise alignments. Perhaps ClustalW is not >> the best choice - maybe use EMBOSS needle? You could also try >> Biopython's Bio.pairwise2 module. >> >>> Now my question: >>> >>> How can I get the score? >> >> What score exactly are you looking for? >> >>> Unfortunately it's not in the dnd file. >> >> The dnd file from clustalw is just a tree, there is no score. >> >>> In an old message of this mailing list, it's was written that it's >>> in the log file. Does this has been removed? >> >> What log file? I didn't think clustalw wrote a log file. It could be >> in the standard output printed to screen... >> >> What old message on the mailing list are you refering to? Could you >> link to it in the archive maybe? >> http://lists.open-bio.org/pipermail/biopython/ >> >> Peter >> > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bartek at rezolwenta.eu.org Mon Nov 24 14:51:12 2008 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 24 Nov 2008 15:51:12 +0100 Subject: [BioPython] Refactoring motif analysis code Message-ID: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> Hello All, Currently, there are two packages dealing with motif analysis in biopython : Bio.AlignAce (written by me) and Bio.MEME (written by Jason Hackney). Both of them are quite old and they were developed independently so the functionality is largely overlapping. Particularly the files AlignAce/Motif.py and MEME/Motif.py contain almost identical functionality useful for anyone interested in motif analysis of writing a parser for yet another motif searching tool. I'd like to change this and create a new library called Bio.Motif, which would contain: -Motif class for all general functionality concerning motif objects: i/o, comparisons, sequence scanning -AlignAce Parser -MEME Parser When this is completed, we could deprecate the AlignAce and MEME modules. For AlignAce I have most of the code already written, I need to rewrite portions of MEME parser to work with different motif implementation (not a major pain). Then I just need to polish it a bit and provide tests and a short tutorial. After this rather long intro I'd like to ask about several things: - Are there many Bio.AlignAce or Bio.MEME users who would be unhappy about deprecating them? - Are there any features which people would find valuable in Bio.Motif - Both MEME and AlignAce are DNA-oriented, I've never worked on Protein motifs myself, but I'd like to know whether anyone is interested in using Bio.Motif for that Any comments/ideas are welcome cheers Bartek -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From dalloliogm at gmail.com Mon Nov 24 15:25:23 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 24 Nov 2008 16:25:23 +0100 Subject: [BioPython] [Biopython-dev] Refactoring motif analysis code In-Reply-To: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> References: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> Message-ID: <5aa3b3570811240725n54f7f624oc1db5fe0b88e3f5a@mail.gmail.com> On Mon, Nov 24, 2008 at 3:51 PM, Bartek Wilczynski wrote: > Hello All, > > Currently, there are two packages dealing with motif analysis in biopython : > Bio.AlignAce (written by me) and Bio.MEME (written by Jason Hackney). Hi, I asked a question about motifs one year ago on this list. Here it is the thread: - http://lists.open-bio.org/pipermail/biopython/2007-September/003727.html I would just like to tell you that I have tried the TAMO framework you suggested me, and found it very useful. I am not using it anymore because I don't need it, but I remember that I liked: - the methods to represent motifs as matrixes of frequencies/occurrencies etc.. - the fact that it was easy to create a motif from an alignment of sequences - the integration it had with this website: http://weblogo.berkeley.edu/logo.cgi. I would suggest you to provide integration with this other web service, which enable to plot the difference between two sequence logos: http://www.twosamplelogo.org/examples.html. Maybe you should contact TAMO's author to ask him if he wants to contribute, because I remember that its framework was really complete. > > Both of them are quite old and they were developed independently so > the functionality is largely overlapping. > Particularly the files AlignAce/Motif.py and MEME/Motif.py contain > almost identical functionality useful for > anyone interested in motif analysis of writing a parser for yet > another motif searching tool. > > I'd like to change this and create a new library called Bio.Motif, > which would contain: > -Motif class for all general functionality concerning motif objects: > i/o, comparisons, sequence scanning > -AlignAce Parser > -MEME Parser > > When this is completed, we could deprecate the AlignAce and MEME > modules. For AlignAce I have most of the code > already written, I need to rewrite portions of MEME parser to work > with different motif implementation (not a major pain). > Then I just need to polish it a bit and provide tests and a short tutorial. > > After this rather long intro I'd like to ask about several things: > - Are there many Bio.AlignAce or Bio.MEME users who would be unhappy > about deprecating them? > - Are there any features which people would find valuable in Bio.Motif > - Both MEME and AlignAce are DNA-oriented, I've never worked on > Protein motifs myself, but I'd like to know whether anyone is > interested in using Bio.Motif for that > > Any comments/ideas are welcome > > cheers > Bartek > > -- > Bartek Wilczynski > ================== > Postdoctoral fellow > EMBL, Furlong group > Meyerhoffstrasse 1, > 69012 Heidelberg, > Germany > tel: +49 6221 387 8433 > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From T.Gkikopoulos at dundee.ac.uk Mon Nov 24 14:58:35 2008 From: T.Gkikopoulos at dundee.ac.uk (Triantafyllos Gkikopoulos) Date: Mon, 24 Nov 2008 14:58:35 +0000 Subject: [BioPython] averaging ATG based signal from chip on chip data Message-ID: <492AC11B020000B3000010E2@gw-out.dundee.ac.uk> Hi all, I am new to python, have started learning on my own. I want to do some microarray analysis and one of the things I would like to do is to average signal from a tilled array by aligning a set coordinates say ORF ATGs and plotting the average signal for a region of fixed length say from ATG to 300 bp downstream . obviously this should be the same if I have a bed file and want to do the same analysis based on either start or stop of all fragments included in the bed file. I can use the csv compnent to import my bed file and signal file, not sure what is the best way to import such data as, make a dictionary or make an array or just a list. Appreciate any help cheers Dr Triantafyllos Gkikopoulos The University of Dundee is a registered Scottish charity, No: SC015096 From bsouthey at gmail.com Mon Nov 24 15:54:32 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 24 Nov 2008 09:54:32 -0600 Subject: [BioPython] [Biopython-dev] Refactoring motif analysis code In-Reply-To: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> References: <8b34ec180811240651k45c11563p9e3dd18ba128f0ac@mail.gmail.com> Message-ID: <492ACE38.1090301@gmail.com> Bartek Wilczynski wrote: > Hello All, > > Currently, there are two packages dealing with motif analysis in biopython : > Bio.AlignAce (written by me) and Bio.MEME (written by Jason Hackney). > Actually I am not that thrilled with the licenses for these packages and similar packages because these are free only for academic use. To me this clashes with the spirit of an open-sourced project especially a BSD-licensed one. But if there is a need for such modules then these modules should be included. > Both of them are quite old and they were developed independently so > the functionality is largely overlapping. > Particularly the files AlignAce/Motif.py and MEME/Motif.py contain > almost identical functionality useful for > anyone interested in motif analysis of writing a parser for yet > another motif searching tool. > > I'd like to change this and create a new library called Bio.Motif, > which would contain: > -Motif class for all general functionality concerning motif objects: > i/o, comparisons, sequence scanning > -AlignAce Parser > -MEME Parser > > While it is only free for academic use, have you seen TAMO? *TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. * Bioinformatics. 2005 Jul 15;21(14):3164-5. http://fraenkel.mit.edu/TAMO/ > When this is completed, we could deprecate the AlignAce and MEME > modules. For AlignAce I have most of the code > already written, I need to rewrite portions of MEME parser to work > with different motif implementation (not a major pain). > Then I just need to polish it a bit and provide tests and a short tutorial. > > After this rather long intro I'd like to ask about several things: > - Are there many Bio.AlignAce or Bio.MEME users who would be unhappy > about deprecating them? > Well, I am not sure how many used Bio.AlignAce given the Parser.py bug :-) Based on the CVS, both have been untouched for about three years. Also, what species are these used for? One of the papers of AlignAce indicate that the base composition was set for yeast. > - Are there any features which people would find valuable in Bio.Motif > - Both MEME and AlignAce are DNA-oriented, I've never worked on > Protein motifs myself, but I'd like to know whether anyone is > interested in using Bio.Motif for that > > Any comments/ideas are welcome > > cheers > Bartek > > Personally I would be interested in a general protein motif finding module because of my current research. However, I do have a different view with respect to the Biopython community as indicated above with the licenses. Bruce From cjauvin at gmail.com Mon Nov 24 22:18:29 2008 From: cjauvin at gmail.com (Christian Jauvin) Date: Mon, 24 Nov 2008 17:18:29 -0500 Subject: [BioPython] PubMed find_related Message-ID: Hi, I'd like to use the PubMed find_related function, but the doc says that it's deprecated and that I should use the one in the Bio.Entrez module: "Find related articles in PubMed, returns an ID list (DEPRECATED). Please use Bio.Entrez instead as described in the Biopython Tutorial." The problem is that I can't find the equivalent in the Bio.Entrez module.. (I'm using latest version 1.49) Thanks, Christian From mjldehoon at yahoo.com Tue Nov 25 04:05:01 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 24 Nov 2008 20:05:01 -0800 (PST) Subject: [BioPython] PubMed find_related In-Reply-To: Message-ID: <580790.81356.qm@web62404.mail.re1.yahoo.com> >>> from Bio import Entrez >>> handle = Entrez.elink(dbfrom='pubmed',id=12345) >>> record = Entrez.read(handle) Feel free to write a section about Entrez.elink for the Biopython documentation :-). Currently, this section is almost empty. --Michiel. --- On Mon, 11/24/08, Christian Jauvin wrote: > From: Christian Jauvin > Subject: [BioPython] PubMed find_related > To: biopython at biopython.org > Date: Monday, November 24, 2008, 5:18 PM > Hi, > > I'd like to use the PubMed find_related function, but > the doc says > that it's deprecated and that I should use the one in > the Bio.Entrez > module: > > "Find related articles in PubMed, returns an ID list > (DEPRECATED). > Please use Bio.Entrez instead as described in the Biopython > Tutorial." > > The problem is that I can't find the equivalent in the > Bio.Entrez > module.. (I'm using latest version 1.49) > > Thanks, > > Christian > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From rjalves at igc.gulbenkian.pt Thu Nov 27 15:17:38 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Thu, 27 Nov 2008 15:17:38 +0000 Subject: [BioPython] Cladograms Message-ID: <492EBA12.2080408@igc.gulbenkian.pt> Hi everyone, I've been searching the web for python modules to do cladograms but the only relevant stuff I found was relative to dendograms and hierarchical clustering which will give the representation I need. My goal is something that resembles a heatmap[1] but where the trees will be cladograms[2] instead of the result of clustering steps. I know that probably I won't find modules doing exactly what I want, which is why I'm searching for tools to do each step separately and try to glue them somehow. For the heatmap I have something already that will probably do the job, but for the cladograms I couldn't find any decent module. Do you happen to know any dark alley in BioPython or any other external module that would allow me to do the cladogram? Thanks, Renato [1] - http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png [2] - http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif From rjalves at igc.gulbenkian.pt Thu Nov 27 16:11:01 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Thu, 27 Nov 2008 16:11:01 +0000 Subject: [BioPython] Cladograms In-Reply-To: <492EBA12.2080408@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> Message-ID: <492EC695.7020102@igc.gulbenkian.pt> Where "which will give the representation I need" is read should be "which will *not* give the representation I need". Sorry for that. Renato Quoting Renato Alves on 11/27/2008 03:17 PM: > Hi everyone, > > I've been searching the web for python modules to do cladograms but > the only relevant stuff I found was relative to dendograms and > hierarchical clustering which will give the representation I need. > My goal is something that resembles a heatmap[1] but where the trees > will be cladograms[2] instead of the result of clustering steps. > I know that probably I won't find modules doing exactly what I want, > which is why I'm searching for tools to do each step separately and > try to glue them somehow. For the heatmap I have something already > that will probably do the job, but for the cladograms I couldn't find > any decent module. > Do you happen to know any dark alley in BioPython or any other > external module that would allow me to do the cladogram? > > Thanks, > Renato > > [1] - > http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png > > [2] - > http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From dalloliogm at gmail.com Thu Nov 27 16:45:37 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 27 Nov 2008 17:45:37 +0100 Subject: [BioPython] Cladograms In-Reply-To: <492EBA12.2080408@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> Message-ID: <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves wrote: > Hi everyone, > > I've been searching the web for python modules to do cladograms but the only > relevant stuff I found was relative to dendograms and hierarchical > clustering which will give the representation I need. > My goal is something that resembles a heatmap[1] Let me premise that I am not able to help you :). But this seems to be the kind of things that R does. Have you had a look at it? > but where the trees will be > cladograms[2] instead of the result of clustering steps. > I know that probably I won't find modules doing exactly what I want, which > is why I'm searching for tools to do each step separately and try to glue > them somehow. For the heatmap I have something already that will probably do > the job, but for the cladograms I couldn't find any decent module. > Do you happen to know any dark alley in BioPython or any other external > module that would allow me to do the cladogram? But from which kind of data? Do you have to align sequences, or are they aligned already? do you already have the cladograms, or do you have to calculate them? > > Thanks, > Renato > > [1] - > http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png > [2] - > http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From rjalves at igc.gulbenkian.pt Thu Nov 27 18:50:36 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Thu, 27 Nov 2008 18:50:36 +0000 Subject: [BioPython] Cladograms In-Reply-To: <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> Message-ID: <492EEBFC.4050201@igc.gulbenkian.pt> Quoting Giovanni Marco Dall'Olio on 11/27/2008 04:45 PM: > On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves wrote: > >> Hi everyone, >> >> I've been searching the web for python modules to do cladograms but the only >> relevant stuff I found was relative to dendograms and hierarchical >> clustering which will give the representation I need. >> My goal is something that resembles a heatmap[1] >> > Let me premise that I am not able to help you :). > But this seems to be the kind of things that R does. Have you had a look at it? > For the heatmap I did have a look and seems easy to use, for the cladograms I couldn't find much using the name with help.search(). Also google doesn't give the best results when searching for R . Still I would like to remain in python as much as possible. I used rpy before to do some R from within python and it worked but the code is far from "maintainable". >> but where the trees will be >> cladograms[2] instead of the result of clustering steps. >> I know that probably I won't find modules doing exactly what I want, which >> is why I'm searching for tools to do each step separately and try to glue >> them somehow. For the heatmap I have something already that will probably do >> the job, but for the cladograms I couldn't find any decent module. >> Do you happen to know any dark alley in BioPython or any other external >> module that would allow me to do the cladogram? >> > But from which kind of data? > Do you have to align sequences, or are they aligned already? do you > already have the cladograms, or do you have to calculate them? > The data will be mostly taxonomic and in some cases genes grouped by properties. I only need to find a simple way to turn it into an image. Thanks >> Thanks, >> Renato >> >> [1] - >> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png >> [2] - >> http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> From dalloliogm at gmail.com Thu Nov 27 20:48:07 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 27 Nov 2008 21:48:07 +0100 Subject: [BioPython] Cladograms In-Reply-To: <492EEBFC.4050201@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> <492EEBFC.4050201@igc.gulbenkian.pt> Message-ID: <5aa3b3570811271248u5fcde149md8741c943a6b62ef@mail.gmail.com> On Thu, Nov 27, 2008 at 7:50 PM, Renato Alves wrote: > Quoting Giovanni Marco Dall'Olio on 11/27/2008 04:45 PM: >> >> On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves >> wrote: >> >>> >>> Hi everyone, >>> >>> I've been searching the web for python modules to do cladograms but the >>> only >>> relevant stuff I found was relative to dendograms and hierarchical >>> clustering which will give the representation I need. >>> My goal is something that resembles a heatmap[1] >>> >> >> Let me premise that I am not able to help you :). >> But this seems to be the kind of things that R does. Have you had a look >> at it? >> > > For the heatmap I did have a look and seems easy to use, for the cladograms > I couldn't find much using the name with help.search(). Also google doesn't > give the best results when searching for R . Ask to the R users mailing list. Be careful on how you write your message there, because it is a mailing list with a lot of users. If you find someting interesting in R, please don't forget us :), don't forget that python is cool :). > Still I would like to remain in python as much as possible. A population genetics module is under development at the moment, but it doesn't implement anything like that. I am sorry I am not aware of any module capable of doing this. > Thanks >>> >>> Thanks, >>> Renato >>> >>> [1] - >>> >>> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/default_heatmap.png >>> [2] - >>> >>> http://www.csupomona.edu/~jcclark/classes/bot125/resource/graphics/c/cladogram.gif >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From winda002 at student.otago.ac.nz Thu Nov 27 21:52:08 2008 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 28 Nov 2008 10:52:08 +1300 Subject: [BioPython] Cladograms In-Reply-To: <492EEBFC.4050201@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> <492EEBFC.4050201@igc.gulbenkian.pt> Message-ID: <492F1688.7060505@student.otago.ac.nz> Renato Alves wrote: > Quoting Giovanni Marco Dall'Olio on 11/27/2008 04:45 PM: >> On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves >> wrote: >> >>> Hi everyone, >>> >>> I've been searching the web for python modules to do cladograms but >>> the only >>> relevant stuff I found was relative to dendograms and hierarchical >>> clustering which will give the representation I need. >>> My goal is something that resembles a heatmap[1] >>> >> Let me premise that I am not able to help you :). >> But this seems to be the kind of things that R does. Have you had a >> look at it? >> > For the heatmap I did have a look and seems easy to use, for the > cladograms I couldn't find much using the name with help.search(). [snip] Hi renato, Have you looked into the R package ape (analysis of phylogentics and evolution, install.packages("ape")) - it has object classes for phylip and nexus trees. I don't know how easy it is to kludge things together from different packages but it might be worth looking into? There is a python module to integrate with R if you want to stay pure ;) -- PhD Student Allan Wilson Centre Department of Zoology University of Otago, PO Box 56, Dunedin 9054 ph: +64-3-4778459 mob: +64-27-3326815 e: winda002 at student.otago.ac.nz From p.j.a.cock at googlemail.com Fri Nov 28 11:29:44 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Nov 2008 11:29:44 +0000 Subject: [BioPython] Cladograms In-Reply-To: <492EBA12.2080408@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> Message-ID: <320fb6e00811280329x6dca1614qf5e7ba4fbb187765@mail.gmail.com> On Thu, Nov 27, 2008 at 3:17 PM, Renato Alves wrote: > Hi everyone, > > I've been searching the web for python modules to do cladograms but the only > relevant stuff I found was relative to dendograms and hierarchical > clustering which will give the representation I need. Hi Renato, > My goal is something that resembles a heatmap[1] but where the trees will be > cladograms[2] instead of the result of clustering steps. In my heatmap example you cited (using R), you can in principle supply the tree to be used (instead of it defaulting to doing a hierarchical clustering). If you like the images from the R heatmap function, I would suggest you look at loading phylogenetic trees into R and passing them to the heatmap function. I have not tried this myself. > I know that probably I won't find modules doing exactly what I want, which > is why I'm searching for tools to do each step separately and try to glue > them somehow. I can understand the idea here, but in practice the "glueing" the images together may not be trivial. You'll have to make sure that they are drawn using vector image formats (allowing you to scale the images to match), or if using bitmaps you'll need to be able to specify things pixel perfect. You will also need to hope that the tree is drawn with equal vertical spacing between leaves, otherwise it won't match the grid of the heatmap. That said, there a lot of tree drawing packages out there, and this could work. > For the heatmap I have something already that will probably do > the job, but for the cladograms I couldn't find any decent module. > Do you happen to know any dark alley in BioPython or any other external > module that would allow me to do the cladogram? You could in principle use python and a package like reportlab to draw both the tree and the heatmap - but you'd end up writing a lot of your own code. For example, I have used python and reportlab to draw colourful PDF trees with aligned columns of data, e.g. Supplementary figures 1 and 2 from: http://dx.doi.org/10.1099/mic.0.2007/013672-0 The script that drew these trees is actually rather complicated (partly due to showing two sets of bootstrap values). If I recall correctly, it also used Thomas Mailund's Newick Tree module to parse the tree files, and not Biopython. See http://www.daimi.au.dk/~mailund/newick.html Next time I need to draw a customised tree, I'll try to look at writing something more general purpose to go in Biopython under Bio.Graphics (using Bio.Nexus to load tree files). For now, I would suggest you explore the R heatmap function and its arguments (and perhaps call this via python if you need to - it would be simpler just to use R directly). Peter From rjalves at igc.gulbenkian.pt Fri Nov 28 11:54:19 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Fri, 28 Nov 2008 11:54:19 +0000 Subject: [BioPython] Cladograms In-Reply-To: <492F1688.7060505@student.otago.ac.nz> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> <492EEBFC.4050201@igc.gulbenkian.pt> <492F1688.7060505@student.otago.ac.nz> Message-ID: <492FDBEB.20001@igc.gulbenkian.pt> @David and Giovanni Thank you both for your feedback. I guess I will give R a more decent try. In the meanwhile I guess I found myself a python small side project :) . I will be sure to contribute with the code if I get to any decent result. Renato Quoting David Winter on 11/27/2008 09:52 PM: > Renato Alves wrote: >> Quoting Giovanni Marco Dall'Olio on 11/27/2008 04:45 PM: >>> On Thu, Nov 27, 2008 at 4:17 PM, Renato Alves >>> wrote: >>> >>>> Hi everyone, >>>> >>>> I've been searching the web for python modules to do cladograms but >>>> the only >>>> relevant stuff I found was relative to dendograms and hierarchical >>>> clustering which will give the representation I need. >>>> My goal is something that resembles a heatmap[1] >>>> >>> Let me premise that I am not able to help you :). >>> But this seems to be the kind of things that R does. Have you had a >>> look at it? >>> >> For the heatmap I did have a look and seems easy to use, for the >> cladograms I couldn't find much using the name with help.search(). > [snip] > > Hi renato, > > Have you looked into the R package ape (analysis of phylogentics and > evolution, install.packages("ape")) - it has object classes for phylip > and nexus trees. I don't know how easy it is to kludge things together > from different packages but it might be worth looking into? There is a > python module to integrate with R if you want to stay pure ;) > From rjalves at igc.gulbenkian.pt Fri Nov 28 17:03:48 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Fri, 28 Nov 2008 17:03:48 +0000 Subject: [BioPython] Cladograms In-Reply-To: <320fb6e00811280329x6dca1614qf5e7ba4fbb187765@mail.gmail.com> References: <492EBA12.2080408@igc.gulbenkian.pt> <320fb6e00811280329x6dca1614qf5e7ba4fbb187765@mail.gmail.com> Message-ID: <49302474.4090907@igc.gulbenkian.pt> Quoting Peter Cock on 11/28/2008 11:29 AM: > Hi Renato, > Hi Peter, >> My goal is something that resembles a heatmap[1] but where the trees will be >> cladograms[2] instead of the result of clustering steps. >> > In my heatmap example you cited (using R), you can in principle supply > the tree to be used (instead of it defaulting to doing a hierarchical > clustering). If you like the images from the R heatmap function, I > would suggest you look at loading phylogenetic trees into R and > passing them to the heatmap function. I have not tried this myself. > R's heatmap.2 is giving me what I need so far, it even reorders the columns/rows according to a pre-calculated dendrogram. The only thing that I don't like that much is how the dendrograms are plotted when multiple branches are at the same levels, but I can live with it :) >> I know that probably I won't find modules doing exactly what I want, which >> is why I'm searching for tools to do each step separately and try to glue >> them somehow. >> > I can understand the idea here, but in practice the "glueing" the > images together may not be trivial. You'll have to make sure that > they are drawn using vector image formats (allowing you to scale the > images to match), or if using bitmaps you'll need to be able to > specify things pixel perfect. You will also need to hope that the > tree is drawn with equal vertical spacing between leaves, otherwise it > won't match the grid of the heatmap. That said, there a lot of tree > drawing packages out there, and this could work. > Well I was thinking of a naive approach such as "glue by hand" (shame on me). But for a real thing I would probably use matplotlib. Although given my current knowledge on the library it would take a while... >> For the heatmap I have something already that will probably do >> the job, but for the cladograms I couldn't find any decent module. >> Do you happen to know any dark alley in BioPython or any other external >> module that would allow me to do the cladogram? >> > You could in principle use python and a package like reportlab to draw > both the tree and the heatmap - but you'd end up writing a lot of your > own code. For example, I have used python and reportlab to draw > colourful PDF trees with aligned columns of data, e.g. Supplementary > figures 1 and 2 from: http://dx.doi.org/10.1099/mic.0.2007/013672-0 > Never used reportlab directly, only via other tools that did all the job. But it's good to know it does a good job at what it does. The only time I messed with PDF libraries using python I ended up with pyPdf - http://pybrary.net/pyPdf/ - It was the only library that had the "limited" edit capabilities I needed. > The script that drew these trees is actually rather complicated > (partly due to showing two sets of bootstrap values). If I recall > correctly, it also used Thomas Mailund's Newick Tree module to parse > the tree files, and not Biopython. See > http://www.daimi.au.dk/~mailund/newick.html > That's also my problem when writing R code using rpy. Not very pythonic (mine at least), hard to read and reuse. Sometimes I end up writing code in the original language, dumping data to files and launching it with os.system/subprocess.call than using rpy. I hope this changes a bit with rpy2... > Next time I need to draw a customised tree, I'll try to look at > writing something more general purpose to go in Biopython under > Bio.Graphics (using Bio.Nexus to load tree files). > > For now, I would suggest you explore the R heatmap function and its > arguments (and perhaps call this via python if you need to - it would > be simpler just to use R directly). > I'm going straight to R for now. But I think this one should be simple and elegant to do in rpy. > Peter > Thanks a bunch for the nice tips and feedback. Renato From p.j.a.cock at googlemail.com Fri Nov 28 18:14:22 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Nov 2008 18:14:22 +0000 Subject: [BioPython] Cladograms In-Reply-To: <49302474.4090907@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> <320fb6e00811280329x6dca1614qf5e7ba4fbb187765@mail.gmail.com> <49302474.4090907@igc.gulbenkian.pt> Message-ID: <320fb6e00811281014j4d52da75r5dcc4d2152d5819c@mail.gmail.com> >> In my heatmap example you cited (using R), you can in principle supply >> the tree to be used (instead of it defaulting to doing a hierarchical >> clustering). If you like the images from the R heatmap function, I >> would suggest you look at loading phylogenetic trees into R and >> passing them to the heatmap function. I have not tried this myself. > > R's heatmap.2 is giving me what I need so far, it even reorders the > columns/rows according to a pre-calculated dendrogram. > The only thing that I don't like that much is how the dendrograms are > plotted when multiple branches are at the same levels, but I can live with > it :) I'm glad that worked out OK. If you haven't signed up to the rpy mailing list, I suggest you do so: https://lists.sourceforge.net/lists/listinfo/rpy-list Peter From pingou at pingoured.fr Sun Nov 30 09:41:13 2008 From: pingou at pingoured.fr (Pierre-Yves) Date: Sun, 30 Nov 2008 10:41:13 +0100 Subject: [BioPython] Cladograms In-Reply-To: <492EEBFC.4050201@igc.gulbenkian.pt> References: <492EBA12.2080408@igc.gulbenkian.pt> <5aa3b3570811270845k2a150745j34c9dc8f1ad4a8bd@mail.gmail.com> <492EEBFC.4050201@igc.gulbenkian.pt> Message-ID: <49325FB9.2080005@pingoured.fr> Renato Alves wrote: > For the heatmap I did have a look and seems easy to use, for the > cladograms I couldn't find much using the name with help.search(). Also > google doesn't give the best results when searching for R . If you are looking for something related to R, I would recommend to use http://www.rseek.org instead of our friend google. Rseek might give better results ;) Regards, Pierre PS. Sorry Renato but I just realized that I forgot to send the mail to the list and I though other people might be interested to