From biopython at maubp.freeserve.co.uk Mon Aug 3 16:48:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Aug 2009 21:48:49 +0100 Subject: [Biopython] Deprecating Bio.Fasta? In-Reply-To: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> Message-ID: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> On 22 June 2009, I wrote: > ... > I'd like to officially deprecate Bio.Fasta for the next release (Biopython > 1.51), which means you can continue to use it for a couple more > releases, but at import time you will see a warning message. See also: > http://biopython.org/wiki/Deprecation_policy > > Would this cause anyone any problems? If you are still using Bio.Fasta, > it would be interesting to know if this is just some old code that hasn't > been updated, or if there is some stronger reason for still using it. No one replied, so I plan to make this change in CVS shortly, meaning that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work but will trigger a deprecation warning at import. Please speak up ASAP if this concerns you. Thanks, Peter From stran104 at chapman.edu Tue Aug 4 21:10:44 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Tue, 4 Aug 2009 18:10:44 -0700 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? In-Reply-To: <65d4b7fc0908031207x187119eerc05340c49488889c@mail.gmail.com> References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> <20090730220902.GD84345@sobchak.mgh.harvard.edu> <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com> <2a63cc350907301710w57d4d4b9nb89fea39f9e62b76@mail.gmail.com> <65d4b7fc0908031207x187119eerc05340c49488889c@mail.gmail.com> Message-ID: <2a63cc350908041810s4583e254o99e90861a2b23f99@mail.gmail.com> I played with this a bit more and Brad is right, the Unigene database is not supported through Entrez efetch. The ID returned by esearch is in fact the GI and other types of records can be retreived with it (e.g. gene). A list of supported databases and returntypes can be found at: http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html As Brad already suggested, downloading the files would work and it would be fast to process locally as well. Good luck :-) On Mon, Aug 3, 2009 at 12:07 PM, Carlos Javier Borroto < carlos.borroto at gmail.com> wrote: > On Thu, Jul 30, 2009 at 8:10 PM, Matthew Strand > wrote: > > Hi Carlos, > > I did something similar to this a while ago and meant to write a cookbook > > entry for it but haven't gotten the chance yet. You could also try doing > an > > efetch on the ID of the record returned by esearch. > > > > I'm not near my workstation so I can't test it but you might try: > > handle = Entrez.efetch(db="unigene", id="141673") > > > > If that works then you just need to pull the ID out of the esearch result > > and do an efetch on it. > > > > I tried that too, but no luck on my side: > > >>> from Bio import Entrez > >>> from Bio import UniGene > >>> Entrez.email = "carlos.borroto at gmail.com" > >>> handle = Entrez.esearch(db="unigene", term="Hs.94542") > >>> record = Entrez.read(handle) > >>> record > {u'Count': '1', u'RetMax': '1', u'IdList': ['141673'], > u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term': > 'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet': > [], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'} > >>> record["IdList"][0] > '141673' > >>> handle = Entrez.efetch(db="unigene", id=record["IdList"][0]) > >>> print handle.read() > (Output a HTML web page) > > regards, > -- > Carlos Javier > -- Matthew Strand From biopython at maubp.freeserve.co.uk Wed Aug 5 06:29:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 Aug 2009 11:29:45 +0100 Subject: [Biopython] Deprecating Bio.Fasta? In-Reply-To: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> Message-ID: <320fb6e00908050329m44fa2596ife06917306ae44ab@mail.gmail.com> On Mon, Aug 3, 2009 at 9:48 PM, Peter wrote: > On 22 June 2009, I wrote: >> ... >> I'd like to officially deprecate Bio.Fasta for the next release (Biopython >> 1.51), which means you can continue to use it for a couple more >> releases, but at import time you will see a warning message. See also: >> http://biopython.org/wiki/Deprecation_policy >> ... > > No one replied, so I plan to make this change in CVS shortly, meaning > that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work > but will trigger a deprecation warning at import. > > Please speak up ASAP if this concerns you. I've just committed the deprecation of Bio.Fasta to CVS. This could be reverted if anyone has a compelling reason (and tells us before we do the final release of Biopython 1.51). The docstring for Bio.Fasta should cover the typical situations for moving from Bio.Fasta to Bio.SeqIO, but please feel free to ask on the mailing list if you have a more complicated bit of old code that needs to be ported. Thanks, Peter From sbassi at gmail.com Fri Aug 7 14:15:47 2009 From: sbassi at gmail.com (Sebastian Bassi) Date: Fri, 7 Aug 2009 15:15:47 -0300 Subject: [Biopython] ANN: Python for Bioinformatics book Message-ID: Just want to announce the availability of the book "Python for Bioinformatics". It has a Biopython chapter and I made it thanks to lot of people of this list (and biopython-dev) who help me since I began programming Python about 7 years ago. Here is the official announce: "Python for Bioinformatics" ISBN 1584889292 Amazon: http://www.tinyurl.com/biopython Publisher: http://www.crcpress.com/product/isbn/9781584889298 This book introduces programming concepts to life science researchers, bioinformaticians, support staff, students, and everyone who is interested in applying programming to solve biologically-related problems. Python is the chosen programming language for this task because it is both powerful and easy-to-use. It begins with the basic aspects of the language (like data types and control structures) up to essential skills on today's bioinformatics tasks like building web applications, using relational database management systems, XML and version control. There is a chapter devoted to Biopython (www.biopython.org) since it can be used for most of the tasks related to bioinformatics data processing. There is a section with applications with source code, featuring sequence manipulation, filtering vector contamination, calculating DNA melting temperature, parsing a genbank file, inferring splicing sites, and more. There are questions at the end of every chapter and odd numbered questiona are answered in an appendix making this text suitable for classroom use. This book can be used also as a reference material as it includes Richard Gruet's Python Quick Reference, and the Python Style Guide. DVD: The included DVD features a virtual machine with a special edition of DNALinux, with all the programs and complementary files required to run the scripts commented in the book. All scripts can be tweaked to fit a particular configuration. By using a pre-configured virtual machine the reader has access to the same development environment than the author, so he can focus on learning Python. All code is also available at the http://py3.us/## where ## is the code number, for example: http://py3.us/57 I've been working on this book for more than two years testing the examples under different setups and working to make the code compatible for most versions of Python, Biopython and operating systems. Where there is code that only works with a particular dependency, this is clearly noted. Finally, I want to highlight that non-bioinformaticians out there can use this book as an introduction to bioinformatics by starting with the included "Diving into the Gene Pool with BioPython" (by Zachary Voase and published originally in Python Magazine). From lueck at ipk-gatersleben.de Sat Aug 8 03:03:52 2009 From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de) Date: Sat, 8 Aug 2009 09:03:52 +0200 Subject: [Biopython] ANN: Python for Bioinformatics book In-Reply-To: References: Message-ID: <20090808090352.av3u0ldgd7rk88wc@webmail.ipk-gatersleben.de> Hi Sebastian! This sounds like a great book! Hopefully it's will be soon also available in the german bookshops, so that I can have a look on it! Kind regards Stefanie Zitat von Sebastian Bassi : > Just want to announce the availability of the book "Python for > Bioinformatics". It has a Biopython chapter and I made it thanks to > lot of people of this list (and biopython-dev) who help me since I > began programming Python about 7 years ago. > > Here is the official announce: > > "Python for Bioinformatics" > ISBN 1584889292 > Amazon: http://www.tinyurl.com/biopython > Publisher: http://www.crcpress.com/product/isbn/9781584889298 > > This book introduces programming concepts to life science researchers, > bioinformaticians, support staff, students, and everyone who is > interested in applying programming to solve biologically-related > problems. Python is the chosen programming language for this task > because it is both powerful and easy-to-use. > > It begins with the basic aspects of the language (like data types and > control structures) up to essential skills on today's bioinformatics > tasks like building web applications, using relational database > management systems, XML and version control. There is a chapter > devoted to Biopython (www.biopython.org) since it can be used for most > of the tasks related to bioinformatics data processing. > > There is a section with applications with source code, featuring > sequence manipulation, filtering vector contamination, calculating DNA > melting temperature, parsing a genbank file, inferring splicing sites, > and more. > > There are questions at the end of every chapter and odd numbered > questiona are answered in an appendix making this text suitable for > classroom use. > > This book can be used also as a reference material as it includes > Richard Gruet's Python Quick Reference, and the Python Style Guide. > > DVD: The included DVD features a virtual machine with a special > edition of DNALinux, with all the programs and complementary files > required to run the scripts commented in the book. All scripts can be > tweaked to fit a particular configuration. By using a pre-configured > virtual machine the reader has access to the same development > environment than the author, so he can focus on learning Python. All > code is also available at the http://py3.us/## where ## is the code > number, for example: http://py3.us/57 > > I've been working on this book for more than two years testing the > examples under different setups and working to make the code > compatible for most versions of Python, Biopython and operating > systems. Where there is code that only works with a particular > dependency, this is clearly noted. > > Finally, I want to highlight that non-bioinformaticians out there can > use this book as an introduction to bioinformatics by starting with > the included "Diving into the Gene Pool with BioPython" (by Zachary > Voase and published originally in Python Magazine). > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From biopython at maubp.freeserve.co.uk Mon Aug 10 07:12:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 12:12:21 +0100 Subject: [Biopython] Trimming adaptors sequences Message-ID: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> Hi all, Brad's got an interesting blog post up on using Biopython for trimming adaptors for next gen sequencing reads, using Bio.pairwise2 for pairwise alignments between the adaptor and the reads: http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ The basic idea is similar to what Giles Weaver was describing last month, although Giles was using EMBOSS needle to do a global pairwise alignment via BioPerl: http://lists.open-bio.org/pipermail/biopython/2009-July/005338.html We already had a simple FASTQ "primer trimming" example in the tutorial, which I have just extended to add a more general FASTQ "adaptor trimming" example. For this I am deliberately only looking for exact matches. This is faster of course, but it also makes the example much more easily understood as well - something important for an introductory example. A full cookbook example of how to use pairwise alignments would seem like a great idea for a cookbook entry on the wiki. It would be interesting to see which is faster - using EMBOSS needle/water or Bio.pairwise2. Both are written in C, but using EMBOSS we'd have the overhead of parsing the output file. Brad - why are you using a local alignment and not a global alignment? Shouldn't we be looking for the entire adaptor sequence? It looks like you don't consider the the unaligned parts of the adaptor when you count the mismatches - is this a bug? I wonder if it would be simpler (and faster) to take a score based threshold. Regards, Peter From chapmanb at 50mail.com Mon Aug 10 09:16:50 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 10 Aug 2009 09:16:50 -0400 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> Message-ID: <20090810131650.GP12604@sobchak.mgh.harvard.edu> Hi Peter; > Brad's got an interesting blog post up on using Biopython for trimming > adaptors for next gen sequencing reads, using Bio.pairwise2 for > pairwise alignments between the adaptor and the reads: > > http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ > > The basic idea is similar to what Giles Weaver was describing last > month, although Giles was using EMBOSS needle to do a global pairwise > alignment via BioPerl: > http://lists.open-bio.org/pipermail/biopython/2009-July/005338.html Yes, same idea. When I started messing with this I was thinking I could be tricky and get something that avoided doing alignments and would be faster. Unfortunately I didn't have good luck with the pure string based approaches. > We already had a simple FASTQ "primer trimming" example in the > tutorial, which I have just extended to add a more general FASTQ > "adaptor trimming" example. For this I am deliberately only looking > for exact matches. This is faster of course, but it also makes the > example much more easily understood as well - something important for > an introductory example. Agreed. I like the examples and was thinking of this as an extension of the exact matching approach. I am definitely happy to roll this or some derivative of it into Biopython. > A full cookbook example of how to use pairwise alignments would seem > like a great idea for a cookbook entry on the wiki. It would be > interesting to see which is faster - using EMBOSS needle/water or > Bio.pairwise2. Both are written in C, but using EMBOSS we'd have the > overhead of parsing the output file. In terms of speed, I was thinking of this as a good target for parallelization using the multiprocessing library (http://docs.python.org/library/multiprocessing.html) but didn't have time yet to look into that. > Brad - why are you using a local alignment and not a global alignment? > Shouldn't we be looking for the entire adaptor sequence? It looks like > you don't consider the the unaligned parts of the adaptor when you > count the mismatches - is this a bug? Good call -- this should consider the number of matches in the aligning region to the full adaptor to see if we've got it. This is fixed in the GitHub version now. Thanks for pointing it out. > I wonder if it would be simpler (and faster) to take a score based threshold. Maybe, but I find comfort in being able to describe the algorithms simply: any matches to the adaptor with 2 or less errors. I'd imagine most of the time is being taken up doing the actual alignment work. Thanks for the feedback on this. It was really helpful, Brad From rodrigo_faccioli at uol.com.br Mon Aug 10 10:31:44 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Mon, 10 Aug 2009 11:31:44 -0300 Subject: [Biopython] qBlast Error and Entrez module Message-ID: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> Hello, I've tried to execute a blast from NCBI. In this way, I'm using the NCBI module from Biopython. I read the Biopython Tutorial its Chapter 7. So, my code is below. result_handle = NCBIWWW.qblast("blastn", "nr", "TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") blast_results = result_handle.read() save_file = open( "1CRN_Blast.xml", "w") save_file.write(blast_results) save_file.close() However, when I execute this code, I receive the error message: raise ValueError("No RID and no RTOE found in the 'please wait' page." I don't know what I'm doing wrong. So, if somebody can help me, I thank. I have on more doubt about Entrez module. What is the difference between Entrez and NCBI ? With Entrez module can I execute a protein aligment? If yes, could somebody inform a example for me. Sorry my English mistakes. Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From jblanca at btc.upv.es Mon Aug 10 10:39:56 2009 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Mon, 10 Aug 2009 16:39:56 +0200 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <320fb6e00908100536j50cf9dacp27b93040b50623aa@mail.gmail.com> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <1249905992.4a800d48a7d29@webmail.upv.es> <320fb6e00908100536j50cf9dacp27b93040b50623aa@mail.gmail.com> Message-ID: <1249915196.4a80313cf2284@webmail.upv.es> I had also the same problem and I wrote a function to do it using exonerate or blast, take a look at create_vector_striper_by_alignment in: http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/seq_cleaner.py Jose Blanca From biopython at maubp.freeserve.co.uk Mon Aug 10 11:10:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 16:10:50 +0100 Subject: [Biopython] qBlast Error and Entrez module In-Reply-To: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> References: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> Message-ID: <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> On Mon, Aug 10, 2009 at 3:31 PM, Rodrigo faccioli wrote: > Hello, > > I've tried to execute a blast from NCBI. In this way, I'm using the NCBI > module from Biopython. I read the Biopython Tutorial its Chapter 7. So, my > code is below. > > ? ? ? ?result_handle = NCBIWWW.qblast("blastn", "nr", > "TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") > ? ? ? ?blast_results = result_handle.read() > ? ? ? ?save_file = open( "1CRN_Blast.xml", "w") > ? ? ? ?save_file.write(blast_results) > ? ? ? ?save_file.close() > > However, when I execute this code, I receive the error message: raise > ValueError("No RID and no RTOE found in the 'please wait' page." > > I don't know what I'm doing wrong. So, if somebody can help me, I thank. I don't see anything wrong with that line, but it isn't working for me either. Odd. Perhaps the NCBI have changed something... I'll get back to you. > I have on more doubt about Entrez module. What is the difference between > Entrez and NCBI ? The NCBI is the (American) National Center for Biotechnology Information. They provide lots of online tools including Entrez and BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi http://www.ncbi.nlm.nih.gov/sites/entrez > With Entrez module can I execute a protein aligment? If > yes, could somebody inform a example for me. No, you can't run BLAST via Entrez. Entrez is like a way to search and download data from the NCBI. Peter From biopython at maubp.freeserve.co.uk Mon Aug 10 11:15:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 16:15:53 +0100 Subject: [Biopython] qBlast Error and Entrez module In-Reply-To: <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> References: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> Message-ID: <320fb6e00908100815p32242292g2931d1c620697171@mail.gmail.com> On Mon, Aug 10, 2009 at 4:10 PM, Peter wrote: > On Mon, Aug 10, 2009 at 3:31 PM, Rodrigo > faccioli wrote: >> Hello, >> >> I've tried to execute a blast from NCBI. In this way, I'm using the NCBI >> module from Biopython. I read the Biopython Tutorial its Chapter 7. So, my >> code is below. >> >> ? ? ? ?result_handle = NCBIWWW.qblast("blastn", "nr", >> "TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") >> ? ? ? ?blast_results = result_handle.read() >> ? ? ? ?save_file = open( "1CRN_Blast.xml", "w") >> ? ? ? ?save_file.write(blast_results) >> ? ? ? ?save_file.close() >> >> However, when I execute this code, I receive the error message: raise >> ValueError("No RID and no RTOE found in the 'please wait' page." >> >> I don't know what I'm doing wrong. So, if somebody can help me, I thank. > > I don't see anything wrong with that line, but it isn't working for me either. > Odd. Perhaps the NCBI have changed something... I'll get back to you. It is actually a simple problem: You are using a protein query but BLASTN requires a nucleotide sequence. The NCBI does actually try and tell us this, but Biopython doesn't (currently) know how to extract the error message to show you. Peter From biopython at maubp.freeserve.co.uk Mon Aug 10 11:29:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 16:29:00 +0100 Subject: [Biopython] qBlast Error and Entrez module In-Reply-To: <320fb6e00908100815p32242292g2931d1c620697171@mail.gmail.com> References: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> <320fb6e00908100815p32242292g2931d1c620697171@mail.gmail.com> Message-ID: <320fb6e00908100829g7d04d29n9f35a1a65072bb18@mail.gmail.com> On Mon, Aug 10, 2009 at 4:15 PM, Peter wrote: > > It is actually a simple problem: You are using a protein query but BLASTN > requires a nucleotide sequence. The NCBI does actually try and tell us > this, but Biopython doesn't (currently) know how to extract the error > message to show you. I've updated Bio.Blast.NCBIWWW to try and report the NCBI error message, this means in future a mistake like this will result in: ValueError: Error message from NCBI: Message ID#24 Error: Failed to read the Blast query: Protein FASTA provided for nucleotide sequence That should make live a little simpler. Thanks for telling us about this, and reminding me about this issue. Peter From biopython at maubp.freeserve.co.uk Mon Aug 10 11:58:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 16:58:29 +0100 Subject: [Biopython] qBlast Error and Entrez module In-Reply-To: <3715adb70908100836m1a653ddco435896a2a1f7ee3c@mail.gmail.com> References: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> <320fb6e00908100815p32242292g2931d1c620697171@mail.gmail.com> <3715adb70908100836m1a653ddco435896a2a1f7ee3c@mail.gmail.com> Message-ID: <320fb6e00908100858h76349de6l55b0145d6e3330da@mail.gmail.com> On Mon, Aug 10, 2009 at 4:36 PM, Rodrigo faccioli wrote: > Sorry my error. This error occured because I've built a blast for nucleotide > sequence and my intention was use the some code. Therefore, my configure > file and one parameter called BlastProgram with options: BlastN or BlastP. > > Now, My code is working. > > Thank you for help. I'm glad I could help. Peter P.S. Please try and keep replies CC'd to the mailing list. From dmikewilliams at gmail.com Mon Aug 10 12:59:03 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Mon, 10 Aug 2009 12:59:03 -0400 Subject: [Biopython] hsp.identities Message-ID: Hi there. Been using perl since 1996, but I am new to python. I am working on some python code that was last modified in March of 2007. The code used to use NCBIStandalone, I've modified it to use NCBIXML because the Standalone package died with an exception, which I assumed was due to changes in the blast report format since the code was originally written. blastToleranceNT = 2 blast_out = open(blast_report, "r") b_parse = NCBIXML.parse(blast_out, debug) for b_record in b_parse : for al in b_record.alignments: al.hsps = filter (lambda x: abs(x.identities[0]-x.identities[1]) <= blastToleranceNT, al.hsps) This code generates the following error: TypeError: 'int' object is unsubscriptable Tried using some (slightly modified) code from: http://biobanner.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/Arabidopsis/ArabComp/LocateOligos.py?rev=1.3 for hsp in al.hsps: identities, length = hsp.identities which gives the following error: identity, length = hsp.identities TypeError: 'int' object is not iterable using blast-2.2.17, python 2.6, and biopython version 1.49 on a fedora 11 system also tried on a fedora 10 system with python 2.5.2 and biopython 1.48 - similar results according to the docs at: http://www.biopython.org/DIST/docs/api/Bio.Blast.Record.HSP-class.html hsp.identities is a tuple: identities Number of identities/total aligned. tuple of (int, int) I've looked at various sites with examples of how to deal with tuples, but nothing seems to work, and the error messages always imply that identities is an int. I'm hoping my spinning my wheels on this is just the result of being new to python. I know the original version of the code *used* to work, and the rest of the program seems to work fine, if I comment out the filter line. Any help would be appreciated, this one line of code is a show stopper and I have multiple deadlines this week which depend on getting this working. Thanks, Mike From dmikewilliams at gmail.com Mon Aug 10 14:23:31 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Mon, 10 Aug 2009 14:23:31 -0400 Subject: [Biopython] hsp.identities (solved) Message-ID: Well, one hour after posting my question, I found the answer in the list archives: http://portal.open-bio.org/pipermail/biopython-dev/2006-April/002347.html What happens is that if the Blast output looks like this: Identities = 28/87 (32%), Positives = 44/87 (50%), Gaps = 12/87 (13%) then the text-based parser returns hsp.identities = (28, 87) hsp.positives = (44, 87) hsp.gaps = (12, 87) while the XML parser returns hsp.identities = 28 hsp.positives = 44 hsp.gaps = 12 ; we can get the 87 from len(hsp.query). Cheers, Mike From biopython at maubp.freeserve.co.uk Mon Aug 10 16:43:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 21:43:30 +0100 Subject: [Biopython] hsp.identities In-Reply-To: References: Message-ID: <320fb6e00908101343m4ce06704y59853624bcb830b6@mail.gmail.com> On Mon, Aug 10, 2009 at 5:59 PM, Mike Williams wrote: > Hi there. ?Been using perl since 1996, but I am new to python. ?I am > working on some python code that was last modified in March of 2007. > > The code used to use NCBIStandalone, I've modified it to use NCBIXML > because the Standalone package died with an exception, which I assumed > was due to changes in the blast report format since the code was > originally written. Quite likely - the NCBI keep changing the plain text output, so we have more or less given up that losing battle and have followed their advice and now just recommend the XML parser. > > blastToleranceNT = 2 > blast_out = open(blast_report, "r") > b_parse = NCBIXML.parse(blast_out, debug) > for b_record in b_parse : > ? ?for al in b_record.alignments: > ? ? ? ?al.hsps = filter (lambda x: > abs(x.identities[0]-x.identities[1]) <= blastToleranceNT, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?al.hsps) > > > This code generates the following error: > TypeError: 'int' object is unsubscriptable > ... > I've looked at various sites with examples of how to deal with tuples, > but nothing seems to work, and > the error messages always imply that identities is an int. > > I'm hoping my spinning my wheels on this is just the result of being > new to python. ?I know the original version of the code *used* to > work, and the rest of the program seems to work fine, if I comment out > the filter line. > > Any help would be appreciated, this one line of code is a show stopper > and I have multiple deadlines this week which depend on getting this > working. This is one of the quirks of the XML parser (integer) versus the plain text parser (tuple of two integers, the number of identities and the alignment length). In general they are interchangeable but there are a couple of accidents like this which we've left in place rather than breaking existing scripts. See Bug 2176 for more details. http://bugzilla.open-bio.org/show_bug.cgi?id=2176 For plain text, from memory you needed this: abs(x.identities[0]-x.identities[1]) or, abs(x.identities[0]-x.align_length) For XML you'll need: abs(x.identities - x.align_length) (I think, without testing it) Peter From dmikewilliams at gmail.com Mon Aug 10 17:42:08 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Mon, 10 Aug 2009 17:42:08 -0400 Subject: [Biopython] hsp.identities In-Reply-To: <320fb6e00908101343m4ce06704y59853624bcb830b6@mail.gmail.com> References: <320fb6e00908101343m4ce06704y59853624bcb830b6@mail.gmail.com> Message-ID: On Mon, Aug 10, 2009 at 4:43 PM, Peter wrote: > This is one of the quirks of the XML parser (integer) versus the > plain text parser (tuple of two integers, the number of identities > and the alignment length). In general they are interchangeable > but there are a couple of accidents like this which we've left in > place rather than breaking existing scripts. See Bug 2176 for > more details. > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > > For plain text, from memory you needed this: > abs(x.identities[0]-x.identities[1]) or, abs(x.identities[0]-x.align_length) > For XML you'll need: abs(x.identities - x.align_length) > Thanks for the reply, Peter. I actually found the solution and posted that fact a couple hours ago, although the additional information was helpful. I do think that, at least, the documentation should be changed to mention the difference between the standalone and xml parsers. If that had been done it would have saved me a lot of time. Peace, Mike From biopython at maubp.freeserve.co.uk Tue Aug 11 05:13:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Aug 2009 10:13:16 +0100 Subject: [Biopython] hsp.identities In-Reply-To: References: <320fb6e00908101343m4ce06704y59853624bcb830b6@mail.gmail.com> Message-ID: <320fb6e00908110213o339d4d2brf7c571a44359297d@mail.gmail.com> On Mon, Aug 10, 2009 at 10:42 PM, Mike Williams wrote: > > I do think that, at least, the documentation should be changed to > mention the difference between the standalone and xml parsers. ?If > that had been done it would have saved me a lot of time. Good point. I've attempted to clarify the HSP class docstring. Peter From biopython at maubp.freeserve.co.uk Wed Aug 12 19:21:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 00:21:36 +0100 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <20090810131650.GP12604@sobchak.mgh.harvard.edu> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <20090810131650.GP12604@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> On Mon, Aug 10, 2009 at 2:16 PM, Brad Chapman wrote: > > Agreed. I like the examples and was thinking of this as an extension > of the exact matching approach. I am definitely happy to roll this > or some derivative of it into Biopython. Hi Brad, Is your aim to have a very fast pipeline, or an understandable reference implementation (a worked example)? If this is for a real pipeline, does it have to be FASTQ to FASTA? Further to your blog comment about slicing SeqRecord objects slowing things down, I agree - if you don't need the qualities, then having to slice them is a pointless overhead. As usual in programming, there are several options trading off elegant/general for speed. Personally I would want to keep the qualities for the assembly/mapping step. While keeping things general, as you don't care about the qualities, you could do the whole operation on FASTA files which are faster to read in and when you slice the resulting SeqRecord you don't have the overhead of the slicing the qualities. However, if you just want speed AND you really want to have a FASTQ input file, try the underlying Bio.SeqIO.QualityIO.FastqGeneralIterator parser which gives plain strings, and handle the output yourself. Working directly with Python strings is going to be faster than using Seq and SeqRecord objects. You can even opt for outputting FASTQ files - as long as you leave the qualities as an encoded string, you can just slice that too. The downside is the code will be very specific. e.g. something along these lines: from Bio.SeqIO.QualityIO import FastqGeneralIterator in_handle = open(input_fastq_filename) out_handle = open(output_fastq_filename, "w") for title, seq, qual in FastqGeneralIterator(in_handle) : #Do trim logic here on the string seq if trim : seq = seq[start:end] qual = qual[start:end] # kept as ASCII string! #Save the (possibly trimmed) FASTQ record: out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) out_handle.close() in_handle.close() Note that FastqGeneralIterator is already in Biopython 1.50 and 1.51b, but is now a bit faster in CVS/github (what will be Biopython 1.51). Peter From mattkarikomi at gmail.com Wed Aug 12 22:10:08 2009 From: mattkarikomi at gmail.com (Matt Karikomi) Date: Wed, 12 Aug 2009 22:10:08 -0400 Subject: [Biopython] biopython mashup simmilar to lasergene Message-ID: <95e3e9cc0908121910o37b8a3dbwb1f4d938a5439606@mail.gmail.com> i use the lasergene suite to manage molecular cloning projects. in projects like this, the visual presentation of both data and workflow history is crucial.? it seems like the GUI of this software suite could be recapitulated by a mashup of modules from bioperl and/or biopython while at the same time providing a rich API which will never exist in lasergene. has there been any attempt to mask the powerful script-dependent functionality of these open-source modules in some form of GUI?? i am envisioning something like the [web based] Primer3 Plus interface to the C implementation of Primer3 (obviously wider in scope).? sorry if this is the wrong list (please advise). thanks matt From biopython at maubp.freeserve.co.uk Thu Aug 13 05:56:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 10:56:28 +0100 Subject: [Biopython] biopython mashup simmilar to lasergene In-Reply-To: <95e3e9cc0908121910o37b8a3dbwb1f4d938a5439606@mail.gmail.com> References: <95e3e9cc0908121910o37b8a3dbwb1f4d938a5439606@mail.gmail.com> Message-ID: <320fb6e00908130256l698deda6ubbf498cb1313943b@mail.gmail.com> On Thu, Aug 13, 2009 at 3:10 AM, Matt Karikomi wrote: > i use the lasergene suite to manage molecular cloning projects. > in projects like this, the visual presentation of both data and > workflow history is crucial.? it seems like the GUI of this software > suite could be recapitulated by a mashup of modules from bioperl > and/or biopython while at the same time providing a rich API which > will never exist in lasergene. > has there been any attempt to mask the powerful script-dependent > functionality of these open-source modules in some form of GUI?? i am > envisioning something like the [web based] Primer3 Plus interface to > the C implementation of Primer3 (obviously wider in scope).? sorry if > this is the wrong list (please advise). > thanks > matt It sounds a bit like you want a work flow system, something like Galaxy, which can act as a GUI to command line tools (including BioPerl and Biopython scripts). Galaxy is actually written in Python: http://galaxy.psu.edu/ Peter From chapmanb at 50mail.com Thu Aug 13 08:44:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 13 Aug 2009 08:44:32 -0400 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <20090810131650.GP12604@sobchak.mgh.harvard.edu> <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> Message-ID: <20090813124432.GB90165@sobchak.mgh.harvard.edu> Hi Peter; > Is your aim to have a very fast pipeline, or an understandable > reference implementation (a worked example)? If this is for a real > pipeline, does it have to be FASTQ to FASTA? Ideally both. It is motivated by fitting into an experiment I am analyzing, but the purpose of the blog posting is to try and explain the logic and solicit feedback. In terms of the work, I don't need fastq downstream so was going the easier fasta route. But I can certainly see myself needing fastq in the future so prefer to be generalized. > Further to your blog comment about slicing SeqRecord objects slowing > things down, I agree - if you don't need the qualities, then having to > slice them is a pointless overhead. As usual in programming, there are > several options trading off elegant/general for speed. Personally I > would want to keep the qualities for the assembly/mapping step. Agreed. Unfortunately, it was unusably slow with the slicing as currently implemented: it ran for about 16 hours and was 1/3 of the way finished so was looking like a 2 day run, or about 12x slower than the reference implementation. > However, if you just want speed AND you really want to have a FASTQ > input file, try the underlying > Bio.SeqIO.QualityIO.FastqGeneralIterator parser which gives plain > strings, and handle the output yourself. Working directly with Python > strings is going to be faster than using Seq and SeqRecord objects. > You can even opt for outputting FASTQ files - as long as you leave the > qualities as an encoded string, you can just slice that too. The > downside is the code will be very specific. e.g. something along these > lines: > > from Bio.SeqIO.QualityIO import FastqGeneralIterator > in_handle = open(input_fastq_filename) > out_handle = open(output_fastq_filename, "w") > for title, seq, qual in FastqGeneralIterator(in_handle) : > #Do trim logic here on the string seq > if trim : > seq = seq[start:end] > qual = qual[start:end] # kept as ASCII string! > #Save the (possibly trimmed) FASTQ record: > out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) > out_handle.close() > in_handle.close() Nice -- I will have to play with this. I hadn't dug into the current SeqRecord slicing code at all but I wonder if there is a way to keep the SeqRecord interface but incorporate some of these speed ups for common cases like this FASTQ trimming. Brad From biopython at maubp.freeserve.co.uk Thu Aug 13 09:02:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 14:02:17 +0100 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <20090813124432.GB90165@sobchak.mgh.harvard.edu> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <20090810131650.GP12604@sobchak.mgh.harvard.edu> <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> <20090813124432.GB90165@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908130602n607add6fme67f7934234a5540@mail.gmail.com> On Thu, Aug 13, 2009 at 1:44 PM, Brad Chapman wrote: >> However, if you just want speed AND you really want to have a FASTQ >> input file, try the underlying Bio.SeqIO.QualityIO.FastqGeneralIterator >> parser which gives plain strings, and handle the output yourself. Working >> directly with Python strings is going to be faster than using Seq and >> SeqRecord objects. You can even opt for outputting FASTQ files - as >> long as you leave the qualities as an encoded string, you can just slice >> that too. The downside is the code will be very specific. e.g. something >> along these lines: >> >> from Bio.SeqIO.QualityIO import FastqGeneralIterator >> in_handle = open(input_fastq_filename) >> out_handle = open(output_fastq_filename, "w") >> for title, seq, qual in FastqGeneralIterator(in_handle) : >> ? ? #Do trim logic here on the string seq >> ? ? if trim : >> ? ? ? ? seq = seq[start:end] >> ? ? ? ? qual = qual[start:end] # kept as ASCII string! >> ? ? #Save the (possibly trimmed) FASTQ record: >> ? ? out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) >> out_handle.close() >> in_handle.close() > > Nice -- I will have to play with this. I hadn't dug into the current > SeqRecord slicing code at all but I wonder if there is a way to keep > the SeqRecord interface but incorporate some of these speed ups > for common cases like this FASTQ trimming. I suggest we continue this on the dev mailing list (this reply is cross posted), as it is starting to get rather technical. When you really care about speed, any object creation becomes an issue. Right now for *any* record we have at least the following objects being created: SeqRecord, Seq, two lists (for features and dbxrefs), two dicts (for annotation and the per letter annotation), and the restricted dict (for per letter annotations), and at least four strings (sequence, id, name and description). Perhaps some lazy instantiation might be worth exploring... for example make dbxref, features, annotations or letter_annotations into properties where the underlying object isn't created unless accessed. [Something to try after Biopython 1.51 is out?] I would guess (but haven't timed it) that for trimming FASTQ SeqRecords, a bit part of the overhead is that we are using Python lists of integers (rather than just a string) for the scores. So sticking with the current SeqRecord object as is, one speed up we could try would be to leave the FASTQ quality string as an encoded string (rather than turning into integer quality scores, and back again on output). It would be a hack, but adding this as another SeqIO format name, e.g. "fastq-raw" or "fastq-ascii", might work. We'd still need a new letter_annotations key, say "fastq_qual_ascii". This idea might work, but it does seem ugly. Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 09:30:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 14:30:50 +0100 Subject: [Biopython] GFF Parsing In-Reply-To: <3d03a61c0908140547s138b1275i2ffc5e9ed0a6a397@mail.gmail.com> References: <3d03a61c0908140534k49b53531hd95aab478e486c56@mail.gmail.com> <3d03a61c0908140547s138b1275i2ffc5e9ed0a6a397@mail.gmail.com> Message-ID: <320fb6e00908140630h91b6055la2226aee675199c2@mail.gmail.com> Hello Vipin, I think your question is probably aimed at Brad, so I will forward the attachment to him directly as well. Brad's GFF code isn't in Biopython yet, but we plan to add it later. Have you signed up to the Biopython mailing list? Once you have done this you can email biopython at lists.open-bio.org or biopython at biopython.org with questions like this. I have copied this reply to the list (without the attachment). Peter ---------- Forwarded message ---------- From: Vipin TS Date: Fri, Aug 14, 2009 at 1:47 PM Subject: GFF Parsing To: biopython-owner at lists.open-bio.org To whom it may concern, Thanks for the development of a quick parser for GFF files. It is very useful. I have a doubt, I used the GFFParser.py program to extract the genome annotation from the file attached with this mail. Please find the attached file. (Because of the size of file here I included a few lines) I wrote a python script like this ################################################## import GFFParser pgff = GFFParser.GFFMapReduceFeatureAdder(dict(), None) cds_limit_info = dict( ? ?gff_type = ["gene","mRNA","CDS","exon"], ? ?gff_id = ["Chr1"] ? ?) pgff.add_features('../PythonGFF/TAIR9_GFF_genes.gff3', cds_limit_info) pgff.base["Chr1"] final = pgff.base["Chr1"] ################################################## By executing this script I am able to extract gene, mRNA and exon annotation from specified GFF file. But I am unable to extract the CDS related information from GFF file. It will be great if you can suggest me an idea to include gene, mRNA, exon and CDS information in a single strech of parsing of GFF file. Thanks in advance, Vipin T S Scientific programmer Friedrich Miescher Laboratory of the Max Planck Society Spemannstrasse 37-39 D-72076 Tuebingen Germany From rodrigo_faccioli at uol.com.br Fri Aug 14 16:49:53 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Fri, 14 Aug 2009 17:49:53 -0300 Subject: [Biopython] SEQRES PDB module Message-ID: <3715adb70908141349t5aff9c6et1c59b44a2edd0ba5@mail.gmail.com> Hello, Sorry about my general question. However, I've read the source-code of PDB module and I haven't found how can I work with SEQRES section of PDB file? My doubt is: Is there a method such as get_SeqRes? Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From chapmanb at 50mail.com Fri Aug 14 16:51:13 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Aug 2009 16:51:13 -0400 Subject: [Biopython] GFF Parsing In-Reply-To: <320fb6e00908140630h91b6055la2226aee675199c2@mail.gmail.com> References: <3d03a61c0908140534k49b53531hd95aab478e486c56@mail.gmail.com> <3d03a61c0908140547s138b1275i2ffc5e9ed0a6a397@mail.gmail.com> <320fb6e00908140630h91b6055la2226aee675199c2@mail.gmail.com> Message-ID: <20090814205113.GI90165@sobchak.mgh.harvard.edu> Hi all; Peter, thanks for forwarding this along. Vipin: > By executing this script I am able to extract gene, mRNA and exon annotation > from specified GFF file. But I am unable to extract the CDS related > information from GFF file. > It will be great if you can suggest me an idea to include gene, mRNA, exon > and CDS information in a single strech of parsing of GFF file. Sure, the CDS features are present in two places within the feature tree. The first is as sub-sub features of genes: gene -> mRNA -> CDS the second is as sub features of proteins: protein -> CDS It's a bit of a confusing way to do it, in my opinion, but this is the nesting defined in the Arabidopsis GFF file, so the parser respects it and puts them where they are supposed to be. Below is an updated script which should demonstrate where the CDS features are; you can use either way to access them as the same CDSs are present under both features. This also uses the updated API for parsing, which is much cleaner and will hopefully be what is in Biopython. There is some initial documentation here: http://www.biopython.org/wiki/GFF_Parsing Hope this helps, Brad import sys from BCBio.GFF import GFFParser in_file = sys.argv[1] parser = GFFParser() limit_info = dict( gff_type = ["protein", "gene", "mRNA", "CDS", "exon"], gff_id = ["Chr1"], ) in_handle = open(in_file) for rec in parser.parse(in_handle, limit_info=limit_info): print rec.id for feature in rec.features: if feature.type == "protein": print feature.type, feature.id for sub in feature.sub_features: if sub.type == "CDS": print sub.type elif feature.type == "gene": for sub in feature.sub_features: if sub.type == "mRNA": print sub.type, sub.id for sub_sub in sub.sub_features: if sub_sub.type == "CDS": print sub_sub.type in_handle.close() From mattkarikomi at gmail.com Fri Aug 14 22:12:36 2009 From: mattkarikomi at gmail.com (Matt Karikomi) Date: Fri, 14 Aug 2009 22:12:36 -0400 Subject: [Biopython] biopython mashup simmilar to lasergene In-Reply-To: <320fb6e00908130256l698deda6ubbf498cb1313943b@mail.gmail.com> References: <95e3e9cc0908121910o37b8a3dbwb1f4d938a5439606@mail.gmail.com> <320fb6e00908130256l698deda6ubbf498cb1313943b@mail.gmail.com> Message-ID: <95e3e9cc0908141912n62b4e98aga2834a524ef15aa9@mail.gmail.com> this is exactly what i had in mind as far as a software development model and the implementation of the aforementioned modules. my current work is less genome-wide exploration (i am interested in doing more of this in the future) and more just cloning/recombineering (conventional knockouts and such). of course the extensibility of Galaxy means it can be made to handle anything in the way of analysis and manipulation. On Thu, Aug 13, 2009 at 5:56 AM, Peter wrote: > On Thu, Aug 13, 2009 at 3:10 AM, Matt Karikomi > wrote: > > i use the lasergene suite to manage molecular cloning projects. > > in projects like this, the visual presentation of both data and > > workflow history is crucial. it seems like the GUI of this software > > suite could be recapitulated by a mashup of modules from bioperl > > and/or biopython while at the same time providing a rich API which > > will never exist in lasergene. > > has there been any attempt to mask the powerful script-dependent > > functionality of these open-source modules in some form of GUI? i am > > envisioning something like the [web based] Primer3 Plus interface to > > the C implementation of Primer3 (obviously wider in scope). sorry if > > this is the wrong list (please advise). > > thanks > > matt > > It sounds a bit like you want a work flow system, something like > Galaxy, which can act as a GUI to command line tools (including > BioPerl and Biopython scripts). Galaxy is actually written in Python: > http://galaxy.psu.edu/ > > Peter > From chapmanb at 50mail.com Mon Aug 17 07:58:22 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 17 Aug 2009 07:58:22 -0400 Subject: [Biopython] Biopython 1.51 released Message-ID: <20090817115822.GA12768@sobchak.mgh.harvard.edu> Biopythonistas; We're pleased to announce the release of Biopython 1.51. This new stable release enhances version 1.50 (released in April) by extending the functionality of existing modules, adding a set of application wrappers for popular alignment programs and fixing a number of minor bugs. Sources and Windows Installer are available from the downloads page: http://biopython.org/wiki/Download In particular, the SeqIO module can now write Genbank files that include features, and deal with FASTQ files created by Illumina 1.3+. Support for this format allows interconversion between FASTQ files using Solexa, Sanger and Ilumina variants using conventions agreed upon with the BioPerl and EMBOSS projects. Biopython 1.51 is the first stable release to include the Align.Applications module which allows users to define command line wrappers for popular alignment programs including ClustalW, Muscle and T-Coffee. Bio.Fasta and the application tools ApplicationResult and generic_run() have been marked as deprecated - Bio.Fasta has been superseded by SeqIO's support for the Fasta format and we provide ducumentation for using the subprocess module from the Python Standard Library as a more flexible approach to calling applications. As always, the Tutorial and Cookbook has been updated to document all the changes: http://biopython.org/wiki/Documentation Thank you to everyone who tested our 1.51 beta or submitted bugs since out last stable release and to all our contributors. Brad From biopython at maubp.freeserve.co.uk Mon Aug 17 09:54:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 14:54:43 +0100 Subject: [Biopython] SEQRES PDB module In-Reply-To: <3715adb70908141349t5aff9c6et1c59b44a2edd0ba5@mail.gmail.com> References: <3715adb70908141349t5aff9c6et1c59b44a2edd0ba5@mail.gmail.com> Message-ID: <320fb6e00908170654j482cd07bvb35677555fa6c5d@mail.gmail.com> On Fri, Aug 14, 2009 at 9:49 PM, Rodrigo faccioli wrote: > Hello, > > Sorry about my general question. However, I've read the source-code of PDB > module and I haven't found how can I work with ?SEQRES section of PDB file? > > My doubt is: Is there a method such as get_SeqRes? > > Thanks, Biopython has limited support for parsing the PDB header information, and does not (currently) do anything with the SEQRES lines. You can usually infer the amino acids sequence from the 3D data itself (although this is complicated if there are gaps, for example residues whose coordinates were not resolved). What are you trying to do? It might be simplest to download the sequences from the PDB as simple FASTA files. Peter From bartomas at gmail.com Tue Aug 18 06:40:18 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 18 Aug 2009 11:40:18 +0100 Subject: [Biopython] Biogeography module Message-ID: Hi, I?ve been looking at the Biogeography module ( http://biopython.org/wiki/BioGeography) currently under development. It seems incredibly interesting and useful. The thing that would be really useful in the tutorial would be to show a step by step example of the commands to execute during a complete workflow of the module, from the retrieval of gbif records to the calculation of statistics of phylogenetic trees per region and generation of kml/shapefiles. In the current state of the tutorial it is hard to know how the data fed into the calculation of a tree summary object can be generated. Congratulations for great work on this module. Look forward to using it. All the best, Tomas Bar From biopython at maubp.freeserve.co.uk Wed Aug 19 05:40:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 Aug 2009 10:40:41 +0100 Subject: [Biopython] Fwd: Biopython package for Fedora/RedHat In-Reply-To: References: <320fb6e00908171002j6dd6da63l49f0fabd5866a332@mail.gmail.com> <74d46tptxc.fsf@allele2.eebweb.arizona.edu> <320fb6e00908180309j54c9d2a7qe06a4a804e280f65@mail.gmail.com> Message-ID: <320fb6e00908190240w69df3c8bve6bdb3ce77b60a4@mail.gmail.com> Hi all, I'd like to thank Alex Lancaster for updating the Fedora packages for Biopython 1.51 (including a patch for the flex issue, Bug 2619). http://bugzilla.open-bio.org/show_bug.cgi?id=2619 Those of you involved in testing Fedora packages, please give this a go - positive feedback will get this into stable F-10 and F-11 sooner (as Alex explains below). Thanks, Peter ---------- Forwarded message ---------- From: Alex Lancaster Date: Wed, Aug 19, 2009 at 12:30 AM Subject: Re: Biopython package for Fedora/RedHat To: Peter Hi Peter, I updated the Biopython wiki to point to that page. https://admin.fedoraproject.org/community/?package=python-biopython#package_maintenance/package_overview [...] OK, updates in CVS done: * Tue Aug 18 2009 Alex Lancaster - ?1.51-1 - Update to upstream 1.51 - Drop mx {Build}Requires, no longer used upstream - Remove Martel modules, no longer distributed upstream - Add flex to BuildRequires, patch setup to build ?Bio.PDB.mmCIF.MMCIFlex as per upstream: ?http://bugzilla.open-bio.org/show_bug.cgi?id=2619 I have done builds for rawhide (although probably won't be included for a while as rawhide is frozen while a Beta for F-12 is being tested), and there are pending updates for F-10 and F-11 which will be pushed to updates-testing soon (you can add comments to the updates without being a Fedora developer): https://admin.fedoraproject.org/updates/python-biopython-1.51-1.fc10 https://admin.fedoraproject.org/updates/python-biopython-1.51-1.fc11 Once in updates-testing for a while and we get some feedback (i.e. "karma" votes), then I will push them to stable for F-10 and F-11. Feel free to forward this information to the mailing list (I am subscribed, but I read the list via GMANE to cut down on volume, so I don't always get time to read the list). [...] Alex From biopython at maubp.freeserve.co.uk Wed Aug 19 09:26:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 Aug 2009 14:26:18 +0100 Subject: [Biopython] Paired end SFF data Message-ID: <320fb6e00908190626t1cb94264p6464ab0ddcab596c@mail.gmail.com> Hi all, We're been talking about adding Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format (used for Roche 454 data). This is a public standard documented here: http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats One of the open questions was how to deal with paired end data. The Roche 454 website has a nice summary of how this works: http://www.454.com/products-solutions/experimental-design-options/multi-span-paired-end-reads.asp Basically after sample preparation, you have DNA fragments containing the 3' end of your sequence, a known linker, and then the 5' end of your sequence. The sequencing machine doesn't need to know what the magic linker sequence is, and (I infer) after sample preparation, everything proceeds as normal for single end 454 sequencing. The upshot is the SFF file for a paired end read is exactly like any other SFF file (apparently even for the XML meta data Roche include), just most of the reads should have a "magic" linker sequence somewhere in them. I have located some publicly available Roche SFF files at the Sanger Centre which include some paired end reads (note the rules about publishing analysis of this data): http://www.sanger.ac.uk/Projects/Echinococcus/ ftp://ftp.sanger.ac.uk/pub/pathogens/Echinococcus/multilocularis/reads/454/ For example, the 2008_09_03.tar.gz archive contains a single 446MB file FGGXRDY01.sff with 278801 reads. ftp://ftp.sanger.ac.uk/pub/pathogens/Echinococcus/multilocularis/reads/454/2008_09_03.tar.gz This is the XML meta data from FGGXRDY01.sff, 454 FGGXRDY R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE /data/2008_09_03/R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE/D_2008_09_03_14_23_54_FLX08070222_EmuR1Ecoli7122PE_FullAnalysis /data/2008_09_03/R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE/D_2008_09_03_14_23_54_FLX08070222_EmuR1Ecoli7122PE_FullAnalysis 1.1.03 Nothing in this XML meta data says this is for paired end reads (nor can this be specified elsewhere in the SFF file format). However of the 278801 reads in FGGXRDY01.sff, about a third (108823 if you look before trimming) have a perfect match to the FLX linker: GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC. Note that the linker sequence depends on how the sample was prepared, and differs for different Roche protocols. e.g. According to the wgs-assembler documentation the known Roche 454 Titanium paired end linkers are instead: TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA See http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs This is a good news/bad news situation for Bio.SeqIO. The bad news is that identifying the paired end reads means knowing the linker sequence(s) used, and finding them within each read. If the read was sequenced perfectly, this is easy - but normally some heuristics are needed. I see this as outside the scope of basic file parsing (i.e. not something to go in Bio.SeqIO, but maybe in Bio.SeqUtils or Bio.Sequencing). The good news is that Bio.SeqIO can treat paired end SFF files just like single end reads - we don't have to worry about complicated new Seq/SeqRecord objects to hold short reads separated by an unknown region of estimated length. Peter From biopython at maubp.freeserve.co.uk Wed Aug 19 11:52:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 Aug 2009 16:52:52 +0100 Subject: [Biopython] Paired end SFF data In-Reply-To: <320fb6e00908190626t1cb94264p6464ab0ddcab596c@mail.gmail.com> References: <320fb6e00908190626t1cb94264p6464ab0ddcab596c@mail.gmail.com> Message-ID: <320fb6e00908190852i3b2b3fe3l2e44b2aa427f4cea@mail.gmail.com> On Wed, Aug 19, 2009 at 2:26 PM, Peter wrote: > > Basically after sample preparation, you have DNA fragments containing > the 3' end of your sequence, a known linker, and then the 5' end of > your sequence. The sequencing machine doesn't need to know what the > magic linker sequence is, and (I infer) after sample preparation, > everything proceeds as normal for single end 454 sequencing. The > upshot is the SFF file for a paired end read is exactly like any other > SFF file (apparently even for the XML meta data Roche include), just > most of the reads should have a "magic" linker sequence somewhere in > them. > > ... FLX linker: > GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC. > > Note that the linker sequence depends on how the sample was prepared, > and differs for different Roche protocols. e.g. According to the > wgs-assembler documentation the known Roche 454 Titanium paired end > linkers are instead: > TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and > CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA > See http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs According to the MIRA documentation, local sequencing centres may also use their own linker sequences (and have been known to modify the adaptor sequences), which would make things more complicated. http://www.chevreux.org/uploads/media/mira3_faq.html#section_10 Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 05:48:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 10:48:53 +0100 Subject: [Biopython] Deprecating Bio.Prosite and Bio.Enzyme Message-ID: <320fb6e00908200248j26d20cefq6e3cf9373881d990@mail.gmail.com> Hi all, Bio.Prosite and Bio.Enzyme were declared obsolete in Release 1.50, being replaced by Bio.ExPASy.Prosite and Bio.ExPASy.Enzyme, respectively. Are there any objections to deprecating Bio.Prosite and Bio.Enzyme for the next release? Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 05:51:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 10:51:19 +0100 Subject: [Biopython] Deprecating Bio.EZRetrieve, NetCatch, FilteredReader and SGMLHandle Message-ID: <320fb6e00908200251g6858939ci8de9192b9af7ad98@mail.gmail.com> Hi all, The minor modules Bio.EZRetrieve, Bio.NetCatch, Bio.File.SGMLHandle, Bio.FilteredReader were declared obsolete in Release 1.50. Are there any objections to us deprecating them in the next release? Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 05:57:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 10:57:41 +0100 Subject: [Biopython] Removing deprecated module Bio.Ndb Message-ID: <320fb6e00908200257w749c0650jd16fcc1648fb1c4b@mail.gmail.com> Hi all, The Bio.Ndb module was deprecated almost a year ago in Biopython 1.49 (Nov 2008), as the website it parsed has been redesigned. Unless there are any objections (or offers to update the code), I'd like to remove this module for the next release of Biopython. Peter From kellrott at gmail.com Thu Aug 20 14:26:33 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 20 Aug 2009 11:26:33 -0700 Subject: [Biopython] SQL Alchemy based BioSQL Message-ID: I've posted a git fork of biopython with a BioSQL system based on SQL Alchemy. It can be found at git://github.com/kellrott/biopython.git It successfully completes unit tests copied from test_BioSQL and test_BioSQL_SeqIO. The unit testing runs on sqlite. But it should abstract out to any database system that SQLAlchemy supports. From the web site, the list includes: SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase, Informix, and IBM DB2. Kyle From biopython at maubp.freeserve.co.uk Thu Aug 20 16:10:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 21:10:05 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: Message-ID: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> Hi Kyle, Thanks for signing up to the mailing list to talk about this work. On Thu, Aug 20, 2009 at 7:26 PM, Kyle Ellrott wrote: > I've posted a git fork of biopython with a BioSQL system based on SQL > Alchemy. ?It can be found at git://github.com/kellrott/biopython.git > It successfully completes unit tests copied from test_BioSQL and > test_BioSQL_SeqIO. > The unit testing runs on sqlite. ?But it should abstract out to any > database system that SQLAlchemy supports. ?From the web site, the list > includes: SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS > Access, Sybase, Informix, and IBM DB2. Sounds interesting - but can you explain your motivation? Brad Chapman had already suggested something with BioSQL and SQLAlchemy, but I can't find the emails right now. Maybe we talked about it in person at BOSC 2009... I forget. Brad? But what I think I said then was that while I like SQLAlchemy, and have used it with BioSQL as part of a web application, I don't see that we need it for Biopython's BioSQL support. We essentially have a niche ORM for going between the BioSQL tables and the Biopython SeqRecord object. I don't see more back end databases alone as a good reason for using SQLAlchemy in Biopython's BioSQL bindings. In most (all?) cases SQLAlchemy in turn calls something like MySQLdb to do the real work. You mention lots of other back ends supported by SQLAlchemy, but very few of them have BioSQL schemas - currently just these exist only for PostgreSQL, MySQL, Oracle, HSQLDB, and Apache Derby. As you know (because it is in your branch, grin), Brad has done a schema for SQLite and got this working with Biopython already, and we already support MySQL and PostgreSQL. That just leaves Biopython lacking support for the existing Oracle, HSQLDB, and Apache Derby BioSQL schemas. As long as these have a python binding using the Python Database API Specification v2.0 shouldn't be hard. For example, extending Biopython's BioSQL support using cx_Oracle to talk to an Oracle database seems like a useful incremental improvement. [That wasn't meant to come across as negative, I'm just wary of adding a heavyweight dependency without a good reason] Something I would be interested in is a set of SQLAlchemy model definitions for the BioSQL tables (ideally database neutral). I've got a very preliminary, partial and minimal set done - and I think Brad has some too. This would be useful for anyone wanting to go beyond the Biopython SeqRecord based BioSQL support. Peter From kellrott at gmail.com Thu Aug 20 16:57:29 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 20 Aug 2009 13:57:29 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> Message-ID: > Sounds interesting - but can you explain your motivation? The primary motivation is Jython compatibility (which is the main purpose of the branch). MySQLdb depends on some C extensions which make it hard to port to Jython. I don't keep track of IronPython, but I would imagine it would be a similar situation on the .Net platform. Beta SQLAlchemy 0.6 ( available on the SVN right now, but soon to be released ) supports the MySQL Connector/Java interface, so it works with Jython. Using this combination was the only way I could get a Jython BioPython to connect to a database. As a technical note, now that this works, it means that you can use BioPython and BioJava in the same memory space. I used BioPython's SQL code to get the data, and the passed it to BioJava's Smith-Waterman alignment code to calculate alignments, all in one script. > But what I think I said then was that while I like SQLAlchemy, > and have used it with BioSQL as part of a web application, I > don't see that we need it for Biopython's BioSQL support. We > essentially have a niche ORM for going between the BioSQL > tables and the Biopython SeqRecord object. Yes, but it's an ORM that only supports one form of Python. Let somebody else worry about wrapping to the details of other systems like Jython. > [That wasn't meant to come across as negative, I'm just > wary of adding a heavyweight dependency without a good > reason] It doesn't have to replace the existing system. It can sit along side, and not get installed if SQL Alchemy isn't available. If we leave the naming as is, it won't effect anybodies code. But it they do want to use it, it can replace the original system in a script call: from BioSQL import BioSQLAlchemy as BioSeqDatabase from BioSQL import BioSeqAlchemy as BioSeq And it should work exactly the same. > Something I would be interested in is a set of SQLAlchemy > model definitions for the BioSQL tables (ideally database > neutral). I've got a very preliminary, partial and minimal > set done - and I think Brad has some too. This would be > useful for anyone wanting to go beyond the Biopython > SeqRecord based BioSQL support. Yes, way the SQL Alchemy sets up python data structures based the structure of the database opens up a lot of cool ways to dynamically create queries. Kyle From biopython at maubp.freeserve.co.uk Thu Aug 20 17:21:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 22:21:30 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> Message-ID: <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> On Thu, Aug 20, 2009 at 9:57 PM, Kyle Ellrott wrote: > >> Sounds interesting - but can you explain your motivation? > > The primary motivation is Jython compatibility (which is the main > purpose of the branch). ?MySQLdb depends on some C extensions which > make it hard to port to Jython. ?I don't keep track of IronPython, but > I would imagine it would be a similar situation on the .Net platform. > Beta SQLAlchemy 0.6 ( available on the SVN right now, but soon to be > released ) supports the MySQL Connector/Java interface, so it works > with Jython. ?Using this combination was the only way I could get a > Jython BioPython to connect to a database. Ah. That ties in with the other changes on your github tree (to work nicely with Jython) which had seemed unrelated to me. I guess MySQLdb etc uses C code which means it won't work under Jython. I don't know enough about Jython to say if there are any other alternatives to using SQLAlchemy. > As a technical note, now that this works, it means that you can > use BioPython and BioJava in the same memory space. ?I > used BioPython's SQL code to get the data, and the passed > it to BioJava's Smith-Waterman alignment code to calculate > alignments, all in one script. This might be a silly question, but why not just use BioJava to talk to BioSQL instead? Or use Biopython's pairwise alignment code. Was the point just to demonstrate things working together? Peter From kellrott at gmail.com Thu Aug 20 17:59:08 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 20 Aug 2009 14:59:08 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> Message-ID: > Ah. That ties in with the other changes on your github tree (to > work nicely with Jython) which had seemed unrelated to me. > > I guess MySQLdb etc uses C code which means it won't > work under Jython. I don't know enough about Jython to say > if there are any other alternatives to using SQLAlchemy. If you aren't using a layer of abstraction like SQLAlchemy, they you can use the standard Java SQL interfaces (JDBC). But code written for that would only work within Jython and be useless for CPython. > This might be a silly question, but why not just use BioJava to > talk to BioSQL instead? Or use Biopython's pairwise alignment > code. Was the point just to demonstrate things working together? To prove that it could be done was part of the point. But there is also a 'cross training' aditude about it. BioPython seems more lightweight/easier to use, but has heavier requirements on installing external applications. BioJava can be harder to use, but it has lots more embedded functionality ( built in dynamic programming and HMM code ). If I can get both working in the same environment, then I get the best of both worlds. Kyle From biopython at maubp.freeserve.co.uk Fri Aug 21 05:27:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 Aug 2009 10:27:42 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> Message-ID: <320fb6e00908210227x37a2b7e7o890258f013196c48@mail.gmail.com> On Thu, Aug 20, 2009 at 10:59 PM, Kyle Ellrott wrote: > >> Ah. That ties in with the other changes on your github tree (to >> work nicely with Jython) which had seemed unrelated to me. >> >> I guess MySQLdb etc uses C code which means it won't >> work under Jython. I don't know enough about Jython to say >> if there are any other alternatives to using SQLAlchemy. > > If you aren't using a layer of abstraction like SQLAlchemy, they you > can use the standard Java SQL interfaces (JDBC). ?But code written for > that would only work within Jython and be useless for CPython. Still, it might be worth while. Assuming it can be done as a (Jython specific) modular backend to the existing BioSQL framework, it should be a less invasive change. >> This might be a silly question, but why not just use BioJava to >> talk to BioSQL instead? Or use Biopython's pairwise alignment >> code. Was the point just to demonstrate things working together? > > To prove that it could be done was part of the point. ?But there is > also a 'cross training' aditude about it. Fair enough. > BioPython seems more lightweight/easier to use, but has > heavier requirements on installing external applications. Biopython does often wrap external command line tools, yes. > BioJava can be harder to use, but it has lots more embedded > functionality ( built in dynamic programming and HMM code ). Biopython does have its own HMM and pairwise alignment code written in Python (and for Bio.pairwise2 we also have a faster C code version, but you wouldn't get that under Jython). These modules are not covered in the tutorial (if anyone wants to help). >?If I can get both working in the same environment, then I get > the best of both worlds. Absolutely. Peter From biopython at maubp.freeserve.co.uk Fri Aug 21 05:51:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 Aug 2009 10:51:07 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908210227x37a2b7e7o890258f013196c48@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> <320fb6e00908210227x37a2b7e7o890258f013196c48@mail.gmail.com> Message-ID: <320fb6e00908210251t15b7ea43sa41ed91c42db8385@mail.gmail.com> On Fri, Aug 21, 2009 at 10:27 AM, Peter wrote: > On Thu, Aug 20, 2009 at 10:59 PM, Kyle Ellrott wrote: >> >>> Ah. That ties in with the other changes on your github tree (to >>> work nicely with Jython) which had seemed unrelated to me. >>> >>> I guess MySQLdb etc uses C code which means it won't >>> work under Jython. I don't know enough about Jython to say >>> if there are any other alternatives to using SQLAlchemy. >> >> If you aren't using a layer of abstraction like SQLAlchemy, they you >> can use the standard Java SQL interfaces (JDBC). ?But code written for >> that would only work within Jython and be useless for CPython. > > Still, it might be worth while. Assuming it can be done as a > (Jython specific) modular backend to the existing BioSQL > framework, it should be a less invasive change. Would this mean using zxJDBC (included with Jython 2.1+)? http://wiki.python.org/jython/UserGuide#database-connectivity-in-jython That sounds worth looking into to me. Peter From chapmanb at 50mail.com Fri Aug 21 08:46:14 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 21 Aug 2009 08:46:14 -0400 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> Message-ID: <20090821124614.GH26023@sobchak.mgh.harvard.edu> Hi all; Kyle: > > I've posted a git fork of biopython with a BioSQL system based on SQL > > Alchemy. ?It can be found at git://github.com/kellrott/biopython.git > > It successfully completes unit tests copied from test_BioSQL and > > test_BioSQL_SeqIO. Awesome. Peter: > Brad Chapman had already suggested something with BioSQL > and SQLAlchemy, but I can't find the emails right now. Maybe > we talked about it in person at BOSC 2009... I forget. Brad? Yup, I was floating this idea around. It's great to see someone tackling it. > But what I think I said then was that while I like SQLAlchemy, > and have used it with BioSQL as part of a web application, I > don't see that we need it for Biopython's BioSQL support. We > essentially have a niche ORM for going between the BioSQL > tables and the Biopython SeqRecord object. > > I don't see more back end databases alone as a good reason > for using SQLAlchemy in Biopython's BioSQL bindings. In > most (all?) cases SQLAlchemy in turn calls something like > MySQLdb to do the real work. SQLAlchemy is a pervasive and growing part of interacting with databases using Python. It encapsulates all of the nastiness of dealing with individual databases and has a large community resolving problems on more niche setups like Jython+MySQL. It also offers a nice object layer which is an alternative to the BioSeq interface we have built. It's a lightweight install -- all python and no external dependencies beyond the interfaces you would already need to have to access your database of choice. Why do we want to be learning and implementing database specific things when there is code already taking care of these problems? Kyle implemented this so it can live beside the existing code base, which I think is a nice move. I'm +1 on including this and moving in the direction of SQLAlchemy. > Something I would be interested in is a set of SQLAlchemy > model definitions for the BioSQL tables (ideally database > neutral). I've got a very preliminary, partial and minimal > set done - and I think Brad has some too. This would be > useful for anyone wanting to go beyond the Biopython > SeqRecord based BioSQL support. Yes, this would be my only suggestion. It would be really useful to have the BioSQL tables mapped as object definitions and have the SQLAlchemy BioSQL based on these. This would open us up to other object based implementations like Google App Engine or Document database mappers. I pushed what I have so far in this direction on GitHub: http://github.com/chapmanb/bcbb/blob/master/biosql/BioSQL-SQLAlchemy_definitions.py I also implemented some of the objects in Google App Engine and replicated the current Biopython BioSQL structure for loading and retrieving objects: http://github.com/chapmanb/biosqlweb/tree/master/app/lib/python/BioSQL/GAE This is all partially finished, but please feel free to take whatever is useful. Brad From italo.maia at gmail.com Mon Aug 24 00:27:01 2009 From: italo.maia at gmail.com (Italo Maia) Date: Mon, 24 Aug 2009 01:27:01 -0300 Subject: [Biopython] How can i use muscle to align with biopython? Message-ID: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> Ok, with clustal i import MultipleAlignCL and do_alignment and do the stuff, but where is the alignment modules for muscle? Using biopython 1.5.1 here with ubuntu 9. -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From ajperry at pansapiens.com Mon Aug 24 03:16:41 2009 From: ajperry at pansapiens.com (Andrew Perry) Date: Mon, 24 Aug 2009 17:16:41 +1000 Subject: [Biopython] How can i use muscle to align with biopython? In-Reply-To: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> References: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> Message-ID: On Mon, Aug 24, 2009 at 2:27 PM, Italo Maia wrote: > Ok, with clustal i import MultipleAlignCL and do_alignment and do the > stuff, > but where is the alignment modules for muscle? > Using biopython 1.5.1 here with ubuntu 9. > > According to the docs, Bio.ClustalW.MultipleAlignCL is now considered "semi-obsolete". The newer wrappers for multiple alignment programs (including MUSCLE) can be found as part of Bio.Align.Applications in Biopython 1.51. See the cookbook example here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc68 Andrew Perry Postdoctoral Fellow Whisstock Lab Department of Biochemistry and Molecular Biology Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia. Mobile: +61 409 808 529 From biopython at maubp.freeserve.co.uk Mon Aug 24 05:51:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Aug 2009 10:51:14 +0100 Subject: [Biopython] How can i use muscle to align with biopython? In-Reply-To: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> References: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> Message-ID: <320fb6e00908240251j5ea5246cq1207d0e3033c3b04@mail.gmail.com> On Mon, Aug 24, 2009 at 5:27 AM, Italo Maia wrote: > Ok, with clustal i import MultipleAlignCL and do_alignment and do the stuff, > but where is the alignment modules for muscle? > Using biopython 1.5.1 here with ubuntu 9. That would be Biopython 1.51 (one, fifty-one). ;) Also Ubuntu 9 is unclear, I guess you meant Ubuntu 9.04 ("Jaunty Jackalope") which was released April 2009 (hence 9.04). Ubuntu releases are every six months, so their next release should be October 2009, and is expected to be called Ubuntu 9.10 ("Karmic Koala"). Anyway - thanks for answering about the alignments Andrew. To recap, to call the alignment tools, use Bio.Align.Applications and Python module subprocess, and then parse the resulting alignment file with Bio.AlignIO. All in the tutorial :) Bio.Clustalw is now semi-obsolete, but can expect a gradual retirement (just like Bio.Fasta was gradually phrased out) because it was widely used, and we don't want to force people to migrate their old code immediately. I wouldn't recommend using Bio.Clustalw for new scripts, try Bio.Align.Applications instead. Regards, Peter From wgheath at gmail.com Mon Aug 24 16:49:47 2009 From: wgheath at gmail.com (William Heath) Date: Mon, 24 Aug 2009 13:49:47 -0700 Subject: [Biopython] Wanting to teach a class on biopython that is particularly geared toward synthetic biology Message-ID: Hi All, I am a member of Tech Shop in Mountain View, CA and I want to teach a class on biopython that is specifically tailored toward the goals of synthetic biology. Can anyone help me to come up with lesson plan for such a class? In particular I want to use bio bricks, and good opensource design programs for biobricks. Can anyone recommend any? I also want to utilize any/all concepts in this training: http://www.edge.org/documents/archive/edge296.html Please let me know your ideas on such a lesson plan. -Tim From krother at rubor.de Tue Aug 25 04:58:44 2009 From: krother at rubor.de (Kristian Rother) Date: Tue, 25 Aug 2009 10:58:44 +0200 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: References: Message-ID: <4A93A7C4.1060109@rubor.de> Hi William, I was teaching Python/BioPython in several courses for Biologists. I wrote some individual lesson plans but they are rather not readable for other people (see attachment). There is some material on-line, though: http://www.rubor.de/lehre_en.html Typically, the lessons consisted of 2h lecture + 1h exercises on language concepts + 3h exercises on a single, more biological task. The code written during the latter was reviewed and scored and the students knew about that. They had a two-week Python crash course before. Details on request. Best Regards, Kristian William Heath schrieb: > Hi All, > I am a member of Tech Shop in Mountain View, CA and I want to teach a class > on biopython that is specifically tailored toward the goals of synthetic > biology. Can anyone help me to come up with lesson plan for such a class? > In particular I want to use bio bricks, and good opensource design programs > for biobricks. Can anyone recommend any? > > I also want to utilize any/all concepts in this training: > > http://www.edge.org/documents/archive/edge296.html > > Please let me know your ideas on such a lesson plan. > > > -Tim > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -------------- next part -------------- A non-text attachment was scrubbed... Name: Lesson2_Plan.pdf Type: application/pdf Size: 56732 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Tue Aug 25 05:56:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 10:56:42 +0100 Subject: [Biopython] Biopython 1.51 in Debian repository Message-ID: <320fb6e00908250256mfafa20rbf699ad63e618dff@mail.gmail.com> Hi all, I'd like to thank Philipp Benner for updating the Debian packages for Biopython 1.51 (including handling dropping Martel and removing mxTextTools from our dependencies). This is now in Debian unstable (sid), and will as usual progress out to Debian testing and also Ubuntu eventually. http://packages.debian.org/unstable/python/python-biopython http://packages.debian.org/unstable/python/python-biopython-sql Thanks! Peter From wgheath at gmail.com Tue Aug 25 13:06:49 2009 From: wgheath at gmail.com (William Heath) Date: Tue, 25 Aug 2009 10:06:49 -0700 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: <4A93A7C4.1060109@rubor.de> References: <4A93A7C4.1060109@rubor.de> Message-ID: This is amazing thanks! -Tim On Tue, Aug 25, 2009 at 1:58 AM, Kristian Rother wrote: > > Hi William, > > I was teaching Python/BioPython in several courses for Biologists. I wrote > some individual lesson plans but they are rather not readable for other > people (see attachment). There is some material on-line, though: > > http://www.rubor.de/lehre_en.html > > Typically, the lessons consisted of 2h lecture + 1h exercises on language > concepts + 3h exercises on a single, more biological task. The code written > during the latter was reviewed and scored and the students knew about that. > They had a two-week Python crash course before. > > Details on request. > > Best Regards, > Kristian > > > William Heath schrieb: > >> Hi All, >> I am a member of Tech Shop in Mountain View, CA and I want to teach a >> class >> on biopython that is specifically tailored toward the goals of synthetic >> biology. Can anyone help me to come up with lesson plan for such a class? >> In particular I want to use bio bricks, and good opensource design >> programs >> for biobricks. Can anyone recommend any? >> >> I also want to utilize any/all concepts in this training: >> >> http://www.edge.org/documents/archive/edge296.html >> >> Please let me know your ideas on such a lesson plan. >> >> >> -Tim >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> >> > > From wgheath at gmail.com Tue Aug 25 13:08:27 2009 From: wgheath at gmail.com (William Heath) Date: Tue, 25 Aug 2009 10:08:27 -0700 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: References: <4A93A7C4.1060109@rubor.de> Message-ID: I am very interested in common bio python tasks as they relate specifically to synthetic biology. Could you give me some examples of such tasks? -Tim On Tue, Aug 25, 2009 at 10:06 AM, William Heath wrote: > This is amazing thanks! > -Tim > > > On Tue, Aug 25, 2009 at 1:58 AM, Kristian Rother wrote: > >> >> Hi William, >> >> I was teaching Python/BioPython in several courses for Biologists. I wrote >> some individual lesson plans but they are rather not readable for other >> people (see attachment). There is some material on-line, though: >> >> http://www.rubor.de/lehre_en.html >> >> Typically, the lessons consisted of 2h lecture + 1h exercises on language >> concepts + 3h exercises on a single, more biological task. The code written >> during the latter was reviewed and scored and the students knew about that. >> They had a two-week Python crash course before. >> >> Details on request. >> >> Best Regards, >> Kristian >> >> >> William Heath schrieb: >> >>> Hi All, >>> I am a member of Tech Shop in Mountain View, CA and I want to teach a >>> class >>> on biopython that is specifically tailored toward the goals of synthetic >>> biology. Can anyone help me to come up with lesson plan for such a >>> class? >>> In particular I want to use bio bricks, and good opensource design >>> programs >>> for biobricks. Can anyone recommend any? >>> >>> I also want to utilize any/all concepts in this training: >>> >>> http://www.edge.org/documents/archive/edge296.html >>> >>> Please let me know your ideas on such a lesson plan. >>> >>> >>> -Tim >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >>> >> >> > From kellrott at gmail.com Tue Aug 25 21:01:30 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 25 Aug 2009 18:01:30 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <20090821124614.GH26023@sobchak.mgh.harvard.edu> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> Message-ID: I've added a new database function lookupFeature to quickly search for sequences features without have to load all of them for any particular sequence. Because it's a non-standard function, I've taken the opportunity to play around with some more dynamic search features. Once we get the interface for these types of searches locked down on lookupFeature, a similar system could be implemented in the standard 'lookup' call. The work is posted at http://github.com/kellrott/biopython The following is an example of a working search, that pulls all of the protein_ids from NC_004663.1 between 60,000 and 70,000 on the positive strand. import sys from BioSQL import BioSQLAlchemy as BioSeqDataBase server = BioSeqDataBase.open_database( driver="mysql", user='test', host='localhost', db='testdb' ) db = server[ 'bacteria' ] seq = db.lookup( version="NC_004663.1" ) features = db.lookupFeatures( BioSeqDataBase.Column('strand') == 1, BioSeqDataBase.Column('start_pos') < 70000, BioSeqDataBase.Column('end_pos') > 60000, bioentry_id = seq._primary_id, name="protein_id" ) #print len(features) for feature in features: print feature > Kyle: >> > I've posted a git fork of biopython with a BioSQL system based on SQL >> > Alchemy. ?It can be found at git://github.com/kellrott/biopython.git >> > It successfully completes unit tests copied from test_BioSQL and >> > test_BioSQL_SeqIO. From biopython at maubp.freeserve.co.uk Wed Aug 26 07:10:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 12:10:44 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> On Wed, Aug 26, 2009 at 2:01 AM, Kyle Ellrott wrote: > I've added a new database function lookupFeature to quickly search for > sequences features without have to load all of them for any particular > sequence. > Because it's a non-standard function, I've taken the opportunity to > play around with some more dynamic search features. > Once we get the interface for these types of searches locked down on > lookupFeature, a similar system could be implemented in the standard > 'lookup' call. I'm not sure about that - all the other "lookup" functions already in BioSeqDatabase return DBSeqRecord objects don't they? See below for an alternative... > The work is posted at http://github.com/kellrott/biopython You could have posted this on the dev list, but this is debatable. If it all gets too technical we should probably move the thread... > The following is an example of a working search, that pulls all of the > protein_ids from NC_004663.1 between 60,000 and 70,000 on the positive > strand. > > import sys > from BioSQL import BioSQLAlchemy as BioSeqDataBase > > server = BioSeqDataBase.open_database( driver="mysql", user='test', > host='localhost', db='testdb' ) > db = server[ 'bacteria' ] > > seq = db.lookup( version="NC_004663.1" ) > > features = db.lookupFeatures( BioSeqDataBase.Column('strand') == 1, > ? ? ? ?BioSeqDataBase.Column('start_pos') < 70000, > ? ? ? ?BioSeqDataBase.Column('end_pos') > 60000, > ? ? ? ?bioentry_id = seq._primary_id, name="protein_id" ) > > #print len(features) > for feature in features: > ? ? ? ?print feature > Interesting - and potentially useful if you are interested in just part of the genome (e.g. an operon). Have you tested this on composite features (e.g. a join)? Without looking into the details of your code this isn't clear. I wonder how well this would scale with a big BioSQL database with hundreds of bioentry rows, and millions of seqfeature and location rows? You'd have to search all the location rows, filtering on the seqfeature_id linked to the bioentry_id you wanted. The performance would depend on the database server, the database software, how big the database is, and any indexing etc. Have you signed up to the BioSQL mailing list yet Kyle? It may help for discussing things like the SQL indexing. On the other hand, if all the record's features have already been loaded into memory, there would just be thousands of locations to look at - it might be quicker. This brings me to another idea for how this interface might work, via the SeqRecord - how about adding a method like this: def filtered_features(self, start=None, end=None, type=None): Note I think it would also be nice to filter on the feature type (e.g. CDS or gene). This method would return a sublist of the full feature list (i.e. a list of those SeqFeature objects within the range given, and of the appropriate type). This could initially be implemented with a simple loop, but there would be scope for building an index or something more clever. [Note we are glossing over some potentially ambiguous cases with complex composite locations, where the "start" and "end" may differ from the "span" of the feature.] The DBSeqRecord would be able to do the same (just inherit the method), but you could try doing this via an SQL query, to get the database to tell you which seqfeature_ids are wanted, and then return those (existing) SeqFeature objects. [Note we should avoid returning new SeqFeature objects, as it could be very confusing to have multiple SeqFeature instances for the same feature in the database - as well as wasting memory, and time to build the new objects.] Peter From kellrott at gmail.com Wed Aug 26 12:40:00 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 26 Aug 2009 09:40:00 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> Message-ID: > I'm not sure about that - all the other "lookup" functions already in > BioSeqDatabase return DBSeqRecord objects don't they? See > below for an alternative... Although the example I provided didn't illustrate it, the reason I did it this way was to provide a function that could look up features without having to find their DBSeqRecords first. ?In my particular case, I've loaded all of the Genbank files from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.gbk.tar.gz and I want to be able to find proteins by their protein_id. lookupFeature( protein_id="NP_808921.1" ) > Have you tested this on composite features (e.g. a join)? > Without looking into the details of your code this isn't clear. That isn't supported in the current syntax. ?But when using SQLAlchemy it's pretty easy to generate new queries by treating the tables and selection criteria like lists. ?Right now I just shove rules into an 'and' list, but when could also create an 'or' list. > I wonder how well this would scale with a big BioSQL database > with hundreds of bioentry rows, and millions of seqfeature > and location rows? You'd have to search all the location rows, > filtering on the seqfeature_id linked to the bioentry_id you > wanted. The performance would depend on the database > server, the database software, how big the database is, and > any indexing etc. Like I said, my test database is all of the published bacterial genebank files from the NCBI ftp. ?It's about 1784 bioentry rows. About 27,838,905 seqfeature_qualifier_value rows. ?The location based searches where pretty much instant. The only way that I've augmented the database is by adding the line Mysql: CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value(10)); Sqlite: CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value); So that I could look up features by their value quickly (like getting a protein by it's protein_id). ?Note the '(10)' for MySQL, because for some reason it will only index text blobs if you define a prefix area for it to index... > On the other hand, if all the record's features have already been > loaded into memory, there would just be thousands of locations > to look at - it might be quicker. My experience so far it that pulling all the features for a single large chromosome can take a while. One of the other things that I did to speed things up (it's not actually faster, it just spreads the work out), is to build a DBSeqFeature with a lazy loader. ?It just stores it's seqfeature_id and type until __getattr__ is hit, and only then does it bother to load the data in from the database. ?So if you bring in 2000 seqfeatures, you get the list back and read the first entry without having the first load the other 1999 entries. > This brings me to another idea for how this interface might work, > via the SeqRecord - how about adding a method like this: > > def filtered_features(self, start=None, end=None, type=None): > > Note I think it would also be nice to filter on the feature type (e.g. > CDS or gene). This method would return a sublist of the full > feature list (i.e. a list of those SeqFeature objects within the > range given, and of the appropriate type). This could initially > be implemented with a simple loop, but there would be scope > for building an index or something more clever. It may be worth while to just use the sqlite memory database. ?We store the schema in the module, and have a simple wrapper module that builds the sqlite RAM database and loads in the sequence file to the database. Something like: from Bio import SeqIODB handle = open("ls_orchid.gbk") for seq_record in SeqIODB.parse(handle, "genbank") : ? ?print seq_record.id But in the background SeqIODB would be creating a Sqlite memory database and loading ls_orchid into it, it's __iter__ function would simple spit out DBSeqRecords for easy of the bioentries... Kyle From biopython at maubp.freeserve.co.uk Wed Aug 26 16:24:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 21:24:34 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> Message-ID: <320fb6e00908261324o3d57f9d0p249a34d638ddca93@mail.gmail.com> On Wed, Aug 26, 2009 at 5:40 PM, Kyle Ellrott wrote: >> I'm not sure about that - all the other "lookup" functions already in >> BioSeqDatabase return DBSeqRecord objects don't they? See >> below for an alternative... > > Although the example I provided didn't illustrate it, the reason I did > it this way was to provide a function that could look up features > without having to find their DBSeqRecords first. ?In my particular > case, I've loaded all of the Genbank files from > ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.gbk.tar.gz and I want > to be able to find proteins by their protein_id. > > lookupFeature( protein_id="NP_808921.1" ) That makes sense. >> Have you tested this on composite features (e.g. a join)? >> Without looking into the details of your code this isn't clear. > > That isn't supported in the current syntax. ?But when using > SQLAlchemy it's pretty easy to generate new queries by > treating the tables and selection criteria like lists. ?Right > now I just shove rules into an 'and' list, but when could > also create an 'or' list. Even the bacterial GenBank files have joins ;) >> I wonder how well this would scale with a big BioSQL database >> with hundreds of bioentry rows, and millions of seqfeature >> and location rows? You'd have to search all the location rows, >> filtering on the seqfeature_id linked to the bioentry_id you >> wanted. The performance would depend on the database >> server, the database software, how big the database is, and >> any indexing etc. > > Like I said, my test database is all of the published bacterial > genebank files from the NCBI ftp. ?It's about 1784 bioentry rows. > About 27,838,905 seqfeature_qualifier_value rows. ?The location based > searches where pretty much instant. Cool. That sounds like more or less the same amount of data we have in our BioSQL database. > The only way that I've augmented the database is by adding the line > > Mysql: > CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value(10)); > Sqlite: > CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value); > > So that I could look up features by their value quickly (like getting > a protein by it's protein_id). ?Note the '(10)' for MySQL, because for > some reason it will only index text blobs if you define a prefix area > for it to index... We've done something similar here too - I'd have to check exactly what we used for the index though. >> On the other hand, if all the record's features have already been >> loaded into memory, there would just be thousands of locations >> to look at - it might be quicker. > > My experience so far it that pulling all the features for a single > large chromosome can take a while. > One of the other things that I did to speed things up (it's not > actually faster, it just spreads the work out), is to build a > DBSeqFeature with a lazy loader. ?It just stores it's seqfeature_id > and type until __getattr__ is hit, and only then does it bother to > load the data in from the database. ?So if you bring in 2000 > seqfeatures, you get the list back and read the first entry without > having the first load the other 1999 entries. Yes - Leighton Pritchard has also done a DBSeqFeature object (with lazy loading of the qualifiers too). I guess your code will be similar. This is something I think could well be worth merging into BioSQL (and doesn't depend on SQLAlchemy at all). >> This brings me to another idea for how this interface might work, >> via the SeqRecord - how about adding a method like this: >> >> def filtered_features(self, start=None, end=None, type=None): >> >> Note I think it would also be nice to filter on the feature type (e.g. >> CDS or gene). This method would return a sublist of the full >> feature list (i.e. a list of those SeqFeature objects within the >> range given, and of the appropriate type). This could initially >> be implemented with a simple loop, but there would be scope >> for building an index or something more clever. > > It may be worth while to just use the sqlite memory database. We > store the schema in the module, and have a simple wrapper module that > builds the sqlite RAM database and loads in the sequence file to the > database. > Something like: > > from Bio import SeqIODB > handle = open("ls_orchid.gbk") > for seq_record in SeqIODB.parse(handle, "genbank") : > print seq_record.id > > But in the background SeqIODB would be creating a Sqlite memory > database and loading ls_orchid into it, it's __iter__ function would > simple spit out DBSeqRecords for easy of the bioentries... Is this idea just a shortcut for explicitly loading the GenBank file into a BioSQL database (which hopefully will include an SQLite backend option soon), and then iterating over its records? e.g. from Bio import SeqIO from BioSQL import BioSeqDatabase server = BioSeqDatabase.open_database(...) db = server["Orchids"] db.load(SeqIO.parse(open("ls_orchid.gbk"), "genbank")) server.commit() #Now retrieve the records one by one... It would make sense to define __iter__ on the BioSQL database object (i.e. the BioSeqDatabase class) to allow iteration on ALL the records in the database (as DBSeqRecord objects). That should be a nice simple enhancement, allowing: for seq_record in db : print seq_record.id [And again, this has no SQLAlchemy dependence] Peter From kellrott at gmail.com Wed Aug 26 17:38:27 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 26 Aug 2009 14:38:27 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908261324o3d57f9d0p249a34d638ddca93@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> <320fb6e00908261324o3d57f9d0p249a34d638ddca93@mail.gmail.com> Message-ID: > Yes - Leighton Pritchard has also done a DBSeqFeature object > (with lazy loading of the qualifiers too). I guess your code will > be similar. This is something I think could well be worth merging > into BioSQL (and doesn't depend on SQLAlchemy at all). The version in my branch uses the sqlalchemy query compositions methods I wrote when porting DBSeqRecord from the original _retrieve_features function. So his code would probably be shorter step for now. > Is this idea just a shortcut for explicitly loading the GenBank > file into a BioSQL database (which hopefully will include an > SQLite backend option soon), and then iterating over its > records? e.g. Yes, it would also have a copy of the biodb-sqlite schema stored as a string in the module, so it could build an in-RAM database on demand. Make the setup and loading automatic. It would appear to be just like a regular file parser. That way, if we start writing crazy feature filters methods based on SQL queries, they can be easily reapplied to file based usage. And we wouldn't have to write a feature filter for the database objects and another for file based objects. Kyle From biopython at maubp.freeserve.co.uk Wed Aug 26 17:55:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 22:55:58 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> <320fb6e00908261324o3d57f9d0p249a34d638ddca93@mail.gmail.com> Message-ID: <320fb6e00908261455m332e446fjcd4c6bc2e7e985e7@mail.gmail.com> On Wed, Aug 26, 2009 at 10:38 PM, Kyle Ellrott wrote: >> Is this idea just a shortcut for explicitly loading the GenBank >> file into a BioSQL database (which hopefully will include an >> SQLite backend option soon), and then iterating over its >> records? e.g. > > Yes, it would also have a copy of the biodb-sqlite schema stored as a > string in the module, so it could build an in-RAM database on demand. > Make the setup and loading automatic. ?It would appear to be just like > a regular file parser. I would agree that once we have SQLite support in BioSQL officially, we can probably ship the schema within Biopython and make using it much more straightforward than the other BioSQL backends (which require the database software and schema to be installed manually). However, I would put the SQLite database on disk, not in RAM. > That way, if we start writing crazy feature > filters methods based on SQL queries, they can be easily reapplied to > file based usage. ?And we wouldn't have to write a feature filter for > the database objects and another for file based objects. If we are just talking about filtering the feature list (see thread http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006700.html ) then we don't need BioSQL - it seems like overkill. Peter From biopython at maubp.freeserve.co.uk Fri Aug 28 06:16:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 11:16:52 +0100 Subject: [Biopython] Fwd: Biopython post from kingaram@hanmail.net requires approval In-Reply-To: References: Message-ID: <320fb6e00908280316u503760c0ofb6d3e20d2378cac@mail.gmail.com> Hi all, I have just forwarded the following message to the list as it had been blocked with a "suspicious header". Could I remind people please try and send "plain text" emails, rather than rich HTML formatting with pictures etc as these are likely to get blocked by the mailing list. Thanks Peter ---------- Forwarded message ---------- From:?"titt" To:? Date:?Fri, 28 Aug 2009 09:42:03 +0900 (KST) Subject:?Finding protein ID using Entrez.efetch Hi all, I'm looking for the way to extract the data of protein ID numbers in the Genbank. I got my Genbank data and save it as a xml file using this commend. from Bio import Entrez handle=Entrez.efetch(db="nuccore",id="256615878",rettype="gb") record=handle.read() save_file = open("record.xml","w") save_file.write(record) save_file.close() What I need is all the protein ID (For example: EEU21068.1) or GI number (for example: 256615878) in this Genbank page for the blast search. Could you let me know how to extract these information, save in some format, and use them? Thank you, Ar! am From biopython at maubp.freeserve.co.uk Fri Aug 28 06:37:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 11:37:45 +0100 Subject: [Biopython] Finding protein ID using Entrez.efetch Message-ID: <320fb6e00908280337r681a8f90mff0da537a2ff2878@mail.gmail.com> > To:? > Date:?Fri, 28 Aug 2009 09:42:03 +0900 (KST) > Subject:?Finding protein ID using Entrez.efetch > > Hi all, > > I'm looking for the way to extract the data of protein ID numbers in > the Genbank. I got my Genbank data and save it as a xml file using > this commend. > > from Bio import Entrez > handle=Entrez.efetch(db="nuccore",id="256615878",rettype="gb") > record=handle.read() > save_file = open("record.xml","w") > save_file.write(record) > save_file.close() That did NOT save the record as XML format. You asked NCBI Entrez EFetch for a GenBank file (rettype="gb"). > What I need is all the protein ID (For example: EEU21068.1) or GI > number (for example: 256615878) in this Genbank page for the blast > search. Could you let me know how to extract these information, save > in some format, and use them? If all you want is the accession, it is pointless to download the entire record (with its features and sequence). Instead try: >>> print Entrez.efetch(db="nuccore",id="256615878",rettype="acc", retmode="text").read() GG698814.1 Note that a nucleotide sequence doesn't have a protein ID! A gene nucleotide should have a single associated protein. A genome sequence will have many associated proteins (this seems to be what you want?). If you really do want the GenBank file (e.g. for some other data), then first save it and then parse it using Bio.SeqIO like this: >>> from Bio import Entrez >>> net_handle = Entrez.efetch(db="nuccore",id="256615878",rettype="gb") >>> save_handle = open("record.gb", "w") >>> save_handle.write(net_handle.read()) >>> save_handle.close() >>> net_handle.close() Then, >>> from Bio import SeqIO >>> record = SeqIO.read(open("record.gb"), "gb") >>> print record.id GG698814.1 You can also look at the CDS features (proteins), and their lists of protein ID(s) and database cross references: >>> for feature in record.features : ... if feature.type != "CDS" : continue ... print feature.qualifiers.get("protein_id", []), ... print feature.qualifiers.get("db_xref", []) ... ['EEU21067.1'] ['GI:256615879'] ['EEU21068.1'] ['GI:256615880'] ['EEU21069.1'] ['GI:256615881'] ['EEU21070.1'] ['GI:256615882'] ['EEU21071.1'] ['GI:256615883'] ... However, if that is all you need, then it is a waste to download the full GenBank file. Try using NCBI Entrez ELink instead? http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html Peter From biopython at maubp.freeserve.co.uk Fri Aug 28 06:56:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 11:56:24 +0100 Subject: [Biopython] Finding protein ID using Entrez.efetch In-Reply-To: <320fb6e00908280337r681a8f90mff0da537a2ff2878@mail.gmail.com> References: <320fb6e00908280337r681a8f90mff0da537a2ff2878@mail.gmail.com> Message-ID: <320fb6e00908280356u3fc23b4bnbadf3ecebf96be82@mail.gmail.com> On Fri, Aug 28, 2009 at 11:37 AM, Peter wrote: >> To:? >> Date:?Fri, 28 Aug 2009 09:42:03 +0900 (KST) >> Subject:?Finding protein ID using Entrez.efetch >> >> Hi all, >> >> I'm looking for the way to extract the data of protein ID numbers in >> the Genbank. ... >> >> What I need is all the protein ID (For example: EEU21068.1) or GI >> number (for example: 256615878) in this Genbank page for the blast >> search. > > ... > > However, if that is all you need, then it is a waste to download the > full GenBank file. Try using NCBI Entrez ELink instead? > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html Try something based on this: >>> from Bio import Entrez >>> data = Entrez.read(Entrez.elink(db="protein", dbfrom="nuccore",id="256615878", retmode="xml")) >>> for db in data : ... print "Links for", db["IdList"], "from database", db["DbFrom"] ... for link in db["LinkSetDb"][0]["Link"] : print link["Id"] ... Links from ['256615878'] from database nuccore 256616663 256616662 ... 256615879 As we try to explain in the tutorial, the Entrez.read() XML parser turns the XML data into Python lists, dictionaries and strings. This reflects the deeply nested nature of the NCBI XML files - you have to dig into the hierarchy to get to the actual list of protein IDs. Peter From bartomas at gmail.com Fri Aug 28 10:17:08 2009 From: bartomas at gmail.com (bar tomas) Date: Fri, 28 Aug 2009 15:17:08 +0100 Subject: [Biopython] How to restrict blast query to proteins of a certain species Message-ID: Hi, I'd like to use blast to measure homology between protein sequences of a species and polyketide sequences from a database. I've been looking in the Biopython tutorial(p.73) in the section about blast. I'd like to do something similar, like this: result_handle = NCBIWWW.qblast("blastp", "genbank", record.format("fasta")) Is it possible to add an option to restrict the search to genbank records that correspond to a given species? Thanks very much From biopython at maubp.freeserve.co.uk Fri Aug 28 10:38:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 15:38:36 +0100 Subject: [Biopython] How to restrict blast query to proteins of a certain species In-Reply-To: References: Message-ID: <320fb6e00908280738w5d5f1236ia50332157e24bf7b@mail.gmail.com> bar tomas wrote: > I'd like to do something similar, ?like this: > > result_handle = NCBIWWW.qblast("blastp", "genbank", record.format("fasta")) > > Is it possible to add an option to restrict the search to genbank records > that correspond to a given species? Yes, you can use a species specific blast database or include an Entrez query, see for example: http://lists.open-bio.org/pipermail/biopython/2009-June/005215.html Peter From kelly.oakeson at utah.edu Fri Aug 28 11:19:07 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Fri, 28 Aug 2009 09:19:07 -0600 Subject: [Biopython] How to restrict blast query to proteins of a certain species In-Reply-To: References: Message-ID: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> Hell list, I would like to do something similar, but I would like to limit my blast search to just the microbial Taxonomic ID. Thanks, Kelly Oakeson kelly.oakeson at utah.edu On Aug 28, 2009, at 8:17 AM, bar tomas wrote: > Hi, > > I'd like to use blast to measure homology between protein sequences > of a > species and polyketide sequences from a database. > I've been looking in the Biopython tutorial(p.73) in the section about > blast. > I'd like to do something similar, like this: > > result_handle = NCBIWWW.qblast("blastp", "genbank", record.format > ("fasta")) > > Is it possible to add an option to restrict the search to genbank > records > that correspond to a given species? > > Thanks very much > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bartomas at gmail.com Fri Aug 28 11:27:20 2009 From: bartomas at gmail.com (bar tomas) Date: Fri, 28 Aug 2009 16:27:20 +0100 Subject: [Biopython] How to restrict blast query to proteins of a certain species In-Reply-To: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> References: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> Message-ID: Hi, I just tried using a taxonomic id: The following both give the same result: result_handle = NCBIWWW.qblast(*"blastp"*, *"nr"*, fasta_string,entrez_query=*"house mouse[orgn]"*) result_handle = NCBIWWW.qblast(*"blastp"*, *"nr"*, fasta_string,entrez_query=*"txid10090[orgn]"*) (entrez_query needs to be after the non keyword arguments or else the compiler complains) On Fri, Aug 28, 2009 at 4:19 PM, Kelly F Oakeson wrote: > Hell list, > I would like to do something similar, but I would like to limit my > blast search to just the microbial Taxonomic ID. > > Thanks, > > Kelly Oakeson > kelly.oakeson at utah.edu > > > > > On Aug 28, 2009, at 8:17 AM, bar tomas wrote: > > > Hi, > > > > I'd like to use blast to measure homology between protein sequences > > of a > > species and polyketide sequences from a database. > > I've been looking in the Biopython tutorial(p.73) in the section about > > blast. > > I'd like to do something similar, like this: > > > > result_handle = NCBIWWW.qblast("blastp", "genbank", record.format > > ("fasta")) > > > > Is it possible to add an option to restrict the search to genbank > > records > > that correspond to a given species? > > > > Thanks very much > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From biopython at maubp.freeserve.co.uk Fri Aug 28 11:29:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 16:29:15 +0100 Subject: [Biopython] How to restrict blast query to proteins of a certain species In-Reply-To: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> References: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> Message-ID: <320fb6e00908280829h694bf960m7a7936569238b1c6@mail.gmail.com> On Fri, Aug 28, 2009 at 4:19 PM, Kelly F Oakeson wrote: > Hell list, > I would like to do something similar, but I would like to limit my > blast search to just the microbial Taxonomic ID. Then just change the Entrez query, e.g. Taxon ID 2 for eubacteria: from Bio.Blast import NCBIWWW fasta_string = """>Test ATGGCCAATACTCCTTCGGCCAAGAAGGCAGTGCGCAAGATCGCTGCCCGCACCGAGATCAACAAGTCCC GCCGTTCGCGCGTGCGCACTTTCGTGCGCAAGCTGGAAGACGCTCTGCTGAGCGGCGACAAGCAGGCAGC GGAAGTTGCGTTCAAGGCTGTTGAGCCTGAACTGATGCGCGCCGCCTCCAAGGGCGTGGTGCACAAGAAC ACCGCGGCCCGCAAGGTTTCGCGTCTTGCCAAGCGCGTGAAGGCTCTGAACGCCTGA """ result_handle = NCBIWWW.qblast("blastn", "nr", entrez_query="txid2[orgn]", fasta_string) Peter From crosvera at gmail.com Fri Aug 28 15:45:46 2009 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Fri, 28 Aug 2009 15:45:46 -0400 Subject: [Biopython] PBD SuperImpose Message-ID: Hello people, I'm trying to use the superimpose() method from the Bio.PBD module, but when I use the set_atoms() method with two atoms list, I got this: "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in size" So, my question is: How can I superimpose to structures with different size ?? Cheers and Thanks. PS: I attached the code that I'm using. -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 -------------- next part -------------- A non-text attachment was scrubbed... Name: superimpose.py Type: text/x-python Size: 992 bytes Desc: not available URL: From crosvera at gmail.com Fri Aug 28 15:48:18 2009 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Fri, 28 Aug 2009 15:48:18 -0400 Subject: [Biopython] Fwd: PBD SuperImpose In-Reply-To: References: Message-ID: Hello people, I'm trying to use the superimpose() method from the Bio.PBD module, but when I use the set_atoms() method with two atoms list, I got this: "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in size" So, my question is: How can I superimpose to structures with different size ?? Cheers and Thanks. PS: I paste the code that I'm using. -------code.py------ #!/usr/bin/env python import sys from Bio.PDB import * #get the PDB's path from user command line struct_path1 = sys.argv[1] name_struct1 = struct_path1.split('/')[-1].split('.')[0] struct_path2 = sys.argv[2] name_struct2 = struct_path2.split('/')[-1].split('.')[0] #parsing the PDBs parser = PDBParser() struct1 = parser.get_structure(name_struct1, struct_path1) struct2 = parser.get_structure(name_struct2, struct_path2) #get atoms list atoms1 = struct1.get_atoms() atoms2 = struct2.get_atoms() latoms1 = [] latoms2 = [] for a in atoms1: latoms1.append( a ) for a in atoms2: latoms2.append( a ) print latoms1 print latoms2 #SuperImpose sup = Superimposer() # Specify the atom lists # ""fixed"" and ""moving"" are lists of Atom objects # The moving atoms will be put on the fixed atoms sup.set_atoms(latoms1, latoms2) # Print rotation/translation/rmsd print "ROTRAN: "+ sup.rotran print "RMS: " + sup.rms # Apply rotation/translation to the moving atoms sup.apply(moving) -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 -------------- next part -------------- A non-text attachment was scrubbed... Name: superimpose.py Type: text/x-python Size: 992 bytes Desc: not available URL: From rodrigo_faccioli at uol.com.br Sat Aug 29 09:17:25 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Sat, 29 Aug 2009 10:17:25 -0300 Subject: [Biopython] Fwd: PBD SuperImpose In-Reply-To: References: Message-ID: <3715adb70908290617g3d8792d9we033c1a8dfec35fc@mail.gmail.com> Please, Could you inform the PDBIDs or PDB files that you are using? Cheers, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 2009/8/28 Carlos R?os Vera > Hello people, > > I'm trying to use the superimpose() method from the Bio.PBD module, but > when > I use the set_atoms() method with two atoms list, I got this: > > "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in > size" > > So, my question is: How can I superimpose to structures with different size > ?? > > Cheers and Thanks. > > PS: I paste the code that I'm using. > > -------code.py------ > #!/usr/bin/env python > > import sys > from Bio.PDB import * > > #get the PDB's path from user command line > struct_path1 = sys.argv[1] > name_struct1 = struct_path1.split('/')[-1].split('.')[0] > > struct_path2 = sys.argv[2] > name_struct2 = struct_path2.split('/')[-1].split('.')[0] > > #parsing the PDBs > parser = PDBParser() > > struct1 = parser.get_structure(name_struct1, struct_path1) > struct2 = parser.get_structure(name_struct2, struct_path2) > > #get atoms list > atoms1 = struct1.get_atoms() > atoms2 = struct2.get_atoms() > > latoms1 = [] > latoms2 = [] > > for a in atoms1: > latoms1.append( a ) > for a in atoms2: > latoms2.append( a ) > > print latoms1 > print latoms2 > > #SuperImpose > sup = Superimposer() > # Specify the atom lists > # ""fixed"" and ""moving"" are lists of Atom objects > # The moving atoms will be put on the fixed atoms > sup.set_atoms(latoms1, latoms2) > # Print rotation/translation/rmsd > print "ROTRAN: "+ sup.rotran > print "RMS: " + sup.rms > # Apply rotation/translation to the moving atoms > sup.apply(moving) > > > > -- > http://crosvera.blogspot.com > > Carlos R?os V. > Estudiante de Ing. (E) en Computaci?n e Inform?tica. > Universidad del B?o-B?o > VIII Regi?n, Chile > > Linux user number 425502 > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From biopython at maubp.freeserve.co.uk Sat Aug 29 09:48:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 29 Aug 2009 14:48:52 +0100 Subject: [Biopython] PBD SuperImpose In-Reply-To: References: Message-ID: <320fb6e00908290648t6b9c20f3p6d48a56cf1144ac5@mail.gmail.com> 2009/8/28 Carlos R?os Vera : > Hello people, > > I'm trying to use the superimpose() method from the Bio.PBD module, but when > I use the set_atoms() method with two atoms list, I got this: > > "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in > size" > > So, my question is: How can I superimpose to structures with different size > ?? This sounds more like a methodology question than a Python question. You must decide how to map between the atoms of one chain and the atoms of the other. If they are different lengths, you will need to exclude some residues (e.g. peptide GHIL versus EGHILD, you would probably ignore the extra trailing/leading residues on the longer sequence). If in addition the residues are (at least in some cases) different amino acids, then you will probably only want to calculate the superposition using the backbone (or even just the C alpha atoms). One approach to this is to base the atomic mapping on a pairwise protein sequence alignment. Peter From crosvera at gmail.com Sat Aug 29 13:03:55 2009 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Sat, 29 Aug 2009 13:03:55 -0400 Subject: [Biopython] Fwd: PBD SuperImpose In-Reply-To: <3715adb70908290617g3d8792d9we033c1a8dfec35fc@mail.gmail.com> References: <3715adb70908290617g3d8792d9we033c1a8dfec35fc@mail.gmail.com> Message-ID: 2009/8/29 Rodrigo faccioli > Please, > > Could you inform the PDBIDs or PDB files that you are using? > > trimero_r308A_600_ps2.pdb wt_600ps_trim.pdb These are the PDBs that I'm using. > Cheers, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > > > 2009/8/28 Carlos R?os Vera > >> Hello people, >> >> I'm trying to use the superimpose() method from the Bio.PBD module, but >> when >> I use the set_atoms() method with two atoms list, I got this: >> >> "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in >> size" >> >> So, my question is: How can I superimpose to structures with different >> size >> ?? >> >> Cheers and Thanks. >> >> PS: I paste the code that I'm using. >> >> -------code.py------ >> #!/usr/bin/env python >> >> import sys >> from Bio.PDB import * >> >> #get the PDB's path from user command line >> struct_path1 = sys.argv[1] >> name_struct1 = struct_path1.split('/')[-1].split('.')[0] >> >> struct_path2 = sys.argv[2] >> name_struct2 = struct_path2.split('/')[-1].split('.')[0] >> >> #parsing the PDBs >> parser = PDBParser() >> >> struct1 = parser.get_structure(name_struct1, struct_path1) >> struct2 = parser.get_structure(name_struct2, struct_path2) >> >> #get atoms list >> atoms1 = struct1.get_atoms() >> atoms2 = struct2.get_atoms() >> >> latoms1 = [] >> latoms2 = [] >> >> for a in atoms1: >> latoms1.append( a ) >> for a in atoms2: >> latoms2.append( a ) >> >> print latoms1 >> print latoms2 >> >> #SuperImpose >> sup = Superimposer() >> # Specify the atom lists >> # ""fixed"" and ""moving"" are lists of Atom objects >> # The moving atoms will be put on the fixed atoms >> sup.set_atoms(latoms1, latoms2) >> # Print rotation/translation/rmsd >> print "ROTRAN: "+ sup.rotran >> print "RMS: " + sup.rms >> # Apply rotation/translation to the moving atoms >> sup.apply(moving) >> >> >> >> -- >> http://crosvera.blogspot.com >> >> Carlos R?os V. >> Estudiante de Ing. (E) en Computaci?n e Inform?tica. >> Universidad del B?o-B?o >> VIII Regi?n, Chile >> >> Linux user number 425502 >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 From eric.talevich at gmail.com Sat Aug 29 13:12:50 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 29 Aug 2009 13:12:50 -0400 Subject: [Biopython] PBD SuperImpose Message-ID: <3f6baf360908291012y50456018pfcb58f9ab6d8463f@mail.gmail.com> > > 2009/8/28 Carlos R?os Vera : > > Hello people, > > > > I'm trying to use the superimpose() method from the Bio.PBD module, but > when > > I use the set_atoms() method with two atoms list, I got this: > > > > "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ > in > > size" > > > > So, my question is: How can I superimpose to structures with different > size > > ?? Peter : > > This sounds more like a methodology question than a Python > question. > > You must decide how to map between the atoms of one chain and > the atoms of the other. If they are different lengths, you will need > to exclude some residues (e.g. peptide GHIL versus EGHILD, you > would probably ignore the extra trailing/leading residues on the > longer sequence). > > If in addition the residues are (at least in some cases) different > amino acids, then you will probably only want to calculate the > superposition using the backbone (or even just the C alpha > atoms). > > One approach to this is to base the atomic mapping on a > pairwise protein sequence alignment. > > Peter > > If you're trying to align the structures of two different proteins, or two different structures of the same protein, you might want to try MultiProt: http://bioinfo3d.cs.tau.ac.il/MultiProt/ It can handle more than two proteins at a time, too. (If that's overkill, then +1 for Peter's approach.) Eric From jjkk73 at gmail.com Sat Aug 29 15:10:41 2009 From: jjkk73 at gmail.com (jorma kala) Date: Sat, 29 Aug 2009 21:10:41 +0200 Subject: [Biopython] How to extract start and end positions of a sequence in blast output file Message-ID: Hi, I'm using Blast through the biopython module. Is it possible to retrieve start and end positions on the genome of an aligned sequence from a blast record object? (I've been looking at the Biopython tutorial, section 'the Blast record class', but haven't been able to find it.) Thank you very much From biopython at maubp.freeserve.co.uk Sun Aug 30 07:29:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 30 Aug 2009 12:29:08 +0100 Subject: [Biopython] How to extract start and end positions of a sequence in blast output file In-Reply-To: References: Message-ID: <320fb6e00908300429p4cd58f9dj1606df1f1242ed7b@mail.gmail.com> On Sat, Aug 29, 2009 at 8:10 PM, jorma kala wrote: > Hi, > I'm using Blast through the biopython module. > Is it possible to retrieve start and end positions on the genome of an > aligned sequence ?from a blast record object? Yes - see below. > (I've been looking at the Biopython tutorial, section 'the Blast record > class', but haven't been able to find it.) > Thank you very much Have you tried using the built in help to find out more about the HSP object?? e.g. >>> from Bio.Blast import NCBIXML >>> record = NCBIXML.read(open("xbt003.xml")) >>> help(record.alignments[0].hsps[0]) ... Or, have you come across the Python command dir? This gives a listing of all the properties and methods of an object (although those starting with an underscore are special or private and should usually be ignored). e.g. >>> from Bio.Blast import NCBIXML >>> record = NCBIXML.read(open("xbt003.xml")) >>> dir(record.alignments[0].hsps[0]) ['__doc__', '__init__', '__module__', '__str__', 'align_length', 'bits', 'expect', 'frame', 'gaps', 'identities', 'match', 'num_alignments', 'positives', 'query', 'query_end', 'query_start', 'sbjct', 'sbjct_end', 'sbjct_start', 'score', 'strand'] The help text tells you this, but you could also guess from using dir - sbjct_start and sbjct_end are what you want (the start/end of the subject sequence, i.e. the database match), while query_start and query_end are those for your query sequence. Peter From chapmanb at 50mail.com Mon Aug 31 09:29:31 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 31 Aug 2009 09:29:31 -0400 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: References: <4A93A7C4.1060109@rubor.de> Message-ID: <20090831132931.GE75451@sobchak.mgh.harvard.edu> Hi Tim; I'm not aware of Python code to do BioBricks design, but a couple of repositories for synthetic biology code in Python are: SynBioPython: http://code.google.com/p/synbiopython/ My Synthetic Biology code: http://bitbucket.org/chapmanb/synbio/ Unfortunately these are more starting points than tutorial ready code. Hope this helps, Brad > I am very interested in common bio python tasks as they relate specifically > to synthetic biology. Could you give me some examples of such tasks? > -Tim > > On Tue, Aug 25, 2009 at 10:06 AM, William Heath wrote: > > > This is amazing thanks! > > -Tim > > > > > > On Tue, Aug 25, 2009 at 1:58 AM, Kristian Rother wrote: > > > >> > >> Hi William, > >> > >> I was teaching Python/BioPython in several courses for Biologists. I wrote > >> some individual lesson plans but they are rather not readable for other > >> people (see attachment). There is some material on-line, though: > >> > >> http://www.rubor.de/lehre_en.html > >> > >> Typically, the lessons consisted of 2h lecture + 1h exercises on language > >> concepts + 3h exercises on a single, more biological task. The code written > >> during the latter was reviewed and scored and the students knew about that. > >> They had a two-week Python crash course before. > >> > >> Details on request. > >> > >> Best Regards, > >> Kristian > >> > >> > >> William Heath schrieb: > >> > >>> Hi All, > >>> I am a member of Tech Shop in Mountain View, CA and I want to teach a > >>> class > >>> on biopython that is specifically tailored toward the goals of synthetic > >>> biology. Can anyone help me to come up with lesson plan for such a > >>> class? > >>> In particular I want to use bio bricks, and good opensource design > >>> programs > >>> for biobricks. Can anyone recommend any? > >>> > >>> I also want to utilize any/all concepts in this training: > >>> > >>> http://www.edge.org/documents/archive/edge296.html > >>> > >>> Please let me know your ideas on such a lesson plan. > >>> > >>> > >>> -Tim > >>> _______________________________________________ > >>> Biopython mailing list - Biopython at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/biopython > >>> > >>> > >>> > >> > >> > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From wgheath at gmail.com Mon Aug 31 13:07:59 2009 From: wgheath at gmail.com (William Heath) Date: Mon, 31 Aug 2009 10:07:59 -0700 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: <20090831132931.GE75451@sobchak.mgh.harvard.edu> References: <4A93A7C4.1060109@rubor.de> <20090831132931.GE75451@sobchak.mgh.harvard.edu> Message-ID: Sounds good thanks! -Tim On Mon, Aug 31, 2009 at 6:29 AM, Brad Chapman wrote: > Hi Tim; > I'm not aware of Python code to do BioBricks design, but a couple of > repositories for synthetic biology code in Python are: > > SynBioPython: http://code.google.com/p/synbiopython/ > My Synthetic Biology code: http://bitbucket.org/chapmanb/synbio/ > > Unfortunately these are more starting points than tutorial ready > code. > > Hope this helps, > Brad > > > I am very interested in common bio python tasks as they relate > specifically > > to synthetic biology. Could you give me some examples of such tasks? > > -Tim > > > > On Tue, Aug 25, 2009 at 10:06 AM, William Heath > wrote: > > > > > This is amazing thanks! > > > -Tim > > > > > > > > > On Tue, Aug 25, 2009 at 1:58 AM, Kristian Rother > wrote: > > > > > >> > > >> Hi William, > > >> > > >> I was teaching Python/BioPython in several courses for Biologists. I > wrote > > >> some individual lesson plans but they are rather not readable for > other > > >> people (see attachment). There is some material on-line, though: > > >> > > >> http://www.rubor.de/lehre_en.html > > >> > > >> Typically, the lessons consisted of 2h lecture + 1h exercises on > language > > >> concepts + 3h exercises on a single, more biological task. The code > written > > >> during the latter was reviewed and scored and the students knew about > that. > > >> They had a two-week Python crash course before. > > >> > > >> Details on request. > > >> > > >> Best Regards, > > >> Kristian > > >> > > >> > > >> William Heath schrieb: > > >> > > >>> Hi All, > > >>> I am a member of Tech Shop in Mountain View, CA and I want to teach a > > >>> class > > >>> on biopython that is specifically tailored toward the goals of > synthetic > > >>> biology. Can anyone help me to come up with lesson plan for such a > > >>> class? > > >>> In particular I want to use bio bricks, and good opensource design > > >>> programs > > >>> for biobricks. Can anyone recommend any? > > >>> > > >>> I also want to utilize any/all concepts in this training: > > >>> > > >>> http://www.edge.org/documents/archive/edge296.html > > >>> > > >>> Please let me know your ideas on such a lesson plan. > > >>> > > >>> > > >>> -Tim > > >>> _______________________________________________ > > >>> Biopython mailing list - Biopython at lists.open-bio.org > > >>> http://lists.open-bio.org/mailman/listinfo/biopython > > >>> > > >>> > > >>> > > >> > > >> > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From xiaoa at mail.rockefeller.edu Mon Aug 31 19:45:55 2009 From: xiaoa at mail.rockefeller.edu (xiaoa) Date: Mon, 31 Aug 2009 19:45:55 -0400 Subject: [Biopython] IDLE problem Message-ID: <4A9C60B3.4040605@rockefeller.edu> Hi, I am new to python and biopython. I ran into a problem when using Entrez.esearch and efetch. My script worked fine when I used python 2.6.2 command line (console), but it returned an empty line when I ran it in IDLE. IDLE seems to be working, because I tested with 1. another python script (no Entrez modules) and 2. even Entrez.einfo--worked fine. I am using Windows Vista, 64-bit and Biopython 1.51 and Python 2.6.2. Thanks in advance, Andrew From biopython at maubp.freeserve.co.uk Mon Aug 3 20:48:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Aug 2009 21:48:49 +0100 Subject: [Biopython] Deprecating Bio.Fasta? In-Reply-To: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> Message-ID: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> On 22 June 2009, I wrote: > ... > I'd like to officially deprecate Bio.Fasta for the next release (Biopython > 1.51), which means you can continue to use it for a couple more > releases, but at import time you will see a warning message. See also: > http://biopython.org/wiki/Deprecation_policy > > Would this cause anyone any problems? If you are still using Bio.Fasta, > it would be interesting to know if this is just some old code that hasn't > been updated, or if there is some stronger reason for still using it. No one replied, so I plan to make this change in CVS shortly, meaning that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work but will trigger a deprecation warning at import. Please speak up ASAP if this concerns you. Thanks, Peter From stran104 at chapman.edu Wed Aug 5 01:10:44 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Tue, 4 Aug 2009 18:10:44 -0700 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? In-Reply-To: <65d4b7fc0908031207x187119eerc05340c49488889c@mail.gmail.com> References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> <20090730220902.GD84345@sobchak.mgh.harvard.edu> <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com> <2a63cc350907301710w57d4d4b9nb89fea39f9e62b76@mail.gmail.com> <65d4b7fc0908031207x187119eerc05340c49488889c@mail.gmail.com> Message-ID: <2a63cc350908041810s4583e254o99e90861a2b23f99@mail.gmail.com> I played with this a bit more and Brad is right, the Unigene database is not supported through Entrez efetch. The ID returned by esearch is in fact the GI and other types of records can be retreived with it (e.g. gene). A list of supported databases and returntypes can be found at: http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html As Brad already suggested, downloading the files would work and it would be fast to process locally as well. Good luck :-) On Mon, Aug 3, 2009 at 12:07 PM, Carlos Javier Borroto < carlos.borroto at gmail.com> wrote: > On Thu, Jul 30, 2009 at 8:10 PM, Matthew Strand > wrote: > > Hi Carlos, > > I did something similar to this a while ago and meant to write a cookbook > > entry for it but haven't gotten the chance yet. You could also try doing > an > > efetch on the ID of the record returned by esearch. > > > > I'm not near my workstation so I can't test it but you might try: > > handle = Entrez.efetch(db="unigene", id="141673") > > > > If that works then you just need to pull the ID out of the esearch result > > and do an efetch on it. > > > > I tried that too, but no luck on my side: > > >>> from Bio import Entrez > >>> from Bio import UniGene > >>> Entrez.email = "carlos.borroto at gmail.com" > >>> handle = Entrez.esearch(db="unigene", term="Hs.94542") > >>> record = Entrez.read(handle) > >>> record > {u'Count': '1', u'RetMax': '1', u'IdList': ['141673'], > u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term': > 'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet': > [], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'} > >>> record["IdList"][0] > '141673' > >>> handle = Entrez.efetch(db="unigene", id=record["IdList"][0]) > >>> print handle.read() > (Output a HTML web page) > > regards, > -- > Carlos Javier > -- Matthew Strand From biopython at maubp.freeserve.co.uk Wed Aug 5 10:29:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 Aug 2009 11:29:45 +0100 Subject: [Biopython] Deprecating Bio.Fasta? In-Reply-To: <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> References: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> <320fb6e00908031348u64da883as875de61ddc0b1717@mail.gmail.com> Message-ID: <320fb6e00908050329m44fa2596ife06917306ae44ab@mail.gmail.com> On Mon, Aug 3, 2009 at 9:48 PM, Peter wrote: > On 22 June 2009, I wrote: >> ... >> I'd like to officially deprecate Bio.Fasta for the next release (Biopython >> 1.51), which means you can continue to use it for a couple more >> releases, but at import time you will see a warning message. See also: >> http://biopython.org/wiki/Deprecation_policy >> ... > > No one replied, so I plan to make this change in CVS shortly, meaning > that Bio.Fasta will be deprecated in Biopython 1.51, i.e. it will still work > but will trigger a deprecation warning at import. > > Please speak up ASAP if this concerns you. I've just committed the deprecation of Bio.Fasta to CVS. This could be reverted if anyone has a compelling reason (and tells us before we do the final release of Biopython 1.51). The docstring for Bio.Fasta should cover the typical situations for moving from Bio.Fasta to Bio.SeqIO, but please feel free to ask on the mailing list if you have a more complicated bit of old code that needs to be ported. Thanks, Peter From sbassi at gmail.com Fri Aug 7 18:15:47 2009 From: sbassi at gmail.com (Sebastian Bassi) Date: Fri, 7 Aug 2009 15:15:47 -0300 Subject: [Biopython] ANN: Python for Bioinformatics book Message-ID: Just want to announce the availability of the book "Python for Bioinformatics". It has a Biopython chapter and I made it thanks to lot of people of this list (and biopython-dev) who help me since I began programming Python about 7 years ago. Here is the official announce: "Python for Bioinformatics" ISBN 1584889292 Amazon: http://www.tinyurl.com/biopython Publisher: http://www.crcpress.com/product/isbn/9781584889298 This book introduces programming concepts to life science researchers, bioinformaticians, support staff, students, and everyone who is interested in applying programming to solve biologically-related problems. Python is the chosen programming language for this task because it is both powerful and easy-to-use. It begins with the basic aspects of the language (like data types and control structures) up to essential skills on today's bioinformatics tasks like building web applications, using relational database management systems, XML and version control. There is a chapter devoted to Biopython (www.biopython.org) since it can be used for most of the tasks related to bioinformatics data processing. There is a section with applications with source code, featuring sequence manipulation, filtering vector contamination, calculating DNA melting temperature, parsing a genbank file, inferring splicing sites, and more. There are questions at the end of every chapter and odd numbered questiona are answered in an appendix making this text suitable for classroom use. This book can be used also as a reference material as it includes Richard Gruet's Python Quick Reference, and the Python Style Guide. DVD: The included DVD features a virtual machine with a special edition of DNALinux, with all the programs and complementary files required to run the scripts commented in the book. All scripts can be tweaked to fit a particular configuration. By using a pre-configured virtual machine the reader has access to the same development environment than the author, so he can focus on learning Python. All code is also available at the http://py3.us/## where ## is the code number, for example: http://py3.us/57 I've been working on this book for more than two years testing the examples under different setups and working to make the code compatible for most versions of Python, Biopython and operating systems. Where there is code that only works with a particular dependency, this is clearly noted. Finally, I want to highlight that non-bioinformaticians out there can use this book as an introduction to bioinformatics by starting with the included "Diving into the Gene Pool with BioPython" (by Zachary Voase and published originally in Python Magazine). From lueck at ipk-gatersleben.de Sat Aug 8 07:03:52 2009 From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de) Date: Sat, 8 Aug 2009 09:03:52 +0200 Subject: [Biopython] ANN: Python for Bioinformatics book In-Reply-To: References: Message-ID: <20090808090352.av3u0ldgd7rk88wc@webmail.ipk-gatersleben.de> Hi Sebastian! This sounds like a great book! Hopefully it's will be soon also available in the german bookshops, so that I can have a look on it! Kind regards Stefanie Zitat von Sebastian Bassi : > Just want to announce the availability of the book "Python for > Bioinformatics". It has a Biopython chapter and I made it thanks to > lot of people of this list (and biopython-dev) who help me since I > began programming Python about 7 years ago. > > Here is the official announce: > > "Python for Bioinformatics" > ISBN 1584889292 > Amazon: http://www.tinyurl.com/biopython > Publisher: http://www.crcpress.com/product/isbn/9781584889298 > > This book introduces programming concepts to life science researchers, > bioinformaticians, support staff, students, and everyone who is > interested in applying programming to solve biologically-related > problems. Python is the chosen programming language for this task > because it is both powerful and easy-to-use. > > It begins with the basic aspects of the language (like data types and > control structures) up to essential skills on today's bioinformatics > tasks like building web applications, using relational database > management systems, XML and version control. There is a chapter > devoted to Biopython (www.biopython.org) since it can be used for most > of the tasks related to bioinformatics data processing. > > There is a section with applications with source code, featuring > sequence manipulation, filtering vector contamination, calculating DNA > melting temperature, parsing a genbank file, inferring splicing sites, > and more. > > There are questions at the end of every chapter and odd numbered > questiona are answered in an appendix making this text suitable for > classroom use. > > This book can be used also as a reference material as it includes > Richard Gruet's Python Quick Reference, and the Python Style Guide. > > DVD: The included DVD features a virtual machine with a special > edition of DNALinux, with all the programs and complementary files > required to run the scripts commented in the book. All scripts can be > tweaked to fit a particular configuration. By using a pre-configured > virtual machine the reader has access to the same development > environment than the author, so he can focus on learning Python. All > code is also available at the http://py3.us/## where ## is the code > number, for example: http://py3.us/57 > > I've been working on this book for more than two years testing the > examples under different setups and working to make the code > compatible for most versions of Python, Biopython and operating > systems. Where there is code that only works with a particular > dependency, this is clearly noted. > > Finally, I want to highlight that non-bioinformaticians out there can > use this book as an introduction to bioinformatics by starting with > the included "Diving into the Gene Pool with BioPython" (by Zachary > Voase and published originally in Python Magazine). > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From biopython at maubp.freeserve.co.uk Mon Aug 10 11:12:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 12:12:21 +0100 Subject: [Biopython] Trimming adaptors sequences Message-ID: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> Hi all, Brad's got an interesting blog post up on using Biopython for trimming adaptors for next gen sequencing reads, using Bio.pairwise2 for pairwise alignments between the adaptor and the reads: http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ The basic idea is similar to what Giles Weaver was describing last month, although Giles was using EMBOSS needle to do a global pairwise alignment via BioPerl: http://lists.open-bio.org/pipermail/biopython/2009-July/005338.html We already had a simple FASTQ "primer trimming" example in the tutorial, which I have just extended to add a more general FASTQ "adaptor trimming" example. For this I am deliberately only looking for exact matches. This is faster of course, but it also makes the example much more easily understood as well - something important for an introductory example. A full cookbook example of how to use pairwise alignments would seem like a great idea for a cookbook entry on the wiki. It would be interesting to see which is faster - using EMBOSS needle/water or Bio.pairwise2. Both are written in C, but using EMBOSS we'd have the overhead of parsing the output file. Brad - why are you using a local alignment and not a global alignment? Shouldn't we be looking for the entire adaptor sequence? It looks like you don't consider the the unaligned parts of the adaptor when you count the mismatches - is this a bug? I wonder if it would be simpler (and faster) to take a score based threshold. Regards, Peter From chapmanb at 50mail.com Mon Aug 10 13:16:50 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 10 Aug 2009 09:16:50 -0400 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> Message-ID: <20090810131650.GP12604@sobchak.mgh.harvard.edu> Hi Peter; > Brad's got an interesting blog post up on using Biopython for trimming > adaptors for next gen sequencing reads, using Bio.pairwise2 for > pairwise alignments between the adaptor and the reads: > > http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ > > The basic idea is similar to what Giles Weaver was describing last > month, although Giles was using EMBOSS needle to do a global pairwise > alignment via BioPerl: > http://lists.open-bio.org/pipermail/biopython/2009-July/005338.html Yes, same idea. When I started messing with this I was thinking I could be tricky and get something that avoided doing alignments and would be faster. Unfortunately I didn't have good luck with the pure string based approaches. > We already had a simple FASTQ "primer trimming" example in the > tutorial, which I have just extended to add a more general FASTQ > "adaptor trimming" example. For this I am deliberately only looking > for exact matches. This is faster of course, but it also makes the > example much more easily understood as well - something important for > an introductory example. Agreed. I like the examples and was thinking of this as an extension of the exact matching approach. I am definitely happy to roll this or some derivative of it into Biopython. > A full cookbook example of how to use pairwise alignments would seem > like a great idea for a cookbook entry on the wiki. It would be > interesting to see which is faster - using EMBOSS needle/water or > Bio.pairwise2. Both are written in C, but using EMBOSS we'd have the > overhead of parsing the output file. In terms of speed, I was thinking of this as a good target for parallelization using the multiprocessing library (http://docs.python.org/library/multiprocessing.html) but didn't have time yet to look into that. > Brad - why are you using a local alignment and not a global alignment? > Shouldn't we be looking for the entire adaptor sequence? It looks like > you don't consider the the unaligned parts of the adaptor when you > count the mismatches - is this a bug? Good call -- this should consider the number of matches in the aligning region to the full adaptor to see if we've got it. This is fixed in the GitHub version now. Thanks for pointing it out. > I wonder if it would be simpler (and faster) to take a score based threshold. Maybe, but I find comfort in being able to describe the algorithms simply: any matches to the adaptor with 2 or less errors. I'd imagine most of the time is being taken up doing the actual alignment work. Thanks for the feedback on this. It was really helpful, Brad From rodrigo_faccioli at uol.com.br Mon Aug 10 14:31:44 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Mon, 10 Aug 2009 11:31:44 -0300 Subject: [Biopython] qBlast Error and Entrez module Message-ID: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> Hello, I've tried to execute a blast from NCBI. In this way, I'm using the NCBI module from Biopython. I read the Biopython Tutorial its Chapter 7. So, my code is below. result_handle = NCBIWWW.qblast("blastn", "nr", "TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") blast_results = result_handle.read() save_file = open( "1CRN_Blast.xml", "w") save_file.write(blast_results) save_file.close() However, when I execute this code, I receive the error message: raise ValueError("No RID and no RTOE found in the 'please wait' page." I don't know what I'm doing wrong. So, if somebody can help me, I thank. I have on more doubt about Entrez module. What is the difference between Entrez and NCBI ? With Entrez module can I execute a protein aligment? If yes, could somebody inform a example for me. Sorry my English mistakes. Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From jblanca at btc.upv.es Mon Aug 10 14:39:56 2009 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Mon, 10 Aug 2009 16:39:56 +0200 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <320fb6e00908100536j50cf9dacp27b93040b50623aa@mail.gmail.com> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <1249905992.4a800d48a7d29@webmail.upv.es> <320fb6e00908100536j50cf9dacp27b93040b50623aa@mail.gmail.com> Message-ID: <1249915196.4a80313cf2284@webmail.upv.es> I had also the same problem and I wrote a function to do it using exonerate or blast, take a look at create_vector_striper_by_alignment in: http://bioinf.comav.upv.es/svn/biolib/biolib/src/biolib/seq_cleaner.py Jose Blanca From biopython at maubp.freeserve.co.uk Mon Aug 10 15:10:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 16:10:50 +0100 Subject: [Biopython] qBlast Error and Entrez module In-Reply-To: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> References: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> Message-ID: <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> On Mon, Aug 10, 2009 at 3:31 PM, Rodrigo faccioli wrote: > Hello, > > I've tried to execute a blast from NCBI. In this way, I'm using the NCBI > module from Biopython. I read the Biopython Tutorial its Chapter 7. So, my > code is below. > > ? ? ? ?result_handle = NCBIWWW.qblast("blastn", "nr", > "TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") > ? ? ? ?blast_results = result_handle.read() > ? ? ? ?save_file = open( "1CRN_Blast.xml", "w") > ? ? ? ?save_file.write(blast_results) > ? ? ? ?save_file.close() > > However, when I execute this code, I receive the error message: raise > ValueError("No RID and no RTOE found in the 'please wait' page." > > I don't know what I'm doing wrong. So, if somebody can help me, I thank. I don't see anything wrong with that line, but it isn't working for me either. Odd. Perhaps the NCBI have changed something... I'll get back to you. > I have on more doubt about Entrez module. What is the difference between > Entrez and NCBI ? The NCBI is the (American) National Center for Biotechnology Information. They provide lots of online tools including Entrez and BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi http://www.ncbi.nlm.nih.gov/sites/entrez > With Entrez module can I execute a protein aligment? If > yes, could somebody inform a example for me. No, you can't run BLAST via Entrez. Entrez is like a way to search and download data from the NCBI. Peter From biopython at maubp.freeserve.co.uk Mon Aug 10 15:15:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 16:15:53 +0100 Subject: [Biopython] qBlast Error and Entrez module In-Reply-To: <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> References: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> Message-ID: <320fb6e00908100815p32242292g2931d1c620697171@mail.gmail.com> On Mon, Aug 10, 2009 at 4:10 PM, Peter wrote: > On Mon, Aug 10, 2009 at 3:31 PM, Rodrigo > faccioli wrote: >> Hello, >> >> I've tried to execute a blast from NCBI. In this way, I'm using the NCBI >> module from Biopython. I read the Biopython Tutorial its Chapter 7. So, my >> code is below. >> >> ? ? ? ?result_handle = NCBIWWW.qblast("blastn", "nr", >> "TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") >> ? ? ? ?blast_results = result_handle.read() >> ? ? ? ?save_file = open( "1CRN_Blast.xml", "w") >> ? ? ? ?save_file.write(blast_results) >> ? ? ? ?save_file.close() >> >> However, when I execute this code, I receive the error message: raise >> ValueError("No RID and no RTOE found in the 'please wait' page." >> >> I don't know what I'm doing wrong. So, if somebody can help me, I thank. > > I don't see anything wrong with that line, but it isn't working for me either. > Odd. Perhaps the NCBI have changed something... I'll get back to you. It is actually a simple problem: You are using a protein query but BLASTN requires a nucleotide sequence. The NCBI does actually try and tell us this, but Biopython doesn't (currently) know how to extract the error message to show you. Peter From biopython at maubp.freeserve.co.uk Mon Aug 10 15:29:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 16:29:00 +0100 Subject: [Biopython] qBlast Error and Entrez module In-Reply-To: <320fb6e00908100815p32242292g2931d1c620697171@mail.gmail.com> References: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> <320fb6e00908100815p32242292g2931d1c620697171@mail.gmail.com> Message-ID: <320fb6e00908100829g7d04d29n9f35a1a65072bb18@mail.gmail.com> On Mon, Aug 10, 2009 at 4:15 PM, Peter wrote: > > It is actually a simple problem: You are using a protein query but BLASTN > requires a nucleotide sequence. The NCBI does actually try and tell us > this, but Biopython doesn't (currently) know how to extract the error > message to show you. I've updated Bio.Blast.NCBIWWW to try and report the NCBI error message, this means in future a mistake like this will result in: ValueError: Error message from NCBI: Message ID#24 Error: Failed to read the Blast query: Protein FASTA provided for nucleotide sequence That should make live a little simpler. Thanks for telling us about this, and reminding me about this issue. Peter From biopython at maubp.freeserve.co.uk Mon Aug 10 15:58:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 16:58:29 +0100 Subject: [Biopython] qBlast Error and Entrez module In-Reply-To: <3715adb70908100836m1a653ddco435896a2a1f7ee3c@mail.gmail.com> References: <3715adb70908100731y57a03d2eqd20e8584539d05bb@mail.gmail.com> <320fb6e00908100810g284932b9l6297b7a41fc18289@mail.gmail.com> <320fb6e00908100815p32242292g2931d1c620697171@mail.gmail.com> <3715adb70908100836m1a653ddco435896a2a1f7ee3c@mail.gmail.com> Message-ID: <320fb6e00908100858h76349de6l55b0145d6e3330da@mail.gmail.com> On Mon, Aug 10, 2009 at 4:36 PM, Rodrigo faccioli wrote: > Sorry my error. This error occured because I've built a blast for nucleotide > sequence and my intention was use the some code. Therefore, my configure > file and one parameter called BlastProgram with options: BlastN or BlastP. > > Now, My code is working. > > Thank you for help. I'm glad I could help. Peter P.S. Please try and keep replies CC'd to the mailing list. From dmikewilliams at gmail.com Mon Aug 10 16:59:03 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Mon, 10 Aug 2009 12:59:03 -0400 Subject: [Biopython] hsp.identities Message-ID: Hi there. Been using perl since 1996, but I am new to python. I am working on some python code that was last modified in March of 2007. The code used to use NCBIStandalone, I've modified it to use NCBIXML because the Standalone package died with an exception, which I assumed was due to changes in the blast report format since the code was originally written. blastToleranceNT = 2 blast_out = open(blast_report, "r") b_parse = NCBIXML.parse(blast_out, debug) for b_record in b_parse : for al in b_record.alignments: al.hsps = filter (lambda x: abs(x.identities[0]-x.identities[1]) <= blastToleranceNT, al.hsps) This code generates the following error: TypeError: 'int' object is unsubscriptable Tried using some (slightly modified) code from: http://biobanner.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/Arabidopsis/ArabComp/LocateOligos.py?rev=1.3 for hsp in al.hsps: identities, length = hsp.identities which gives the following error: identity, length = hsp.identities TypeError: 'int' object is not iterable using blast-2.2.17, python 2.6, and biopython version 1.49 on a fedora 11 system also tried on a fedora 10 system with python 2.5.2 and biopython 1.48 - similar results according to the docs at: http://www.biopython.org/DIST/docs/api/Bio.Blast.Record.HSP-class.html hsp.identities is a tuple: identities Number of identities/total aligned. tuple of (int, int) I've looked at various sites with examples of how to deal with tuples, but nothing seems to work, and the error messages always imply that identities is an int. I'm hoping my spinning my wheels on this is just the result of being new to python. I know the original version of the code *used* to work, and the rest of the program seems to work fine, if I comment out the filter line. Any help would be appreciated, this one line of code is a show stopper and I have multiple deadlines this week which depend on getting this working. Thanks, Mike From dmikewilliams at gmail.com Mon Aug 10 18:23:31 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Mon, 10 Aug 2009 14:23:31 -0400 Subject: [Biopython] hsp.identities (solved) Message-ID: Well, one hour after posting my question, I found the answer in the list archives: http://portal.open-bio.org/pipermail/biopython-dev/2006-April/002347.html What happens is that if the Blast output looks like this: Identities = 28/87 (32%), Positives = 44/87 (50%), Gaps = 12/87 (13%) then the text-based parser returns hsp.identities = (28, 87) hsp.positives = (44, 87) hsp.gaps = (12, 87) while the XML parser returns hsp.identities = 28 hsp.positives = 44 hsp.gaps = 12 ; we can get the 87 from len(hsp.query). Cheers, Mike From biopython at maubp.freeserve.co.uk Mon Aug 10 20:43:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 21:43:30 +0100 Subject: [Biopython] hsp.identities In-Reply-To: References: Message-ID: <320fb6e00908101343m4ce06704y59853624bcb830b6@mail.gmail.com> On Mon, Aug 10, 2009 at 5:59 PM, Mike Williams wrote: > Hi there. ?Been using perl since 1996, but I am new to python. ?I am > working on some python code that was last modified in March of 2007. > > The code used to use NCBIStandalone, I've modified it to use NCBIXML > because the Standalone package died with an exception, which I assumed > was due to changes in the blast report format since the code was > originally written. Quite likely - the NCBI keep changing the plain text output, so we have more or less given up that losing battle and have followed their advice and now just recommend the XML parser. > > blastToleranceNT = 2 > blast_out = open(blast_report, "r") > b_parse = NCBIXML.parse(blast_out, debug) > for b_record in b_parse : > ? ?for al in b_record.alignments: > ? ? ? ?al.hsps = filter (lambda x: > abs(x.identities[0]-x.identities[1]) <= blastToleranceNT, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?al.hsps) > > > This code generates the following error: > TypeError: 'int' object is unsubscriptable > ... > I've looked at various sites with examples of how to deal with tuples, > but nothing seems to work, and > the error messages always imply that identities is an int. > > I'm hoping my spinning my wheels on this is just the result of being > new to python. ?I know the original version of the code *used* to > work, and the rest of the program seems to work fine, if I comment out > the filter line. > > Any help would be appreciated, this one line of code is a show stopper > and I have multiple deadlines this week which depend on getting this > working. This is one of the quirks of the XML parser (integer) versus the plain text parser (tuple of two integers, the number of identities and the alignment length). In general they are interchangeable but there are a couple of accidents like this which we've left in place rather than breaking existing scripts. See Bug 2176 for more details. http://bugzilla.open-bio.org/show_bug.cgi?id=2176 For plain text, from memory you needed this: abs(x.identities[0]-x.identities[1]) or, abs(x.identities[0]-x.align_length) For XML you'll need: abs(x.identities - x.align_length) (I think, without testing it) Peter From dmikewilliams at gmail.com Mon Aug 10 21:42:08 2009 From: dmikewilliams at gmail.com (Mike Williams) Date: Mon, 10 Aug 2009 17:42:08 -0400 Subject: [Biopython] hsp.identities In-Reply-To: <320fb6e00908101343m4ce06704y59853624bcb830b6@mail.gmail.com> References: <320fb6e00908101343m4ce06704y59853624bcb830b6@mail.gmail.com> Message-ID: On Mon, Aug 10, 2009 at 4:43 PM, Peter wrote: > This is one of the quirks of the XML parser (integer) versus the > plain text parser (tuple of two integers, the number of identities > and the alignment length). In general they are interchangeable > but there are a couple of accidents like this which we've left in > place rather than breaking existing scripts. See Bug 2176 for > more details. > http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > > For plain text, from memory you needed this: > abs(x.identities[0]-x.identities[1]) or, abs(x.identities[0]-x.align_length) > For XML you'll need: abs(x.identities - x.align_length) > Thanks for the reply, Peter. I actually found the solution and posted that fact a couple hours ago, although the additional information was helpful. I do think that, at least, the documentation should be changed to mention the difference between the standalone and xml parsers. If that had been done it would have saved me a lot of time. Peace, Mike From biopython at maubp.freeserve.co.uk Tue Aug 11 09:13:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Aug 2009 10:13:16 +0100 Subject: [Biopython] hsp.identities In-Reply-To: References: <320fb6e00908101343m4ce06704y59853624bcb830b6@mail.gmail.com> Message-ID: <320fb6e00908110213o339d4d2brf7c571a44359297d@mail.gmail.com> On Mon, Aug 10, 2009 at 10:42 PM, Mike Williams wrote: > > I do think that, at least, the documentation should be changed to > mention the difference between the standalone and xml parsers. ?If > that had been done it would have saved me a lot of time. Good point. I've attempted to clarify the HSP class docstring. Peter From biopython at maubp.freeserve.co.uk Wed Aug 12 23:21:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 00:21:36 +0100 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <20090810131650.GP12604@sobchak.mgh.harvard.edu> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <20090810131650.GP12604@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> On Mon, Aug 10, 2009 at 2:16 PM, Brad Chapman wrote: > > Agreed. I like the examples and was thinking of this as an extension > of the exact matching approach. I am definitely happy to roll this > or some derivative of it into Biopython. Hi Brad, Is your aim to have a very fast pipeline, or an understandable reference implementation (a worked example)? If this is for a real pipeline, does it have to be FASTQ to FASTA? Further to your blog comment about slicing SeqRecord objects slowing things down, I agree - if you don't need the qualities, then having to slice them is a pointless overhead. As usual in programming, there are several options trading off elegant/general for speed. Personally I would want to keep the qualities for the assembly/mapping step. While keeping things general, as you don't care about the qualities, you could do the whole operation on FASTA files which are faster to read in and when you slice the resulting SeqRecord you don't have the overhead of the slicing the qualities. However, if you just want speed AND you really want to have a FASTQ input file, try the underlying Bio.SeqIO.QualityIO.FastqGeneralIterator parser which gives plain strings, and handle the output yourself. Working directly with Python strings is going to be faster than using Seq and SeqRecord objects. You can even opt for outputting FASTQ files - as long as you leave the qualities as an encoded string, you can just slice that too. The downside is the code will be very specific. e.g. something along these lines: from Bio.SeqIO.QualityIO import FastqGeneralIterator in_handle = open(input_fastq_filename) out_handle = open(output_fastq_filename, "w") for title, seq, qual in FastqGeneralIterator(in_handle) : #Do trim logic here on the string seq if trim : seq = seq[start:end] qual = qual[start:end] # kept as ASCII string! #Save the (possibly trimmed) FASTQ record: out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) out_handle.close() in_handle.close() Note that FastqGeneralIterator is already in Biopython 1.50 and 1.51b, but is now a bit faster in CVS/github (what will be Biopython 1.51). Peter From mattkarikomi at gmail.com Thu Aug 13 02:10:08 2009 From: mattkarikomi at gmail.com (Matt Karikomi) Date: Wed, 12 Aug 2009 22:10:08 -0400 Subject: [Biopython] biopython mashup simmilar to lasergene Message-ID: <95e3e9cc0908121910o37b8a3dbwb1f4d938a5439606@mail.gmail.com> i use the lasergene suite to manage molecular cloning projects. in projects like this, the visual presentation of both data and workflow history is crucial.? it seems like the GUI of this software suite could be recapitulated by a mashup of modules from bioperl and/or biopython while at the same time providing a rich API which will never exist in lasergene. has there been any attempt to mask the powerful script-dependent functionality of these open-source modules in some form of GUI?? i am envisioning something like the [web based] Primer3 Plus interface to the C implementation of Primer3 (obviously wider in scope).? sorry if this is the wrong list (please advise). thanks matt From biopython at maubp.freeserve.co.uk Thu Aug 13 09:56:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 10:56:28 +0100 Subject: [Biopython] biopython mashup simmilar to lasergene In-Reply-To: <95e3e9cc0908121910o37b8a3dbwb1f4d938a5439606@mail.gmail.com> References: <95e3e9cc0908121910o37b8a3dbwb1f4d938a5439606@mail.gmail.com> Message-ID: <320fb6e00908130256l698deda6ubbf498cb1313943b@mail.gmail.com> On Thu, Aug 13, 2009 at 3:10 AM, Matt Karikomi wrote: > i use the lasergene suite to manage molecular cloning projects. > in projects like this, the visual presentation of both data and > workflow history is crucial.? it seems like the GUI of this software > suite could be recapitulated by a mashup of modules from bioperl > and/or biopython while at the same time providing a rich API which > will never exist in lasergene. > has there been any attempt to mask the powerful script-dependent > functionality of these open-source modules in some form of GUI?? i am > envisioning something like the [web based] Primer3 Plus interface to > the C implementation of Primer3 (obviously wider in scope).? sorry if > this is the wrong list (please advise). > thanks > matt It sounds a bit like you want a work flow system, something like Galaxy, which can act as a GUI to command line tools (including BioPerl and Biopython scripts). Galaxy is actually written in Python: http://galaxy.psu.edu/ Peter From chapmanb at 50mail.com Thu Aug 13 12:44:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 13 Aug 2009 08:44:32 -0400 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <20090810131650.GP12604@sobchak.mgh.harvard.edu> <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> Message-ID: <20090813124432.GB90165@sobchak.mgh.harvard.edu> Hi Peter; > Is your aim to have a very fast pipeline, or an understandable > reference implementation (a worked example)? If this is for a real > pipeline, does it have to be FASTQ to FASTA? Ideally both. It is motivated by fitting into an experiment I am analyzing, but the purpose of the blog posting is to try and explain the logic and solicit feedback. In terms of the work, I don't need fastq downstream so was going the easier fasta route. But I can certainly see myself needing fastq in the future so prefer to be generalized. > Further to your blog comment about slicing SeqRecord objects slowing > things down, I agree - if you don't need the qualities, then having to > slice them is a pointless overhead. As usual in programming, there are > several options trading off elegant/general for speed. Personally I > would want to keep the qualities for the assembly/mapping step. Agreed. Unfortunately, it was unusably slow with the slicing as currently implemented: it ran for about 16 hours and was 1/3 of the way finished so was looking like a 2 day run, or about 12x slower than the reference implementation. > However, if you just want speed AND you really want to have a FASTQ > input file, try the underlying > Bio.SeqIO.QualityIO.FastqGeneralIterator parser which gives plain > strings, and handle the output yourself. Working directly with Python > strings is going to be faster than using Seq and SeqRecord objects. > You can even opt for outputting FASTQ files - as long as you leave the > qualities as an encoded string, you can just slice that too. The > downside is the code will be very specific. e.g. something along these > lines: > > from Bio.SeqIO.QualityIO import FastqGeneralIterator > in_handle = open(input_fastq_filename) > out_handle = open(output_fastq_filename, "w") > for title, seq, qual in FastqGeneralIterator(in_handle) : > #Do trim logic here on the string seq > if trim : > seq = seq[start:end] > qual = qual[start:end] # kept as ASCII string! > #Save the (possibly trimmed) FASTQ record: > out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) > out_handle.close() > in_handle.close() Nice -- I will have to play with this. I hadn't dug into the current SeqRecord slicing code at all but I wonder if there is a way to keep the SeqRecord interface but incorporate some of these speed ups for common cases like this FASTQ trimming. Brad From biopython at maubp.freeserve.co.uk Thu Aug 13 13:02:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Aug 2009 14:02:17 +0100 Subject: [Biopython] Trimming adaptors sequences In-Reply-To: <20090813124432.GB90165@sobchak.mgh.harvard.edu> References: <320fb6e00908100412s447cedd8nc993374b8a23de68@mail.gmail.com> <20090810131650.GP12604@sobchak.mgh.harvard.edu> <320fb6e00908121621m25e37f20pbd8e5e01c26b13a7@mail.gmail.com> <20090813124432.GB90165@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908130602n607add6fme67f7934234a5540@mail.gmail.com> On Thu, Aug 13, 2009 at 1:44 PM, Brad Chapman wrote: >> However, if you just want speed AND you really want to have a FASTQ >> input file, try the underlying Bio.SeqIO.QualityIO.FastqGeneralIterator >> parser which gives plain strings, and handle the output yourself. Working >> directly with Python strings is going to be faster than using Seq and >> SeqRecord objects. You can even opt for outputting FASTQ files - as >> long as you leave the qualities as an encoded string, you can just slice >> that too. The downside is the code will be very specific. e.g. something >> along these lines: >> >> from Bio.SeqIO.QualityIO import FastqGeneralIterator >> in_handle = open(input_fastq_filename) >> out_handle = open(output_fastq_filename, "w") >> for title, seq, qual in FastqGeneralIterator(in_handle) : >> ? ? #Do trim logic here on the string seq >> ? ? if trim : >> ? ? ? ? seq = seq[start:end] >> ? ? ? ? qual = qual[start:end] # kept as ASCII string! >> ? ? #Save the (possibly trimmed) FASTQ record: >> ? ? out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) >> out_handle.close() >> in_handle.close() > > Nice -- I will have to play with this. I hadn't dug into the current > SeqRecord slicing code at all but I wonder if there is a way to keep > the SeqRecord interface but incorporate some of these speed ups > for common cases like this FASTQ trimming. I suggest we continue this on the dev mailing list (this reply is cross posted), as it is starting to get rather technical. When you really care about speed, any object creation becomes an issue. Right now for *any* record we have at least the following objects being created: SeqRecord, Seq, two lists (for features and dbxrefs), two dicts (for annotation and the per letter annotation), and the restricted dict (for per letter annotations), and at least four strings (sequence, id, name and description). Perhaps some lazy instantiation might be worth exploring... for example make dbxref, features, annotations or letter_annotations into properties where the underlying object isn't created unless accessed. [Something to try after Biopython 1.51 is out?] I would guess (but haven't timed it) that for trimming FASTQ SeqRecords, a bit part of the overhead is that we are using Python lists of integers (rather than just a string) for the scores. So sticking with the current SeqRecord object as is, one speed up we could try would be to leave the FASTQ quality string as an encoded string (rather than turning into integer quality scores, and back again on output). It would be a hack, but adding this as another SeqIO format name, e.g. "fastq-raw" or "fastq-ascii", might work. We'd still need a new letter_annotations key, say "fastq_qual_ascii". This idea might work, but it does seem ugly. Peter From biopython at maubp.freeserve.co.uk Fri Aug 14 13:30:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Aug 2009 14:30:50 +0100 Subject: [Biopython] GFF Parsing In-Reply-To: <3d03a61c0908140547s138b1275i2ffc5e9ed0a6a397@mail.gmail.com> References: <3d03a61c0908140534k49b53531hd95aab478e486c56@mail.gmail.com> <3d03a61c0908140547s138b1275i2ffc5e9ed0a6a397@mail.gmail.com> Message-ID: <320fb6e00908140630h91b6055la2226aee675199c2@mail.gmail.com> Hello Vipin, I think your question is probably aimed at Brad, so I will forward the attachment to him directly as well. Brad's GFF code isn't in Biopython yet, but we plan to add it later. Have you signed up to the Biopython mailing list? Once you have done this you can email biopython at lists.open-bio.org or biopython at biopython.org with questions like this. I have copied this reply to the list (without the attachment). Peter ---------- Forwarded message ---------- From: Vipin TS Date: Fri, Aug 14, 2009 at 1:47 PM Subject: GFF Parsing To: biopython-owner at lists.open-bio.org To whom it may concern, Thanks for the development of a quick parser for GFF files. It is very useful. I have a doubt, I used the GFFParser.py program to extract the genome annotation from the file attached with this mail. Please find the attached file. (Because of the size of file here I included a few lines) I wrote a python script like this ################################################## import GFFParser pgff = GFFParser.GFFMapReduceFeatureAdder(dict(), None) cds_limit_info = dict( ? ?gff_type = ["gene","mRNA","CDS","exon"], ? ?gff_id = ["Chr1"] ? ?) pgff.add_features('../PythonGFF/TAIR9_GFF_genes.gff3', cds_limit_info) pgff.base["Chr1"] final = pgff.base["Chr1"] ################################################## By executing this script I am able to extract gene, mRNA and exon annotation from specified GFF file. But I am unable to extract the CDS related information from GFF file. It will be great if you can suggest me an idea to include gene, mRNA, exon and CDS information in a single strech of parsing of GFF file. Thanks in advance, Vipin T S Scientific programmer Friedrich Miescher Laboratory of the Max Planck Society Spemannstrasse 37-39 D-72076 Tuebingen Germany From rodrigo_faccioli at uol.com.br Fri Aug 14 20:49:53 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Fri, 14 Aug 2009 17:49:53 -0300 Subject: [Biopython] SEQRES PDB module Message-ID: <3715adb70908141349t5aff9c6et1c59b44a2edd0ba5@mail.gmail.com> Hello, Sorry about my general question. However, I've read the source-code of PDB module and I haven't found how can I work with SEQRES section of PDB file? My doubt is: Is there a method such as get_SeqRes? Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From chapmanb at 50mail.com Fri Aug 14 20:51:13 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Aug 2009 16:51:13 -0400 Subject: [Biopython] GFF Parsing In-Reply-To: <320fb6e00908140630h91b6055la2226aee675199c2@mail.gmail.com> References: <3d03a61c0908140534k49b53531hd95aab478e486c56@mail.gmail.com> <3d03a61c0908140547s138b1275i2ffc5e9ed0a6a397@mail.gmail.com> <320fb6e00908140630h91b6055la2226aee675199c2@mail.gmail.com> Message-ID: <20090814205113.GI90165@sobchak.mgh.harvard.edu> Hi all; Peter, thanks for forwarding this along. Vipin: > By executing this script I am able to extract gene, mRNA and exon annotation > from specified GFF file. But I am unable to extract the CDS related > information from GFF file. > It will be great if you can suggest me an idea to include gene, mRNA, exon > and CDS information in a single strech of parsing of GFF file. Sure, the CDS features are present in two places within the feature tree. The first is as sub-sub features of genes: gene -> mRNA -> CDS the second is as sub features of proteins: protein -> CDS It's a bit of a confusing way to do it, in my opinion, but this is the nesting defined in the Arabidopsis GFF file, so the parser respects it and puts them where they are supposed to be. Below is an updated script which should demonstrate where the CDS features are; you can use either way to access them as the same CDSs are present under both features. This also uses the updated API for parsing, which is much cleaner and will hopefully be what is in Biopython. There is some initial documentation here: http://www.biopython.org/wiki/GFF_Parsing Hope this helps, Brad import sys from BCBio.GFF import GFFParser in_file = sys.argv[1] parser = GFFParser() limit_info = dict( gff_type = ["protein", "gene", "mRNA", "CDS", "exon"], gff_id = ["Chr1"], ) in_handle = open(in_file) for rec in parser.parse(in_handle, limit_info=limit_info): print rec.id for feature in rec.features: if feature.type == "protein": print feature.type, feature.id for sub in feature.sub_features: if sub.type == "CDS": print sub.type elif feature.type == "gene": for sub in feature.sub_features: if sub.type == "mRNA": print sub.type, sub.id for sub_sub in sub.sub_features: if sub_sub.type == "CDS": print sub_sub.type in_handle.close() From mattkarikomi at gmail.com Sat Aug 15 02:12:36 2009 From: mattkarikomi at gmail.com (Matt Karikomi) Date: Fri, 14 Aug 2009 22:12:36 -0400 Subject: [Biopython] biopython mashup simmilar to lasergene In-Reply-To: <320fb6e00908130256l698deda6ubbf498cb1313943b@mail.gmail.com> References: <95e3e9cc0908121910o37b8a3dbwb1f4d938a5439606@mail.gmail.com> <320fb6e00908130256l698deda6ubbf498cb1313943b@mail.gmail.com> Message-ID: <95e3e9cc0908141912n62b4e98aga2834a524ef15aa9@mail.gmail.com> this is exactly what i had in mind as far as a software development model and the implementation of the aforementioned modules. my current work is less genome-wide exploration (i am interested in doing more of this in the future) and more just cloning/recombineering (conventional knockouts and such). of course the extensibility of Galaxy means it can be made to handle anything in the way of analysis and manipulation. On Thu, Aug 13, 2009 at 5:56 AM, Peter wrote: > On Thu, Aug 13, 2009 at 3:10 AM, Matt Karikomi > wrote: > > i use the lasergene suite to manage molecular cloning projects. > > in projects like this, the visual presentation of both data and > > workflow history is crucial. it seems like the GUI of this software > > suite could be recapitulated by a mashup of modules from bioperl > > and/or biopython while at the same time providing a rich API which > > will never exist in lasergene. > > has there been any attempt to mask the powerful script-dependent > > functionality of these open-source modules in some form of GUI? i am > > envisioning something like the [web based] Primer3 Plus interface to > > the C implementation of Primer3 (obviously wider in scope). sorry if > > this is the wrong list (please advise). > > thanks > > matt > > It sounds a bit like you want a work flow system, something like > Galaxy, which can act as a GUI to command line tools (including > BioPerl and Biopython scripts). Galaxy is actually written in Python: > http://galaxy.psu.edu/ > > Peter > From chapmanb at 50mail.com Mon Aug 17 11:58:22 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 17 Aug 2009 07:58:22 -0400 Subject: [Biopython] Biopython 1.51 released Message-ID: <20090817115822.GA12768@sobchak.mgh.harvard.edu> Biopythonistas; We're pleased to announce the release of Biopython 1.51. This new stable release enhances version 1.50 (released in April) by extending the functionality of existing modules, adding a set of application wrappers for popular alignment programs and fixing a number of minor bugs. Sources and Windows Installer are available from the downloads page: http://biopython.org/wiki/Download In particular, the SeqIO module can now write Genbank files that include features, and deal with FASTQ files created by Illumina 1.3+. Support for this format allows interconversion between FASTQ files using Solexa, Sanger and Ilumina variants using conventions agreed upon with the BioPerl and EMBOSS projects. Biopython 1.51 is the first stable release to include the Align.Applications module which allows users to define command line wrappers for popular alignment programs including ClustalW, Muscle and T-Coffee. Bio.Fasta and the application tools ApplicationResult and generic_run() have been marked as deprecated - Bio.Fasta has been superseded by SeqIO's support for the Fasta format and we provide ducumentation for using the subprocess module from the Python Standard Library as a more flexible approach to calling applications. As always, the Tutorial and Cookbook has been updated to document all the changes: http://biopython.org/wiki/Documentation Thank you to everyone who tested our 1.51 beta or submitted bugs since out last stable release and to all our contributors. Brad From biopython at maubp.freeserve.co.uk Mon Aug 17 13:54:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Aug 2009 14:54:43 +0100 Subject: [Biopython] SEQRES PDB module In-Reply-To: <3715adb70908141349t5aff9c6et1c59b44a2edd0ba5@mail.gmail.com> References: <3715adb70908141349t5aff9c6et1c59b44a2edd0ba5@mail.gmail.com> Message-ID: <320fb6e00908170654j482cd07bvb35677555fa6c5d@mail.gmail.com> On Fri, Aug 14, 2009 at 9:49 PM, Rodrigo faccioli wrote: > Hello, > > Sorry about my general question. However, I've read the source-code of PDB > module and I haven't found how can I work with ?SEQRES section of PDB file? > > My doubt is: Is there a method such as get_SeqRes? > > Thanks, Biopython has limited support for parsing the PDB header information, and does not (currently) do anything with the SEQRES lines. You can usually infer the amino acids sequence from the 3D data itself (although this is complicated if there are gaps, for example residues whose coordinates were not resolved). What are you trying to do? It might be simplest to download the sequences from the PDB as simple FASTA files. Peter From bartomas at gmail.com Tue Aug 18 10:40:18 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 18 Aug 2009 11:40:18 +0100 Subject: [Biopython] Biogeography module Message-ID: Hi, I?ve been looking at the Biogeography module ( http://biopython.org/wiki/BioGeography) currently under development. It seems incredibly interesting and useful. The thing that would be really useful in the tutorial would be to show a step by step example of the commands to execute during a complete workflow of the module, from the retrieval of gbif records to the calculation of statistics of phylogenetic trees per region and generation of kml/shapefiles. In the current state of the tutorial it is hard to know how the data fed into the calculation of a tree summary object can be generated. Congratulations for great work on this module. Look forward to using it. All the best, Tomas Bar From biopython at maubp.freeserve.co.uk Wed Aug 19 09:40:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 Aug 2009 10:40:41 +0100 Subject: [Biopython] Fwd: Biopython package for Fedora/RedHat In-Reply-To: References: <320fb6e00908171002j6dd6da63l49f0fabd5866a332@mail.gmail.com> <74d46tptxc.fsf@allele2.eebweb.arizona.edu> <320fb6e00908180309j54c9d2a7qe06a4a804e280f65@mail.gmail.com> Message-ID: <320fb6e00908190240w69df3c8bve6bdb3ce77b60a4@mail.gmail.com> Hi all, I'd like to thank Alex Lancaster for updating the Fedora packages for Biopython 1.51 (including a patch for the flex issue, Bug 2619). http://bugzilla.open-bio.org/show_bug.cgi?id=2619 Those of you involved in testing Fedora packages, please give this a go - positive feedback will get this into stable F-10 and F-11 sooner (as Alex explains below). Thanks, Peter ---------- Forwarded message ---------- From: Alex Lancaster Date: Wed, Aug 19, 2009 at 12:30 AM Subject: Re: Biopython package for Fedora/RedHat To: Peter Hi Peter, I updated the Biopython wiki to point to that page. https://admin.fedoraproject.org/community/?package=python-biopython#package_maintenance/package_overview [...] OK, updates in CVS done: * Tue Aug 18 2009 Alex Lancaster - ?1.51-1 - Update to upstream 1.51 - Drop mx {Build}Requires, no longer used upstream - Remove Martel modules, no longer distributed upstream - Add flex to BuildRequires, patch setup to build ?Bio.PDB.mmCIF.MMCIFlex as per upstream: ?http://bugzilla.open-bio.org/show_bug.cgi?id=2619 I have done builds for rawhide (although probably won't be included for a while as rawhide is frozen while a Beta for F-12 is being tested), and there are pending updates for F-10 and F-11 which will be pushed to updates-testing soon (you can add comments to the updates without being a Fedora developer): https://admin.fedoraproject.org/updates/python-biopython-1.51-1.fc10 https://admin.fedoraproject.org/updates/python-biopython-1.51-1.fc11 Once in updates-testing for a while and we get some feedback (i.e. "karma" votes), then I will push them to stable for F-10 and F-11. Feel free to forward this information to the mailing list (I am subscribed, but I read the list via GMANE to cut down on volume, so I don't always get time to read the list). [...] Alex From biopython at maubp.freeserve.co.uk Wed Aug 19 13:26:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 Aug 2009 14:26:18 +0100 Subject: [Biopython] Paired end SFF data Message-ID: <320fb6e00908190626t1cb94264p6464ab0ddcab596c@mail.gmail.com> Hi all, We're been talking about adding Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format (used for Roche 454 data). This is a public standard documented here: http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats One of the open questions was how to deal with paired end data. The Roche 454 website has a nice summary of how this works: http://www.454.com/products-solutions/experimental-design-options/multi-span-paired-end-reads.asp Basically after sample preparation, you have DNA fragments containing the 3' end of your sequence, a known linker, and then the 5' end of your sequence. The sequencing machine doesn't need to know what the magic linker sequence is, and (I infer) after sample preparation, everything proceeds as normal for single end 454 sequencing. The upshot is the SFF file for a paired end read is exactly like any other SFF file (apparently even for the XML meta data Roche include), just most of the reads should have a "magic" linker sequence somewhere in them. I have located some publicly available Roche SFF files at the Sanger Centre which include some paired end reads (note the rules about publishing analysis of this data): http://www.sanger.ac.uk/Projects/Echinococcus/ ftp://ftp.sanger.ac.uk/pub/pathogens/Echinococcus/multilocularis/reads/454/ For example, the 2008_09_03.tar.gz archive contains a single 446MB file FGGXRDY01.sff with 278801 reads. ftp://ftp.sanger.ac.uk/pub/pathogens/Echinococcus/multilocularis/reads/454/2008_09_03.tar.gz This is the XML meta data from FGGXRDY01.sff, 454 FGGXRDY R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE /data/2008_09_03/R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE/D_2008_09_03_14_23_54_FLX08070222_EmuR1Ecoli7122PE_FullAnalysis /data/2008_09_03/R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE/D_2008_09_03_14_23_54_FLX08070222_EmuR1Ecoli7122PE_FullAnalysis 1.1.03 Nothing in this XML meta data says this is for paired end reads (nor can this be specified elsewhere in the SFF file format). However of the 278801 reads in FGGXRDY01.sff, about a third (108823 if you look before trimming) have a perfect match to the FLX linker: GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC. Note that the linker sequence depends on how the sample was prepared, and differs for different Roche protocols. e.g. According to the wgs-assembler documentation the known Roche 454 Titanium paired end linkers are instead: TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA See http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs This is a good news/bad news situation for Bio.SeqIO. The bad news is that identifying the paired end reads means knowing the linker sequence(s) used, and finding them within each read. If the read was sequenced perfectly, this is easy - but normally some heuristics are needed. I see this as outside the scope of basic file parsing (i.e. not something to go in Bio.SeqIO, but maybe in Bio.SeqUtils or Bio.Sequencing). The good news is that Bio.SeqIO can treat paired end SFF files just like single end reads - we don't have to worry about complicated new Seq/SeqRecord objects to hold short reads separated by an unknown region of estimated length. Peter From biopython at maubp.freeserve.co.uk Wed Aug 19 15:52:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 Aug 2009 16:52:52 +0100 Subject: [Biopython] Paired end SFF data In-Reply-To: <320fb6e00908190626t1cb94264p6464ab0ddcab596c@mail.gmail.com> References: <320fb6e00908190626t1cb94264p6464ab0ddcab596c@mail.gmail.com> Message-ID: <320fb6e00908190852i3b2b3fe3l2e44b2aa427f4cea@mail.gmail.com> On Wed, Aug 19, 2009 at 2:26 PM, Peter wrote: > > Basically after sample preparation, you have DNA fragments containing > the 3' end of your sequence, a known linker, and then the 5' end of > your sequence. The sequencing machine doesn't need to know what the > magic linker sequence is, and (I infer) after sample preparation, > everything proceeds as normal for single end 454 sequencing. The > upshot is the SFF file for a paired end read is exactly like any other > SFF file (apparently even for the XML meta data Roche include), just > most of the reads should have a "magic" linker sequence somewhere in > them. > > ... FLX linker: > GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC. > > Note that the linker sequence depends on how the sample was prepared, > and differs for different Roche protocols. e.g. According to the > wgs-assembler documentation the known Roche 454 Titanium paired end > linkers are instead: > TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and > CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA > See http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs According to the MIRA documentation, local sequencing centres may also use their own linker sequences (and have been known to modify the adaptor sequences), which would make things more complicated. http://www.chevreux.org/uploads/media/mira3_faq.html#section_10 Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 09:48:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 10:48:53 +0100 Subject: [Biopython] Deprecating Bio.Prosite and Bio.Enzyme Message-ID: <320fb6e00908200248j26d20cefq6e3cf9373881d990@mail.gmail.com> Hi all, Bio.Prosite and Bio.Enzyme were declared obsolete in Release 1.50, being replaced by Bio.ExPASy.Prosite and Bio.ExPASy.Enzyme, respectively. Are there any objections to deprecating Bio.Prosite and Bio.Enzyme for the next release? Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 09:51:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 10:51:19 +0100 Subject: [Biopython] Deprecating Bio.EZRetrieve, NetCatch, FilteredReader and SGMLHandle Message-ID: <320fb6e00908200251g6858939ci8de9192b9af7ad98@mail.gmail.com> Hi all, The minor modules Bio.EZRetrieve, Bio.NetCatch, Bio.File.SGMLHandle, Bio.FilteredReader were declared obsolete in Release 1.50. Are there any objections to us deprecating them in the next release? Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 09:57:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 10:57:41 +0100 Subject: [Biopython] Removing deprecated module Bio.Ndb Message-ID: <320fb6e00908200257w749c0650jd16fcc1648fb1c4b@mail.gmail.com> Hi all, The Bio.Ndb module was deprecated almost a year ago in Biopython 1.49 (Nov 2008), as the website it parsed has been redesigned. Unless there are any objections (or offers to update the code), I'd like to remove this module for the next release of Biopython. Peter From kellrott at gmail.com Thu Aug 20 18:26:33 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 20 Aug 2009 11:26:33 -0700 Subject: [Biopython] SQL Alchemy based BioSQL Message-ID: I've posted a git fork of biopython with a BioSQL system based on SQL Alchemy. It can be found at git://github.com/kellrott/biopython.git It successfully completes unit tests copied from test_BioSQL and test_BioSQL_SeqIO. The unit testing runs on sqlite. But it should abstract out to any database system that SQLAlchemy supports. From the web site, the list includes: SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase, Informix, and IBM DB2. Kyle From biopython at maubp.freeserve.co.uk Thu Aug 20 20:10:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 21:10:05 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: Message-ID: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> Hi Kyle, Thanks for signing up to the mailing list to talk about this work. On Thu, Aug 20, 2009 at 7:26 PM, Kyle Ellrott wrote: > I've posted a git fork of biopython with a BioSQL system based on SQL > Alchemy. ?It can be found at git://github.com/kellrott/biopython.git > It successfully completes unit tests copied from test_BioSQL and > test_BioSQL_SeqIO. > The unit testing runs on sqlite. ?But it should abstract out to any > database system that SQLAlchemy supports. ?From the web site, the list > includes: SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS > Access, Sybase, Informix, and IBM DB2. Sounds interesting - but can you explain your motivation? Brad Chapman had already suggested something with BioSQL and SQLAlchemy, but I can't find the emails right now. Maybe we talked about it in person at BOSC 2009... I forget. Brad? But what I think I said then was that while I like SQLAlchemy, and have used it with BioSQL as part of a web application, I don't see that we need it for Biopython's BioSQL support. We essentially have a niche ORM for going between the BioSQL tables and the Biopython SeqRecord object. I don't see more back end databases alone as a good reason for using SQLAlchemy in Biopython's BioSQL bindings. In most (all?) cases SQLAlchemy in turn calls something like MySQLdb to do the real work. You mention lots of other back ends supported by SQLAlchemy, but very few of them have BioSQL schemas - currently just these exist only for PostgreSQL, MySQL, Oracle, HSQLDB, and Apache Derby. As you know (because it is in your branch, grin), Brad has done a schema for SQLite and got this working with Biopython already, and we already support MySQL and PostgreSQL. That just leaves Biopython lacking support for the existing Oracle, HSQLDB, and Apache Derby BioSQL schemas. As long as these have a python binding using the Python Database API Specification v2.0 shouldn't be hard. For example, extending Biopython's BioSQL support using cx_Oracle to talk to an Oracle database seems like a useful incremental improvement. [That wasn't meant to come across as negative, I'm just wary of adding a heavyweight dependency without a good reason] Something I would be interested in is a set of SQLAlchemy model definitions for the BioSQL tables (ideally database neutral). I've got a very preliminary, partial and minimal set done - and I think Brad has some too. This would be useful for anyone wanting to go beyond the Biopython SeqRecord based BioSQL support. Peter From kellrott at gmail.com Thu Aug 20 20:57:29 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 20 Aug 2009 13:57:29 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> Message-ID: > Sounds interesting - but can you explain your motivation? The primary motivation is Jython compatibility (which is the main purpose of the branch). MySQLdb depends on some C extensions which make it hard to port to Jython. I don't keep track of IronPython, but I would imagine it would be a similar situation on the .Net platform. Beta SQLAlchemy 0.6 ( available on the SVN right now, but soon to be released ) supports the MySQL Connector/Java interface, so it works with Jython. Using this combination was the only way I could get a Jython BioPython to connect to a database. As a technical note, now that this works, it means that you can use BioPython and BioJava in the same memory space. I used BioPython's SQL code to get the data, and the passed it to BioJava's Smith-Waterman alignment code to calculate alignments, all in one script. > But what I think I said then was that while I like SQLAlchemy, > and have used it with BioSQL as part of a web application, I > don't see that we need it for Biopython's BioSQL support. We > essentially have a niche ORM for going between the BioSQL > tables and the Biopython SeqRecord object. Yes, but it's an ORM that only supports one form of Python. Let somebody else worry about wrapping to the details of other systems like Jython. > [That wasn't meant to come across as negative, I'm just > wary of adding a heavyweight dependency without a good > reason] It doesn't have to replace the existing system. It can sit along side, and not get installed if SQL Alchemy isn't available. If we leave the naming as is, it won't effect anybodies code. But it they do want to use it, it can replace the original system in a script call: from BioSQL import BioSQLAlchemy as BioSeqDatabase from BioSQL import BioSeqAlchemy as BioSeq And it should work exactly the same. > Something I would be interested in is a set of SQLAlchemy > model definitions for the BioSQL tables (ideally database > neutral). I've got a very preliminary, partial and minimal > set done - and I think Brad has some too. This would be > useful for anyone wanting to go beyond the Biopython > SeqRecord based BioSQL support. Yes, way the SQL Alchemy sets up python data structures based the structure of the database opens up a lot of cool ways to dynamically create queries. Kyle From biopython at maubp.freeserve.co.uk Thu Aug 20 21:21:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 22:21:30 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> Message-ID: <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> On Thu, Aug 20, 2009 at 9:57 PM, Kyle Ellrott wrote: > >> Sounds interesting - but can you explain your motivation? > > The primary motivation is Jython compatibility (which is the main > purpose of the branch). ?MySQLdb depends on some C extensions which > make it hard to port to Jython. ?I don't keep track of IronPython, but > I would imagine it would be a similar situation on the .Net platform. > Beta SQLAlchemy 0.6 ( available on the SVN right now, but soon to be > released ) supports the MySQL Connector/Java interface, so it works > with Jython. ?Using this combination was the only way I could get a > Jython BioPython to connect to a database. Ah. That ties in with the other changes on your github tree (to work nicely with Jython) which had seemed unrelated to me. I guess MySQLdb etc uses C code which means it won't work under Jython. I don't know enough about Jython to say if there are any other alternatives to using SQLAlchemy. > As a technical note, now that this works, it means that you can > use BioPython and BioJava in the same memory space. ?I > used BioPython's SQL code to get the data, and the passed > it to BioJava's Smith-Waterman alignment code to calculate > alignments, all in one script. This might be a silly question, but why not just use BioJava to talk to BioSQL instead? Or use Biopython's pairwise alignment code. Was the point just to demonstrate things working together? Peter From kellrott at gmail.com Thu Aug 20 21:59:08 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 20 Aug 2009 14:59:08 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> Message-ID: > Ah. That ties in with the other changes on your github tree (to > work nicely with Jython) which had seemed unrelated to me. > > I guess MySQLdb etc uses C code which means it won't > work under Jython. I don't know enough about Jython to say > if there are any other alternatives to using SQLAlchemy. If you aren't using a layer of abstraction like SQLAlchemy, they you can use the standard Java SQL interfaces (JDBC). But code written for that would only work within Jython and be useless for CPython. > This might be a silly question, but why not just use BioJava to > talk to BioSQL instead? Or use Biopython's pairwise alignment > code. Was the point just to demonstrate things working together? To prove that it could be done was part of the point. But there is also a 'cross training' aditude about it. BioPython seems more lightweight/easier to use, but has heavier requirements on installing external applications. BioJava can be harder to use, but it has lots more embedded functionality ( built in dynamic programming and HMM code ). If I can get both working in the same environment, then I get the best of both worlds. Kyle From biopython at maubp.freeserve.co.uk Fri Aug 21 09:27:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 Aug 2009 10:27:42 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> Message-ID: <320fb6e00908210227x37a2b7e7o890258f013196c48@mail.gmail.com> On Thu, Aug 20, 2009 at 10:59 PM, Kyle Ellrott wrote: > >> Ah. That ties in with the other changes on your github tree (to >> work nicely with Jython) which had seemed unrelated to me. >> >> I guess MySQLdb etc uses C code which means it won't >> work under Jython. I don't know enough about Jython to say >> if there are any other alternatives to using SQLAlchemy. > > If you aren't using a layer of abstraction like SQLAlchemy, they you > can use the standard Java SQL interfaces (JDBC). ?But code written for > that would only work within Jython and be useless for CPython. Still, it might be worth while. Assuming it can be done as a (Jython specific) modular backend to the existing BioSQL framework, it should be a less invasive change. >> This might be a silly question, but why not just use BioJava to >> talk to BioSQL instead? Or use Biopython's pairwise alignment >> code. Was the point just to demonstrate things working together? > > To prove that it could be done was part of the point. ?But there is > also a 'cross training' aditude about it. Fair enough. > BioPython seems more lightweight/easier to use, but has > heavier requirements on installing external applications. Biopython does often wrap external command line tools, yes. > BioJava can be harder to use, but it has lots more embedded > functionality ( built in dynamic programming and HMM code ). Biopython does have its own HMM and pairwise alignment code written in Python (and for Bio.pairwise2 we also have a faster C code version, but you wouldn't get that under Jython). These modules are not covered in the tutorial (if anyone wants to help). >?If I can get both working in the same environment, then I get > the best of both worlds. Absolutely. Peter From biopython at maubp.freeserve.co.uk Fri Aug 21 09:51:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 Aug 2009 10:51:07 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908210227x37a2b7e7o890258f013196c48@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <320fb6e00908201421w6df7dc00x2dd288ad5bafe325@mail.gmail.com> <320fb6e00908210227x37a2b7e7o890258f013196c48@mail.gmail.com> Message-ID: <320fb6e00908210251t15b7ea43sa41ed91c42db8385@mail.gmail.com> On Fri, Aug 21, 2009 at 10:27 AM, Peter wrote: > On Thu, Aug 20, 2009 at 10:59 PM, Kyle Ellrott wrote: >> >>> Ah. That ties in with the other changes on your github tree (to >>> work nicely with Jython) which had seemed unrelated to me. >>> >>> I guess MySQLdb etc uses C code which means it won't >>> work under Jython. I don't know enough about Jython to say >>> if there are any other alternatives to using SQLAlchemy. >> >> If you aren't using a layer of abstraction like SQLAlchemy, they you >> can use the standard Java SQL interfaces (JDBC). ?But code written for >> that would only work within Jython and be useless for CPython. > > Still, it might be worth while. Assuming it can be done as a > (Jython specific) modular backend to the existing BioSQL > framework, it should be a less invasive change. Would this mean using zxJDBC (included with Jython 2.1+)? http://wiki.python.org/jython/UserGuide#database-connectivity-in-jython That sounds worth looking into to me. Peter From chapmanb at 50mail.com Fri Aug 21 12:46:14 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 21 Aug 2009 08:46:14 -0400 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> Message-ID: <20090821124614.GH26023@sobchak.mgh.harvard.edu> Hi all; Kyle: > > I've posted a git fork of biopython with a BioSQL system based on SQL > > Alchemy. ?It can be found at git://github.com/kellrott/biopython.git > > It successfully completes unit tests copied from test_BioSQL and > > test_BioSQL_SeqIO. Awesome. Peter: > Brad Chapman had already suggested something with BioSQL > and SQLAlchemy, but I can't find the emails right now. Maybe > we talked about it in person at BOSC 2009... I forget. Brad? Yup, I was floating this idea around. It's great to see someone tackling it. > But what I think I said then was that while I like SQLAlchemy, > and have used it with BioSQL as part of a web application, I > don't see that we need it for Biopython's BioSQL support. We > essentially have a niche ORM for going between the BioSQL > tables and the Biopython SeqRecord object. > > I don't see more back end databases alone as a good reason > for using SQLAlchemy in Biopython's BioSQL bindings. In > most (all?) cases SQLAlchemy in turn calls something like > MySQLdb to do the real work. SQLAlchemy is a pervasive and growing part of interacting with databases using Python. It encapsulates all of the nastiness of dealing with individual databases and has a large community resolving problems on more niche setups like Jython+MySQL. It also offers a nice object layer which is an alternative to the BioSeq interface we have built. It's a lightweight install -- all python and no external dependencies beyond the interfaces you would already need to have to access your database of choice. Why do we want to be learning and implementing database specific things when there is code already taking care of these problems? Kyle implemented this so it can live beside the existing code base, which I think is a nice move. I'm +1 on including this and moving in the direction of SQLAlchemy. > Something I would be interested in is a set of SQLAlchemy > model definitions for the BioSQL tables (ideally database > neutral). I've got a very preliminary, partial and minimal > set done - and I think Brad has some too. This would be > useful for anyone wanting to go beyond the Biopython > SeqRecord based BioSQL support. Yes, this would be my only suggestion. It would be really useful to have the BioSQL tables mapped as object definitions and have the SQLAlchemy BioSQL based on these. This would open us up to other object based implementations like Google App Engine or Document database mappers. I pushed what I have so far in this direction on GitHub: http://github.com/chapmanb/bcbb/blob/master/biosql/BioSQL-SQLAlchemy_definitions.py I also implemented some of the objects in Google App Engine and replicated the current Biopython BioSQL structure for loading and retrieving objects: http://github.com/chapmanb/biosqlweb/tree/master/app/lib/python/BioSQL/GAE This is all partially finished, but please feel free to take whatever is useful. Brad From italo.maia at gmail.com Mon Aug 24 04:27:01 2009 From: italo.maia at gmail.com (Italo Maia) Date: Mon, 24 Aug 2009 01:27:01 -0300 Subject: [Biopython] How can i use muscle to align with biopython? Message-ID: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> Ok, with clustal i import MultipleAlignCL and do_alignment and do the stuff, but where is the alignment modules for muscle? Using biopython 1.5.1 here with ubuntu 9. -- "A arrog?ncia ? a arma dos fracos." =========================== Italo Moreira Campelo Maia Ci?ncia da Computa??o - UECE Desenvolvedor WEB e Desktop Programador Java, Python Ubuntu User For Life! ----------------------------------------------------- http://www.italomaia.com/ http://twitter.com/italomaia/ http://eusouolobomal.blogspot.com/ =========================== From ajperry at pansapiens.com Mon Aug 24 07:16:41 2009 From: ajperry at pansapiens.com (Andrew Perry) Date: Mon, 24 Aug 2009 17:16:41 +1000 Subject: [Biopython] How can i use muscle to align with biopython? In-Reply-To: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> References: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> Message-ID: On Mon, Aug 24, 2009 at 2:27 PM, Italo Maia wrote: > Ok, with clustal i import MultipleAlignCL and do_alignment and do the > stuff, > but where is the alignment modules for muscle? > Using biopython 1.5.1 here with ubuntu 9. > > According to the docs, Bio.ClustalW.MultipleAlignCL is now considered "semi-obsolete". The newer wrappers for multiple alignment programs (including MUSCLE) can be found as part of Bio.Align.Applications in Biopython 1.51. See the cookbook example here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc68 Andrew Perry Postdoctoral Fellow Whisstock Lab Department of Biochemistry and Molecular Biology Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia. Mobile: +61 409 808 529 From biopython at maubp.freeserve.co.uk Mon Aug 24 09:51:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Aug 2009 10:51:14 +0100 Subject: [Biopython] How can i use muscle to align with biopython? In-Reply-To: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> References: <800166920908232127l6dd27733wf7a85470b9ad0a06@mail.gmail.com> Message-ID: <320fb6e00908240251j5ea5246cq1207d0e3033c3b04@mail.gmail.com> On Mon, Aug 24, 2009 at 5:27 AM, Italo Maia wrote: > Ok, with clustal i import MultipleAlignCL and do_alignment and do the stuff, > but where is the alignment modules for muscle? > Using biopython 1.5.1 here with ubuntu 9. That would be Biopython 1.51 (one, fifty-one). ;) Also Ubuntu 9 is unclear, I guess you meant Ubuntu 9.04 ("Jaunty Jackalope") which was released April 2009 (hence 9.04). Ubuntu releases are every six months, so their next release should be October 2009, and is expected to be called Ubuntu 9.10 ("Karmic Koala"). Anyway - thanks for answering about the alignments Andrew. To recap, to call the alignment tools, use Bio.Align.Applications and Python module subprocess, and then parse the resulting alignment file with Bio.AlignIO. All in the tutorial :) Bio.Clustalw is now semi-obsolete, but can expect a gradual retirement (just like Bio.Fasta was gradually phrased out) because it was widely used, and we don't want to force people to migrate their old code immediately. I wouldn't recommend using Bio.Clustalw for new scripts, try Bio.Align.Applications instead. Regards, Peter From wgheath at gmail.com Mon Aug 24 20:49:47 2009 From: wgheath at gmail.com (William Heath) Date: Mon, 24 Aug 2009 13:49:47 -0700 Subject: [Biopython] Wanting to teach a class on biopython that is particularly geared toward synthetic biology Message-ID: Hi All, I am a member of Tech Shop in Mountain View, CA and I want to teach a class on biopython that is specifically tailored toward the goals of synthetic biology. Can anyone help me to come up with lesson plan for such a class? In particular I want to use bio bricks, and good opensource design programs for biobricks. Can anyone recommend any? I also want to utilize any/all concepts in this training: http://www.edge.org/documents/archive/edge296.html Please let me know your ideas on such a lesson plan. -Tim From krother at rubor.de Tue Aug 25 08:58:44 2009 From: krother at rubor.de (Kristian Rother) Date: Tue, 25 Aug 2009 10:58:44 +0200 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: References: Message-ID: <4A93A7C4.1060109@rubor.de> Hi William, I was teaching Python/BioPython in several courses for Biologists. I wrote some individual lesson plans but they are rather not readable for other people (see attachment). There is some material on-line, though: http://www.rubor.de/lehre_en.html Typically, the lessons consisted of 2h lecture + 1h exercises on language concepts + 3h exercises on a single, more biological task. The code written during the latter was reviewed and scored and the students knew about that. They had a two-week Python crash course before. Details on request. Best Regards, Kristian William Heath schrieb: > Hi All, > I am a member of Tech Shop in Mountain View, CA and I want to teach a class > on biopython that is specifically tailored toward the goals of synthetic > biology. Can anyone help me to come up with lesson plan for such a class? > In particular I want to use bio bricks, and good opensource design programs > for biobricks. Can anyone recommend any? > > I also want to utilize any/all concepts in this training: > > http://www.edge.org/documents/archive/edge296.html > > Please let me know your ideas on such a lesson plan. > > > -Tim > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -------------- next part -------------- A non-text attachment was scrubbed... Name: Lesson2_Plan.pdf Type: application/pdf Size: 56732 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Tue Aug 25 09:56:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 10:56:42 +0100 Subject: [Biopython] Biopython 1.51 in Debian repository Message-ID: <320fb6e00908250256mfafa20rbf699ad63e618dff@mail.gmail.com> Hi all, I'd like to thank Philipp Benner for updating the Debian packages for Biopython 1.51 (including handling dropping Martel and removing mxTextTools from our dependencies). This is now in Debian unstable (sid), and will as usual progress out to Debian testing and also Ubuntu eventually. http://packages.debian.org/unstable/python/python-biopython http://packages.debian.org/unstable/python/python-biopython-sql Thanks! Peter From wgheath at gmail.com Tue Aug 25 17:06:49 2009 From: wgheath at gmail.com (William Heath) Date: Tue, 25 Aug 2009 10:06:49 -0700 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: <4A93A7C4.1060109@rubor.de> References: <4A93A7C4.1060109@rubor.de> Message-ID: This is amazing thanks! -Tim On Tue, Aug 25, 2009 at 1:58 AM, Kristian Rother wrote: > > Hi William, > > I was teaching Python/BioPython in several courses for Biologists. I wrote > some individual lesson plans but they are rather not readable for other > people (see attachment). There is some material on-line, though: > > http://www.rubor.de/lehre_en.html > > Typically, the lessons consisted of 2h lecture + 1h exercises on language > concepts + 3h exercises on a single, more biological task. The code written > during the latter was reviewed and scored and the students knew about that. > They had a two-week Python crash course before. > > Details on request. > > Best Regards, > Kristian > > > William Heath schrieb: > >> Hi All, >> I am a member of Tech Shop in Mountain View, CA and I want to teach a >> class >> on biopython that is specifically tailored toward the goals of synthetic >> biology. Can anyone help me to come up with lesson plan for such a class? >> In particular I want to use bio bricks, and good opensource design >> programs >> for biobricks. Can anyone recommend any? >> >> I also want to utilize any/all concepts in this training: >> >> http://www.edge.org/documents/archive/edge296.html >> >> Please let me know your ideas on such a lesson plan. >> >> >> -Tim >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> >> > > From wgheath at gmail.com Tue Aug 25 17:08:27 2009 From: wgheath at gmail.com (William Heath) Date: Tue, 25 Aug 2009 10:08:27 -0700 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: References: <4A93A7C4.1060109@rubor.de> Message-ID: I am very interested in common bio python tasks as they relate specifically to synthetic biology. Could you give me some examples of such tasks? -Tim On Tue, Aug 25, 2009 at 10:06 AM, William Heath wrote: > This is amazing thanks! > -Tim > > > On Tue, Aug 25, 2009 at 1:58 AM, Kristian Rother wrote: > >> >> Hi William, >> >> I was teaching Python/BioPython in several courses for Biologists. I wrote >> some individual lesson plans but they are rather not readable for other >> people (see attachment). There is some material on-line, though: >> >> http://www.rubor.de/lehre_en.html >> >> Typically, the lessons consisted of 2h lecture + 1h exercises on language >> concepts + 3h exercises on a single, more biological task. The code written >> during the latter was reviewed and scored and the students knew about that. >> They had a two-week Python crash course before. >> >> Details on request. >> >> Best Regards, >> Kristian >> >> >> William Heath schrieb: >> >>> Hi All, >>> I am a member of Tech Shop in Mountain View, CA and I want to teach a >>> class >>> on biopython that is specifically tailored toward the goals of synthetic >>> biology. Can anyone help me to come up with lesson plan for such a >>> class? >>> In particular I want to use bio bricks, and good opensource design >>> programs >>> for biobricks. Can anyone recommend any? >>> >>> I also want to utilize any/all concepts in this training: >>> >>> http://www.edge.org/documents/archive/edge296.html >>> >>> Please let me know your ideas on such a lesson plan. >>> >>> >>> -Tim >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >>> >> >> > From kellrott at gmail.com Wed Aug 26 01:01:30 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 25 Aug 2009 18:01:30 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <20090821124614.GH26023@sobchak.mgh.harvard.edu> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> Message-ID: I've added a new database function lookupFeature to quickly search for sequences features without have to load all of them for any particular sequence. Because it's a non-standard function, I've taken the opportunity to play around with some more dynamic search features. Once we get the interface for these types of searches locked down on lookupFeature, a similar system could be implemented in the standard 'lookup' call. The work is posted at http://github.com/kellrott/biopython The following is an example of a working search, that pulls all of the protein_ids from NC_004663.1 between 60,000 and 70,000 on the positive strand. import sys from BioSQL import BioSQLAlchemy as BioSeqDataBase server = BioSeqDataBase.open_database( driver="mysql", user='test', host='localhost', db='testdb' ) db = server[ 'bacteria' ] seq = db.lookup( version="NC_004663.1" ) features = db.lookupFeatures( BioSeqDataBase.Column('strand') == 1, BioSeqDataBase.Column('start_pos') < 70000, BioSeqDataBase.Column('end_pos') > 60000, bioentry_id = seq._primary_id, name="protein_id" ) #print len(features) for feature in features: print feature > Kyle: >> > I've posted a git fork of biopython with a BioSQL system based on SQL >> > Alchemy. ?It can be found at git://github.com/kellrott/biopython.git >> > It successfully completes unit tests copied from test_BioSQL and >> > test_BioSQL_SeqIO. From biopython at maubp.freeserve.co.uk Wed Aug 26 11:10:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 12:10:44 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> On Wed, Aug 26, 2009 at 2:01 AM, Kyle Ellrott wrote: > I've added a new database function lookupFeature to quickly search for > sequences features without have to load all of them for any particular > sequence. > Because it's a non-standard function, I've taken the opportunity to > play around with some more dynamic search features. > Once we get the interface for these types of searches locked down on > lookupFeature, a similar system could be implemented in the standard > 'lookup' call. I'm not sure about that - all the other "lookup" functions already in BioSeqDatabase return DBSeqRecord objects don't they? See below for an alternative... > The work is posted at http://github.com/kellrott/biopython You could have posted this on the dev list, but this is debatable. If it all gets too technical we should probably move the thread... > The following is an example of a working search, that pulls all of the > protein_ids from NC_004663.1 between 60,000 and 70,000 on the positive > strand. > > import sys > from BioSQL import BioSQLAlchemy as BioSeqDataBase > > server = BioSeqDataBase.open_database( driver="mysql", user='test', > host='localhost', db='testdb' ) > db = server[ 'bacteria' ] > > seq = db.lookup( version="NC_004663.1" ) > > features = db.lookupFeatures( BioSeqDataBase.Column('strand') == 1, > ? ? ? ?BioSeqDataBase.Column('start_pos') < 70000, > ? ? ? ?BioSeqDataBase.Column('end_pos') > 60000, > ? ? ? ?bioentry_id = seq._primary_id, name="protein_id" ) > > #print len(features) > for feature in features: > ? ? ? ?print feature > Interesting - and potentially useful if you are interested in just part of the genome (e.g. an operon). Have you tested this on composite features (e.g. a join)? Without looking into the details of your code this isn't clear. I wonder how well this would scale with a big BioSQL database with hundreds of bioentry rows, and millions of seqfeature and location rows? You'd have to search all the location rows, filtering on the seqfeature_id linked to the bioentry_id you wanted. The performance would depend on the database server, the database software, how big the database is, and any indexing etc. Have you signed up to the BioSQL mailing list yet Kyle? It may help for discussing things like the SQL indexing. On the other hand, if all the record's features have already been loaded into memory, there would just be thousands of locations to look at - it might be quicker. This brings me to another idea for how this interface might work, via the SeqRecord - how about adding a method like this: def filtered_features(self, start=None, end=None, type=None): Note I think it would also be nice to filter on the feature type (e.g. CDS or gene). This method would return a sublist of the full feature list (i.e. a list of those SeqFeature objects within the range given, and of the appropriate type). This could initially be implemented with a simple loop, but there would be scope for building an index or something more clever. [Note we are glossing over some potentially ambiguous cases with complex composite locations, where the "start" and "end" may differ from the "span" of the feature.] The DBSeqRecord would be able to do the same (just inherit the method), but you could try doing this via an SQL query, to get the database to tell you which seqfeature_ids are wanted, and then return those (existing) SeqFeature objects. [Note we should avoid returning new SeqFeature objects, as it could be very confusing to have multiple SeqFeature instances for the same feature in the database - as well as wasting memory, and time to build the new objects.] Peter From kellrott at gmail.com Wed Aug 26 16:40:00 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 26 Aug 2009 09:40:00 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> Message-ID: > I'm not sure about that - all the other "lookup" functions already in > BioSeqDatabase return DBSeqRecord objects don't they? See > below for an alternative... Although the example I provided didn't illustrate it, the reason I did it this way was to provide a function that could look up features without having to find their DBSeqRecords first. ?In my particular case, I've loaded all of the Genbank files from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.gbk.tar.gz and I want to be able to find proteins by their protein_id. lookupFeature( protein_id="NP_808921.1" ) > Have you tested this on composite features (e.g. a join)? > Without looking into the details of your code this isn't clear. That isn't supported in the current syntax. ?But when using SQLAlchemy it's pretty easy to generate new queries by treating the tables and selection criteria like lists. ?Right now I just shove rules into an 'and' list, but when could also create an 'or' list. > I wonder how well this would scale with a big BioSQL database > with hundreds of bioentry rows, and millions of seqfeature > and location rows? You'd have to search all the location rows, > filtering on the seqfeature_id linked to the bioentry_id you > wanted. The performance would depend on the database > server, the database software, how big the database is, and > any indexing etc. Like I said, my test database is all of the published bacterial genebank files from the NCBI ftp. ?It's about 1784 bioentry rows. About 27,838,905 seqfeature_qualifier_value rows. ?The location based searches where pretty much instant. The only way that I've augmented the database is by adding the line Mysql: CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value(10)); Sqlite: CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value); So that I could look up features by their value quickly (like getting a protein by it's protein_id). ?Note the '(10)' for MySQL, because for some reason it will only index text blobs if you define a prefix area for it to index... > On the other hand, if all the record's features have already been > loaded into memory, there would just be thousands of locations > to look at - it might be quicker. My experience so far it that pulling all the features for a single large chromosome can take a while. One of the other things that I did to speed things up (it's not actually faster, it just spreads the work out), is to build a DBSeqFeature with a lazy loader. ?It just stores it's seqfeature_id and type until __getattr__ is hit, and only then does it bother to load the data in from the database. ?So if you bring in 2000 seqfeatures, you get the list back and read the first entry without having the first load the other 1999 entries. > This brings me to another idea for how this interface might work, > via the SeqRecord - how about adding a method like this: > > def filtered_features(self, start=None, end=None, type=None): > > Note I think it would also be nice to filter on the feature type (e.g. > CDS or gene). This method would return a sublist of the full > feature list (i.e. a list of those SeqFeature objects within the > range given, and of the appropriate type). This could initially > be implemented with a simple loop, but there would be scope > for building an index or something more clever. It may be worth while to just use the sqlite memory database. ?We store the schema in the module, and have a simple wrapper module that builds the sqlite RAM database and loads in the sequence file to the database. Something like: from Bio import SeqIODB handle = open("ls_orchid.gbk") for seq_record in SeqIODB.parse(handle, "genbank") : ? ?print seq_record.id But in the background SeqIODB would be creating a Sqlite memory database and loading ls_orchid into it, it's __iter__ function would simple spit out DBSeqRecords for easy of the bioentries... Kyle From biopython at maubp.freeserve.co.uk Wed Aug 26 20:24:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 21:24:34 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> Message-ID: <320fb6e00908261324o3d57f9d0p249a34d638ddca93@mail.gmail.com> On Wed, Aug 26, 2009 at 5:40 PM, Kyle Ellrott wrote: >> I'm not sure about that - all the other "lookup" functions already in >> BioSeqDatabase return DBSeqRecord objects don't they? See >> below for an alternative... > > Although the example I provided didn't illustrate it, the reason I did > it this way was to provide a function that could look up features > without having to find their DBSeqRecords first. ?In my particular > case, I've loaded all of the Genbank files from > ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.gbk.tar.gz and I want > to be able to find proteins by their protein_id. > > lookupFeature( protein_id="NP_808921.1" ) That makes sense. >> Have you tested this on composite features (e.g. a join)? >> Without looking into the details of your code this isn't clear. > > That isn't supported in the current syntax. ?But when using > SQLAlchemy it's pretty easy to generate new queries by > treating the tables and selection criteria like lists. ?Right > now I just shove rules into an 'and' list, but when could > also create an 'or' list. Even the bacterial GenBank files have joins ;) >> I wonder how well this would scale with a big BioSQL database >> with hundreds of bioentry rows, and millions of seqfeature >> and location rows? You'd have to search all the location rows, >> filtering on the seqfeature_id linked to the bioentry_id you >> wanted. The performance would depend on the database >> server, the database software, how big the database is, and >> any indexing etc. > > Like I said, my test database is all of the published bacterial > genebank files from the NCBI ftp. ?It's about 1784 bioentry rows. > About 27,838,905 seqfeature_qualifier_value rows. ?The location based > searches where pretty much instant. Cool. That sounds like more or less the same amount of data we have in our BioSQL database. > The only way that I've augmented the database is by adding the line > > Mysql: > CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value(10)); > Sqlite: > CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value); > > So that I could look up features by their value quickly (like getting > a protein by it's protein_id). ?Note the '(10)' for MySQL, because for > some reason it will only index text blobs if you define a prefix area > for it to index... We've done something similar here too - I'd have to check exactly what we used for the index though. >> On the other hand, if all the record's features have already been >> loaded into memory, there would just be thousands of locations >> to look at - it might be quicker. > > My experience so far it that pulling all the features for a single > large chromosome can take a while. > One of the other things that I did to speed things up (it's not > actually faster, it just spreads the work out), is to build a > DBSeqFeature with a lazy loader. ?It just stores it's seqfeature_id > and type until __getattr__ is hit, and only then does it bother to > load the data in from the database. ?So if you bring in 2000 > seqfeatures, you get the list back and read the first entry without > having the first load the other 1999 entries. Yes - Leighton Pritchard has also done a DBSeqFeature object (with lazy loading of the qualifiers too). I guess your code will be similar. This is something I think could well be worth merging into BioSQL (and doesn't depend on SQLAlchemy at all). >> This brings me to another idea for how this interface might work, >> via the SeqRecord - how about adding a method like this: >> >> def filtered_features(self, start=None, end=None, type=None): >> >> Note I think it would also be nice to filter on the feature type (e.g. >> CDS or gene). This method would return a sublist of the full >> feature list (i.e. a list of those SeqFeature objects within the >> range given, and of the appropriate type). This could initially >> be implemented with a simple loop, but there would be scope >> for building an index or something more clever. > > It may be worth while to just use the sqlite memory database. We > store the schema in the module, and have a simple wrapper module that > builds the sqlite RAM database and loads in the sequence file to the > database. > Something like: > > from Bio import SeqIODB > handle = open("ls_orchid.gbk") > for seq_record in SeqIODB.parse(handle, "genbank") : > print seq_record.id > > But in the background SeqIODB would be creating a Sqlite memory > database and loading ls_orchid into it, it's __iter__ function would > simple spit out DBSeqRecords for easy of the bioentries... Is this idea just a shortcut for explicitly loading the GenBank file into a BioSQL database (which hopefully will include an SQLite backend option soon), and then iterating over its records? e.g. from Bio import SeqIO from BioSQL import BioSeqDatabase server = BioSeqDatabase.open_database(...) db = server["Orchids"] db.load(SeqIO.parse(open("ls_orchid.gbk"), "genbank")) server.commit() #Now retrieve the records one by one... It would make sense to define __iter__ on the BioSQL database object (i.e. the BioSeqDatabase class) to allow iteration on ALL the records in the database (as DBSeqRecord objects). That should be a nice simple enhancement, allowing: for seq_record in db : print seq_record.id [And again, this has no SQLAlchemy dependence] Peter From kellrott at gmail.com Wed Aug 26 21:38:27 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 26 Aug 2009 14:38:27 -0700 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: <320fb6e00908261324o3d57f9d0p249a34d638ddca93@mail.gmail.com> References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> <320fb6e00908261324o3d57f9d0p249a34d638ddca93@mail.gmail.com> Message-ID: > Yes - Leighton Pritchard has also done a DBSeqFeature object > (with lazy loading of the qualifiers too). I guess your code will > be similar. This is something I think could well be worth merging > into BioSQL (and doesn't depend on SQLAlchemy at all). The version in my branch uses the sqlalchemy query compositions methods I wrote when porting DBSeqRecord from the original _retrieve_features function. So his code would probably be shorter step for now. > Is this idea just a shortcut for explicitly loading the GenBank > file into a BioSQL database (which hopefully will include an > SQLite backend option soon), and then iterating over its > records? e.g. Yes, it would also have a copy of the biodb-sqlite schema stored as a string in the module, so it could build an in-RAM database on demand. Make the setup and loading automatic. It would appear to be just like a regular file parser. That way, if we start writing crazy feature filters methods based on SQL queries, they can be easily reapplied to file based usage. And we wouldn't have to write a feature filter for the database objects and another for file based objects. Kyle From biopython at maubp.freeserve.co.uk Wed Aug 26 21:55:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 22:55:58 +0100 Subject: [Biopython] SQL Alchemy based BioSQL In-Reply-To: References: <320fb6e00908201310y5ed972b5j6e17b81fef8e97fb@mail.gmail.com> <20090821124614.GH26023@sobchak.mgh.harvard.edu> <320fb6e00908260410i6158c332j8ef6684278ab827@mail.gmail.com> <320fb6e00908261324o3d57f9d0p249a34d638ddca93@mail.gmail.com> Message-ID: <320fb6e00908261455m332e446fjcd4c6bc2e7e985e7@mail.gmail.com> On Wed, Aug 26, 2009 at 10:38 PM, Kyle Ellrott wrote: >> Is this idea just a shortcut for explicitly loading the GenBank >> file into a BioSQL database (which hopefully will include an >> SQLite backend option soon), and then iterating over its >> records? e.g. > > Yes, it would also have a copy of the biodb-sqlite schema stored as a > string in the module, so it could build an in-RAM database on demand. > Make the setup and loading automatic. ?It would appear to be just like > a regular file parser. I would agree that once we have SQLite support in BioSQL officially, we can probably ship the schema within Biopython and make using it much more straightforward than the other BioSQL backends (which require the database software and schema to be installed manually). However, I would put the SQLite database on disk, not in RAM. > That way, if we start writing crazy feature > filters methods based on SQL queries, they can be easily reapplied to > file based usage. ?And we wouldn't have to write a feature filter for > the database objects and another for file based objects. If we are just talking about filtering the feature list (see thread http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006700.html ) then we don't need BioSQL - it seems like overkill. Peter From biopython at maubp.freeserve.co.uk Fri Aug 28 10:16:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 11:16:52 +0100 Subject: [Biopython] Fwd: Biopython post from kingaram@hanmail.net requires approval In-Reply-To: References: Message-ID: <320fb6e00908280316u503760c0ofb6d3e20d2378cac@mail.gmail.com> Hi all, I have just forwarded the following message to the list as it had been blocked with a "suspicious header". Could I remind people please try and send "plain text" emails, rather than rich HTML formatting with pictures etc as these are likely to get blocked by the mailing list. Thanks Peter ---------- Forwarded message ---------- From:?"titt" To:? Date:?Fri, 28 Aug 2009 09:42:03 +0900 (KST) Subject:?Finding protein ID using Entrez.efetch Hi all, I'm looking for the way to extract the data of protein ID numbers in the Genbank. I got my Genbank data and save it as a xml file using this commend. from Bio import Entrez handle=Entrez.efetch(db="nuccore",id="256615878",rettype="gb") record=handle.read() save_file = open("record.xml","w") save_file.write(record) save_file.close() What I need is all the protein ID (For example: EEU21068.1) or GI number (for example: 256615878) in this Genbank page for the blast search. Could you let me know how to extract these information, save in some format, and use them? Thank you, Ar! am From biopython at maubp.freeserve.co.uk Fri Aug 28 10:37:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 11:37:45 +0100 Subject: [Biopython] Finding protein ID using Entrez.efetch Message-ID: <320fb6e00908280337r681a8f90mff0da537a2ff2878@mail.gmail.com> > To:? > Date:?Fri, 28 Aug 2009 09:42:03 +0900 (KST) > Subject:?Finding protein ID using Entrez.efetch > > Hi all, > > I'm looking for the way to extract the data of protein ID numbers in > the Genbank. I got my Genbank data and save it as a xml file using > this commend. > > from Bio import Entrez > handle=Entrez.efetch(db="nuccore",id="256615878",rettype="gb") > record=handle.read() > save_file = open("record.xml","w") > save_file.write(record) > save_file.close() That did NOT save the record as XML format. You asked NCBI Entrez EFetch for a GenBank file (rettype="gb"). > What I need is all the protein ID (For example: EEU21068.1) or GI > number (for example: 256615878) in this Genbank page for the blast > search. Could you let me know how to extract these information, save > in some format, and use them? If all you want is the accession, it is pointless to download the entire record (with its features and sequence). Instead try: >>> print Entrez.efetch(db="nuccore",id="256615878",rettype="acc", retmode="text").read() GG698814.1 Note that a nucleotide sequence doesn't have a protein ID! A gene nucleotide should have a single associated protein. A genome sequence will have many associated proteins (this seems to be what you want?). If you really do want the GenBank file (e.g. for some other data), then first save it and then parse it using Bio.SeqIO like this: >>> from Bio import Entrez >>> net_handle = Entrez.efetch(db="nuccore",id="256615878",rettype="gb") >>> save_handle = open("record.gb", "w") >>> save_handle.write(net_handle.read()) >>> save_handle.close() >>> net_handle.close() Then, >>> from Bio import SeqIO >>> record = SeqIO.read(open("record.gb"), "gb") >>> print record.id GG698814.1 You can also look at the CDS features (proteins), and their lists of protein ID(s) and database cross references: >>> for feature in record.features : ... if feature.type != "CDS" : continue ... print feature.qualifiers.get("protein_id", []), ... print feature.qualifiers.get("db_xref", []) ... ['EEU21067.1'] ['GI:256615879'] ['EEU21068.1'] ['GI:256615880'] ['EEU21069.1'] ['GI:256615881'] ['EEU21070.1'] ['GI:256615882'] ['EEU21071.1'] ['GI:256615883'] ... However, if that is all you need, then it is a waste to download the full GenBank file. Try using NCBI Entrez ELink instead? http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html Peter From biopython at maubp.freeserve.co.uk Fri Aug 28 10:56:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 11:56:24 +0100 Subject: [Biopython] Finding protein ID using Entrez.efetch In-Reply-To: <320fb6e00908280337r681a8f90mff0da537a2ff2878@mail.gmail.com> References: <320fb6e00908280337r681a8f90mff0da537a2ff2878@mail.gmail.com> Message-ID: <320fb6e00908280356u3fc23b4bnbadf3ecebf96be82@mail.gmail.com> On Fri, Aug 28, 2009 at 11:37 AM, Peter wrote: >> To:? >> Date:?Fri, 28 Aug 2009 09:42:03 +0900 (KST) >> Subject:?Finding protein ID using Entrez.efetch >> >> Hi all, >> >> I'm looking for the way to extract the data of protein ID numbers in >> the Genbank. ... >> >> What I need is all the protein ID (For example: EEU21068.1) or GI >> number (for example: 256615878) in this Genbank page for the blast >> search. > > ... > > However, if that is all you need, then it is a waste to download the > full GenBank file. Try using NCBI Entrez ELink instead? > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/elink_help.html Try something based on this: >>> from Bio import Entrez >>> data = Entrez.read(Entrez.elink(db="protein", dbfrom="nuccore",id="256615878", retmode="xml")) >>> for db in data : ... print "Links for", db["IdList"], "from database", db["DbFrom"] ... for link in db["LinkSetDb"][0]["Link"] : print link["Id"] ... Links from ['256615878'] from database nuccore 256616663 256616662 ... 256615879 As we try to explain in the tutorial, the Entrez.read() XML parser turns the XML data into Python lists, dictionaries and strings. This reflects the deeply nested nature of the NCBI XML files - you have to dig into the hierarchy to get to the actual list of protein IDs. Peter From bartomas at gmail.com Fri Aug 28 14:17:08 2009 From: bartomas at gmail.com (bar tomas) Date: Fri, 28 Aug 2009 15:17:08 +0100 Subject: [Biopython] How to restrict blast query to proteins of a certain species Message-ID: Hi, I'd like to use blast to measure homology between protein sequences of a species and polyketide sequences from a database. I've been looking in the Biopython tutorial(p.73) in the section about blast. I'd like to do something similar, like this: result_handle = NCBIWWW.qblast("blastp", "genbank", record.format("fasta")) Is it possible to add an option to restrict the search to genbank records that correspond to a given species? Thanks very much From biopython at maubp.freeserve.co.uk Fri Aug 28 14:38:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 15:38:36 +0100 Subject: [Biopython] How to restrict blast query to proteins of a certain species In-Reply-To: References: Message-ID: <320fb6e00908280738w5d5f1236ia50332157e24bf7b@mail.gmail.com> bar tomas wrote: > I'd like to do something similar, ?like this: > > result_handle = NCBIWWW.qblast("blastp", "genbank", record.format("fasta")) > > Is it possible to add an option to restrict the search to genbank records > that correspond to a given species? Yes, you can use a species specific blast database or include an Entrez query, see for example: http://lists.open-bio.org/pipermail/biopython/2009-June/005215.html Peter From kelly.oakeson at utah.edu Fri Aug 28 15:19:07 2009 From: kelly.oakeson at utah.edu (Kelly F Oakeson) Date: Fri, 28 Aug 2009 09:19:07 -0600 Subject: [Biopython] How to restrict blast query to proteins of a certain species In-Reply-To: References: Message-ID: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> Hell list, I would like to do something similar, but I would like to limit my blast search to just the microbial Taxonomic ID. Thanks, Kelly Oakeson kelly.oakeson at utah.edu On Aug 28, 2009, at 8:17 AM, bar tomas wrote: > Hi, > > I'd like to use blast to measure homology between protein sequences > of a > species and polyketide sequences from a database. > I've been looking in the Biopython tutorial(p.73) in the section about > blast. > I'd like to do something similar, like this: > > result_handle = NCBIWWW.qblast("blastp", "genbank", record.format > ("fasta")) > > Is it possible to add an option to restrict the search to genbank > records > that correspond to a given species? > > Thanks very much > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bartomas at gmail.com Fri Aug 28 15:27:20 2009 From: bartomas at gmail.com (bar tomas) Date: Fri, 28 Aug 2009 16:27:20 +0100 Subject: [Biopython] How to restrict blast query to proteins of a certain species In-Reply-To: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> References: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> Message-ID: Hi, I just tried using a taxonomic id: The following both give the same result: result_handle = NCBIWWW.qblast(*"blastp"*, *"nr"*, fasta_string,entrez_query=*"house mouse[orgn]"*) result_handle = NCBIWWW.qblast(*"blastp"*, *"nr"*, fasta_string,entrez_query=*"txid10090[orgn]"*) (entrez_query needs to be after the non keyword arguments or else the compiler complains) On Fri, Aug 28, 2009 at 4:19 PM, Kelly F Oakeson wrote: > Hell list, > I would like to do something similar, but I would like to limit my > blast search to just the microbial Taxonomic ID. > > Thanks, > > Kelly Oakeson > kelly.oakeson at utah.edu > > > > > On Aug 28, 2009, at 8:17 AM, bar tomas wrote: > > > Hi, > > > > I'd like to use blast to measure homology between protein sequences > > of a > > species and polyketide sequences from a database. > > I've been looking in the Biopython tutorial(p.73) in the section about > > blast. > > I'd like to do something similar, like this: > > > > result_handle = NCBIWWW.qblast("blastp", "genbank", record.format > > ("fasta")) > > > > Is it possible to add an option to restrict the search to genbank > > records > > that correspond to a given species? > > > > Thanks very much > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From biopython at maubp.freeserve.co.uk Fri Aug 28 15:29:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 28 Aug 2009 16:29:15 +0100 Subject: [Biopython] How to restrict blast query to proteins of a certain species In-Reply-To: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> References: <666A24ED-F523-4BD0-99ED-B0C0046A3CEB@utah.edu> Message-ID: <320fb6e00908280829h694bf960m7a7936569238b1c6@mail.gmail.com> On Fri, Aug 28, 2009 at 4:19 PM, Kelly F Oakeson wrote: > Hell list, > I would like to do something similar, but I would like to limit my > blast search to just the microbial Taxonomic ID. Then just change the Entrez query, e.g. Taxon ID 2 for eubacteria: from Bio.Blast import NCBIWWW fasta_string = """>Test ATGGCCAATACTCCTTCGGCCAAGAAGGCAGTGCGCAAGATCGCTGCCCGCACCGAGATCAACAAGTCCC GCCGTTCGCGCGTGCGCACTTTCGTGCGCAAGCTGGAAGACGCTCTGCTGAGCGGCGACAAGCAGGCAGC GGAAGTTGCGTTCAAGGCTGTTGAGCCTGAACTGATGCGCGCCGCCTCCAAGGGCGTGGTGCACAAGAAC ACCGCGGCCCGCAAGGTTTCGCGTCTTGCCAAGCGCGTGAAGGCTCTGAACGCCTGA """ result_handle = NCBIWWW.qblast("blastn", "nr", entrez_query="txid2[orgn]", fasta_string) Peter From crosvera at gmail.com Fri Aug 28 19:45:46 2009 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Fri, 28 Aug 2009 15:45:46 -0400 Subject: [Biopython] PBD SuperImpose Message-ID: Hello people, I'm trying to use the superimpose() method from the Bio.PBD module, but when I use the set_atoms() method with two atoms list, I got this: "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in size" So, my question is: How can I superimpose to structures with different size ?? Cheers and Thanks. PS: I attached the code that I'm using. -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 -------------- next part -------------- A non-text attachment was scrubbed... Name: superimpose.py Type: text/x-python Size: 992 bytes Desc: not available URL: From crosvera at gmail.com Fri Aug 28 19:48:18 2009 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Fri, 28 Aug 2009 15:48:18 -0400 Subject: [Biopython] Fwd: PBD SuperImpose In-Reply-To: References: Message-ID: Hello people, I'm trying to use the superimpose() method from the Bio.PBD module, but when I use the set_atoms() method with two atoms list, I got this: "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in size" So, my question is: How can I superimpose to structures with different size ?? Cheers and Thanks. PS: I paste the code that I'm using. -------code.py------ #!/usr/bin/env python import sys from Bio.PDB import * #get the PDB's path from user command line struct_path1 = sys.argv[1] name_struct1 = struct_path1.split('/')[-1].split('.')[0] struct_path2 = sys.argv[2] name_struct2 = struct_path2.split('/')[-1].split('.')[0] #parsing the PDBs parser = PDBParser() struct1 = parser.get_structure(name_struct1, struct_path1) struct2 = parser.get_structure(name_struct2, struct_path2) #get atoms list atoms1 = struct1.get_atoms() atoms2 = struct2.get_atoms() latoms1 = [] latoms2 = [] for a in atoms1: latoms1.append( a ) for a in atoms2: latoms2.append( a ) print latoms1 print latoms2 #SuperImpose sup = Superimposer() # Specify the atom lists # ""fixed"" and ""moving"" are lists of Atom objects # The moving atoms will be put on the fixed atoms sup.set_atoms(latoms1, latoms2) # Print rotation/translation/rmsd print "ROTRAN: "+ sup.rotran print "RMS: " + sup.rms # Apply rotation/translation to the moving atoms sup.apply(moving) -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 -------------- next part -------------- A non-text attachment was scrubbed... Name: superimpose.py Type: text/x-python Size: 992 bytes Desc: not available URL: From rodrigo_faccioli at uol.com.br Sat Aug 29 13:17:25 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Sat, 29 Aug 2009 10:17:25 -0300 Subject: [Biopython] Fwd: PBD SuperImpose In-Reply-To: References: Message-ID: <3715adb70908290617g3d8792d9we033c1a8dfec35fc@mail.gmail.com> Please, Could you inform the PDBIDs or PDB files that you are using? Cheers, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 2009/8/28 Carlos R?os Vera > Hello people, > > I'm trying to use the superimpose() method from the Bio.PBD module, but > when > I use the set_atoms() method with two atoms list, I got this: > > "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in > size" > > So, my question is: How can I superimpose to structures with different size > ?? > > Cheers and Thanks. > > PS: I paste the code that I'm using. > > -------code.py------ > #!/usr/bin/env python > > import sys > from Bio.PDB import * > > #get the PDB's path from user command line > struct_path1 = sys.argv[1] > name_struct1 = struct_path1.split('/')[-1].split('.')[0] > > struct_path2 = sys.argv[2] > name_struct2 = struct_path2.split('/')[-1].split('.')[0] > > #parsing the PDBs > parser = PDBParser() > > struct1 = parser.get_structure(name_struct1, struct_path1) > struct2 = parser.get_structure(name_struct2, struct_path2) > > #get atoms list > atoms1 = struct1.get_atoms() > atoms2 = struct2.get_atoms() > > latoms1 = [] > latoms2 = [] > > for a in atoms1: > latoms1.append( a ) > for a in atoms2: > latoms2.append( a ) > > print latoms1 > print latoms2 > > #SuperImpose > sup = Superimposer() > # Specify the atom lists > # ""fixed"" and ""moving"" are lists of Atom objects > # The moving atoms will be put on the fixed atoms > sup.set_atoms(latoms1, latoms2) > # Print rotation/translation/rmsd > print "ROTRAN: "+ sup.rotran > print "RMS: " + sup.rms > # Apply rotation/translation to the moving atoms > sup.apply(moving) > > > > -- > http://crosvera.blogspot.com > > Carlos R?os V. > Estudiante de Ing. (E) en Computaci?n e Inform?tica. > Universidad del B?o-B?o > VIII Regi?n, Chile > > Linux user number 425502 > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From biopython at maubp.freeserve.co.uk Sat Aug 29 13:48:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 29 Aug 2009 14:48:52 +0100 Subject: [Biopython] PBD SuperImpose In-Reply-To: References: Message-ID: <320fb6e00908290648t6b9c20f3p6d48a56cf1144ac5@mail.gmail.com> 2009/8/28 Carlos R?os Vera : > Hello people, > > I'm trying to use the superimpose() method from the Bio.PBD module, but when > I use the set_atoms() method with two atoms list, I got this: > > "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in > size" > > So, my question is: How can I superimpose to structures with different size > ?? This sounds more like a methodology question than a Python question. You must decide how to map between the atoms of one chain and the atoms of the other. If they are different lengths, you will need to exclude some residues (e.g. peptide GHIL versus EGHILD, you would probably ignore the extra trailing/leading residues on the longer sequence). If in addition the residues are (at least in some cases) different amino acids, then you will probably only want to calculate the superposition using the backbone (or even just the C alpha atoms). One approach to this is to base the atomic mapping on a pairwise protein sequence alignment. Peter From crosvera at gmail.com Sat Aug 29 17:03:55 2009 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Sat, 29 Aug 2009 13:03:55 -0400 Subject: [Biopython] Fwd: PBD SuperImpose In-Reply-To: <3715adb70908290617g3d8792d9we033c1a8dfec35fc@mail.gmail.com> References: <3715adb70908290617g3d8792d9we033c1a8dfec35fc@mail.gmail.com> Message-ID: 2009/8/29 Rodrigo faccioli > Please, > > Could you inform the PDBIDs or PDB files that you are using? > > trimero_r308A_600_ps2.pdb wt_600ps_trim.pdb These are the PDBs that I'm using. > Cheers, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > > > 2009/8/28 Carlos R?os Vera > >> Hello people, >> >> I'm trying to use the superimpose() method from the Bio.PBD module, but >> when >> I use the set_atoms() method with two atoms list, I got this: >> >> "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ in >> size" >> >> So, my question is: How can I superimpose to structures with different >> size >> ?? >> >> Cheers and Thanks. >> >> PS: I paste the code that I'm using. >> >> -------code.py------ >> #!/usr/bin/env python >> >> import sys >> from Bio.PDB import * >> >> #get the PDB's path from user command line >> struct_path1 = sys.argv[1] >> name_struct1 = struct_path1.split('/')[-1].split('.')[0] >> >> struct_path2 = sys.argv[2] >> name_struct2 = struct_path2.split('/')[-1].split('.')[0] >> >> #parsing the PDBs >> parser = PDBParser() >> >> struct1 = parser.get_structure(name_struct1, struct_path1) >> struct2 = parser.get_structure(name_struct2, struct_path2) >> >> #get atoms list >> atoms1 = struct1.get_atoms() >> atoms2 = struct2.get_atoms() >> >> latoms1 = [] >> latoms2 = [] >> >> for a in atoms1: >> latoms1.append( a ) >> for a in atoms2: >> latoms2.append( a ) >> >> print latoms1 >> print latoms2 >> >> #SuperImpose >> sup = Superimposer() >> # Specify the atom lists >> # ""fixed"" and ""moving"" are lists of Atom objects >> # The moving atoms will be put on the fixed atoms >> sup.set_atoms(latoms1, latoms2) >> # Print rotation/translation/rmsd >> print "ROTRAN: "+ sup.rotran >> print "RMS: " + sup.rms >> # Apply rotation/translation to the moving atoms >> sup.apply(moving) >> >> >> >> -- >> http://crosvera.blogspot.com >> >> Carlos R?os V. >> Estudiante de Ing. (E) en Computaci?n e Inform?tica. >> Universidad del B?o-B?o >> VIII Regi?n, Chile >> >> Linux user number 425502 >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 From eric.talevich at gmail.com Sat Aug 29 17:12:50 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 29 Aug 2009 13:12:50 -0400 Subject: [Biopython] PBD SuperImpose Message-ID: <3f6baf360908291012y50456018pfcb58f9ab6d8463f@mail.gmail.com> > > 2009/8/28 Carlos R?os Vera : > > Hello people, > > > > I'm trying to use the superimpose() method from the Bio.PBD module, but > when > > I use the set_atoms() method with two atoms list, I got this: > > > > "Bio.PDB.PDBExceptions.PDBException: Fixed and moving atom lists differ > in > > size" > > > > So, my question is: How can I superimpose to structures with different > size > > ?? Peter : > > This sounds more like a methodology question than a Python > question. > > You must decide how to map between the atoms of one chain and > the atoms of the other. If they are different lengths, you will need > to exclude some residues (e.g. peptide GHIL versus EGHILD, you > would probably ignore the extra trailing/leading residues on the > longer sequence). > > If in addition the residues are (at least in some cases) different > amino acids, then you will probably only want to calculate the > superposition using the backbone (or even just the C alpha > atoms). > > One approach to this is to base the atomic mapping on a > pairwise protein sequence alignment. > > Peter > > If you're trying to align the structures of two different proteins, or two different structures of the same protein, you might want to try MultiProt: http://bioinfo3d.cs.tau.ac.il/MultiProt/ It can handle more than two proteins at a time, too. (If that's overkill, then +1 for Peter's approach.) Eric From jjkk73 at gmail.com Sat Aug 29 19:10:41 2009 From: jjkk73 at gmail.com (jorma kala) Date: Sat, 29 Aug 2009 21:10:41 +0200 Subject: [Biopython] How to extract start and end positions of a sequence in blast output file Message-ID: Hi, I'm using Blast through the biopython module. Is it possible to retrieve start and end positions on the genome of an aligned sequence from a blast record object? (I've been looking at the Biopython tutorial, section 'the Blast record class', but haven't been able to find it.) Thank you very much From biopython at maubp.freeserve.co.uk Sun Aug 30 11:29:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 30 Aug 2009 12:29:08 +0100 Subject: [Biopython] How to extract start and end positions of a sequence in blast output file In-Reply-To: References: Message-ID: <320fb6e00908300429p4cd58f9dj1606df1f1242ed7b@mail.gmail.com> On Sat, Aug 29, 2009 at 8:10 PM, jorma kala wrote: > Hi, > I'm using Blast through the biopython module. > Is it possible to retrieve start and end positions on the genome of an > aligned sequence ?from a blast record object? Yes - see below. > (I've been looking at the Biopython tutorial, section 'the Blast record > class', but haven't been able to find it.) > Thank you very much Have you tried using the built in help to find out more about the HSP object?? e.g. >>> from Bio.Blast import NCBIXML >>> record = NCBIXML.read(open("xbt003.xml")) >>> help(record.alignments[0].hsps[0]) ... Or, have you come across the Python command dir? This gives a listing of all the properties and methods of an object (although those starting with an underscore are special or private and should usually be ignored). e.g. >>> from Bio.Blast import NCBIXML >>> record = NCBIXML.read(open("xbt003.xml")) >>> dir(record.alignments[0].hsps[0]) ['__doc__', '__init__', '__module__', '__str__', 'align_length', 'bits', 'expect', 'frame', 'gaps', 'identities', 'match', 'num_alignments', 'positives', 'query', 'query_end', 'query_start', 'sbjct', 'sbjct_end', 'sbjct_start', 'score', 'strand'] The help text tells you this, but you could also guess from using dir - sbjct_start and sbjct_end are what you want (the start/end of the subject sequence, i.e. the database match), while query_start and query_end are those for your query sequence. Peter From chapmanb at 50mail.com Mon Aug 31 13:29:31 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 31 Aug 2009 09:29:31 -0400 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: References: <4A93A7C4.1060109@rubor.de> Message-ID: <20090831132931.GE75451@sobchak.mgh.harvard.edu> Hi Tim; I'm not aware of Python code to do BioBricks design, but a couple of repositories for synthetic biology code in Python are: SynBioPython: http://code.google.com/p/synbiopython/ My Synthetic Biology code: http://bitbucket.org/chapmanb/synbio/ Unfortunately these are more starting points than tutorial ready code. Hope this helps, Brad > I am very interested in common bio python tasks as they relate specifically > to synthetic biology. Could you give me some examples of such tasks? > -Tim > > On Tue, Aug 25, 2009 at 10:06 AM, William Heath wrote: > > > This is amazing thanks! > > -Tim > > > > > > On Tue, Aug 25, 2009 at 1:58 AM, Kristian Rother wrote: > > > >> > >> Hi William, > >> > >> I was teaching Python/BioPython in several courses for Biologists. I wrote > >> some individual lesson plans but they are rather not readable for other > >> people (see attachment). There is some material on-line, though: > >> > >> http://www.rubor.de/lehre_en.html > >> > >> Typically, the lessons consisted of 2h lecture + 1h exercises on language > >> concepts + 3h exercises on a single, more biological task. The code written > >> during the latter was reviewed and scored and the students knew about that. > >> They had a two-week Python crash course before. > >> > >> Details on request. > >> > >> Best Regards, > >> Kristian > >> > >> > >> William Heath schrieb: > >> > >>> Hi All, > >>> I am a member of Tech Shop in Mountain View, CA and I want to teach a > >>> class > >>> on biopython that is specifically tailored toward the goals of synthetic > >>> biology. Can anyone help me to come up with lesson plan for such a > >>> class? > >>> In particular I want to use bio bricks, and good opensource design > >>> programs > >>> for biobricks. Can anyone recommend any? > >>> > >>> I also want to utilize any/all concepts in this training: > >>> > >>> http://www.edge.org/documents/archive/edge296.html > >>> > >>> Please let me know your ideas on such a lesson plan. > >>> > >>> > >>> -Tim > >>> _______________________________________________ > >>> Biopython mailing list - Biopython at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/biopython > >>> > >>> > >>> > >> > >> > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From wgheath at gmail.com Mon Aug 31 17:07:59 2009 From: wgheath at gmail.com (William Heath) Date: Mon, 31 Aug 2009 10:07:59 -0700 Subject: [Biopython] Wanting to teach a class on biopython .. In-Reply-To: <20090831132931.GE75451@sobchak.mgh.harvard.edu> References: <4A93A7C4.1060109@rubor.de> <20090831132931.GE75451@sobchak.mgh.harvard.edu> Message-ID: Sounds good thanks! -Tim On Mon, Aug 31, 2009 at 6:29 AM, Brad Chapman wrote: > Hi Tim; > I'm not aware of Python code to do BioBricks design, but a couple of > repositories for synthetic biology code in Python are: > > SynBioPython: http://code.google.com/p/synbiopython/ > My Synthetic Biology code: http://bitbucket.org/chapmanb/synbio/ > > Unfortunately these are more starting points than tutorial ready > code. > > Hope this helps, > Brad > > > I am very interested in common bio python tasks as they relate > specifically > > to synthetic biology. Could you give me some examples of such tasks? > > -Tim > > > > On Tue, Aug 25, 2009 at 10:06 AM, William Heath > wrote: > > > > > This is amazing thanks! > > > -Tim > > > > > > > > > On Tue, Aug 25, 2009 at 1:58 AM, Kristian Rother > wrote: > > > > > >> > > >> Hi William, > > >> > > >> I was teaching Python/BioPython in several courses for Biologists. I > wrote > > >> some individual lesson plans but they are rather not readable for > other > > >> people (see attachment). There is some material on-line, though: > > >> > > >> http://www.rubor.de/lehre_en.html > > >> > > >> Typically, the lessons consisted of 2h lecture + 1h exercises on > language > > >> concepts + 3h exercises on a single, more biological task. The code > written > > >> during the latter was reviewed and scored and the students knew about > that. > > >> They had a two-week Python crash course before. > > >> > > >> Details on request. > > >> > > >> Best Regards, > > >> Kristian > > >> > > >> > > >> William Heath schrieb: > > >> > > >>> Hi All, > > >>> I am a member of Tech Shop in Mountain View, CA and I want to teach a > > >>> class > > >>> on biopython that is specifically tailored toward the goals of > synthetic > > >>> biology. Can anyone help me to come up with lesson plan for such a > > >>> class? > > >>> In particular I want to use bio bricks, and good opensource design > > >>> programs > > >>> for biobricks. Can anyone recommend any? > > >>> > > >>> I also want to utilize any/all concepts in this training: > > >>> > > >>> http://www.edge.org/documents/archive/edge296.html > > >>> > > >>> Please let me know your ideas on such a lesson plan. > > >>> > > >>> > > >>> -Tim > > >>> _______________________________________________ > > >>> Biopython mailing list - Biopython at lists.open-bio.org > > >>> http://lists.open-bio.org/mailman/listinfo/biopython > > >>> > > >>> > > >>> > > >> > > >> > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From xiaoa at mail.rockefeller.edu Mon Aug 31 23:45:55 2009 From: xiaoa at mail.rockefeller.edu (xiaoa) Date: Mon, 31 Aug 2009 19:45:55 -0400 Subject: [Biopython] IDLE problem Message-ID: <4A9C60B3.4040605@rockefeller.edu> Hi, I am new to python and biopython. I ran into a problem when using Entrez.esearch and efetch. My script worked fine when I used python 2.6.2 command line (console), but it returned an empty line when I ran it in IDLE. IDLE seems to be working, because I tested with 1. another python script (no Entrez modules) and 2. even Entrez.einfo--worked fine. I am using Windows Vista, 64-bit and Biopython 1.51 and Python 2.6.2. Thanks in advance, Andrew