From jgrant at email.smith.edu Fri Jan 2 13:01:45 2009 From: jgrant at email.smith.edu (Jessica Grant) Date: Fri, 2 Jan 2009 13:01:45 -0500 Subject: [BioPython] help with NCBIWWW.qblast Message-ID: I wrote a script that uses NCBIWWW.qblast and it worked last time I tried it (a few weeks ago) but this morning I get the following error message: File "tblastn.py", line 41, in tblastn result_handle = NCBIWWW.qblast("tblastn", "nr", fas.seq.data) File "/root/biopython-1.49/build/lib.linux-i686-2.4/Bio/Blast/NCBIWWW.py", line 770, in qblast File "/root/biopython-1.49/build/lib.linux-i686-2.4/Bio/Blast/NCBIWWW.py", line 837, in _parse_qblast_ref_page ValueError: A non-integer RTOE found in the 'please wait' page, '' Does this sound like an ncbi error or like something I should be able to work around? Thanks for the help! -- Jessica Grant Phone: 413-585-3750 Fax: 413-585-3786 jgrant at email.smith.edu http://www.science.smith.edu/departments/Biology/lkatz/people/jgrant From biopython at maubp.freeserve.co.uk Fri Jan 2 13:38:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jan 2009 18:38:07 +0000 Subject: [BioPython] help with NCBIWWW.qblast In-Reply-To: References: Message-ID: <320fb6e00901021038u38cd05b8p6e9ca7cff07969d6@mail.gmail.com> On Fri, Jan 2, 2009 at 6:01 PM, Jessica Grant wrote: > I wrote a script that uses NCBIWWW.qblast and it worked last time I tried it > (a few weeks ago) but this morning I get the following error message: > > File "tblastn.py", line 41, in tblastn > result_handle = NCBIWWW.qblast("tblastn", "nr", fas.seq.data) > File "/root/biopython-1.49/build/lib.linux-i686-2.4/Bio/Blast/NCBIWWW.py", > line 770, in qblast > File "/root/biopython-1.49/build/lib.linux-i686-2.4/Bio/Blast/NCBIWWW.py", > line 837, in _parse_qblast_ref_page > ValueError: A non-integer RTOE found in the 'please wait' page, '' It looks like an empty string was found for RTOE (which obviously cannot be turned into an integer). [As an aside, there was a trivial error in the error processing in Bio/Blast/NCBIWWW.py as it should have said "No RTOE found in the 'please wait' page." instead.] This suggests the NCBI sent back an error page of some kind (or even an empty page if there was some network problem), instead of the normal "please wait" page. Unfortunately, these are not so easy to deal with automatically. Does this happen repeatedly? It would help if you could include the sequence you are trying - then we can attempt to reproduce the error for ourselves. I've tried a tblastn search from my machine and it worked OK. Also, what version of Biopython are you using (I'd guess Biopython 1.49 beta, or 1.49, based on the error message)? > Does this sound like an ncbi error or like something I should be able to > work around? Thanks for the help! Given you say this script used to work, it could be something the NCBI has changed. Have you updated your installation of Biopython since the time the script worked? Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Jan 2 18:04:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jan 2009 23:04:43 +0000 Subject: [BioPython] help with NCBIWWW.qblast In-Reply-To: References: <320fb6e00901021038u38cd05b8p6e9ca7cff07969d6@mail.gmail.com> Message-ID: <320fb6e00901021504u737bb09dx3242ef9d2c5f767e@mail.gmail.com> On Fri, Jan 2, 2009 at 7:50 PM, Jessica Grant wrote: > > Thanks for your response...actually I had failed to save my fasta file with > unix line breaks. This causes me so many problems and I guess over vacation > I forgot that I need to deal with this. > > All is working now! > > Thanks! > > Jessica That's interesting - but I'd be surprised if the NCBI can't cope with mixed new lines in a qblast query (after all, their webserver will get queries from Linux, Windows and Mac machines). Was there a problem stemming from parsing the FASTA file? How are you reading in the FASTA file? Have you tried opening the handle in universal read lines mode? e.g. handle = open("example.fasta","rU") Peter From biopython at maubp.freeserve.co.uk Sat Jan 3 07:59:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jan 2009 12:59:34 +0000 Subject: [BioPython] help with NCBIWWW.qblast In-Reply-To: References: <320fb6e00901021038u38cd05b8p6e9ca7cff07969d6@mail.gmail.com> <320fb6e00901021504u737bb09dx3242ef9d2c5f767e@mail.gmail.com> Message-ID: <320fb6e00901030459u4212b87aw2094b98f68bd0f1c@mail.gmail.com> On 1/3/09, Jessica Grant wrote: > > I have never seen that ("rU") before. I will give it a try. Thanks! > > Jessica > I meant to type "universal new lines mode", but anyway its been present since at least Python 2.3 and can be very helpful in this situation - see: http://docs.python.org/library/functions.html I was hoping you could tell me the exact string you used as your tblastx query, because it could be useful to be able to reproduce this kind of error. I hope you don't mind me sharing some comments on your original snippet of code, which looked like this: result_handle = NCBIWWW.qblast("tblastn", "nr", fas.seq.data) I'm assuming that the variable fas is a SeqRecord, thus fas.seq is its Seq object. Using a Seq object's data property to get the sequence as a plain string is discouraged (it hasn't been in the tutorial for some time), as the Seq object now behaves much more like a string itself. You could have just used fas.seq here. This brings me to my next point, the NCBI qblast interface will take three kinds of queries, (1) a record identifier like a GI number, (2) a sequence, or (3) a FASTA format string. Supplying just the sequence (as in your code) means that BLAST will assign an identifier for your sequence automatically. You might prefer to use the SeqRecord object's format method to make a fasta string (which will include the existing identifier - and that should then be present in the BLAST results): result_handle = NCBIWWW.qblast("tblastn", "nr", fas.format("fasta")) This information is in the version of the Biopython Tutorial, but I thought it worth bringing it up here too. The format method used here was added to the SeqRecord (and Alignment) objects in Biopython 1.48. Peter From biopython at maubp.freeserve.co.uk Sat Jan 3 08:42:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jan 2009 13:42:10 +0000 Subject: [BioPython] Deprecating Bio.Transcribe and Bio.Translate? Message-ID: <320fb6e00901030542g6292084bl98a10587e6dd9990@mail.gmail.com> Dear all, For some time now, the Bio.Seq module has provided transcription and basic translation functionality. With Biopython 1.49, this was extended further with the addition of Seq object transcription and translation methods (for use on nucleotide sequences only), and also support for translation up to the first stop codon. Using Bio.Seq is now the recommended and preferred way to do transcription or translation with Biopython. The Bio.Transcribe and Bio.Translate modules were declared obsolete with the release of Biopython 1.49, and I am wondering if anyone objects to our officially deprecating them in Biopython 1.50 (i.e. just adding a warning message when the module is imported). Alternatively, we could do this in the release after that. Thanks, Peter From biopython at maubp.freeserve.co.uk Sat Jan 3 14:59:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jan 2009 19:59:04 +0000 Subject: [BioPython] help for local alignment In-Reply-To: <4c2163890901030837y5b12a358gd2f87aa6989b5a75@mail.gmail.com> References: <4c2163890812220838o40cb50fcyb55f24392cf14101@mail.gmail.com> <320fb6e00812220947xd9444ffp636c13c684fda2c4@mail.gmail.com> <4c2163890812222328v4d9369cq6b4d7ad749365e9b@mail.gmail.com> <320fb6e00812230320u62d915b4k7b2d334b241e8f97@mail.gmail.com> <4c2163890812310100n4100bce5sc9da85b4df391016@mail.gmail.com> <320fb6e00812310635n1685798ai42c8dc07c1c3bf45@mail.gmail.com> <4c2163890812310710o212c9603mc96f7720328baaa6@mail.gmail.com> <4c2163890901012011o47a4bef1l19d951ae40b84aaf@mail.gmail.com> <320fb6e00901020522q51819fbt28c29ae9333d4831@mail.gmail.com> <4c2163890901030837y5b12a358gd2f87aa6989b5a75@mail.gmail.com> Message-ID: <320fb6e00901031159g5554ab29i837ed466ab28cfbc@mail.gmail.com> Hi Chen, I've copied the Biopython mailing list on my reply as I think this is of general interest. On 1/3/09, Chen Ku wrote: > Dear Peter, > it will be a great help of you if you can send me > the exact code for this problem using Bio package. I think as you are expert > in this you can write me and I think it will be few line code. > > My Problem: Given two protein sequence I have to perform Global alignment > using Blosum 62 scoring scheme. > > v = NGPSTKDFGKISESREFDNQNGPSTKDFGKISESREFDNQ > w = QNQLERSFGKINMRLEDALVQNQLERSFGKINMRLEDALV > Scoring matrix: BLOSUM62 > Gap open: 10.0 > Gap extended: 0.5 > > The answer I did manually will come 20 using BLOSUM matrix. > .... > > Regards > Chen I'd already told Chen that Biopython provides the BLOSUM matrices in Bio.SubsMat.MatrixInfo (as simple dictionaries) and pointed him at the Bio.pairwise2 for doing pairwise alignments. The information is all there in the pairwise2 docstrings, but perhaps it could be clearer. There are lots of global alignment functions named globalXX in Bio.pairwise2.align, where the two letter code "XX" tells you the type of parameters for matches (and mismatches), and the parameters for gap penalties. In this case we want to use the function globalds because we want to use the BLOSUM62 matrix which we have as a dictionary (d for dictionary), and the two sequences have the same gap parameters (s for same). from Bio import pairwise2 from Bio.SubsMat.MatrixInfo import blosum62 v = "NGPSTKDFGKISESREFDNQNGPSTKDFGKISESREFDNQ" w = "QNQLERSFGKINMRLEDALVQNQLERSFGKINMRLEDALV" open_penalty = -10 extend_penalty = -0.5 alignments = pairwise2.align.globalds(v, w, blosum62, open_penalty, extend_penalty) for align1, align2, score, begin, end in alignments : print pairwise2.format_alignment(align1, align2, score, begin, end) This gives me two alignments back, both scoring twenty - as you had calculated by hand Chen. But do double check this is doing what you expected! Peter From sudhir.cr at gmail.com Sun Jan 4 00:23:25 2009 From: sudhir.cr at gmail.com (sudhir cr) Date: Sun, 4 Jan 2009 10:53:25 +0530 Subject: [BioPython] How to use Bio.Kegg.Compound Module In-Reply-To: <320fb6e00812310717s51cce5dm52694079ffe9c253@mail.gmail.com> References: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> <320fb6e00812310717s51cce5dm52694079ffe9c253@mail.gmail.com> Message-ID: Hi Peter, The The new KEGG format has changed to "Other DBs" from "DBLINKS" only on the html page but not when we download from KEGG FTP. So, I guess its not yet needed to log a bug. Thanks for your help, Sudhir On Wed, Dec 31, 2008 at 8:47 PM, Peter wrote: > On Wed, Dec 31, 2008 at 3:07 PM, sudhir cr wrote: > > Hello Peter, > > > > Thanks for the quick reply. This code is working great. > > Great. > > > P.S: The new KEGG format has changed to "Other DBs" from "DBLINKS" > > Do you have a link for this? If we need to update our parser could > you file a bug on Bugzilla please? http://bugzilla.open-bio.org/ > > Thanks, > > Peter > -- Sudhir Chowbina Bioinformatics Graduate Student & Research Assistant Discovery Informatics and Computing Laboratory Indiana University School of Informatics Indianapolis, USA 317-847 7721 schowbin at iupui.edu From sedaalper at yahoo.com Mon Jan 5 07:49:44 2009 From: sedaalper at yahoo.com (Seda Alper) Date: Mon, 5 Jan 2009 04:49:44 -0800 (PST) Subject: [BioPython] do_alignment Message-ID: <544161.80500.qm@web90603.mail.mud.yahoo.com> Hi! I executed the code below about Clustalw . However it doesn't work. import os from Bio.Clustalw import MultipleAlignCL cline = MultipleAlignCL(os.path.join(os.curdir,"opuntia.fasta")) cline.set_output("test.aln") print cline from Bio import Clustalw alignment = Clustalw.do_alignment(cline) The error like that >>> clustalw -INFILE=.\opuntia.fasta -OUTFILE=test.aln Traceback (most recent call last): File "C:\Python25\ders\se.py", line 10, in alignment = Clustalw.do_alignment(cline) File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 95, in do_alignment shell=(sys.platform!="win32") File "C:\Python25\lib\subprocess.py", line 594, in __init__ errread, errwrite) File "C:\Python25\lib\subprocess.py", line 816, in _execute_child startupinfo) WindowsError: [Error 2] The system cannot find the file specified What to do? Thanks Seda From biopython at maubp.freeserve.co.uk Mon Jan 5 08:09:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jan 2009 13:09:22 +0000 Subject: [BioPython] do_alignment In-Reply-To: <544161.80500.qm@web90603.mail.mud.yahoo.com> References: <544161.80500.qm@web90603.mail.mud.yahoo.com> Message-ID: <320fb6e00901050509g75ca62a5sb532a375d6a2543b@mail.gmail.com> On Mon, Jan 5, 2009 at 12:49 PM, Seda Alper wrote: > Hi! > > I executed the code below about Clustalw . However it doesn't work. > > import os > from Bio.Clustalw import MultipleAlignCL > > cline = MultipleAlignCL(os.path.join(os.curdir,"opuntia.fasta")) > cline.set_output("test.aln") > print cline > > from Bio import Clustalw > > alignment = Clustalw.do_alignment(cline) > > The error like that >>>> > clustalw -INFILE=.\opuntia.fasta -OUTFILE=test.aln > > Traceback (most recent call last): > File "C:\Python25\ders\se.py", line 10, in > alignment = Clustalw.do_alignment(cline) > File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 95, in do_alignment > shell=(sys.platform!="win32") > File "C:\Python25\lib\subprocess.py", line 594, in __init__ > errread, errwrite) > File "C:\Python25\lib\subprocess.py", line 816, in _execute_child > startupinfo) > WindowsError: [Error 2] The system cannot find the file specified > > What to do? > > Thanks > Seda I'm not at my Windows machine to double check this, but I suspect you don't have clustalw on your path. If you don't have clustalw on your path, you'll have to tell Biopython where it is: clustalw_exe = r"C:\Program Files\...\clustalw.exe" assert os.path.isfile(clustalw_exe) cline = MultipleAlignCL(os.path.join(os.curdir,"opuntia.fasta"), clustalw_exe) ... Peter From biopython at maubp.freeserve.co.uk Tue Jan 6 05:33:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jan 2009 10:33:16 +0000 Subject: [BioPython] do_alignment In-Reply-To: <127996.50517.qm@web90605.mail.mud.yahoo.com> References: <320fb6e00901050509g75ca62a5sb532a375d6a2543b@mail.gmail.com> <127996.50517.qm@web90605.mail.mud.yahoo.com> Message-ID: <320fb6e00901060233k62abf10eyd715bc2b3b52b8c1@mail.gmail.com> On Tue, Jan 6, 2009 at 9:58 AM, Seda Alper wrote: > > Hi Peter, > > I applied what you do. However now the error is like that > > import os > from Bio.Clustalw import MultipleAlignCL > > clustalw_exe = r"C:\Python\Biopython-1.49\clustalw_exe" The above is wrong - the filename should end with ".exe", so it might be this: clustalw_exe = r"C:\Python\Biopython-1.49\clustalw.exe" (assuming you really do have the ClustalW executable in your Biopython directory) > assert os.path.isfile(clustalw_exe) > cline = MultipleAlignCL(os.path.join(os.curdir,"opuntia.fasta"),clustalw_exe) > ... > Traceback (most recent call last): > File "C:\Python25\ders\se.py", line 5, in > assert os.path.isfile(clustalw_exe) > AssertionError The assertion failed becase the Clustalw filename you used does not exist. Peter From lueck at ipk-gatersleben.de Thu Jan 8 10:07:06 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 8 Jan 2009 16:07:06 +0100 Subject: [BioPython] blastall directory problem Message-ID: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> Hi! I finally finished my program and tested on several PC's. I'm doing standalone blasts, it's a GUI program for Windows. By this I found a problem on the english operating systems because blastall has problems with spaces in the full path. I tried to replace the path "C:\Program Files\Final\test" into "C:\PROGRA~1\Final\test" but I get a message that blast is unable to open the input file. Maby there is a way not to give the full path of the database and the blastall.exe but only the file name? Does someone has a idea how I can solve this problem? Kind regards and a happy new year! Stefanie From biopython at maubp.freeserve.co.uk Fri Jan 9 07:16:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jan 2009 12:16:32 +0000 Subject: [BioPython] blastall directory problem In-Reply-To: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> References: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00901090416w7ab46f0biea1d1ba2fa746229@mail.gmail.com> On Thu, Jan 8, 2009 at 3:07 PM, Stefanie L?ck wrote: > Hi! > > I finally finished my program and tested on several PC's. > I'm doing standalone blasts, it's a GUI program for Windows. > > By this I found a problem on the english operating systems because blastall has problems with spaces in the full path. > > I tried to replace the path "C:\Program Files\Final\test" into "C:\PROGRA~1\Final\test" but I get a message that blast is unable to open the input file. Maby there is a way not to give the full path of the database and the blastall.exe but only the file name? > > Does someone has a idea how I can solve this problem? > > Kind regards and a happy new year! > Stefanie The problem here is that BLAST allows you to specify multiple databases using spaces to separate the list - while you may also want to use spaces in the filename(s)! The solution is that BLAST should understand some fiddly escape quoted string arguments, i.e. slash double quote at the beginning and end of the filename. Try this (using a mixture of single and double quotes!): my_blast_db =r'"\"C:\Program Files\Final\test\""' (There was some discussion on this issue on Bug 2480). You should also be able to use the DOS 8.3 style names (which have no spaces), something like "C:\PROGRA~1\Final\test" which you said you tried. Read about the win32api.GetShortPathName() function for how to get this name programatically. From your email it sounds like this worked - but you had another problem with specifying your input filename. If you are still stuck, could you give us an example showing the arguments used for the blast exe, database and input file? Peter From mmokrejs at ribosome.natur.cuni.cz Fri Jan 9 09:24:02 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 09 Jan 2009 15:24:02 +0100 Subject: [BioPython] Does biopython have a parser for .qual files? Message-ID: <49675E02.5050102@ribosome.natur.cuni.cz> Hi, is there a way in biopython to access the quality values from NCBI trace archive? I had a look briefly into http://biopython.org/DIST/docs/api/ but cannot find anything related. NCBItrace provides some perl script (maybe I could the same with Bio.Entrez.esearch (haven't tried yet) ... I will need to revert the order of values to get them for minus strand orientation. If nobody needed to do this before I will invent the wheel. ;) Thanks for your comments, Martin $ perl NCBItrace/query_tracedb "retrieve quality 5728631" >gnl|ti|5728631 name:jea17d09.b1 7 7 7 7 7 7 10 9 8 6 6 9 9 9 7 10 9 13 13 10 19 8 6 6 13 8 4 4 4 6 13 13 6 6 9 9 10 16 19 19 19 4 0 4 13 19 19 32 32 25 19 15 6 6 9 19 19 22 25 25 25 25 22 22 22 29 29 27 22 16 10 19 16 15 6 6 6 6 6 6 8 19 23 33 39 34 34 34 34 39 39 39 39 39 39 39 39 40 28 19 11 9 15 11 28 37 40 45 35 35 35 35 39 39 51 51 39 39 39 39 39 39 35 35 32 33 33 33 32 32 40 40 56 56 40 32 32 32 32 34 35 51 51 51 51 35 34 34 34 35 35 39 40 40 51 51 51 51 51 51 51 45 40 40 40 40 40 40 51 45 45 45 45 51 51 56 56 51 51 51 51 51 40 45 45 45 45 45 56 56 56 56 56 56 56 56 40 28 28 23 23 23 25 29 35 38 38 38 38 38 38 38 38 40 51 51 51 51 56 56 [cut] From biopython at maubp.freeserve.co.uk Fri Jan 9 09:31:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jan 2009 14:31:30 +0000 Subject: [BioPython] do_alignment In-Reply-To: <328891.8790.qm@web90608.mail.mud.yahoo.com> References: <320fb6e00901060428v59cb5d5ek8521b49bdff55da2@mail.gmail.com> <328891.8790.qm@web90608.mail.mud.yahoo.com> Message-ID: <320fb6e00901090631s342f96dcgd24bbae07e19cec3@mail.gmail.com> On Fri, Jan 9, 2009 at 2:22 PM, Seda Alper wrote: > > Dear Peter, > > I've executed my code at the end! I only changed the file name( from > opuntia.fasta to my file mouse.fasta). I think the problem may be > resulted from the file opuntia. Now, everything works. > > Thanks for your help! > Seda Oh good, I was going to say double check that opuntia.fasta was in the current directory. Your other message where you got the error "ValueError: No records found in handle" is usually caused by an empty output alignment file. Perhaps an aborted ClustalW run had left behind an empty output file? Peter From biopython at maubp.freeserve.co.uk Fri Jan 9 10:01:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jan 2009 15:01:24 +0000 Subject: [BioPython] Does biopython have a parser for .qual files? In-Reply-To: <49675E02.5050102@ribosome.natur.cuni.cz> References: <49675E02.5050102@ribosome.natur.cuni.cz> Message-ID: <320fb6e00901090701n5a85bb17lb1769fa1d55d3a88@mail.gmail.com> On Fri, Jan 9, 2009 at 2:24 PM, Martin MOKREJ? wrote: > Hi, > is there a way in biopython to access the quality values from > NCBI trace archive? I had a look briefly into > http://biopython.org/DIST/docs/api/ but cannot find anything > related. NCBItrace provides some perl script (maybe I could the > same with Bio.Entrez.esearch (haven't tried yet) ... I will need > to revert the order of values to get them for minus strand > orientation. If nobody needed to do this before I will invent > the wheel. ;) > Thanks for your comments, > Martin In the short term, I'm sure a quick parser shouldn't take you more than five minutes to implement (based on any of the FASTA parsers), giving you record names with lists of integer scores. The trouble for integrating this into Biopython nicely is how to represent the data. Have a look at Bug 2382 for some related ideas (including over FASTA like formats), and this thread just over a year ago: http://lists.open-bio.org/pipermail/biopython-dev/2007-October/003131.html http://bugzilla.open-bio.org/show_bug.cgi?id=2382 I can see these qual files (and also fastq files which have both the sequence and the quality scores) fitting into Bio.SeqIO but this would require an elegant way to deal with unknown sequences of known length (see next paragraph), and a good way to handle per-letter-annotation (which we have touched on on the mailing lists fairly recently). For this reason, I had wondered about creating an UnknownSeq as subclass of Seq. To create an instance you would supply the length and a character to use (typically N or X for nucleotides and proteins, perhaps defaulting to ?). This would then act like a Seq object as much as possible (for example, translation of an UnknownSeq with a nucleotide alphabet could give an UnknownSeq with a protein alphabet with appropriate length). An UnknownSeq object could be used for these qual files, or even certain GenBank files (where the sequence is not always included). There is a risk of user confusion here though, as there isn't really a sequence present! Peter From bjorn_johansson at bio.uminho.pt Mon Jan 12 05:58:16 2009 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Mon, 12 Jan 2009 10:58:16 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence Message-ID: Hi, I am fairly new to biopython, so I don't now if this question has been answered in the archives (tried to loo but found nothing). Is there a (bio)python module or code snippet that I can use to determine if a sequence is liiely to be nucleic acid or protein? I believe the program ReadSeq does this for example, when formatting a fasta sequence to genbank. grateful for answers! /bjorn -- Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From biopython at maubp.freeserve.co.uk Mon Jan 12 06:30:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Jan 2009 11:30:26 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: References: Message-ID: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> On Mon, Jan 12, 2009 at 10:58 AM, Bj?rn Johansson wrote: > Hi, I am fairly new to biopython, so I don't now if this question has > been answered in the archives (tried to loo but found nothing). > > Is there a (bio)python module or code snippet that I can use to > determine if a sequence is liiely to be nucleic acid or protein? > > I believe the program ReadSeq does this for example, when formatting a > fasta sequence to genbank. > > grateful for answers! > > /bjorn It seems like lots of different tools (e.g. FASTA) have come up with their own way to try and guess this, usually by looking at the letter content. This is impossible to get right 100% of the time (especially if the nucleotide includes ambiguous characters - which can make it look more protein like). I don't think we have a standard bit of code in Biopython to do this (but I've never searched). In python there as a general preference for making things explicit rather than trying to guess and do the right thing. If you don't know which you have (e.g. user input?) then you are in an awkward position. What are you going to do with the sequence? If you are going to pass it to a command line tool, maybe you can let it guess? Peter From lueck at ipk-gatersleben.de Mon Jan 12 07:06:39 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 12 Jan 2009 13:06:39 +0100 Subject: [BioPython] blastall directory problem References: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> <320fb6e00901090416w7ab46f0biea1d1ba2fa746229@mail.gmail.com> Message-ID: <001a01c974ae$39b0a820$1022a8c0@ipkgatersleben.de> Thanks for the response! This worked! Lifesaver ;-) Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Friday, January 09, 2009 1:16 PM Subject: Re: [BioPython] blastall directory problem On Thu, Jan 8, 2009 at 3:07 PM, Stefanie L?ck wrote: > Hi! > > I finally finished my program and tested on several PC's. > I'm doing standalone blasts, it's a GUI program for Windows. > > By this I found a problem on the english operating systems because > blastall has problems with spaces in the full path. > > I tried to replace the path "C:\Program Files\Final\test" into > "C:\PROGRA~1\Final\test" but I get a message that blast is unable to open > the input file. Maby there is a way not to give the full path of the > database and the blastall.exe but only the file name? > > Does someone has a idea how I can solve this problem? > > Kind regards and a happy new year! > Stefanie The problem here is that BLAST allows you to specify multiple databases using spaces to separate the list - while you may also want to use spaces in the filename(s)! The solution is that BLAST should understand some fiddly escape quoted string arguments, i.e. slash double quote at the beginning and end of the filename. Try this (using a mixture of single and double quotes!): my_blast_db =r'"\"C:\Program Files\Final\test\""' (There was some discussion on this issue on Bug 2480). You should also be able to use the DOS 8.3 style names (which have no spaces), something like "C:\PROGRA~1\Final\test" which you said you tried. Read about the win32api.GetShortPathName() function for how to get this name programatically. From your email it sounds like this worked - but you had another problem with specifying your input filename. If you are still stuck, could you give us an example showing the arguments used for the blast exe, database and input file? Peter From chapmanb at 50mail.com Mon Jan 12 08:47:37 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 12 Jan 2009 08:47:37 -0500 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> References: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> Message-ID: <20090112134737.GG4135@sobchak.mgh.harvard.edu> Hi Bj?rn: I am agreed with Peter; guessing should be the last resort. The guessing is not that smart, and will fall apart for very pathological cases like short amino acids with lots of Gly, Ala, Cys or Thrs. That being said, here is some code that does this. Hope this helps, Brad from Bio import Seq def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']): """Guess if the given sequence is DNA. It's considered DNA if more than 90% of the sequence is GATCs. The threshold is configurable via the thresh parameter. dna_letters can be used to configure which letters are considered DNA; for instance, adding N might be useful if you are expecting data with ambiguous bases. """ if isinstance(seq, Seq.Seq): seq = seq.data elif isinstance(seq, type("")) or isinstance(seq, type(u"")): seq = str(seq) else: raise ValueError("Do not know provided type: %s" % seq) seq = seq.upper() dna_alpha_count = 0 for letter in dna_letters: dna_alpha_count += seq.count(letter) if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh): return True else: return False On Mon, Jan 12, 2009 at 11:30:26AM +0000, Peter wrote: > On Mon, Jan 12, 2009 at 10:58 AM, Bj?rn Johansson > wrote: > > Hi, I am fairly new to biopython, so I don't now if this question has > > been answered in the archives (tried to loo but found nothing). > > > > Is there a (bio)python module or code snippet that I can use to > > determine if a sequence is liiely to be nucleic acid or protein? > > > > I believe the program ReadSeq does this for example, when formatting a > > fasta sequence to genbank. > > > > grateful for answers! > > > > /bjorn > > It seems like lots of different tools (e.g. FASTA) have come up with > their own way to try and guess this, usually by looking at the letter > content. This is impossible to get right 100% of the time (especially > if the nucleotide includes ambiguous characters - which can make it > look more protein like). I don't think we have a standard bit of code > in Biopython to do this (but I've never searched). > > In python there as a general preference for making things explicit > rather than trying to guess and do the right thing. If you don't know > which you have (e.g. user input?) then you are in an awkward position. > What are you going to do with the sequence? If you are going to pass > it to a command line tool, maybe you can let it guess? > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Jan 12 09:34:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Jan 2009 14:34:13 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: <20090112134737.GG4135@sobchak.mgh.harvard.edu> References: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> <20090112134737.GG4135@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00901120634q1ba2e80fj2cbd049a7f99a3d3@mail.gmail.com> On Mon, Jan 12, 2009 at 1:47 PM, Brad Chapman wrote: > Hi Bj?rn: > I am agreed with Peter; guessing should be the last resort. The > guessing is not that smart, and will fall apart for very > pathological cases like short amino acids with lots of Gly, Ala, Cys > or Thrs. That being said, here is some code that does this. Hope > this helps, > > Brad > > from Bio import Seq > > def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']): > """Guess if the given sequence is DNA. > > It's considered DNA if more than 90% of the sequence is GATCs. The threshold > is configurable via the thresh parameter. dna_letters can be used to configure > which letters are considered DNA; for instance, adding N might be useful if > you are expecting data with ambiguous bases. > """ > if isinstance(seq, Seq.Seq): > seq = seq.data > elif isinstance(seq, type("")) or isinstance(seq, type(u"")): > seq = str(seq) > else: > raise ValueError("Do not know provided type: %s" % seq) > seq = seq.upper() This code is trying to get the sequence as an upper case string, given that the Seq object does not support the upper method (yet - I've just filed enhancement Bug 2731 on this, something I'd been thinking about for a while). Anyway, this would be shorter and would cope with strings or Seq objects, or even MutableSeq objects. seq = str(seq.upper()) Also using the Seq object's data property is discouraged (see Bug 2509). > dna_alpha_count = 0 > for letter in dna_letters: > dna_alpha_count += seq.count(letter) > if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh): > return True > else: > return False You could just do: return (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh) Peter From bjorn_johansson at bio.uminho.pt Mon Jan 12 13:34:36 2009 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Mon, 12 Jan 2009 18:34:36 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: <320fb6e00901120634q1ba2e80fj2cbd049a7f99a3d3@mail.gmail.com> References: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> <20090112134737.GG4135@sobchak.mgh.harvard.edu> <320fb6e00901120634q1ba2e80fj2cbd049a7f99a3d3@mail.gmail.com> Message-ID: Hi, and thanks for the quick replies and the submitted code! Its very nice to have the help of such a devoted community! I am writing a plug-in to deal with reformatting pasted code (DNA or protein) snippets into the editor (incidently WikidPad which is written in python and uses scintilla, open-source http://wikidpad.sourceforge.net/) and I would like to be able to format (DNA or protein) code in the selection from raw format to fasta and genbank. The identity of the code (DNA or protein) is only needed to feed into the SeqIO.write method, it demands to know if the sequence is DNA or protein to write genbank format. I know I could add a dialog, but I want a function to quickly reformat sequences, although I agree that guessing is bad from a theoretical viewpoint. Ill try the code that you submitted as soon as I can and Ill get back to you! thanks, /bjorn On Mon, Jan 12, 2009 at 14:34, Peter wrote: > On Mon, Jan 12, 2009 at 1:47 PM, Brad Chapman wrote: >> Hi Bj?rn: >> I am agreed with Peter; guessing should be the last resort. The >> guessing is not that smart, and will fall apart for very >> pathological cases like short amino acids with lots of Gly, Ala, Cys >> or Thrs. That being said, here is some code that does this. Hope >> this helps, >> >> Brad >> >> from Bio import Seq >> >> def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']): >> """Guess if the given sequence is DNA. >> >> It's considered DNA if more than 90% of the sequence is GATCs. The threshold >> is configurable via the thresh parameter. dna_letters can be used to configure >> which letters are considered DNA; for instance, adding N might be useful if >> you are expecting data with ambiguous bases. >> """ >> if isinstance(seq, Seq.Seq): >> seq = seq.data >> elif isinstance(seq, type("")) or isinstance(seq, type(u"")): >> seq = str(seq) >> else: >> raise ValueError("Do not know provided type: %s" % seq) >> seq = seq.upper() > > This code is trying to get the sequence as an upper case string, given > that the Seq object does not support the upper method (yet - I've just > filed enhancement Bug 2731 on this, something I'd been thinking about > for a while). > > Anyway, this would be shorter and would cope with strings or Seq > objects, or even MutableSeq objects. > > seq = str(seq.upper()) > > Also using the Seq object's data property is discouraged (see Bug 2509). > >> dna_alpha_count = 0 >> for letter in dna_letters: >> dna_alpha_count += seq.count(letter) >> if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh): >> return True >> else: >> return False > > You could just do: > > return (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh) > > Peter > -- Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From biopython at maubp.freeserve.co.uk Mon Jan 12 16:57:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Jan 2009 21:57:39 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: References: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> <20090112134737.GG4135@sobchak.mgh.harvard.edu> <320fb6e00901120634q1ba2e80fj2cbd049a7f99a3d3@mail.gmail.com> Message-ID: <320fb6e00901121357k1d9c39a2tec3610af8bf4c448@mail.gmail.com> On Mon, Jan 12, 2009 at 6:34 PM, Bj?rn Johansson wrote: > > Hi, > and thanks for the quick replies and the submitted code! Its very nice > to have the help of such a devoted community! > > I am writing a plug-in to deal with reformatting pasted code (DNA or > protein) snippets into the editor (incidently WikidPad which is > written in python and uses scintilla, open-source > http://wikidpad.sourceforge.net/) and I would like to be able to > format (DNA or protein) code in the selection from raw format to fasta > and genbank. > > The identity of the code (DNA or protein) is only needed to feed into > the SeqIO.write method, it demands to know if the sequence is DNA or > protein to write genbank format. Yes - this is because the GenBank format distinguishes between nucleotides and proteins, so if you try and output a SeqRecord using a generic alphabet, we have a problem. We could guess, but from a python style point of view I think most would agree it is preferable to make you (the programmer) make the choice explicity. As an aside, you might prefer to use the SeqRecord's format method to get the record as a FASTA or GenBank string - but this calls Bio.SeqIO.write() internally anyway, so the alphabet problem remains. > I know I could add a dialog, but I want a function to quickly reformat > sequences, although I agree that guessing is bad from a theoretical > viewpoint. You could have a selection box offering: (*) Guess (default) (*) Nucleotide (*) Amino acids That way for any border line cases, the web site user can easily change this if they need to. Once you know you have nucleotides, deciding if it is DNA or RNA is pretty easy :) > Ill try the code that you submitted as soon as I can and Ill get back to you! > thanks, > /bjorn Peter From lueck at ipk-gatersleben.de Tue Jan 13 02:38:43 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 13 Jan 2009 08:38:43 +0100 Subject: [BioPython] blastall directory problem References: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> <320fb6e00901090416w7ab46f0biea1d1ba2fa746229@mail.gmail.com> Message-ID: <001101c97551$f5f64cd0$1022a8c0@ipkgatersleben.de> I was a little bit to optimistic... After compilation with py2exe, blast hangs. In the log file of py2exe I get the following error message: Traceback (most recent call last): File "prim_search.pyc", line 464, in make_xml File "Bio\Blast\NCBIStandalone.pyc", line 1668, in blastall File "Bio\Blast\NCBIStandalone.pyc", line 1992, in _invoke_blast File "subprocess.pyc", line 586, in __init__ File "subprocess.pyc", line 681, in _get_handles File "subprocess.pyc", line 722, in _make_inheritable TypeError: an integer is required Any ideas? Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Friday, January 09, 2009 1:16 PM Subject: Re: [BioPython] blastall directory problem On Thu, Jan 8, 2009 at 3:07 PM, Stefanie L?ck wrote: > Hi! > > I finally finished my program and tested on several PC's. > I'm doing standalone blasts, it's a GUI program for Windows. > > By this I found a problem on the english operating systems because blastall has problems with spaces in the full path. > > I tried to replace the path "C:\Program Files\Final\test" into "C:\PROGRA~1\Final\test" but I get a message that blast is unable to open the input file. Maby there is a way not to give the full path of the database and the blastall.exe but only the file name? > > Does someone has a idea how I can solve this problem? > > Kind regards and a happy new year! > Stefanie The problem here is that BLAST allows you to specify multiple databases using spaces to separate the list - while you may also want to use spaces in the filename(s)! The solution is that BLAST should understand some fiddly escape quoted string arguments, i.e. slash double quote at the beginning and end of the filename. Try this (using a mixture of single and double quotes!): my_blast_db =r'"\"C:\Program Files\Final\test\""' (There was some discussion on this issue on Bug 2480). You should also be able to use the DOS 8.3 style names (which have no spaces), something like "C:\PROGRA~1\Final\test" which you said you tried. Read about the win32api.GetShortPathName() function for how to get this name programatically. From your email it sounds like this worked - but you had another problem with specifying your input filename. If you are still stuck, could you give us an example showing the arguments used for the blast exe, database and input file? Peter From biopython at maubp.freeserve.co.uk Tue Jan 13 05:41:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Jan 2009 10:41:20 +0000 Subject: [BioPython] blastall directory problem In-Reply-To: <001101c97551$f5f64cd0$1022a8c0@ipkgatersleben.de> References: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> <320fb6e00901090416w7ab46f0biea1d1ba2fa746229@mail.gmail.com> <001101c97551$f5f64cd0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00901130241t2b116d59j2428415d60e2f177@mail.gmail.com> On Tue, Jan 13, 2009 at 7:38 AM, Stefanie L?ck wrote: > I was a little bit to optimistic... > > After compilation with py2exe, blast hangs. In the log file of py2exe > I get the following error message: > > Traceback (most recent call last): > File "prim_search.pyc", line 464, in make_xml > > File "Bio\Blast\NCBIStandalone.pyc", line 1668, in blastall > File "Bio\Blast\NCBIStandalone.pyc", line 1992, in _invoke_blast > File "subprocess.pyc", line 586, in __init__ > File "subprocess.pyc", line 681, in _get_handles > File "subprocess.pyc", line 722, in _make_inheritable > TypeError: an integer is required > > Any ideas? > Stefanie Are you using Biopython 1.49? What version of Python are you using here? (Python 2.3 is handled a little differently, as it does not have the subprocess module). Can you confirm the exact same code works fine run from Python directly (via IDLE or the commandline?), but fails via py2exe? Are you running the py2exe compiled version from the Windows command line? Can you try that, even thought you said it was a GUI program. This might be related to the following python bug on Windows to do with pipe redirection, http://bugs.python.org/issue1124861 If so, I think there is a suggested work around we can try (this will require a change to the Biopython code). Peter From animesh.agrawal at anu.edu.au Thu Jan 15 04:21:11 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Thu, 15 Jan 2009 20:21:11 +1100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences Message-ID: <000001c976f2$9b10c320$d1324960$@agrawal@anu.edu.au> Hi, I have been trying to write a python script to do the codon wise alignment of given nucleotide sequences. I have downloaded CDS sequences (by a script found on biopython mailing list) from genbank for a particular protein and now would like to check codon usage for few specific amino acid positions. Could you please provide me few pointers on how to do that. I also want to take this opportunity to thank you guys for excellent work on biopython documentation. I am new to python, but I am able to use cookbook/tutorial example for my work with relative ease. Cheers, Animesh Agrawal PhD Scholar Proteomics & Therapy Design Group Division of Molecular Biosciences The John Curtin School of Medical Research The Australian National University P.O. Box 334 Canberra ACT 2601 AUSTRALIA T: +61 2 6125 8303 From dalloliogm at fastwebnet.it Thu Jan 15 06:45:18 2009 From: dalloliogm at fastwebnet.it (Giovanni Marco Dall'Olio) Date: Thu, 15 Jan 2009 12:45:18 +0100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <319800150623214077@unknownmsgid> References: <319800150623214077@unknownmsgid> Message-ID: <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> On Thu, Jan 15, 2009 at 10:21 AM, Animesh Agrawal wrote: > Hi, > > I have been trying to write a python script to do the codon wise alignment > of given nucleotide sequences. Note that there are many tools that already do a 'codon wise' alignment, if it is what I think you mean by it. I think t-coffee does this. It is always better to use a tool that already exists rather than develop a new one, if you can, because otherwise your results will be different to compare with other experiments. > I have downloaded CDS sequences (by a script > found on biopython mailing list) from genbank for a particular protein and > now would like to check codon usage for few specific amino acid positions. Can you provide a better example of what do you want to obtain? Do you want to know: - for a particular aminoacid position (e.g. the first, or the third, or the last) the codon usage in a set of sequences? - for those aminoacids that are coded by more than a possible codon (e.g. Ala) the frequency with which every codon is used? - the frequency at which every possible codon is used, in general. If I can give you an advice, I would spend some time in developing a test case first. For example, create a fake sequence and calculate the output that you expect from your experiment. It is a lot easier to describe your experiment to other people if you can provide the test cases you are using, it will be easier to understand what you want to do. > Could you please provide me few pointers on how to do that. I also want to > take this opportunity to thank you guys for excellent work on biopython > documentation. I am new to python, but I am able to use cookbook/tutorial > example for my work with relative ease. > > Cheers, > > Animesh Agrawal > > PhD Scholar > > Proteomics & Therapy Design Group > > Division of Molecular Biosciences > > The John Curtin School of Medical Research > > The Australian National University > > P.O. Box 334 > > Canberra ACT 2601 > > AUSTRALIA > > T: +61 2 6125 8303 > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From chapmanb at 50mail.com Thu Jan 15 06:52:49 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 15 Jan 2009 06:52:49 -0500 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <000001c976f2$9b10c320$d1324960$@agrawal@anu.edu.au> References: <000001c976f2$9b10c320$d1324960$@agrawal@anu.edu.au> Message-ID: <20090115115249.GA61956@kunkel> Hi Animesh; > I have been trying to write a python script to do the codon wise alignment > of given nucleotide sequences. I have downloaded CDS sequences (by a script > found on biopython mailing list) from genbank for a particular protein and > now would like to check codon usage for few specific amino acid positions. Biopython does not contain codon usage dictionaries; the possible organisms and usage frequencies themselves are changing as additional organisms are sequenced. Your best bet is to parse out the values from the codon usage database (http://www.kazusa.or.jp/codon/) for your organism of interest. An example is pasted below from E coli; you did not mention which organism you were interested in. The values are reported as usage per 1000 codons. When you have defined this, here is some Biopython code to create a dictionary (positional_usage) of usage at each codon position (using python 0-based indexing for positions): from Bio import SeqIO handle = open("example.fasta", "rU") positional_usage = {} for record in SeqIO.parse(handle, "fasta"): assert len(record.seq) % 3 == 0 # make sure you are 3 based for cindex in range(len(record.seq) // 3): cur_codon = str(record.seq[cindex * 3:(cindex + 1) * 3]) usage = usage_dict[cur_codon] positional_usage[cindex] = usage handle.close() The input to this is usage_dict, a dictionary defined as below. Hope this helps, Brad Escherichia_coli = \ {'AAA': 35.601945036625438, 'AAC': 21.202802271903, 'AAG': 13.045009394539333, 'AAT': 22.831396289856265, 'ACA': 10.700618181965975, 'ACC': 21.387130807992541, 'ACG': 13.784236156652, 'ACT': 11.016200111457801, 'AGA': 4.4652452250900074, 'AGC': 14.997074890221718, 'AGG': 2.5626687138052029, 'AGT': 10.73241545213447, 'ATA': 8.2158886416564805, 'ATC': 22.685559186075952, 'ATG': 25.945855225833537, 'ATT': 29.669004762179132, 'CAA': 14.383602745467156, 'CAC': 8.8157333849102599, 'CAG': 28.118110840502265, 'CAT': 12.473375763164368, 'CCA': 8.6299703855048442, 'CCC': 5.630985746455262, 'CCG': 19.354496289402018, 'CCT': 7.8991113260680947, 'CGA': 4.0270166820911326, 'CGC': 18.382647392898786, 'CGG': 6.4933372765136035, 'CGT': 18.916506823622456, 'CTA': 4.4733738505466141, 'CTC': 10.083559878921733, 'CTG': 46.036709350716478, 'CTT': 12.48556870134928, 'GAA': 38.019254801088948, 'GAC': 18.833307951301883, 'GAG': 18.80390145332651, 'GAT': 32.883397975828814, 'GCA': 21.603495691469892, 'GCC': 23.869708653328228, 'GCG': 27.990682682608973, 'GCT': 17.355093504295862, 'GGA': 10.60618268033774, 'GGC': 25.658245331001215, 'GGG': 11.57779249962166, 'GGT': 24.92882073488034, 'GTA': 11.897916896280412, 'GTC': 14.044830325702069, 'GTG': 23.467102616006844, 'GTT': 20.038018059414991, 'TAA': 1.9881661557984951, 'TAC': 12.005979799409431, 'TAG': 0.28569727707782611, 'TAT': 18.337939952887442, 'TCA': 9.9362883118255478, 'TCC': 9.2876718158321232, 'TCG': 8.51664778355096, 'TCT': 10.941368941813147, 'TGA': 1.0356825140595336, 'TGC': 5.9924705020549887, 'TGG': 13.780171843923698, 'TGT': 5.3450493921581241, 'TTA': 14.983925643159559, 'TTC': 15.622261818722567, 'TTG': 12.856616545721486, 'TTT': 22.459153059387496 } From animesh.agrawal at anu.edu.au Thu Jan 15 08:21:00 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Fri, 16 Jan 2009 00:21:00 +1100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> Message-ID: Hi Marco, My apologies. Probably in my last mail I didn't make myself very clear. I have a protein which is about 475 amino acid long and is highly conserved (over 95%) among diffrent organisms. I have downloaded its CDS(coding sequence) . I would like to calculate codon use frequenecy for important amino acid positions as you have put it very nicely in your reply: "for a particular aminoacid position (e.g. the first, or the third,or the last) the codon usage for those aminoacids that are coded by more than a possible codon (e.g. Ala) the frequency with which every codon is used?" For example in a set of four sequenecs ?? ? ? ? ? ? ? ?1 ? ? ? 2 ? ? ? 3? ?? ? ? ? ? ? ?Ala ? ?Gly ? ? Ile Seq1 GCT?GCT?ATT? Seq2 GCC?GCC?ATC? Seq3 GCA?GCA?ATA Seq4 GCG?GCG?ATT For first amino acid position i.e. Ala (which is coded by 4 codons) each codon is used once in 4 sequences that gives you frequency of 0.25 for each codon or for third??amino acid position i.e.?Ile ( which is coded by 3 codons) the ?ATT will give you frequency of 0.5 while other two will give you frequency of 0.25. Cheers, Animesh ----- Original Message ----- From: Giovanni Marco Dall'Olio Date: Thursday, January 15, 2009 10:45 pm Subject: Re: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences To: Animesh Agrawal Cc: biopython at lists.open-bio.org > On Thu, Jan 15, 2009 at 10:21 AM, Animesh Agrawal > wrote: > > Hi, > > > > I have been trying to write a python script to do the codon > wise alignment > > of given nucleotide sequences. > > Note that there are many tools that already do a 'codon wise' > alignment, if it is what I think you mean by it. > I think t-coffee does this. It is always better to use a tool that > already exists rather than develop a new one, if you can, because > otherwise your results will be different to compare with other > experiments. > > > > I have downloaded CDS sequences (by a script > > found on biopython mailing list) from genbank for a particular > protein and > > now would like to check codon usage for few specific amino > acid positions. > > Can you provide a better example of what do you want to obtain? > Do you want to know: > - for a particular aminoacid position (e.g. the first, or the third, > or the last) the codon usage in a set of sequences? > - for those aminoacids that are coded by more than a possible codon > (e.g. Ala) the frequency with which every codon is used? > - the frequency at which every possible codon is used, in general. > > If I can give you an advice, I would spend some time in > developing a > test case first. For example, create a fake sequence and > calculate the > output that you expect from your experiment. > It is a lot easier to describe your experiment to other people > if you > can provide the test cases you are using, it will be easier to > understand what you want to do. > > > > Could you please provide me few pointers on how to do that. I > also want to > > take this opportunity to thank you guys for excellent work on > biopython> documentation. I am new to python, but I am able to > use cookbook/tutorial > > example for my work with relative ease. > > > > Cheers, > > > > Animesh Agrawal > > > > PhD Scholar > > > > Proteomics & Therapy Design Group > > > > Division of Molecular Biosciences > > > > The John Curtin School of Medical Research > > > > The Australian National University > > > > P.O. Box 334 > > > > Canberra ACT 2601 > > > > AUSTRALIA > > > > T: +61 2 6125 8303 > > > > > > > > _______________________________________________ > > BioPython mailing list? -? BioPython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > -- > > My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Jan 15 08:34:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jan 2009 13:34:52 +0000 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> Message-ID: <320fb6e00901150534q1cd8880bve9392ec8ac560d70@mail.gmail.com> On Thu, Jan 15, 2009 at 1:21 PM, Animesh Agrawal wrote: > > Hi Marco, > My apologies. Probably in my last mail I didn't make myself very clear. > I have a protein which is about 475 amino acid long and is highly > conserved (over 95%) among diffrent organisms. I have downloaded > its CDS(coding sequence) . > I would like to calculate codon use frequenecy for important amino acid > positions as you have put it very nicely in your reply: > "for a particular aminoacid position (e.g. the first, or the third,or the last) > the codon usage for those aminoacids that are coded by more than a > possible codon (e.g. Ala) the frequency with which every codon is used?" > For example in a set of four sequenecs > 1 2 3 > Ala Gly Ile > Seq1 GCT GCT ATT > Seq2 GCC GCC ATC > Seq3 GCA GCA ATA > Seq4 GCG GCG ATT > > For first amino acid position i.e. Ala (which is coded by 4 codons) each > codon is used once in 4 sequences that gives you frequency of 0.25 for > each codon or for third amino acid position i.e. Ile ( which is coded by 3 > codons) the ATT will give you frequency of 0.5 while other two will give > you frequency of 0.25. OK - first of all you will need to create an alignment of all the different CDS sequences. If they happen to be the same length this is easy. Otherwise, you'll want to align their PROTEIN sequences, and then turn this into a nucleotide sequence alignment (where gaps are only found as triples). You may be lucky and find the proteins all align beautifully with no gaps. Do you need advice on this step? Once you have the alignment file, it should be fairly trivial to count the codons in each set of three columns. Peter From dalloliogm at fastwebnet.it Thu Jan 15 09:26:07 2009 From: dalloliogm at fastwebnet.it (Giovanni Marco Dall'Olio) Date: Thu, 15 Jan 2009 15:26:07 +0100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> Message-ID: <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> On Thu, Jan 15, 2009 at 2:21 PM, Animesh Agrawal wrote: > > Hi Marco, > My apologies. Probably in my last mail I didn't make myself very clear. I > have a protein which is about 475 amino acid long and is highly conserved > (over 95%) among diffrent organisms. I have downloaded its CDS(coding > sequence) . ok! I used to work with transcript and alternative splicing > I would like to calculate codon use frequenecy for important amino acid > positions as you have put it very nicely in your reply: > "for a particular aminoacid position (e.g. the first, or the third,or the > last) the codon usage for those aminoacids that are coded by more than a > possible codon (e.g. Ala) the frequency with which every codon is used?" > For example in a set of four sequenecs > 1 2 3 > Ala Gly Ile > Seq1 GCT GCT ATT > Seq2 GCC GCC ATC > Seq3 GCA GCA ATA > Seq4 GCG GCG ATT Let's see how can you do this with biopython (ehi, Peter, please correct me if I say something wrong!! :)). If your set of sequences is not too big, you can just put the sequences in a dictionary: sequences = {seq1 : , seq2 = } The alignment file (align.txt) should look like this (or any other format supported by AlignIO): >seq1 aaacccaaa >seq2 aaacccaaa >seq3 tttcccaaa >seq4 tttgggaaa If you want, you can use biopython to parse the alignment file: >>> from Bio import AlignIO >>> alignment = AlignIO(open('align.txt', 'r')) Then, you will have an AlignIO object called 'alignment', which contains all the sequences in your file: >>> print alignment SingleLetterAlphabet() alignment with 4 rows and 9 columns aaacccaaa Seq1 aaacccaaa seq2 tttcccaaa seq3 tttgggaaa seq4 You will be able to access all the sequences in your alignment by the _records property of AlignIO: >>> sequences = alignment._records >>> print sequences [SeqRecord(seq=Seq('aaacccaaa', SingleLetterAlphabet()), id='seq1', name='seq1', description='seq1', dbxrefs=[]), SeqRecord(seq=Seq('aaacccaaa', SingleLetterAlphabet()), id='seq2', name='seq2', description='seq2', dbxrefs=[]), SeqRecord(seq=Seq('tttcccaaa', SingleLetterAlphabet()), id='seq3', name='seq3', description='seq3', dbxrefs=[]), SeqRecord(seq=Seq('tttgggaaa', SingleLetterAlphabet()), id='seq4', name='seq4', description='seq4', dbxrefs=[])] If you prefer, you are not obliged to use AlignIO and you can your own parser for your alignment. However, if you use biopython's code, you won't have to demonstrate that your parser doesn't contain errors (somebody could ask you this). The alignment object in biopython doesn't have any method to count codon usage the way you want to do. However, you can implement it easily in many ways, for example: >>> codon_count_by_position = {} >>> for codon_start in (range(0, len(sequences[0]), 3)): codon_count_by_position[codon_start] = {} for sequence in sequences: current_codon = sequence.seq[codon_start:codon_start+3] # note: why do I have to do .tostring() here, and in the previous statement no? codon_count_by_position[codon_start].setdefault(current_codon.tostring(), 0) codon_count_by_position[codon_start][current_codon.tostring()] += 1. / len(sequences) >>> print codon_count_by_position {0: {'aaa': 0.5, 'ttt': 0.5}, 3: {'ccc': 0.75, 'ggg': 0.25}, 6: {'aaa': 1.0}} There are many other ways you can do this and you should be careful in handling gaps and alternative splicing, and you should have a look at the tools that already do codon-based alignment, but I hope this can help you. > > For first amino acid position i.e. Ala (which is coded by 4 codons) each > codon is used once in 4 sequences that gives you frequency of 0.25 for each > codon or for third amino acid position i.e. Ile ( which is coded by 3 > codons) the ATT will give you frequency of 0.5 while other two will give > you frequency of 0.25. > > > Cheers, > Animesh > > > ----- Original Message ----- > From: Giovanni Marco Dall'Olio > Date: Thursday, January 15, 2009 10:45 pm > Subject: Re: [BioPython] How to check codon usage for specific amino acid > positions in a given set of CDS sequences > To: Animesh Agrawal > Cc: biopython at lists.open-bio.org > >> On Thu, Jan 15, 2009 at 10:21 AM, Animesh Agrawal >> wrote: >> > Hi, >> > >> > I have been trying to write a python script to do the codon >> wise alignment >> > of given nucleotide sequences. >> >> Note that there are many tools that already do a 'codon wise' >> alignment, if it is what I think you mean by it. >> I think t-coffee does this. It is always better to use a tool that >> already exists rather than develop a new one, if you can, because >> otherwise your results will be different to compare with other >> experiments. >> >> >> > I have downloaded CDS sequences (by a script >> > found on biopython mailing list) from genbank for a particular >> protein and >> > now would like to check codon usage for few specific amino >> acid positions. >> >> Can you provide a better example of what do you want to obtain? >> Do you want to know: >> - for a particular aminoacid position (e.g. the first, or the third, >> or the last) the codon usage in a set of sequences? >> - for those aminoacids that are coded by more than a possible codon >> (e.g. Ala) the frequency with which every codon is used? >> - the frequency at which every possible codon is used, in general. >> >> If I can give you an advice, I would spend some time in >> developing a >> test case first. For example, create a fake sequence and >> calculate the >> output that you expect from your experiment. >> It is a lot easier to describe your experiment to other people >> if you >> can provide the test cases you are using, it will be easier to >> understand what you want to do. >> >> >> > Could you please provide me few pointers on how to do that. I >> also want to >> > take this opportunity to thank you guys for excellent work on >> biopython> documentation. I am new to python, but I am able to >> use cookbook/tutorial >> > example for my work with relative ease. >> > >> > Cheers, >> > >> > Animesh Agrawal >> > >> > PhD Scholar >> > >> > Proteomics & Therapy Design Group >> > >> > Division of Molecular Biosciences >> > >> > The John Curtin School of Medical Research >> > >> > The Australian National University >> > >> > P.O. Box 334 >> > >> > Canberra ACT 2601 >> > >> > AUSTRALIA >> > >> > T: +61 2 6125 8303 >> > >> > >> > >> > _______________________________________________ >> > BioPython mailing list - BioPython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> >> >> -- >> >> My blog on bioinformatics (now in English): http://bioinfoblog.it -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Jan 15 13:02:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jan 2009 18:02:05 +0000 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> Message-ID: <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> On Thu, Jan 15, 2009 at 2:26 PM, Giovanni Marco Dall'Olio wrote: > Let's see how can you do this with biopython (ehi, Peter, please > correct me if I say something wrong!! :)). > > ... > > You will be able to access all the sequences in your alignment by the > _records property of AlignIO: In Python anything starting with a single underscore is considered to be a private variable, and you should avoid using it. So you shouldn't be doing alignment._records, and if you do, don't complain if this implementation detail changes in a future version of Biopython. For the Alignment object, if you really want a list of SeqRecord objects you should use alignment.get_all_seqs() instead. Ugly I agree - but in practice you don't need to do this so often. You can use the alignment object itself, e.g. first_record = alignment[0] last_record = alignment[-1] for record in alignment : print record I think there is still room for improvement to this bit of Biopython, and there are a couple of open enhancement bugs. If you are interested, here my quick solution for solving this code using Bio.AlignIO is one way of solving Animesh's question. First of all, we need the alignment in a suitable file format (e.g. FASTA, ClustalW, PHYLIP, Stockholm etc). e.g. taking Animesh's example alignment with four sequences of length nine: handle = open("my_example.fasta","w") handle.write(""">Alpha GCT GCT ATT >Beta GCC GCC ATC >Gamma GCA GCA ATA >Delta GCG GCG ATT""") handle.close() Here is one solution using Bio.AlignIO to read in this as an alignment object, and then count the codon usage at each position separately: from Bio import AlignIO #Change this next line if your real file is another file format: alignment = AlignIO.read(open("my_example.fasta"),"fasta") assert alignment.get_alignment_length() % 3 == 0, \ "Alignment length is not a multiple of three!" number_of_codons = int(alignment.get_alignment_length() / 3) for codon_index in range(number_of_codons) : #Count the codons in a dictionary using upper case codons as keys counts = dict() for record in alignment : #In case the alignment is in mixed case, make everything upper case codon = str(record.seq[codon_index*3:codon_index*3+3]).upper() #Assuming you want to exclude gaps when calculating frequencies: if codon=="---" : continue #Increment the count by one, defaulting to zero #(there are lots of ways to write this code!) counts[codon] = counts.get(codon,0)+1 #Turn the counts in frequencies - note that because I exclude gaps, #the total codon count can vary across the alignment. total = float(sum(counts.values())) freqs = dict((codon,count/total) for (codon,count) in counts.iteritems()) print "Codon frequencies for columns %i to %i:" \ % (codon_index*3+1,codon_index*3+3), print freqs And the output should read: Codon frequencies for columns 1 to 3: {'GCA': 0.25, 'GCC': 0.25, 'GCT': 0.25, 'GCG': 0.25} Codon frequencies for columns 4 to 6: {'GCA': 0.25, 'GCC': 0.25, 'GCT': 0.25, 'GCG': 0.25} Codon frequencies for columns 7 to 9: {'ATT': 0.5, 'ATC': 0.25, 'ATA': 0.25} Peter From biopython at maubp.freeserve.co.uk Thu Jan 15 13:11:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jan 2009 18:11:42 +0000 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> Message-ID: <320fb6e00901151011p387a616bw8933b56d2d2ee5c@mail.gmail.com> On Thu, Jan 15, 2009 at 6:02 PM, Peter wrote: > On Thu, Jan 15, 2009 at 2:26 PM, Giovanni Marco Dall'Olio > wrote: >> Let's see how can you do this with biopython (ehi, Peter, please >> correct me if I say something wrong!! :)). >> ... >> You will be able to access all the sequences in your alignment by the >> _records property of AlignIO: > > In Python anything starting with a single underscore is considered to > be a private variable, and you should avoid using it. So you > shouldn't be doing alignment._records, and if you do, don't complain > if this implementation detail changes in a future version of > Biopython. > > For the Alignment object, if you really want a list of SeqRecord > objects you should use alignment.get_all_seqs() instead. ... On a related note, if you just want a list of SeqRecord objects from an alignment file, you can do this: from Bio import AlignIO alignment = AlignIO.read(open("my_example.phy"), "phylip") records = alignment.get_all_seqs() However, any input alignment format supported by Bio.AlignIO (like the PHYLIP format used in this example) can also be used via Bio.SeqIO, so you might prefer to do this: from Bio import SeqIO records = list(SeqIO.parse(open("my_example.phy"), "phylip")) Up to you. It rather depends on what you are trying to do with the sequences - sometimes working with the SeqRecord objects directly is preferable. Peter From animesh.agrawal at anu.edu.au Thu Jan 15 23:46:52 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Fri, 16 Jan 2009 15:46:52 +1100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> Message-ID: <000301c97795$730e76d0$592b6470$@agrawal@anu.edu.au> Peter, Wow! The code(for positional frequency of codons) works 4 me. Thanks a ton. While we are at it please allow me to ask you another question related to downloading CDS sequences. I have copied one script from mailing list for downloading CDS given from Genbank record of protein sequence written by Andrew Dalke. I modified it a little bit to include few more exceptions and it work in most of the cases but it's still not bug free. Giving errors frequently. I am copying both the script and errors. See if you can spot the problem.. or can suggest better way of doing it.. ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - from Bio import SeqIO from Bio.Seq import Seq from Bio import GenBank from Bio.GenBank import LocationParser from EUtils import DBIds, DBIdsClient from Bio.SeqRecord import SeqRecord import StringIO from Bio import Entrez from Bio.Alphabet import IUPAC #Animesh Agrawal Email:animesh.agrawal at anu.edu.au print "This program extracts the CDS of a given Genbank protein file\n" File_Input = raw_input("Give the name of input file:\t") File_Output = raw_input("Give the name of output file:\t") gb_handle = open(File_Input, "r") feature_parser = GenBank.FeatureParser () iterator = GenBank.Iterator (gb_handle, feature_parser) Out_file= open(File_Output, "w") def lookup(name, seq_start, seq_stop): h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name])) return h.efetch(retmode = "text", rettype = "fasta", seq_start = seq_start, seq_stop = seq_stop).read() def make_rc_record(record) : """Returns a new SeqRecord with the reverse complement sequence.""" rc_rec = SeqRecord(seq = record.seq.reverse_complement(), \ id = "rc_" + record.id, \ name = "rc_" + record.name, \ description = "reverse complement") return rc_rec while 1: cur_entry = iterator.next () Genbank_entry = str(cur_entry) if cur_entry is None: break for feature in cur_entry.features : if feature.type == "CDS": loc = feature.qualifiers["coded_by"][0] Temp1=loc Temp2 = Temp1.split('(') # for genbank record like this # coded_by="complement(NC_001713.1:67323..68795)" if Temp2[0]=="complement": Temp3 = Temp2[1].replace(')', '') parsed_loc_complement=LocationParser.parse(LocationParser.scan(Temp3)) assert isinstance(parsed_loc_complement, LocationParser.AbsoluteLocation) assert isinstance(parsed_loc_complement.local_location, LocationParser.Range) seq_start = parsed_loc_complement.local_location.low seq_stop = parsed_loc_complement.local_location.high assert isinstance(seq_start, LocationParser.Integer) assert isinstance(seq_stop, LocationParser.Integer) seq_start = seq_start.val seq_stop = seq_stop.val Temp4=lookup(parsed_loc_complement.path.accession, seq_start, seq_stop) fasta_handle = StringIO.StringIO(Temp4) record = SeqIO.read(fasta_handle, "fasta") record = make_rc_record(record) Out_file.write(record.format("fasta")) break # for genbank record like this # coded_by="join(NC_008114.1:51934..52632, NC_008114.1:54315..55043)" elif Temp2[0]=="join": loc=loc.replace('join', '') loc= loc.replace('(', '') loc= loc.replace(')', '') loc = loc.split(',') loc1=loc[0].split(':') loc2=loc[1].split(':') loc3=loc1[1].split('..') loc4=loc2[1].split('..') loc5=int(loc3[0])-1 loc6=int(loc3[1]) loc7=int(loc4[0])-1 loc8=int(loc4[1]) handle = Entrez.efetch(db="nucleotide", id=loc1[0], rettype="genbank") record=SeqIO.read(handle, "genbank") seq1 = record.seq[loc5:loc6] seq2 = record.seq[loc7:loc8] record.seq = seq1+seq2 Out_file.write(record.format("fasta")) break # for genbank record like this # coded_by="FM207547.1:<1..1443" else: parsed_loc=LocationParser.parse(LocationParser.scan(loc)) assert isinstance(parsed_loc, LocationParser.AbsoluteLocation) assert isinstance(parsed_loc.local_location, LocationParser.Range) seq_start = parsed_loc.local_location.low seq_stop = parsed_loc.local_location.high assert isinstance(seq_start, LocationParser.Integer) assert isinstance(seq_stop, LocationParser.Integer) seq_start = seq_start.val seq_stop = seq_stop.val Out_file.write(lookup(parsed_loc.path.accession, seq_start, seq_stop)) break # for swissprot entries in Genbank elif Genbank_entry.find('swissprot') >= 0: Entry = cur_entry.annotations Entry = str(Entry) Entry = Entry.split('xrefs') Entry1 =Entry[1].split(',') Entry2 = Entry1[0].split(':') handle = Entrez.efetch(db="nucleotide", id=Entry2[1], rettype="genbank") data=SeqIO.read(handle, "genbank") for feature in data.features : if feature.type == "gene": Gene_id= feature.qualifiers['gene'] [0] if Gene_id == "rbcL": temp = str(feature.location) temp = temp.replace(':', '..') temp = temp.replace('[', '') temp = temp.replace(']', '') if feature.strand == -1: temp1 = data.id+':<'+temp parsed_loc_complement = LocationParser.parse(LocationParser.scan(temp1)) assert isinstance(parsed_loc_complement, LocationParser.AbsoluteLocation) assert isinstance(parsed_loc_complement.local_location, LocationParser.Range) seq_start = parsed_loc_complement.local_location.low seq_stop = parsed_loc_complement.local_location.high assert isinstance(seq_start, LocationParser.Integer) assert isinstance(seq_stop, LocationParser.Integer) seq_start = seq_start.val+1 seq_stop = seq_stop.val Temp4=lookup(parsed_loc_complement.path.accession, seq_start, seq_stop) fasta_handle = StringIO.StringIO(Temp4) record = SeqIO.read(fasta_handle, "fasta") record = make_rc_record(record) #print (record.format("fasta")) Out_file.write(record.format("fasta")) break else: temp2 = data.id+':<'+temp parsed_loc=LocationParser.parse(LocationParser.scan(temp2)) assert isinstance(parsed_loc, LocationParser.AbsoluteLocation) assert isinstance(parsed_loc.local_location, LocationParser.Range) seq_start = parsed_loc.local_location.low seq_stop = parsed_loc.local_location.high assert isinstance(seq_start, LocationParser.Integer) assert isinstance(seq_stop, LocationParser.Integer) seq_start = seq_start.val+1 seq_stop = seq_stop.val #print (lookup(parsed_loc.path.accession, seq_start, seq_stop)) Out_file.write(lookup(parsed_loc.path.accession, seq_start, seq_stop)) break break Out_file.close() ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - Syntax error at or near `join' token Traceback (most recent call last): File "C:\Documents and Settings\Animesh\Desktop\sequences\Features_extraction_final.py", line 52, in parsed_loc_complement=LocationParser.parse(LocationParser.scan(Temp3)) File "C:\Python25\lib\site-packages\Bio\GenBank\LocationParser.py", line 319, in parse return _cached_parser.parse(tokens) File "C:\Python25\Lib\site-packages\Bio\Parsers\spark.py", line 204, in parse self.error(tokens[i-1]) File "C:\Python25\Lib\site-packages\Bio\Parsers\spark.py", line 183, in error raise SystemExit SystemExit ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - Traceback (most recent call last): File "C:\Documents and Settings\Animesh\Desktop\sequences\Features_extraction_final.py", line 76, in loc5=int(loc3[0])-1 ValueError: invalid literal for int() with base 10: '<1' ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - Animesh From biopython at maubp.freeserve.co.uk Fri Jan 16 07:35:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jan 2009 12:35:25 +0000 Subject: [BioPython] Downloading CDS sequences Message-ID: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> On Fri, Jan 16, 2009 at 4:46 AM, Animesh Agrawal wrote: > > Peter, > Wow! The code(for positional frequency of codons) works 4 me. Thanks a ton. Good. > While we are at it please allow me to ask you another question related to > downloading CDS sequences. Sure - bit I would have changed the email subject line if I was you. > I have copied one script from mailing list for > downloading CDS given from Genbank record of protein sequence written by > Andrew Dalke. I modified it a little bit to include few more exceptions and > it work in most of the cases but it's still not bug free. Do you have a link to the original in the mail archive? http://lists.open-bio.org/pipermail/biopython/ One minor point is I would have used Bio.SeqIO rather than Bio.GenBank.FeatureParser and Bio.GenBank.Iterator (the same parsing code gets used internally - I just think the code is simpler). >From a style point of view, breaking this up into some subfunctions would make it a lot clearer what it going on. I see you are looking at the "coded_by" qualifier, which will be a location string like "join(NC_008114.1:51934..52632, NC_008114.1:54315..55043)" including other sequence identifiers. For this example you download "NC_008114.1" and extract the two subsequences and join them up. The Bio.GenBank.LocationParser should be able to cope with parsing these strings - but its a complicated thing to do. As you have seen, there can be joins etc to deal with - but there are also fuzzy location which are more tricky. You specific error is simple enough: > Traceback (most recent call last): > File "C:\Documents and > Settings\Animesh\Desktop\sequences\Features_extraction_final.py", line 76, > in > loc5=int(loc3[0])-1 > ValueError: invalid literal for int() with base 10: '<1' You've got a location like "<1..456" meaning it starts before base one and continues to base 456 (one based counting). In this particular case, you'll just have to take the sequence from the start (base 1). The problem is your code does int("<1") which fails. Peter From biopython at maubp.freeserve.co.uk Sat Jan 17 14:30:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Jan 2009 19:30:17 +0000 Subject: [BioPython] Downloading CDS sequences In-Reply-To: References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> Message-ID: <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> Peter wrote: > You've got a location like "<1..456" meaning it starts before base one > and continues to base 456 (one based counting). In this particular > case, you'll just have to take the sequence from the start (base 1). > The problem is your code does int("<1") which fails. >From my testing, in this case and similar examples like "AF376133.1:<1..>553" it is safe to treat this as just from position 1 to 553. The less than and greater than signs are to indicate that the full protein CDS may well extend beyound this region, but it was not sequenced. Animesh wrote: > > http://lists.open-bio.org/pipermail/biopython/2003-April/001255.html > This is the link to original script in the mailing list. > Animesh > Thanks! I see Andrew's original code just dealt with the "easy" cases, where the coded_by string was a non-fuzzy location, and without a join. Andrew's code (and yours) uses Bio.EUtils to access the NCBI's "Entrez Utitlities" online API. I should point out this module has been deprecated since Release 1.48 (its still there for now but will give a warning message when used), and we recommend you use Bio.Entrez instead. I hope you don't mind me giving you a few comments about your code? You seem to be struggling with handles. Andrew defined this function: def lookup(name, seq_start, seq_stop): h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name])) return h.efetch(retmode = "text", rettype = "fasta", seq_start = seq_start, seq_stop = seq_stop).read() The efetch call returns a handle, but you use its read method to get all the data as a string. This means your lookup function returns a string containing the record in FASTA format. However, for your code, it would have made more sense to just stick with the handle - as you had to convert back from a string of data to a handle using StringIO: Temp4=lookup(parsed_loc_complement.path.accession, seq_start, seq_stop) fasta_handle = StringIO.StringIO(Temp4) record = SeqIO.read(fasta_handle, "fasta") Using Bio.Entrez.efetch (the equivalent to the old EUtils efetch method you were using) which returns a handle this would be just: from Bio import Entrez fasta_handle = Entrez.efetch("nucleotides", id=name, retmode="text", rettype="fasta", seq_start=seq_start, seq_stop=seq_stop) record = SeqIO.read(fasta_handle, "fasta") Peter From biopython at maubp.freeserve.co.uk Sat Jan 17 14:40:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Jan 2009 19:40:49 +0000 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> Message-ID: <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> Following Animesh's query, I was inspired to try and solve this problem for myself. My rough script of my own to solve this problem (below) has several differences to Andrew and Animesh's code. First of all, I didn't bother using the Bio.GenBank.LocationParser as I felt that for CDS processing I only needed to cope with a handful of location formats, and this was easier to do "by hand". Secondly I found some GenBank/GenPept examples where there wasn't a CDS feature with a "coded_by" qualifier in the annotation. Here the only thing I could find that worked was to look under the DBSOURCE information for a cross reference to the full parent nucleotide sequence, and then try and work out which bit codes for the protein. This is a little ugly, but seems to work. I'm also using Bio.SeqIO and Bio.Entrez rather than Bio.GenBank and Bio.EUtils (deprecated). I think the most important change was that I explicitly verify the nucleotide sequence obtained when translated does actuall give the expected protein sequence - just in case there was an error in my code, the annotation, or the even downloads. Peter --- #Script to take a file of proteins in GenBank/GenPept format, examine their annotation, #and use this to download their CDS from the NCBI. #Written and tested on Biopython 1.49, on 2008/01/17 from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO from Bio import Entrez #Edit this next line (and read the NCBI Entrez usage guidelines) Entrez.email = "Your.Name.Here at example.com" def get_nuc_by_name(name, start=None, end=None) : """Fetches the sequence from the NCBI given an indentifier name. Note start and end should be given using one based counting! Returns a Seq object.""" record = SeqIO.read(Entrez.efetch("nucleotide", id=name.strip(), seq_start=start, seq_stop=end, retmode="text", rettype="fasta"), "fasta") return record.seq def get_nuc_from_coded_by_string(source) : """Fetches the sequence from the NCBI for a "coded_by" string. e.g. "NM_010510.1:21..569" or "AF376133.1:<1..>553" or "join(AB061020.1:1..184,AB061020.1:300..1300)" or "complement(NC_001713.1:67323..68795)" Note - joins and complements are handled by recusion. Returns a Seq object.""" if source.startswith("complement(") : assert source.endswith(")") #For simplicity this works by recursion return get_nuc_from_coded_by_string(source[11:-1]).reverse_complement() if source.startswith("join(") : assert source.endswith(")") #For simplicity this works by recursion. #Note that the Seq object (currently) does not have a join #method, so convert to strings and join them, then go back #to a Seq object: return Seq("".join(str(get_nuc_from_coded_by_string(s)) \ for s in source[5:-1].split(","))) if "(" in source or ")" in source \ or source.count(":") != 1 or source.count("..") != 1 : raise ValueError("Don't understand %s" % repr(source)) name, loc = source.split(":") #Remove and ignore any leading < or > for fuzzy locations which #indicate the full CDS extends beyound the region sequenced. start, end = [int(x.lstrip("<>")) for x in loc.split("..")] #We could now download the full sequence, and crop it locally: #return get_nuc_by_name(name)[start-1:end] #However, we can ask the NCBI to crop it and then download #just the bit we need! return get_nuc_by_name(name,start,end) def get_nuc_record(protein_record, table="Standard") : """Given a protein record, returns a record with the CDS nucleotides. The protein's annotation is used to determine the CDS sequence(s) which are downloaded from the NCBI using Entrez. The translation table specified is used to check the nucleotides actually do give the expected protein sequence. Tries to get the CDS information from a "coded_by" qualifier, failing that it falls back on a DB_SOURCE xref entry (which does not specify which bit of the nucleotide sequence referenced is required - this is deduced from the expected translation). """ if not isinstance(protein_record, SeqRecord) : raise TypeError("Expect a SeqRecord as the protein_record.") feature = None for f in protein_record.features : if f.type == "CDS" and "coded_by" in f.qualifiers : feature = f break if feature : assert feature.location.start.position == 0 assert feature.location.end.position == len(protein_record) source = feature.qualifiers["coded_by"][0] print "Using %s" % source return SeqRecord(Seq("")) nuc = get_nuc_from_coded_by_string(source) #See if this included the stop codon - they don't always! if str(nuc[-3:].translate(table)) == "*" : nuc = nuc[:-3] elif "db_source" in protein_record.annotations : #Note the current parsing of the DBSOURCE lines in GenPept #files is non-optimal (as of Biopython 1.49). If the #parsing is changed then the following code will need #updating to pull out the first xrefs entry. parts = protein_record.annotations["db_source"].split() source = parts[parts.index("xrefs:")+1].strip(",;") print "Using %s" % source nuc_all = get_nuc_by_name(source) start = nuc_all.translate(table).find(protein_record.seq) assert start != -1, "Could not find start (assumed in frame)" nuc = nuc_all[3*start:3*(start+len(protein_record))] else : raise ValueError("Could not determine CDS source from record.") assert str(nuc.translate(table)) == str(protein_record.seq), \ "Translation:\n%s\nExpected:\n%s" \ % (translate(nuc,table), protein_record.seq) return SeqRecord(nuc, id=protein_record.id, description="(the CDS for this protein)") #Now use the above functions to fetch the CDS sequence for some proteins... nucs = (get_nuc_record(p, table="Standard") for p \ in SeqIO.parse(open("protein.gbk"),"genbank")) handle = open("nucleotide.fasta","w") SeqIO.write(nucs, handle, "fasta") handle.close() print "Done" From biopython at maubp.freeserve.co.uk Mon Jan 19 06:03:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Jan 2009 11:03:41 +0000 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <-808938340139089427@unknownmsgid> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> <-808938340139089427@unknownmsgid> Message-ID: <320fb6e00901190303h61d9c25fx114de8b253ad4c73@mail.gmail.com> On Mon, Jan 19, 2009 at 1:53 AM, Animesh Agrawal wrote: > Peter, > Definitely the script written by you look much simpler to understand with > defined functions for each of the cases. Thank you! This illustrates one of problems I had reading your code - it was one big lump with no clear structure. You should find this gets easy with practise. > But this script is giving following error. I couldn't get it working for me. > I am attaching the input file with this mail. > ---------------------------------------------------------------------------- > ... > assert start != -1, "Could not find start (assumed in frame)" > AssertionError: Could not find start (assumed in frame) > ---------------------------------------------------------------------------- You'd found a nasty example, locus P24673, where there is no CDS feature with a "coded_by" qualifier in the annotation. My code just tried fetching all of M59080.1 and only looked in the default translation in the first frame - but didn't find the protein. This is what my error message was trying to convey. In this situation, the sequence can still be found by downloading M59080.1 but we potentially have to check all six translation frames. I must have been lucky that all the examples I tried without the "coded_by" information happened to be in the standard frame. This is fairly simple to fix - see below. Peter #Script to take a file of proteins in GenBank/GenPept format, examine #their annotation, and use this to download their CDS from the NCBI. #Written and tested on Biopython 1.49, on 2008/01/19 from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO from Bio import Entrez #Edit this next line (and read the NCBI Entrez usage guidelines) #Entrez.email = "Your.Name.Here at example.com" def get_nuc_by_name(name, start=None, end=None) : """Fetches the sequence from the NCBI given an identifier name. Note start and end should be given using one based counting! Returns a Seq object.""" record = SeqIO.read(Entrez.efetch("nucleotide", id=name.strip(), seq_start=start, seq_stop=end, retmode="text", rettype="fasta"), "fasta") return record.seq def get_nuc_from_coded_by_string(source) : """Fetches the sequence from the NCBI for a "coded_by" string. e.g. "NM_010510.1:21..569" or "AF376133.1:<1..>553" or "join(AB061020.1:1..184,AB061020.1:300..1300)" or "complement(NC_001713.1:67323..68795)" Note - joins and complements are handled by recursion. Returns a Seq object.""" if source.startswith("complement(") : assert source.endswith(")") #For simplicity this works by recursion return get_nuc_from_coded_by_string(source[11:-1]).reverse_complement() if source.startswith("join(") : assert source.endswith(")") #For simplicity this works by recursion. #Note that the Seq object (currently) does not have a join #method, so convert to strings and join them, then go back #to a Seq object: return Seq("".join(str(get_nuc_from_coded_by_string(s)) \ for s in source[5:-1].split(","))) if "(" in source or ")" in source \ or source.count(":") != 1 or source.count("..") != 1 : raise ValueError("Don't understand %s" % repr(source)) name, loc = source.split(":") #Remove and ignore any leading < or > for fuzzy locations which #indicate the full CDS extends beyond the region sequenced. start, end = [int(x.lstrip("<>")) for x in loc.split("..")] #We could now download the full sequence, and crop it locally: #return get_nuc_by_name(name)[start-1:end] #However, we can ask the NCBI to crop it and then download #just the bit we need! return get_nuc_by_name(name,start,end) def find_protein_within_nuc(protein_seq, nuc_seq, table) : """Search all six frames to find a protein's CDS.""" for frame in [0,1,2] : start = nuc_seq[frame:].translate(table).find(protein_seq) if start != -1 : return nuc_seq[frame+3*start:frame+3*(start+len(protein_seq))] rev_seq = nuc_seq.reverse_complement() for frame in [0,1,2] : start = rev_seq[frame:].translate(table).find(protein_seq) if start != -1 : return rev_seq[frame+3*start:frame+3*(start+len(protein_seq))] raise ValueError("Could not find the protein sequence " "in any of the six translation frames.") def get_nuc_record(protein_record, table="Standard") : """Given a protein record, returns a record with the CDS nucleotides. The protein's annotation is used to determine the CDS sequence(s) which are downloaded from the NCBI using Entrez. The translation table specified is used to check the nucleotides actually do give the expected protein sequence. Tries to get the CDS information from a "coded_by" qualifier, failing that it falls back on a DB_SOURCE xref entry (which does not specify which bit of the nucleotide sequence referenced is required - this is deduced from the expected translation). This could get the wrong region if the happens to be two genes with different nucleotides encoding the same protein sequence! """ if not isinstance(protein_record, SeqRecord) : raise TypeError("Expect a SeqRecord as the protein_record.") feature = None for f in protein_record.features : if f.type == "CDS" and "coded_by" in f.qualifiers : feature = f break if feature : #This is the good situation, there is a precise "coded_by" string #Check this CDS feature is for the whole protein: assert feature.location.start.position == 0 assert feature.location.end.position == len(protein_record) source = feature.qualifiers["coded_by"][0] print "Using %s" % source return SeqRecord(Seq("")) nuc = get_nuc_from_coded_by_string(source) #See if this included the stop codon - they don't always! if str(nuc[-3:].translate(table)) == "*" : nuc = nuc[:-3] elif "db_source" in protein_record.annotations : #Note the current parsing of the DBSOURCE lines in GenPept #files is non-optimal (as of Biopython 1.49). If the #parsing is changed then the following code will need #updating to pull out the first xrefs entry. parts = protein_record.annotations["db_source"].split() source = parts[parts.index("xrefs:")+1].strip(",;") print "Using %s" % source nuc_all = get_nuc_by_name(source) nuc = find_protein_within_nuc(protein_record.seq, nuc_all, table) else : raise ValueError("Could not determine CDS source from record.") assert str(nuc.translate(table)) == str(protein_record.seq), \ "Translation:\n%s\nExpected:\n%s" \ % (translate(nuc,table), protein_record.seq) return SeqRecord(nuc, id=protein_record.id, description="(the CDS for this protein)") #Now use the above functions to fetch the CDS sequence for some proteins... gbk_input = "Diatoms_in.gp" #any proteins in GenBank/GenPept format. nucs = (get_nuc_record(p, table="Standard") for p \ in SeqIO.parse(open(gbk_input),"genbank")) handle = open("nucleotide.fasta","w") SeqIO.write(nucs, handle, "fasta") handle.close() print "Done" From biopython at maubp.freeserve.co.uk Tue Jan 20 04:44:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jan 2009 09:44:02 +0000 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <7373673958461722326@unknownmsgid> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> <-808938340139089427@unknownmsgid> <320fb6e00901190303h61d9c25fx114de8b253ad4c73@mail.gmail.com> <7373673958461722326@unknownmsgid> Message-ID: <320fb6e00901200144u3ba0929ave72282c99815b7d5@mail.gmail.com> On Tue, Jan 20, 2009 at 8:31 AM, Animesh Agrawal wrote: > > Peter, > Thanks a lot. ... > I tested your new script for Downloading CDS sequences. It was working fine > for records like P24673 but couldn't get it working for precise "coded_by" > string situation unless I comment(#return SeqRecord(Seq(""))) statement in > get_nuc_record() function. I don't understand why? That was a deliberate mistake to test your understanding (joke). I'm pleased you worked out what was wrong! I put that line in to speed up my testing - and forgot to remove it. Basically that line was to just return a dummy SeqRecord with an empty sequence for the "coded_by" cases, rather than going online wasting the NCBI server time. With hindsight I could have edited my example GenBank file to focus on the cases of interest. Does that make sense? Sorry for the confusion, Peter From animesh.agrawal at anu.edu.au Wed Jan 21 02:14:42 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Wed, 21 Jan 2009 18:14:42 +1100 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <320fb6e00901200144u3ba0929ave72282c99815b7d5@mail.gmail.com> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> <-808938340139089427@unknownmsgid> <320fb6e00901190303h61d9c25fx114de8b253ad4c73@mail.gmail.com> <7373673958461722326@unknownmsgid> <320fb6e00901200144u3ba0929ave72282c99815b7d5@mail.gmail.com> Message-ID: <000001c97b97$ee720de0$cb5629a0$@agrawal@anu.edu.au> >just return a dummy SeqRecord with an empty >sequence for the "coded_by" cases, rather than going online wasting >the NCBI server time. Ok. So that's the reason and I was wondering why you want to return empty sequence. -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: Tuesday, 20 January 2009 8:44 PM To: Animesh Agrawal Cc: BioPython Mailing List Subject: Re: [BioPython] Downloading CDS sequences On Tue, Jan 20, 2009 at 8:31 AM, Animesh Agrawal wrote: > > Peter, > Thanks a lot. ... > I tested your new script for Downloading CDS sequences. It was working fine > for records like P24673 but couldn't get it working for precise "coded_by" > string situation unless I comment(#return SeqRecord(Seq(""))) statement in > get_nuc_record() function. I don't understand why? That was a deliberate mistake to test your understanding (joke). I'm pleased you worked out what was wrong! I put that line in to speed up my testing - and forgot to remove it. Basically that line was to just return a dummy SeqRecord with an empty sequence for the "coded_by" cases, rather than going online wasting the NCBI server time. With hindsight I could have edited my example GenBank file to focus on the cases of interest. Does that make sense? Sorry for the confusion, Peter From animesh.agrawal at anu.edu.au Sun Jan 18 20:53:23 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Mon, 19 Jan 2009 12:53:23 +1100 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> Message-ID: <000001c979d8$b65b1890$231149b0$@agrawal@anu.edu.au> Peter, Definitely the script written by you look much simpler to understand with defined functions for each of the cases. But this script is giving following error. I couldn't get it working for me. I am attaching the input file with this mail. ---------------------------------------------------------------------------- --------------------------------- Traceback (most recent call last): File "C:\Documents and Settings\Animesh\Desktop\Pthon_learning\Features_extraction_peter.py", line 114, in SeqIO.write(nucs, handle, "fasta") File "C:\Python25\Lib\site-packages\Bio\SeqIO\__init__.py", line 274, in write count = writer_class(handle).write_file(sequences) File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 255, in write_file count = self.write_records(records) File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 239, in write_records for record in records : File "C:\Documents and Settings\Animesh\Desktop\Pthon_learning\Features_extraction_peter.py", line 110, in nucs = (get_nuc_record(p, table="Standard") for p \ File "C:\Documents and Settings\Animesh\Desktop\Pthon_learning\Features_extraction_peter.py", line 99, in get_nuc_record assert start != -1, "Could not find start (assumed in frame)" AssertionError: Could not find start (assumed in frame) ---------------------------------------------------------------------------- ------------------------------------ Cheers, Animesh -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: Sunday, 18 January 2009 6:41 AM To: Animesh Agrawal Cc: BioPython Mailing List Subject: Re: [BioPython] Downloading CDS sequences Following Animesh's query, I was inspired to try and solve this problem for myself. My rough script of my own to solve this problem (below) has several differences to Andrew and Animesh's code. First of all, I didn't bother using the Bio.GenBank.LocationParser as I felt that for CDS processing I only needed to cope with a handful of location formats, and this was easier to do "by hand". Secondly I found some GenBank/GenPept examples where there wasn't a CDS feature with a "coded_by" qualifier in the annotation. Here the only thing I could find that worked was to look under the DBSOURCE information for a cross reference to the full parent nucleotide sequence, and then try and work out which bit codes for the protein. This is a little ugly, but seems to work. I'm also using Bio.SeqIO and Bio.Entrez rather than Bio.GenBank and Bio.EUtils (deprecated). I think the most important change was that I explicitly verify the nucleotide sequence obtained when translated does actuall give the expected protein sequence - just in case there was an error in my code, the annotation, or the even downloads. Peter --- #Script to take a file of proteins in GenBank/GenPept format, examine their annotation, #and use this to download their CDS from the NCBI. #Written and tested on Biopython 1.49, on 2008/01/17 from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO from Bio import Entrez #Edit this next line (and read the NCBI Entrez usage guidelines) Entrez.email = "Your.Name.Here at example.com" def get_nuc_by_name(name, start=None, end=None) : """Fetches the sequence from the NCBI given an indentifier name. Note start and end should be given using one based counting! Returns a Seq object.""" record = SeqIO.read(Entrez.efetch("nucleotide", id=name.strip(), seq_start=start, seq_stop=end, retmode="text", rettype="fasta"), "fasta") return record.seq def get_nuc_from_coded_by_string(source) : """Fetches the sequence from the NCBI for a "coded_by" string. e.g. "NM_010510.1:21..569" or "AF376133.1:<1..>553" or "join(AB061020.1:1..184,AB061020.1:300..1300)" or "complement(NC_001713.1:67323..68795)" Note - joins and complements are handled by recusion. Returns a Seq object.""" if source.startswith("complement(") : assert source.endswith(")") #For simplicity this works by recursion return get_nuc_from_coded_by_string(source[11:-1]).reverse_complement() if source.startswith("join(") : assert source.endswith(")") #For simplicity this works by recursion. #Note that the Seq object (currently) does not have a join #method, so convert to strings and join them, then go back #to a Seq object: return Seq("".join(str(get_nuc_from_coded_by_string(s)) \ for s in source[5:-1].split(","))) if "(" in source or ")" in source \ or source.count(":") != 1 or source.count("..") != 1 : raise ValueError("Don't understand %s" % repr(source)) name, loc = source.split(":") #Remove and ignore any leading < or > for fuzzy locations which #indicate the full CDS extends beyound the region sequenced. start, end = [int(x.lstrip("<>")) for x in loc.split("..")] #We could now download the full sequence, and crop it locally: #return get_nuc_by_name(name)[start-1:end] #However, we can ask the NCBI to crop it and then download #just the bit we need! return get_nuc_by_name(name,start,end) def get_nuc_record(protein_record, table="Standard") : """Given a protein record, returns a record with the CDS nucleotides. The protein's annotation is used to determine the CDS sequence(s) which are downloaded from the NCBI using Entrez. The translation table specified is used to check the nucleotides actually do give the expected protein sequence. Tries to get the CDS information from a "coded_by" qualifier, failing that it falls back on a DB_SOURCE xref entry (which does not specify which bit of the nucleotide sequence referenced is required - this is deduced from the expected translation). """ if not isinstance(protein_record, SeqRecord) : raise TypeError("Expect a SeqRecord as the protein_record.") feature = None for f in protein_record.features : if f.type == "CDS" and "coded_by" in f.qualifiers : feature = f break if feature : assert feature.location.start.position == 0 assert feature.location.end.position == len(protein_record) source = feature.qualifiers["coded_by"][0] print "Using %s" % source return SeqRecord(Seq("")) nuc = get_nuc_from_coded_by_string(source) #See if this included the stop codon - they don't always! if str(nuc[-3:].translate(table)) == "*" : nuc = nuc[:-3] elif "db_source" in protein_record.annotations : #Note the current parsing of the DBSOURCE lines in GenPept #files is non-optimal (as of Biopython 1.49). If the #parsing is changed then the following code will need #updating to pull out the first xrefs entry. parts = protein_record.annotations["db_source"].split() source = parts[parts.index("xrefs:")+1].strip(",;") print "Using %s" % source nuc_all = get_nuc_by_name(source) start = nuc_all.translate(table).find(protein_record.seq) assert start != -1, "Could not find start (assumed in frame)" nuc = nuc_all[3*start:3*(start+len(protein_record))] else : raise ValueError("Could not determine CDS source from record.") assert str(nuc.translate(table)) == str(protein_record.seq), \ "Translation:\n%s\nExpected:\n%s" \ % (translate(nuc,table), protein_record.seq) return SeqRecord(nuc, id=protein_record.id, description="(the CDS for this protein)") #Now use the above functions to fetch the CDS sequence for some proteins... nucs = (get_nuc_record(p, table="Standard") for p \ in SeqIO.parse(open("protein.gbk"),"genbank")) handle = open("nucleotide.fasta","w") SeqIO.write(nucs, handle, "fasta") handle.close() print "Done" -------------- next part -------------- A non-text attachment was scrubbed... Name: Diatoms_in.gp Type: application/octet-stream Size: 78630 bytes Desc: not available URL: From nir at rosettadesigngroup.com Tue Jan 27 08:09:28 2009 From: nir at rosettadesigngroup.com (Nir London) Date: Tue, 27 Jan 2009 15:09:28 +0200 Subject: [BioPython] Rosetta Academic Training Workshop Message-ID: Due to public demand, ?Rosetta Design Group? is organizing a ?Rosetta? software training workshop, aimed for academic groups. The format of the workshop will be a ?webinar? - a web seminar, enabling more groups to attend while avoiding the annoying jet lag and accommodation troubles. Would you be interested in participating? If so please fill the form located at: http://rosettadesigngroup.com/blog/rosetta-academic-workshop/ and we will contact you when the details are finalized.* Nir London | Rosetta Design Group http://rosettadesigngroup.com/ * If you?re not from an academic group, don?t worry, write us anyway? From rodrigo_faccioli at uol.com.br Tue Jan 27 11:31:41 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Tue, 27 Jan 2009 14:31:41 -0200 Subject: [BioPython] Error XML Parser and another doubt Message-ID: <3715adb70901270831u1c2deafbu58d062bb5da5c70@mail.gmail.com> I have a error about read a XML file which is result from NCBIWWW.qblast. For this work, I used biopython 1.45 and python 2.5. The source-code is below: from Bio.Blast import NCBIXML import sys def readxml(filenamexml): E_VALUE_THRESH = 0.04 result_handle = open(filenamexml) blast_records = NCBIXML.parse(result_handle) for alignment in blast_records: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '...' print hsp.match[0:75] + '...' print hsp.sbjct[0:75] + '...' def main(): filenamexml = sys.argv[1] readxml(filenamexml) print "Done" main() The error message is: Traceback (most recent call last): File "src/readxml.py", line 26, in main() File "src/readxml.py", line 23, in main readxml(filenamexml) File "src/readxml.py", line 10, in readxml for alignment in blast_records: File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 574, in parse expat_parser.Parse(text, False) File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 98, in endElement eval("self.%s()" % method) File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in _end_BlastOutput_version self._header.date = self._value.split()[2][1:-1] IndexError: list index out of range I'm very new in Python and BioPython. Sincerely, this is my first program without tutorial. I have another doubt: Is there a way (website, program) that read a xml file from blast and shows like ncbi web site? Thanks for any help. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From biopython at maubp.freeserve.co.uk Tue Jan 27 11:59:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jan 2009 16:59:24 +0000 Subject: [BioPython] Error XML Parser and another doubt In-Reply-To: <3715adb70901270831u1c2deafbu58d062bb5da5c70@mail.gmail.com> References: <3715adb70901270831u1c2deafbu58d062bb5da5c70@mail.gmail.com> Message-ID: <320fb6e00901270859v29f545aeu3475cec90c493577@mail.gmail.com> On Tue, Jan 27, 2009 at 4:31 PM, Rodrigo faccioli wrote: > I have a error about read a XML file which is result from NCBIWWW.qblast. > For this work, I used biopython 1.45 and python 2.5. > ... > Traceback (most recent call last): > ... > File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in > _end_BlastOutput_version > self._header.date = self._value.split()[2][1:-1] > IndexError: list index out of range > > I'm very new in Python and BioPython. Sincerely, this is my first program > without tutorial. I'm sorry you've had trouble. This looks like an old bug in parsing the date in the XML file, caused when the NCBI changed their online server. See Biopython Bug 2499 for details: http://bugzilla.open-bio.org/show_bug.cgi?id=2499 We fixed this in Biopython 1.46, but you are using Biopython 1.45. Can you update your machine? The current release is Biopython 1.49. Peter From rodrigo_faccioli at uol.com.br Tue Jan 27 14:28:34 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Tue, 27 Jan 2009 17:28:34 -0200 Subject: [BioPython] Remove biopython 1.45 Message-ID: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> I want to know, how can I remove the biopython 1.45 in my machine. I installed the last version (1.49) from biopython website. I read http://biopython.org/DIST/docs/install/Installation.html and I didn't find anything about uninstall. Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From dalloliogm at gmail.com Wed Jan 28 05:05:19 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 28 Jan 2009 11:05:19 +0100 Subject: [BioPython] Remove biopython 1.45 In-Reply-To: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> References: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> Message-ID: <5aa3b3570901280205w191be10cr18ad32e84f18076@mail.gmail.com> On Tue, Jan 27, 2009 at 8:28 PM, Rodrigo faccioli wrote: > I want to know, how can I remove the biopython 1.45 in my machine. I > installed the last version (1.49) from biopython website. > > I read http://biopython.org/DIST/docs/install/Installation.html and I didn't > find anything about uninstall. In the future, I suggest you to always install/upgrade biopython via easy_install. Executing this from a command line: $: easy_install -U biopython is the easiest way to install and upgrade biopython along with all its dependencies. As for your problem, it should be enough to delete the folder where you have installed biopython 1.45 (please someone correct me if I am wrong). It seems that manually installing biopython using the instructions you posted puts all the scripts in the same directory; in my case (I am running an Ubuntu), it installed everything on /usr/lib/python2.5/site-packages/Bio . So basically, the manual installation you did has overwritten the old biopython you had installed on your computer... so you don't need to do anything to remove it. I suggest you to always use easy_install to install new python modules, as it is supposed to be the standard way and it creates a distinct directory for every module and every version. p.s. more info on easy_install: http://peak.telecommunity.com/DevCenter/EasyInstall > > Thanks, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Jan 28 06:37:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jan 2009 11:37:43 +0000 Subject: [BioPython] Remove biopython 1.45 In-Reply-To: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> References: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> Message-ID: <320fb6e00901280337t6c162ba2g839d54d5a9408d8c@mail.gmail.com> On Tue, Jan 27, 2009 at 7:28 PM, Rodrigo faccioli wrote: > I want to know, how can I remove the biopython 1.45 in my machine. I > installed the last version (1.49) from biopython website. > > I read http://biopython.org/DIST/docs/install/Installation.html and I didn't > find anything about uninstall. If you installed an old version of Biopython using your Linux distribution's package manager, you should ideally have un-installed it first via the package manager. Installing from any python package from source will just over-write any existing installation (if there is one in already there in the same place). As far as I know, this is just the way that distutils works (the standard python installation package). While easy install may be popular, it is not (yet) the official python tool for package installation. To manually remove Biopython (e.g. to make a clean install), locate and remove the relevent directories (and if present, egg files) under your python site-package directory, e.g. /usr/lib/python2.5/site-packages/Bio /usr/lib/python2.5/site-packages/BioSQL /usr/lib/python2.5/site-packages/Martel [These paths will depend on your OS, your version of python, and also can differ if you choose to install Biopython in a non-default directory, such as under your home folder] Peter From biopython at maubp.freeserve.co.uk Wed Jan 28 12:45:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jan 2009 17:45:25 +0000 Subject: [BioPython] BLAST subprocess problem with a GUI Message-ID: <320fb6e00901280945p32eff05by64d8a42d576f76cc@mail.gmail.com> On Tue, Jan 13, 2009 at 10:41 AM, Peter wrote: > On Tue, Jan 13, 2009 at 7:38 AM, Stefanie L?ck wrote: >> I was a little bit to optimistic... >> >> After compilation with py2exe, blast hangs. In the log file of py2exe >> I get the following error message: >> >> Traceback (most recent call last): >> File "prim_search.pyc", line 464, in make_xml >> >> File "Bio\Blast\NCBIStandalone.pyc", line 1668, in blastall >> File "Bio\Blast\NCBIStandalone.pyc", line 1992, in _invoke_blast >> File "subprocess.pyc", line 586, in __init__ >> File "subprocess.pyc", line 681, in _get_handles >> File "subprocess.pyc", line 722, in _make_inheritable >> TypeError: an integer is required >> >> Any ideas? >> Stefanie > > Are you using Biopython 1.49? > > What version of Python are you using here? (Python 2.3 is handled a > little differently, as it does not have the subprocess module). > > Can you confirm the exact same code works fine run from Python > directly (via IDLE or the commandline?), but fails via py2exe? > > Are you running the py2exe compiled version from the Windows command > line? Can you try that, even thought you said it was a GUI program. > This might be related to the following python bug on Windows to do > with pipe redirection, http://bugs.python.org/issue1124861 > If so, I think there is a suggested work around we can try (this will > require a change to the Biopython code). > > Peter Hi Stefanie, Did you make any progress with this problem? If as I suspect the problem is the python subprocess bug http://bugs.python.org/issue1124861 then you can try the suggested work around in Biopython, by modifying the _invoke_blast function in Bio\Blast\NCBIStandalone.py file as follows: import subprocess, sys #We don't need to supply any piped input, but we setup the #pipe anyway as a work around for a python bug if this is #called from a Windows GUI program. For details, see: #http://bugs.python.org/issue1124861 blast_process = subprocess.Popen(cmd_string, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) blast_process.stdin.close() return blast_process.stdout, blast_process.stderr I've checked this change doesn't seem to break anything - but does it help for your GUI program? Peter From biopython at maubp.freeserve.co.uk Fri Jan 30 07:52:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 30 Jan 2009 12:52:29 +0000 Subject: [BioPython] Does anyone use EZRetrieve? In-Reply-To: <320fb6e00811201253j66336d7cl977e4e3112c9f9f7@mail.gmail.com> References: <4925CCAA.2040809@gmail.com> <320fb6e00811201253j66336d7cl977e4e3112c9f9f7@mail.gmail.com> Message-ID: <320fb6e00901300452vdfd1a73yf70fd78d77d12eb5@mail.gmail.com> On Thu, Nov 20, 2008 at 8:53 PM, Peter wrote: > On Thu, Nov 20, 2008 at 8:46 PM, Bruce Southey wrote: >> Hi, >> Does anyone use EZRetrieve >> (http://siriusb.umdnj.edu:18080/EZRetrieve/single_r.jsp) ? >> This allows a user to retrieve a human, mouse or rat genome nucleic sequence >> based on an valid identifier. >> >> I think that most of the functionality of Bio.EZRetrieve is already present >> in Biopython and the genome sources appear to be 5 years old. For example, >> it uses LocusLink that was discontinued March 2005. >> >> If so could you please let me know? > > Actually - could you let the whole mailing list know? ;) > > Given nature of the database and the limited functionality this python > code offers, if no-one is using Bio.EZRetrieve then it could be > considered for deprecation. I've seen no replies so I've marked Bio.EZRetrieve as obsolete in CVS (and therefore for Biopython 1.50), and unless anyone speaks up it will be deprecated in the release after that. I'm not sure that the EZRetrieve data is that out of date (they may have updated things since Bruce looked), but all the Bio.EZRetrieve code does is fetch an HTML page and extract the FASTA formatted sequence (ignoring any metadata or cross references). In any case, this kind of HTML "screen scraping" is fragile (liable to break when the site gets a visual redesign) and is not explicitly condoned by the service itself. Peter From jgrant at email.smith.edu Fri Jan 2 18:01:45 2009 From: jgrant at email.smith.edu (Jessica Grant) Date: Fri, 2 Jan 2009 13:01:45 -0500 Subject: [BioPython] help with NCBIWWW.qblast Message-ID: I wrote a script that uses NCBIWWW.qblast and it worked last time I tried it (a few weeks ago) but this morning I get the following error message: File "tblastn.py", line 41, in tblastn result_handle = NCBIWWW.qblast("tblastn", "nr", fas.seq.data) File "/root/biopython-1.49/build/lib.linux-i686-2.4/Bio/Blast/NCBIWWW.py", line 770, in qblast File "/root/biopython-1.49/build/lib.linux-i686-2.4/Bio/Blast/NCBIWWW.py", line 837, in _parse_qblast_ref_page ValueError: A non-integer RTOE found in the 'please wait' page, '' Does this sound like an ncbi error or like something I should be able to work around? Thanks for the help! -- Jessica Grant Phone: 413-585-3750 Fax: 413-585-3786 jgrant at email.smith.edu http://www.science.smith.edu/departments/Biology/lkatz/people/jgrant From biopython at maubp.freeserve.co.uk Fri Jan 2 18:38:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jan 2009 18:38:07 +0000 Subject: [BioPython] help with NCBIWWW.qblast In-Reply-To: References: Message-ID: <320fb6e00901021038u38cd05b8p6e9ca7cff07969d6@mail.gmail.com> On Fri, Jan 2, 2009 at 6:01 PM, Jessica Grant wrote: > I wrote a script that uses NCBIWWW.qblast and it worked last time I tried it > (a few weeks ago) but this morning I get the following error message: > > File "tblastn.py", line 41, in tblastn > result_handle = NCBIWWW.qblast("tblastn", "nr", fas.seq.data) > File "/root/biopython-1.49/build/lib.linux-i686-2.4/Bio/Blast/NCBIWWW.py", > line 770, in qblast > File "/root/biopython-1.49/build/lib.linux-i686-2.4/Bio/Blast/NCBIWWW.py", > line 837, in _parse_qblast_ref_page > ValueError: A non-integer RTOE found in the 'please wait' page, '' It looks like an empty string was found for RTOE (which obviously cannot be turned into an integer). [As an aside, there was a trivial error in the error processing in Bio/Blast/NCBIWWW.py as it should have said "No RTOE found in the 'please wait' page." instead.] This suggests the NCBI sent back an error page of some kind (or even an empty page if there was some network problem), instead of the normal "please wait" page. Unfortunately, these are not so easy to deal with automatically. Does this happen repeatedly? It would help if you could include the sequence you are trying - then we can attempt to reproduce the error for ourselves. I've tried a tblastn search from my machine and it worked OK. Also, what version of Biopython are you using (I'd guess Biopython 1.49 beta, or 1.49, based on the error message)? > Does this sound like an ncbi error or like something I should be able to > work around? Thanks for the help! Given you say this script used to work, it could be something the NCBI has changed. Have you updated your installation of Biopython since the time the script worked? Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Jan 2 23:04:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jan 2009 23:04:43 +0000 Subject: [BioPython] help with NCBIWWW.qblast In-Reply-To: References: <320fb6e00901021038u38cd05b8p6e9ca7cff07969d6@mail.gmail.com> Message-ID: <320fb6e00901021504u737bb09dx3242ef9d2c5f767e@mail.gmail.com> On Fri, Jan 2, 2009 at 7:50 PM, Jessica Grant wrote: > > Thanks for your response...actually I had failed to save my fasta file with > unix line breaks. This causes me so many problems and I guess over vacation > I forgot that I need to deal with this. > > All is working now! > > Thanks! > > Jessica That's interesting - but I'd be surprised if the NCBI can't cope with mixed new lines in a qblast query (after all, their webserver will get queries from Linux, Windows and Mac machines). Was there a problem stemming from parsing the FASTA file? How are you reading in the FASTA file? Have you tried opening the handle in universal read lines mode? e.g. handle = open("example.fasta","rU") Peter From biopython at maubp.freeserve.co.uk Sat Jan 3 12:59:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jan 2009 12:59:34 +0000 Subject: [BioPython] help with NCBIWWW.qblast In-Reply-To: References: <320fb6e00901021038u38cd05b8p6e9ca7cff07969d6@mail.gmail.com> <320fb6e00901021504u737bb09dx3242ef9d2c5f767e@mail.gmail.com> Message-ID: <320fb6e00901030459u4212b87aw2094b98f68bd0f1c@mail.gmail.com> On 1/3/09, Jessica Grant wrote: > > I have never seen that ("rU") before. I will give it a try. Thanks! > > Jessica > I meant to type "universal new lines mode", but anyway its been present since at least Python 2.3 and can be very helpful in this situation - see: http://docs.python.org/library/functions.html I was hoping you could tell me the exact string you used as your tblastx query, because it could be useful to be able to reproduce this kind of error. I hope you don't mind me sharing some comments on your original snippet of code, which looked like this: result_handle = NCBIWWW.qblast("tblastn", "nr", fas.seq.data) I'm assuming that the variable fas is a SeqRecord, thus fas.seq is its Seq object. Using a Seq object's data property to get the sequence as a plain string is discouraged (it hasn't been in the tutorial for some time), as the Seq object now behaves much more like a string itself. You could have just used fas.seq here. This brings me to my next point, the NCBI qblast interface will take three kinds of queries, (1) a record identifier like a GI number, (2) a sequence, or (3) a FASTA format string. Supplying just the sequence (as in your code) means that BLAST will assign an identifier for your sequence automatically. You might prefer to use the SeqRecord object's format method to make a fasta string (which will include the existing identifier - and that should then be present in the BLAST results): result_handle = NCBIWWW.qblast("tblastn", "nr", fas.format("fasta")) This information is in the version of the Biopython Tutorial, but I thought it worth bringing it up here too. The format method used here was added to the SeqRecord (and Alignment) objects in Biopython 1.48. Peter From biopython at maubp.freeserve.co.uk Sat Jan 3 13:42:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jan 2009 13:42:10 +0000 Subject: [BioPython] Deprecating Bio.Transcribe and Bio.Translate? Message-ID: <320fb6e00901030542g6292084bl98a10587e6dd9990@mail.gmail.com> Dear all, For some time now, the Bio.Seq module has provided transcription and basic translation functionality. With Biopython 1.49, this was extended further with the addition of Seq object transcription and translation methods (for use on nucleotide sequences only), and also support for translation up to the first stop codon. Using Bio.Seq is now the recommended and preferred way to do transcription or translation with Biopython. The Bio.Transcribe and Bio.Translate modules were declared obsolete with the release of Biopython 1.49, and I am wondering if anyone objects to our officially deprecating them in Biopython 1.50 (i.e. just adding a warning message when the module is imported). Alternatively, we could do this in the release after that. Thanks, Peter From biopython at maubp.freeserve.co.uk Sat Jan 3 19:59:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jan 2009 19:59:04 +0000 Subject: [BioPython] help for local alignment In-Reply-To: <4c2163890901030837y5b12a358gd2f87aa6989b5a75@mail.gmail.com> References: <4c2163890812220838o40cb50fcyb55f24392cf14101@mail.gmail.com> <320fb6e00812220947xd9444ffp636c13c684fda2c4@mail.gmail.com> <4c2163890812222328v4d9369cq6b4d7ad749365e9b@mail.gmail.com> <320fb6e00812230320u62d915b4k7b2d334b241e8f97@mail.gmail.com> <4c2163890812310100n4100bce5sc9da85b4df391016@mail.gmail.com> <320fb6e00812310635n1685798ai42c8dc07c1c3bf45@mail.gmail.com> <4c2163890812310710o212c9603mc96f7720328baaa6@mail.gmail.com> <4c2163890901012011o47a4bef1l19d951ae40b84aaf@mail.gmail.com> <320fb6e00901020522q51819fbt28c29ae9333d4831@mail.gmail.com> <4c2163890901030837y5b12a358gd2f87aa6989b5a75@mail.gmail.com> Message-ID: <320fb6e00901031159g5554ab29i837ed466ab28cfbc@mail.gmail.com> Hi Chen, I've copied the Biopython mailing list on my reply as I think this is of general interest. On 1/3/09, Chen Ku wrote: > Dear Peter, > it will be a great help of you if you can send me > the exact code for this problem using Bio package. I think as you are expert > in this you can write me and I think it will be few line code. > > My Problem: Given two protein sequence I have to perform Global alignment > using Blosum 62 scoring scheme. > > v = NGPSTKDFGKISESREFDNQNGPSTKDFGKISESREFDNQ > w = QNQLERSFGKINMRLEDALVQNQLERSFGKINMRLEDALV > Scoring matrix: BLOSUM62 > Gap open: 10.0 > Gap extended: 0.5 > > The answer I did manually will come 20 using BLOSUM matrix. > .... > > Regards > Chen I'd already told Chen that Biopython provides the BLOSUM matrices in Bio.SubsMat.MatrixInfo (as simple dictionaries) and pointed him at the Bio.pairwise2 for doing pairwise alignments. The information is all there in the pairwise2 docstrings, but perhaps it could be clearer. There are lots of global alignment functions named globalXX in Bio.pairwise2.align, where the two letter code "XX" tells you the type of parameters for matches (and mismatches), and the parameters for gap penalties. In this case we want to use the function globalds because we want to use the BLOSUM62 matrix which we have as a dictionary (d for dictionary), and the two sequences have the same gap parameters (s for same). from Bio import pairwise2 from Bio.SubsMat.MatrixInfo import blosum62 v = "NGPSTKDFGKISESREFDNQNGPSTKDFGKISESREFDNQ" w = "QNQLERSFGKINMRLEDALVQNQLERSFGKINMRLEDALV" open_penalty = -10 extend_penalty = -0.5 alignments = pairwise2.align.globalds(v, w, blosum62, open_penalty, extend_penalty) for align1, align2, score, begin, end in alignments : print pairwise2.format_alignment(align1, align2, score, begin, end) This gives me two alignments back, both scoring twenty - as you had calculated by hand Chen. But do double check this is doing what you expected! Peter From sudhir.cr at gmail.com Sun Jan 4 05:23:25 2009 From: sudhir.cr at gmail.com (sudhir cr) Date: Sun, 4 Jan 2009 10:53:25 +0530 Subject: [BioPython] How to use Bio.Kegg.Compound Module In-Reply-To: <320fb6e00812310717s51cce5dm52694079ffe9c253@mail.gmail.com> References: <320fb6e00812310643s40715abj5138c2a417484c0e@mail.gmail.com> <320fb6e00812310717s51cce5dm52694079ffe9c253@mail.gmail.com> Message-ID: Hi Peter, The The new KEGG format has changed to "Other DBs" from "DBLINKS" only on the html page but not when we download from KEGG FTP. So, I guess its not yet needed to log a bug. Thanks for your help, Sudhir On Wed, Dec 31, 2008 at 8:47 PM, Peter wrote: > On Wed, Dec 31, 2008 at 3:07 PM, sudhir cr wrote: > > Hello Peter, > > > > Thanks for the quick reply. This code is working great. > > Great. > > > P.S: The new KEGG format has changed to "Other DBs" from "DBLINKS" > > Do you have a link for this? If we need to update our parser could > you file a bug on Bugzilla please? http://bugzilla.open-bio.org/ > > Thanks, > > Peter > -- Sudhir Chowbina Bioinformatics Graduate Student & Research Assistant Discovery Informatics and Computing Laboratory Indiana University School of Informatics Indianapolis, USA 317-847 7721 schowbin at iupui.edu From sedaalper at yahoo.com Mon Jan 5 12:49:44 2009 From: sedaalper at yahoo.com (Seda Alper) Date: Mon, 5 Jan 2009 04:49:44 -0800 (PST) Subject: [BioPython] do_alignment Message-ID: <544161.80500.qm@web90603.mail.mud.yahoo.com> Hi! I executed the code below about Clustalw . However it doesn't work. import os from Bio.Clustalw import MultipleAlignCL cline = MultipleAlignCL(os.path.join(os.curdir,"opuntia.fasta")) cline.set_output("test.aln") print cline from Bio import Clustalw alignment = Clustalw.do_alignment(cline) The error like that >>> clustalw -INFILE=.\opuntia.fasta -OUTFILE=test.aln Traceback (most recent call last): File "C:\Python25\ders\se.py", line 10, in alignment = Clustalw.do_alignment(cline) File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 95, in do_alignment shell=(sys.platform!="win32") File "C:\Python25\lib\subprocess.py", line 594, in __init__ errread, errwrite) File "C:\Python25\lib\subprocess.py", line 816, in _execute_child startupinfo) WindowsError: [Error 2] The system cannot find the file specified What to do? Thanks Seda From biopython at maubp.freeserve.co.uk Mon Jan 5 13:09:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jan 2009 13:09:22 +0000 Subject: [BioPython] do_alignment In-Reply-To: <544161.80500.qm@web90603.mail.mud.yahoo.com> References: <544161.80500.qm@web90603.mail.mud.yahoo.com> Message-ID: <320fb6e00901050509g75ca62a5sb532a375d6a2543b@mail.gmail.com> On Mon, Jan 5, 2009 at 12:49 PM, Seda Alper wrote: > Hi! > > I executed the code below about Clustalw . However it doesn't work. > > import os > from Bio.Clustalw import MultipleAlignCL > > cline = MultipleAlignCL(os.path.join(os.curdir,"opuntia.fasta")) > cline.set_output("test.aln") > print cline > > from Bio import Clustalw > > alignment = Clustalw.do_alignment(cline) > > The error like that >>>> > clustalw -INFILE=.\opuntia.fasta -OUTFILE=test.aln > > Traceback (most recent call last): > File "C:\Python25\ders\se.py", line 10, in > alignment = Clustalw.do_alignment(cline) > File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 95, in do_alignment > shell=(sys.platform!="win32") > File "C:\Python25\lib\subprocess.py", line 594, in __init__ > errread, errwrite) > File "C:\Python25\lib\subprocess.py", line 816, in _execute_child > startupinfo) > WindowsError: [Error 2] The system cannot find the file specified > > What to do? > > Thanks > Seda I'm not at my Windows machine to double check this, but I suspect you don't have clustalw on your path. If you don't have clustalw on your path, you'll have to tell Biopython where it is: clustalw_exe = r"C:\Program Files\...\clustalw.exe" assert os.path.isfile(clustalw_exe) cline = MultipleAlignCL(os.path.join(os.curdir,"opuntia.fasta"), clustalw_exe) ... Peter From biopython at maubp.freeserve.co.uk Tue Jan 6 10:33:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jan 2009 10:33:16 +0000 Subject: [BioPython] do_alignment In-Reply-To: <127996.50517.qm@web90605.mail.mud.yahoo.com> References: <320fb6e00901050509g75ca62a5sb532a375d6a2543b@mail.gmail.com> <127996.50517.qm@web90605.mail.mud.yahoo.com> Message-ID: <320fb6e00901060233k62abf10eyd715bc2b3b52b8c1@mail.gmail.com> On Tue, Jan 6, 2009 at 9:58 AM, Seda Alper wrote: > > Hi Peter, > > I applied what you do. However now the error is like that > > import os > from Bio.Clustalw import MultipleAlignCL > > clustalw_exe = r"C:\Python\Biopython-1.49\clustalw_exe" The above is wrong - the filename should end with ".exe", so it might be this: clustalw_exe = r"C:\Python\Biopython-1.49\clustalw.exe" (assuming you really do have the ClustalW executable in your Biopython directory) > assert os.path.isfile(clustalw_exe) > cline = MultipleAlignCL(os.path.join(os.curdir,"opuntia.fasta"),clustalw_exe) > ... > Traceback (most recent call last): > File "C:\Python25\ders\se.py", line 5, in > assert os.path.isfile(clustalw_exe) > AssertionError The assertion failed becase the Clustalw filename you used does not exist. Peter From lueck at ipk-gatersleben.de Thu Jan 8 15:07:06 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 8 Jan 2009 16:07:06 +0100 Subject: [BioPython] blastall directory problem Message-ID: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> Hi! I finally finished my program and tested on several PC's. I'm doing standalone blasts, it's a GUI program for Windows. By this I found a problem on the english operating systems because blastall has problems with spaces in the full path. I tried to replace the path "C:\Program Files\Final\test" into "C:\PROGRA~1\Final\test" but I get a message that blast is unable to open the input file. Maby there is a way not to give the full path of the database and the blastall.exe but only the file name? Does someone has a idea how I can solve this problem? Kind regards and a happy new year! Stefanie From biopython at maubp.freeserve.co.uk Fri Jan 9 12:16:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jan 2009 12:16:32 +0000 Subject: [BioPython] blastall directory problem In-Reply-To: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> References: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00901090416w7ab46f0biea1d1ba2fa746229@mail.gmail.com> On Thu, Jan 8, 2009 at 3:07 PM, Stefanie L?ck wrote: > Hi! > > I finally finished my program and tested on several PC's. > I'm doing standalone blasts, it's a GUI program for Windows. > > By this I found a problem on the english operating systems because blastall has problems with spaces in the full path. > > I tried to replace the path "C:\Program Files\Final\test" into "C:\PROGRA~1\Final\test" but I get a message that blast is unable to open the input file. Maby there is a way not to give the full path of the database and the blastall.exe but only the file name? > > Does someone has a idea how I can solve this problem? > > Kind regards and a happy new year! > Stefanie The problem here is that BLAST allows you to specify multiple databases using spaces to separate the list - while you may also want to use spaces in the filename(s)! The solution is that BLAST should understand some fiddly escape quoted string arguments, i.e. slash double quote at the beginning and end of the filename. Try this (using a mixture of single and double quotes!): my_blast_db =r'"\"C:\Program Files\Final\test\""' (There was some discussion on this issue on Bug 2480). You should also be able to use the DOS 8.3 style names (which have no spaces), something like "C:\PROGRA~1\Final\test" which you said you tried. Read about the win32api.GetShortPathName() function for how to get this name programatically. From your email it sounds like this worked - but you had another problem with specifying your input filename. If you are still stuck, could you give us an example showing the arguments used for the blast exe, database and input file? Peter From mmokrejs at ribosome.natur.cuni.cz Fri Jan 9 14:24:02 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 09 Jan 2009 15:24:02 +0100 Subject: [BioPython] Does biopython have a parser for .qual files? Message-ID: <49675E02.5050102@ribosome.natur.cuni.cz> Hi, is there a way in biopython to access the quality values from NCBI trace archive? I had a look briefly into http://biopython.org/DIST/docs/api/ but cannot find anything related. NCBItrace provides some perl script (maybe I could the same with Bio.Entrez.esearch (haven't tried yet) ... I will need to revert the order of values to get them for minus strand orientation. If nobody needed to do this before I will invent the wheel. ;) Thanks for your comments, Martin $ perl NCBItrace/query_tracedb "retrieve quality 5728631" >gnl|ti|5728631 name:jea17d09.b1 7 7 7 7 7 7 10 9 8 6 6 9 9 9 7 10 9 13 13 10 19 8 6 6 13 8 4 4 4 6 13 13 6 6 9 9 10 16 19 19 19 4 0 4 13 19 19 32 32 25 19 15 6 6 9 19 19 22 25 25 25 25 22 22 22 29 29 27 22 16 10 19 16 15 6 6 6 6 6 6 8 19 23 33 39 34 34 34 34 39 39 39 39 39 39 39 39 40 28 19 11 9 15 11 28 37 40 45 35 35 35 35 39 39 51 51 39 39 39 39 39 39 35 35 32 33 33 33 32 32 40 40 56 56 40 32 32 32 32 34 35 51 51 51 51 35 34 34 34 35 35 39 40 40 51 51 51 51 51 51 51 45 40 40 40 40 40 40 51 45 45 45 45 51 51 56 56 51 51 51 51 51 40 45 45 45 45 45 56 56 56 56 56 56 56 56 40 28 28 23 23 23 25 29 35 38 38 38 38 38 38 38 38 40 51 51 51 51 56 56 [cut] From biopython at maubp.freeserve.co.uk Fri Jan 9 14:31:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jan 2009 14:31:30 +0000 Subject: [BioPython] do_alignment In-Reply-To: <328891.8790.qm@web90608.mail.mud.yahoo.com> References: <320fb6e00901060428v59cb5d5ek8521b49bdff55da2@mail.gmail.com> <328891.8790.qm@web90608.mail.mud.yahoo.com> Message-ID: <320fb6e00901090631s342f96dcgd24bbae07e19cec3@mail.gmail.com> On Fri, Jan 9, 2009 at 2:22 PM, Seda Alper wrote: > > Dear Peter, > > I've executed my code at the end! I only changed the file name( from > opuntia.fasta to my file mouse.fasta). I think the problem may be > resulted from the file opuntia. Now, everything works. > > Thanks for your help! > Seda Oh good, I was going to say double check that opuntia.fasta was in the current directory. Your other message where you got the error "ValueError: No records found in handle" is usually caused by an empty output alignment file. Perhaps an aborted ClustalW run had left behind an empty output file? Peter From biopython at maubp.freeserve.co.uk Fri Jan 9 15:01:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jan 2009 15:01:24 +0000 Subject: [BioPython] Does biopython have a parser for .qual files? In-Reply-To: <49675E02.5050102@ribosome.natur.cuni.cz> References: <49675E02.5050102@ribosome.natur.cuni.cz> Message-ID: <320fb6e00901090701n5a85bb17lb1769fa1d55d3a88@mail.gmail.com> On Fri, Jan 9, 2009 at 2:24 PM, Martin MOKREJ? wrote: > Hi, > is there a way in biopython to access the quality values from > NCBI trace archive? I had a look briefly into > http://biopython.org/DIST/docs/api/ but cannot find anything > related. NCBItrace provides some perl script (maybe I could the > same with Bio.Entrez.esearch (haven't tried yet) ... I will need > to revert the order of values to get them for minus strand > orientation. If nobody needed to do this before I will invent > the wheel. ;) > Thanks for your comments, > Martin In the short term, I'm sure a quick parser shouldn't take you more than five minutes to implement (based on any of the FASTA parsers), giving you record names with lists of integer scores. The trouble for integrating this into Biopython nicely is how to represent the data. Have a look at Bug 2382 for some related ideas (including over FASTA like formats), and this thread just over a year ago: http://lists.open-bio.org/pipermail/biopython-dev/2007-October/003131.html http://bugzilla.open-bio.org/show_bug.cgi?id=2382 I can see these qual files (and also fastq files which have both the sequence and the quality scores) fitting into Bio.SeqIO but this would require an elegant way to deal with unknown sequences of known length (see next paragraph), and a good way to handle per-letter-annotation (which we have touched on on the mailing lists fairly recently). For this reason, I had wondered about creating an UnknownSeq as subclass of Seq. To create an instance you would supply the length and a character to use (typically N or X for nucleotides and proteins, perhaps defaulting to ?). This would then act like a Seq object as much as possible (for example, translation of an UnknownSeq with a nucleotide alphabet could give an UnknownSeq with a protein alphabet with appropriate length). An UnknownSeq object could be used for these qual files, or even certain GenBank files (where the sequence is not always included). There is a risk of user confusion here though, as there isn't really a sequence present! Peter From bjorn_johansson at bio.uminho.pt Mon Jan 12 10:58:16 2009 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Mon, 12 Jan 2009 10:58:16 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence Message-ID: Hi, I am fairly new to biopython, so I don't now if this question has been answered in the archives (tried to loo but found nothing). Is there a (bio)python module or code snippet that I can use to determine if a sequence is liiely to be nucleic acid or protein? I believe the program ReadSeq does this for example, when formatting a fasta sequence to genbank. grateful for answers! /bjorn -- Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From biopython at maubp.freeserve.co.uk Mon Jan 12 11:30:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Jan 2009 11:30:26 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: References: Message-ID: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> On Mon, Jan 12, 2009 at 10:58 AM, Bj?rn Johansson wrote: > Hi, I am fairly new to biopython, so I don't now if this question has > been answered in the archives (tried to loo but found nothing). > > Is there a (bio)python module or code snippet that I can use to > determine if a sequence is liiely to be nucleic acid or protein? > > I believe the program ReadSeq does this for example, when formatting a > fasta sequence to genbank. > > grateful for answers! > > /bjorn It seems like lots of different tools (e.g. FASTA) have come up with their own way to try and guess this, usually by looking at the letter content. This is impossible to get right 100% of the time (especially if the nucleotide includes ambiguous characters - which can make it look more protein like). I don't think we have a standard bit of code in Biopython to do this (but I've never searched). In python there as a general preference for making things explicit rather than trying to guess and do the right thing. If you don't know which you have (e.g. user input?) then you are in an awkward position. What are you going to do with the sequence? If you are going to pass it to a command line tool, maybe you can let it guess? Peter From lueck at ipk-gatersleben.de Mon Jan 12 12:06:39 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 12 Jan 2009 13:06:39 +0100 Subject: [BioPython] blastall directory problem References: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> <320fb6e00901090416w7ab46f0biea1d1ba2fa746229@mail.gmail.com> Message-ID: <001a01c974ae$39b0a820$1022a8c0@ipkgatersleben.de> Thanks for the response! This worked! Lifesaver ;-) Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Friday, January 09, 2009 1:16 PM Subject: Re: [BioPython] blastall directory problem On Thu, Jan 8, 2009 at 3:07 PM, Stefanie L?ck wrote: > Hi! > > I finally finished my program and tested on several PC's. > I'm doing standalone blasts, it's a GUI program for Windows. > > By this I found a problem on the english operating systems because > blastall has problems with spaces in the full path. > > I tried to replace the path "C:\Program Files\Final\test" into > "C:\PROGRA~1\Final\test" but I get a message that blast is unable to open > the input file. Maby there is a way not to give the full path of the > database and the blastall.exe but only the file name? > > Does someone has a idea how I can solve this problem? > > Kind regards and a happy new year! > Stefanie The problem here is that BLAST allows you to specify multiple databases using spaces to separate the list - while you may also want to use spaces in the filename(s)! The solution is that BLAST should understand some fiddly escape quoted string arguments, i.e. slash double quote at the beginning and end of the filename. Try this (using a mixture of single and double quotes!): my_blast_db =r'"\"C:\Program Files\Final\test\""' (There was some discussion on this issue on Bug 2480). You should also be able to use the DOS 8.3 style names (which have no spaces), something like "C:\PROGRA~1\Final\test" which you said you tried. Read about the win32api.GetShortPathName() function for how to get this name programatically. From your email it sounds like this worked - but you had another problem with specifying your input filename. If you are still stuck, could you give us an example showing the arguments used for the blast exe, database and input file? Peter From chapmanb at 50mail.com Mon Jan 12 13:47:37 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 12 Jan 2009 08:47:37 -0500 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> References: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> Message-ID: <20090112134737.GG4135@sobchak.mgh.harvard.edu> Hi Bj?rn: I am agreed with Peter; guessing should be the last resort. The guessing is not that smart, and will fall apart for very pathological cases like short amino acids with lots of Gly, Ala, Cys or Thrs. That being said, here is some code that does this. Hope this helps, Brad from Bio import Seq def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']): """Guess if the given sequence is DNA. It's considered DNA if more than 90% of the sequence is GATCs. The threshold is configurable via the thresh parameter. dna_letters can be used to configure which letters are considered DNA; for instance, adding N might be useful if you are expecting data with ambiguous bases. """ if isinstance(seq, Seq.Seq): seq = seq.data elif isinstance(seq, type("")) or isinstance(seq, type(u"")): seq = str(seq) else: raise ValueError("Do not know provided type: %s" % seq) seq = seq.upper() dna_alpha_count = 0 for letter in dna_letters: dna_alpha_count += seq.count(letter) if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh): return True else: return False On Mon, Jan 12, 2009 at 11:30:26AM +0000, Peter wrote: > On Mon, Jan 12, 2009 at 10:58 AM, Bj?rn Johansson > wrote: > > Hi, I am fairly new to biopython, so I don't now if this question has > > been answered in the archives (tried to loo but found nothing). > > > > Is there a (bio)python module or code snippet that I can use to > > determine if a sequence is liiely to be nucleic acid or protein? > > > > I believe the program ReadSeq does this for example, when formatting a > > fasta sequence to genbank. > > > > grateful for answers! > > > > /bjorn > > It seems like lots of different tools (e.g. FASTA) have come up with > their own way to try and guess this, usually by looking at the letter > content. This is impossible to get right 100% of the time (especially > if the nucleotide includes ambiguous characters - which can make it > look more protein like). I don't think we have a standard bit of code > in Biopython to do this (but I've never searched). > > In python there as a general preference for making things explicit > rather than trying to guess and do the right thing. If you don't know > which you have (e.g. user input?) then you are in an awkward position. > What are you going to do with the sequence? If you are going to pass > it to a command line tool, maybe you can let it guess? > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Jan 12 14:34:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Jan 2009 14:34:13 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: <20090112134737.GG4135@sobchak.mgh.harvard.edu> References: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> <20090112134737.GG4135@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00901120634q1ba2e80fj2cbd049a7f99a3d3@mail.gmail.com> On Mon, Jan 12, 2009 at 1:47 PM, Brad Chapman wrote: > Hi Bj?rn: > I am agreed with Peter; guessing should be the last resort. The > guessing is not that smart, and will fall apart for very > pathological cases like short amino acids with lots of Gly, Ala, Cys > or Thrs. That being said, here is some code that does this. Hope > this helps, > > Brad > > from Bio import Seq > > def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']): > """Guess if the given sequence is DNA. > > It's considered DNA if more than 90% of the sequence is GATCs. The threshold > is configurable via the thresh parameter. dna_letters can be used to configure > which letters are considered DNA; for instance, adding N might be useful if > you are expecting data with ambiguous bases. > """ > if isinstance(seq, Seq.Seq): > seq = seq.data > elif isinstance(seq, type("")) or isinstance(seq, type(u"")): > seq = str(seq) > else: > raise ValueError("Do not know provided type: %s" % seq) > seq = seq.upper() This code is trying to get the sequence as an upper case string, given that the Seq object does not support the upper method (yet - I've just filed enhancement Bug 2731 on this, something I'd been thinking about for a while). Anyway, this would be shorter and would cope with strings or Seq objects, or even MutableSeq objects. seq = str(seq.upper()) Also using the Seq object's data property is discouraged (see Bug 2509). > dna_alpha_count = 0 > for letter in dna_letters: > dna_alpha_count += seq.count(letter) > if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh): > return True > else: > return False You could just do: return (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh) Peter From bjorn_johansson at bio.uminho.pt Mon Jan 12 18:34:36 2009 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Mon, 12 Jan 2009 18:34:36 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: <320fb6e00901120634q1ba2e80fj2cbd049a7f99a3d3@mail.gmail.com> References: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> <20090112134737.GG4135@sobchak.mgh.harvard.edu> <320fb6e00901120634q1ba2e80fj2cbd049a7f99a3d3@mail.gmail.com> Message-ID: Hi, and thanks for the quick replies and the submitted code! Its very nice to have the help of such a devoted community! I am writing a plug-in to deal with reformatting pasted code (DNA or protein) snippets into the editor (incidently WikidPad which is written in python and uses scintilla, open-source http://wikidpad.sourceforge.net/) and I would like to be able to format (DNA or protein) code in the selection from raw format to fasta and genbank. The identity of the code (DNA or protein) is only needed to feed into the SeqIO.write method, it demands to know if the sequence is DNA or protein to write genbank format. I know I could add a dialog, but I want a function to quickly reformat sequences, although I agree that guessing is bad from a theoretical viewpoint. Ill try the code that you submitted as soon as I can and Ill get back to you! thanks, /bjorn On Mon, Jan 12, 2009 at 14:34, Peter wrote: > On Mon, Jan 12, 2009 at 1:47 PM, Brad Chapman wrote: >> Hi Bj?rn: >> I am agreed with Peter; guessing should be the last resort. The >> guessing is not that smart, and will fall apart for very >> pathological cases like short amino acids with lots of Gly, Ala, Cys >> or Thrs. That being said, here is some code that does this. Hope >> this helps, >> >> Brad >> >> from Bio import Seq >> >> def guess_if_dna(seq, thresh = 0.90, dna_letters = ['G', 'A', 'T', 'C']): >> """Guess if the given sequence is DNA. >> >> It's considered DNA if more than 90% of the sequence is GATCs. The threshold >> is configurable via the thresh parameter. dna_letters can be used to configure >> which letters are considered DNA; for instance, adding N might be useful if >> you are expecting data with ambiguous bases. >> """ >> if isinstance(seq, Seq.Seq): >> seq = seq.data >> elif isinstance(seq, type("")) or isinstance(seq, type(u"")): >> seq = str(seq) >> else: >> raise ValueError("Do not know provided type: %s" % seq) >> seq = seq.upper() > > This code is trying to get the sequence as an upper case string, given > that the Seq object does not support the upper method (yet - I've just > filed enhancement Bug 2731 on this, something I'd been thinking about > for a while). > > Anyway, this would be shorter and would cope with strings or Seq > objects, or even MutableSeq objects. > > seq = str(seq.upper()) > > Also using the Seq object's data property is discouraged (see Bug 2509). > >> dna_alpha_count = 0 >> for letter in dna_letters: >> dna_alpha_count += seq.count(letter) >> if (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh): >> return True >> else: >> return False > > You could just do: > > return (len(seq) == 0 or float(dna_alpha_count) / float(len(seq)) >= thresh) > > Peter > -- Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From biopython at maubp.freeserve.co.uk Mon Jan 12 21:57:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Jan 2009 21:57:39 +0000 Subject: [BioPython] Determine alphabet (DNA or Protein) of a sequence In-Reply-To: References: <320fb6e00901120330m4c96de4cn267332ac5ff7c8c1@mail.gmail.com> <20090112134737.GG4135@sobchak.mgh.harvard.edu> <320fb6e00901120634q1ba2e80fj2cbd049a7f99a3d3@mail.gmail.com> Message-ID: <320fb6e00901121357k1d9c39a2tec3610af8bf4c448@mail.gmail.com> On Mon, Jan 12, 2009 at 6:34 PM, Bj?rn Johansson wrote: > > Hi, > and thanks for the quick replies and the submitted code! Its very nice > to have the help of such a devoted community! > > I am writing a plug-in to deal with reformatting pasted code (DNA or > protein) snippets into the editor (incidently WikidPad which is > written in python and uses scintilla, open-source > http://wikidpad.sourceforge.net/) and I would like to be able to > format (DNA or protein) code in the selection from raw format to fasta > and genbank. > > The identity of the code (DNA or protein) is only needed to feed into > the SeqIO.write method, it demands to know if the sequence is DNA or > protein to write genbank format. Yes - this is because the GenBank format distinguishes between nucleotides and proteins, so if you try and output a SeqRecord using a generic alphabet, we have a problem. We could guess, but from a python style point of view I think most would agree it is preferable to make you (the programmer) make the choice explicity. As an aside, you might prefer to use the SeqRecord's format method to get the record as a FASTA or GenBank string - but this calls Bio.SeqIO.write() internally anyway, so the alphabet problem remains. > I know I could add a dialog, but I want a function to quickly reformat > sequences, although I agree that guessing is bad from a theoretical > viewpoint. You could have a selection box offering: (*) Guess (default) (*) Nucleotide (*) Amino acids That way for any border line cases, the web site user can easily change this if they need to. Once you know you have nucleotides, deciding if it is DNA or RNA is pretty easy :) > Ill try the code that you submitted as soon as I can and Ill get back to you! > thanks, > /bjorn Peter From lueck at ipk-gatersleben.de Tue Jan 13 07:38:43 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 13 Jan 2009 08:38:43 +0100 Subject: [BioPython] blastall directory problem References: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> <320fb6e00901090416w7ab46f0biea1d1ba2fa746229@mail.gmail.com> Message-ID: <001101c97551$f5f64cd0$1022a8c0@ipkgatersleben.de> I was a little bit to optimistic... After compilation with py2exe, blast hangs. In the log file of py2exe I get the following error message: Traceback (most recent call last): File "prim_search.pyc", line 464, in make_xml File "Bio\Blast\NCBIStandalone.pyc", line 1668, in blastall File "Bio\Blast\NCBIStandalone.pyc", line 1992, in _invoke_blast File "subprocess.pyc", line 586, in __init__ File "subprocess.pyc", line 681, in _get_handles File "subprocess.pyc", line 722, in _make_inheritable TypeError: an integer is required Any ideas? Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Friday, January 09, 2009 1:16 PM Subject: Re: [BioPython] blastall directory problem On Thu, Jan 8, 2009 at 3:07 PM, Stefanie L?ck wrote: > Hi! > > I finally finished my program and tested on several PC's. > I'm doing standalone blasts, it's a GUI program for Windows. > > By this I found a problem on the english operating systems because blastall has problems with spaces in the full path. > > I tried to replace the path "C:\Program Files\Final\test" into "C:\PROGRA~1\Final\test" but I get a message that blast is unable to open the input file. Maby there is a way not to give the full path of the database and the blastall.exe but only the file name? > > Does someone has a idea how I can solve this problem? > > Kind regards and a happy new year! > Stefanie The problem here is that BLAST allows you to specify multiple databases using spaces to separate the list - while you may also want to use spaces in the filename(s)! The solution is that BLAST should understand some fiddly escape quoted string arguments, i.e. slash double quote at the beginning and end of the filename. Try this (using a mixture of single and double quotes!): my_blast_db =r'"\"C:\Program Files\Final\test\""' (There was some discussion on this issue on Bug 2480). You should also be able to use the DOS 8.3 style names (which have no spaces), something like "C:\PROGRA~1\Final\test" which you said you tried. Read about the win32api.GetShortPathName() function for how to get this name programatically. From your email it sounds like this worked - but you had another problem with specifying your input filename. If you are still stuck, could you give us an example showing the arguments used for the blast exe, database and input file? Peter From biopython at maubp.freeserve.co.uk Tue Jan 13 10:41:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Jan 2009 10:41:20 +0000 Subject: [BioPython] blastall directory problem In-Reply-To: <001101c97551$f5f64cd0$1022a8c0@ipkgatersleben.de> References: <011d01c971a2$c5138040$1022a8c0@ipkgatersleben.de> <320fb6e00901090416w7ab46f0biea1d1ba2fa746229@mail.gmail.com> <001101c97551$f5f64cd0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00901130241t2b116d59j2428415d60e2f177@mail.gmail.com> On Tue, Jan 13, 2009 at 7:38 AM, Stefanie L?ck wrote: > I was a little bit to optimistic... > > After compilation with py2exe, blast hangs. In the log file of py2exe > I get the following error message: > > Traceback (most recent call last): > File "prim_search.pyc", line 464, in make_xml > > File "Bio\Blast\NCBIStandalone.pyc", line 1668, in blastall > File "Bio\Blast\NCBIStandalone.pyc", line 1992, in _invoke_blast > File "subprocess.pyc", line 586, in __init__ > File "subprocess.pyc", line 681, in _get_handles > File "subprocess.pyc", line 722, in _make_inheritable > TypeError: an integer is required > > Any ideas? > Stefanie Are you using Biopython 1.49? What version of Python are you using here? (Python 2.3 is handled a little differently, as it does not have the subprocess module). Can you confirm the exact same code works fine run from Python directly (via IDLE or the commandline?), but fails via py2exe? Are you running the py2exe compiled version from the Windows command line? Can you try that, even thought you said it was a GUI program. This might be related to the following python bug on Windows to do with pipe redirection, http://bugs.python.org/issue1124861 If so, I think there is a suggested work around we can try (this will require a change to the Biopython code). Peter From animesh.agrawal at anu.edu.au Thu Jan 15 09:21:11 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Thu, 15 Jan 2009 20:21:11 +1100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences Message-ID: <000001c976f2$9b10c320$d1324960$@agrawal@anu.edu.au> Hi, I have been trying to write a python script to do the codon wise alignment of given nucleotide sequences. I have downloaded CDS sequences (by a script found on biopython mailing list) from genbank for a particular protein and now would like to check codon usage for few specific amino acid positions. Could you please provide me few pointers on how to do that. I also want to take this opportunity to thank you guys for excellent work on biopython documentation. I am new to python, but I am able to use cookbook/tutorial example for my work with relative ease. Cheers, Animesh Agrawal PhD Scholar Proteomics & Therapy Design Group Division of Molecular Biosciences The John Curtin School of Medical Research The Australian National University P.O. Box 334 Canberra ACT 2601 AUSTRALIA T: +61 2 6125 8303 From dalloliogm at fastwebnet.it Thu Jan 15 11:45:18 2009 From: dalloliogm at fastwebnet.it (Giovanni Marco Dall'Olio) Date: Thu, 15 Jan 2009 12:45:18 +0100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <319800150623214077@unknownmsgid> References: <319800150623214077@unknownmsgid> Message-ID: <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> On Thu, Jan 15, 2009 at 10:21 AM, Animesh Agrawal wrote: > Hi, > > I have been trying to write a python script to do the codon wise alignment > of given nucleotide sequences. Note that there are many tools that already do a 'codon wise' alignment, if it is what I think you mean by it. I think t-coffee does this. It is always better to use a tool that already exists rather than develop a new one, if you can, because otherwise your results will be different to compare with other experiments. > I have downloaded CDS sequences (by a script > found on biopython mailing list) from genbank for a particular protein and > now would like to check codon usage for few specific amino acid positions. Can you provide a better example of what do you want to obtain? Do you want to know: - for a particular aminoacid position (e.g. the first, or the third, or the last) the codon usage in a set of sequences? - for those aminoacids that are coded by more than a possible codon (e.g. Ala) the frequency with which every codon is used? - the frequency at which every possible codon is used, in general. If I can give you an advice, I would spend some time in developing a test case first. For example, create a fake sequence and calculate the output that you expect from your experiment. It is a lot easier to describe your experiment to other people if you can provide the test cases you are using, it will be easier to understand what you want to do. > Could you please provide me few pointers on how to do that. I also want to > take this opportunity to thank you guys for excellent work on biopython > documentation. I am new to python, but I am able to use cookbook/tutorial > example for my work with relative ease. > > Cheers, > > Animesh Agrawal > > PhD Scholar > > Proteomics & Therapy Design Group > > Division of Molecular Biosciences > > The John Curtin School of Medical Research > > The Australian National University > > P.O. Box 334 > > Canberra ACT 2601 > > AUSTRALIA > > T: +61 2 6125 8303 > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From chapmanb at 50mail.com Thu Jan 15 11:52:49 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 15 Jan 2009 06:52:49 -0500 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <000001c976f2$9b10c320$d1324960$@agrawal@anu.edu.au> References: <000001c976f2$9b10c320$d1324960$@agrawal@anu.edu.au> Message-ID: <20090115115249.GA61956@kunkel> Hi Animesh; > I have been trying to write a python script to do the codon wise alignment > of given nucleotide sequences. I have downloaded CDS sequences (by a script > found on biopython mailing list) from genbank for a particular protein and > now would like to check codon usage for few specific amino acid positions. Biopython does not contain codon usage dictionaries; the possible organisms and usage frequencies themselves are changing as additional organisms are sequenced. Your best bet is to parse out the values from the codon usage database (http://www.kazusa.or.jp/codon/) for your organism of interest. An example is pasted below from E coli; you did not mention which organism you were interested in. The values are reported as usage per 1000 codons. When you have defined this, here is some Biopython code to create a dictionary (positional_usage) of usage at each codon position (using python 0-based indexing for positions): from Bio import SeqIO handle = open("example.fasta", "rU") positional_usage = {} for record in SeqIO.parse(handle, "fasta"): assert len(record.seq) % 3 == 0 # make sure you are 3 based for cindex in range(len(record.seq) // 3): cur_codon = str(record.seq[cindex * 3:(cindex + 1) * 3]) usage = usage_dict[cur_codon] positional_usage[cindex] = usage handle.close() The input to this is usage_dict, a dictionary defined as below. Hope this helps, Brad Escherichia_coli = \ {'AAA': 35.601945036625438, 'AAC': 21.202802271903, 'AAG': 13.045009394539333, 'AAT': 22.831396289856265, 'ACA': 10.700618181965975, 'ACC': 21.387130807992541, 'ACG': 13.784236156652, 'ACT': 11.016200111457801, 'AGA': 4.4652452250900074, 'AGC': 14.997074890221718, 'AGG': 2.5626687138052029, 'AGT': 10.73241545213447, 'ATA': 8.2158886416564805, 'ATC': 22.685559186075952, 'ATG': 25.945855225833537, 'ATT': 29.669004762179132, 'CAA': 14.383602745467156, 'CAC': 8.8157333849102599, 'CAG': 28.118110840502265, 'CAT': 12.473375763164368, 'CCA': 8.6299703855048442, 'CCC': 5.630985746455262, 'CCG': 19.354496289402018, 'CCT': 7.8991113260680947, 'CGA': 4.0270166820911326, 'CGC': 18.382647392898786, 'CGG': 6.4933372765136035, 'CGT': 18.916506823622456, 'CTA': 4.4733738505466141, 'CTC': 10.083559878921733, 'CTG': 46.036709350716478, 'CTT': 12.48556870134928, 'GAA': 38.019254801088948, 'GAC': 18.833307951301883, 'GAG': 18.80390145332651, 'GAT': 32.883397975828814, 'GCA': 21.603495691469892, 'GCC': 23.869708653328228, 'GCG': 27.990682682608973, 'GCT': 17.355093504295862, 'GGA': 10.60618268033774, 'GGC': 25.658245331001215, 'GGG': 11.57779249962166, 'GGT': 24.92882073488034, 'GTA': 11.897916896280412, 'GTC': 14.044830325702069, 'GTG': 23.467102616006844, 'GTT': 20.038018059414991, 'TAA': 1.9881661557984951, 'TAC': 12.005979799409431, 'TAG': 0.28569727707782611, 'TAT': 18.337939952887442, 'TCA': 9.9362883118255478, 'TCC': 9.2876718158321232, 'TCG': 8.51664778355096, 'TCT': 10.941368941813147, 'TGA': 1.0356825140595336, 'TGC': 5.9924705020549887, 'TGG': 13.780171843923698, 'TGT': 5.3450493921581241, 'TTA': 14.983925643159559, 'TTC': 15.622261818722567, 'TTG': 12.856616545721486, 'TTT': 22.459153059387496 } From animesh.agrawal at anu.edu.au Thu Jan 15 13:21:00 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Fri, 16 Jan 2009 00:21:00 +1100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> Message-ID: Hi Marco, My apologies. Probably in my last mail I didn't make myself very clear. I have a protein which is about 475 amino acid long and is highly conserved (over 95%) among diffrent organisms. I have downloaded its CDS(coding sequence) . I would like to calculate codon use frequenecy for important amino acid positions as you have put it very nicely in your reply: "for a particular aminoacid position (e.g. the first, or the third,or the last) the codon usage for those aminoacids that are coded by more than a possible codon (e.g. Ala) the frequency with which every codon is used?" For example in a set of four sequenecs ?? ? ? ? ? ? ? ?1 ? ? ? 2 ? ? ? 3? ?? ? ? ? ? ? ?Ala ? ?Gly ? ? Ile Seq1 GCT?GCT?ATT? Seq2 GCC?GCC?ATC? Seq3 GCA?GCA?ATA Seq4 GCG?GCG?ATT For first amino acid position i.e. Ala (which is coded by 4 codons) each codon is used once in 4 sequences that gives you frequency of 0.25 for each codon or for third??amino acid position i.e.?Ile ( which is coded by 3 codons) the ?ATT will give you frequency of 0.5 while other two will give you frequency of 0.25. Cheers, Animesh ----- Original Message ----- From: Giovanni Marco Dall'Olio Date: Thursday, January 15, 2009 10:45 pm Subject: Re: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences To: Animesh Agrawal Cc: biopython at lists.open-bio.org > On Thu, Jan 15, 2009 at 10:21 AM, Animesh Agrawal > wrote: > > Hi, > > > > I have been trying to write a python script to do the codon > wise alignment > > of given nucleotide sequences. > > Note that there are many tools that already do a 'codon wise' > alignment, if it is what I think you mean by it. > I think t-coffee does this. It is always better to use a tool that > already exists rather than develop a new one, if you can, because > otherwise your results will be different to compare with other > experiments. > > > > I have downloaded CDS sequences (by a script > > found on biopython mailing list) from genbank for a particular > protein and > > now would like to check codon usage for few specific amino > acid positions. > > Can you provide a better example of what do you want to obtain? > Do you want to know: > - for a particular aminoacid position (e.g. the first, or the third, > or the last) the codon usage in a set of sequences? > - for those aminoacids that are coded by more than a possible codon > (e.g. Ala) the frequency with which every codon is used? > - the frequency at which every possible codon is used, in general. > > If I can give you an advice, I would spend some time in > developing a > test case first. For example, create a fake sequence and > calculate the > output that you expect from your experiment. > It is a lot easier to describe your experiment to other people > if you > can provide the test cases you are using, it will be easier to > understand what you want to do. > > > > Could you please provide me few pointers on how to do that. I > also want to > > take this opportunity to thank you guys for excellent work on > biopython> documentation. I am new to python, but I am able to > use cookbook/tutorial > > example for my work with relative ease. > > > > Cheers, > > > > Animesh Agrawal > > > > PhD Scholar > > > > Proteomics & Therapy Design Group > > > > Division of Molecular Biosciences > > > > The John Curtin School of Medical Research > > > > The Australian National University > > > > P.O. Box 334 > > > > Canberra ACT 2601 > > > > AUSTRALIA > > > > T: +61 2 6125 8303 > > > > > > > > _______________________________________________ > > BioPython mailing list? -? BioPython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > -- > > My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Jan 15 13:34:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jan 2009 13:34:52 +0000 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> Message-ID: <320fb6e00901150534q1cd8880bve9392ec8ac560d70@mail.gmail.com> On Thu, Jan 15, 2009 at 1:21 PM, Animesh Agrawal wrote: > > Hi Marco, > My apologies. Probably in my last mail I didn't make myself very clear. > I have a protein which is about 475 amino acid long and is highly > conserved (over 95%) among diffrent organisms. I have downloaded > its CDS(coding sequence) . > I would like to calculate codon use frequenecy for important amino acid > positions as you have put it very nicely in your reply: > "for a particular aminoacid position (e.g. the first, or the third,or the last) > the codon usage for those aminoacids that are coded by more than a > possible codon (e.g. Ala) the frequency with which every codon is used?" > For example in a set of four sequenecs > 1 2 3 > Ala Gly Ile > Seq1 GCT GCT ATT > Seq2 GCC GCC ATC > Seq3 GCA GCA ATA > Seq4 GCG GCG ATT > > For first amino acid position i.e. Ala (which is coded by 4 codons) each > codon is used once in 4 sequences that gives you frequency of 0.25 for > each codon or for third amino acid position i.e. Ile ( which is coded by 3 > codons) the ATT will give you frequency of 0.5 while other two will give > you frequency of 0.25. OK - first of all you will need to create an alignment of all the different CDS sequences. If they happen to be the same length this is easy. Otherwise, you'll want to align their PROTEIN sequences, and then turn this into a nucleotide sequence alignment (where gaps are only found as triples). You may be lucky and find the proteins all align beautifully with no gaps. Do you need advice on this step? Once you have the alignment file, it should be fairly trivial to count the codons in each set of three columns. Peter From dalloliogm at fastwebnet.it Thu Jan 15 14:26:07 2009 From: dalloliogm at fastwebnet.it (Giovanni Marco Dall'Olio) Date: Thu, 15 Jan 2009 15:26:07 +0100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> Message-ID: <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> On Thu, Jan 15, 2009 at 2:21 PM, Animesh Agrawal wrote: > > Hi Marco, > My apologies. Probably in my last mail I didn't make myself very clear. I > have a protein which is about 475 amino acid long and is highly conserved > (over 95%) among diffrent organisms. I have downloaded its CDS(coding > sequence) . ok! I used to work with transcript and alternative splicing > I would like to calculate codon use frequenecy for important amino acid > positions as you have put it very nicely in your reply: > "for a particular aminoacid position (e.g. the first, or the third,or the > last) the codon usage for those aminoacids that are coded by more than a > possible codon (e.g. Ala) the frequency with which every codon is used?" > For example in a set of four sequenecs > 1 2 3 > Ala Gly Ile > Seq1 GCT GCT ATT > Seq2 GCC GCC ATC > Seq3 GCA GCA ATA > Seq4 GCG GCG ATT Let's see how can you do this with biopython (ehi, Peter, please correct me if I say something wrong!! :)). If your set of sequences is not too big, you can just put the sequences in a dictionary: sequences = {seq1 : , seq2 = } The alignment file (align.txt) should look like this (or any other format supported by AlignIO): >seq1 aaacccaaa >seq2 aaacccaaa >seq3 tttcccaaa >seq4 tttgggaaa If you want, you can use biopython to parse the alignment file: >>> from Bio import AlignIO >>> alignment = AlignIO(open('align.txt', 'r')) Then, you will have an AlignIO object called 'alignment', which contains all the sequences in your file: >>> print alignment SingleLetterAlphabet() alignment with 4 rows and 9 columns aaacccaaa Seq1 aaacccaaa seq2 tttcccaaa seq3 tttgggaaa seq4 You will be able to access all the sequences in your alignment by the _records property of AlignIO: >>> sequences = alignment._records >>> print sequences [SeqRecord(seq=Seq('aaacccaaa', SingleLetterAlphabet()), id='seq1', name='seq1', description='seq1', dbxrefs=[]), SeqRecord(seq=Seq('aaacccaaa', SingleLetterAlphabet()), id='seq2', name='seq2', description='seq2', dbxrefs=[]), SeqRecord(seq=Seq('tttcccaaa', SingleLetterAlphabet()), id='seq3', name='seq3', description='seq3', dbxrefs=[]), SeqRecord(seq=Seq('tttgggaaa', SingleLetterAlphabet()), id='seq4', name='seq4', description='seq4', dbxrefs=[])] If you prefer, you are not obliged to use AlignIO and you can your own parser for your alignment. However, if you use biopython's code, you won't have to demonstrate that your parser doesn't contain errors (somebody could ask you this). The alignment object in biopython doesn't have any method to count codon usage the way you want to do. However, you can implement it easily in many ways, for example: >>> codon_count_by_position = {} >>> for codon_start in (range(0, len(sequences[0]), 3)): codon_count_by_position[codon_start] = {} for sequence in sequences: current_codon = sequence.seq[codon_start:codon_start+3] # note: why do I have to do .tostring() here, and in the previous statement no? codon_count_by_position[codon_start].setdefault(current_codon.tostring(), 0) codon_count_by_position[codon_start][current_codon.tostring()] += 1. / len(sequences) >>> print codon_count_by_position {0: {'aaa': 0.5, 'ttt': 0.5}, 3: {'ccc': 0.75, 'ggg': 0.25}, 6: {'aaa': 1.0}} There are many other ways you can do this and you should be careful in handling gaps and alternative splicing, and you should have a look at the tools that already do codon-based alignment, but I hope this can help you. > > For first amino acid position i.e. Ala (which is coded by 4 codons) each > codon is used once in 4 sequences that gives you frequency of 0.25 for each > codon or for third amino acid position i.e. Ile ( which is coded by 3 > codons) the ATT will give you frequency of 0.5 while other two will give > you frequency of 0.25. > > > Cheers, > Animesh > > > ----- Original Message ----- > From: Giovanni Marco Dall'Olio > Date: Thursday, January 15, 2009 10:45 pm > Subject: Re: [BioPython] How to check codon usage for specific amino acid > positions in a given set of CDS sequences > To: Animesh Agrawal > Cc: biopython at lists.open-bio.org > >> On Thu, Jan 15, 2009 at 10:21 AM, Animesh Agrawal >> wrote: >> > Hi, >> > >> > I have been trying to write a python script to do the codon >> wise alignment >> > of given nucleotide sequences. >> >> Note that there are many tools that already do a 'codon wise' >> alignment, if it is what I think you mean by it. >> I think t-coffee does this. It is always better to use a tool that >> already exists rather than develop a new one, if you can, because >> otherwise your results will be different to compare with other >> experiments. >> >> >> > I have downloaded CDS sequences (by a script >> > found on biopython mailing list) from genbank for a particular >> protein and >> > now would like to check codon usage for few specific amino >> acid positions. >> >> Can you provide a better example of what do you want to obtain? >> Do you want to know: >> - for a particular aminoacid position (e.g. the first, or the third, >> or the last) the codon usage in a set of sequences? >> - for those aminoacids that are coded by more than a possible codon >> (e.g. Ala) the frequency with which every codon is used? >> - the frequency at which every possible codon is used, in general. >> >> If I can give you an advice, I would spend some time in >> developing a >> test case first. For example, create a fake sequence and >> calculate the >> output that you expect from your experiment. >> It is a lot easier to describe your experiment to other people >> if you >> can provide the test cases you are using, it will be easier to >> understand what you want to do. >> >> >> > Could you please provide me few pointers on how to do that. I >> also want to >> > take this opportunity to thank you guys for excellent work on >> biopython> documentation. I am new to python, but I am able to >> use cookbook/tutorial >> > example for my work with relative ease. >> > >> > Cheers, >> > >> > Animesh Agrawal >> > >> > PhD Scholar >> > >> > Proteomics & Therapy Design Group >> > >> > Division of Molecular Biosciences >> > >> > The John Curtin School of Medical Research >> > >> > The Australian National University >> > >> > P.O. Box 334 >> > >> > Canberra ACT 2601 >> > >> > AUSTRALIA >> > >> > T: +61 2 6125 8303 >> > >> > >> > >> > _______________________________________________ >> > BioPython mailing list - BioPython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> >> >> -- >> >> My blog on bioinformatics (now in English): http://bioinfoblog.it -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Jan 15 18:02:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jan 2009 18:02:05 +0000 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> Message-ID: <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> On Thu, Jan 15, 2009 at 2:26 PM, Giovanni Marco Dall'Olio wrote: > Let's see how can you do this with biopython (ehi, Peter, please > correct me if I say something wrong!! :)). > > ... > > You will be able to access all the sequences in your alignment by the > _records property of AlignIO: In Python anything starting with a single underscore is considered to be a private variable, and you should avoid using it. So you shouldn't be doing alignment._records, and if you do, don't complain if this implementation detail changes in a future version of Biopython. For the Alignment object, if you really want a list of SeqRecord objects you should use alignment.get_all_seqs() instead. Ugly I agree - but in practice you don't need to do this so often. You can use the alignment object itself, e.g. first_record = alignment[0] last_record = alignment[-1] for record in alignment : print record I think there is still room for improvement to this bit of Biopython, and there are a couple of open enhancement bugs. If you are interested, here my quick solution for solving this code using Bio.AlignIO is one way of solving Animesh's question. First of all, we need the alignment in a suitable file format (e.g. FASTA, ClustalW, PHYLIP, Stockholm etc). e.g. taking Animesh's example alignment with four sequences of length nine: handle = open("my_example.fasta","w") handle.write(""">Alpha GCT GCT ATT >Beta GCC GCC ATC >Gamma GCA GCA ATA >Delta GCG GCG ATT""") handle.close() Here is one solution using Bio.AlignIO to read in this as an alignment object, and then count the codon usage at each position separately: from Bio import AlignIO #Change this next line if your real file is another file format: alignment = AlignIO.read(open("my_example.fasta"),"fasta") assert alignment.get_alignment_length() % 3 == 0, \ "Alignment length is not a multiple of three!" number_of_codons = int(alignment.get_alignment_length() / 3) for codon_index in range(number_of_codons) : #Count the codons in a dictionary using upper case codons as keys counts = dict() for record in alignment : #In case the alignment is in mixed case, make everything upper case codon = str(record.seq[codon_index*3:codon_index*3+3]).upper() #Assuming you want to exclude gaps when calculating frequencies: if codon=="---" : continue #Increment the count by one, defaulting to zero #(there are lots of ways to write this code!) counts[codon] = counts.get(codon,0)+1 #Turn the counts in frequencies - note that because I exclude gaps, #the total codon count can vary across the alignment. total = float(sum(counts.values())) freqs = dict((codon,count/total) for (codon,count) in counts.iteritems()) print "Codon frequencies for columns %i to %i:" \ % (codon_index*3+1,codon_index*3+3), print freqs And the output should read: Codon frequencies for columns 1 to 3: {'GCA': 0.25, 'GCC': 0.25, 'GCT': 0.25, 'GCG': 0.25} Codon frequencies for columns 4 to 6: {'GCA': 0.25, 'GCC': 0.25, 'GCT': 0.25, 'GCG': 0.25} Codon frequencies for columns 7 to 9: {'ATT': 0.5, 'ATC': 0.25, 'ATA': 0.25} Peter From biopython at maubp.freeserve.co.uk Thu Jan 15 18:11:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jan 2009 18:11:42 +0000 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> Message-ID: <320fb6e00901151011p387a616bw8933b56d2d2ee5c@mail.gmail.com> On Thu, Jan 15, 2009 at 6:02 PM, Peter wrote: > On Thu, Jan 15, 2009 at 2:26 PM, Giovanni Marco Dall'Olio > wrote: >> Let's see how can you do this with biopython (ehi, Peter, please >> correct me if I say something wrong!! :)). >> ... >> You will be able to access all the sequences in your alignment by the >> _records property of AlignIO: > > In Python anything starting with a single underscore is considered to > be a private variable, and you should avoid using it. So you > shouldn't be doing alignment._records, and if you do, don't complain > if this implementation detail changes in a future version of > Biopython. > > For the Alignment object, if you really want a list of SeqRecord > objects you should use alignment.get_all_seqs() instead. ... On a related note, if you just want a list of SeqRecord objects from an alignment file, you can do this: from Bio import AlignIO alignment = AlignIO.read(open("my_example.phy"), "phylip") records = alignment.get_all_seqs() However, any input alignment format supported by Bio.AlignIO (like the PHYLIP format used in this example) can also be used via Bio.SeqIO, so you might prefer to do this: from Bio import SeqIO records = list(SeqIO.parse(open("my_example.phy"), "phylip")) Up to you. It rather depends on what you are trying to do with the sequences - sometimes working with the SeqRecord objects directly is preferable. Peter From animesh.agrawal at anu.edu.au Fri Jan 16 04:46:52 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Fri, 16 Jan 2009 15:46:52 +1100 Subject: [BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences In-Reply-To: <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> References: <319800150623214077@unknownmsgid> <5aa3b3570901150345k2f9b8109g670671ae4b6b5a81@mail.gmail.com> <5aa3b3570901150626s203f6373w5e04345a80fc8ece@mail.gmail.com> <320fb6e00901151002r77785b19h6dd66b5b3e1aa71a@mail.gmail.com> Message-ID: <000301c97795$730e76d0$592b6470$@agrawal@anu.edu.au> Peter, Wow! The code(for positional frequency of codons) works 4 me. Thanks a ton. While we are at it please allow me to ask you another question related to downloading CDS sequences. I have copied one script from mailing list for downloading CDS given from Genbank record of protein sequence written by Andrew Dalke. I modified it a little bit to include few more exceptions and it work in most of the cases but it's still not bug free. Giving errors frequently. I am copying both the script and errors. See if you can spot the problem.. or can suggest better way of doing it.. ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - from Bio import SeqIO from Bio.Seq import Seq from Bio import GenBank from Bio.GenBank import LocationParser from EUtils import DBIds, DBIdsClient from Bio.SeqRecord import SeqRecord import StringIO from Bio import Entrez from Bio.Alphabet import IUPAC #Animesh Agrawal Email:animesh.agrawal at anu.edu.au print "This program extracts the CDS of a given Genbank protein file\n" File_Input = raw_input("Give the name of input file:\t") File_Output = raw_input("Give the name of output file:\t") gb_handle = open(File_Input, "r") feature_parser = GenBank.FeatureParser () iterator = GenBank.Iterator (gb_handle, feature_parser) Out_file= open(File_Output, "w") def lookup(name, seq_start, seq_stop): h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name])) return h.efetch(retmode = "text", rettype = "fasta", seq_start = seq_start, seq_stop = seq_stop).read() def make_rc_record(record) : """Returns a new SeqRecord with the reverse complement sequence.""" rc_rec = SeqRecord(seq = record.seq.reverse_complement(), \ id = "rc_" + record.id, \ name = "rc_" + record.name, \ description = "reverse complement") return rc_rec while 1: cur_entry = iterator.next () Genbank_entry = str(cur_entry) if cur_entry is None: break for feature in cur_entry.features : if feature.type == "CDS": loc = feature.qualifiers["coded_by"][0] Temp1=loc Temp2 = Temp1.split('(') # for genbank record like this # coded_by="complement(NC_001713.1:67323..68795)" if Temp2[0]=="complement": Temp3 = Temp2[1].replace(')', '') parsed_loc_complement=LocationParser.parse(LocationParser.scan(Temp3)) assert isinstance(parsed_loc_complement, LocationParser.AbsoluteLocation) assert isinstance(parsed_loc_complement.local_location, LocationParser.Range) seq_start = parsed_loc_complement.local_location.low seq_stop = parsed_loc_complement.local_location.high assert isinstance(seq_start, LocationParser.Integer) assert isinstance(seq_stop, LocationParser.Integer) seq_start = seq_start.val seq_stop = seq_stop.val Temp4=lookup(parsed_loc_complement.path.accession, seq_start, seq_stop) fasta_handle = StringIO.StringIO(Temp4) record = SeqIO.read(fasta_handle, "fasta") record = make_rc_record(record) Out_file.write(record.format("fasta")) break # for genbank record like this # coded_by="join(NC_008114.1:51934..52632, NC_008114.1:54315..55043)" elif Temp2[0]=="join": loc=loc.replace('join', '') loc= loc.replace('(', '') loc= loc.replace(')', '') loc = loc.split(',') loc1=loc[0].split(':') loc2=loc[1].split(':') loc3=loc1[1].split('..') loc4=loc2[1].split('..') loc5=int(loc3[0])-1 loc6=int(loc3[1]) loc7=int(loc4[0])-1 loc8=int(loc4[1]) handle = Entrez.efetch(db="nucleotide", id=loc1[0], rettype="genbank") record=SeqIO.read(handle, "genbank") seq1 = record.seq[loc5:loc6] seq2 = record.seq[loc7:loc8] record.seq = seq1+seq2 Out_file.write(record.format("fasta")) break # for genbank record like this # coded_by="FM207547.1:<1..1443" else: parsed_loc=LocationParser.parse(LocationParser.scan(loc)) assert isinstance(parsed_loc, LocationParser.AbsoluteLocation) assert isinstance(parsed_loc.local_location, LocationParser.Range) seq_start = parsed_loc.local_location.low seq_stop = parsed_loc.local_location.high assert isinstance(seq_start, LocationParser.Integer) assert isinstance(seq_stop, LocationParser.Integer) seq_start = seq_start.val seq_stop = seq_stop.val Out_file.write(lookup(parsed_loc.path.accession, seq_start, seq_stop)) break # for swissprot entries in Genbank elif Genbank_entry.find('swissprot') >= 0: Entry = cur_entry.annotations Entry = str(Entry) Entry = Entry.split('xrefs') Entry1 =Entry[1].split(',') Entry2 = Entry1[0].split(':') handle = Entrez.efetch(db="nucleotide", id=Entry2[1], rettype="genbank") data=SeqIO.read(handle, "genbank") for feature in data.features : if feature.type == "gene": Gene_id= feature.qualifiers['gene'] [0] if Gene_id == "rbcL": temp = str(feature.location) temp = temp.replace(':', '..') temp = temp.replace('[', '') temp = temp.replace(']', '') if feature.strand == -1: temp1 = data.id+':<'+temp parsed_loc_complement = LocationParser.parse(LocationParser.scan(temp1)) assert isinstance(parsed_loc_complement, LocationParser.AbsoluteLocation) assert isinstance(parsed_loc_complement.local_location, LocationParser.Range) seq_start = parsed_loc_complement.local_location.low seq_stop = parsed_loc_complement.local_location.high assert isinstance(seq_start, LocationParser.Integer) assert isinstance(seq_stop, LocationParser.Integer) seq_start = seq_start.val+1 seq_stop = seq_stop.val Temp4=lookup(parsed_loc_complement.path.accession, seq_start, seq_stop) fasta_handle = StringIO.StringIO(Temp4) record = SeqIO.read(fasta_handle, "fasta") record = make_rc_record(record) #print (record.format("fasta")) Out_file.write(record.format("fasta")) break else: temp2 = data.id+':<'+temp parsed_loc=LocationParser.parse(LocationParser.scan(temp2)) assert isinstance(parsed_loc, LocationParser.AbsoluteLocation) assert isinstance(parsed_loc.local_location, LocationParser.Range) seq_start = parsed_loc.local_location.low seq_stop = parsed_loc.local_location.high assert isinstance(seq_start, LocationParser.Integer) assert isinstance(seq_stop, LocationParser.Integer) seq_start = seq_start.val+1 seq_stop = seq_stop.val #print (lookup(parsed_loc.path.accession, seq_start, seq_stop)) Out_file.write(lookup(parsed_loc.path.accession, seq_start, seq_stop)) break break Out_file.close() ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - Syntax error at or near `join' token Traceback (most recent call last): File "C:\Documents and Settings\Animesh\Desktop\sequences\Features_extraction_final.py", line 52, in parsed_loc_complement=LocationParser.parse(LocationParser.scan(Temp3)) File "C:\Python25\lib\site-packages\Bio\GenBank\LocationParser.py", line 319, in parse return _cached_parser.parse(tokens) File "C:\Python25\Lib\site-packages\Bio\Parsers\spark.py", line 204, in parse self.error(tokens[i-1]) File "C:\Python25\Lib\site-packages\Bio\Parsers\spark.py", line 183, in error raise SystemExit SystemExit ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - Traceback (most recent call last): File "C:\Documents and Settings\Animesh\Desktop\sequences\Features_extraction_final.py", line 76, in loc5=int(loc3[0])-1 ValueError: invalid literal for int() with base 10: '<1' ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - Animesh From biopython at maubp.freeserve.co.uk Fri Jan 16 12:35:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jan 2009 12:35:25 +0000 Subject: [BioPython] Downloading CDS sequences Message-ID: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> On Fri, Jan 16, 2009 at 4:46 AM, Animesh Agrawal wrote: > > Peter, > Wow! The code(for positional frequency of codons) works 4 me. Thanks a ton. Good. > While we are at it please allow me to ask you another question related to > downloading CDS sequences. Sure - bit I would have changed the email subject line if I was you. > I have copied one script from mailing list for > downloading CDS given from Genbank record of protein sequence written by > Andrew Dalke. I modified it a little bit to include few more exceptions and > it work in most of the cases but it's still not bug free. Do you have a link to the original in the mail archive? http://lists.open-bio.org/pipermail/biopython/ One minor point is I would have used Bio.SeqIO rather than Bio.GenBank.FeatureParser and Bio.GenBank.Iterator (the same parsing code gets used internally - I just think the code is simpler). >From a style point of view, breaking this up into some subfunctions would make it a lot clearer what it going on. I see you are looking at the "coded_by" qualifier, which will be a location string like "join(NC_008114.1:51934..52632, NC_008114.1:54315..55043)" including other sequence identifiers. For this example you download "NC_008114.1" and extract the two subsequences and join them up. The Bio.GenBank.LocationParser should be able to cope with parsing these strings - but its a complicated thing to do. As you have seen, there can be joins etc to deal with - but there are also fuzzy location which are more tricky. You specific error is simple enough: > Traceback (most recent call last): > File "C:\Documents and > Settings\Animesh\Desktop\sequences\Features_extraction_final.py", line 76, > in > loc5=int(loc3[0])-1 > ValueError: invalid literal for int() with base 10: '<1' You've got a location like "<1..456" meaning it starts before base one and continues to base 456 (one based counting). In this particular case, you'll just have to take the sequence from the start (base 1). The problem is your code does int("<1") which fails. Peter From biopython at maubp.freeserve.co.uk Sat Jan 17 19:30:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Jan 2009 19:30:17 +0000 Subject: [BioPython] Downloading CDS sequences In-Reply-To: References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> Message-ID: <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> Peter wrote: > You've got a location like "<1..456" meaning it starts before base one > and continues to base 456 (one based counting). In this particular > case, you'll just have to take the sequence from the start (base 1). > The problem is your code does int("<1") which fails. >From my testing, in this case and similar examples like "AF376133.1:<1..>553" it is safe to treat this as just from position 1 to 553. The less than and greater than signs are to indicate that the full protein CDS may well extend beyound this region, but it was not sequenced. Animesh wrote: > > http://lists.open-bio.org/pipermail/biopython/2003-April/001255.html > This is the link to original script in the mailing list. > Animesh > Thanks! I see Andrew's original code just dealt with the "easy" cases, where the coded_by string was a non-fuzzy location, and without a join. Andrew's code (and yours) uses Bio.EUtils to access the NCBI's "Entrez Utitlities" online API. I should point out this module has been deprecated since Release 1.48 (its still there for now but will give a warning message when used), and we recommend you use Bio.Entrez instead. I hope you don't mind me giving you a few comments about your code? You seem to be struggling with handles. Andrew defined this function: def lookup(name, seq_start, seq_stop): h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name])) return h.efetch(retmode = "text", rettype = "fasta", seq_start = seq_start, seq_stop = seq_stop).read() The efetch call returns a handle, but you use its read method to get all the data as a string. This means your lookup function returns a string containing the record in FASTA format. However, for your code, it would have made more sense to just stick with the handle - as you had to convert back from a string of data to a handle using StringIO: Temp4=lookup(parsed_loc_complement.path.accession, seq_start, seq_stop) fasta_handle = StringIO.StringIO(Temp4) record = SeqIO.read(fasta_handle, "fasta") Using Bio.Entrez.efetch (the equivalent to the old EUtils efetch method you were using) which returns a handle this would be just: from Bio import Entrez fasta_handle = Entrez.efetch("nucleotides", id=name, retmode="text", rettype="fasta", seq_start=seq_start, seq_stop=seq_stop) record = SeqIO.read(fasta_handle, "fasta") Peter From biopython at maubp.freeserve.co.uk Sat Jan 17 19:40:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Jan 2009 19:40:49 +0000 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> Message-ID: <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> Following Animesh's query, I was inspired to try and solve this problem for myself. My rough script of my own to solve this problem (below) has several differences to Andrew and Animesh's code. First of all, I didn't bother using the Bio.GenBank.LocationParser as I felt that for CDS processing I only needed to cope with a handful of location formats, and this was easier to do "by hand". Secondly I found some GenBank/GenPept examples where there wasn't a CDS feature with a "coded_by" qualifier in the annotation. Here the only thing I could find that worked was to look under the DBSOURCE information for a cross reference to the full parent nucleotide sequence, and then try and work out which bit codes for the protein. This is a little ugly, but seems to work. I'm also using Bio.SeqIO and Bio.Entrez rather than Bio.GenBank and Bio.EUtils (deprecated). I think the most important change was that I explicitly verify the nucleotide sequence obtained when translated does actuall give the expected protein sequence - just in case there was an error in my code, the annotation, or the even downloads. Peter --- #Script to take a file of proteins in GenBank/GenPept format, examine their annotation, #and use this to download their CDS from the NCBI. #Written and tested on Biopython 1.49, on 2008/01/17 from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO from Bio import Entrez #Edit this next line (and read the NCBI Entrez usage guidelines) Entrez.email = "Your.Name.Here at example.com" def get_nuc_by_name(name, start=None, end=None) : """Fetches the sequence from the NCBI given an indentifier name. Note start and end should be given using one based counting! Returns a Seq object.""" record = SeqIO.read(Entrez.efetch("nucleotide", id=name.strip(), seq_start=start, seq_stop=end, retmode="text", rettype="fasta"), "fasta") return record.seq def get_nuc_from_coded_by_string(source) : """Fetches the sequence from the NCBI for a "coded_by" string. e.g. "NM_010510.1:21..569" or "AF376133.1:<1..>553" or "join(AB061020.1:1..184,AB061020.1:300..1300)" or "complement(NC_001713.1:67323..68795)" Note - joins and complements are handled by recusion. Returns a Seq object.""" if source.startswith("complement(") : assert source.endswith(")") #For simplicity this works by recursion return get_nuc_from_coded_by_string(source[11:-1]).reverse_complement() if source.startswith("join(") : assert source.endswith(")") #For simplicity this works by recursion. #Note that the Seq object (currently) does not have a join #method, so convert to strings and join them, then go back #to a Seq object: return Seq("".join(str(get_nuc_from_coded_by_string(s)) \ for s in source[5:-1].split(","))) if "(" in source or ")" in source \ or source.count(":") != 1 or source.count("..") != 1 : raise ValueError("Don't understand %s" % repr(source)) name, loc = source.split(":") #Remove and ignore any leading < or > for fuzzy locations which #indicate the full CDS extends beyound the region sequenced. start, end = [int(x.lstrip("<>")) for x in loc.split("..")] #We could now download the full sequence, and crop it locally: #return get_nuc_by_name(name)[start-1:end] #However, we can ask the NCBI to crop it and then download #just the bit we need! return get_nuc_by_name(name,start,end) def get_nuc_record(protein_record, table="Standard") : """Given a protein record, returns a record with the CDS nucleotides. The protein's annotation is used to determine the CDS sequence(s) which are downloaded from the NCBI using Entrez. The translation table specified is used to check the nucleotides actually do give the expected protein sequence. Tries to get the CDS information from a "coded_by" qualifier, failing that it falls back on a DB_SOURCE xref entry (which does not specify which bit of the nucleotide sequence referenced is required - this is deduced from the expected translation). """ if not isinstance(protein_record, SeqRecord) : raise TypeError("Expect a SeqRecord as the protein_record.") feature = None for f in protein_record.features : if f.type == "CDS" and "coded_by" in f.qualifiers : feature = f break if feature : assert feature.location.start.position == 0 assert feature.location.end.position == len(protein_record) source = feature.qualifiers["coded_by"][0] print "Using %s" % source return SeqRecord(Seq("")) nuc = get_nuc_from_coded_by_string(source) #See if this included the stop codon - they don't always! if str(nuc[-3:].translate(table)) == "*" : nuc = nuc[:-3] elif "db_source" in protein_record.annotations : #Note the current parsing of the DBSOURCE lines in GenPept #files is non-optimal (as of Biopython 1.49). If the #parsing is changed then the following code will need #updating to pull out the first xrefs entry. parts = protein_record.annotations["db_source"].split() source = parts[parts.index("xrefs:")+1].strip(",;") print "Using %s" % source nuc_all = get_nuc_by_name(source) start = nuc_all.translate(table).find(protein_record.seq) assert start != -1, "Could not find start (assumed in frame)" nuc = nuc_all[3*start:3*(start+len(protein_record))] else : raise ValueError("Could not determine CDS source from record.") assert str(nuc.translate(table)) == str(protein_record.seq), \ "Translation:\n%s\nExpected:\n%s" \ % (translate(nuc,table), protein_record.seq) return SeqRecord(nuc, id=protein_record.id, description="(the CDS for this protein)") #Now use the above functions to fetch the CDS sequence for some proteins... nucs = (get_nuc_record(p, table="Standard") for p \ in SeqIO.parse(open("protein.gbk"),"genbank")) handle = open("nucleotide.fasta","w") SeqIO.write(nucs, handle, "fasta") handle.close() print "Done" From biopython at maubp.freeserve.co.uk Mon Jan 19 11:03:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Jan 2009 11:03:41 +0000 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <-808938340139089427@unknownmsgid> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> <-808938340139089427@unknownmsgid> Message-ID: <320fb6e00901190303h61d9c25fx114de8b253ad4c73@mail.gmail.com> On Mon, Jan 19, 2009 at 1:53 AM, Animesh Agrawal wrote: > Peter, > Definitely the script written by you look much simpler to understand with > defined functions for each of the cases. Thank you! This illustrates one of problems I had reading your code - it was one big lump with no clear structure. You should find this gets easy with practise. > But this script is giving following error. I couldn't get it working for me. > I am attaching the input file with this mail. > ---------------------------------------------------------------------------- > ... > assert start != -1, "Could not find start (assumed in frame)" > AssertionError: Could not find start (assumed in frame) > ---------------------------------------------------------------------------- You'd found a nasty example, locus P24673, where there is no CDS feature with a "coded_by" qualifier in the annotation. My code just tried fetching all of M59080.1 and only looked in the default translation in the first frame - but didn't find the protein. This is what my error message was trying to convey. In this situation, the sequence can still be found by downloading M59080.1 but we potentially have to check all six translation frames. I must have been lucky that all the examples I tried without the "coded_by" information happened to be in the standard frame. This is fairly simple to fix - see below. Peter #Script to take a file of proteins in GenBank/GenPept format, examine #their annotation, and use this to download their CDS from the NCBI. #Written and tested on Biopython 1.49, on 2008/01/19 from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO from Bio import Entrez #Edit this next line (and read the NCBI Entrez usage guidelines) #Entrez.email = "Your.Name.Here at example.com" def get_nuc_by_name(name, start=None, end=None) : """Fetches the sequence from the NCBI given an identifier name. Note start and end should be given using one based counting! Returns a Seq object.""" record = SeqIO.read(Entrez.efetch("nucleotide", id=name.strip(), seq_start=start, seq_stop=end, retmode="text", rettype="fasta"), "fasta") return record.seq def get_nuc_from_coded_by_string(source) : """Fetches the sequence from the NCBI for a "coded_by" string. e.g. "NM_010510.1:21..569" or "AF376133.1:<1..>553" or "join(AB061020.1:1..184,AB061020.1:300..1300)" or "complement(NC_001713.1:67323..68795)" Note - joins and complements are handled by recursion. Returns a Seq object.""" if source.startswith("complement(") : assert source.endswith(")") #For simplicity this works by recursion return get_nuc_from_coded_by_string(source[11:-1]).reverse_complement() if source.startswith("join(") : assert source.endswith(")") #For simplicity this works by recursion. #Note that the Seq object (currently) does not have a join #method, so convert to strings and join them, then go back #to a Seq object: return Seq("".join(str(get_nuc_from_coded_by_string(s)) \ for s in source[5:-1].split(","))) if "(" in source or ")" in source \ or source.count(":") != 1 or source.count("..") != 1 : raise ValueError("Don't understand %s" % repr(source)) name, loc = source.split(":") #Remove and ignore any leading < or > for fuzzy locations which #indicate the full CDS extends beyond the region sequenced. start, end = [int(x.lstrip("<>")) for x in loc.split("..")] #We could now download the full sequence, and crop it locally: #return get_nuc_by_name(name)[start-1:end] #However, we can ask the NCBI to crop it and then download #just the bit we need! return get_nuc_by_name(name,start,end) def find_protein_within_nuc(protein_seq, nuc_seq, table) : """Search all six frames to find a protein's CDS.""" for frame in [0,1,2] : start = nuc_seq[frame:].translate(table).find(protein_seq) if start != -1 : return nuc_seq[frame+3*start:frame+3*(start+len(protein_seq))] rev_seq = nuc_seq.reverse_complement() for frame in [0,1,2] : start = rev_seq[frame:].translate(table).find(protein_seq) if start != -1 : return rev_seq[frame+3*start:frame+3*(start+len(protein_seq))] raise ValueError("Could not find the protein sequence " "in any of the six translation frames.") def get_nuc_record(protein_record, table="Standard") : """Given a protein record, returns a record with the CDS nucleotides. The protein's annotation is used to determine the CDS sequence(s) which are downloaded from the NCBI using Entrez. The translation table specified is used to check the nucleotides actually do give the expected protein sequence. Tries to get the CDS information from a "coded_by" qualifier, failing that it falls back on a DB_SOURCE xref entry (which does not specify which bit of the nucleotide sequence referenced is required - this is deduced from the expected translation). This could get the wrong region if the happens to be two genes with different nucleotides encoding the same protein sequence! """ if not isinstance(protein_record, SeqRecord) : raise TypeError("Expect a SeqRecord as the protein_record.") feature = None for f in protein_record.features : if f.type == "CDS" and "coded_by" in f.qualifiers : feature = f break if feature : #This is the good situation, there is a precise "coded_by" string #Check this CDS feature is for the whole protein: assert feature.location.start.position == 0 assert feature.location.end.position == len(protein_record) source = feature.qualifiers["coded_by"][0] print "Using %s" % source return SeqRecord(Seq("")) nuc = get_nuc_from_coded_by_string(source) #See if this included the stop codon - they don't always! if str(nuc[-3:].translate(table)) == "*" : nuc = nuc[:-3] elif "db_source" in protein_record.annotations : #Note the current parsing of the DBSOURCE lines in GenPept #files is non-optimal (as of Biopython 1.49). If the #parsing is changed then the following code will need #updating to pull out the first xrefs entry. parts = protein_record.annotations["db_source"].split() source = parts[parts.index("xrefs:")+1].strip(",;") print "Using %s" % source nuc_all = get_nuc_by_name(source) nuc = find_protein_within_nuc(protein_record.seq, nuc_all, table) else : raise ValueError("Could not determine CDS source from record.") assert str(nuc.translate(table)) == str(protein_record.seq), \ "Translation:\n%s\nExpected:\n%s" \ % (translate(nuc,table), protein_record.seq) return SeqRecord(nuc, id=protein_record.id, description="(the CDS for this protein)") #Now use the above functions to fetch the CDS sequence for some proteins... gbk_input = "Diatoms_in.gp" #any proteins in GenBank/GenPept format. nucs = (get_nuc_record(p, table="Standard") for p \ in SeqIO.parse(open(gbk_input),"genbank")) handle = open("nucleotide.fasta","w") SeqIO.write(nucs, handle, "fasta") handle.close() print "Done" From biopython at maubp.freeserve.co.uk Tue Jan 20 09:44:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jan 2009 09:44:02 +0000 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <7373673958461722326@unknownmsgid> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> <-808938340139089427@unknownmsgid> <320fb6e00901190303h61d9c25fx114de8b253ad4c73@mail.gmail.com> <7373673958461722326@unknownmsgid> Message-ID: <320fb6e00901200144u3ba0929ave72282c99815b7d5@mail.gmail.com> On Tue, Jan 20, 2009 at 8:31 AM, Animesh Agrawal wrote: > > Peter, > Thanks a lot. ... > I tested your new script for Downloading CDS sequences. It was working fine > for records like P24673 but couldn't get it working for precise "coded_by" > string situation unless I comment(#return SeqRecord(Seq(""))) statement in > get_nuc_record() function. I don't understand why? That was a deliberate mistake to test your understanding (joke). I'm pleased you worked out what was wrong! I put that line in to speed up my testing - and forgot to remove it. Basically that line was to just return a dummy SeqRecord with an empty sequence for the "coded_by" cases, rather than going online wasting the NCBI server time. With hindsight I could have edited my example GenBank file to focus on the cases of interest. Does that make sense? Sorry for the confusion, Peter From animesh.agrawal at anu.edu.au Wed Jan 21 07:14:42 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Wed, 21 Jan 2009 18:14:42 +1100 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <320fb6e00901200144u3ba0929ave72282c99815b7d5@mail.gmail.com> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> <-808938340139089427@unknownmsgid> <320fb6e00901190303h61d9c25fx114de8b253ad4c73@mail.gmail.com> <7373673958461722326@unknownmsgid> <320fb6e00901200144u3ba0929ave72282c99815b7d5@mail.gmail.com> Message-ID: <000001c97b97$ee720de0$cb5629a0$@agrawal@anu.edu.au> >just return a dummy SeqRecord with an empty >sequence for the "coded_by" cases, rather than going online wasting >the NCBI server time. Ok. So that's the reason and I was wondering why you want to return empty sequence. -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: Tuesday, 20 January 2009 8:44 PM To: Animesh Agrawal Cc: BioPython Mailing List Subject: Re: [BioPython] Downloading CDS sequences On Tue, Jan 20, 2009 at 8:31 AM, Animesh Agrawal wrote: > > Peter, > Thanks a lot. ... > I tested your new script for Downloading CDS sequences. It was working fine > for records like P24673 but couldn't get it working for precise "coded_by" > string situation unless I comment(#return SeqRecord(Seq(""))) statement in > get_nuc_record() function. I don't understand why? That was a deliberate mistake to test your understanding (joke). I'm pleased you worked out what was wrong! I put that line in to speed up my testing - and forgot to remove it. Basically that line was to just return a dummy SeqRecord with an empty sequence for the "coded_by" cases, rather than going online wasting the NCBI server time. With hindsight I could have edited my example GenBank file to focus on the cases of interest. Does that make sense? Sorry for the confusion, Peter From animesh.agrawal at anu.edu.au Mon Jan 19 01:53:23 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Mon, 19 Jan 2009 12:53:23 +1100 Subject: [BioPython] Downloading CDS sequences In-Reply-To: <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> References: <320fb6e00901160435n3035b3adva7964e31ed929f96@mail.gmail.com> <320fb6e00901171130u610b2db1s1c4ff613dc49e404@mail.gmail.com> <320fb6e00901171140k67cef606oe2426a30f41f9623@mail.gmail.com> Message-ID: <000001c979d8$b65b1890$231149b0$@agrawal@anu.edu.au> Peter, Definitely the script written by you look much simpler to understand with defined functions for each of the cases. But this script is giving following error. I couldn't get it working for me. I am attaching the input file with this mail. ---------------------------------------------------------------------------- --------------------------------- Traceback (most recent call last): File "C:\Documents and Settings\Animesh\Desktop\Pthon_learning\Features_extraction_peter.py", line 114, in SeqIO.write(nucs, handle, "fasta") File "C:\Python25\Lib\site-packages\Bio\SeqIO\__init__.py", line 274, in write count = writer_class(handle).write_file(sequences) File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 255, in write_file count = self.write_records(records) File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 239, in write_records for record in records : File "C:\Documents and Settings\Animesh\Desktop\Pthon_learning\Features_extraction_peter.py", line 110, in nucs = (get_nuc_record(p, table="Standard") for p \ File "C:\Documents and Settings\Animesh\Desktop\Pthon_learning\Features_extraction_peter.py", line 99, in get_nuc_record assert start != -1, "Could not find start (assumed in frame)" AssertionError: Could not find start (assumed in frame) ---------------------------------------------------------------------------- ------------------------------------ Cheers, Animesh -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: Sunday, 18 January 2009 6:41 AM To: Animesh Agrawal Cc: BioPython Mailing List Subject: Re: [BioPython] Downloading CDS sequences Following Animesh's query, I was inspired to try and solve this problem for myself. My rough script of my own to solve this problem (below) has several differences to Andrew and Animesh's code. First of all, I didn't bother using the Bio.GenBank.LocationParser as I felt that for CDS processing I only needed to cope with a handful of location formats, and this was easier to do "by hand". Secondly I found some GenBank/GenPept examples where there wasn't a CDS feature with a "coded_by" qualifier in the annotation. Here the only thing I could find that worked was to look under the DBSOURCE information for a cross reference to the full parent nucleotide sequence, and then try and work out which bit codes for the protein. This is a little ugly, but seems to work. I'm also using Bio.SeqIO and Bio.Entrez rather than Bio.GenBank and Bio.EUtils (deprecated). I think the most important change was that I explicitly verify the nucleotide sequence obtained when translated does actuall give the expected protein sequence - just in case there was an error in my code, the annotation, or the even downloads. Peter --- #Script to take a file of proteins in GenBank/GenPept format, examine their annotation, #and use this to download their CDS from the NCBI. #Written and tested on Biopython 1.49, on 2008/01/17 from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO from Bio import Entrez #Edit this next line (and read the NCBI Entrez usage guidelines) Entrez.email = "Your.Name.Here at example.com" def get_nuc_by_name(name, start=None, end=None) : """Fetches the sequence from the NCBI given an indentifier name. Note start and end should be given using one based counting! Returns a Seq object.""" record = SeqIO.read(Entrez.efetch("nucleotide", id=name.strip(), seq_start=start, seq_stop=end, retmode="text", rettype="fasta"), "fasta") return record.seq def get_nuc_from_coded_by_string(source) : """Fetches the sequence from the NCBI for a "coded_by" string. e.g. "NM_010510.1:21..569" or "AF376133.1:<1..>553" or "join(AB061020.1:1..184,AB061020.1:300..1300)" or "complement(NC_001713.1:67323..68795)" Note - joins and complements are handled by recusion. Returns a Seq object.""" if source.startswith("complement(") : assert source.endswith(")") #For simplicity this works by recursion return get_nuc_from_coded_by_string(source[11:-1]).reverse_complement() if source.startswith("join(") : assert source.endswith(")") #For simplicity this works by recursion. #Note that the Seq object (currently) does not have a join #method, so convert to strings and join them, then go back #to a Seq object: return Seq("".join(str(get_nuc_from_coded_by_string(s)) \ for s in source[5:-1].split(","))) if "(" in source or ")" in source \ or source.count(":") != 1 or source.count("..") != 1 : raise ValueError("Don't understand %s" % repr(source)) name, loc = source.split(":") #Remove and ignore any leading < or > for fuzzy locations which #indicate the full CDS extends beyound the region sequenced. start, end = [int(x.lstrip("<>")) for x in loc.split("..")] #We could now download the full sequence, and crop it locally: #return get_nuc_by_name(name)[start-1:end] #However, we can ask the NCBI to crop it and then download #just the bit we need! return get_nuc_by_name(name,start,end) def get_nuc_record(protein_record, table="Standard") : """Given a protein record, returns a record with the CDS nucleotides. The protein's annotation is used to determine the CDS sequence(s) which are downloaded from the NCBI using Entrez. The translation table specified is used to check the nucleotides actually do give the expected protein sequence. Tries to get the CDS information from a "coded_by" qualifier, failing that it falls back on a DB_SOURCE xref entry (which does not specify which bit of the nucleotide sequence referenced is required - this is deduced from the expected translation). """ if not isinstance(protein_record, SeqRecord) : raise TypeError("Expect a SeqRecord as the protein_record.") feature = None for f in protein_record.features : if f.type == "CDS" and "coded_by" in f.qualifiers : feature = f break if feature : assert feature.location.start.position == 0 assert feature.location.end.position == len(protein_record) source = feature.qualifiers["coded_by"][0] print "Using %s" % source return SeqRecord(Seq("")) nuc = get_nuc_from_coded_by_string(source) #See if this included the stop codon - they don't always! if str(nuc[-3:].translate(table)) == "*" : nuc = nuc[:-3] elif "db_source" in protein_record.annotations : #Note the current parsing of the DBSOURCE lines in GenPept #files is non-optimal (as of Biopython 1.49). If the #parsing is changed then the following code will need #updating to pull out the first xrefs entry. parts = protein_record.annotations["db_source"].split() source = parts[parts.index("xrefs:")+1].strip(",;") print "Using %s" % source nuc_all = get_nuc_by_name(source) start = nuc_all.translate(table).find(protein_record.seq) assert start != -1, "Could not find start (assumed in frame)" nuc = nuc_all[3*start:3*(start+len(protein_record))] else : raise ValueError("Could not determine CDS source from record.") assert str(nuc.translate(table)) == str(protein_record.seq), \ "Translation:\n%s\nExpected:\n%s" \ % (translate(nuc,table), protein_record.seq) return SeqRecord(nuc, id=protein_record.id, description="(the CDS for this protein)") #Now use the above functions to fetch the CDS sequence for some proteins... nucs = (get_nuc_record(p, table="Standard") for p \ in SeqIO.parse(open("protein.gbk"),"genbank")) handle = open("nucleotide.fasta","w") SeqIO.write(nucs, handle, "fasta") handle.close() print "Done" -------------- next part -------------- A non-text attachment was scrubbed... Name: Diatoms_in.gp Type: application/octet-stream Size: 78630 bytes Desc: not available URL: From nir at rosettadesigngroup.com Tue Jan 27 13:09:28 2009 From: nir at rosettadesigngroup.com (Nir London) Date: Tue, 27 Jan 2009 15:09:28 +0200 Subject: [BioPython] Rosetta Academic Training Workshop Message-ID: Due to public demand, ?Rosetta Design Group? is organizing a ?Rosetta? software training workshop, aimed for academic groups. The format of the workshop will be a ?webinar? - a web seminar, enabling more groups to attend while avoiding the annoying jet lag and accommodation troubles. Would you be interested in participating? If so please fill the form located at: http://rosettadesigngroup.com/blog/rosetta-academic-workshop/ and we will contact you when the details are finalized.* Nir London | Rosetta Design Group http://rosettadesigngroup.com/ * If you?re not from an academic group, don?t worry, write us anyway? From rodrigo_faccioli at uol.com.br Tue Jan 27 16:31:41 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Tue, 27 Jan 2009 14:31:41 -0200 Subject: [BioPython] Error XML Parser and another doubt Message-ID: <3715adb70901270831u1c2deafbu58d062bb5da5c70@mail.gmail.com> I have a error about read a XML file which is result from NCBIWWW.qblast. For this work, I used biopython 1.45 and python 2.5. The source-code is below: from Bio.Blast import NCBIXML import sys def readxml(filenamexml): E_VALUE_THRESH = 0.04 result_handle = open(filenamexml) blast_records = NCBIXML.parse(result_handle) for alignment in blast_records: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '...' print hsp.match[0:75] + '...' print hsp.sbjct[0:75] + '...' def main(): filenamexml = sys.argv[1] readxml(filenamexml) print "Done" main() The error message is: Traceback (most recent call last): File "src/readxml.py", line 26, in main() File "src/readxml.py", line 23, in main readxml(filenamexml) File "src/readxml.py", line 10, in readxml for alignment in blast_records: File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 574, in parse expat_parser.Parse(text, False) File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 98, in endElement eval("self.%s()" % method) File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in _end_BlastOutput_version self._header.date = self._value.split()[2][1:-1] IndexError: list index out of range I'm very new in Python and BioPython. Sincerely, this is my first program without tutorial. I have another doubt: Is there a way (website, program) that read a xml file from blast and shows like ncbi web site? Thanks for any help. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From biopython at maubp.freeserve.co.uk Tue Jan 27 16:59:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jan 2009 16:59:24 +0000 Subject: [BioPython] Error XML Parser and another doubt In-Reply-To: <3715adb70901270831u1c2deafbu58d062bb5da5c70@mail.gmail.com> References: <3715adb70901270831u1c2deafbu58d062bb5da5c70@mail.gmail.com> Message-ID: <320fb6e00901270859v29f545aeu3475cec90c493577@mail.gmail.com> On Tue, Jan 27, 2009 at 4:31 PM, Rodrigo faccioli wrote: > I have a error about read a XML file which is result from NCBIWWW.qblast. > For this work, I used biopython 1.45 and python 2.5. > ... > Traceback (most recent call last): > ... > File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 214, in > _end_BlastOutput_version > self._header.date = self._value.split()[2][1:-1] > IndexError: list index out of range > > I'm very new in Python and BioPython. Sincerely, this is my first program > without tutorial. I'm sorry you've had trouble. This looks like an old bug in parsing the date in the XML file, caused when the NCBI changed their online server. See Biopython Bug 2499 for details: http://bugzilla.open-bio.org/show_bug.cgi?id=2499 We fixed this in Biopython 1.46, but you are using Biopython 1.45. Can you update your machine? The current release is Biopython 1.49. Peter From rodrigo_faccioli at uol.com.br Tue Jan 27 19:28:34 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Tue, 27 Jan 2009 17:28:34 -0200 Subject: [BioPython] Remove biopython 1.45 Message-ID: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> I want to know, how can I remove the biopython 1.45 in my machine. I installed the last version (1.49) from biopython website. I read http://biopython.org/DIST/docs/install/Installation.html and I didn't find anything about uninstall. Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From dalloliogm at gmail.com Wed Jan 28 10:05:19 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 28 Jan 2009 11:05:19 +0100 Subject: [BioPython] Remove biopython 1.45 In-Reply-To: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> References: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> Message-ID: <5aa3b3570901280205w191be10cr18ad32e84f18076@mail.gmail.com> On Tue, Jan 27, 2009 at 8:28 PM, Rodrigo faccioli wrote: > I want to know, how can I remove the biopython 1.45 in my machine. I > installed the last version (1.49) from biopython website. > > I read http://biopython.org/DIST/docs/install/Installation.html and I didn't > find anything about uninstall. In the future, I suggest you to always install/upgrade biopython via easy_install. Executing this from a command line: $: easy_install -U biopython is the easiest way to install and upgrade biopython along with all its dependencies. As for your problem, it should be enough to delete the folder where you have installed biopython 1.45 (please someone correct me if I am wrong). It seems that manually installing biopython using the instructions you posted puts all the scripts in the same directory; in my case (I am running an Ubuntu), it installed everything on /usr/lib/python2.5/site-packages/Bio . So basically, the manual installation you did has overwritten the old biopython you had installed on your computer... so you don't need to do anything to remove it. I suggest you to always use easy_install to install new python modules, as it is supposed to be the standard way and it creates a distinct directory for every module and every version. p.s. more info on easy_install: http://peak.telecommunity.com/DevCenter/EasyInstall > > Thanks, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Jan 28 11:37:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jan 2009 11:37:43 +0000 Subject: [BioPython] Remove biopython 1.45 In-Reply-To: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> References: <3715adb70901271128o11c7fcd3jeeb4d4dbdb064cad@mail.gmail.com> Message-ID: <320fb6e00901280337t6c162ba2g839d54d5a9408d8c@mail.gmail.com> On Tue, Jan 27, 2009 at 7:28 PM, Rodrigo faccioli wrote: > I want to know, how can I remove the biopython 1.45 in my machine. I > installed the last version (1.49) from biopython website. > > I read http://biopython.org/DIST/docs/install/Installation.html and I didn't > find anything about uninstall. If you installed an old version of Biopython using your Linux distribution's package manager, you should ideally have un-installed it first via the package manager. Installing from any python package from source will just over-write any existing installation (if there is one in already there in the same place). As far as I know, this is just the way that distutils works (the standard python installation package). While easy install may be popular, it is not (yet) the official python tool for package installation. To manually remove Biopython (e.g. to make a clean install), locate and remove the relevent directories (and if present, egg files) under your python site-package directory, e.g. /usr/lib/python2.5/site-packages/Bio /usr/lib/python2.5/site-packages/BioSQL /usr/lib/python2.5/site-packages/Martel [These paths will depend on your OS, your version of python, and also can differ if you choose to install Biopython in a non-default directory, such as under your home folder] Peter From biopython at maubp.freeserve.co.uk Wed Jan 28 17:45:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jan 2009 17:45:25 +0000 Subject: [BioPython] BLAST subprocess problem with a GUI Message-ID: <320fb6e00901280945p32eff05by64d8a42d576f76cc@mail.gmail.com> On Tue, Jan 13, 2009 at 10:41 AM, Peter wrote: > On Tue, Jan 13, 2009 at 7:38 AM, Stefanie L?ck wrote: >> I was a little bit to optimistic... >> >> After compilation with py2exe, blast hangs. In the log file of py2exe >> I get the following error message: >> >> Traceback (most recent call last): >> File "prim_search.pyc", line 464, in make_xml >> >> File "Bio\Blast\NCBIStandalone.pyc", line 1668, in blastall >> File "Bio\Blast\NCBIStandalone.pyc", line 1992, in _invoke_blast >> File "subprocess.pyc", line 586, in __init__ >> File "subprocess.pyc", line 681, in _get_handles >> File "subprocess.pyc", line 722, in _make_inheritable >> TypeError: an integer is required >> >> Any ideas? >> Stefanie > > Are you using Biopython 1.49? > > What version of Python are you using here? (Python 2.3 is handled a > little differently, as it does not have the subprocess module). > > Can you confirm the exact same code works fine run from Python > directly (via IDLE or the commandline?), but fails via py2exe? > > Are you running the py2exe compiled version from the Windows command > line? Can you try that, even thought you said it was a GUI program. > This might be related to the following python bug on Windows to do > with pipe redirection, http://bugs.python.org/issue1124861 > If so, I think there is a suggested work around we can try (this will > require a change to the Biopython code). > > Peter Hi Stefanie, Did you make any progress with this problem? If as I suspect the problem is the python subprocess bug http://bugs.python.org/issue1124861 then you can try the suggested work around in Biopython, by modifying the _invoke_blast function in Bio\Blast\NCBIStandalone.py file as follows: import subprocess, sys #We don't need to supply any piped input, but we setup the #pipe anyway as a work around for a python bug if this is #called from a Windows GUI program. For details, see: #http://bugs.python.org/issue1124861 blast_process = subprocess.Popen(cmd_string, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) blast_process.stdin.close() return blast_process.stdout, blast_process.stderr I've checked this change doesn't seem to break anything - but does it help for your GUI program? Peter From biopython at maubp.freeserve.co.uk Fri Jan 30 12:52:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 30 Jan 2009 12:52:29 +0000 Subject: [BioPython] Does anyone use EZRetrieve? In-Reply-To: <320fb6e00811201253j66336d7cl977e4e3112c9f9f7@mail.gmail.com> References: <4925CCAA.2040809@gmail.com> <320fb6e00811201253j66336d7cl977e4e3112c9f9f7@mail.gmail.com> Message-ID: <320fb6e00901300452vdfd1a73yf70fd78d77d12eb5@mail.gmail.com> On Thu, Nov 20, 2008 at 8:53 PM, Peter wrote: > On Thu, Nov 20, 2008 at 8:46 PM, Bruce Southey wrote: >> Hi, >> Does anyone use EZRetrieve >> (http://siriusb.umdnj.edu:18080/EZRetrieve/single_r.jsp) ? >> This allows a user to retrieve a human, mouse or rat genome nucleic sequence >> based on an valid identifier. >> >> I think that most of the functionality of Bio.EZRetrieve is already present >> in Biopython and the genome sources appear to be 5 years old. For example, >> it uses LocusLink that was discontinued March 2005. >> >> If so could you please let me know? > > Actually - could you let the whole mailing list know? ;) > > Given nature of the database and the limited functionality this python > code offers, if no-one is using Bio.EZRetrieve then it could be > considered for deprecation. I've seen no replies so I've marked Bio.EZRetrieve as obsolete in CVS (and therefore for Biopython 1.50), and unless anyone speaks up it will be deprecated in the release after that. I'm not sure that the EZRetrieve data is that out of date (they may have updated things since Bruce looked), but all the Bio.EZRetrieve code does is fetch an HTML page and extract the FASTA formatted sequence (ignoring any metadata or cross references). In any case, this kind of HTML "screen scraping" is fragile (liable to break when the site gets a visual redesign) and is not explicitly condoned by the service itself. Peter