From mauriceling at acm.org Thu Dec 8 12:16:40 2005 From: mauriceling at acm.org (Maurice Ling) Date: Thu Dec 8 12:28:47 2005 Subject: [BioPython] problem with GenBank Message-ID: <43986A78.6020704@acm.org> Hi, I'm using BioPython 1.41 with Python 2.4.1. I'm trying to use GenBank module, Python 2.4.1 (#1, Sep 16 2005, 17:46:53) [GCC 4.0.0 (Apple Computer, Inc. build 5026)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import GenBank >>> GenBank.search_for('OCT4') But this takes forever. Any idea? Cheers Maurice -------------- next part -------------- A non-text attachment was scrubbed... Name: mauriceling.vcf Type: text/x-vcard Size: 324 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biopython/attachments/20051209/92668695/mauriceling.vcf From idoerg at burnham.org Thu Dec 8 12:55:05 2005 From: idoerg at burnham.org (Iddo Friedberg) Date: Thu Dec 8 13:02:55 2005 Subject: [BioPython] problem with GenBank In-Reply-To: <43986A78.6020704@acm.org> References: <43986A78.6020704@acm.org> Message-ID: <43987379.1000601@burnham.org> Hi Maurice, Try narrowing down your initial search space. For example: GenBank.search_for('OCT4', database='protein') or: GenBank.search_for('OCT4', database='nucleotide') This actualy completes, although slower. I am puzzled by why this is so slow myself, expecially as other queries seem to complete in reasonable time. If anyone out there has any ideas, please share them with us. ./I Maurice Ling wrote: > Hi, > > I'm using BioPython 1.41 with Python 2.4.1. I'm trying to use GenBank > module, > > Python 2.4.1 (#1, Sep 16 2005, 17:46:53) > [GCC 4.0.0 (Apple Computer, Inc. build 5026)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> from Bio import GenBank > >>> GenBank.search_for('OCT4') > > But this takes forever. Any idea? > > Cheers > Maurice > > >_______________________________________________ >BioPython mailing list - BioPython@biopython.org >http://biopython.org/mailman/listinfo/biopython > > -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://ffas.ljcrf.edu/~iddo From eko at eijkman.go.id Fri Dec 9 05:26:04 2005 From: eko at eijkman.go.id (Ismail Ekoprayitno Rozi) Date: Fri Dec 9 05:44:38 2005 Subject: [BioPython] Problem to open PDBid 2mbw Message-ID: <20051209172604.g5iabrowqoo84w4c@home.eijkman.go.id> Hi, I'm not so good in python and I'd like to use Bio.PDB (fromp biopython 1.41) in my scripts. When trying to open PDBid 2mbw, the error bellow is coming out (other PDBid seems to be opened normally). Any idea how to solve it? Thanks for your help, I. E. Rozi ------------- Python 2.3.5 (#2, Aug 30 2005, 15:50:26) [GCC 4.0.2 20050821 (prerelease) (Debian 4.0.1-6)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.PDB import PDBParser >>> p = PDBParser(PERMISSIVE=1) >>> structure = p.get_structure("2mbw","./pdb2mbw.ent") Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.3/site-packages/Bio/PDB/PDBParser.py", line 66, in get_structure self._parse(file.readlines()) File "/usr/lib/python2.3/site-packages/Bio/PDB/PDBParser.py", line 87, in _parse self.trailer=self._parse_coordinates(coords_trailer) File "/usr/lib/python2.3/site-packages/Bio/PDB/PDBParser.py", line 179, in _parse_coordinates structure_builder.init_residue(resname, hetero_flag, resseq, icode) File "/usr/lib/python2.3/site-packages/Bio/PDB/StructureBuilder.py", line 155, in init_residue self.chain.add(residue) File "/usr/lib/python2.3/site-packages/Bio/PDB/Entity.py", line 80, in add raise PDBConstructionException, "%s defined twice" % entity.get_full_id() File "/usr/lib/python2.3/site-packages/Bio/PDB/Entity.py", line 132, in get_full_id parent=self.get_parent() File "/usr/lib/python2.3/site-packages/Bio/PDB/Entity.py", line 102, in get_parent raise PDBException, 'No parent' PDBException: No parent ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From thamelry at binf.ku.dk Fri Dec 9 05:59:39 2005 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri Dec 9 06:19:08 2005 Subject: [BioPython] Problem to open PDBid 2mbw In-Reply-To: <20051209172604.g5iabrowqoo84w4c@home.eijkman.go.id> References: <20051209172604.g5iabrowqoo84w4c@home.eijkman.go.id> Message-ID: <33231.87.72.27.226.1134125979.squirrel@www.binf.ku.dk> > > Hi, > I'm not so good in python and I'd like to use Bio.PDB (fromp biopython > 1.41) in > my scripts. > When trying to open PDBid 2mbw, the error bellow is coming out (other > PDBid > seems to be opened normally). > Any idea how to solve it? Hi, Water 230 is defined twice (the second water should be Water 239, BTW), so it's a buggy PDB file. But if PERMISSIVE=1 this should have normally been ignored, so it's also a bug in Bio.PDB - I'll fix it in CVS. Cheers, -Thomas From douglas.kojetin at gmail.com Fri Dec 9 10:25:02 2005 From: douglas.kojetin at gmail.com (Douglas Kojetin) Date: Fri Dec 9 11:18:46 2005 Subject: [BioPython] using a variable as input for restriction enzyme Message-ID: <7D01DE19-C38B-48AB-87DB-A2EBC139E726@gmail.com> Hi All- How do I use a variable to specify the restriction enzyme? Example -- instead of using the following call: Restriction.EcoR1.elucidate() instead (somehow) use a variable to specify the restriction enzyme: rsite='EcoR1' Restriction.rsite.elucidate() Thanks, Doug From fkauff at duke.edu Fri Dec 9 12:14:03 2005 From: fkauff at duke.edu (Frank Kauff) Date: Fri Dec 9 13:16:21 2005 Subject: [BioPython] using a variable as input for restriction enzyme In-Reply-To: <7D01DE19-C38B-48AB-87DB-A2EBC139E726@gmail.com> References: <7D01DE19-C38B-48AB-87DB-A2EBC139E726@gmail.com> Message-ID: <1134148444.4731.13.camel@osiris.biology.duke.edu> On Fri, 2005-12-09 at 10:25 -0500, Douglas Kojetin wrote: > Hi All- > > How do I use a variable to specify the restriction enzyme? > > Example -- instead of using the following call: > > Restriction.EcoR1.elucidate() > > instead (somehow) use a variable to specify the restriction enzyme: > > rsite='EcoR1' > Restriction.rsite.elucidate() > Not sure, but would Restriction.__getattribute__(rsite).elucidate() work? Frank > Thanks, > Doug > > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython -- From jstroud at mbi.ucla.edu Fri Dec 9 14:57:05 2005 From: jstroud at mbi.ucla.edu (James Stroud) Date: Fri Dec 9 15:32:38 2005 Subject: [BioPython] using a variable as input for restriction enzyme In-Reply-To: <7D01DE19-C38B-48AB-87DB-A2EBC139E726@gmail.com> References: <7D01DE19-C38B-48AB-87DB-A2EBC139E726@gmail.com> Message-ID: <200512091157.05155.jstroud@mbi.ucla.edu> Best would probably be getattr(Restriction, 'EcoRI').elucidate() On Friday 09 December 2005 07:25, Douglas Kojetin wrote: > Hi All- > > How do I use a variable to specify the restriction enzyme? > > Example -- instead of using the following call: > > Restriction.EcoR1.elucidate() > > instead (somehow) use a variable to specify the restriction enzyme: > > rsite='EcoR1' > Restriction.rsite.elucidate() > > Thanks, > Doug > > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ From km at mrna.tn.nic.in Tue Dec 13 22:39:03 2005 From: km at mrna.tn.nic.in (km) Date: Mon Dec 12 02:47:37 2005 Subject: [BioPython] Bio.SCOP Message-ID: <20051214033903.GA2373@mrna.tn.nic.in> hi all, Is there any short intro for usage of Bio.SCOP module? i need to index a astral db and use it to emit sequences with sid didnt find anything in biopython cookbook. unfortunately biopython is poorly documented; basic intro of this module appreciated. regards, KM From frederic.sohm at iaf.cnrs-gif.fr Mon Dec 12 07:13:38 2005 From: frederic.sohm at iaf.cnrs-gif.fr (Frederic Sohm) Date: Mon Dec 12 07:38:17 2005 Subject: [BioPython] using a variable as input for restriction enzyme Message-ID: <200512121313.38669.frederic.sohm@iaf.cnrs-gif.fr> Hi, While the two precedent methods will work, I find simpler to use any of the following : ?this one especially intended for this kind of problems : >>> from Bio.Restriction import AllEnzymes >>> rsite = 'EcoRI' >>> AllEnzymes.get(rsite).elucidate() 'G^AATT_C' >>> If you use only commercial enzymes you can use : >>> from Bio.Restriction import CommOnly >>> rsite = 'EcoRI' >>> CommOnly.get(rsite).elucidate() G^AATT_C' >>> or You can also evaluate the string : >>> from Bio import Restriction >>> rsite = 'EcoRI' >>> eval('Restriction.'+rsite).elucidate() 'G^AATT_C' Another way less talkative but which will put a lot of names in your namespace : >>> from Bio.Restriction import * >>> rsite = 'EcoRI' >>> eval(rsite).elucidate() 'G^AATT_C' >>> Hope this help. Best regards Fred -- Fr?d?ric Sohm Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s" UPR 2197 DEPSN, CNRS Institut de Neurosciences A. Fessard 1 Avenue de la Terrasse 91 198 GIF-SUR-YVETTE FRANCE Phone: +33 (0) 1 69 82 34 12 Fax:+33 (0) 1 69 82 34 47 From alfl0002 at stud.uni-sb.de Mon Dec 12 11:21:24 2005 From: alfl0002 at stud.uni-sb.de (Aline Flockerzi) Date: Mon Dec 12 12:13:43 2005 Subject: [BioPython] ClustalW parameter Message-ID: <20051212172124.shgs9v6t73fo0gs4@webmail.stud.uni-saarland.de> Hello! Does anybody know how to change the parameter is_no_end_pen in ClustalW? Thanks Aline From kirbywhite at sbcglobal.net Tue Dec 13 01:27:21 2005 From: kirbywhite at sbcglobal.net (kirby white) Date: Tue Dec 13 01:31:20 2005 Subject: [BioPython] (no subject) Message-ID: <20051213062721.94494.qmail@web81203.mail.mud.yahoo.com> From kirbywhite at sbcglobal.net Tue Dec 13 01:38:09 2005 From: kirbywhite at sbcglobal.net (kirby white) Date: Tue Dec 13 01:42:12 2005 Subject: [BioPython] LSSITES NEWS - Lolita PICTURE & MOVIES at 45 Lolita Sites Message-ID: <20051213063809.34365.qmail@web81201.mail.mud.yahoo.com> From kirbywhite at sbcglobal.net Tue Dec 13 02:05:46 2005 From: kirbywhite at sbcglobal.net (kirby white) Date: Tue Dec 13 02:09:45 2005 Subject: [BioPython] LSSITES NEWS - Lolita PICTURE & MOVIES at 45 Lolita Sites Message-ID: <20051213070546.62822.qmail@web81205.mail.mud.yahoo.com> From as_nascimento at yahoo.com.br Tue Dec 13 06:29:56 2005 From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento) Date: Tue Dec 13 06:34:06 2005 Subject: [BioPython] A Blast filter Message-ID: <20051213112957.58550.qmail@web31008.mail.mud.yahoo.com> Hi all, I wonder if one has developed any filter for blast searching against a local database (like nr) before performing a multiple sequence alignment of large number of sequences... Any tips are pretty appreciable! Thanks in advance! Alessandro --------------------------------- Yahoo! doce lar. Fa?a do Yahoo! sua homepage. From eirik.sonneland at student.umb.no Tue Dec 13 07:07:21 2005 From: eirik.sonneland at student.umb.no (Eirik =?iso-8859-1?Q?S=F8nneland?=) Date: Tue Dec 13 07:36:58 2005 Subject: [BioPython] A Blast filter In-Reply-To: <20051213112957.58550.qmail@web31008.mail.mud.yahoo.com> References: <20051213112957.58550.qmail@web31008.mail.mud.yahoo.com> Message-ID: <3036.128.39.177.29.1134475641.squirrel@webmail.umb.no> Hi, I'm working on a high throughput SNP detection pipeline, are blasting ~150 000 trace sequences against contig/refseq database and a trace archive database downloaded from NCBI. To find SNPs We repeat mask 150 000 traces AND contig database BEFORE Blast. This way we focus the search in the NON repeated area of the sequence. When selecting Blast hits (HSPs) I use e-value(>=e-17), Blast score(>=1050) and identity (>=97%). This is very stringent but since I know my trace sequences are about 1000bp I ensure to select hits which have a minimum of 500bp matching/aligned(Blast score) and in this alignment "area" of the sequences there are no less then 97% identity. You need to tune this parametres to meet your needs. The code for this is what is explained in the cookbook for sorting blast.records. I can send you a extract from my script if interested. Eirik S?nneland > Hi all, > > I wonder if one has developed any filter for blast searching against a > local database (like nr) before performing a multiple sequence alignment > of large number of sequences... > > Any tips are pretty appreciable! > > Thanks in advance! > > Alessandro > > > > --------------------------------- > Yahoo! doce lar. Fa?a do Yahoo! sua homepage. > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Dec 15 08:33:37 2005 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu Dec 15 09:00:18 2005 Subject: [BioPython] Bio.Geo for NCBI's GEO microarray SOFT files Message-ID: <43A170B1.5070007@maubp.freeserve.co.uk> Does anyone on the discussion list use GEO files? Peter -------- Original Message -------- Subject: [Biopython-dev] Bio.Geo for NCBI's GEO microarry SOFT files Date: Sat, 10 Dec 2005 18:39:13 +0000 From: Peter To: biopython-dev@biopython.org I've just been looking at the Bio.Geo module by Katharine Lindner, contributed back in 2002 which should parse the NCBI's Gene Expression Omnibus (GEO) microarray data files. http://www.ncbi.nlm.nih.gov/geo/ Is anyone using Bio.Geo at the moment? The NCBI seem to call these SOFT files, (*.soft) and the format is documented here: http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html#SOFTformat Apparently in 2005, they began a switch to a revised file format, new format files here: ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/ Old format files here: ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_old/ ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_old_gz/ As far as I can tell, neither the "old" or "new" versions work in Bio.Geo, so there may have been another format change between 2002 and 2005. In addition the 2005 change introduces new lines, before and after the actual data: !dataset_table_begin !dataset_table_end These are definitely not supported in the current Martel grammar for GEO files. Peter _______________________________________________ Biopython-dev mailing list Biopython-dev@biopython.org http://biopython.org/mailman/listinfo/biopython-dev From sbassi at gmail.com Sun Dec 18 14:32:30 2005 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun Dec 18 14:29:40 2005 Subject: [BioPython] Biopython function in a Google module In-Reply-To: References: Message-ID: I've made a Google module, that is a little page that can be inserted into the Google personalized homepage (www.google.com/ig). This module is very simple, it is more a proof of concept than a useful module. If you want to test it, just log into your gmail account and then go to www.google.com/ig. After that, you will see your personilazed homepage. In the upperleft corner there is a "Add content" link. Click on it, and a frame will be displayed, you will see a "Create a section" form, then you type: http://www.bioinformatica.info/modulomt2.xml Press OK on the prompt about an external module (since it is not a Google module) and the module will be placed in your Google personalized homepage. Now I waiting for feedback and ideas for another biopython based module. -- La web sin popups ni spyware: Usa Firefox en lugar de Internet Explorer From biopython at maubp.freeserve.co.uk Sun Dec 18 15:40:15 2005 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun Dec 18 15:37:36 2005 Subject: [BioPython] Biopython function in a Google module In-Reply-To: References: Message-ID: <43A5C92F.7000305@maubp.freeserve.co.uk> Sebastian Bassi wrote: > I've made a Google module, that is a little page that can be inserted > into the Google personalized homepage (www.google.com/ig). This module > is very simple, it is more a proof of concept than a useful module. You haven't said what it does - but by looking at your XML file it turns out to be an Oligo Melting Point Calculator. http://www.bioinformatica.info/modulomt2.xml By the way, you have a typo in the XML file's description (point not poing). Screenshot: http://www.bioinformatica.info/omscreen.jpg Peter From sbassi at gmail.com Sun Dec 18 21:36:48 2005 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun Dec 18 22:00:31 2005 Subject: [BioPython] Biopython function in a Google module In-Reply-To: <43A5C92F.7000305@maubp.freeserve.co.uk> References: <43A5C92F.7000305@maubp.freeserve.co.uk> Message-ID: On 12/18/05, Peter wrote: > Sebastian Bassi wrote: > > I've made a Google module, that is a little page that can be inserted > > into the Google personalized homepage (www.google.com/ig). This module > > is very simple, it is more a proof of concept than a useful module. > You haven't said what it does - but by looking at your XML file it turns > out to be an Oligo Melting Point Calculator. Yes, sorry. Anyway, you can run it on the Google webpage to see it working. > By the way, you have a typo in the XML file's description (point not poing). Thank you, just changed it :) -- La web sin popups ni spyware: Usa Firefox en lugar de Internet Explorer From borreguero at gmail.com Tue Dec 20 18:54:23 2005 From: borreguero at gmail.com (Jose Borreguero) Date: Wed Dec 21 04:32:28 2005 Subject: [BioPython] Anybody used Align.fastapairwise? Message-ID: <7cced4ed0512201554p3e300c3bh@mail.gmail.com> I'm new to biopython. I wonder if anybody used the Align.fastapairwise module, to get me started with some example. jose -- Jose M. Borreguero jmborr@gatech.edu, www.borreguero.com phone: 404 707 8980 GCATT, biology department, Georgia Tech, 250 14St NW, Atlanta GA 30318 From volcs0 at gmail.com Wed Dec 21 10:31:47 2005 From: volcs0 at gmail.com (Sam Volchenboum) Date: Wed Dec 21 10:36:06 2005 Subject: [BioPython] Help with NaiveBayes Message-ID: <8e8165be0512210731s1df9b513n354b54cef9cf76ae@mail.gmail.com> I'm trying to get the NaiveBayes function up and running. I just can't find any examples out there to learn from (which is how I usually figure these things out). I have a set of proteins - say 100 - that are on/off in health/disease. I have 10 samples each of health and disease. This is mass spec data. So, I have a matrix where the rows are proteins (1-100) and the columns are health/disease (10 each, 20 total), and the cell contents are 1's and 0's (present/absent). I want to create a NaiveBayes classifier based on this training data and see if it predicts health/disease based on a new set of data (a new set of results for the 100 proteins). For the training_set, I've tried this format: [[1, 0, 1, 1, 0, 1], [0, 0, 1, 1, 1], [0, 0, 0, 0, 0]] which would be an example of three states and five proteins (on or off). And results like this: ['Healthy', 'Disease', 'Disease'] But I get an error on NaiveBayes.train(training_set, results) - the two lists need to be the same length (I thought they were... length = 3)... Any help, advice, push, shove... etc., is greatly appreciated. Thanks. sam From jchang at smi.stanford.edu Wed Dec 21 14:24:46 2005 From: jchang at smi.stanford.edu (jchang@smi.stanford.edu) Date: Wed Dec 21 16:13:35 2005 Subject: [BioPython] Help with NaiveBayes In-Reply-To: <20051221170618.GA385@oliphaunt.duhs.duke.edu> References: <8e8165be0512210731s1df9b513n354b54cef9cf76ae@mail.gmail.com> <20051221170618.GA385@oliphaunt.duhs.duke.edu> Message-ID: <20051221192444.GE385@oliphaunt.duhs.duke.edu> On Wed, Dec 21, 2005 at 10:31:47AM -0500, Sam Volchenboum wrote: > I'm trying to get the NaiveBayes function up and running. > > For the training_set, I've tried this format: > > [[1, 0, 1, 1, 0, 1], [0, 0, 1, 1, 1], [0, 0, 0, 0, 0]] > > which would be an example of three states and five proteins (on or off). The first one contains data for 6 proteins. If you make them all lists of length 5, the training completes. >>> from Bio import NaiveBayes >>> training_set = [[1, 0, 1, 1, 0, 1], [0, 0, 1, 1, 1], [0, 0, 0, 0, 0]] >>> results = ['Healthy', 'Disease', 'Disease'] >>> nb = NaiveBayes.train(training_set, results) Traceback (most recent call last): File "", line 1, in ? File "/Users/jchang/lib/jchang/python/Bio/NaiveBayes.py", line 146, in train raise ValueError, "observations have different dimensionality" ValueError: observations have different dimensionality >>> training_set = [[1, 0, 1, 1, 0], [0, 0, 1, 1, 1], [0, 0, 0, 0, 0]] >>> nb = NaiveBayes.train(training_set, results) >>> NaiveBayes.classify(nb, [1, 0, 1, 1, 0]) 'Healthy' >>> NaiveBayes.classify(nb, [0, 0, 0, 1, 1]) 'Disease' >>> Jeff From winter at biotec.tu-dresden.de Thu Dec 22 06:17:35 2005 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Thu Dec 22 07:14:38 2005 Subject: [BioPython] Parse bit score with NCBIXML module from BLAST output Message-ID: <43AA8B4F.9020203@biotec.tu-dresden.de> Hi all, I just tried to parse a BLAST XML output. To my surprise, the parser always returns None for the bit scores (hsp.bits), whereas raw scores (hsp.score) are fine. Is this a known issue? Or a problem with my installation? This is my code: from Bio.Blast import NCBIXML parser = NCBIXML.BlastParser() record = parser.parse(open("blastout.xml")) for alignment in record.alignments: for hsp in alignment.hsps: print hsp.score print hsp.bits print The relevant part from blastout.xml looks like this: 1 gi|56554556|pdb|1XI5|I Chain I, Clathrin D6 Coat 1XI5-I 1630 1 3254.54 8437 0 ... The corresponding output of my script is: 8437.0 None I checked the 1.41 version of Bio.Blast.NCBIXML.BlastParser() and could not find any evidence for parsing of the tag. Any solution? Thanks, Christof From e.picardi at unical.it Wed Dec 21 10:45:38 2005 From: e.picardi at unical.it (Ernesto) Date: Thu Dec 22 07:27:57 2005 Subject: [BioPython] simple class to generate random trees References: <7cced4ed0512201554p3e300c3bh@mail.gmail.com> Message-ID: <002201c60645$9b889f10$572561a0@mirko84cf0g99i> Hi all, I wrote a simple class to generate random phylogenetic trees clock (Kuhner and Felsenstein 1994) and no clock-like (Guindon and Gascuel 2002). I attached a copy. I don't know if it could be useful for BioPython users. Using RandonTree: >>> from RandomTree import RandomTree >>> tree = RandomTree() >>> tree.nobr 0 >>> tree.ntips=4 >>> tree.constant_tree() '((T3:0.01516,T4:0.01516):0.01332,(T1:0.02643,T2:0.02643):0.00205);' >>> tree.variable_tree() '(((T4:0.00050,T3:0.00954):0.00009,T2:0.01591):0.00595,T1:0.00531);' >>>tree.nobr=1 >>> tree.constant_tree() '((T2,(T1,T4)),T3);' >>> tree.variable_tree() '((T1,(T3,T2)),T4);' The attributes of tree are: ntips: number of tips for tree --> default is 10 nobr: 1 for trees without branch lengths --> default is 0 pm: probability of change per unit time --> default is 0.03 shape: gamma shape for variable trees --> default is 0.5 mean: mean of gamma distribution for variable trees --> default is 1 Regards, Ernesto Picardi PS: there is also an on-line version at: http://biologia.unical.it/py_script/tree.html Ernesto Picardi, PhD Dept. of Cell Biology University of Calabria 87036, Arcavacata di Rende (CS) Italy Phone: +39 0984 492937 Fax: +39 0984 492911 E-mail: e.picardi@unical.it -------------- next part -------------- """ RandomTree is a simple class to generate random rooted trees. Clock-like trees are generated according to the methodology of Kuhner and Felsenstein (1994) Mol. Biol. Evol. 11: 459-468, whereas no clock-like trees are created following Guindon and Gascuel (2002) Mol. Biol. Evol. 19: 534-543. Once a clock-like tree is generted, each branch length is multiplied by a gamma dinstributed factor. If the mean of this distribution is equal to 1 and the shape fixed to 0.5, then the departure from molecular clock is strong. The opposite situation is when gamma shape is fixed to 2.0. When the RandomTree class is invoked a simple object is created. It contains: ntips: number of tips for tree --> default is 10 nobr: 1 for trees without branch lengths --> default is 0 pm: probability of change per unit time --> default is 0.03 shape: gamma shape for variable trees --> default is 0.5 mean: mean of gamma distribution for variable trees --> default is 1 USING this class: >>> from RandomTree import RandomTree >>> tree = RandomTree() >>> tree.nobr 0 >>> tree.ntips=4 >>> tree.constant_tree() '((T3:0.01516,T4:0.01516):0.01332,(T1:0.02643,T2:0.02643):0.00205);' >>> tree.variable_tree() '(((T4:0.00050,T3:0.00954):0.00009,T2:0.01591):0.00595,T1:0.00531);' >>>tree.nobr=1 >>> tree.constant_tree() '((T2,(T1,T4)),T3);' >>> tree.variable_tree() '((T1,(T3,T2)),T4);' Copyright (c) 2004-2005, Ernesto Picardi. This class comes with ABSOLUTELY NO WARRANTY. """ import math,string,fpformat,random,re,sys # import of standard modules class RandomTree: def __init__(self,alltips=10,nobr=0,pm=0.03,shape=0.5,mean=1): self.alltips=alltips # number of tips self.nobr=nobr # use branch lengths self.pm=pm # probability of change per unit time self.shape=shape # gamma shape parameter self.mean=mean # mean of gamma dinstribution def constant_tree(self): # function to generate a clock-like tree if self.alltips <=2: sys.exit('At least three tips. Bye.') tips=[] for i in range(1, self.alltips+1): tips.append("T"+str(i)) Lb=[] for i in range(len(tips)): Lb.append(0) n=1 dictionary={} while len(tips)!=1: R=random.random() tyme=(-(math.log(R))/len(tips))*self.pm fixtyme=fpformat.fix(tyme,5) brlens=float(fixtyme) for i in range(len(tips)): Lb[i]=Lb[i]+brlens nodeName = '@node%04i@' % n s1=random.choice(tips) i1=str(Lb[tips.index(s1)]) del Lb[tips.index(s1)] tips.remove(s1) s2=random.choice(tips) i2=str(Lb[tips.index(s2)]) del Lb[tips.index(s2)] tips.remove(s2) if self.nobr: nodo="("+s1+","+s2+")" else: nodo="("+s1+":"+i1+","+s2+":"+i2+")" dictionary[nodeName]=nodo tips.append(nodeName) Lb.append(0) n+=1 findNodes=re.compile(r"@node.*?@", re.I) #to identify a node name lastNode = max(dictionary.keys()) treestring = lastNode while 1: nodeList = findNodes.findall(treestring) if nodeList == []: break for element in nodeList: treestring=treestring.replace(element, dictionary[element]) return treestring + ';' def variable_tree(self): # function to generate a variable tree treestring=self.constant_tree() findbr=re.compile(":[0-9]+.[0-9]+[\),]") allbr=findbr.findall(treestring) dicbr={} for i in allbr: br=(i.split(':'))[1] brval=eval(br.strip('),')) beta=float(self.shape)/self.mean gammafactor=random.gammavariate(self.shape,beta) newbr=brval*gammafactor newbr1=fpformat.fix(newbr,5) dicbr[i]=newbr1 for j in dicbr: if ',' in j: treestring=treestring.replace(j,':'+dicbr[j]+',') elif ')' in j: treestring=treestring.replace(i,':'+dicbr[i]+')') return treestring From mdehoon at c2b2.columbia.edu Thu Dec 22 07:38:08 2005 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu Dec 22 07:43:49 2005 Subject: =?iso-8859-1?Q?RE=A0=3A_=5BBioPython=5D_Parse_bit_score_with_NCBIXML_modu?= =?iso-8859-1?Q?le_from_BLAST_output?= Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECDD3@cgcmail.cgc.cpmc.columbia.edu> > I checked the 1.41 version of Bio.Blast.NCBIXML.BlastParser() and could > not find any evidence for parsing of the tag. Any solution? If this is a recent addition to Blast reports, it may not have existed at the time NCBIXML.BlastParser was written. Can you write a patch to NCBIXML.BlastParser? --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Fri Dec 23 12:26:29 2005 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Fri Dec 23 12:25:38 2005 Subject: [BioPython] Parse bit score with NCBIXML module from BLAST output Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECDD9@cgcmail.cgc.cpmc.columbia.edu> Hi everybody, Christof has sent me a patch to NCBIXML.py for this bug, which is in CVS now. Thanks again, Christof! --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -------- Message d'origine-------- De: biopython-bounces@portal.open-bio.org de la part de Christof Winter Date: jeu. 12/22/2005 6:17 ?: biopython@biopython.org Objet : [BioPython] Parse bit score with NCBIXML module from BLAST output Hi all, I just tried to parse a BLAST XML output. To my surprise, the parser always returns None for the bit scores (hsp.bits), whereas raw scores (hsp.score) are fine. Is this a known issue? Or a problem with my installation? This is my code: from Bio.Blast import NCBIXML parser = NCBIXML.BlastParser() record = parser.parse(open("blastout.xml")) for alignment in record.alignments: for hsp in alignment.hsps: print hsp.score print hsp.bits print The relevant part from blastout.xml looks like this: 1 gi|56554556|pdb|1XI5|I Chain I, Clathrin D6 Coat 1XI5-I 1630 1 3254.54 8437 0 ... The corresponding output of my script is: 8437.0 None I checked the 1.41 version of Bio.Blast.NCBIXML.BlastParser() and could not find any evidence for parsing of the tag. Any solution? Thanks, Christof _______________________________________________ BioPython mailing list - BioPython@biopython.org http://biopython.org/mailman/listinfo/biopython From srini_iyyer_bio at yahoo.com Sat Dec 24 12:49:40 2005 From: srini_iyyer_bio at yahoo.com (Srinivas Iyyer) Date: Sat Dec 24 12:54:17 2005 Subject: [BioPython] Bug in GenBank module - record.feature method ? In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECDD9@cgcmail.cgc.cpmc.columbia.edu> Message-ID: <20051224174941.31939.qmail@web31610.mail.mud.yahoo.com> Hi group, I have been working to parse out the GO annotations from FEATURE section of GenBank record. The GO annotations are incorporated into the CDS section of FEATURES part. Here is my script: from Bio import GenBank from sets import Set import glob def FeatureGOparser(myseq): parser = GenBank.RecordParser() record = parser.parse(open(myseq,'r')) ## this gives the CDS in features CDS is second in features section ### feature = record.features[2] golist = feature.qualifiers[1].value """extracts '/Note' part go_list = golist.split(';') #split by ';' to get GO secs." refseq = record.version # getting NM_ gene_name = (feature.qualifiers[0].value).split('"')[1] go_comp = [] go_func = [] go_proc = [] # from here trimming of GO list obtained above# for i in go_list: line = i.strip() if line.startswith('go_component'): iline = line.split('[')[0] com_line = gene_name+'\t'+refseq+'\t'+iline go_comp.append(com_line) for m in go_list: line = m.strip() if line.startswith('go_function'): mline = line.split('[')[0] func_line = gene_name+'\t'+refseq+'\t'+mline go_func.append(func_line) for j in go_list: line = i.strip() if line.startswith('go_process'): jline = line.split('[')[0] proc_line = gene_name+'\t'+refseq+'\t'+jline go_proc.append(proc_line) unique_go_comp = Set(go_comp) unique_go_func = Set(go_func) unique_go_proc = Set(go_proc) for x in unique_go_comp: print x for y in unique_go_func: print y for z in unique_go_proc: print z files = glob.glob('/home/seq/genbank/refseq/*') def main(): for each in files: FeatureGOparser(each) main() Error: RGS3 NM_144489.2 go_component: membrane RGS3 NM_144489.2 go_component: cytosol RGS3 NM_144489.2 go_component: nucleus RGS3 NM_144489.2 go_function: protein binding RGS3 NM_144489.2 go_function: GTPase activator activity RGS3 NM_144489.2 go_function: signal transducer activity RGS3 NM_144489.2 go_process: regulation of G-protein coupled receptor protein signaling pathway FREQ NM_014286.2 go_component: Golgi stack FREQ NM_014286.2 go_function: calcium ion binding DKFZp686O24166 NM_001009913.1 go_function: structural molecule activity Traceback (most recent call last): File "genbank_go_parser_ver2.py", line 58, in ? main() File "genbank_go_parser_ver2.py", line 57, in main FeatureGOparser(each) File "genbank_go_parser_ver2.py", line 11, in FeatureGOparser feature = record.features[2] IndexError: list index out of range # feature = record.features[2] ### this gives the CDS part. record.features[2]. I see that this order is not always true. For many sequences in FEATURES section, 'gene' is always followed by 'CDS'. However in some new RefSeq sequences, 'variation' sub-section is incorporated now. this is the trouble, I guess. So there are two things, that I need some help/suggestions/comments. 1. Is there any more technical way to parse '/note' sub-section in CDS section of FEATURES. Do you think what I am doing (record.features[2]) is more novice and not technical/correct. Please let me know what is the best process. 2. If there any other way to parse GO annotations for all RefSeq sequences in GenBank format. thanks srini __________________________________ Yahoo! for Good - Make a difference this year. http://brand.yahoo.com/cybergivingweek2005/ From amorgan at mitre.org Sat Dec 24 15:16:01 2005 From: amorgan at mitre.org (Alexander Morgan) Date: Sat Dec 24 15:40:10 2005 Subject: [BioPython] Bug in GenBank module - record.feature method ? In-Reply-To: <20051224174941.31939.qmail@web31610.mail.mud.yahoo.com> Message-ID: If you're not tied to the GenBank format, you might have greater ease getting the GO codes associated with EntrezGene identifiers. ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ The gene2go file contains the GO code assignments, and the gene2accession the mappings to other accession numbers; there is also a gene2refseq file. They are all in an easy to parse tab delimited format, and you can zip through them pretty fast and load what you need into a hash, a shelf file or a DB. On 12/24/05 9:49 AM, "Srinivas Iyyer" wrote: > > 2. If there any other way to parse GO annotations for > all RefSeq sequences in GenBank format. > From srini_iyyer_bio at yahoo.com Sat Dec 24 17:13:21 2005 From: srini_iyyer_bio at yahoo.com (Srinivas Iyyer) Date: Sat Dec 24 17:17:03 2005 Subject: [BioPython] Bug in GenBank module - record.feature method ? In-Reply-To: Message-ID: <20051224221321.4948.qmail@web31605.mail.mud.yahoo.com> Aaaaaaaaagrrrrrrrrrhhhhh ..... why was i floundering when i worked on this gene2go in the past. wasted a lot of time despite knowing this fact.... anyways, thank you very much. --- Alexander Morgan wrote: > If you're not tied to the GenBank format, you might > have greater ease > getting the GO codes associated with EntrezGene > identifiers. > > ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ > > The gene2go file contains the GO code assignments, > and the gene2accession > the mappings to other accession numbers; there is > also a gene2refseq file. > > They are all in an easy to parse tab delimited > format, and you can zip > through them pretty fast and load what you need into > a hash, a shelf file or > a DB. > > > On 12/24/05 9:49 AM, "Srinivas Iyyer" > wrote: > > > > 2. If there any other way to parse GO annotations > for > > all RefSeq sequences in GenBank format. > > > > __________________________________________ Yahoo! DSL ? Something to write home about. Just $16.99/mo. or less. dsl.yahoo.com From srini_iyyer_bio at yahoo.com Sun Dec 25 00:47:31 2005 From: srini_iyyer_bio at yahoo.com (Srinivas Iyyer) Date: Sun Dec 25 00:51:43 2005 Subject: [BioPython] Bug in GenBank module - record.feature method ? In-Reply-To: <20051224221321.4948.qmail@web31605.mail.mud.yahoo.com> Message-ID: <20051225054731.37082.qmail@web31606.mail.mud.yahoo.com> One problem (which might be somewhat troubling) with gene2go is that all the categories are written not mentioning the category. It would have been good if gene2go looked like this: 9606 2345 ISS go_process: glucose metabolism 1348503 9606 2345 ISS go_function: NADP transporter 9606 2345 ISS go_component: cytoplasm The reason I tried to parse on my ways is to get each GO annotation according to category. So that in future, my enrichment analysis on GO categories could be more meaningful. However, I successfully parsed GenBank records for GO categories. my output looks like this now: CNOT6 NM_015455.3 go_component: nucleus CNOT6 NM_015455.3 go_function: hydrolase activity CNOT6 NM_015455.3 go_function: RNA binding CNOT6 NM_015455.3 go_function: magnesium ion binding CNOT6 NM_015455.3 go_function: exonuclease activity CNOT6 NM_015455.3 go_process: regulation of transcription, DNA-dependent I can now play more easily when the GO annotations are in this format. Thanks again for reminding me about gene2go __________________________________ Yahoo! for Good - Make a difference this year. http://brand.yahoo.com/cybergivingweek2005/ From biopython at maubp.freeserve.co.uk Mon Dec 26 12:05:20 2005 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon Dec 26 12:26:59 2005 Subject: [BioPython] Bug in GenBank module - record.feature method ? In-Reply-To: <20051224174941.31939.qmail@web31610.mail.mud.yahoo.com> References: <20051224174941.31939.qmail@web31610.mail.mud.yahoo.com> Message-ID: <43B022D0.5040001@maubp.freeserve.co.uk> Srinivas Iyyer wrote: > Hi group, I have been working to parse out the GO annotations from > FEATURE section of GenBank record. ... > feature = record.features[2] > golist = feature.qualifiers[1].value """extracts '/Note' part > go_list = golist.split(';') #split by ';' to get GO secs." ... > # feature = record.features[2] ### this gives the CDS part. > record.features[2]. I see that this order is not always true. For > many sequences in FEATURES section, 'gene' is always followed by > 'CDS'. However in some new RefSeq sequences, 'variation' sub-section > is incorporated now. this is the trouble, I guess. ... > 1. Is there any more technical way to parse '/note' sub-section in > CDS section of FEATURES. Do you think what I am doing > (record.features[2]) is more novice and not technical/correct. Please > let me know what is the best process. You are doing this: feature = record.features[2] It depends on the record you want being the third one in the file (zero based counting: 0, 1, 2). You might be better off doing something like: for feature in record.features : if feature.type=="CDS" : #Do stuff... Also, once you have found the feature(s) you are interested in, the qualifiers property is a python dictionary. You should be able to access the /note entry from the GenBank feature record by: notes = feature.qualifiers['note'] This will be a list - for some things (like db_xref) there can be several different entries for a single feature. For others, like the translation, there should be only one. I'm note sure what happens with notes. You could try something like: go_list = [] for note in feature.qualifiers['note'] : go_list.extend(note.split(';')) Peter From biopyte at yahoo.de Fri Dec 30 14:57:10 2005 From: biopyte at yahoo.de (Hans Meier) Date: Fri Dec 30 15:00:34 2005 Subject: [BioPython] extract non-coding sequence data from .gbk with absolute coordinates Message-ID: <20051230195710.17729.qmail@web26313.mail.ukl.yahoo.com> Dear friends, whole genome GeneBank files include the genomic nucleotide sequence as the last record named 'Origin'. Is it possible with the Bio tools to extract sequence data from origin by giving absolute coordinates? That means, is there a Bio way to also read *non-coding* sequence data from a .gbk file? Best Regards, Harald --------------------------------- Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! Jetzt Yahoo! Messenger installieren! From biopyte at yahoo.de Fri Dec 30 15:47:56 2005 From: biopyte at yahoo.de (Hans Meier) Date: Fri Dec 30 15:51:21 2005 Subject: [BioPython] parsing error with GenBank.RecordParser Message-ID: <20051230204756.25343.qmail@web26311.mail.ukl.yahoo.com> Hi, parsing of NC_000913.gbk does not work. Greets, Harald *********************************************** >>> from Bio import GenBank >>> parser = GenBank.RecordParser() >>> record = parser.parse(open('NC_000913.gbk')) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.3/site-packages/Bio/GenBank/__init__.py", line 240, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.3/site-packages/Bio/GenBank/__init__.py",line1259, in feed self._parser.parseFile(handle) File "/usr/lib/python2.3/site-packages/Martel/Parser.py", line 328, in parse File self.parseString(fileobj.read()) File "/usr/lib/python2.3/site-packages/Martel/Parser.py", line 356, in parseString self._err_handler.fatalError(result) File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception Martel.Parser.ParserPositionException: error parsing at or beyond character 856717 ************************************************* Beyond character 856717 comes the following: /note="2'-(5"-phosphoribosyl)-3'-dephospho-CoA transferase; holo-citrate lyase synthase; CitG forms the prosthetic group precursor 2'-(5"-triphosphoribosyl)-3'-dephospho-CoA which is then transferred to apo-ACP by CitX to produce holo-ACP and pyrophosphate; go_process: protein modification [goid 0006464]" --------------------------------- Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! Jetzt Yahoo! Messenger installieren! From biopyte at yahoo.de Fri Dec 30 21:54:37 2005 From: biopyte at yahoo.de (Hans Meier) Date: Fri Dec 30 21:58:06 2005 Subject: [BioPython] Large GenBank files: impossible to handle? Message-ID: <20051231025437.5979.qmail@web26304.mail.ukl.yahoo.com> Dear friends, I tried to handle a .gbk file of 4,7MB in size with a "700MHz, Pentium III, 256 MB RAM"-box. Parsing with "RecordParser" and indexing with "index_file" crushed the machine in both cases, I had to reboot (what happens not so often with Debian). My final goal is to access the .gbk file somehow like a database. The alternative would be to use .fna,.faa and .fnn files and write my own methods. Or stuff all the data in a SQL-database. But I still hope that Biopython could help. Before I spend more time on this, I'd like to ask you: With the Biopython tools, is it possible to handle .gbk files of about 5MB in a reasonable time with a low- to middle-class desktop computer? If so, how? All the best and a Happy New Year, Harald --------------------------------- Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! Jetzt Yahoo! Messenger installieren! From sbassi at gmail.com Sat Dec 31 10:09:26 2005 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat Dec 31 10:14:08 2005 Subject: [BioPython] Large GenBank files: impossible to handle? In-Reply-To: <20051231025437.5979.qmail@web26304.mail.ukl.yahoo.com> References: <20051231025437.5979.qmail@web26304.mail.ukl.yahoo.com> Message-ID: On 12/30/05, Hans Meier wrote: > I tried to handle a .gbk file of 4,7MB in size > with a "700MHz, Pentium III, 256 MB RAM"-box. ... > With the Biopython tools, is it possible to handle > .gbk files of about 5MB in a reasonable time with > a low- to middle-class desktop computer? If so, how? Your computer is not underpowered and the file is not so large, so it should not hangup. Could you provide code for us to check it? (and the datafile, you should upload it to a ftp/web server if the data is public). -- La web sin popups ni spyware: Usa Firefox en lugar de Internet Explorer