From biopython-dev at maubp.freeserve.co.uk Sat Dec 10 13:39:13 2005 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat Dec 10 14:12:22 2005 Subject: [Biopython-dev] Bio.Geo for NCBI's GEO microarry SOFT files Message-ID: <439B20D1.2020707@maubp.freeserve.co.uk> I've just been looking at the Bio.Geo module by Katharine Lindner, contributed back in 2002 which should parse the NCBI's Gene Expression Omnibus (GEO) microarray data files. http://www.ncbi.nlm.nih.gov/geo/ Is anyone using Bio.Geo at the moment? The NCBI seem to call these SOFT files, (*.soft) and the format is documented here: http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html#SOFTformat Apparently in 2005, they began a switch to a revised file format, new format files here: ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/ Old format files here: ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_old/ ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_old_gz/ As far as I can tell, neither the "old" or "new" versions work in Bio.Geo, so there may have been another format change between 2002 and 2005. In addition the 2005 change introduces new lines, before and after the actual data: !dataset_table_begin !dataset_table_end These are definitely not supported in the current Martel grammar for GEO files. Peter From ivan at biodec.com Tue Dec 13 15:35:05 2005 From: ivan at biodec.com (Ivan Rossi) Date: Tue Dec 13 15:39:46 2005 Subject: [Biopython-dev] tiny Align.AlignInfo patch In-Reply-To: References: Message-ID: Dear BioPythoneers, I am submitting a tiny patch to the pos_specific_score_matrix method of Bio.Align.AlignInfo It allows for the generation of PSSMs composed by the "alphabet+gap" symbols. I use it all the time to generate 21-symbols PSSMs for proteins, that we use as inputs for neural networks and HMMs. The patch is not invasive at all and it preserves the default behavior of AlignInfo.pos_specific_score_matrix() I hope it will be considered for inclusion in the CVS. Ivan -- Ivan Rossi, Ph.D. - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it BioDec s.r.l., Via Fanin 48, I-40127 Bologna (Italy) Phone: +39-051-4200321 - fax: +39-051-4200317 - web: www.biodec.com -------------- next part -------------- *** AlignInfo.py.orig Tue Dec 13 18:09:22 2005 --- AlignInfo.py Tue Dec 13 18:18:40 2005 *************** *** 335,341 **** def pos_specific_score_matrix(self, axis_seq = None, ! chars_to_ignore = []): """Create a position specific score matrix object for the alignment. This creates a position specific score matrix (pssm) which is an --- 335,342 ---- def pos_specific_score_matrix(self, axis_seq = None, ! chars_to_ignore = [], ! drop_gap_char = True): """Create a position specific score matrix object for the alignment. This creates a position specific score matrix (pssm) which is an *************** *** 348,353 **** --- 349,357 ---- put on the axis of the PSSM. This should be a Seq object. If nothing is specified, the consensus sequence, calculated with default parameters, will be used. + o drop_gap_char - An optional boolean parameter to specify if the gap + symbol has to be accounted for in the pssm. Useful to generate the + "alphabet+gap" PSSMs used by some remote-homologi detection codes. Returns: o A PSSM (position specific score matrix) object. *************** *** 355,363 **** # determine all of the letters we have to deal with all_letters = self.alignment._alphabet.letters ! # if we have a gap char, add it to stuff to ignore ! if isinstance(self.alignment._alphabet, Alphabet.Gapped): ! chars_to_ignore.append(self.alignment._alphabet.gap_char) for char in chars_to_ignore: all_letters = string.replace(all_letters, char, '') --- 359,368 ---- # determine all of the letters we have to deal with all_letters = self.alignment._alphabet.letters ! if drop_gap_char: ! # if we have a gap char, add it to stuff to ignore ! if isinstance(self.alignment._alphabet, Alphabet.Gapped): ! chars_to_ignore.append(self.alignment._alphabet.gap_char) for char in chars_to_ignore: all_letters = string.replace(all_letters, char, '') From mdehoon at c2b2.columbia.edu Tue Dec 13 15:43:40 2005 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Tue Dec 13 15:49:13 2005 Subject: [Biopython-dev] tiny Align.AlignInfo patch Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECDC5@cgcmail.cgc.cpmc.columbia.edu> Hi Ivan, Thanks for the patch. But could you submit it through bugzilla? Patches posted to mailing lists tend to get lost. (They shouldn't, but it happens a lot in practice). Thanks again, --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-dev-bounces@portal.open-bio.org on behalf of Ivan Rossi Sent: Tue 12/13/2005 3:35 PM To: biopython-dev@biopython.org Subject: [Biopython-dev] tiny Align.AlignInfo patch Dear BioPythoneers, I am submitting a tiny patch to the pos_specific_score_matrix method of Bio.Align.AlignInfo It allows for the generation of PSSMs composed by the "alphabet+gap" symbols. I use it all the time to generate 21-symbols PSSMs for proteins, that we use as inputs for neural networks and HMMs. The patch is not invasive at all and it preserves the default behavior of AlignInfo.pos_specific_score_matrix() I hope it will be considered for inclusion in the CVS. Ivan -- Ivan Rossi, Ph.D. - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it BioDec s.r.l., Via Fanin 48, I-40127 Bologna (Italy) Phone: +39-051-4200321 - fax: +39-051-4200317 - web: www.biodec.com From biopython-dev at maubp.freeserve.co.uk Tue Dec 13 17:23:11 2005 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue Dec 13 17:20:29 2005 Subject: [Biopython-dev] Updates to the tutorial for parsing GenBank files In-Reply-To: <43737060.4070006@maubp.freeserve.co.uk> References: <43737060.4070006@maubp.freeserve.co.uk> Message-ID: <439F49CF.7030006@maubp.freeserve.co.uk> Are there any others on the list interested in parsing GenBank files who wouldn't mind proofreading/commenting on this change to the Tutorial/Cookbook? i.e. Changes to this document, section 3.4 GenBank: http://www.biopython.org/docs/tutorial/Tutorial004.html#toc13 http://www.biopython.org/docs/tutorial/Tutorial.pdf The patch is on the mailing list archive here: http://www.biopython.org/pipermail/biopython-dev/2005-November/002193.html Or I could log a bug & attach the patch to it. Would I be better off asking on the Discussion List, rather than the Development List for this sort of question? Bonus question: where could I find multi-record GenBank files? Peter On 10 Nov 2005, I wrote: > There should be a patch attached for Biopython Doc/Tutorial.tex which > tries to clarify GenBank parsing. > > Created on Windows using:- > > diff cvs_Tutorial.tex new_Tutorial.tex -E -Naur > patch.txt > > In particular, I have tried make it clear that GenBank.Iterator() and > GenBank.index_file() are overkill/unnecessary when dealing with GenBank > files which contain only single record (which is the typical case in my > personal experience). > > My changes add an introductory example: parsing a small bacterial genome > (a single large GenBank record), before moving on to the > GenBank.Iterator() and GenBank.index_file() examples. > > I have also pointed out that the multi-record example GenBank file used > in these examples (cor6_6.gb) is included in the downloadable BioPython > source code. > > Plus there is a minor correction to the GenBank.index_file example, > len(gb_dict) gives 6, not 7. > > Peter From biopython-dev at maubp.freeserve.co.uk Wed Dec 14 13:33:15 2005 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed Dec 14 13:39:36 2005 Subject: [Biopython-dev] Updates to the tutorial for parsing GenBank files In-Reply-To: References: <43737060.4070006@maubp.freeserve.co.uk> <439F49CF.7030006@maubp.freeserve.co.uk> Message-ID: <43A0656B.8060200@maubp.freeserve.co.uk> Marc Colosimo wrote: > The patch looks go to me , but i could have missed something there. I > forgot about the Discussion List. I really should join that list. Motion seconded - any developer want to accept this? > Also, I probably will be filling a bug on Bio.Fasta documentation. > There are two basic doc changes that should be made: > > Under the doc for Fasta: > RecordParser Parses FASTA sequence data into a Record object <- change > to a Fasta.Record object which is not the same as a Seq.Record Sounds sensible > Cookbooks: > > Then maybe in the Cookbook, give an example on using > Fasta.SequenceParser with title2ids. With out title2ids, you don't get > name or id. You only get description which is the title. Fasta.Record > only has title, which maybe should be renamed (depreciated to) > description to make it the same default behavior as SequenceParser. I don't usually bother with the title2ids function either. I agree that the fact that its .title and .description depending on the parser used (Fasta.RecordParser or Fasta.SequenceParser) is odd. > It seems odd that the Fasta stuff is buried within Chapter 2 (2.4.3 > Making it easier - plus it is missing "import string"). Yes, but I think it would be better to avoid using the string module completely, and use the split method of the string object instead: from Bio import Fasta def parseTitle2Ids(title): return title.split("|")[:3] parser = Fasta.SequenceParser(title2ids = parseTitle2Ids) file = open("ls_orchid.fasta") iterator = Fasta.Iterator(file, parser) ... Peter From mcolosimo at mitre.org Wed Dec 14 14:01:22 2005 From: mcolosimo at mitre.org (Colosimo, Marc E.) Date: Wed Dec 14 16:01:51 2005 Subject: [Biopython-dev] Updates to the tutorial for parsing GenBank files In-Reply-To: <43A0656B.8060200@maubp.freeserve.co.uk> Message-ID: On 12/14/05 1:33 PM, "Peter" wrote: > Marc Colosimo wrote: > >> It seems odd that the Fasta stuff is buried within Chapter 2 (2.4.3 >> Making it easier - plus it is missing "import string"). > > Yes, but I think it would be better to avoid using the string module > completely, and use the split method of the string object instead: > I totally agree with you on this. I was just following the coding style used in the cookbook and not my own. > from Bio import Fasta > > def parseTitle2Ids(title): > return title.split("|")[:3] > > parser = Fasta.SequenceParser(title2ids = parseTitle2Ids) > file = open("ls_orchid.fasta") > iterator = Fasta.Iterator(file, parser) > ... > Marc From bugzilla-daemon at portal.open-bio.org Thu Dec 15 14:51:14 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Thu Dec 15 14:58:17 2005 Subject: [Biopython-dev] [Bug 1919] New: Transcribe DNA Message-ID: <200512151951.jBFJpEEK012122@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1919 Summary: Transcribe DNA Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev@biopython.org ReportedBy: tmagalhaes@dcc.fc.up.pt I was reading some examples in the biopython tutorial and cookbook and for the first time, since I'd already read it many times, I get confused... Transcribing the dna sequence ATCG produces the AUCG rna sequence or the UAGC? Biopython does the first one, but until today I was completely sure that the correct one is the second. Probably this is a Tania's bug :) and not a biopython bug, and probably this is not the right place to put that kind of questions, but at this time I really don't know how the transcribe works, I'm really confused because in the internet I found sites where they do like I thought it was (or at least it seems to me the same thing)... ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Dec 15 16:55:57 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Thu Dec 15 16:58:17 2005 Subject: [Biopython-dev] [Bug 1919] Transcribe DNA Message-ID: <200512152155.jBFLtvG6014663@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1919 ------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk 2005-12-15 16:55 ------- Transcription: DNA {using A,T,C and G} --> mRNA {using A,U,C and G} Translation: mRNA {using A,U,C and G} --> Protein {Amino Acids} Note that the BioPython Translation object can use used to go direct from DNA {ATCG} to Protein {Amino Acids} which may be helpful. Are you asking about the effect of complementation that also happens as part of the transciption in biology? Because your example was just the four nucleotides I wasn't entirely clear on what you meant. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mhampton at d.umn.edu Thu Dec 15 15:06:46 2005 From: mhampton at d.umn.edu (Marshall Hampton) Date: Thu Dec 15 23:05:42 2005 Subject: [Biopython-dev] Re: [BioPython] blastx works fine? Message-ID: Hi, I am a new user of biopython - I like it a lot, thanks for all those contributions! - and I have been wondering about this too. It would help me a lot to automate some blastx searches. What is the best way to do this? Thanks, Marshall Hampton Dept. Mathematics & Statistics University of Minnesota, Duluth Frank Kauff wrote: >Hi all, > >qblast currently says it works only for blastp and blastn. Actually it >seems to work fine with blastx as well - xml output parses well with >NCBIXML. Or am I missing something? > >Frank > > >-- >Frank Kauff >Dept. of Biology >Duke University From bugzilla-daemon at portal.open-bio.org Sun Dec 18 15:45:26 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Sun Dec 18 15:58:18 2005 Subject: [Biopython-dev] [Bug 1920] Bio.Geo does not support recent GEO files Message-ID: <200512182045.jBIKjQrE015541@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1920 ------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk 2005-12-18 15:45 ------- Created an attachment (id=260) --> (http://bugzilla.open-bio.org/attachment.cgi?id=260&action=view) Patch for Bio/Geo/*.py Changes to the Martel format definition in Bio/Geo/geo_format.py Changes to the Geo.Iterator in Bio/Geo/__init__.py ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Dec 18 15:50:10 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Sun Dec 18 15:58:20 2005 Subject: [Biopython-dev] [Bug 1920] Bio.Geo does not support recent GEO files Message-ID: <200512182050.jBIKoAcG015567@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1920 ------- Comment #2 from biopython-bugzilla@maubp.freeserve.co.uk 2005-12-18 15:50 ------- Created an attachment (id=261) --> (http://bugzilla.open-bio.org/attachment.cgi?id=261&action=view) ZIP file containing revised test_geo.py and five test files The example files are from the NCBI webpage, they are examples of valid GEO SOFTtext submission files, but its the closest they offered. * a single Platform submission. * three dual channel Sample submissions. * a single Series submission. * a family (Platform, Samples and Series) submission. * three Affymetrix Sample submissions. http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html#SOFTsubmissionexamples ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Dec 18 15:43:31 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Sun Dec 18 15:58:21 2005 Subject: [Biopython-dev] [Bug 1920] New: Bio.Geo does not support recent GEO files Message-ID: <200512182043.jBIKhVM5015510@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1920 Summary: Bio.Geo does not support recent GEO files Product: Biopython Version: Not Applicable Platform: PC URL: http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html #SOFTformat OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Martel/Mindy AssignedTo: biopython-dev@biopython.org ReportedBy: biopython-bugzilla@maubp.freeserve.co.uk The NCBI tweaked their GEO SOFT file format this year (2005) and the old Bio.Geo parser can't cope. I have fixed the Martel format definition to support this (and the old test cases). I have also changed the Geo.Iterator as it didn't seem to work (it seemed to be doing an entire file at a time). Patch to follow, along with new test cases from the NCBI. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Dec 19 06:43:00 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Dec 19 06:59:23 2005 Subject: [Biopython-dev] [Bug 1921] BioSeqDatabase.load() method fails Message-ID: <200512191143.jBJBh0kH029607@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1921 ------- Comment #1 from lpritc@scri.sari.ac.uk 2005-12-19 06:42 ------- Created an attachment (id=262) --> (http://bugzilla.open-bio.org/attachment.cgi?id=262&action=view) Patch to BioSQL/Loader.py fixing problem with bioentry.taxon_id field I'm not sure if this is a fix or a workaround, as I'm not confident that it has no unfortunate downstream effects. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Dec 19 06:38:29 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Mon Dec 19 06:59:25 2005 Subject: [Biopython-dev] [Bug 1921] New: BioSeqDatabase.load() method fails Message-ID: <200512191138.jBJBcTPM029522@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1921 Summary: BioSeqDatabase.load() method fails Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P2 Component: BioSQL AssignedTo: biopython-dev@biopython.org ReportedBy: lpritc@scri.sari.ac.uk Using Fedora Core 3, MySQL Ver 14.7 Distrib 4.1.11 with BioPython from CVS head. On attempting to follow the documentation code at http://www.biopython.org/docs/biosql/python_biosql_basic.html#htoc10 to populate a BioSQL database from the example GenBank file, an error was thrown, with traceback: File "/usr/lib/python2.3/site-packages/BioSQL/BioSeqDatabase.py", line 414, in load db_loader.load_seqrecord(cur_record) File "/usr/lib/python2.3/site-packages/BioSQL/Loader.py", line 37, in load_seqrecord bioentry_id = self._load_bioentry_table(record) File "/usr/lib/python2.3/site-packages/BioSQL/Loader.py", line 251, in _load_bioentry_table self.adaptor.execute(sql, (self.dbid, File "/usr/lib/python2.3/site-packages/BioSQL/BioSeqDatabase.py", line 277, in execute self.cursor.execute(sql, args or ()) File "/usr/lib/python2.3/site-packages/MySQLdb/cursors.py", line 95, in execute return self._execute(query, args) File "/usr/lib/python2.3/site-packages/MySQLdb/cursors.py", line 114, in _execute self.errorhandler(self, exc, value) File "/usr/lib/python2.3/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler raise errorclass, errorvalue OperationalError: (1216, 'Cannot add or update a child row: a foreign key constraint fails') This problem had previously been reported under a different configuration on the BioPython discussion mailing list at http://www.biopython.org/pipermail/biopython/2005-July/002716.html The test_BioSQL.py script with the CVS BioPython failed with the same error: [lpritc@lplinuxdev Tests]$ python test_BioSQL.py Load SeqRecord objects into a BioSQL database. ... ERROR Get a list of all items in the database. ... ERROR Test retrieval of items using various ids. ... ERROR Make sure Seqs from BioSQL implement the right interface. ... ERROR Check SeqFeatures of a sequence. ... ERROR Make sure SeqRecords from BioSQL implement the right interface. ... ERROR Check that slices of sequences are retrieved properly. ... ERROR Make sure all records are correctly loaded. ... ERROR Indepth check that SeqFeatures are transmitted through the db. ... ERROR ====================================================================== ERROR: Load SeqRecord objects into a BioSQL database. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 316, in t_load_database self.db.load(self.iterator) File "/usr/lib/python2.3/site-packages/BioSQL/BioSeqDatabase.py", line 414, in load db_loader.load_seqrecord(cur_record) File "/usr/lib/python2.3/site-packages/BioSQL/Loader.py", line 37, in load_seqrecord bioentry_id = self._load_bioentry_table(record) File "/usr/lib/python2.3/site-packages/BioSQL/Loader.py", line 251, in _load_bioentry_table self.adaptor.execute(sql, (self.dbid, File "/usr/lib/python2.3/site-packages/BioSQL/BioSeqDatabase.py", line 277, in execute self.cursor.execute(sql, args or ()) File "/usr/lib/python2.3/site-packages/MySQLdb/cursors.py", line 95, in execute return self._execute(query, args) File "/usr/lib/python2.3/site-packages/MySQLdb/cursors.py", line 114, in _execute self.errorhandler(self, exc, value) File "/usr/lib/python2.3/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler raise errorclass, errorvalue OperationalError: (1216, 'Cannot add or update a child row: a foreign key constraint fails') The problem seems to stem from the DatabaseLoader._load_bioentry_table() method in Loader.py - a previous fix attempts to solve a previous problem with the population of the bioentry.taxon_id field by assigning it the value "0" in the INSERT SQL statment. Attempting to do this in a database where the taxon table is unpopulated is a violation of a foreign key constraint in both the current BioSQL schema, and the one that ships with BioPython, and throws the error seen. I modified the code in DatabaseLoader._load_bioentry_table() so that the INSERT statement no londer attempts to populate the bioentry.taxon_id field, which is left to take the default value of NULL. The diff is below: 226c226 < taxon_id = "0" # inserted this because the taxon population code is out of date --- > #taxon_id = "0" # inserted this because the taxon population code is out of date 231a232,234 > # removed taxon_id field, as it was causing difficulties with the > # schema - not inserting a value allows it to default to NULL, > # avoiding violation of the foreign key constraint. 235d237 < taxon_id, 249d250 < %s, 252d252 < taxon_id, ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Dec 20 07:32:41 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Tue Dec 20 07:58:21 2005 Subject: [Biopython-dev] [Bug 1909] Format issue with GenBank with segmented BACs (eg GI:55276707) Message-ID: <200512201232.jBKCWfFw021417@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1909 biopython-bugzilla@maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #2 from biopython-bugzilla@maubp.freeserve.co.uk 2005-12-20 07:32 ------- A GenBank format entry for GI:55276707 can be downloaded from here: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=55276707 Its a 401 kb GenBank file, containing THREE separate GenBank records (three segments), starting: LOCUS AY643842S1 12998 bp DNA linear PLN 17-NOV-2004 DEFINITION Hordeum vulgare subsp. vulgare clone BAC 519K7 hardness locus region. ACCESSION AY643842 VERSION AY643842.1 GI:55276708 KEYWORDS . SEGMENT 1 of 3 .. Using the old Martel GenBank parser (e.g. BioPython 1.41) the following works perfectly: print "Method 1 - Using for record in Iterator" from Bio import GenBank gbk_filename = "AY643842.gbk" input_file = open(gbk_filename, "r") for gb_record in GenBank.Iterator(input_file, GenBank.RecordParser()) : print "Loaded GenBank record %s" % gb_record.locus print "Done" input_file.close() Or: print "Method 2 - Using Iterator.next()" from Bio import GenBank gbk_filename = "AY643842.gbk" input_file = open(gbk_filename, "r") gb_iterator = GenBank.Iterator(input_file, GenBank.RecordParser()) while True: gb_record = gb_iterator.next() if gb_record is None : break print "Loaded GenBank record %s" % gb_record.locus print "Done" input_file.close() This bit of code will reproduce the error reported: print "Method 3 - No Iterator object, this fails" from Bio import GenBank gbk_filename = "AY643842.gbk" input_file = open(gbk_filename, "r") gb_record = GenBank.RecordParser().parse(input_file) .. The reason the error message says "unparsed text remains" beyond position 18263, is the fact that there are actually two more records in the file. Your text editor may have a "goto character" command (TextPad does, available to try from www.textpad.com but it does cost money). The following snippet of code is another way to find out where a Martel parser is failing from a position in a file, in this case 18263: print "Debug:" input_file = open(gbk_filename, "r") raw_text = "".join(input_file.readlines()) input_file.close() print raw_text[18263:18263+100] + "..." Debug: LOCUS AY643842S2 129099 bp DNA linear PLN 17-NOV-2004 DEFINITION Hordeum ... i.e. It's complaining about the presence of second record (i.e. LOCUS line onwards) in the GenBank file. Resolution ========== If you can't be sure in advance that there is only one record, allways use the GenBank.Iterator object. Note ==== Using the current version of the GenBank parser (in CVS, not yet released), then method 3 above will work and give you the (just) first record. It does not warn you in any way that there is a second or third record available. P.S. ==== My testing and the original report were done on Windows. If you run this on unix, then because of the different line endings, the exact position of the second record will change slightly. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From e.picardi at unical.it Thu Dec 22 03:58:52 2005 From: e.picardi at unical.it (Ernesto) Date: Thu Dec 22 07:28:32 2005 Subject: [Biopython-dev] simple class to generate random trees Message-ID: <005101c606d5$f7de0080$572561a0@mirko84cf0g99i> Skipped content of type multipart/alternative-------------- next part -------------- """ RandomTree is a simple class to generate random rooted trees. Clock-like trees are generated according to the methodology of Kuhner and Felsenstein (1994) Mol. Biol. Evol. 11: 459-468, whereas no clock-like trees are created following Guindon and Gascuel (2002) Mol. Biol. Evol. 19: 534-543. Once a clock-like tree is generted, each branch length is multiplied by a gamma dinstributed factor. If the mean of this distribution is equal to 1 and the shape fixed to 0.5, then the departure from molecular clock is strong. The opposite situation is when gamma shape is fixed to 2.0. When the RandomTree class is invoked a simple object is created. It contains: ntips: number of tips for tree --> default is 10 nobr: 1 for trees without branch lengths --> default is 0 pm: probability of change per unit time --> default is 0.03 shape: gamma shape for variable trees --> default is 0.5 mean: mean of gamma distribution for variable trees --> default is 1 USING this class: >>> from RandomTree import RandomTree >>> tree = RandomTree() >>> tree.nobr 0 >>> tree.ntips=4 >>> tree.constant_tree() '((T3:0.01516,T4:0.01516):0.01332,(T1:0.02643,T2:0.02643):0.00205);' >>> tree.variable_tree() '(((T4:0.00050,T3:0.00954):0.00009,T2:0.01591):0.00595,T1:0.00531);' >>>tree.nobr=1 >>> tree.constant_tree() '((T2,(T1,T4)),T3);' >>> tree.variable_tree() '((T1,(T3,T2)),T4);' Copyright (c) 2004-2005, Ernesto Picardi. This class comes with ABSOLUTELY NO WARRANTY. """ import math,string,fpformat,random,re,sys # import of standard modules class RandomTree: def __init__(self,alltips=10,nobr=0,pm=0.03,shape=0.5,mean=1): self.alltips=alltips # number of tips self.nobr=nobr # use branch lengths self.pm=pm # probability of change per unit time self.shape=shape # gamma shape parameter self.mean=mean # mean of gamma dinstribution def constant_tree(self): # function to generate a clock-like tree if self.alltips <=2: sys.exit('At least three tips. Bye.') tips=[] for i in range(1, self.alltips+1): tips.append("T"+str(i)) Lb=[] for i in range(len(tips)): Lb.append(0) n=1 dictionary={} while len(tips)!=1: R=random.random() tyme=(-(math.log(R))/len(tips))*self.pm fixtyme=fpformat.fix(tyme,5) brlens=float(fixtyme) for i in range(len(tips)): Lb[i]=Lb[i]+brlens nodeName = '@node%04i@' % n s1=random.choice(tips) i1=str(Lb[tips.index(s1)]) del Lb[tips.index(s1)] tips.remove(s1) s2=random.choice(tips) i2=str(Lb[tips.index(s2)]) del Lb[tips.index(s2)] tips.remove(s2) if self.nobr: nodo="("+s1+","+s2+")" else: nodo="("+s1+":"+i1+","+s2+":"+i2+")" dictionary[nodeName]=nodo tips.append(nodeName) Lb.append(0) n+=1 findNodes=re.compile(r"@node.*?@", re.I) #to identify a node name lastNode = max(dictionary.keys()) treestring = lastNode while 1: nodeList = findNodes.findall(treestring) if nodeList == []: break for element in nodeList: treestring=treestring.replace(element, dictionary[element]) return treestring + ';' def variable_tree(self): # function to generate a variable tree treestring=self.constant_tree() findbr=re.compile(":[0-9]+.[0-9]+[\),]") allbr=findbr.findall(treestring) dicbr={} for i in allbr: br=(i.split(':'))[1] brval=eval(br.strip('),')) beta=float(self.shape)/self.mean gammafactor=random.gammavariate(self.shape,beta) newbr=brval*gammafactor newbr1=fpformat.fix(newbr,5) dicbr[i]=newbr1 for j in dicbr: if ',' in j: treestring=treestring.replace(j,':'+dicbr[j]+',') elif ')' in j: treestring=treestring.replace(i,':'+dicbr[i]+')') return treestring From bugzilla-daemon at portal.open-bio.org Sat Dec 24 06:39:52 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Sat Dec 24 07:00:24 2005 Subject: [Biopython-dev] [Bug 1920] Bio.Geo does not support recent GEO files Message-ID: <200512241139.jBOBdqIL008387@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1920 ------- Comment #3 from biopython-bugzilla@maubp.freeserve.co.uk 2005-12-24 06:39 ------- The current patch (patch 260) doesn't cope with current GPL files (GEO platforms/annotation files) where the "before/after table comments" are slightly different. Also, the GEO Record object's __str__ method will attempt to show all the rows in a data table, and for large GDS or GPL files this is a very bad idea - python seems to lock up my computer as a result. I propose to only print the first 20, then a ..., and the final record. 20 is a reasonably low number and will not affect the existing test cases. I have a revised patch prepared that tackles these two issues, but won't have direct internet access until the New Year. If anyone with CVS access feels the urge, don't let this stop you from checking in the current patch and the additional test cases. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Dec 24 07:18:47 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Sat Dec 24 07:58:45 2005 Subject: [Biopython-dev] [Bug 1680] Problems with the GenBank indexing Message-ID: <200512241218.jBOCIlWh008813@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1680 ------- Comment #4 from biopython-bugzilla@maubp.freeserve.co.uk 2005-12-24 07:18 ------- I think (after following references through several files) that we need to focus on Bio/expressions/genbank.py The "record" definition appears to allow multiple trailing blank lines at the end of a record, see "record_end". i.e. It looks for // and then one or more new lines. However, the "format" definition which appears to be used to build the index is this: format = Martel.ParseRecords("genbank", {"format" : "genbank"}, record, RecordReader.EndsWith, ("//",)) If I am not mistaken the for files with blank lines between records (as reported in this bug), this will lead to the first record with no trailing lines, and then subsequent records would have leading blank lines. So, my suggestions are: (a) Allow blank lines at the start of a genbank record (before the LOCUS line) Or: (b) we could try this: format = Martel.ParseRecords("genbank", {"format" : "genbank"}, record, RecordReader.StartsWith, ("LOCUS ",)) Making this change seems to fix this bug (indexing the small 6 KB GenBank file with three entries, takes under a second). As the GenBank.Iterator code works by looking for records that start LOCUS, this seems like a more consistent approach. NOTE - I have not run the full test suite to look for any side effects. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Dec 24 07:28:45 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Sat Dec 24 07:58:47 2005 Subject: [Biopython-dev] [Bug 1680] Problems with the GenBank indexing Message-ID: <200512241228.jBOCSjLU008952@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1680 biopython-bugzilla@maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |marc.saric@gmx.de ------- Comment #5 from biopython-bugzilla@maubp.freeserve.co.uk 2005-12-24 07:28 ------- *** Bug 1773 has been marked as a duplicate of this bug. *** ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Dec 24 07:28:43 2005 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org) Date: Sat Dec 24 07:58:48 2005 Subject: [Biopython-dev] [Bug 1773] Martel.Parser.ParserPositionException Message-ID: <200512241228.jBOCShrh008947@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1773 biopython-bugzilla@maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE ------- Comment #2 from biopython-bugzilla@maubp.freeserve.co.uk 2005-12-24 07:28 ------- Having investigated bug 1680 further, I'm sure that your issue with the trailing blank lines is the same problem, so I'm marking this as a duplicate. However, as far as I can tell, your example GenBank file only has one "genbank record" in it (i.e. it only has one LOCUS line). *This means that indexing this particular file is rather pointless* Indexing the features within this single GenBank record might be more useful, there is an in-memory approach to this here: http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank/#indexing_features *** This bug has been marked as a duplicate of 1680 *** ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.