From chapmanb at arches.uga.edu Sat Sep 1 21:09:35 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:03 2005 Subject: [Biopython-dev] next release imminent In-Reply-To: References: Message-ID: <20010901210935.A8873@ci350185-a.athen1.ga.home.com> Hi Jeff! > If nobody has any rejections, I'm going to put together the next > release this weekend. Please let me know if I should hold off... Sorry to be so slow in getting back to your previous message. It's been a crazy week at lab. I definately think we should get a new release together, especially to take care of the bugs and things. Let me know when you end up getting it together, and I can go ahead and do the Windows Installers and everything for it. Thanks for getting this together! Brad From katel at worldpath.net Sun Sep 2 01:38:37 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:03 2005 Subject: [Biopython-dev] next release imminent References: Message-ID: <003301c13371$84f5b220$010a0a0a@cadence.com> > If nobody has any rejections, I'm going to put together the next > release this weekend. Please let me know if I should hold off... > The MetaTool parser won't be ready till next release after the weekend release. I'm at least a week away. But I could check old code for "fossils" like a reference to GenBank in an Interpro parser. Cayte From jchang at SMI.Stanford.EDU Sun Sep 2 00:38:26 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:43:03 2005 Subject: [Biopython-dev] next release imminent In-Reply-To: <003301c13371$84f5b220$010a0a0a@cadence.com> References: <003301c13371$84f5b220$010a0a0a@cadence.com> Message-ID: At 10:38 PM -0700 9/1/01, Cayte wrote: > > If nobody has any rejections, I'm going to put together the next >> release this weekend. Please let me know if I should hold off... >> > The MetaTool parser won't be ready till next release after the weekend >release. I'm at least a week away. >But I could check old code for "fossils" like a reference to GenBank in an >Interpro parser. > > Cayte Thanks. It should like MetaTool will have to wait for one more release. I'll start putting this release together on Monday. Please let me know if you find any showstoppers. Jeff From xgtl at eth.net Tue Sep 4 07:45:26 2001 From: xgtl at eth.net (G. DEEPAK REDDY) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Bioinformatics & Molecular Modeling Training Message-ID: <000b01c13537$18edde00$b1bf09ca@xgt15> Dear Members, We have recently started a training division in Bioinformatics & Molecular Modeling. We are looking for feedback from experts about including Biophython as part of the curriculum in the course. Please send your suggestions as to what topics, exercises and applications to be included. Regards Jupudi Srinivas Director-Technical, Xpert Global Tech Limited, INDIA Jupudi@xpertglobaltech.com http://www.xpertglobaltech.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20010904/9a598825/attachment.htm From Y.Benita at pharm.uu.nl Wed Sep 5 05:28:00 2001 From: Y.Benita at pharm.uu.nl (Yair Benita) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Biopython 1.00a3 for the Mac Message-ID: Hi guys, I have compiled the new release for the Mac. Please download it from "http://homepage.mac.com/ybenita" and post it on the website. All test pass except: ====================================================================== FAIL: test_SubsMat ---------------------------------------------------------------------- Traceback (most recent call last): File "Yair's G4:Desktop Folder:biopython-1.00a3:Tests:run_tests.py", line 153, in runTest expected_handle) File "Yair's G4:Desktop Folder:biopython-1.00a3:Tests:run_tests.py", line 247, in compare_output assert expected_line == output_line, \ AssertionError: Output : 'H 0.003 0.000 0.003 0.002 0.002 0.003 0.003\n' Expected: 'H 0.003 0.001 0.003 0.002 0.002 0.003 0.003\n' ---------------------------------------------------------------------- Yair -- Yair Benita Pharmaceutical Proteomics Utrecht University From chapmanb at arches.uga.edu Wed Sep 5 07:53:20 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Biopython 1.00a3 for the Mac In-Reply-To: Message-ID: Hi Yair! > I have compiled the new release for the Mac. > Please download it from "http://homepage.mac.com/ybenita" and post it on the > website. Sweet! Thanks for doing this. I've put it up on the Download page. > All test pass except: > > ====================================================================== > FAIL: test_SubsMat > ---------------------------------------------------------------------- Great, I'm very happy most stuff is passing! Only test_SubsMat fails on Windows as well (grrrr, we really are going to have to something about that test!), so we are definately doing well on cross-platform this time. Great to hear! Brad From johann at egenetics.com Wed Sep 5 09:46:55 2001 From: johann at egenetics.com (Johann Visagie) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Re: [BioPython] Biopython 1.00a3 release now available In-Reply-To: ; from jchang@SMI.Stanford.EDU on Tue, Sep 04, 2001 at 11:45:13AM -0700 References: Message-ID: <20010905154655.E57556@fling.sanbi.ac.za> Jeffrey Chang on 2001-09-04 (Tue) at 11:45:13 -0700: > > A new release of Biopython is now available. Cool. :-) A thought: Shouldn't these announcements be cross-posted to python-announce-list@python.org, a.k.a comp.lang.python.announcee? :-) -- V From thomas at cbs.dtu.dk Wed Sep 5 14:45:40 2001 From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] sequence format readers ? Message-ID: <15254.29396.129922.274263@genome.cbs.dtu.dk> Hej, To follow up one of the discussions and questions at ISMB in Copenhagen, - how are we going to proceed with the sequence format reader (the biopython variant of readseq ...) Currently we can only have parsers for Fasta, Embl and GenBank. What we need is a internal format and functions/modules which can read/write: Fasta Embl GenBank GCG Phylip PIR MSF Nexus Clustal Mase ??? - more suggestions ? I can write most of the rules, but I guess we have to define a smart base class/parser - where plugging in a new format should only take 5 seconds ... If we brain storm on the design of the reader/writer, I could volunteer to implement the format rules ... Some things to consider: * some formats are alignment based (e.g. clustal, phylip, nexus) * some formats have loads of information which is lost when converted to a lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should not lose any information * some formats allow multiple entries, some not back-in-the-sequence-format-jungle'ly yr's -thomas -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From reillywu at yahoo.com Wed Sep 5 15:50:25 2001 From: reillywu at yahoo.com (Chunlei Wu) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] localblast bug? Message-ID: <20010905195025.67630.qmail@web20503.mail.yahoo.com> Hi, I wrote a script for localblast. It always raised a TypeError: File "e:\python21\Bio\Blast\NCBIStandalone.py", line 1447, in blastall r, w, e = popen2.popen3([blastcmd] + params) File "e:\python21\lib\popen2.py", line 129, in popen3 w, r, e = os.popen3(cmd, mode, bufsize) TypeError: popen3() argument 1 must be string, not list When I modified line 1447 as: r, w, e = popen2.popen3(' '.join([blastcmd] + params)) then it works. Chunlei Wu Python version: Activepython build 210 Biopython version: 1.00a3 OS: WinNT source: def mylocalblast(input_file,output_file,db='nt'): """mylocalblast""" from Bio.Blast import NCBIStandalone my_blast_db="r:\\blastdb\\"+db my_blast_exe=r"r:\localblast\blastall.exe" blast_out, error_info = NCBIStandalone.blastall(my_blast_exe,'blastn',my_blast_db,input_file) output_f=open(output_file,'w') blast_result=blast_out.read() output_f.write(blast_result) print blast_result output_f.close() __________________________________________________ Do You Yahoo!? Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger http://im.yahoo.com From chapmanb at arches.uga.edu Wed Sep 5 17:27:45 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] localblast bug? In-Reply-To: <20010905195025.67630.qmail@web20503.mail.yahoo.com> References: <20010905195025.67630.qmail@web20503.mail.yahoo.com> Message-ID: <20010905172745.A4632@ci350185-a.athen1.ga.home.com> Hi Chunlei; > I wrote a script for localblast. It always raised a > TypeError: [...] > When I modified line 1447 as: > > r, w, e = popen2.popen3(' '.join([blastcmd] + > params)) > > then it works. Thanks for the fix. I think you're probably the first to use the localblast module on windows, so you get to run into the platform specific problems (aren't you lucky :-). Your fix works fine for me on UNIX as well (with the Doc/examples/local_blast.py script), so I checked your change into CVS. It is available from anonymous CVS and should be in the next release. Thanks again! Brad From chapmanb at arches.uga.edu Wed Sep 5 17:46:00 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] sequence format readers ? In-Reply-To: <15254.29396.129922.274263@genome.cbs.dtu.dk> References: <15254.29396.129922.274263@genome.cbs.dtu.dk> Message-ID: <20010905174600.A4766@ci350185-a.athen1.ga.home.com> Hi Thomas! > To follow up one of the discussions and questions at ISMB in Copenhagen, > - how are we going to proceed with the sequence format reader (the > biopython variant of readseq ...) It's great that you're going to work on this! It's definately much desired by a lot o' people (in fact I was just having a conversation today about format conversion). > Currently we can only have parsers for Fasta, Embl and GenBank. What we > need is a internal format and functions/modules which can read/write: [...impressive list o' formats...] > ??? - more suggestions ? I think supporting this many would be an *excellent* start :-). > I can write most of the rules, but I guess we have to define a smart base > class/parser - where plugging in a new format should only take 5 seconds ... > If we brain storm on the design of the reader/writer, I could volunteer to > implement the format rules ... > > Some things to consider: > * some formats are alignment based (e.g. clustal, phylip, nexus) > * some formats have loads of information which is lost when converted to a > lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should > not lose any information > * some formats allow multiple entries, some not Just as a way of getting things started (I haven't done a lot of thinking about this), my opinion is that the best way to do this is to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO system would be the standard SeqRecord object that we currently have. The advantage of this is that existing parsers (ie Fasta, GenBank), already parse into this, so all that would need to be done is to define a mapping that converts a generic SeqRecord object to and from the formats "native" Record based representation. So to convert from GenBank to Fasta you could do: GenBank Record Format --> SeqRecord --> Fasta Record Format Since the Record formats already provide writing capabilities (and we have the parsers to parse into them) we would already get writing and parsing "for free." Also, we would make good use of our existing "generic" Sequence representations. The advantages of this is that it would help us avoid having to make a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever specific converters. The disadvantage of this is that we may lose some information in the conversion process (but than again, what converters don't :-). The tricky part of doing it this way is that we would then need to define the Record --> SeqRecord mapping, which, as you mention, may take some thinking for alignment formats and other complications. Hopefully-rambling-on-and-on-about-this-helps-a-little-bit-ly yr's, Brad From jchang at SMI.Stanford.EDU Wed Sep 5 19:08:49 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Re: [BioPython] Biopython 1.00a3 release now available In-Reply-To: <20010905154655.E57556@fling.sanbi.ac.za> References: <20010905154655.E57556@fling.sanbi.ac.za> Message-ID: At 3:46 PM +0200 9/5/01, Johann Visagie wrote: >Jeffrey Chang on 2001-09-04 (Tue) at 11:45:13 -0700: >> >> A new release of Biopython is now available. > >Cool. :-) > >A thought: Shouldn't these announcements be cross-posted to >python-announce-list@python.org, a.k.a comp.lang.python.announcee? :-) Yes. Next time. :) Thanks, Jeff From thomas at cbs.dtu.dk Thu Sep 6 05:22:01 2001 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] sequence format readers ? In-Reply-To: Brad Chapman's message of "Wed, 5 Sep 2001 17:46:00 -0400" References: <15254.29396.129922.274263@genome.cbs.dtu.dk> <20010905174600.A4766@ci350185-a.athen1.ga.home.com> Message-ID: Brad Chapman writes: > Hi Thomas! > > > To follow up one of the discussions and questions at ISMB in Copenhagen, > > - how are we going to proceed with the sequence format reader (the > > biopython variant of readseq ...) > > It's great that you're going to work on this! It's definately much > desired by a lot o' people (in fact I was just having a conversation > today about format conversion). > > > Currently we can only have parsers for Fasta, Embl and GenBank. What we > > need is a internal format and functions/modules which can read/write: > [...impressive list o' formats...] > > ??? - more suggestions ? > > I think supporting this many would be an *excellent* start :-). > > > I can write most of the rules, but I guess we have to define a smart base > > class/parser - where plugging in a new format should only take 5 seconds ... > > If we brain storm on the design of the reader/writer, I could volunteer to > > implement the format rules ... > > > > Some things to consider: > > * some formats are alignment based (e.g. clustal, phylip, nexus) > > * some formats have loads of information which is lost when converted to a > > lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should > > not lose any information > > * some formats allow multiple entries, some not > > Just as a way of getting things started (I haven't done a lot of > thinking about this), my opinion is that the best way to do this is > to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO > system would be the standard SeqRecord object that we currently > have. The advantage of this is that existing parsers (ie Fasta, > GenBank), already parse into this, so all that would need to be done > is to define a mapping that converts a generic SeqRecord object to > and from the formats "native" Record based representation. So to > convert from GenBank to Fasta you could do: > > GenBank Record Format --> SeqRecord --> Fasta Record Format > > Since the Record formats already provide writing capabilities (and > we have the parsers to parse into them) we would already get writing > and parsing "for free." Also, we would make good use of our existing > "generic" Sequence representations. > > The advantages of this is that it would help us avoid having to make > a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever > specific converters. The disadvantage of this is that we may lose > some information in the conversion process (but than again, what > converters don't :-). I think inheriting the Seq object to a SeqIOSeq object is enough. We just need to add a single dictionary (features) where all Swiss/EMBL/GenBank extra annotations can be added. e.g. class SeqIOSeq(Seq): def __init__(self): Seq.__init__(self) # dictionary for extra annotations (e.g. Embl, GenBank) self.features = {} In the case of GenBank Record Format --> SeqIOSeq --> Fasta Record Format we pick only the the name and sequence ... but for GenBank Record Format --> SeqIOSeq --> EMBL Record Format the writer function should check if there are any additional features (self.features.keys()) That way we shouldn't loose any information. It would be nice if a new format can be added by simply adding functions for reading, writing and recognizing the format. I not completely sure of how to define these functions - any ideas ? example code ... import sys from Bio.Seq import Seq NO, YES = 0,1 class SeqIOSeq(Seq): def __init__(self): Seq.__init__(self) # dictionary for extra annotations (e.g. Embl, GenBank) self.features = {} class SeqIO: # dictionary to store functions for # recognizing, reading and writing of different sequence formats recognizers = {} readers = {} writers = {} def __init__(self, **kwds): self.name = None self.format = None self.sequence = SeqIOSeq() self.is_an_alignment = NO self.allow_multiple_entries = YES for k,v in kwds: setattr(self, k, v) def AddFormat(self, name, recognizeF, readF, writeF): self.recognizers[name] = recognizeF self.readers[name] = readF self.writers[name] = writeF needing-a-machete-for-the-sequence-format-jungle'ly yr's -thomas -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From johann at egenetics.com Thu Sep 6 08:49:29 2001 From: johann at egenetics.com (Johann Visagie) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Biopython 1.00a3 for the Mac In-Reply-To: ; from Y.Benita@pharm.uu.nl on Wed, Sep 05, 2001 at 11:28:00AM +0200 References: Message-ID: <20010906144929.F35666@fling.sanbi.ac.za> Yair Benita on 2001-09-05 (Wed) at 11:28:00 +0200: > > I have compiled the new release for the Mac. FreeBSD port has also just been updated: http://www.freebsd.org/cgi/cvsweb.cgi/ports/biology/py-biopython/ Pre-built package (minus CORBA) should appear here in a couple of days: ftp://ftp.freebsd.org/pub/FreeBSD/ports/i386/packages-stable/All/py-biopython-1.00.a3.tgz Unfortunately, this all comes a day or two too late to make it into 4.4-RELEASE, and hence onto the distribution CDs. :-( -- V From pewilkinson at informaxinc.com Thu Sep 6 19:14:21 2001 From: pewilkinson at informaxinc.com (Peter Wilkinson) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] RE:Sequence format readers In-Reply-To: <200109061602.f86G29B28847@pw600a.bioperl.org> Message-ID: <005001c13729$aa920e50$f70210ac@l001696w00> I have almost completed code for reading in Refseek data. I have finished classes (1st draft, but functions well) for the smaller organisms), and now I am moving on to the Human records .... Also I we need a parser for Derwent data, which should inherit from EMBL, since its formatting is EMBL like. Next aslo is the expression data from different manufacturers .... there are piles more I am sure Peter Wilkinson P.S. I am sitting on code for specific fasta formated types .... how about that? > -----Original Message----- > From: biopython-dev-admin@biopython.org > [mailto:biopython-dev-admin@biopython.org]On Behalf Of > biopython-dev-request@biopython.org > Sent: Thursday, September 06, 2001 10:02 AM > To: biopython-dev@biopython.org > Subject: Biopython-dev digest, Vol 1 #207 - 7 msgs > > > Send Biopython-dev mailing list submissions to > biopython-dev@biopython.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://biopython.org/mailman/listinfo/biopython-dev > or, via email, send a message with subject or body 'help' to > biopython-dev-request@biopython.org > > You can reach the person managing the list at > biopython-dev-admin@biopython.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython-dev digest..." > > > Today's Topics: > > 1. sequence format readers ? (thomas@cbs.dtu.dk) > 2. localblast bug? (Chunlei Wu) > 3. Re: localblast bug? (Brad Chapman) > 4. Re: sequence format readers ? (Brad Chapman) > 5. Re: [BioPython] Biopython 1.00a3 release now available > (Jeffrey Chang) > 6. Re: sequence format readers ? (Thomas Sicheritz-Ponten) > 7. Re: Biopython 1.00a3 for the Mac (Johann Visagie) > > --__--__-- > > Message: 1 > Date: Wed, 5 Sep 2001 20:45:40 +0200 (MDT) > From: thomas@cbs.dtu.dk > To: biopython-dev@biopython.org > Reply-To: thomas@cbs.dtu.dk > Subject: [Biopython-dev] sequence format readers ? > > Hej, > > To follow up one of the discussions and questions at ISMB in > Copenhagen, > - how are we going to proceed with the sequence format reader (the > biopython variant of readseq ...) > > Currently we can only have parsers for Fasta, Embl and > GenBank. What we > need is a internal format and functions/modules which can read/write: > Fasta > Embl > GenBank > GCG > Phylip > PIR > MSF > Nexus > Clustal > Mase > ??? - more suggestions ? > > I can write most of the rules, but I guess we have to define > a smart base > class/parser - where plugging in a new format should only > take 5 seconds ... > If we brain storm on the design of the reader/writer, I could > volunteer to > implement the format rules ... > > Some things to consider: > * some formats are alignment based (e.g. clustal, phylip, nexus) > * some formats have loads of information which is lost when > converted to a > lower info-rich format( e.g. Embl -> Fasta). But Embl -> > GenBank should > not lose any information > * some formats allow multiple entries, some not > > > back-in-the-sequence-format-jungle'ly yr's > -thomas > > > -- > Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology > thomas@biopython.org The Technical University of Denmark > CBS: +45 45 252489 Building 208, DK-2800 Lyngby > Fax +45 45 931585 http://www.cbs.dtu.dk/thomas > > De Chelonian Mobile ... The Turtle Moves ... > > > --__--__-- > > Message: 2 > Date: Wed, 5 Sep 2001 12:50:25 -0700 (PDT) > From: Chunlei Wu > To: biopython-dev@biopython.org > Subject: [Biopython-dev] localblast bug? > > Hi, > I wrote a script for localblast. It always raised a > TypeError: > > File "e:\python21\Bio\Blast\NCBIStandalone.py", > line 1447, in blastall > r, w, e = popen2.popen3([blastcmd] + params) > File "e:\python21\lib\popen2.py", line 129, in > popen3 > w, r, e = os.popen3(cmd, mode, bufsize) > TypeError: popen3() argument 1 must be string, not > list > > When I modified line 1447 as: > > r, w, e = popen2.popen3(' '.join([blastcmd] + > params)) > > then it works. > > > Chunlei Wu > > Python version: Activepython build 210 > Biopython version: 1.00a3 > OS: WinNT > source: > > def mylocalblast(input_file,output_file,db='nt'): > """mylocalblast""" > > from Bio.Blast import NCBIStandalone > > my_blast_db="r:\\blastdb\\"+db > my_blast_exe=r"r:\localblast\blastall.exe" > > blast_out, error_info = > NCBIStandalone.blastall(my_blast_exe,'blastn',my_blast_db,input_file) > > output_f=open(output_file,'w') > blast_result=blast_out.read() > output_f.write(blast_result) > print blast_result > output_f.close() > > > > __________________________________________________ > Do You Yahoo!? > Get email alerts & NEW webcam video instant messaging with > Yahoo! Messenger > http://im.yahoo.com > > --__--__-- > > Message: 3 > Date: Wed, 5 Sep 2001 17:27:45 -0400 > From: Brad Chapman > To: Chunlei Wu > Cc: biopython-dev@biopython.org > Subject: Re: [Biopython-dev] localblast bug? > > Hi Chunlei; > > > I wrote a script for localblast. It always raised a > > TypeError: > [...] > > When I modified line 1447 as: > > > > r, w, e = popen2.popen3(' '.join([blastcmd] + > > params)) > > > > then it works. > > Thanks for the fix. I think you're probably the first to use the > localblast module on windows, so you get to run into the platform > specific problems (aren't you lucky :-). Your fix works fine for me > on UNIX as well (with the Doc/examples/local_blast.py script), so I > checked your change into CVS. It is available from anonymous CVS and > should be in the next release. > > Thanks again! > Brad > > --__--__-- > > Message: 4 > Date: Wed, 5 Sep 2001 17:46:00 -0400 > From: Brad Chapman > To: biopython-dev@biopython.org > Subject: Re: [Biopython-dev] sequence format readers ? > > Hi Thomas! > > > To follow up one of the discussions and questions at ISMB > in Copenhagen, > > - how are we going to proceed with the sequence format reader (the > > biopython variant of readseq ...) > > It's great that you're going to work on this! It's definately much > desired by a lot o' people (in fact I was just having a conversation > today about format conversion). > > > Currently we can only have parsers for Fasta, Embl and > GenBank. What we > > need is a internal format and functions/modules which can > read/write: > [...impressive list o' formats...] > > ??? - more suggestions ? > > I think supporting this many would be an *excellent* start :-). > > > I can write most of the rules, but I guess we have to > define a smart base > > class/parser - where plugging in a new format should only > take 5 seconds ... > > If we brain storm on the design of the reader/writer, I > could volunteer to > > implement the format rules ... > > > > Some things to consider: > > * some formats are alignment based (e.g. clustal, phylip, nexus) > > * some formats have loads of information which is lost when > converted to a > > lower info-rich format( e.g. Embl -> Fasta). But Embl -> > GenBank should > > not lose any information > > * some formats allow multiple entries, some not > > Just as a way of getting things started (I haven't done a lot of > thinking about this), my opinion is that the best way to do this is > to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO > system would be the standard SeqRecord object that we currently > have. The advantage of this is that existing parsers (ie Fasta, > GenBank), already parse into this, so all that would need to be done > is to define a mapping that converts a generic SeqRecord object to > and from the formats "native" Record based representation. So to > convert from GenBank to Fasta you could do: > > GenBank Record Format --> SeqRecord --> Fasta Record Format > > Since the Record formats already provide writing capabilities (and > we have the parsers to parse into them) we would already get writing > and parsing "for free." Also, we would make good use of our existing > "generic" Sequence representations. > > The advantages of this is that it would help us avoid having to make > a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever > specific converters. The disadvantage of this is that we may lose > some information in the conversion process (but than again, what > converters don't :-). > > The tricky part of doing it this way is that we would then need to > define the Record --> SeqRecord mapping, which, as you mention, > may take some thinking for alignment formats and other > complications. > > Hopefully-rambling-on-and-on-about-this-helps-a-little-bit-ly yr's, > > Brad > > > > --__--__-- > > Message: 5 > Date: Wed, 5 Sep 2001 16:08:49 -0700 > To: Johann Visagie > From: Jeffrey Chang > Cc: biopython-dev@biopython.org > Subject: [Biopython-dev] Re: [BioPython] Biopython 1.00a3 > release now available > > At 3:46 PM +0200 9/5/01, Johann Visagie wrote: > >Jeffrey Chang on 2001-09-04 (Tue) at 11:45:13 -0700: > >> > >> A new release of Biopython is now available. > > > >Cool. :-) > > > >A thought: Shouldn't these announcements be cross-posted to > >python-announce-list@python.org, a.k.a > comp.lang.python.announcee? :-) > > Yes. Next time. :) > > Thanks, > Jeff > > --__--__-- > > Message: 6 > To: Brad Chapman > Cc: biopython-dev@biopython.org > Subject: Re: [Biopython-dev] sequence format readers ? > From: Thomas Sicheritz-Ponten > Date: 06 Sep 2001 11:22:01 +0200 > > Brad Chapman writes: > > > Hi Thomas! > > > > > To follow up one of the discussions and questions at ISMB > in Copenhagen, > > > - how are we going to proceed with the sequence format reader (the > > > biopython variant of readseq ...) > > > > It's great that you're going to work on this! It's definately much > > desired by a lot o' people (in fact I was just having a conversation > > today about format conversion). > > > > > Currently we can only have parsers for Fasta, Embl and > GenBank. What we > > > need is a internal format and functions/modules which can > read/write: > > [...impressive list o' formats...] > > > ??? - more suggestions ? > > > > I think supporting this many would be an *excellent* start :-). > > > > > I can write most of the rules, but I guess we have to > define a smart base > > > class/parser - where plugging in a new format should only > take 5 seconds ... > > > If we brain storm on the design of the reader/writer, I > could volunteer to > > > implement the format rules ... > > > > > > Some things to consider: > > > * some formats are alignment based (e.g. clustal, phylip, nexus) > > > * some formats have loads of information which is lost > when converted to a > > > lower info-rich format( e.g. Embl -> Fasta). But Embl > -> GenBank should > > > not lose any information > > > * some formats allow multiple entries, some not > > > > Just as a way of getting things started (I haven't done a lot of > > thinking about this), my opinion is that the best way to do this is > > to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO > > system would be the standard SeqRecord object that we currently > > have. The advantage of this is that existing parsers (ie Fasta, > > GenBank), already parse into this, so all that would need to be done > > is to define a mapping that converts a generic SeqRecord object to > > and from the formats "native" Record based representation. So to > > convert from GenBank to Fasta you could do: > > > > GenBank Record Format --> SeqRecord --> Fasta Record Format > > > > Since the Record formats already provide writing capabilities (and > > we have the parsers to parse into them) we would already get writing > > and parsing "for free." Also, we would make good use of our existing > > "generic" Sequence representations. > > > > The advantages of this is that it would help us avoid having to make > > a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever > > specific converters. The disadvantage of this is that we may lose > > some information in the conversion process (but than again, what > > converters don't :-). > > I think inheriting the Seq object to a SeqIOSeq object is enough. > We just need to add a single dictionary (features) where all > Swiss/EMBL/GenBank extra annotations can be added. > > e.g. > class SeqIOSeq(Seq): > def __init__(self): > Seq.__init__(self) > # dictionary for extra annotations (e.g. Embl, GenBank) > self.features = {} > > > In the case of > GenBank Record Format --> SeqIOSeq --> Fasta Record Format > we pick only the the name and sequence ... > > but for > GenBank Record Format --> SeqIOSeq --> EMBL Record Format > the writer function should check if there are any additional features > (self.features.keys()) > That way we shouldn't loose any information. > > > It would be nice if a new format can be added by simply > adding functions > for reading, writing and recognizing the format. > I not completely sure of how to define these functions - any ideas ? > > example code ... > > import sys > from Bio.Seq import Seq > NO, YES = 0,1 > > class SeqIOSeq(Seq): > def __init__(self): > Seq.__init__(self) > # dictionary for extra annotations (e.g. Embl, GenBank) > self.features = {} > > > class SeqIO: > # dictionary to store functions for > # recognizing, reading and writing of different sequence formats > recognizers = {} > readers = {} > writers = {} > > def __init__(self, **kwds): > self.name = None > self.format = None > self.sequence = SeqIOSeq() > self.is_an_alignment = NO > self.allow_multiple_entries = YES > for k,v in kwds: setattr(self, k, v) > > def AddFormat(self, name, recognizeF, readF, writeF): > self.recognizers[name] = recognizeF > self.readers[name] = readF > self.writers[name] = writeF > > > needing-a-machete-for-the-sequence-format-jungle'ly yr's > -thomas > > -- > Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology > thomas@biopython.org The Technical University of Denmark > CBS: +45 45 252489 Building 208, DK-2800 Lyngby > Fax +45 45 931585 http://www.cbs.dtu.dk/thomas > > De Chelonian Mobile ... The Turtle Moves ... > > --__--__-- > > Message: 7 > Date: Thu, 6 Sep 2001 14:49:29 +0200 > From: Johann Visagie > To: biopython-dev@biopython.org > Subject: Re: [Biopython-dev] Biopython 1.00a3 for the Mac > > Yair Benita on 2001-09-05 (Wed) at 11:28:00 +0200: > > > > I have compiled the new release for the Mac. > > FreeBSD port has also just been updated: > http://www.freebsd.org/cgi/cvsweb.cgi/ports/biology/py-biopython/ > > Pre-built package (minus CORBA) should appear here in a > couple of days: > > ftp://ftp.freebsd.org/pub/FreeBSD/ports/i386/packages-stable/A ll/py-biopython-1.00.a3.tgz Unfortunately, this all comes a day or two too late to make it into 4.4-RELEASE, and hence onto the distribution CDs. :-( -- V --__--__-- _______________________________________________ Biopython-dev mailing list Biopython-dev@biopython.org http://biopython.org/mailman/listinfo/biopython-dev End of Biopython-dev Digest From katel at worldpath.net Thu Sep 6 23:38:32 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] RE:Sequence format readers References: <005001c13729$aa920e50$f70210ac@l001696w00> Message-ID: <001601c1374e$95d62260$010a0a0a@cadence.com> ----- Original Message ----- From: "Peter Wilkinson" To: Sent: Thursday, September 06, 2001 4:14 PM Subject: [Biopython-dev] RE:Sequence format readers > I have almost completed code for reading in Refseek data. I have finished > classes (1st draft, but functions well) for the smaller organisms), and now > I am moving on to the Human records .... > > Also I we need a parser for Derwent data, which should inherit from EMBL, > since its formatting is EMBL like. > > Next aslo is the expression data from different manufacturers .... > We have tons of good ideas. I think we need to set priorities so we focus on the most important tasks. I'm not the right person to set priorites because I don't use these tools for my day job but others may have ideas. Cayte From jchang at SMI.Stanford.EDU Thu Sep 6 23:39:05 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] sequence format readers ? In-Reply-To: References: <15254.29396.129922.274263@genome.cbs.dtu.dk> <20010905174600.A4766@ci350185-a.athen1.ga.home.com> Message-ID: At 11:22 AM +0200 9/6/01, Thomas Sicheritz-Ponten wrote: >Brad Chapman writes: [Thomas] > > > I can write most of the rules, but I guess we have to define a smart base >> > class/parser - where plugging in a new format should only take 5 >>seconds ... >> > If we brain storm on the design of the reader/writer, I could volunteer to >> > implement the format rules ... >> > >> > Some things to consider: >> > * some formats are alignment based (e.g. clustal, phylip, nexus) >> > * some formats have loads of information which is lost when converted to a >> > lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should >> > not lose any information >> > * some formats allow multiple entries, some not >> >> Just as a way of getting things started (I haven't done a lot of >> thinking about this), my opinion is that the best way to do this is >> to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO >> system would be the standard SeqRecord object that we currently >> have. The advantage of this is that existing parsers (ie Fasta, >> GenBank), already parse into this, so all that would need to be done >> is to define a mapping that converts a generic SeqRecord object to >> and from the formats "native" Record based representation. So to >> convert from GenBank to Fasta you could do: >> >> GenBank Record Format --> SeqRecord --> Fasta Record Format >> >> Since the Record formats already provide writing capabilities (and >> we have the parsers to parse into them) we would already get writing >> and parsing "for free." Also, we would make good use of our existing >> "generic" Sequence representations. >> >> The advantages of this is that it would help us avoid having to make >> a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever >> specific converters. The disadvantage of this is that we may lose >> some information in the conversion process (but than again, what > > converters don't :-). Yes. It would be nice to have a design where any conversion can be done via an intermediate data structure. However, it should also be possible to plug in your own converter if you want. For example, if you really need to have a good GenBank -> EMBL translator, you can code one up that bypasses the intermediate, and Biopython should use it. That is, biopython should have 2 methods for translation, 1) general, but possible lossy translation via an intermediate, and 2) direct translation if we happen to have a translator for those two types; and the methods should work together as seamlessly as possible. >I think inheriting the Seq object to a SeqIOSeq object is enough. >We just need to add a single dictionary (features) where all >Swiss/EMBL/GenBank extra annotations can be added. > >e.g. >class SeqIOSeq(Seq): > def __init__(self): > Seq.__init__(self) > # dictionary for extra annotations (e.g. Embl, GenBank) > self.features = {} > > >In the case of >GenBank Record Format --> SeqIOSeq --> Fasta Record Format we pick only the the name and sequence ... > >but for >GenBank Record Format --> SeqIOSeq --> EMBL Record Format >the writer function should check if there are any additional features >(self.features.keys()) >That way we shouldn't loose any information. This seems like the same solution to the one that Brad suggested, except that SeqRecord is replaced by SeqIOSeq. The SeqIOSeq is a much simpler format, so may be easier to use. However, it leaves unspecified how the the features should be stored, which may be problematic. For example, the converter from SeqIOSeq to Fasta.Record will have to know what to use as the Fasta description. For GenBank, it might be the accession and comments. For a SProt.Record, it might be the entry_name and description. Thus, unless the SeqIOSeq.features elements are specified better, I'm afraid the SeqIOSeq -> X converter will have to know about all the other formats. SeqRecord gets around this by defining (theoretically) all the information people would care about from a record, with a consistent interface. Thus, a SeqRecord -> Fasta.Record converter will always use the SeqRecord.id and SeqRecord.description (or some other combination of attributes). >It would be nice if a new format can be added by simply adding functions >for reading, writing and recognizing the format. >I not completely sure of how to define these functions - any ideas ? Not exactly, but it would be nice if those functions were exposed. For example, there should be a function somewhere called "whichformat" (similar to the whichdb package in Python's standard library) that returns a best guess at the format. In the past, Andrew's talked about building this kind of functionality into Martel... Jeff From jchang at SMI.Stanford.EDU Thu Sep 6 23:40:45 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] RE:Sequence format readers In-Reply-To: <005001c13729$aa920e50$f70210ac@l001696w00> References: <005001c13729$aa920e50$f70210ac@l001696w00> Message-ID: >P.S. I am sitting on code for specific fasta formated types .... how about >that? Yes, please send it in! Jeff From katel at worldpath.net Sat Sep 8 00:46:36 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] MetaTool Message-ID: <004301c13821$3f6bf340$010a0a0a@cadence.com> I just added MetaTool. It passes a superficial test but I need to take a closer look. Cayte From katel at worldpath.net Sat Sep 8 16:52:18 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] MetaTool References: <004301c13821$3f6bf340$010a0a0a@cadence.com> Message-ID: <000f01c138a8$27932360$010a0a0a@cadence.com> When I've finished checking zillions of Matrixes from the Martel parser, the next rainy day, I could look into Nexus since that is on the list. I'm holding off on more pathway stuff till I find out what Tarjei has. I want to post my plans so we don't have two people duplicating effort on the same parser. Cayte From tarjei at genome.wi.mit.edu Sun Sep 9 00:49:23 2001 From: tarjei at genome.wi.mit.edu (Tarjei Mikkelsen) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] MetaTool In-Reply-To: <000f01c138a8$27932360$010a0a0a@cadence.com> Message-ID: <000601c138ea$cc57e6a0$6d04fa12@mit.edu> >I'm holding off on more pathway stuff till I find out what Tarjei has I've completed a prototype for the core species/reaction/system classes as described earlier. Next, I intend to use these classes to implement the missing part from my KEGG parser - the reaction database. Then I'd like to write a bridge from the pathway classes to Metatool. Combined with your output parser this would give us a complete "vertical slice" from database (KEGG/WIT) through the pathway classes to an analysis program. I'll do a first commit as soon as I am satisfied that the pathway classes are well designed. Time is getting precious these days as school is starting again, so it might take a while for this to happen. Tarjei From katel at worldpath.net Sun Sep 9 16:09:43 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] MetaTool References: <000601c138ea$cc57e6a0$6d04fa12@mit.edu> Message-ID: <001401c1396b$5ed0a320$010a0a0a@cadence.com> ----- Original Message ----- From: "Tarjei Mikkelsen" To: "'Cayte'" ; Sent: Saturday, September 08, 2001 9:49 PM Subject: RE: [Biopython-dev] MetaTool > >I'm holding off on more pathway stuff till I find out what Tarjei has > > I've completed a prototype for the core species/reaction/system classes > as described earlier. > > Next, I intend to use these classes to implement the missing part from > my KEGG parser - the reaction database. Then I'd like to write a bridge > from the pathway classes to Metatool. Combined with your output parser > this would give us a complete "vertical slice" from database (KEGG/WIT) > through the pathway classes to an analysis program. > I may need to retrofit some MetaTool classes to fit your classes. Please share your ideas on what's needed for the MetaTool end of the bridge. > I'll do a first commit as soon as I am satisfied that the pathway > classes are well designed. Time is getting precious these days as school > is starting again, so it might take a while for this to happen. > Commiting code early may allow users to offer suggestions on class structure. > > _______________________________________________ Cayte From tarjei at genome.wi.mit.edu Sun Sep 9 19:11:16 2001 From: tarjei at genome.wi.mit.edu (Tarjei Mikkelsen) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] MetaTool In-Reply-To: <001401c1396b$5ed0a320$010a0a0a@cadence.com> Message-ID: <000601c13984$bb47dd80$0f05f612@mit.edu> >> Next, I intend to use these classes to implement the missing part from >> my KEGG parser - the reaction database. Then I'd like to write a bridge >> from the pathway classes to Metatool. Combined with your output parser >> this would give us a complete "vertical slice" from database (KEGG/WIT) >> through the pathway classes to an analysis program. >> > I may need to retrofit some MetaTool classes to fit your classes. > Please share your ideas on what's needed for the MetaTool end of > the bridge. What I'm envisioning is simply a function/class that converts a system object into a string or text file that can be used directly as the input to Metatool. I haven't looked at your Metatool module in detail, but unless you already have a comprehensive method for interfacing with the input side of Metatool there shouldn't be any need for retrofitting to do this. How does that sound? Tarjei From katel at worldpath.net Mon Sep 10 22:08:50 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] MetaTool References: <000601c13984$bb47dd80$0f05f612@mit.edu> Message-ID: <003201c13a66$b4cb92c0$010a0a0a@cadence.com> ----- Original Message ----- From: "Tarjei Mikkelsen" > What I'm envisioning is simply a function/class that converts a system > object into a string or text file that can be used directly as the input > to Metatool. I was thinking of the output. You may for example want to check if an elementary mode matches a pathway. in a frog. Pathways, elementary modes, basis vectors, etc, all look the same to the metatool parser, . They are stored as matrices. I don't se why they couldn't all be mixed and matched in a search query.. But they would need to be converted to a format the search engine understands. > > I haven't looked at your Metatool module in detail, but unless you > already have a comprehensive method for interfacing with the input side > of Metatool there shouldn't be any need for retrofitting to do this. > No I just parse the output of Metatool. The parser may have to change with new revs of MetaTool though. Cayte From sarah at staff.cs.usyd.edu.au Tue Sep 11 00:46:17 2001 From: sarah at staff.cs.usyd.edu.au (Sarah Kummerfeld) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Local blast problem Message-ID: Hi, Just wondering whether anyone know about an intermittent problem with locally run blast. I'm running lots of blast searches on a very small database. It will work for a while, or sometimes for the whole program. But other times it will core dump. Occasionally I get a python traceback which suggests that it tried to read a stream that was not there, but other times there is no error, it just dumps core. If I run blast on its own (not from my python program) it sometimes does the same thing. One time I rebooted my machine (linux) and found that a blast search I had run just before and that had crashed, now worked. I couldn't find anything helpful in the core file. I had thought it might be some new memory I put in, so I had it replaced -- but still have the problem. Any suggestions would be greatly appreciated! Sarah From jchang at SMI.Stanford.EDU Tue Sep 11 13:27:15 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Local blast problem In-Reply-To: References: Message-ID: Hmmm... I'm not sure. Is Python core-dumping, or just blast? I've run into BLAST crashing before, but it usually results in a truncated stream and hasn't caused python to core-dump. Some possible workarounds are to: 1) fork off a separate thread to run blast, so if it crashes, it won't take down your main application. The MultiProc library might help here. 2) hack the blastall (or blatspgp) function so that it saves the output to a file, and then return a handle to that file. This is technically a variation of the first solution, and might be more straightforward to implement. Jeff At 2:46 PM +1000 9/11/01, Sarah Kummerfeld wrote: >Hi, > >Just wondering whether anyone know about an intermittent >problem with locally run blast. > >I'm running lots of blast searches on a very small >database. It will work for a while, or sometimes for >the whole program. But other times it will core dump. > >Occasionally I get a python traceback which suggests >that it tried to read a stream that was not there, >but other times there is no error, it just dumps >core. > >If I run blast on its own (not from my python program) >it sometimes does the same thing. One time I rebooted >my machine (linux) and found that a blast search I had >run just before and that had crashed, now worked. > >I couldn't find anything helpful in the core file. > >I had thought it might be some new memory I put in, >so I had it replaced -- but still have the problem. > >Any suggestions would be greatly appreciated! > >Sarah > > > >_______________________________________________ >Biopython-dev mailing list >Biopython-dev@biopython.org >http://biopython.org/mailman/listinfo/biopython-dev From thomas at cbs.dtu.dk Tue Sep 11 20:43:43 2001 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] sequence format readers ? In-Reply-To: Brad Chapman's message of "Wed, 5 Sep 2001 17:46:00 -0400" References: <15254.29396.129922.274263@genome.cbs.dtu.dk> <20010905174600.A4766@ci350185-a.athen1.ga.home.com> Message-ID: Brad, I made some changes to our initial SeqRecord and FastaReader/Write classes in order to use it for inheritance. Before I start defining rules for the other formats we should brainstorm over possible drawbacks/pitfalls of the current implementation (e.g. alignments). Any ideas/suggestions ? cheers -thomas # ----- SNIP ----- SNAP ----- SNIP ----- SNAP ----- import sys import string import Bio.Alphabet from Bio.Seq import Seq #from Bio.SeqRecord import SeqRecord class SeqRecord: def __init__(self, seq, id = "", name = "", description = ""): self.seq = seq self.id = id self.name = name self.description = description # annotations about the whole sequence self.annotations = {} # annotations about parts of the sequence self.features = [] def __str__(self): res = '' res += '%s %s' % (self.name, self.seq.data) return res class GenericFormat: def __init__(self, instream=None, outstream=None, alphabet = Bio.Alphabet.generic_alphabet): self.instream = instream self.outstream = outstream self.alphabet = alphabet self._n = -1 self._lookahead = None def find_start(self): # find the start of data pass def next(self): pass def __getitem__(self, i): # wrapper to the normal Python "for spam in list:" idiom assert i == self._n # forward iteration only! x = self.next() if x is None: raise IndexError, i return x def write(self, record): pass def write_records(self, records): # In general, can assume homogenous records... useful? for record in records: self.write(record) def close(self): return self.outstream.close() def flush(self): return self.outstream.flush() class FastaFormat(GenericFormat): def __init__(self, instream=None, outstream=None, alphabet = Bio.Alphabet.generic_alphabet): GenericFormat.__init__(self, instream, alphabet = Bio.Alphabet.generic_alphabet) self.find_start() def find_start(self): line = self.instream.readline() while line and line[0] != ">": line = self.instream.readline() self._lookahead = line self._n = 0 def next(self): self._n = self._n + 1 line = self._lookahead if not line: return None x = string.split(line[1:-1], None, 1) if len(x) == 1: id = x desc = "" else: id, desc = x lines = [] line = self.instream.readline() while line: if line[0] == ">": break lines.append(line[:-1]) line = self.instream.readline() self._lookahead = line return SeqRecord(Seq(string.join(lines, ""), self.alphabet), id = id, name = id, description = desc) def write(self, record): id = record.id assert "\n" not in id description = record.description assert "\n" not in description self.outstream.write(">%s %s\n" % (id, description)) data = record.seq.tostring() for i in range(0, len(data), 60): self.outstream.write(data[i:i+60] + "\n") if __name__ == '__main__': txt = """ >TM0001 hypothetical protein MVYGKEGYGRSKNILLSECVCGIISLELNGFQYFLRGMETL >TM0002 hypothetical protein MSPEDWKRLICFHTSKEVLKQTLDDAQQNISDSVSIPLRKY >TM0003 hypothetical protein METVKAYEVEDIPAIGFNNSLEVWKLFPASSSRSTSSSFQ >TM0004 hypothetical protein MKDLYERFNNSLEVWKLVELFGTSIRIHLFQ """ from StringIO import StringIO test = FastaFormat(instream = StringIO(txt)) while 1: r = test.next() if not r: break print r # ----- SNIP ----- SNAP ----- SNIP ----- SNAP ----- -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From chapmanb at arches.uga.edu Wed Sep 12 04:50:15 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] sequence format readers ? In-Reply-To: References: <15254.29396.129922.274263@genome.cbs.dtu.dk> <20010905174600.A4766@ci350185-a.athen1.ga.home.com> Message-ID: <20010912045015.A24186@ci350185-a.athen1.ga.home.com> Hi Thomas; > Brad, I made some changes to our initial SeqRecord and FastaReader/Write > classes in order to use it for inheritance. Cool! Thanks for working on this. With regards to SeqRecord, adding __str__ stuff for debugging is great. Abstracting out the common stuff in Reader/Writer is definitely a plus. I have to admit to not having looked at or used the SeqIO stuff much, mostly because I always figured it was a work-in-progress. One thing that comes to mind is you might want to support the Iterator stuff coming in python 2.2: http://www.amk.ca/python/2.2/index.html#SECTION000300000000000000000 Seems like all we need to do is add __iter__ that returns the object itself and we'll be all set (and it should be back compatible and all of that). > Before I start defining rules for the other formats we should > brainstorm over possible drawbacks/pitfalls of the current > implementation (e.g. alignments). Hmm, I guess I just figured we would run into pitfalls after it was already coded :-). Seriously, I'm pretty happy with the SeqRecord + SeqFeature classes (with a few mistakes I made which I'll write about in a separate thread in a second), so it might be best to go forward and see how they handle what we need. Everything does a decent job of supporting the BioCorba spec, which is a good sign (to me!) that they can handle "most common cases." In terms of alignments, I think these will end up being more "high level" than SeqRecords. For instance, in the Generic alignment stuff I coded up, an Alignment is basically a collection of SeqRecords. So the conversions here will be a little different, I guess: A File of FASTA records (lots of SeqRecords) --> one Alignment one Alignment --> a bunch of FASTA records Other than this, I think you're on target (at least with my understanding of how conversions will work). If you can coerce Andrew into commenting, he might have some opinions about how the SeqIO stuff should work, since he wrote it. May-the-force-by-with-you-on-sequence-conversions-ly yr's, Brad From chapmanb at arches.uga.edu Tue Sep 18 21:25:32 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Re: Parsing Protein GenBank Records In-Reply-To: References: Message-ID: <20010918212532.A3580@ci350185-a.athen1.ga.home.com> Hi Joung; (ccing this to biopython-dev since this is relevant to everyone) > I'm having trouble parsing GenBank records obtained from the protein > database. The parser works fine for nucleotide GenBank records , but not for > protein records. I would appreciate it very much if you can guide me in > right direction for parsing such records. > > Here is the code and the error that I get back. > > >>> parser = GenBank.RecordParser() > >>> ncbi = GenBank.NCBIDictionary(database='Protein') > >>> rec = ncbi['6754304'] The parser does work for proteins in general, but does fail badly on this particular REFSEQ sequence. In the past, REFSEQ stuff has been only "sort of" GenBank format, and this record is no exception. It has a lot of formatting problems (has no identifier for the sequence type in the LOCUS line, has extra DBSOURCE tag, has non-standard feature table types and keys (Protein, Region, region_name)). Anyways, it is a big non-standard formatting mess. I've fixed the GenBank parser to be able to handle this, and checked the changes into CVS. Diffs to the relevant files (Record.py, __init__.py and genbank_format.py in Bio.GenBank) are also attached to this file in case you don't have CVS access. Thanks for the bug report. Hope this works for you! Brad -- PGP public key available from http://pgp.mit.edu/ -------------- next part -------------- *** Record.py.orig Sat May 19 15:31:16 2001 --- Record.py Tue Sep 18 21:02:18 2001 *************** *** 106,112 **** --- 106,114 ---- o date - The date of submission of the record, in a form like '28-JUL-1998' o accession - list of all accession numbers for the sequence. o nid - Nucleotide identifier number. + o pid - Proteint identifier number o version - The accession number + version (ie. AB01234.2) + o db_source - Information about the database the record came from o gi - The NCBI gi identifier for the record. o keywords - A list of keywords related to the record. o segment - If the record is one of a series, this is info about which *************** *** 153,159 **** --- 155,163 ---- self.definition = '' self.accession = [] self.nid = '' + self.pid = '' self.version = '' + self.db_source = '' self.gi = '' self.keywords = [] self.segment = '' *************** *** 185,191 **** --- 189,197 ---- output += self._accession_line() output += self._version_line() output += self._nid_line() + output += self._pid_line() output += self._keywords_line() + output += self._db_source_line() output += self._segment_line() output += self._source_line() output += self._organism_line() *************** *** 210,216 **** output += "%-9s" % self.locus output += " " # 22 space output += "%7s" % self.size ! output += " bp " # treat circular types differently, since they'll have long residue # types --- 216,225 ---- output += "%-9s" % self.locus output += " " # 22 space output += "%7s" % self.size ! if self.residue_type.find("PROTEIN") >= 0: ! output += " aa" ! else: ! output += " bp " # treat circular types differently, since they'll have long residue # types *************** *** 272,277 **** --- 281,296 ---- output = "" return output + def _pid_line(self): + """Output for PID line. Presumedly, PID usage is also obsolete. + """ + if self.pid: + output = Record.BASE_FORMAT % "PID" + output += "%s\n" % self.pid + else: + output = "" + return output + def _keywords_line(self): """Output for the KEYWORDS line. """ *************** *** 288,293 **** --- 307,322 ---- output += _wrapped_genbank(keyword_info, Record.GB_BASE_INDENT) + return output + + def _db_source_line(self): + """Output for DBSOURCE line. + """ + if self.db_source: + output = Record.BASE_FORMAT % "DBSOURCE" + output += "%s\n" % self.db_source + else: + output = "" return output def _segment_line(self): -------------- next part -------------- *** __init__.py.orig Sat Jul 28 12:02:25 2001 --- __init__.py Tue Sep 18 21:13:48 2001 *************** *** 98,112 **** def __getitem__(self, key): """Retrieve an item from the dictionary. """ ! print "keys:", self._index.keys() # get the location of the record of interest in the file start, len = self._index[key] ! print "start:", start, "len:", len # read through and get the data from the file self._handle.seek(start) data = self._handle.read(len) ! print "data:", data # run the data through the parser if one is specified if self._parser is not None: --- 98,112 ---- def __getitem__(self, key): """Retrieve an item from the dictionary. """ ! # print "keys:", self._index.keys() # get the location of the record of interest in the file start, len = self._index[key] ! # print "start:", start, "len:", len # read through and get the data from the file self._handle.seek(start) data = self._handle.read(len) ! # print "data:", data # run the data through the parser if one is specified if self._parser is not None: *************** *** 434,439 **** --- 434,442 ---- def nid(self, content): self.data.annotations['nid'] = content + def pid(self, content): + self.data.annotations['pid'] = content + def version(self, version_id): """Set the version to overwrite the id. *************** *** 443,448 **** --- 446,454 ---- """ self.data.id = version_id + def db_source(self, content): + self.data.annotations['db_source'] = content.rstrip() + def gi(self, content): self.data.annotations['gi'] = content *************** *** 485,510 **** (bases 1 to 86436) (sites) (bases 1 to 105654; 110423 to 111122) """ ! # first remove the parentheses ref_base_info = content[1:-1] all_locations = [] ! # only attempt to get out information if we find the words ! # 'bases' and 'to' if (string.find(ref_base_info, 'bases') != -1 and string.find(ref_base_info, 'to') != -1): # get rid of the beginning 'bases' ref_base_info = ref_base_info[5:] ! # split possibly multiple locations using the ';' ! all_base_info = string.split(ref_base_info, ';') ! ! for base_info in all_base_info: ! start, end = string.split(base_info, 'to') ! this_location = \ ! SeqFeature.FeatureLocation(int(string.strip(start)), ! int(string.strip(end))) ! all_locations.append(this_location) # make sure if we are not finding information then we have # the string 'sites' or the string 'bases' --- 491,516 ---- (bases 1 to 86436) (sites) (bases 1 to 105654; 110423 to 111122) + 1 (residues 1 to 182) """ ! # first remove the parentheses or other junk ref_base_info = content[1:-1] all_locations = [] ! # parse if we've got 'bases' and 'to' if (string.find(ref_base_info, 'bases') != -1 and string.find(ref_base_info, 'to') != -1): # get rid of the beginning 'bases' ref_base_info = ref_base_info[5:] ! locations = self._split_reference_locations(ref_base_info) ! all_locations.extend(locations) ! elif (ref_base_info.find("residues") >= 0 and ! ref_base_info.find("to") >= 0): ! residues_start = ref_base_info.find("residues") ! # get only the information after "residues" ! ref_base_info = ref_base_info[(residues_start + len("residues ")):] ! locations = self._split_reference_locations(ref_base_info) ! all_locations.extend(locations) # make sure if we are not finding information then we have # the string 'sites' or the string 'bases' *************** *** 517,523 **** (ref_base_info, self.data.id)) self._current_ref.location = all_locations ! def authors(self, content): self._current_ref.authors = content --- 523,551 ---- (ref_base_info, self.data.id)) self._current_ref.location = all_locations ! ! def _split_reference_locations(self, location_string): ! """Get reference locations out of a string of reference information ! ! The passed string should be of the form: ! ! 1 to 20; 20 to 100 ! ! This splits the information out and returns a list of location objects ! based on the reference locations. ! """ ! # split possibly multiple locations using the ';' ! all_base_info = location_string.split(';') ! ! new_locations = [] ! for base_info in all_base_info: ! start, end = base_info.split('to') ! this_location = \ ! SeqFeature.FeatureLocation(int(string.strip(start)), ! int(string.strip(end))) ! new_locations.append(this_location) ! return new_locations ! def authors(self, content): self._current_ref.authors = content *************** *** 905,913 **** --- 933,947 ---- def nid(self, content): self.data.nid = content + def pid(self, content): + self.data.pid = content + def version(self, content): self.data.version = content + def db_source(self, content): + self.data.db_source = content.rstrip() + def gi(self, content): self.data.gi = content *************** *** 1070,1076 **** # in the MartelParser self.interest_tags = ["locus", "size", "residue_type", "data_file_division", "date", ! "definition", "accession", "nid", "version", "gi", "keywords", "segment", "source", "organism", "taxonomy", "reference_num", --- 1104,1111 ---- # in the MartelParser self.interest_tags = ["locus", "size", "residue_type", "data_file_division", "date", ! "definition", "accession", "nid", ! "pid", "version", "db_source", "gi", "keywords", "segment", "source", "organism", "taxonomy", "reference_num", -------------- next part -------------- *** genbank_format.py.orig Thu May 10 17:42:43 2001 --- genbank_format.py Tue Sep 18 21:07:11 2001 *************** *** 142,147 **** --- 142,156 ---- nid + Martel.AnyEol()) + # PID g6754304 + pid = Martel.Group("pid", + Martel.Re("[\w\d]+")) + pid_line = Martel.Group("pid_line", + Martel.Str("PID") + + blank_space + + pid + + Martel.AnyEol()) + # version and GI line # VERSION AC007323.5 GI:6587720 version = Martel.Group("version", *************** *** 159,164 **** --- 168,181 ---- gi + Martel.AnyEol()) + # DBSOURCE REFSEQ: accession NM_010510.1 + db_source = Martel.Group("db_source", + Martel.ToEol()) + db_source_line = Martel.Group("db_source_line", + Martel.Str("DBSOURCE") + + blank_space + + db_source) + # keywords line # KEYWORDS antifreeze protein homology; cold-regulated gene; cor6.6 gene; # KIN1 homology. *************** *** 312,319 **** --- 329,338 ---- "primer", # Primer binding region used with PCR XXX not in # http://www.ncbi.nlm.nih.gov/collab/FT/index.html "promoter", # A region involved in transcription initiation + "Protein", # A REFSEQ invention for referring to a protein "protein_bind", # Non-covalent protein binding site on DNA or RNA "RBS", # Ribosome binding site + "Region", # Another REFSEQ invention that doesn't make any sense "rep_origin", # Replication origin for duplex DNA "repeat_region", # Sequence containing repeated subsequences "repeat_unit", # One repeated unit of a repeat_region *************** *** 424,429 **** --- 443,449 ---- # evidence for a feature "clone_lib", # Clone library from which the sequence was obtained "clone", # Clone from which the sequence was obtained + "coded_by", # REFSEQ invention to specify a crossreference "codon_start", # Indicates the first base of the first complete codon # in a CDS (as 1 or 2 or 3) "codon", # Specifies a codon that is different from any found *************** *** 505,510 **** --- 525,531 ---- "rearranged", # If the sequence shown is DNA and a member of the # immunoglobulin family, this qualifier is used to # denote that the sequence is from rearranged DNA + "region_name", # REFSEQ invention to go with their Region Type "replace", # Indicates that the sequence identified a feature's # intervals is replaced by the sequence shown in # "text" *************** *** 624,630 **** --- 645,653 ---- definition_block + \ accession_block + \ Martel.Opt(nid_line) + \ + Martel.Opt(pid_line) + \ Martel.Opt(version_line) + \ + Martel.Opt(db_source_line) + \ keywords_block + \ Martel.Opt(segment_line) + \ source_block + \ *************** *** 633,639 **** Martel.Opt(comment_block) + \ features_line + \ Martel.Rep1(feature) + \ ! base_count_line + \ sequence_entry + \ record_end) --- 656,662 ---- Martel.Opt(comment_block) + \ features_line + \ Martel.Rep1(feature) + \ ! Martel.Opt(base_count_line) + \ sequence_entry + \ record_end) From thomas at cbs.dtu.dk Wed Sep 19 04:53:49 2001 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Q: How to undo a CVS update ? Message-ID: ARGHHHHHH ! Unfortunately I made an cvs update before comitting my local changes to generic.py ... :( Are there any hidden cvs commands to undo this stupid error ? thx -thomas -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From adalke at mindspring.com Wed Sep 19 06:34:19 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Q: How to undo a CVS update ? Message-ID: <00d501c140f6$a3ef8380$0301a8c0@josiah.dalkescientific.com> Thomas of the Moving Turtle: >Unfortunately I made an cvs update before comitting my local changes to >generic.py ... :( > >Are there any hidden cvs commands to undo this stupid error ? Don't know the official way to do that. Here's what I do. Suppose you want to revert to version 1.15 rm filename.py cvs co -r1.15 filename.py mv filename.py filename.py.tmp cvs co filename.py rm filename.py mv filename.py.tmp filename.py cvs commit filename.py The reason for the mv, co, rm, mv back is because the co with a version number is sticky. By moving it then checking out the current version, I get rid of the sticky part. There's documentation at http://www.cvshome.org/docs/ Huh. How about http://www.cvshome.org/docs/manual/cvs_4.html#SEC53 ] However, this isn't the easiest way, if you are asking ] how to undo a previous checkin (in this example, put ] `file1' back to the way it was as of revision 1.1). In ] that case you are better off using the `-j' option to ] update; for further discussion see 5.8 Merging differences ] between any two revisions. and points to http://www.cvshome.org/docs/manual/cvs_5.html#SEC62 Try out the suggestion on that page. And let me know if it works :) Andrew dalke@dalkescientific.com From chapmanb at arches.uga.edu Wed Sep 19 06:53:36 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Proposed Incompatible Changes to GenBank SeqFeatures Message-ID: <20010919065335.B4962@ci350185-a.athen1.ga.home.com> Hello all; Recently I've been slugging through the biopython implementation of the new BioCorba spec, and have been forced to come back into some bad coding I did on the parts of the GenBank parser that converts GenBank into SeqRecord and SeqFeature objects. Most of the mistakes that I'm working on fixing are detailed by Andrew: http://www.biopython.org/pipermail/biopython-dev/2001-July/000451.html (unfortunately he sent this right before ISMB when I was crazily coding to get my poster done :-<) Anyways, I'd like to try and fix the problems he mentions here (and which are also a problem for biopython-corba), and have got the changes together. The issue is that some of the changes are back-incompatible, so I'd like to talk about them here and get people's thoughts on whether the changes will severely affect their existing code. Note that his only influences parsing GenBank in SeqRecord and SeqFeature objects, not in GenBank Record objects (ie. the FeatureParser, not the RecordParser). I'll go through the changes step by step, and attach the diffs (against current CVS) which implement the changes. I'll start with incompatible changes, and then go on to the more benign back-compatible changes. Comments are very welcome. Okay, here we go: Back-incompatible changes ------------------------- 1. Andrew: > feature qualifiers shouldn't really be a dictionary I have been storing feature qualifiers as a dictionary with the keys equal to the qualifier keys and the values equal to the qualifier values. The problem comes when there are multiple keys of the same name, ie with db_xref in the following CDS Feature: CDS 50..250 /gene="cor6.6" /db_xref="GI:16230" /db_xref="SWISS-PROT:P31169" What I did was this hideous hack to add numbers to the end of the qualifiers to make them unique, so the feature above would be: {"gene" : "cor6.6", "db_xref" : "GI:16230", "db_xref1" : "SWISS-PROT:P31169"} As Andrew points out, this is hideous and all around bad news. I would like to propose to use a combination dictionary/list structure to hold multiple keys, so the above now looks like: {"gene" : ["cor6.6"], "db_xref" : [""GI:16230", "SWISS-PROT:P31169"]} This affects all people using qualifiers (since even single items are now in a list), but I think it is easier to always have the values be lists (otherwise you'll always have to check is the key is of type("") or type([]). 2. > - What's the numbering system of the FeatureLocation? There is a difference between "biological" coordinates used in GenBank and python/biopython coordinates. In "biological" coordinates 1 is the first base in a sequence, and if you want to get 1 to 50, that includes both the first and 50th base. In python, 0 is the first base in the sequence, and 1 to 50 includes 1, but not 50. Previously I did no conversion of GenBank "biological" coordinates to python coordinates, but Andrew argues (and I agree) that I should do this to make things less confusing. The new implementation does the conversion, so this will give an "off-by-one" type error to people using the current numbers. 3. Andrew: > Related to that, what's the type used when there are subfeatures? Previously, if we had a sequence feature like: CDS join(104..160,320..390,504..579) I would code this as a top level SeqFeature with type "CDS" and location (104..579), and have sub_features of this top level feature with type "CDS_join." This is stolen from bioperl, but is not that great in retrospect, since I'm hacking the type and all of that. I'd like to propose adding a location_operator attribute to SeqFeature (already done in CVS) and have the top level SeqFeature be type "CDS" with location_operator "join", and all sub_features also be of the same type and location_operator. This will only affect people who relied on the previous (fairly ugly) type/location_operator concatenation mechanism. 4. strand information Previous I had the default strand for a SeqFeature be 0, which I now think is a mistake. In my mind, the strand information is: None -- No strand information, or not relevant (ie. protein) 1 -- The top strand -1 -- The bottom strand 0 -- both strands So, I changed the default strand info to be None. Hopefully this is a relatively minor change. Back-compatible changes ======================= 1. Andrew: > The biopython SeqFeature currently must be used like: > feature = SeqFeature() > feature.type = "variation" > ... > I would much rather prefer allowing the values to be set > through the constructor, as in > feature = SeqFeature(type = "variation", ...) I agree with Andrew on this and have implemented it. The only current issue is that if I try to set sub_features or annotations in the constructor I'll get infinate recursion problems when printing out the features (ie. try going this and running test_GenBank). I haven't been able to figure out why that is yet. 2. New attributes in SeqFeature I added location_operator and id attributes to SeqFeatures (to help support BioCorba), which can now be used. These shouldn't affect anyones old code and you can now use 'em if you want. Whew, I think that is everything :-). Thanks for reading all of the way through this. I'd really like comments on whether these changes are good/bad, and most importantly whether I can check 'em in or should do something different. Thanks much! Brad -- PGP public key available from http://pgp.mit.edu/ -------------- next part -------------- *** __init__.py Wed Sep 19 05:56:29 2001 --- __init__.py.orig Tue Sep 18 21:13:48 2001 *************** *** 374,397 **** """ return text.replace(" ", "") - def _convert_to_python_numbers(self, start, end): - """Convert a start and end range to python notation. - - In GenBank, starts and ends are defined in "biological" coordinates, - where 1 is the first base and [i, j] means to include both i and j. - - In python, 0 is the first base and [i, j] means to include i, but - not j. - - So, to convert "biological" to python coordinates, we need to - subtract 1 from the start, and leave the end and things should - be converted happily. - """ - new_start = start - 1 - new_end = end - - return new_start, new_end - class _FeatureConsumer(_BaseGenBankConsumer): """Create a SeqRecord object with Features to return. --- 374,379 ---- *************** *** 558,567 **** new_locations = [] for base_info in all_base_info: start, end = base_info.split('to') ! new_start, new_end = \ ! self._convert_to_python_numbers(int(start.strip()), ! int(end.strip())) ! this_location = SeqFeature.FeatureLocation(new_start, new_end) new_locations.append(this_location) return new_locations --- 540,548 ---- new_locations = [] for base_info in all_base_info: start, end = base_info.split('to') ! this_location = \ ! SeqFeature.FeatureLocation(int(string.strip(start)), ! int(string.strip(end))) new_locations.append(this_location) return new_locations *************** *** 707,717 **** # current feature, then get the information for this feature for inner_element in function.args: new_sub_feature = SeqFeature.SeqFeature() ! # inherit the type from the parent ! new_sub_feature.type = cur_feature.type ! # add the join or order info to the location_operator ! cur_feature.location_operator = function.name ! new_sub_feature.location_operator = function.name # inherit references and strand from the parent feature new_sub_feature.ref = cur_feature.ref new_sub_feature.ref_db = cur_feature.ref_db --- 688,695 ---- # current feature, then get the information for this feature for inner_element in function.args: new_sub_feature = SeqFeature.SeqFeature() ! # add _join or _order to the name to make the type clear ! new_sub_feature.type = cur_feature.type + '_' + function.name # inherit references and strand from the parent feature new_sub_feature.ref = cur_feature.ref new_sub_feature.ref_db = cur_feature.ref_db *************** *** 726,735 **** # set the location of the top -- this should be a combination of # the start position of the first sub_feature and the end position # of the last sub_feature - - # these positions are already converted to python coordinates - # (when the sub_features were added) so they don't need to - # be converted again feature_start = cur_feature.sub_features[0].location.start feature_end = cur_feature.sub_features[-1].location.end cur_feature.location = SeqFeature.FeatureLocation(feature_start, --- 704,709 ---- *************** *** 788,796 **** # check if we just have a single base if not(isinstance(range_info, LocationParser.Range)): pos = self._get_position(range_info) ! # move the single position back one to be consistent with how ! # python indexes numbers (starting at 0) ! pos.position = pos.position - 1 return SeqFeature.FeatureLocation(pos, pos) # otherwise we need to get both sides of the range else: --- 762,768 ---- # check if we just have a single base if not(isinstance(range_info, LocationParser.Range)): pos = self._get_position(range_info) ! return SeqFeature.FeatureLocation(pos, pos) # otherwise we need to get both sides of the range else: *************** *** 798,807 **** start_pos = self._get_position(range_info.low) end_pos = self._get_position(range_info.high) - start_pos.position, end_pos.position = \ - self._convert_to_python_numbers(start_pos.position, - end_pos.position) - return SeqFeature.FeatureLocation(start_pos, end_pos) def _get_position(self, position): --- 770,775 ---- *************** *** 854,867 **** # if we've got a key from before, add it to the dictionary of # qualifiers if self._cur_qualifier_key: ! key = self._cur_qualifier_key ! value = self._cur_qualifier_value ! # if the qualifier name exists, append the value ! if self._cur_feature.qualifiers.has_key(key): ! self._cur_feature.qualifiers[key].append(value) ! # otherwise start a new list of the key with its values ! else: ! self._cur_feature.qualifiers[key] = [value] def qualifier_key(self, content): """When we get a qualifier key, use it as a dictionary key. --- 822,837 ---- # if we've got a key from before, add it to the dictionary of # qualifiers if self._cur_qualifier_key: ! # get a unique name ! unique_name = self._cur_qualifier_key ! counter = 1 ! while self._cur_feature.qualifiers.has_key(unique_name): ! unique_name = self._cur_qualifier_key + str(counter) ! counter = counter + 1 ! ! ! self._cur_feature.qualifiers[unique_name] = \ ! self._cur_qualifier_value def qualifier_key(self, content): """When we get a qualifier key, use it as a dictionary key. -------------- next part -------------- *** SeqFeature.py.orig Tue Sep 18 20:35:03 2001 --- SeqFeature.py Wed Sep 19 01:30:26 2001 *************** *** 69,75 **** 40 and 50 to 60, respectively. """ def __init__(self, location = None, type = '', location_operator = '', ! strand = 0, id = "", qualifiers = {}, sub_features = [], ref = None, ref_db = None): """Initialize a SeqFeature on a Sequence. --- 69,75 ---- 40 and 50 to 60, respectively. """ def __init__(self, location = None, type = '', location_operator = '', ! strand = None, id = "", qualifiers = {}, sub_features = [], ref = None, ref_db = None): """Initialize a SeqFeature on a Sequence. From thomas at cbs.dtu.dk Wed Sep 19 07:28:11 2001 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] SeqIO Message-ID: Hej, ok - I rewrote and _commited_ the lost code for sequence conversion :-) Brad or/and Andrew: could you check how we can use the GenBank and the SWISS parser in the SeqIO stuff ? The current file for seqeunce format IO is SeqIO/generic.py ... (should definitely change name, maybe to SeqIO.py ?) We need to design a method for guessing sequence formats - which is quite easy with filenames and file streams, but a litte trickier with sys.stdin ... (you can't tell and seek in a stdin - can you???) cheers -thomas -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From pewilkinson at informaxinc.com Mon Sep 24 16:43:08 2001 From: pewilkinson at informaxinc.com (Peter Wilkinson) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Parsing features fuzzie in Genbank annotation att Brad. In-Reply-To: <200109191055.f8JAt6p14463@pw600a.bioperl.org> Message-ID: <002401c14539$859ff670$f00210ac@l001696w00> Since this is being reviewed .... Brad, please make a note of this everyone should understand this is there. In genbank records the following join format will pop up: "join(10000,10200..10450)" The numbers used here represent a one base join with a second exon. Can this happen in biology, I am still working that out, or what does this annotation represent of the biology, if this is not a real one base join, I am working that out too. However, please note that is a possible annotation in any case. programs that use feature information should know how to handle this. This cropped up whiles I was parsing the Refseq S_cerevisiae data. Go to NCBI and download Chromosome 9, and you will see what I am talking about. I will post what I find out, but if anyone else wants to look into some insight on this, please post. Peter P.S. pretty umbeleavable is it not? In response to the following comment -------------------------- 3. Andrew: > Related to that, what's the type used when there are subfeatures? Previously, if we had a sequence feature like: CDS join(104..160,320..390,504..579) I would code this as a top level SeqFeature with type "CDS" and location (104..579), and have sub_features of this top level feature with type "CDS_join." This is stolen from bioperl, but is not that great in retrospect, since I'm hacking the type and all of that. I'd like to propose adding a location_operator attribute to SeqFeature (already done in CVS) and have the top level SeqFeature be type "CDS" with location_operator "join", and all sub_features also be of the same type and location_operator. This will only affect people who relied on the previous (fairly ugly) type/location_operator concatenation mechanism. From chapmanb at arches.uga.edu Wed Sep 26 22:04:59 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Parsing features fuzzie in Genbank annotation att Brad. In-Reply-To: <002401c14539$859ff670$f00210ac@l001696w00> References: <002401c14539$859ff670$f00210ac@l001696w00> Message-ID: <20010926220458.A27721@ci350185-a.athen1.ga.home.com> Hi Peter; > In genbank records the following join format will pop up: > > "join(10000,10200..10450)" Thanks for the heads up on this. I tried this location in Andrew's parser and it seems to handle it just fine, so I'm pretty sure the GenBank stuff should be able to handle it. If you run across this case in a record and the parser fails or produces erroneous results, send the accession number along and I can fix things. > The numbers used here represent a one base join with a second exon. > Can this happen in biology, Hmm, I'm not sure if I can think of a biological case off the top of my head where this makes good sense. It certainly doesn't make sense for an exon (ie. a 1 base pair exon) but maybe might make sense if the location described something like a protein binding location or something similar. > P.S. pretty umbeleavable is it not? :-). GenBank has lots of surprises. BTW, since I haven't heard any negative comments about my proposed SeqFeature/GenBank parser changes, I committed 'em to CVS. If anyone gets problems on account of this, please let me know! Brad -- PGP public key available from http://pgp.mit.edu/ From chapmanb at arches.uga.edu Wed Sep 26 22:47:54 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] SeqIO In-Reply-To: References: Message-ID: <20010926224754.E27721@ci350185-a.athen1.ga.home.com> Hi Thomas! > ok - I rewrote and _commited_ the lost code for sequence conversion :-) Glad it made it in okay :-) > Brad or/and Andrew: could you check how we can use the GenBank and the SWISS > parser in the SeqIO stuff ? Yeah, I spent some time looking through it (a few more comments are below), and I think what I'd need to do for GenBank is just create a converter that takes SeqRecord objects and turns them into a GenBank.Record object. This way, I could just do str(the_record) to get the output and re-use the output work I already did. One big question I have is, how many of the features do you want to try and retain in the conversion? So, for GenBank format, do you want me to just write out the basic information (sequence, type, etc) and ignore the feature table, or do we want to somehow map the features from format to format (ie. EMBL <-> GenBank). If we want to think about feature conversion, this'll be tougher and we'll need to think about converters between "similar" formats like EMBL and GenBank. > The current file for seqeunce format IO is SeqIO/generic.py ... (should > definitely change name, maybe to SeqIO.py ?) You could just change it to __init__.py, like in the other modules (so we could do from Bio import SeqIO and get it). I also had a couple of questions from looking at this: => Why are you duplicating SeqRecord in the SeqIO stuff instead of just reusing it? I don't think I understand what you are talking about with stripping newlines... => Is there a way to plug in a specialized converter for similar formats, like I was talking about above with EMBL/GenBank? I think Jeff suggested this earlier, and it seems like a good idea to me. I guess right now you could subclass ReadSeq and define your own Convert function, but maybe there is another way to do it. Thanks for your work and code on this. Nice to see it progressing along! Brad -- PGP public key available from http://pgp.mit.edu/ From biopython-bugs at bioperl.org Thu Sep 27 05:22:02 2001 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] Notification: incoming/43 Message-ID: <200109270922.f8R9M2p18294@pw600a.bioperl.org> JitterBug notification new message incoming/43 Message summary for PR#43 From: mkersz@pasteur.fr Subject: GenBank parser fails (on large files?) Date: Thu, 27 Sep 2001 05:22:01 -0400 0 replies 0 followups ====> ORIGINAL MESSAGE FOLLOWS <==== >From mkersz@pasteur.fr Thu Sep 27 05:22:02 2001 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f8R9M1p18288 for ; Thu, 27 Sep 2001 05:22:01 -0400 Date: Thu, 27 Sep 2001 05:22:01 -0400 Message-Id: <200109270922.f8R9M1p18288@pw600a.bioperl.org> From: mkersz@pasteur.fr To: biopython-bugs@bioperl.org Subject: GenBank parser fails (on large files?) Full_Name: Michel Kerszberg Module: GenBank Version: 1.00a3 OS: linux 2.2 Submission from: cache.pasteur.fr (157.99.64.13) fetch ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_H37Rv/AL123456.gbk open this with file_handle = open( ... ,'r') pars = GenBank.FeatureParser() iter = GenBank.Iterator(file_handle, pars) rec = iter.next() This fails with: rec = iter.next() File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 182, in next return self._parser.parse(File.StringHandle(data)) File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 260, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 1108, in feed self._parser.parseFile(handle) File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 205, in parseFile self.parseString(fileobj.read()) File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 233, in parseString self._err_handler.fatalError(result) File "/var/tmp/python-root//usr/lib/python2.0/xml/sax/handler.py", line 38, in fatalError Martel.Parser.ParserPositionException: error parsing at or beyond character 42 This is in the first line of the record, which seems correctly formatted. No amount of massaging of the file seems to help. I have seen this problem reported with other large GenBank records. From adalke at mindspring.com Thu Sep 27 15:35:50 2001 From: adalke at mindspring.com (Andrew Dalke) Date: Sat Mar 5 14:43:04 2005 Subject: [Biopython-dev] GenBank parser fails (on large files?) Message-ID: <024b01c1478b$af5516e0$0301a8c0@josiah.dalkescientific.com> >Full_Name: Michel Kerszberg >ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_ H37Rv/AL123456.gbk >This fails with: >Martel.Parser.ParserPositionException: error parsing at or > beyond character 42 > >This is in the first line of the record, which seems >correctly formatted. No amount of massaging of the >file seems to help. > >I have seen this problem reported with other large >GenBank records. I've found the problem. Here's the format definition locus_line = Martel.Group("locus_line", ... blank_space + Martel.Opt(residue_type + residue_type = Martel.Group("residue_type", Martel.Opt(Martel.Alt(*residue_prefixes)) + Martel.Opt(Martel.Alt(*residue_types)) + Martel.Opt(blank_space + Martel.Str("circular"))) In this record, the locus line is LOCUS AL123456 4411529 bp circular BCT 07-JUL-1998 ^^^^^^^^^^ all spaces so there is no residue type. The 'blank_space' in 'locus_line' eats up all those spaces, leaving the parser at the word 'circular'. That doesn't match the residue_prefixes or the residue_types. There's no " " so it doesn't match the 'blank_space', so the residue_type fails. Here's a likely solution - move 'blank_space' to occur after the residue_type = Martel.Group("residue_type", Martel.Alt( Martel.Opt(Martel.Alt(*residue_prefixes)) + \ Martel.Alt(*residue_types) + \ Martel.Opt(blank_space + Martel.Str("circular")), Martel.Opt(Martel.Str("circular"))) I've not tested this, since I think the format definition needs to be revisited first because I've now more experience in writing these things, and second because the LOCUS line definition is changing in the next couple months, according to ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt Andrew From chapmanb at arches.uga.edu Thu Sep 27 16:05:54 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] GenBank parser fails (on large files?) In-Reply-To: <024b01c1478b$af5516e0$0301a8c0@josiah.dalkescientific.com> References: <024b01c1478b$af5516e0$0301a8c0@josiah.dalkescientific.com> Message-ID: <20010927160554.A29159@ci350185-a.athen1.ga.home.com> Hi Michel, Andrew; Michel: > >ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_ > H37Rv/AL123456.gbk > > >This fails with: > > >Martel.Parser.ParserPositionException: error parsing at or > > beyond character 42 Andrew: > I've found the problem. Here's the format definition [...] > In this record, the locus line is > > LOCUS AL123456 4411529 bp circular BCT 07-JUL-1998 > ^^^^^^^^^^ all spaces > > so there is no residue type. The 'blank_space' in 'locus_line' > eats up all those spaces, leaving the parser at the word 'circular'. Thanks for looking at this Andrew -- I've also been checking it out concurrently and came to the same conclusion. Wow, I never would have expected to have circular without the residue type :-). I've fixed this and also a second problem with this file, the version line has no GI: VERSION AL123456 I've added these examples to the GenBankFormat test so that we should be able to catch them in the future. For Michel, the fixes are in CVS and the patches to GenBank/__init__.py and GenBank/genbank_format.py are attached. With these I can parse your file without problems. I've also added a couple of things which will (hopefully) speed up dealing with large sequences some. Thanks for the bug report on this; Let us know if you come across anything else that fails. > I've not tested this, since I think the format definition needs > to be revisited first because I've now more experience in writing > these things, and second because the LOCUS line definition is > changing in the next couple months, according to > > ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt Yeah, I had read about this previously and I _think_ the format will handle them (after some modifications I made a while back). In test_GenBankFormat.py there are a couple of example locus lines with this new format that it'll parse okay. We'll see if it will hold up when the full-scale change comes on, though. But, you are still more than welcome to attack the locus line parsing anytime you feel up to it -- you are definately the master o' Martel :-). Brad -- PGP public key available from http://pgp.mit.edu/ -------------- next part -------------- Index: genbank_format.py =================================================================== RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v retrieving revision 1.8 diff -c -r1.8 genbank_format.py *** genbank_format.py 2001/09/19 01:15:52 1.8 --- genbank_format.py 2001/09/27 20:02:47 *************** *** 85,92 **** residue_type = Martel.Group("residue_type", Martel.Opt(Martel.Alt(*residue_prefixes)) + Martel.Opt(Martel.Alt(*residue_types)) + ! Martel.Opt(blank_space + Martel.Str("circular"))) date = Martel.Group("date", Martel.Re("[-\w]+")) --- 85,93 ---- residue_type = Martel.Group("residue_type", Martel.Opt(Martel.Alt(*residue_prefixes)) + Martel.Opt(Martel.Alt(*residue_types)) + ! Martel.Opt(Martel.Opt(blank_space) + Martel.Str("circular"))) + date = Martel.Group("date", Martel.Re("[-\w]+")) *************** *** 163,171 **** Martel.Str("VERSION") + blank_space + version + ! blank_space + ! Martel.Str("GI:") + ! gi + Martel.AnyEol()) # DBSOURCE REFSEQ: accession NM_010510.1 --- 164,172 ---- Martel.Str("VERSION") + blank_space + version + ! Martel.Opt(blank_space + ! Martel.Str("GI:") + ! gi) + Martel.AnyEol()) # DBSOURCE REFSEQ: accession NM_010510.1 From biopython-bugs at bioperl.org Thu Sep 27 16:11:50 2001 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Notification: incoming/43 Message-ID: <200109272011.f8RKBop24395@pw600a.bioperl.org> JitterBug notification chapmanb changed notes Message summary for PR#43 From: mkersz@pasteur.fr Subject: GenBank parser fails (on large files?) Date: Thu, 27 Sep 2001 05:22:01 -0400 0 replies 0 followups Notes: Parser problem was with a LOCUS line containing "circular" but no sequence type information (ie. DNA, RNA, etc). This is fixed in CVS in revision 1.24 of __init__.py and 1.9 of genbank_format.py ====> ORIGINAL MESSAGE FOLLOWS <==== >From mkersz@pasteur.fr Thu Sep 27 05:22:02 2001 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f8R9M1p18288 for ; Thu, 27 Sep 2001 05:22:01 -0400 Date: Thu, 27 Sep 2001 05:22:01 -0400 Message-Id: <200109270922.f8R9M1p18288@pw600a.bioperl.org> From: mkersz@pasteur.fr To: biopython-bugs@bioperl.org Subject: GenBank parser fails (on large files?) Full_Name: Michel Kerszberg Module: GenBank Version: 1.00a3 OS: linux 2.2 Submission from: cache.pasteur.fr (157.99.64.13) fetch ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_H37Rv/AL123456.gbk open this with file_handle = open( ... ,'r') pars = GenBank.FeatureParser() iter = GenBank.Iterator(file_handle, pars) rec = iter.next() This fails with: rec = iter.next() File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 182, in next return self._parser.parse(File.StringHandle(data)) File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 260, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 1108, in feed self._parser.parseFile(handle) File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 205, in parseFile self.parseString(fileobj.read()) File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 233, in parseString self._err_handler.fatalError(result) File "/var/tmp/python-root//usr/lib/python2.0/xml/sax/handler.py", line 38, in fatalError Martel.Parser.ParserPositionException: error parsing at or beyond character 42 This is in the first line of the record, which seems correctly formatted. No amount of massaging of the file seems to help. I have seen this problem reported with other large GenBank records. From biopython-bugs at bioperl.org Thu Sep 27 16:11:50 2001 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Notification: incoming/43 Message-ID: <200109272011.f8RKBop24399@pw600a.bioperl.org> JitterBug notification chapmanb moved PR#43 from incoming to fixed-bugs Message summary for PR#43 From: mkersz@pasteur.fr Subject: GenBank parser fails (on large files?) Date: Thu, 27 Sep 2001 05:22:01 -0400 0 replies 0 followups Notes: Parser problem was with a LOCUS line containing "circular" but no sequence type information (ie. DNA, RNA, etc). This is fixed in CVS in revision 1.24 of __init__.py and 1.9 of genbank_format.py ====> ORIGINAL MESSAGE FOLLOWS <==== >From mkersz@pasteur.fr Thu Sep 27 05:22:02 2001 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f8R9M1p18288 for ; Thu, 27 Sep 2001 05:22:01 -0400 Date: Thu, 27 Sep 2001 05:22:01 -0400 Message-Id: <200109270922.f8R9M1p18288@pw600a.bioperl.org> From: mkersz@pasteur.fr To: biopython-bugs@bioperl.org Subject: GenBank parser fails (on large files?) Full_Name: Michel Kerszberg Module: GenBank Version: 1.00a3 OS: linux 2.2 Submission from: cache.pasteur.fr (157.99.64.13) fetch ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_H37Rv/AL123456.gbk open this with file_handle = open( ... ,'r') pars = GenBank.FeatureParser() iter = GenBank.Iterator(file_handle, pars) rec = iter.next() This fails with: rec = iter.next() File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 182, in next return self._parser.parse(File.StringHandle(data)) File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 260, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 1108, in feed self._parser.parseFile(handle) File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 205, in parseFile self.parseString(fileobj.read()) File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 233, in parseString self._err_handler.fatalError(result) File "/var/tmp/python-root//usr/lib/python2.0/xml/sax/handler.py", line 38, in fatalError Martel.Parser.ParserPositionException: error parsing at or beyond character 42 This is in the first line of the record, which seems correctly formatted. No amount of massaging of the file seems to help. I have seen this problem reported with other large GenBank records. From biopython-bugs at bioperl.org Thu Sep 27 16:12:18 2001 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Notification: incoming/40 Message-ID: <200109272012.f8RKCIp24464@pw600a.bioperl.org> JitterBug notification chapmanb moved PR#40 from incoming to fixed-bugs Message summary for PR#40 From: joungjh@AptusGenomics.com Subject: retrieving GenBank records from NCBI Date: Tue, 14 Aug 2001 16:44:34 -0400 0 replies 0 followups ====> ORIGINAL MESSAGE FOLLOWS <==== >From joungjh@AptusGenomics.com Tue Aug 14 16:44:35 2001 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f7EKiYq02770 for ; Tue, 14 Aug 2001 16:44:34 -0400 Date: Tue, 14 Aug 2001 16:44:34 -0400 Message-Id: <200108142044.f7EKiYq02770@pw600a.bioperl.org> From: joungjh@AptusGenomics.com To: biopython-bugs@bioperl.org Subject: retrieving GenBank records from NCBI Full_Name: J. Joung Module: GenBank Version: biopython-1.00a2 OS: UNIX Submission from: gw-aptusgen1.cust.fast.net (209.92.248.166) I'm using GenBank NCBIDictionary to retrieve a GenBank record. The retrived record is missing the following information: LOCUS, DEFINITION, ACCESSION, VERSION, and KEYWORDS. Is there a way of obtaining the GenBank id from a known locuslink id in biopython? From biopython-bugs at bioperl.org Thu Sep 27 16:12:19 2001 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Notification: incoming/41 Message-ID: <200109272012.f8RKCIp24468@pw600a.bioperl.org> JitterBug notification chapmanb moved PR#41 from incoming to trash Message summary for PR#41 From: Jeffrey Chang Subject: Re: [Biopython-dev] Notification: incoming/40 Date: Tue, 14 Aug 2001 22:46:45 -0700 0 replies 0 followups ====> ORIGINAL MESSAGE FOLLOWS <==== >From jchang@SMI.Stanford.EDU Wed Aug 15 01:45:11 2001 Received: from crg-gw.Stanford.EDU (root@crg-gw.Stanford.EDU [171.65.32.201]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f7F5jAq05966 for ; Wed, 15 Aug 2001 01:45:11 -0400 Received: from [192.168.0.4] (c1128134-a.stcla1.sfba.home.com [24.176.209.55]) by crg-gw.Stanford.EDU (8.11.5/8.11.5) with ESMTP id f7F5jDU24945; Tue, 14 Aug 2001 22:45:13 -0700 (PDT) Mime-Version: 1.0 X-Sender: jchang@smi.stanford.edu (Unverified) Message-Id: In-Reply-To: <200108142044.f7EKiZq02776@pw600a.bioperl.org> References: <200108142044.f7EKiZq02776@pw600a.bioperl.org> Date: Tue, 14 Aug 2001 22:46:45 -0700 To: biopython-bugs@bioperl.org, biopython-dev@biopython.org, joungjh@aptusgenomics.com From: Jeffrey Chang Subject: Re: [Biopython-dev] Notification: incoming/40 Content-Type: text/plain; charset="us-ascii" ; format="flowed" At 4:44 PM -0400 8/14/01, biopython-bugs@bioperl.org wrote: >Full_Name: J. Joung >I'm using GenBank NCBIDictionary to retrieve a GenBank record. The retrived >record is missing the following information: LOCUS, DEFINITION, ACCESSION, >VERSION, and KEYWORDS. Is this information that's in the Genbank record? It should be returning whatever NCBI returns, or raising an exception. Dropping information would be odd. Do you have a reproducible? What is the accession you're using? >Is there a way of obtaining the GenBank id from a known locuslink id in >biopython? No, we don't have any locuslink functionality at the moment. Jeff From biopython-bugs at bioperl.org Thu Sep 27 16:12:19 2001 From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Notification: incoming/42 Message-ID: <200109272012.f8RKCJp24473@pw600a.bioperl.org> JitterBug notification chapmanb moved PR#42 from incoming to trash Message summary for PR#42 From: joungjh@email.com Subject: Re: [Biopython-dev] Notification: incoming/40 Date: Wed, 15 Aug 2001 08:22:26 -0400 0 replies 0 followups ====> ORIGINAL MESSAGE FOLLOWS <==== >From joungjh@email.com Wed Aug 15 08:22:26 2001 Received: from localhost (localhost [127.0.0.1]) by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f7FCMPq08874 for ; Wed, 15 Aug 2001 08:22:26 -0400 Date: Wed, 15 Aug 2001 08:22:26 -0400 Message-Id: <200108151222.f7FCMPq08874@pw600a.bioperl.org> From: joungjh@email.com To: biopython-bugs@bioperl.org Subject: Re: [Biopython-dev] Notification: incoming/40 Full_Name: Module: Version: OS: Submission from: gw-aptusgen1.cust.fast.net (209.92.248.166) >>I'm using GenBank NCBIDictionary to retrieve a GenBank record. The retrived >>record is missing the following information: LOCUS, DEFINITION, ACCESSION, >>VERSION, and KEYWORDS. >Is this information that's in the Genbank record? It should be >returning whatever NCBI returns, or raising an exception. Dropping >information would be odd. Do you have a reproducible? What is the >accession you're using? Yes, LOCUS, DEFINITION, ACCESSION, VERSION, and KEYWORDS information is in GenBank record. Any GenBank id would drop this information on UNIX. You can try GenBank id of '15145772'. I have installed biopython-1.00a1 windows version on my pc and this seems to return all information correctly. Thank you for your quick response. From thomas at cbs.dtu.dk Fri Sep 28 00:51:45 2001 From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] SeqIO In-Reply-To: Brad Chapman's message of "Wed, 26 Sep 2001 22:47:54 -0400" References: <20010926224754.E27721@ci350185-a.athen1.ga.home.com> Message-ID: Brad Chapman writes: thx for the comments ! > One big question I have is, how many of the features do you want to > try and retain in the conversion? So, for GenBank format, do you > want me to just write out the basic information (sequence, type, > etc) and ignore the feature table, or do we want to somehow map the > features from format to format (ie. EMBL <-> GenBank). All of them. I think each GenBank feature has an exact equivalence in EMBL and SwissProt (GenPept). So that leaves us just with the definition of the corresponding feature names. > > If we want to think about feature conversion, this'll be tougher and > we'll need to think about converters between "similar" formats like > EMBL and GenBank. GenBank, EMBL and SwissProt ... where EMBL and SwissProt are almost identical (I think...) > => Why are you duplicating SeqRecord in the SeqIO stuff instead of > just reusing it? I don't think I understand what you are talking > about with stripping newlines... I copied everything so that I c?uld play around without breaking e.g. your code. Now I think the changes are actually backward compatible - so we could move it back. > > => Is there a way to plug in a specialized converter for similar > formats, like I was talking about above with EMBL/GenBank? I think > Jeff suggested this earlier, and it seems like a good idea to me. I > guess right now you could subclass ReadSeq and define your own > Convert function, but maybe there is another way to do it. I don't know if I understood this question... A colleague and I, are thinking about converting SWISSPROT into a SQL database for local use ... which actually gets close to a former discussion where Andrew and I dreamed about a python variant of SRS ! My question: does anybody know about an already existing SQL tables for SWISSPROT ? The step after that is actually creating an python interface for generic queries, which would beat SRS ... at least on SWISSPROT. cheers -thomas P.S. is anybody going to the Atlanta meeting in November ? -- Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology thomas@biopython.org The Technical University of Denmark CBS: +45 45 252489 Building 208, DK-2800 Lyngby Fax +45 45 931585 http://www.cbs.dtu.dk/thomas De Chelonian Mobile ... The Turtle Moves ... From chapmanb at arches.uga.edu Thu Sep 27 17:02:33 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Re: Blast parser In-Reply-To: References: Message-ID: <20010927170233.A29348@ci350185-a.athen1.ga.home.com> Hi Jeong; I'm ccing this message to biopython-dev@biopython.org. By the way, asking your questions there is probably a better place than asking me directly, as there are lots of people there to help. > Hello, I would like to know if the blast standalone parser supports parsing > of the BLASTX results. When I use blast standalone parser to parse BLASTX > results, I get an error message of the following: [...] > SyntaxError: Line does not start with 'length of query': > length of database: 27,975,647 Yup, it looks like the blastx output format has changed somewhat since the last time it was used/tested with blastx. The specific things that have changed are the lack of the following lines in blastx output: 'length of query' 'effective length of query' 'effective search space:' 'S2' I've fixed Bio/Blast/NCBIStandalone.py so that it works again on blastx. The diff to this file is attached. Jeff, if you have a chance could you give me the okay on this before I check it in? The current regression tests all pass with these changes. When I check it in, I can also add the blastx example file I used to fix this. Jeong, thanks for the bug report! Please let us know if this fix doesn't get things working again for you. Brad -- PGP public key available from http://pgp.mit.edu/ -------------- next part -------------- *** NCBIStandalone.py.orig Wed Sep 5 17:22:14 2001 --- NCBIStandalone.py Thu Sep 27 17:01:34 2001 *************** *** 462,481 **** start="Number of HSP's that") read_and_call(uhandle, consumer.hsps_gapped, start="Number of HSP's gapped") ! ! read_and_call(uhandle, consumer.query_length, ! start='length of query') read_and_call(uhandle, consumer.database_length, start='length of database') read_and_call(uhandle, consumer.effective_hsp_length, start='effective HSP') ! read_and_call(uhandle, consumer.effective_query_length, ! start='effective length of query') read_and_call(uhandle, consumer.effective_database_length, start='effective length of database') ! read_and_call(uhandle, consumer.effective_search_space, ! start='effective search space') # Does not appear in BLASTP 2.0.5 attempt_read_and_call(uhandle, consumer.effective_search_space_used, start='effective search space used') --- 462,484 ---- start="Number of HSP's that") read_and_call(uhandle, consumer.hsps_gapped, start="Number of HSP's gapped") ! # not in blastx 2.2.1 ! attempt_read_and_call(uhandle, consumer.query_length, ! start='length of query') read_and_call(uhandle, consumer.database_length, start='length of database') read_and_call(uhandle, consumer.effective_hsp_length, start='effective HSP') ! # Not in blastx 2.2.1 ! attempt_read_and_call(uhandle, consumer.effective_query_length, ! start='effective length of query') read_and_call(uhandle, consumer.effective_database_length, start='effective length of database') ! # Not in blastx 2.2.1, added a ':' to distinguish between ! # this and the 'effective search space used' line ! attempt_read_and_call(uhandle, consumer.effective_search_space, ! start='effective search space:') # Does not appear in BLASTP 2.0.5 attempt_read_and_call(uhandle, consumer.effective_search_space_used, start='effective search space used') *************** *** 490,496 **** attempt_read_and_call(uhandle, consumer.gap_x_dropoff_final, start='X3') read_and_call(uhandle, consumer.gap_trigger, start='S1') ! read_and_call(uhandle, consumer.blast_cutoff, start='S2') consumer.end_parameters() --- 493,507 ---- attempt_read_and_call(uhandle, consumer.gap_x_dropoff_final, start='X3') read_and_call(uhandle, consumer.gap_trigger, start='S1') ! # not in blastx 2.2.1 ! # need to enclose this inside a try/except because ! # attempt_read_and_call will still complain about end of stream. ! # All attempts are made to be sure we've got the expected error ! try: ! read_and_call(uhandle, consumer.blast_cutoff, start='S2') ! except SyntaxError, reason: ! assert str(reason) == "Unexpected end of stream.", \ ! "Unexpected reason: '%s'" % reason consumer.end_parameters() From jchang at SMI.Stanford.EDU Thu Sep 27 19:16:15 2001 From: jchang at SMI.Stanford.EDU (Jeffrey Chang) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Re: Blast parser In-Reply-To: <20010927170233.A29348@ci350185-a.athen1.ga.home.com> References: <20010927170233.A29348@ci350185-a.athen1.ga.home.com> Message-ID: Great! Thanks a lot. The patch looks good really good. The only thing is, can you change the try: except: to an explicit test for the end of the stream? That would be more robust to changes in the error message. try: read_and_call(uhandle, consumer.blast_cutoff, start='S2') except SyntaxError, reason: assert str(reason) == "Unexpected end of stream.", \ "Unexpected reason: '%s'" % reason (untested) if uhandle.peekline(): attempt_read_and_call(uhandle, consumer.blast_cutoff, start='S2') Jeff At 5:02 PM -0400 9/27/01, Brad Chapman wrote: >Hi Jeong; >I'm ccing this message to biopython-dev@biopython.org. By the way, >asking your questions there is probably a better place than asking >me directly, as there are lots of people there to help. > >> Hello, I would like to know if the blast standalone parser supports parsing >> of the BLASTX results. When I use blast standalone parser to parse BLASTX >> results, I get an error message of the following: >[...] >> SyntaxError: Line does not start with 'length of query': >> length of database: 27,975,647 > >Yup, it looks like the blastx output format has changed somewhat >since the last time it was used/tested with blastx. The specific >things that have changed are the lack of the following lines in >blastx output: > >'length of query' >'effective length of query' >'effective search space:' >'S2' > >I've fixed Bio/Blast/NCBIStandalone.py so that it works again on >blastx. The diff to this file is attached. Jeff, if you have a >chance could you give me the okay on this before I check it in? The >current regression tests all pass with these changes. When I check >it in, I can also add the blastx example file I used to fix this. > >Jeong, thanks for the bug report! Please let us know if this fix >doesn't get things working again for you. > >Brad >-- >PGP public key available from http://pgp.mit.edu/ > >Attachment converted: Macintosh HD:NCBIStandalone.py.diff >(TEXT/text) (0015B4C0) From chapmanb at arches.uga.edu Thu Sep 27 19:59:16 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Re: Blast parser In-Reply-To: References: <20010927170233.A29348@ci350185-a.athen1.ga.home.com> Message-ID: <20010927195916.A29452@ci350185-a.athen1.ga.home.com> [problems with new blastx output and proposed patch] > Great! Thanks a lot. The patch looks good really good. The only > thing is, can you change the try: except: to an explicit test for the > end of the stream? Great idea -- thanks! This is much nicer than the uglish try/except I was using. I've checked in the patch with your suggested change, as well as the test blastx file and associated changes to the test suite. Everything seems to pass a-okay. Brad -- PGP public key available from http://pgp.mit.edu/ From chapmanb at arches.uga.edu Fri Sep 28 08:17:02 2001 From: chapmanb at arches.uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] GenBank parser fails (on large files?) In-Reply-To: <200109281141.f8SBfDM188776@electre.pasteur.fr> References: <200109281141.f8SBfDM188776@electre.pasteur.fr> Message-ID: <20010928081702.A973@ci350185-a.athen1.ga.home.com> Hi Michel; > Thanks, the fix worked. Great to hear. Thanks for reporting back. > However your solution to make parsing of large sequences > faster has currently a side effect. If I print the first > feature with qualifier 'translation', I get > > ['MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ\012 > CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR'] > > before, when I would have gotten a slightly different result: > > "MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ\012 > CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR" This is actually not a side-effect of the recent changes, but a deliberate change I made in CVS. I wrote a long message about this last week concerning non-compatible fixes I made to the GenBank SeqFeature parser: http://www.biopython.org/pipermail/biopython-dev/2001-September/000579.html The part related to your problem involves how I was handing features that had multiple qualifier keys with the same name (ie. two 'translation' keys). Previously, I was doing something really ugly -- appending numbers on to the end of multiple keys to make them unique (translation, translation1, translation2 ...). This allowed me to have one key and one string value and store things in a dictionary. But, this is an ugly way to do things and actually makes life very hard for people who wanted to get, say, all translation qualifiers in a feature (if there were multiple translations). The fix was to use the qualifier key and store the values as a list, ie: qualifiers = {"translation" : ["CREL", "CRET"]} When there is one one qualifier name, I also store this as a list to help people avoid having to do: if type(qualifier[key]) == type(""): # do something with the string elif type(qualifier[key]) == type([]): # do something with the list in their code. I am definately sensitive to the fact that the change is bad news for current code -- I'm sorry about that; it's all due to that bad design decision I made earlier. > Now the problem is, I had a hack to shape this string better, namely > > >>> newseq= string.join(string.split(sq.qualifiers['translation']), sep='')) > > This works with the " " form, but not with the [' '] form, which is how I > noticed the difference. Yes, sorry about that. A potential change (untested) would be: clean_translations = [] for translation in sq.qualifiers['translation']: clean_translations.append(string.join(string.split(translation), sep = '')) sq.qualifiers['translations'] = clean_translations But on to the other problem: [Talking about the translation] > Note, incidentally, that this is a bit ugly, because the \012's and spaces > should have been cleaned out I agree with you here -- I haven't yet done any work at massaging the feature value information. I'll think about a good way to do this (I'm sure there are other cases where this also needs to be done), and try to get something done on it this weekend. Thanks again for the feedback. Brad -- PGP public key available from http://pgp.mit.edu/ From j.joung at AptusGenomics.com Fri Sep 28 09:36:02 2001 From: j.joung at AptusGenomics.com (Jeong Joung) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] Re: Blast parser In-Reply-To: Message-ID: Thank you so much for your responses. The changes work really well. Jeong -----Original Message----- From: Jeffrey Chang [mailto:jchang@SMI.Stanford.EDU] Sent: Thursday, September 27, 2001 7:16 PM To: Brad Chapman; Jeong Joung Cc: biopython-dev@biopython.org Subject: [Biopython-dev] Re: Blast parser Great! Thanks a lot. The patch looks good really good. The only thing is, can you change the try: except: to an explicit test for the end of the stream? That would be more robust to changes in the error message. try: read_and_call(uhandle, consumer.blast_cutoff, start='S2') except SyntaxError, reason: assert str(reason) == "Unexpected end of stream.", \ "Unexpected reason: '%s'" % reason (untested) if uhandle.peekline(): attempt_read_and_call(uhandle, consumer.blast_cutoff, start='S2') Jeff At 5:02 PM -0400 9/27/01, Brad Chapman wrote: >Hi Jeong; >I'm ccing this message to biopython-dev@biopython.org. By the way, >asking your questions there is probably a better place than asking >me directly, as there are lots of people there to help. > >> Hello, I would like to know if the blast standalone parser supports parsing >> of the BLASTX results. When I use blast standalone parser to parse BLASTX >> results, I get an error message of the following: >[...] >> SyntaxError: Line does not start with 'length of query': >> length of database: 27,975,647 > >Yup, it looks like the blastx output format has changed somewhat >since the last time it was used/tested with blastx. The specific >things that have changed are the lack of the following lines in >blastx output: > >'length of query' >'effective length of query' >'effective search space:' >'S2' > >I've fixed Bio/Blast/NCBIStandalone.py so that it works again on >blastx. The diff to this file is attached. Jeff, if you have a >chance could you give me the okay on this before I check it in? The >current regression tests all pass with these changes. When I check >it in, I can also add the blastx example file I used to fix this. > >Jeong, thanks for the bug report! Please let us know if this fix >doesn't get things working again for you. > >Brad >-- >PGP public key available from http://pgp.mit.edu/ > >Attachment converted: Macintosh HD:NCBIStandalone.py.diff >(TEXT/text) (0015B4C0) From katel at worldpath.net Sun Sep 30 20:41:35 2001 From: katel at worldpath.net (Cayte) Date: Sat Mar 5 14:43:05 2005 Subject: [Biopython-dev] what I'm up to Message-ID: <002201c14a11$d4425160$010a0a0a@cadence.com> I'm still looking into nexus, but I'm not sure where the NCL library fits in. Do we use it to read in nexus files? How much continuing support and development can we expect? What will the library buy? us.Also, nexus is more than a sequence format. It supports phylogenetic and other types of data. Mostly, we support sequence data although we're doing a little wiith pathways. I'm in the process of moving from our summer home to our winter home and work has turned into an alligator swamp since I got back from vacation. For this reason I plan to continue investigating nexus but at a low level. I'd like to write a parser for MASE instead ( IntelliGenetic format ) because it is an almost-FASTA format and should be doable without a large time commitment that I can't promise for a month or so. Let me know if someone has already .written a MASE parser or has ideas about nexus or NCL. Cayte