From bugzilla-daemon at portal.open-bio.org Sun Oct 1 02:10:37 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 1 Oct 2006 02:10:37 -0400 Subject: [Biopython-dev] [Bug 1939] Doc/Makefile does not build pdf, html, txt files completely correctly In-Reply-To: Message-ID: <200610010610.k916Ab3S003487@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1939 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #19 from mdehoon at ims.u-tokyo.ac.jp 2006-10-01 02:10 ------- I've taken bits and pieces of the patch to get the recursive behavior for make. Getting html output from biopdb_faq is not essential, and does not warrant adding a hack to Biopython. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Mon Oct 9 22:55:17 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Mon, 9 Oct 2006 22:55:17 -0400 Subject: [Biopython-dev] ranlib.c missing in Bio/Cluster Message-ID: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com> I just checked out the latest CVS and setup.py failed on installation during gcc compilation of ranlib.o since Bio/Cluster/ranlib.c couldn't be found. Any suggestions? Was this file supposed to be in the CVS? The checkout notes indicate that it has been replaced with something else. Thanks, Chris From mdehoon at c2b2.columbia.edu Tue Oct 10 00:10:22 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Tue, 10 Oct 2006 00:10:22 -0400 Subject: [Biopython-dev] ranlib.c missing in Bio/Cluster In-Reply-To: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com> References: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com> Message-ID: <452B1D2E.4060703@c2b2.columbia.edu> There was some confusion about the license status of ranlib, so I removed it from Bio.Cluster and replaced it with a new random number generator written from scratch. Apparently I forgot to update setup.py in CVS accordingly. I have done that now, so if you get the new setup.py from CVS the compilation should work. You could also edit your local copy of setup.py and remove ranlib.c, linpack.c, and com.c. Sorry for the confusion. --Michiel. Chris Lasher wrote: > I just checked out the latest CVS and setup.py failed on installation > during gcc compilation of ranlib.o since Bio/Cluster/ranlib.c couldn't > be found. Any suggestions? Was this file supposed to be in the CVS? > The checkout notes indicate that it has been replaced with something > else. > > Thanks, > Chris > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From chris.lasher at gmail.com Tue Oct 10 00:46:29 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 10 Oct 2006 00:46:29 -0400 Subject: [Biopython-dev] Subversion Repository Message-ID: <128a885f0610092146y5a184ccfw31d433d228a9b05d@mail.gmail.com> Anybody know if BioPython (I suppose all Open Bio projects) will switch over to Subversion, and if so, when? I think the merits and advantages of Subversion over CVS speak for themselves. It's certainly become my revision control system of preference. Anybody else's? Curious, Chris From bugzilla-daemon at portal.open-bio.org Thu Oct 19 00:14:23 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Oct 2006 00:14:23 -0400 Subject: [Biopython-dev] [Bug 2014] Bio/Blast/NCBIStandalone.py parsing of psiblast fails In-Reply-To: Message-ID: <200610190414.k9J4ENQm025952@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2014 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:14 ------- Fixed in CVS, thanks. Note though that the parser for plain-text blast output is very difficult to maintain, because the output format keeps changing with different versions of blast. I'd encourage you to use the XML parser instead, as it is much more stable. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 19 00:21:40 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Oct 2006 00:21:40 -0400 Subject: [Biopython-dev] [Bug 2032] query_to and sbjct_to added in parsed NCBI-Blast XML In-Reply-To: Message-ID: <200610190421.k9J4LeGH026582@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2032 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:21 ------- Fixed in CVS following same bug report on the mailing list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 19 00:44:57 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Oct 2006 00:44:57 -0400 Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached In-Reply-To: Message-ID: <200610190444.k9J4ivGv028629@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2051 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mdehoon at ims.u-tokyo.ac.jp ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:44 ------- A new blast version (2.2.15) came out recently, so I tried the XML parser with the its output of a multiple query. I didn't notice a problem except that all alignments are put into one list, which is annoying because then we have to find out which alignment corresponds to which query. So, which specific problem with the XML parser are you trying to solve? And do these problems still occur with blast 2.2.15? (as far as I can tell, its XML output is the same as for blast 2.2.14, so it's probably here to stay). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Tue Oct 24 22:22:06 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 24 Oct 2006 22:22:06 -0400 Subject: [Biopython-dev] [BioPython] Martel-based parsing of Unigene flat files In-Reply-To: <453975D5.4070701@mail.nih.gov> References: <453975D5.4070701@mail.nih.gov> Message-ID: <128a885f0610241922h5db02fbfod1a83cfeade29801@mail.gmail.com> Hi Sean, FWIW this should probably have been posted to BioPython-dev, but I don't think that would improve your chances of getting a response. I am cross-posting it there, anyways. Unfortunately for you, I do not have an answer for you. :-( I, myself, would be interested in a response to this question from the Devs, as I would like to write a parser for PTT files. Last I saw there was a lot of chatter about the Martel parsers being incredibly slow compared to straightforward solutions. It seems that standard format parsers would be one of the easiest ways for BioPython newbies to contribute to developing the BioPython project, however, there isn't very much in the way of documentation on the BioPython way to do so, let alone developer documentation at all. I would like to know what can be done to get some dev docs going on the wiki. Chris On 10/20/06, Sean Davis wrote: > I am relatively new to python and biopython (coming from perl side of > things). I would like to make a parser for Unigene flat file format. > However, after digging through the LocusLink parsing code (as probably > the most similar format, etc.), I'm still at a loss for how Martel-based > parsing works. I understand the big picture (converting an re-based > parsing of a file into events), but it is the detail that I am missing. > I know about pydoc, but the pydoc for much of Martel is not very helpful > to me, at least not in my current state of knowledge. Any suggestions > on how to get started? > > Thanks, > Sean > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From sdavis2 at mail.nih.gov Thu Oct 26 08:09:43 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 26 Oct 2006 08:09:43 -0400 Subject: [Biopython-dev] Basic python question with regard to Unigene parser Message-ID: <200610260809.43245.sdavis2@mail.nih.gov> Let me start off by saying that I am a python newbie after working in perl for the last few years. I am working on a Unigene flat file parser. In my scanner, I have a construct that looks like: for line in handle: tag = line.split(' ')[0] line = line.rstrip() if tag=='ID': consumer.ID(line) if tag=='GENE': consumer.GENE(line) if tag=='TITLE': consumer.TITLE(line) if tag=='EXPRESS': consumer.EXPRESS(line) .... Since I am setting things up so that there is a 1:1 correspondence between the "tag" and the consumer method, is there an easy way to reduce this long set of IF statements to a simple mapping procedure that maps a tag to the correct method? Sorry for the naive question.... Thanks, Sean From sdavis2 at mail.nih.gov Thu Oct 26 08:30:08 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 26 Oct 2006 08:30:08 -0400 Subject: [Biopython-dev] Basic python question with regard to Unigene parser In-Reply-To: <200610260809.43245.sdavis2@mail.nih.gov> References: <200610260809.43245.sdavis2@mail.nih.gov> Message-ID: <200610260830.08326.sdavis2@mail.nih.gov> On Thursday 26 October 2006 08:09, Sean Davis wrote: > Let me start off by saying that I am a python newbie after working in perl > for the last few years. I am working on a Unigene flat file parser. In my > scanner, I have a construct that looks like: > > for line in handle: > tag = line.split(' ')[0] > line = line.rstrip() > if tag=='ID': > consumer.ID(line) > if tag=='GENE': > consumer.GENE(line) > if tag=='TITLE': > consumer.TITLE(line) > if tag=='EXPRESS': > consumer.EXPRESS(line) > .... > > Since I am setting things up so that there is a 1:1 correspondence between > the "tag" and the consumer method, is there an easy way to reduce this long > set of IF statements to a simple mapping procedure that maps a tag to the > correct method? > > Sorry for the naive question.... Even more apologies. I answered my own question. Something like this seems to work: exec('consumer.'+tag+'(line)') which replaces all the IF statements quite nicely. Sean From james.balhoff at duke.edu Thu Oct 26 09:46:34 2006 From: james.balhoff at duke.edu (Jim Balhoff) Date: Thu, 26 Oct 2006 09:46:34 -0400 Subject: [Biopython-dev] Basic python question with regard to Unigene parser In-Reply-To: <200610260830.08326.sdavis2@mail.nih.gov> References: <200610260809.43245.sdavis2@mail.nih.gov> <200610260830.08326.sdavis2@mail.nih.gov> Message-ID: Hi Sean, On Oct 26, 2006, at 8:30 AM, Sean Davis wrote: > On Thursday 26 October 2006 08:09, Sean Davis wrote: >> Let me start off by saying that I am a python newbie after working >> in perl >> for the last few years. I am working on a Unigene flat file >> parser. In my >> scanner, I have a construct that looks like: >> >> for line in handle: >> tag = line.split(' ')[0] >> line = line.rstrip() >> if tag=='ID': >> consumer.ID(line) >> if tag=='GENE': >> consumer.GENE(line) >> if tag=='TITLE': >> consumer.TITLE(line) >> if tag=='EXPRESS': >> consumer.EXPRESS(line) >> .... >> >> Since I am setting things up so that there is a 1:1 correspondence >> between >> the "tag" and the consumer method, is there an easy way to reduce >> this long >> set of IF statements to a simple mapping procedure that maps a tag >> to the >> correct method? >> >> Sorry for the naive question.... > > Even more apologies. I answered my own question. Something like > this seems > to work: > > exec('consumer.'+tag+'(line)') > > which replaces all the IF statements quite nicely. Alternatively, you may want to look at getattr(). There is a good description here: Jim From sdavis2 at mail.nih.gov Thu Oct 26 10:56:25 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 26 Oct 2006 10:56:25 -0400 Subject: [Biopython-dev] Unigene flat file parser Message-ID: <200610261056.25883.sdavis2@mail.nih.gov> I have put together a parser for the Unigene flat file format described here: ftp://ftp.ncbi.nih.gov/repository/UniGene/README under the Hs.data section. The actual .data files are included in the various organism-specific directories. Is there any interest in including this in biopython? If so, I would appreciate some input on the code and details of contributions, etc. The current code is available here: http://watson.nci.nih.gov/pressa/~sdavis/Unigene.py Use like so and note that the ugrecord has much more information (in fact, all information is captured) in it that given in its __repr__. #!/usr/bin/python import Unigene fh = file('Hs.data') #downloaded previously from ftp, or whatever ugparser = Unigene.Iterator(fh,Unigene.RecordParser()) for ugrecord in ugparser: print ugrecord From mdehoon at c2b2.columbia.edu Thu Oct 26 14:01:24 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 26 Oct 2006 14:01:24 -0400 Subject: [Biopython-dev] Unigene flat file parser In-Reply-To: <200610261056.25883.sdavis2@mail.nih.gov> References: <200610261056.25883.sdavis2@mail.nih.gov> Message-ID: <4540F7F4.2050003@c2b2.columbia.edu> Sean Davis wrote: > I have put together a parser for the Unigene flat file format described here: Perhaps a silly question from a non-Unigene user, but what is the relation between your parser and the one in Bio/UniGene/__init__.py? The latter seems to parse HTML files (see the example in Tests/test_unigene.py) instead of flat files. Is your parser intended as a replacement for Bio/UniGene/__init__.py? --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From sdavis2 at mail.nih.gov Thu Oct 26 16:15:52 2006 From: sdavis2 at mail.nih.gov (Davis, Sean (NIH/NCI) [E]) Date: Thu, 26 Oct 2006 16:15:52 -0400 Subject: [Biopython-dev] Unigene flat file parser References: <200610261056.25883.sdavis2@mail.nih.gov> <4540F7F4.2050003@c2b2.columbia.edu> Message-ID: <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov> Michiel, It looks to me like it parses an HTML file downloaded from the NCBI website containing a single unigene record of interest--potentially useful if one knows what one needs. I, on the other hand, have always just used the flat files as the source for unigene, as I typically want ALL the data for one or several species available. A single flat file is available for each organism and contains ALL the unigene entries and their associated information for that organism. By concatenating several files (they are simple text files), one can parse the entire unigene database. So, in short, I don't see this unigene parser as a replacement for the current module. They fill different needs; this one fills a need that I have and is useful for whole-genome, multiple species work, or microarray analyses and whether and where it fits into biopython is really up to the community. Just a quick comment on speed for the parser--it parses Hs.data (the largest flat file in unigene, 84,000 entries, with just under 7,000,000 sequence entries, 150 Mb file size) in just under 5 minutes on my Xeon desktop. Sean -----Original Message----- From: Michiel Jan Laurens de Hoon [mailto:mdehoon at c2b2.columbia.edu] Sent: Thu 10/26/2006 2:01 PM To: Davis, Sean (NIH/NCI) [E] Cc: biopython-dev at lists.open-bio.org Subject: Re: [Biopython-dev] Unigene flat file parser Sean Davis wrote: > I have put together a parser for the Unigene flat file format described here: Perhaps a silly question from a non-Unigene user, but what is the relation between your parser and the one in Bio/UniGene/__init__.py? The latter seems to parse HTML files (see the example in Tests/test_unigene.py) instead of flat files. Is your parser intended as a replacement for Bio/UniGene/__init__.py? --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Fri Oct 27 15:08:21 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Fri, 27 Oct 2006 20:08:21 +0100 Subject: [Biopython-dev] New Bio.SeqIO code Message-ID: <45425925.8090607@maubp.freeserve.co.uk> Hello list, I've checked in a somewhat cleaned up (and more tested) version of the earlier attachments to bug 2059. And I've updated the wiki page: http://biopython.org/wiki/SeqIO Has anyone got any tips on formatting python code on Wiki? Maybe I should just write the docs in LaTeX like the cook book etc. Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord objects, it would be a good idea to make them slightly more user-friendly: http://bugzilla.open-bio.org/show_bug.cgi?id=2057 (I would like to check this in before writing to much of the SeqIO documentation) If any of you want to check this out and have a look, I'd be pleased to get some feedback. There should be no impact on the rest of BioPython, or existing scripts. Peter ----------------------------------------------------------------- Link to view CVS, http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/?cvsroot=biopython Old files, not touched: Bio/SeqIO/FASTA.py Bio/SeqIO/generic.py Bio/SeqIO/__init__.py (replaces almost empty old file) ====================== * the helper functions (i.e. the functions I expect people to use) * mappings from file types to parsers and writers * mappings from file extensions to file types * large self test suite (which does not need any input files, but will create a temp file in the current directory) Bio/SeqIO/Interfaces.py ======================= Base classes for readers/writers Bio/SeqIO/FastaIO.py ==================== Uses a generator function for the reader. Uses a sub-class of SequentialSequenceWriter for the writer. Bio/SeqIO/ClustalIO.py ====================== Uses a generator function for the reader, based on the old class in Bio/SeqIO/generic.py Bio/SeqIO/PhylipIO.py ===================== Reads and writes phylip files with names strictly truncated at 10 characters. Uses a generator function for the reader, subclasses SequenceWriter Bio/SeqIO/StockholmIO.py ======================== Uses subclasses from Interfaces.py Unlike prior code attached to bug 2059, this code contains just one writer and parser, which expects the Stockholm file to follow the PFAM conventions. It should read other files fine - but what happens to the annotation is less well defined. This is what BioPerl does http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c10 Bio/SeqIO/GenBankIO.py ====================== Uses a generator function for the reader, which just calls Bio.GenBank to do the work. See also bug 2059 comment 11 on my thoughts about how to include EMBL support: http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c11 Bio/SeqIO/NexusIO.py ==================== Uses a generator function for the reader, which just calls Bio.Nexus to do the parsing and then extracts the sequences. Has not been tested much. Peter From mdehoon at c2b2.columbia.edu Sat Oct 28 01:40:02 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 28 Oct 2006 01:40:02 -0400 Subject: [Biopython-dev] Unigene flat file parser In-Reply-To: <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov> References: <200610261056.25883.sdavis2@mail.nih.gov> <4540F7F4.2050003@c2b2.columbia.edu> <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov> Message-ID: <4542ED32.8060702@c2b2.columbia.edu> OK, that's fine then. Is anybody actually using the current Bio/UniGene stuff? I couldn't find documentation for it and it hasn't been updated in more than two years, so it may be some dead code sitting around. If so, we can remove this code; Bio/UniGene would be a nice place to put Sean's code (even though it is doing something different from the current Bio/UniGene). --Michiel. Davis, Sean (NIH/NCI) [E] wrote: > So, in short, I don't see this unigene parser as a replacement for > the current module. They fill different needs; this one fills a need > that I have and is useful for whole-genome, multiple species work, or > microarray analyses and whether and where it fits into biopython is > really up to the community. > > Michiel wrote: >> Perhaps a silly question from a non-Unigene user, but what is the >> relation between your parser and the one in >> Bio/UniGene/__init__.py? The latter seems to parse HTML files (see >> the example in Tests/test_unigene.py) instead of flat files. Is >> your parser intended as a replacement for Bio/UniGene/__init__.py? From mdehoon at c2b2.columbia.edu Sat Oct 28 01:56:51 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 28 Oct 2006 01:56:51 -0400 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45425925.8090607@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> Message-ID: <4542F123.9050106@c2b2.columbia.edu> Thanks, Peter! It looks very nice. Actually, I have been using an earlier version of the new SeqIO module (from your code on Bugzilla) and found it to work quite well. A few short comments: To parse a Fasta file using the new SeqIO looks like this: from Bio.SeqIO import File2SequenceIterator for record in File2SequenceIterator("example.fasta") : print record.id print record.seq I would rather have something like this: from Bio.SeqIO import Fasta for record in Fasta.parse(open("example.fasta")): print record.id print record.seq where Fasta.parse returns a FastaIterator object, and the argument is either a file object or a file name. You can in addition have a function Bio.SeqIO.parse that guesses the file type from the file name extension (as you have now for File2SequenceIterator), though that wouldn't work for file handles. On a related note, I don't think we need the SequenceList and SequenceDict class. To make a list, one can do from Bio.SeqIO import Fasta records = [record for record in Fasta.parse(open("example.fasta"))] To convert an iterator to a dictionary takes one line more, and is probably more straightforward than SequenceDict. --Michiel. Peter (BioPython Dev) wrote: > Hello list, > > I've checked in a somewhat cleaned up (and more tested) version of the > earlier attachments to bug 2059. > > And I've updated the wiki page: > http://biopython.org/wiki/SeqIO > > Has anyone got any tips on formatting python code on Wiki? Maybe I > should just write the docs in LaTeX like the cook book etc. > > Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord > objects, it would be a good idea to make them slightly more user-friendly: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2057 > > (I would like to check this in before writing to much of the SeqIO > documentation) > > If any of you want to check this out and have a look, I'd be pleased to > get some feedback. From biopython-dev at maubp.freeserve.co.uk Sat Oct 28 07:59:13 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat, 28 Oct 2006 12:59:13 +0100 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4542F123.9050106@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> Message-ID: <45434611.1040708@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Thanks, Peter! > It looks very nice. Actually, I have been using an earlier version of > the new SeqIO module (from your code on Bugzilla) and found it to work > quite well. Thank you - and good to here the (old version) is working OK. > A few short comments: > > To parse a Fasta file using the new SeqIO looks like this: > > from Bio.SeqIO import File2SequenceIterator > for record in File2SequenceIterator("example.fasta") : > print record.id > print record.seq > > I would rather have something like this: > > from Bio.SeqIO import Fasta > for record in Fasta.parse(open("example.fasta")): > print record.id > print record.seq > > where Fasta.parse returns a FastaIterator object, and the argument is > either a file object or a file name. I think you have raised two issues - file names/handles (discussed below), and the use of a generic function versus a format specific one (or at least the naming conventions). I like the idea of a generic function File2SequenceIterator() which can be used on lots of different file formats, just by changing the arguments. However, there is nothing to stop you using the underlying format specific iterators directly: from Bio.SeqIO.FastaIO import FastaIterator for record in FastaIterator(open("example.fasta")): print record.id print record.seq (which is similar to your suggestion above) As long as you don't need to use any file format specific options, then for every file format the style of the code is the same - but switching file formats takes a little more work: from Bio.SeqIO.NexusIO import NexusIterator for record in NexusIterator(open("example.nexus")): print record.id print record.seq versus: from Bio.SeqIO import File2SequenceIterator for record in File2SequenceIterator("example.nexus") : print record.id print record.seq or, to give an example where the file extension is no use and the format must be explicitly stated: from Bio.SeqIO import File2SequenceIterator for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") : print record.id print record.seq I expect the "helper functions" like File2SequenceIterator() to be used for the simple cases where the user does not care about the minor options we might offer for individual file formats (this would cover beginners). They are also nice for writing multiple file format test cases ;) I see later in you email you suggested a generic Bio.SeqIO.parse(file) function which would cope with multiple file formats. Was your point more about what we call things? I'm happy to go from File2SequenceIterator() to something like SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() - with matching versions like SeqList() and SeqDict() However, I'm not so keen on "parse()" because it gives no clue as to what it will return. --- On the other point, filenames/handles. Right now, the individual iterators only take a handle. This was a simplification I made to make my life as straight forward as possible. The File2SequenceIterator() function (and friends) can take a filename, handle, or a string containing the contents of a file (in addition to the format). However, these are done as three separate arguments. I could have one argument that takes a file name or handle, and works it out on its own. Bio.Nexus tries to do this for example. Having the individual iterators also do this trick would be pretty simple (using a shared utility function). The "contents of a file" string argument was handy when testing, but I imagine this is not going to be a common situation. If people need this, they can use python's StringIO module to turn their data string into a handle easily enough. > You can in addition have a function > Bio.SeqIO.parse that guesses the file type from the file name extension > (as you have now for File2SequenceIterator), though that wouldn't work > for file handles. When dealing with a file handle, converting it to an undo file handle would probably work - if we had code to guess the file format. I have tried to raise a syntax error when a parser is given an invalid file - which would mean we could just try some common file formats in order until one works without a syntax error. But I felt this was not needed right away, so I put it off. > On a related note, I don't think we need the SequenceList and > SequenceDict class. To make a list, one can do > > from Bio.SeqIO import Fasta > records = [record for record in Fasta.parse(open("example.fasta"))] Currently that would be written: from Bio.SeqIO.FastaIO import FastaIterator records = [record for record in FastaIterator(open("example.fasta"))] Or even just the following, which I find simpler: from Bio.SeqIO.FastaIO import FastaIterator records = list(FastaIterator(open("example.fasta"))) Versus the alternatives: from Bio.SeqIO import File2SequenceList records = File2SequenceList("example.fasta") from Bio.SeqIO import File2SequenceDict record_dict = File2SequenceDict("example.fasta") > To convert an iterator to a dictionary takes one line more, and is > probably more straightforward than SequenceDict. That was one thing I wanted to discuss - having a SequenceDict and SequenceList class would let us add doc strings and perhaps methods like maxlength, minlength, totallength, ... Or, I can just use simple list and dict objects in the functions File2SequenceList and File2SequenceDict. I have no strong preference on this issue - so unless someone else speaks up, I'll go back to simple lists and dictionaries - keeps things simple. Peter From sdavis2 at mail.nih.gov Sat Oct 28 12:47:03 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sat, 28 Oct 2006 12:47:03 -0400 Subject: [Biopython-dev] Unigene flat file parser In-Reply-To: <4542ED32.8060702@c2b2.columbia.edu> References: <200610261056.25883.sdavis2@mail.nih.gov> <4540F7F4.2050003@c2b2.columbia.edu> <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov> <4542ED32.8060702@c2b2.columbia.edu> Message-ID: <45438987.1070403@mail.nih.gov> Michiel de Hoon wrote: > OK, that's fine then. > > Is anybody actually using the current Bio/UniGene stuff? I couldn't > find documentation for it and it hasn't been updated in more than two > years, so it may be some dead code sitting around. If so, we can > remove this code; Bio/UniGene would be a nice place to put Sean's code > (even though it is doing something different from the current > Bio/UniGene). I haven't looked into it much, but for dynamic queries of individual Unigene entries, it seems that Eutils might be the better way to go, anyway. Sean From mdehoon at c2b2.columbia.edu Sun Oct 29 01:09:14 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 29 Oct 2006 01:09:14 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45434611.1040708@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> Message-ID: <4544458A.5000102@c2b2.columbia.edu> Well let's first decide which functions we want in Bio.SeqIO, and then decide how to name them. I'm fine with the idea of having a function that can guess the file format from the extension. I also agree that a parser that can guess the file format from the file contents is not needed at this point. > That was one thing I wanted to discuss - having a SequenceDict and > SequenceList class would let us add doc strings and perhaps methods > like maxlength, minlength, totallength, ... > > Or, I can just use simple list and dict objects in the functions > File2SequenceList and File2SequenceDict. > > I have no strong preference on this issue - so unless someone else > speaks up, I'll go back to simple lists and dictionaries - keeps > things simple. If we go back to simple lists and dictionaries, do we still need the functions File2SequenceList and File2SequenceDict? I'd like to avoid software bloat as much as possible, so if we don't need these two functions, so much the better. About file handles: > The File2SequenceIterator() function (and friends) can take a > filename, handle, or a string containing the contents of a file (in > addition to the format). However, these are done as three separate > arguments. > > I could have one argument that takes a file name or handle, and works > it out on its own. Bio.Nexus tries to do this for example. Having > the individual iterators also do this trick would be pretty simple > (using a shared utility function). > > The "contents of a file" string argument was handy when testing, but I > imagine this is not going to be a common situation. If people need > this, they can use python's StringIO module to turn their data string > into a handle easily enough. I like the idea of one argument that takes a file name or handle. I believe that that is how other Biopython functions work. --Michiel. Peter wrote: > Michiel de Hoon wrote: >> Thanks, Peter! >> It looks very nice. Actually, I have been using an earlier version of >> the new SeqIO module (from your code on Bugzilla) and found it to work >> quite well. > > Thank you - and good to here the (old version) is working OK. > > > A few short comments: >> >> To parse a Fasta file using the new SeqIO looks like this: >> >> from Bio.SeqIO import File2SequenceIterator >> for record in File2SequenceIterator("example.fasta") : >> print record.id >> print record.seq >> >> I would rather have something like this: >> >> from Bio.SeqIO import Fasta >> for record in Fasta.parse(open("example.fasta")): >> print record.id >> print record.seq >> >> where Fasta.parse returns a FastaIterator object, and the argument is >> either a file object or a file name. > > I think you have raised two issues - file names/handles (discussed > below), and the use of a generic function versus a format specific one > (or at least the naming conventions). > > I like the idea of a generic function File2SequenceIterator() which can > be used on lots of different file formats, just by changing the > arguments. However, there is nothing to stop you using the underlying > format specific iterators directly: > > from Bio.SeqIO.FastaIO import FastaIterator > for record in FastaIterator(open("example.fasta")): > print record.id > print record.seq > > (which is similar to your suggestion above) > > As long as you don't need to use any file format specific options, then > for every file format the style of the code is the same - but switching > file formats takes a little more work: > > from Bio.SeqIO.NexusIO import NexusIterator > for record in NexusIterator(open("example.nexus")): > print record.id > print record.seq > > versus: > > from Bio.SeqIO import File2SequenceIterator > for record in File2SequenceIterator("example.nexus") : > print record.id > print record.seq > > or, to give an example where the file extension is no use and the format > must be explicitly stated: > > from Bio.SeqIO import File2SequenceIterator > for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") : > print record.id > print record.seq > > I expect the "helper functions" like File2SequenceIterator() to be used > for the simple cases where the user does not care about the minor > options we might offer for individual file formats (this would cover > beginners). > > They are also nice for writing multiple file format test cases ;) > > I see later in you email you suggested a generic Bio.SeqIO.parse(file) > function which would cope with multiple file formats. Was your point > more about what we call things? > > I'm happy to go from File2SequenceIterator() to something like > SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() - > with matching versions like SeqList() and SeqDict() > > However, I'm not so keen on "parse()" because it gives no clue as to > what it will return. > > --- > > On the other point, filenames/handles. Right now, the individual > iterators only take a handle. This was a simplification I made to make > my life as straight forward as possible. > > The File2SequenceIterator() function (and friends) can take a filename, > handle, or a string containing the contents of a file (in addition to > the format). However, these are done as three separate arguments. > > I could have one argument that takes a file name or handle, and works it > out on its own. Bio.Nexus tries to do this for example. Having the > individual iterators also do this trick would be pretty simple (using a > shared utility function). > > The "contents of a file" string argument was handy when testing, but I > imagine this is not going to be a common situation. If people need > this, they can use python's StringIO module to turn their data string > into a handle easily enough. > > > You can in addition have a function >> Bio.SeqIO.parse that guesses the file type from the file name >> extension (as you have now for File2SequenceIterator), though that >> wouldn't work for file handles. > > When dealing with a file handle, converting it to an undo file handle > would probably work - if we had code to guess the file format. I have > tried to raise a syntax error when a parser is given an invalid file - > which would mean we could just try some common file formats in order > until one works without a syntax error. > > But I felt this was not needed right away, so I put it off. > >> On a related note, I don't think we need the SequenceList and >> SequenceDict class. To make a list, one can do >> >> from Bio.SeqIO import Fasta >> records = [record for record in Fasta.parse(open("example.fasta"))] > > Currently that would be written: > > from Bio.SeqIO.FastaIO import FastaIterator > records = [record for record in FastaIterator(open("example.fasta"))] > > Or even just the following, which I find simpler: > > from Bio.SeqIO.FastaIO import FastaIterator > records = list(FastaIterator(open("example.fasta"))) > > Versus the alternatives: > > from Bio.SeqIO import File2SequenceList > records = File2SequenceList("example.fasta") > > from Bio.SeqIO import File2SequenceDict > record_dict = File2SequenceDict("example.fasta") > >> To convert an iterator to a dictionary takes one line more, and is >> probably more straightforward than SequenceDict. > > That was one thing I wanted to discuss - having a SequenceDict and > SequenceList class would let us add doc strings and perhaps methods like > maxlength, minlength, totallength, ... > > Or, I can just use simple list and dict objects in the functions > File2SequenceList and File2SequenceDict. > > I have no strong preference on this issue - so unless someone else > speaks up, I'll go back to simple lists and dictionaries - keeps things > simple. > > Peter > From biopython-dev at maubp.freeserve.co.uk Sun Oct 29 06:25:35 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sun, 29 Oct 2006 11:25:35 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4544458A.5000102@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> Message-ID: <45448FAF.1090104@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Well let's first decide which functions we want in Bio.SeqIO, and then > decide how to name them. Agreed. One point against names like File2SequenceIterator is the pun on two versus to (i.e. convert) will not be so obvious to non-native English speakers. > > That was one thing I wanted to discuss - having a SequenceDict and > > SequenceList class would let us add doc strings and perhaps methods > > like maxlength, minlength, totallength, ... > > > > Or, I can just use simple list and dict objects in the functions > > File2SequenceList and File2SequenceDict. > > > > I have no strong preference on this issue - so unless someone else > > speaks up, I'll go back to simple lists and dictionaries - keeps > > things simple. > > If we go back to simple lists and dictionaries, do we still need the > functions File2SequenceList and File2SequenceDict? I'd like to avoid > software bloat as much as possible, so if we don't need these two > functions, so much the better. I think there is some benefit to having File2SequenceDict included as converting from a SeqRecord iterator to a dictionary of SeqRecords isn't completely trivial. There are at least two important questions: What to use as the dictionary key (e.g. record.id) and how to deal with duplicate keys (e.g. use first/last record with that id, or simply abort). Consider this line of code as an alternative to File2SequenceDict: iterator = File2SequenceList(...) d = dict([record.id, record] for record in iterator) I don't think its very readable, or intuitive (and could scare beginners). Part of my aim with Bio.SeqIO was to make the interface simple. More importantly, if there are records with duplicate ids then with this code the resulting dictionary will have only the last record. Personally I would want duplicate keys to cause an exception. Rewriting File2SequenceDict() to use a simple dict would give something like this, where record2key is an optional user supplied function. def File2SequenceDict(..., record2key=None) : iterator = File2SequenceIterator(...) if record2key is None : record2key = lambda record : record.id answer = dict() for record in iterator : key = record2key(record) assert key not in answer, "Duplicate key" answer[key] = record return answer The record2key function is perhaps not needed - I was trying to make the function flexible. The duplicate key behaviour could also be an option. The other function, File2SequenceList isn't really needed if we are using simple lists. Its basically a wrapper for list(File2SequenceIterator(...)) or some other one liner. The main reason I invented File2SequenceList() was for completeness - given I already had File2SequenceDict() and File2SequenceIterator() > About file handles: > > > The File2SequenceIterator() function (and friends) can take a > > filename, handle, or a string containing the contents of a file (in > > addition to the format). However, these are done as three separate > > arguments. > > > > I could have one argument that takes a file name or handle, and works > > it out on its own. Bio.Nexus tries to do this for example. Having > > the individual iterators also do this trick would be pretty simple > > (using a shared utility function). > > > > The "contents of a file" string argument was handy when testing, but I > > imagine this is not going to be a common situation. If people need > > this, they can use python's StringIO module to turn their data string > > into a handle easily enough. > > I like the idea of one argument that takes a file name or handle. I > believe that that is how other Biopython functions work. OK then - I'll do that. Peter From biopython-dev at maubp.freeserve.co.uk Sun Oct 29 19:13:57 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 30 Oct 2006 00:13:57 +0000 Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna Message-ID: <454543C5.1080209@maubp.freeserve.co.uk> Hello all, I've been looking at writing multiple sequence alignments in Nexus format for the new Bio.SeqIO code, and came up with the following little problem: Given one or more Seq objects, how can I reliably decide if they are protein, DNA, or RNA? (These are the relevant choices in a Nexus file's format datatype=... header.) I'm resigned to the fact that if the Seq object has the generic alphabet this boils down to looking at the sequence strings and making an educated guess (probably following an established algorithm from an alignment program). Does any such code already exist in BioPython? However - is there a nice/official way to ask an alphabet object what it is (protein, DNA, RNA)? Looking over the code in Bio.Alphabet the only thing I can think of is to get the class name as a string and search it(!) We can't look at the letters property as this is None for the base classes like ProteinAlphabet. If we are prepared to meddle with the alphabet system we might add attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these base classes. Or simply have a "sequence_type" method, which the subclasses can re-define as required. (I wasn't meaning to reopen the whole "do we need alphabets" conversation last discussed in July 2006. At least, not yet...) Peter From fkauff at duke.edu Sun Oct 29 19:48:39 2006 From: fkauff at duke.edu (Frank) Date: Sun, 29 Oct 2006 19:48:39 -0500 Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna In-Reply-To: <454543C5.1080209@maubp.freeserve.co.uk> References: <454543C5.1080209@maubp.freeserve.co.uk> Message-ID: <1162169319.12941.5.camel@cpe-071-077-002-012.nc.res.rr.com> Hi all, On Mon, 2006-10-30 at 00:13 +0000, Peter (BioPython Dev) wrote: > Hello all, > > I've been looking at writing multiple sequence alignments in Nexus > format for the new Bio.SeqIO code, and came up with the following little > problem: > > Given one or more Seq objects, how can I reliably decide if they are > protein, DNA, or RNA? > > (These are the relevant choices in a Nexus file's format datatype=... > header.) > > I'm resigned to the fact that if the Seq object has the generic alphabet > this boils down to looking at the sequence strings and making an > educated guess (probably following an established algorithm from an > alignment program). Does any such code already exist in BioPython? > I'm not aware of any such code - however, an educated guess would be easy, (more or less ACGTNX only, ACGUNX only, everything else...?). With NEXUS it becomes tricky, as a dataset could potentially be partitioned into a mix of all types. And there is no "official" way to indicate this in the datatype= option. Frank > However - is there a nice/official way to ask an alphabet object what it > is (protein, DNA, RNA)? > > Looking over the code in Bio.Alphabet the only thing I can think of is > to get the class name as a string and search it(!) We can't look at the > letters property as this is None for the base classes like ProteinAlphabet. > > If we are prepared to meddle with the alphabet system we might add > attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these > base classes. Or simply have a "sequence_type" method, which the > subclasses can re-define as required. > > (I wasn't meaning to reopen the whole "do we need alphabets" > conversation last discussed in July 2006. At least, not yet...) > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mdehoon at c2b2.columbia.edu Sun Oct 29 22:20:48 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 29 Oct 2006 22:20:48 -0500 Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna In-Reply-To: <454543C5.1080209@maubp.freeserve.co.uk> References: <454543C5.1080209@maubp.freeserve.co.uk> Message-ID: <45456F90.1090005@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > Given one or more Seq objects, how can I reliably decide if they are > protein, DNA, or RNA? > > (These are the relevant choices in a Nexus file's format datatype=... > header.) > > I'm resigned to the fact that if the Seq object has the generic alphabet > this boils down to looking at the sequence strings and making an > educated guess (probably following an established algorithm from an > alignment program). Does any such code already exist in BioPython? Something similar exists in Bio.Seq in the complement, reverse_complement methods of Seq objects, but it only distinguishes between DNA and RNA. I don't know of any official way to do that in Biopython. --Michiel. From mdehoon at c2b2.columbia.edu Sun Oct 29 22:42:44 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 29 Oct 2006 22:42:44 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45448FAF.1090104@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> Message-ID: <454574B4.3050407@c2b2.columbia.edu> Peter wrote: > There are at least two important questions: What to use as the > dictionary key (e.g. record.id) and how to deal with duplicate keys > (e.g. use first/last record with that id, or simply abort). > > Rewriting File2SequenceDict() to use a simple dict would give something > like this, where record2key is an optional user supplied function. > > def File2SequenceDict(..., record2key=None) : > iterator = File2SequenceIterator(...) > if record2key is None : record2key = lambda record : record.id > answer = dict() > for record in iterator : > key = record2key(record) > assert key not in answer, "Duplicate key" > answer[key] = record > return answer > > The record2key function is perhaps not needed - I was trying to make the > function flexible. The duplicate key behaviour could also be an option. > I am using File2SequenceIterator in one of my scripts (thanks by the way for that, my script is a lot faster now. I didn't do a rigorous timing, but it's about a zillion times faster), and convert the iterator to a dictionary using plain Python. If I were to use File2SequenceDict instead, I would need the record2key argument, because in my application I want only part of record.id as the key. In the File2SequenceDict above, answer[key] contains the complete record. Some people will want that. However, in my application I only want to store the record.seq part in answer[key]. Somebody else may want str(record.seq). So we'd also need a record2value argument. For duplicate keys, there are at least four possibilities (raise an exception, store only one of the keys, store neither of the keys and don't raise an exception, store both after modifying one of the keys). So this should also be an option. You'll end up with a File2SequenceDict function that is more complicated than the plain Python solution. --Michiel. From biopython-dev at maubp.freeserve.co.uk Mon Oct 30 05:54:41 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 30 Oct 2006 10:54:41 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <454574B4.3050407@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> Message-ID: <4545D9F1.2040902@maubp.freeserve.co.uk> Michiel de Hoon wrote: > On a related note, I don't think we need the SequenceList and > SequenceDict class. To make a list, one can do ... I've updated the new code in Bio.SeqIO to remove SequenceDict and SequenceList and use the standard dictionary and list instead. Michiel de Hoon wrote: > I am using File2SequenceIterator in one of my scripts (thanks by the way > for that, my script is a lot faster now. I didn't do a rigorous timing, > but it's about a zillion times faster), and convert the iterator to a > dictionary using plain Python. If I were to use File2SequenceDict > instead, I would need the record2key argument, because in my application > I want only part of record.id as the key. With such a speed up, I'd guess you were using Bio.Fasta before. I've noticed the same thing. Are you dealing with NCBI style fasta identifiers made up of several fields separated by "|" characters? > In the File2SequenceDict above, answer[key] contains the complete > record. Some people will want that. However, in my application I only > want to store the record.seq part in answer[key]. Somebody else may want > str(record.seq). So we'd also need a record2value argument. It does slightly undermine the "you only get SeqRecord objects" principle. On the other hand, its a simple addition that is easy to explain and implement. I'm happy to add this. > For duplicate keys, there are at least four possibilities (raise an > exception, store only one of the keys, store neither of the keys and > don't raise an exception, store both after modifying one of the keys). > So this should also be an option. Supporting all these options with an easy to understand interface looks too hard. In my opinion if someone is trying to build a dictionary using repeated keys they have made a mistake (either in their datafile, or their record2key function) - so raising an exception is reasonable default behaviour (and is easy to code). Apart from the "exception" option, which of these actions do you generally find most appropriate? > You'll end up with a File2SequenceDict function that is more complicated > than the plain Python solution. Yes. Trying to do everything would be bad - both complicated to implement, probably complicated to use as well. Peter From mdehoon at c2b2.columbia.edu Mon Oct 30 17:02:34 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 30 Oct 2006 17:02:34 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45425925.8090607@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> Message-ID: <4546767A.70302@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord > objects, it would be a good idea to make them slightly more user-friendly: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2057 > > (I would like to check this in before writing to much of the SeqIO > documentation) Looks good to me. Thanks! --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From bugzilla-daemon at portal.open-bio.org Sun Oct 1 06:10:37 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 1 Oct 2006 02:10:37 -0400 Subject: [Biopython-dev] [Bug 1939] Doc/Makefile does not build pdf, html, txt files completely correctly In-Reply-To: Message-ID: <200610010610.k916Ab3S003487@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1939 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #19 from mdehoon at ims.u-tokyo.ac.jp 2006-10-01 02:10 ------- I've taken bits and pieces of the patch to get the recursive behavior for make. Getting html output from biopdb_faq is not essential, and does not warrant adding a hack to Biopython. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Tue Oct 10 02:55:17 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Mon, 9 Oct 2006 22:55:17 -0400 Subject: [Biopython-dev] ranlib.c missing in Bio/Cluster Message-ID: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com> I just checked out the latest CVS and setup.py failed on installation during gcc compilation of ranlib.o since Bio/Cluster/ranlib.c couldn't be found. Any suggestions? Was this file supposed to be in the CVS? The checkout notes indicate that it has been replaced with something else. Thanks, Chris From mdehoon at c2b2.columbia.edu Tue Oct 10 04:10:22 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Tue, 10 Oct 2006 00:10:22 -0400 Subject: [Biopython-dev] ranlib.c missing in Bio/Cluster In-Reply-To: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com> References: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com> Message-ID: <452B1D2E.4060703@c2b2.columbia.edu> There was some confusion about the license status of ranlib, so I removed it from Bio.Cluster and replaced it with a new random number generator written from scratch. Apparently I forgot to update setup.py in CVS accordingly. I have done that now, so if you get the new setup.py from CVS the compilation should work. You could also edit your local copy of setup.py and remove ranlib.c, linpack.c, and com.c. Sorry for the confusion. --Michiel. Chris Lasher wrote: > I just checked out the latest CVS and setup.py failed on installation > during gcc compilation of ranlib.o since Bio/Cluster/ranlib.c couldn't > be found. Any suggestions? Was this file supposed to be in the CVS? > The checkout notes indicate that it has been replaced with something > else. > > Thanks, > Chris > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From chris.lasher at gmail.com Tue Oct 10 04:46:29 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 10 Oct 2006 00:46:29 -0400 Subject: [Biopython-dev] Subversion Repository Message-ID: <128a885f0610092146y5a184ccfw31d433d228a9b05d@mail.gmail.com> Anybody know if BioPython (I suppose all Open Bio projects) will switch over to Subversion, and if so, when? I think the merits and advantages of Subversion over CVS speak for themselves. It's certainly become my revision control system of preference. Anybody else's? Curious, Chris From bugzilla-daemon at portal.open-bio.org Thu Oct 19 04:14:23 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Oct 2006 00:14:23 -0400 Subject: [Biopython-dev] [Bug 2014] Bio/Blast/NCBIStandalone.py parsing of psiblast fails In-Reply-To: Message-ID: <200610190414.k9J4ENQm025952@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2014 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:14 ------- Fixed in CVS, thanks. Note though that the parser for plain-text blast output is very difficult to maintain, because the output format keeps changing with different versions of blast. I'd encourage you to use the XML parser instead, as it is much more stable. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 19 04:21:40 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Oct 2006 00:21:40 -0400 Subject: [Biopython-dev] [Bug 2032] query_to and sbjct_to added in parsed NCBI-Blast XML In-Reply-To: Message-ID: <200610190421.k9J4LeGH026582@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2032 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:21 ------- Fixed in CVS following same bug report on the mailing list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Oct 19 04:44:57 2006 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Oct 2006 00:44:57 -0400 Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached In-Reply-To: Message-ID: <200610190444.k9J4ivGv028629@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2051 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mdehoon at ims.u-tokyo.ac.jp ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:44 ------- A new blast version (2.2.15) came out recently, so I tried the XML parser with the its output of a multiple query. I didn't notice a problem except that all alignments are put into one list, which is annoying because then we have to find out which alignment corresponds to which query. So, which specific problem with the XML parser are you trying to solve? And do these problems still occur with blast 2.2.15? (as far as I can tell, its XML output is the same as for blast 2.2.14, so it's probably here to stay). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Wed Oct 25 02:22:06 2006 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 24 Oct 2006 22:22:06 -0400 Subject: [Biopython-dev] [BioPython] Martel-based parsing of Unigene flat files In-Reply-To: <453975D5.4070701@mail.nih.gov> References: <453975D5.4070701@mail.nih.gov> Message-ID: <128a885f0610241922h5db02fbfod1a83cfeade29801@mail.gmail.com> Hi Sean, FWIW this should probably have been posted to BioPython-dev, but I don't think that would improve your chances of getting a response. I am cross-posting it there, anyways. Unfortunately for you, I do not have an answer for you. :-( I, myself, would be interested in a response to this question from the Devs, as I would like to write a parser for PTT files. Last I saw there was a lot of chatter about the Martel parsers being incredibly slow compared to straightforward solutions. It seems that standard format parsers would be one of the easiest ways for BioPython newbies to contribute to developing the BioPython project, however, there isn't very much in the way of documentation on the BioPython way to do so, let alone developer documentation at all. I would like to know what can be done to get some dev docs going on the wiki. Chris On 10/20/06, Sean Davis wrote: > I am relatively new to python and biopython (coming from perl side of > things). I would like to make a parser for Unigene flat file format. > However, after digging through the LocusLink parsing code (as probably > the most similar format, etc.), I'm still at a loss for how Martel-based > parsing works. I understand the big picture (converting an re-based > parsing of a file into events), but it is the detail that I am missing. > I know about pydoc, but the pydoc for much of Martel is not very helpful > to me, at least not in my current state of knowledge. Any suggestions > on how to get started? > > Thanks, > Sean > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From sdavis2 at mail.nih.gov Thu Oct 26 12:09:43 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 26 Oct 2006 08:09:43 -0400 Subject: [Biopython-dev] Basic python question with regard to Unigene parser Message-ID: <200610260809.43245.sdavis2@mail.nih.gov> Let me start off by saying that I am a python newbie after working in perl for the last few years. I am working on a Unigene flat file parser. In my scanner, I have a construct that looks like: for line in handle: tag = line.split(' ')[0] line = line.rstrip() if tag=='ID': consumer.ID(line) if tag=='GENE': consumer.GENE(line) if tag=='TITLE': consumer.TITLE(line) if tag=='EXPRESS': consumer.EXPRESS(line) .... Since I am setting things up so that there is a 1:1 correspondence between the "tag" and the consumer method, is there an easy way to reduce this long set of IF statements to a simple mapping procedure that maps a tag to the correct method? Sorry for the naive question.... Thanks, Sean From sdavis2 at mail.nih.gov Thu Oct 26 12:30:08 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 26 Oct 2006 08:30:08 -0400 Subject: [Biopython-dev] Basic python question with regard to Unigene parser In-Reply-To: <200610260809.43245.sdavis2@mail.nih.gov> References: <200610260809.43245.sdavis2@mail.nih.gov> Message-ID: <200610260830.08326.sdavis2@mail.nih.gov> On Thursday 26 October 2006 08:09, Sean Davis wrote: > Let me start off by saying that I am a python newbie after working in perl > for the last few years. I am working on a Unigene flat file parser. In my > scanner, I have a construct that looks like: > > for line in handle: > tag = line.split(' ')[0] > line = line.rstrip() > if tag=='ID': > consumer.ID(line) > if tag=='GENE': > consumer.GENE(line) > if tag=='TITLE': > consumer.TITLE(line) > if tag=='EXPRESS': > consumer.EXPRESS(line) > .... > > Since I am setting things up so that there is a 1:1 correspondence between > the "tag" and the consumer method, is there an easy way to reduce this long > set of IF statements to a simple mapping procedure that maps a tag to the > correct method? > > Sorry for the naive question.... Even more apologies. I answered my own question. Something like this seems to work: exec('consumer.'+tag+'(line)') which replaces all the IF statements quite nicely. Sean From james.balhoff at duke.edu Thu Oct 26 13:46:34 2006 From: james.balhoff at duke.edu (Jim Balhoff) Date: Thu, 26 Oct 2006 09:46:34 -0400 Subject: [Biopython-dev] Basic python question with regard to Unigene parser In-Reply-To: <200610260830.08326.sdavis2@mail.nih.gov> References: <200610260809.43245.sdavis2@mail.nih.gov> <200610260830.08326.sdavis2@mail.nih.gov> Message-ID: Hi Sean, On Oct 26, 2006, at 8:30 AM, Sean Davis wrote: > On Thursday 26 October 2006 08:09, Sean Davis wrote: >> Let me start off by saying that I am a python newbie after working >> in perl >> for the last few years. I am working on a Unigene flat file >> parser. In my >> scanner, I have a construct that looks like: >> >> for line in handle: >> tag = line.split(' ')[0] >> line = line.rstrip() >> if tag=='ID': >> consumer.ID(line) >> if tag=='GENE': >> consumer.GENE(line) >> if tag=='TITLE': >> consumer.TITLE(line) >> if tag=='EXPRESS': >> consumer.EXPRESS(line) >> .... >> >> Since I am setting things up so that there is a 1:1 correspondence >> between >> the "tag" and the consumer method, is there an easy way to reduce >> this long >> set of IF statements to a simple mapping procedure that maps a tag >> to the >> correct method? >> >> Sorry for the naive question.... > > Even more apologies. I answered my own question. Something like > this seems > to work: > > exec('consumer.'+tag+'(line)') > > which replaces all the IF statements quite nicely. Alternatively, you may want to look at getattr(). There is a good description here: Jim From sdavis2 at mail.nih.gov Thu Oct 26 14:56:25 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 26 Oct 2006 10:56:25 -0400 Subject: [Biopython-dev] Unigene flat file parser Message-ID: <200610261056.25883.sdavis2@mail.nih.gov> I have put together a parser for the Unigene flat file format described here: ftp://ftp.ncbi.nih.gov/repository/UniGene/README under the Hs.data section. The actual .data files are included in the various organism-specific directories. Is there any interest in including this in biopython? If so, I would appreciate some input on the code and details of contributions, etc. The current code is available here: http://watson.nci.nih.gov/pressa/~sdavis/Unigene.py Use like so and note that the ugrecord has much more information (in fact, all information is captured) in it that given in its __repr__. #!/usr/bin/python import Unigene fh = file('Hs.data') #downloaded previously from ftp, or whatever ugparser = Unigene.Iterator(fh,Unigene.RecordParser()) for ugrecord in ugparser: print ugrecord From mdehoon at c2b2.columbia.edu Thu Oct 26 18:01:24 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 26 Oct 2006 14:01:24 -0400 Subject: [Biopython-dev] Unigene flat file parser In-Reply-To: <200610261056.25883.sdavis2@mail.nih.gov> References: <200610261056.25883.sdavis2@mail.nih.gov> Message-ID: <4540F7F4.2050003@c2b2.columbia.edu> Sean Davis wrote: > I have put together a parser for the Unigene flat file format described here: Perhaps a silly question from a non-Unigene user, but what is the relation between your parser and the one in Bio/UniGene/__init__.py? The latter seems to parse HTML files (see the example in Tests/test_unigene.py) instead of flat files. Is your parser intended as a replacement for Bio/UniGene/__init__.py? --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From sdavis2 at mail.nih.gov Thu Oct 26 20:15:52 2006 From: sdavis2 at mail.nih.gov (Davis, Sean (NIH/NCI) [E]) Date: Thu, 26 Oct 2006 16:15:52 -0400 Subject: [Biopython-dev] Unigene flat file parser References: <200610261056.25883.sdavis2@mail.nih.gov> <4540F7F4.2050003@c2b2.columbia.edu> Message-ID: <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov> Michiel, It looks to me like it parses an HTML file downloaded from the NCBI website containing a single unigene record of interest--potentially useful if one knows what one needs. I, on the other hand, have always just used the flat files as the source for unigene, as I typically want ALL the data for one or several species available. A single flat file is available for each organism and contains ALL the unigene entries and their associated information for that organism. By concatenating several files (they are simple text files), one can parse the entire unigene database. So, in short, I don't see this unigene parser as a replacement for the current module. They fill different needs; this one fills a need that I have and is useful for whole-genome, multiple species work, or microarray analyses and whether and where it fits into biopython is really up to the community. Just a quick comment on speed for the parser--it parses Hs.data (the largest flat file in unigene, 84,000 entries, with just under 7,000,000 sequence entries, 150 Mb file size) in just under 5 minutes on my Xeon desktop. Sean -----Original Message----- From: Michiel Jan Laurens de Hoon [mailto:mdehoon at c2b2.columbia.edu] Sent: Thu 10/26/2006 2:01 PM To: Davis, Sean (NIH/NCI) [E] Cc: biopython-dev at lists.open-bio.org Subject: Re: [Biopython-dev] Unigene flat file parser Sean Davis wrote: > I have put together a parser for the Unigene flat file format described here: Perhaps a silly question from a non-Unigene user, but what is the relation between your parser and the one in Bio/UniGene/__init__.py? The latter seems to parse HTML files (see the example in Tests/test_unigene.py) instead of flat files. Is your parser intended as a replacement for Bio/UniGene/__init__.py? --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Fri Oct 27 19:08:21 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Fri, 27 Oct 2006 20:08:21 +0100 Subject: [Biopython-dev] New Bio.SeqIO code Message-ID: <45425925.8090607@maubp.freeserve.co.uk> Hello list, I've checked in a somewhat cleaned up (and more tested) version of the earlier attachments to bug 2059. And I've updated the wiki page: http://biopython.org/wiki/SeqIO Has anyone got any tips on formatting python code on Wiki? Maybe I should just write the docs in LaTeX like the cook book etc. Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord objects, it would be a good idea to make them slightly more user-friendly: http://bugzilla.open-bio.org/show_bug.cgi?id=2057 (I would like to check this in before writing to much of the SeqIO documentation) If any of you want to check this out and have a look, I'd be pleased to get some feedback. There should be no impact on the rest of BioPython, or existing scripts. Peter ----------------------------------------------------------------- Link to view CVS, http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/?cvsroot=biopython Old files, not touched: Bio/SeqIO/FASTA.py Bio/SeqIO/generic.py Bio/SeqIO/__init__.py (replaces almost empty old file) ====================== * the helper functions (i.e. the functions I expect people to use) * mappings from file types to parsers and writers * mappings from file extensions to file types * large self test suite (which does not need any input files, but will create a temp file in the current directory) Bio/SeqIO/Interfaces.py ======================= Base classes for readers/writers Bio/SeqIO/FastaIO.py ==================== Uses a generator function for the reader. Uses a sub-class of SequentialSequenceWriter for the writer. Bio/SeqIO/ClustalIO.py ====================== Uses a generator function for the reader, based on the old class in Bio/SeqIO/generic.py Bio/SeqIO/PhylipIO.py ===================== Reads and writes phylip files with names strictly truncated at 10 characters. Uses a generator function for the reader, subclasses SequenceWriter Bio/SeqIO/StockholmIO.py ======================== Uses subclasses from Interfaces.py Unlike prior code attached to bug 2059, this code contains just one writer and parser, which expects the Stockholm file to follow the PFAM conventions. It should read other files fine - but what happens to the annotation is less well defined. This is what BioPerl does http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c10 Bio/SeqIO/GenBankIO.py ====================== Uses a generator function for the reader, which just calls Bio.GenBank to do the work. See also bug 2059 comment 11 on my thoughts about how to include EMBL support: http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c11 Bio/SeqIO/NexusIO.py ==================== Uses a generator function for the reader, which just calls Bio.Nexus to do the parsing and then extracts the sequences. Has not been tested much. Peter From mdehoon at c2b2.columbia.edu Sat Oct 28 05:40:02 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 28 Oct 2006 01:40:02 -0400 Subject: [Biopython-dev] Unigene flat file parser In-Reply-To: <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov> References: <200610261056.25883.sdavis2@mail.nih.gov> <4540F7F4.2050003@c2b2.columbia.edu> <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov> Message-ID: <4542ED32.8060702@c2b2.columbia.edu> OK, that's fine then. Is anybody actually using the current Bio/UniGene stuff? I couldn't find documentation for it and it hasn't been updated in more than two years, so it may be some dead code sitting around. If so, we can remove this code; Bio/UniGene would be a nice place to put Sean's code (even though it is doing something different from the current Bio/UniGene). --Michiel. Davis, Sean (NIH/NCI) [E] wrote: > So, in short, I don't see this unigene parser as a replacement for > the current module. They fill different needs; this one fills a need > that I have and is useful for whole-genome, multiple species work, or > microarray analyses and whether and where it fits into biopython is > really up to the community. > > Michiel wrote: >> Perhaps a silly question from a non-Unigene user, but what is the >> relation between your parser and the one in >> Bio/UniGene/__init__.py? The latter seems to parse HTML files (see >> the example in Tests/test_unigene.py) instead of flat files. Is >> your parser intended as a replacement for Bio/UniGene/__init__.py? From mdehoon at c2b2.columbia.edu Sat Oct 28 05:56:51 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 28 Oct 2006 01:56:51 -0400 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45425925.8090607@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> Message-ID: <4542F123.9050106@c2b2.columbia.edu> Thanks, Peter! It looks very nice. Actually, I have been using an earlier version of the new SeqIO module (from your code on Bugzilla) and found it to work quite well. A few short comments: To parse a Fasta file using the new SeqIO looks like this: from Bio.SeqIO import File2SequenceIterator for record in File2SequenceIterator("example.fasta") : print record.id print record.seq I would rather have something like this: from Bio.SeqIO import Fasta for record in Fasta.parse(open("example.fasta")): print record.id print record.seq where Fasta.parse returns a FastaIterator object, and the argument is either a file object or a file name. You can in addition have a function Bio.SeqIO.parse that guesses the file type from the file name extension (as you have now for File2SequenceIterator), though that wouldn't work for file handles. On a related note, I don't think we need the SequenceList and SequenceDict class. To make a list, one can do from Bio.SeqIO import Fasta records = [record for record in Fasta.parse(open("example.fasta"))] To convert an iterator to a dictionary takes one line more, and is probably more straightforward than SequenceDict. --Michiel. Peter (BioPython Dev) wrote: > Hello list, > > I've checked in a somewhat cleaned up (and more tested) version of the > earlier attachments to bug 2059. > > And I've updated the wiki page: > http://biopython.org/wiki/SeqIO > > Has anyone got any tips on formatting python code on Wiki? Maybe I > should just write the docs in LaTeX like the cook book etc. > > Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord > objects, it would be a good idea to make them slightly more user-friendly: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2057 > > (I would like to check this in before writing to much of the SeqIO > documentation) > > If any of you want to check this out and have a look, I'd be pleased to > get some feedback. From biopython-dev at maubp.freeserve.co.uk Sat Oct 28 11:59:13 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat, 28 Oct 2006 12:59:13 +0100 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4542F123.9050106@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> Message-ID: <45434611.1040708@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Thanks, Peter! > It looks very nice. Actually, I have been using an earlier version of > the new SeqIO module (from your code on Bugzilla) and found it to work > quite well. Thank you - and good to here the (old version) is working OK. > A few short comments: > > To parse a Fasta file using the new SeqIO looks like this: > > from Bio.SeqIO import File2SequenceIterator > for record in File2SequenceIterator("example.fasta") : > print record.id > print record.seq > > I would rather have something like this: > > from Bio.SeqIO import Fasta > for record in Fasta.parse(open("example.fasta")): > print record.id > print record.seq > > where Fasta.parse returns a FastaIterator object, and the argument is > either a file object or a file name. I think you have raised two issues - file names/handles (discussed below), and the use of a generic function versus a format specific one (or at least the naming conventions). I like the idea of a generic function File2SequenceIterator() which can be used on lots of different file formats, just by changing the arguments. However, there is nothing to stop you using the underlying format specific iterators directly: from Bio.SeqIO.FastaIO import FastaIterator for record in FastaIterator(open("example.fasta")): print record.id print record.seq (which is similar to your suggestion above) As long as you don't need to use any file format specific options, then for every file format the style of the code is the same - but switching file formats takes a little more work: from Bio.SeqIO.NexusIO import NexusIterator for record in NexusIterator(open("example.nexus")): print record.id print record.seq versus: from Bio.SeqIO import File2SequenceIterator for record in File2SequenceIterator("example.nexus") : print record.id print record.seq or, to give an example where the file extension is no use and the format must be explicitly stated: from Bio.SeqIO import File2SequenceIterator for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") : print record.id print record.seq I expect the "helper functions" like File2SequenceIterator() to be used for the simple cases where the user does not care about the minor options we might offer for individual file formats (this would cover beginners). They are also nice for writing multiple file format test cases ;) I see later in you email you suggested a generic Bio.SeqIO.parse(file) function which would cope with multiple file formats. Was your point more about what we call things? I'm happy to go from File2SequenceIterator() to something like SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() - with matching versions like SeqList() and SeqDict() However, I'm not so keen on "parse()" because it gives no clue as to what it will return. --- On the other point, filenames/handles. Right now, the individual iterators only take a handle. This was a simplification I made to make my life as straight forward as possible. The File2SequenceIterator() function (and friends) can take a filename, handle, or a string containing the contents of a file (in addition to the format). However, these are done as three separate arguments. I could have one argument that takes a file name or handle, and works it out on its own. Bio.Nexus tries to do this for example. Having the individual iterators also do this trick would be pretty simple (using a shared utility function). The "contents of a file" string argument was handy when testing, but I imagine this is not going to be a common situation. If people need this, they can use python's StringIO module to turn their data string into a handle easily enough. > You can in addition have a function > Bio.SeqIO.parse that guesses the file type from the file name extension > (as you have now for File2SequenceIterator), though that wouldn't work > for file handles. When dealing with a file handle, converting it to an undo file handle would probably work - if we had code to guess the file format. I have tried to raise a syntax error when a parser is given an invalid file - which would mean we could just try some common file formats in order until one works without a syntax error. But I felt this was not needed right away, so I put it off. > On a related note, I don't think we need the SequenceList and > SequenceDict class. To make a list, one can do > > from Bio.SeqIO import Fasta > records = [record for record in Fasta.parse(open("example.fasta"))] Currently that would be written: from Bio.SeqIO.FastaIO import FastaIterator records = [record for record in FastaIterator(open("example.fasta"))] Or even just the following, which I find simpler: from Bio.SeqIO.FastaIO import FastaIterator records = list(FastaIterator(open("example.fasta"))) Versus the alternatives: from Bio.SeqIO import File2SequenceList records = File2SequenceList("example.fasta") from Bio.SeqIO import File2SequenceDict record_dict = File2SequenceDict("example.fasta") > To convert an iterator to a dictionary takes one line more, and is > probably more straightforward than SequenceDict. That was one thing I wanted to discuss - having a SequenceDict and SequenceList class would let us add doc strings and perhaps methods like maxlength, minlength, totallength, ... Or, I can just use simple list and dict objects in the functions File2SequenceList and File2SequenceDict. I have no strong preference on this issue - so unless someone else speaks up, I'll go back to simple lists and dictionaries - keeps things simple. Peter From sdavis2 at mail.nih.gov Sat Oct 28 16:47:03 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sat, 28 Oct 2006 12:47:03 -0400 Subject: [Biopython-dev] Unigene flat file parser In-Reply-To: <4542ED32.8060702@c2b2.columbia.edu> References: <200610261056.25883.sdavis2@mail.nih.gov> <4540F7F4.2050003@c2b2.columbia.edu> <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov> <4542ED32.8060702@c2b2.columbia.edu> Message-ID: <45438987.1070403@mail.nih.gov> Michiel de Hoon wrote: > OK, that's fine then. > > Is anybody actually using the current Bio/UniGene stuff? I couldn't > find documentation for it and it hasn't been updated in more than two > years, so it may be some dead code sitting around. If so, we can > remove this code; Bio/UniGene would be a nice place to put Sean's code > (even though it is doing something different from the current > Bio/UniGene). I haven't looked into it much, but for dynamic queries of individual Unigene entries, it seems that Eutils might be the better way to go, anyway. Sean From mdehoon at c2b2.columbia.edu Sun Oct 29 06:09:14 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 29 Oct 2006 01:09:14 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45434611.1040708@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> Message-ID: <4544458A.5000102@c2b2.columbia.edu> Well let's first decide which functions we want in Bio.SeqIO, and then decide how to name them. I'm fine with the idea of having a function that can guess the file format from the extension. I also agree that a parser that can guess the file format from the file contents is not needed at this point. > That was one thing I wanted to discuss - having a SequenceDict and > SequenceList class would let us add doc strings and perhaps methods > like maxlength, minlength, totallength, ... > > Or, I can just use simple list and dict objects in the functions > File2SequenceList and File2SequenceDict. > > I have no strong preference on this issue - so unless someone else > speaks up, I'll go back to simple lists and dictionaries - keeps > things simple. If we go back to simple lists and dictionaries, do we still need the functions File2SequenceList and File2SequenceDict? I'd like to avoid software bloat as much as possible, so if we don't need these two functions, so much the better. About file handles: > The File2SequenceIterator() function (and friends) can take a > filename, handle, or a string containing the contents of a file (in > addition to the format). However, these are done as three separate > arguments. > > I could have one argument that takes a file name or handle, and works > it out on its own. Bio.Nexus tries to do this for example. Having > the individual iterators also do this trick would be pretty simple > (using a shared utility function). > > The "contents of a file" string argument was handy when testing, but I > imagine this is not going to be a common situation. If people need > this, they can use python's StringIO module to turn their data string > into a handle easily enough. I like the idea of one argument that takes a file name or handle. I believe that that is how other Biopython functions work. --Michiel. Peter wrote: > Michiel de Hoon wrote: >> Thanks, Peter! >> It looks very nice. Actually, I have been using an earlier version of >> the new SeqIO module (from your code on Bugzilla) and found it to work >> quite well. > > Thank you - and good to here the (old version) is working OK. > > > A few short comments: >> >> To parse a Fasta file using the new SeqIO looks like this: >> >> from Bio.SeqIO import File2SequenceIterator >> for record in File2SequenceIterator("example.fasta") : >> print record.id >> print record.seq >> >> I would rather have something like this: >> >> from Bio.SeqIO import Fasta >> for record in Fasta.parse(open("example.fasta")): >> print record.id >> print record.seq >> >> where Fasta.parse returns a FastaIterator object, and the argument is >> either a file object or a file name. > > I think you have raised two issues - file names/handles (discussed > below), and the use of a generic function versus a format specific one > (or at least the naming conventions). > > I like the idea of a generic function File2SequenceIterator() which can > be used on lots of different file formats, just by changing the > arguments. However, there is nothing to stop you using the underlying > format specific iterators directly: > > from Bio.SeqIO.FastaIO import FastaIterator > for record in FastaIterator(open("example.fasta")): > print record.id > print record.seq > > (which is similar to your suggestion above) > > As long as you don't need to use any file format specific options, then > for every file format the style of the code is the same - but switching > file formats takes a little more work: > > from Bio.SeqIO.NexusIO import NexusIterator > for record in NexusIterator(open("example.nexus")): > print record.id > print record.seq > > versus: > > from Bio.SeqIO import File2SequenceIterator > for record in File2SequenceIterator("example.nexus") : > print record.id > print record.seq > > or, to give an example where the file extension is no use and the format > must be explicitly stated: > > from Bio.SeqIO import File2SequenceIterator > for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") : > print record.id > print record.seq > > I expect the "helper functions" like File2SequenceIterator() to be used > for the simple cases where the user does not care about the minor > options we might offer for individual file formats (this would cover > beginners). > > They are also nice for writing multiple file format test cases ;) > > I see later in you email you suggested a generic Bio.SeqIO.parse(file) > function which would cope with multiple file formats. Was your point > more about what we call things? > > I'm happy to go from File2SequenceIterator() to something like > SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() - > with matching versions like SeqList() and SeqDict() > > However, I'm not so keen on "parse()" because it gives no clue as to > what it will return. > > --- > > On the other point, filenames/handles. Right now, the individual > iterators only take a handle. This was a simplification I made to make > my life as straight forward as possible. > > The File2SequenceIterator() function (and friends) can take a filename, > handle, or a string containing the contents of a file (in addition to > the format). However, these are done as three separate arguments. > > I could have one argument that takes a file name or handle, and works it > out on its own. Bio.Nexus tries to do this for example. Having the > individual iterators also do this trick would be pretty simple (using a > shared utility function). > > The "contents of a file" string argument was handy when testing, but I > imagine this is not going to be a common situation. If people need > this, they can use python's StringIO module to turn their data string > into a handle easily enough. > > > You can in addition have a function >> Bio.SeqIO.parse that guesses the file type from the file name >> extension (as you have now for File2SequenceIterator), though that >> wouldn't work for file handles. > > When dealing with a file handle, converting it to an undo file handle > would probably work - if we had code to guess the file format. I have > tried to raise a syntax error when a parser is given an invalid file - > which would mean we could just try some common file formats in order > until one works without a syntax error. > > But I felt this was not needed right away, so I put it off. > >> On a related note, I don't think we need the SequenceList and >> SequenceDict class. To make a list, one can do >> >> from Bio.SeqIO import Fasta >> records = [record for record in Fasta.parse(open("example.fasta"))] > > Currently that would be written: > > from Bio.SeqIO.FastaIO import FastaIterator > records = [record for record in FastaIterator(open("example.fasta"))] > > Or even just the following, which I find simpler: > > from Bio.SeqIO.FastaIO import FastaIterator > records = list(FastaIterator(open("example.fasta"))) > > Versus the alternatives: > > from Bio.SeqIO import File2SequenceList > records = File2SequenceList("example.fasta") > > from Bio.SeqIO import File2SequenceDict > record_dict = File2SequenceDict("example.fasta") > >> To convert an iterator to a dictionary takes one line more, and is >> probably more straightforward than SequenceDict. > > That was one thing I wanted to discuss - having a SequenceDict and > SequenceList class would let us add doc strings and perhaps methods like > maxlength, minlength, totallength, ... > > Or, I can just use simple list and dict objects in the functions > File2SequenceList and File2SequenceDict. > > I have no strong preference on this issue - so unless someone else > speaks up, I'll go back to simple lists and dictionaries - keeps things > simple. > > Peter > From biopython-dev at maubp.freeserve.co.uk Sun Oct 29 11:25:35 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sun, 29 Oct 2006 11:25:35 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <4544458A.5000102@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> Message-ID: <45448FAF.1090104@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Well let's first decide which functions we want in Bio.SeqIO, and then > decide how to name them. Agreed. One point against names like File2SequenceIterator is the pun on two versus to (i.e. convert) will not be so obvious to non-native English speakers. > > That was one thing I wanted to discuss - having a SequenceDict and > > SequenceList class would let us add doc strings and perhaps methods > > like maxlength, minlength, totallength, ... > > > > Or, I can just use simple list and dict objects in the functions > > File2SequenceList and File2SequenceDict. > > > > I have no strong preference on this issue - so unless someone else > > speaks up, I'll go back to simple lists and dictionaries - keeps > > things simple. > > If we go back to simple lists and dictionaries, do we still need the > functions File2SequenceList and File2SequenceDict? I'd like to avoid > software bloat as much as possible, so if we don't need these two > functions, so much the better. I think there is some benefit to having File2SequenceDict included as converting from a SeqRecord iterator to a dictionary of SeqRecords isn't completely trivial. There are at least two important questions: What to use as the dictionary key (e.g. record.id) and how to deal with duplicate keys (e.g. use first/last record with that id, or simply abort). Consider this line of code as an alternative to File2SequenceDict: iterator = File2SequenceList(...) d = dict([record.id, record] for record in iterator) I don't think its very readable, or intuitive (and could scare beginners). Part of my aim with Bio.SeqIO was to make the interface simple. More importantly, if there are records with duplicate ids then with this code the resulting dictionary will have only the last record. Personally I would want duplicate keys to cause an exception. Rewriting File2SequenceDict() to use a simple dict would give something like this, where record2key is an optional user supplied function. def File2SequenceDict(..., record2key=None) : iterator = File2SequenceIterator(...) if record2key is None : record2key = lambda record : record.id answer = dict() for record in iterator : key = record2key(record) assert key not in answer, "Duplicate key" answer[key] = record return answer The record2key function is perhaps not needed - I was trying to make the function flexible. The duplicate key behaviour could also be an option. The other function, File2SequenceList isn't really needed if we are using simple lists. Its basically a wrapper for list(File2SequenceIterator(...)) or some other one liner. The main reason I invented File2SequenceList() was for completeness - given I already had File2SequenceDict() and File2SequenceIterator() > About file handles: > > > The File2SequenceIterator() function (and friends) can take a > > filename, handle, or a string containing the contents of a file (in > > addition to the format). However, these are done as three separate > > arguments. > > > > I could have one argument that takes a file name or handle, and works > > it out on its own. Bio.Nexus tries to do this for example. Having > > the individual iterators also do this trick would be pretty simple > > (using a shared utility function). > > > > The "contents of a file" string argument was handy when testing, but I > > imagine this is not going to be a common situation. If people need > > this, they can use python's StringIO module to turn their data string > > into a handle easily enough. > > I like the idea of one argument that takes a file name or handle. I > believe that that is how other Biopython functions work. OK then - I'll do that. Peter From biopython-dev at maubp.freeserve.co.uk Mon Oct 30 00:13:57 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 30 Oct 2006 00:13:57 +0000 Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna Message-ID: <454543C5.1080209@maubp.freeserve.co.uk> Hello all, I've been looking at writing multiple sequence alignments in Nexus format for the new Bio.SeqIO code, and came up with the following little problem: Given one or more Seq objects, how can I reliably decide if they are protein, DNA, or RNA? (These are the relevant choices in a Nexus file's format datatype=... header.) I'm resigned to the fact that if the Seq object has the generic alphabet this boils down to looking at the sequence strings and making an educated guess (probably following an established algorithm from an alignment program). Does any such code already exist in BioPython? However - is there a nice/official way to ask an alphabet object what it is (protein, DNA, RNA)? Looking over the code in Bio.Alphabet the only thing I can think of is to get the class name as a string and search it(!) We can't look at the letters property as this is None for the base classes like ProteinAlphabet. If we are prepared to meddle with the alphabet system we might add attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these base classes. Or simply have a "sequence_type" method, which the subclasses can re-define as required. (I wasn't meaning to reopen the whole "do we need alphabets" conversation last discussed in July 2006. At least, not yet...) Peter From fkauff at duke.edu Mon Oct 30 00:48:39 2006 From: fkauff at duke.edu (Frank) Date: Sun, 29 Oct 2006 19:48:39 -0500 Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna In-Reply-To: <454543C5.1080209@maubp.freeserve.co.uk> References: <454543C5.1080209@maubp.freeserve.co.uk> Message-ID: <1162169319.12941.5.camel@cpe-071-077-002-012.nc.res.rr.com> Hi all, On Mon, 2006-10-30 at 00:13 +0000, Peter (BioPython Dev) wrote: > Hello all, > > I've been looking at writing multiple sequence alignments in Nexus > format for the new Bio.SeqIO code, and came up with the following little > problem: > > Given one or more Seq objects, how can I reliably decide if they are > protein, DNA, or RNA? > > (These are the relevant choices in a Nexus file's format datatype=... > header.) > > I'm resigned to the fact that if the Seq object has the generic alphabet > this boils down to looking at the sequence strings and making an > educated guess (probably following an established algorithm from an > alignment program). Does any such code already exist in BioPython? > I'm not aware of any such code - however, an educated guess would be easy, (more or less ACGTNX only, ACGUNX only, everything else...?). With NEXUS it becomes tricky, as a dataset could potentially be partitioned into a mix of all types. And there is no "official" way to indicate this in the datatype= option. Frank > However - is there a nice/official way to ask an alphabet object what it > is (protein, DNA, RNA)? > > Looking over the code in Bio.Alphabet the only thing I can think of is > to get the class name as a string and search it(!) We can't look at the > letters property as this is None for the base classes like ProteinAlphabet. > > If we are prepared to meddle with the alphabet system we might add > attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these > base classes. Or simply have a "sequence_type" method, which the > subclasses can re-define as required. > > (I wasn't meaning to reopen the whole "do we need alphabets" > conversation last discussed in July 2006. At least, not yet...) > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mdehoon at c2b2.columbia.edu Mon Oct 30 03:20:48 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 29 Oct 2006 22:20:48 -0500 Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna In-Reply-To: <454543C5.1080209@maubp.freeserve.co.uk> References: <454543C5.1080209@maubp.freeserve.co.uk> Message-ID: <45456F90.1090005@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > Given one or more Seq objects, how can I reliably decide if they are > protein, DNA, or RNA? > > (These are the relevant choices in a Nexus file's format datatype=... > header.) > > I'm resigned to the fact that if the Seq object has the generic alphabet > this boils down to looking at the sequence strings and making an > educated guess (probably following an established algorithm from an > alignment program). Does any such code already exist in BioPython? Something similar exists in Bio.Seq in the complement, reverse_complement methods of Seq objects, but it only distinguishes between DNA and RNA. I don't know of any official way to do that in Biopython. --Michiel. From mdehoon at c2b2.columbia.edu Mon Oct 30 03:42:44 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 29 Oct 2006 22:42:44 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45448FAF.1090104@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> Message-ID: <454574B4.3050407@c2b2.columbia.edu> Peter wrote: > There are at least two important questions: What to use as the > dictionary key (e.g. record.id) and how to deal with duplicate keys > (e.g. use first/last record with that id, or simply abort). > > Rewriting File2SequenceDict() to use a simple dict would give something > like this, where record2key is an optional user supplied function. > > def File2SequenceDict(..., record2key=None) : > iterator = File2SequenceIterator(...) > if record2key is None : record2key = lambda record : record.id > answer = dict() > for record in iterator : > key = record2key(record) > assert key not in answer, "Duplicate key" > answer[key] = record > return answer > > The record2key function is perhaps not needed - I was trying to make the > function flexible. The duplicate key behaviour could also be an option. > I am using File2SequenceIterator in one of my scripts (thanks by the way for that, my script is a lot faster now. I didn't do a rigorous timing, but it's about a zillion times faster), and convert the iterator to a dictionary using plain Python. If I were to use File2SequenceDict instead, I would need the record2key argument, because in my application I want only part of record.id as the key. In the File2SequenceDict above, answer[key] contains the complete record. Some people will want that. However, in my application I only want to store the record.seq part in answer[key]. Somebody else may want str(record.seq). So we'd also need a record2value argument. For duplicate keys, there are at least four possibilities (raise an exception, store only one of the keys, store neither of the keys and don't raise an exception, store both after modifying one of the keys). So this should also be an option. You'll end up with a File2SequenceDict function that is more complicated than the plain Python solution. --Michiel. From biopython-dev at maubp.freeserve.co.uk Mon Oct 30 10:54:41 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 30 Oct 2006 10:54:41 +0000 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <454574B4.3050407@c2b2.columbia.edu> References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk> <454574B4.3050407@c2b2.columbia.edu> Message-ID: <4545D9F1.2040902@maubp.freeserve.co.uk> Michiel de Hoon wrote: > On a related note, I don't think we need the SequenceList and > SequenceDict class. To make a list, one can do ... I've updated the new code in Bio.SeqIO to remove SequenceDict and SequenceList and use the standard dictionary and list instead. Michiel de Hoon wrote: > I am using File2SequenceIterator in one of my scripts (thanks by the way > for that, my script is a lot faster now. I didn't do a rigorous timing, > but it's about a zillion times faster), and convert the iterator to a > dictionary using plain Python. If I were to use File2SequenceDict > instead, I would need the record2key argument, because in my application > I want only part of record.id as the key. With such a speed up, I'd guess you were using Bio.Fasta before. I've noticed the same thing. Are you dealing with NCBI style fasta identifiers made up of several fields separated by "|" characters? > In the File2SequenceDict above, answer[key] contains the complete > record. Some people will want that. However, in my application I only > want to store the record.seq part in answer[key]. Somebody else may want > str(record.seq). So we'd also need a record2value argument. It does slightly undermine the "you only get SeqRecord objects" principle. On the other hand, its a simple addition that is easy to explain and implement. I'm happy to add this. > For duplicate keys, there are at least four possibilities (raise an > exception, store only one of the keys, store neither of the keys and > don't raise an exception, store both after modifying one of the keys). > So this should also be an option. Supporting all these options with an easy to understand interface looks too hard. In my opinion if someone is trying to build a dictionary using repeated keys they have made a mistake (either in their datafile, or their record2key function) - so raising an exception is reasonable default behaviour (and is easy to code). Apart from the "exception" option, which of these actions do you generally find most appropriate? > You'll end up with a File2SequenceDict function that is more complicated > than the plain Python solution. Yes. Trying to do everything would be bad - both complicated to implement, probably complicated to use as well. Peter From mdehoon at c2b2.columbia.edu Mon Oct 30 22:02:34 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 30 Oct 2006 17:02:34 -0500 Subject: [Biopython-dev] New Bio.SeqIO code In-Reply-To: <45425925.8090607@maubp.freeserve.co.uk> References: <45425925.8090607@maubp.freeserve.co.uk> Message-ID: <4546767A.70302@c2b2.columbia.edu> Peter (BioPython Dev) wrote: > Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord > objects, it would be a good idea to make them slightly more user-friendly: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2057 > > (I would like to check this in before writing to much of the SeqIO > documentation) Looks good to me. Thanks! --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032