From mdehoon at c2b2.columbia.edu Sat Jul 1 17:47:28 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 01 Jul 2006 17:47:28 -0400 Subject: [Biopython-dev] Fasta parser Message-ID: <44A6ED70.9080204@c2b2.columbia.edu> Hi everybody, The Biopython shows the following approach to parsing a Fasta file: >>> from Bio import Fasta >>> parser = Fasta.RecordParser() >>> file = open("ls_orchid.fasta") >>> iterator = Fasta.Iterator(file, parser) >>> cur_record = iterator.next() But for large Fasta files, it's very slow, compared to file.read(), which may be due to going through Martel (I believe the same was true for large GenBank files). So I'm thinking about writing a simple-minded Fasta parser for better performance with large files. What I'm wondering about: 1) Is there some advantage that I overlooked of using Martel for parsing Fasta files? 2) Why is it necessary to create a parser first and passing it to Fasta.Iterator? Are there any cases where Fasta.Iterator uses something other than a Fasta.RecordParser? --Michiel. From idoerg at burnham.org Sat Jul 1 18:52:43 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Sat, 1 Jul 2006 15:52:43 -0700 Subject: [Biopython-dev] Fasta parser References: <44A6ED70.9080204@c2b2.columbia.edu> Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org> Michiel, There is actually a simple minded fasta reader/writer that does not use Martel. Bio.SeqIO.FASTA ./I -- Iddo Friedberg, PhD Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA T: +1 858 646 3100 x3516 http://iddo-friedberg.org http://BioFunctionPrediction.org -----Original Message----- From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon Sent: Sat 7/1/2006 2:47 PM To: biopython-dev at biopython.org Subject: [Biopython-dev] Fasta parser Hi everybody, The Biopython shows the following approach to parsing a Fasta file: >>> from Bio import Fasta >>> parser = Fasta.RecordParser() >>> file = open("ls_orchid.fasta") >>> iterator = Fasta.Iterator(file, parser) >>> cur_record = iterator.next() But for large Fasta files, it's very slow, compared to file.read(), which may be due to going through Martel (I believe the same was true for large GenBank files). So I'm thinking about writing a simple-minded Fasta parser for better performance with large files. What I'm wondering about: 1) Is there some advantage that I overlooked of using Martel for parsing Fasta files? 2) Why is it necessary to create a parser first and passing it to Fasta.Iterator? Are there any cases where Fasta.Iterator uses something other than a Fasta.RecordParser? --Michiel. _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From mdehoon at c2b2.columbia.edu Sun Jul 2 00:43:47 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 00:43:47 -0400 Subject: [Biopython-dev] Fasta parser In-Reply-To: <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org> References: <44A6ED70.9080204@c2b2.columbia.edu> <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org> Message-ID: <44A74F03.8020801@c2b2.columbia.edu> Thanks Iddo! I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than the Martel-based one in Bio.Fasta. It would be nice to merge these two modules. However, it raises a bunch of design questions (such as Fasta.Record versus SeqRecord, and Seq versus string), so it's probably better to wait with that until after the next Biopython release. Which, by the way, will be coming up soon. Thanks, --Michiel. Iddo Friedberg wrote: > Michiel, > > There is actually a simple minded fasta reader/writer that does not use > Martel. Bio.SeqIO.FASTA > > ./I > > -- > Iddo Friedberg, PhD > Burnham Institute for Medical Research > 10901 N. Torrey Pines Rd. > La Jolla, CA 92037 USA > T: +1 858 646 3100 x3516 > http://iddo-friedberg.org > http://BioFunctionPrediction.org > > > > -----Original Message----- > From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon > Sent: Sat 7/1/2006 2:47 PM > To: biopython-dev at biopython.org > Subject: [Biopython-dev] Fasta parser > > Hi everybody, > > The Biopython shows the following approach to parsing a Fasta file: > > >>> from Bio import Fasta > >>> parser = Fasta.RecordParser() > >>> file = open("ls_orchid.fasta") > >>> iterator = Fasta.Iterator(file, parser) > >>> cur_record = iterator.next() > > But for large Fasta files, it's very slow, compared to file.read(), > which may be due to going through Martel (I believe the same was true > for large GenBank files). > > So I'm thinking about writing a simple-minded Fasta parser for better > performance with large files. What I'm wondering about: > 1) Is there some advantage that I overlooked of using Martel for parsing > Fasta files? > 2) Why is it necessary to create a parser first and passing it to > Fasta.Iterator? Are there any cases where Fasta.Iterator uses something > other than a Fasta.RecordParser? > > --Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From idoerg at burnham.org Sun Jul 2 00:48:50 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Sat, 1 Jul 2006 21:48:50 -0700 Subject: [Biopython-dev] Fasta parser References: <44A6ED70.9080204@c2b2.columbia.edu> <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org> <44A74F03.8020801@c2b2.columbia.edu> Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D46A6@MAIL.burnham.org> By (lack of?) design, my own biopython using code seems to be using both the martel and non-Martel parsers. I imagine others may have the same. Point being: any design change should make sure that we are back compatible. Thanks very much for your work on the Biopython release. Cheers, ./I -- Iddo Friedberg, PhD Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA T: +1 858 646 3100 x3516 http://iddo-friedberg.org http://BioFunctionPrediction.org -----Original Message----- From: Michiel de Hoon [mailto:mdehoon at c2b2.columbia.edu] Sent: Sat 7/1/2006 9:43 PM To: Iddo Friedberg Cc: biopython-dev at biopython.org Subject: Re: [Biopython-dev] Fasta parser Thanks Iddo! I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than the Martel-based one in Bio.Fasta. It would be nice to merge these two modules. However, it raises a bunch of design questions (such as Fasta.Record versus SeqRecord, and Seq versus string), so it's probably better to wait with that until after the next Biopython release. Which, by the way, will be coming up soon. Thanks, --Michiel. Iddo Friedberg wrote: > Michiel, > > There is actually a simple minded fasta reader/writer that does not use > Martel. Bio.SeqIO.FASTA > > ./I > > -- > Iddo Friedberg, PhD > Burnham Institute for Medical Research > 10901 N. Torrey Pines Rd. > La Jolla, CA 92037 USA > T: +1 858 646 3100 x3516 > http://iddo-friedberg.org > http://BioFunctionPrediction.org > > > > -----Original Message----- > From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon > Sent: Sat 7/1/2006 2:47 PM > To: biopython-dev at biopython.org > Subject: [Biopython-dev] Fasta parser > > Hi everybody, > > The Biopython shows the following approach to parsing a Fasta file: > > >>> from Bio import Fasta > >>> parser = Fasta.RecordParser() > >>> file = open("ls_orchid.fasta") > >>> iterator = Fasta.Iterator(file, parser) > >>> cur_record = iterator.next() > > But for large Fasta files, it's very slow, compared to file.read(), > which may be due to going through Martel (I believe the same was true > for large GenBank files). > > So I'm thinking about writing a simple-minded Fasta parser for better > performance with large files. What I'm wondering about: > 1) Is there some advantage that I overlooked of using Martel for parsing > Fasta files? > 2) Why is it necessary to create a parser first and passing it to > Fasta.Iterator? Are there any cases where Fasta.Iterator uses something > other than a Fasta.RecordParser? > > --Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From mdehoon at c2b2.columbia.edu Sun Jul 2 10:58:35 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 10:58:35 -0400 Subject: [Biopython-dev] New Biopython release coming up Message-ID: <44A7DF1B.1000008@c2b2.columbia.edu> Hi everybody, The next Biopython release (1.42, code-named "Brooklyn") is coming up. I'm planning to finish this release about two weeks from now. The tests of Biopython in CVS all pass, so we are doing well. However, there are 25 bugs listed in Bugzilla, so please have a look to see if there's something we can do about them. If you have some code sitting around, now would be a good time to commit it to CVS. However, if you are not sure if your code is ready for prime time, please hold off until after this release. Also, if you have a cvs checkout of Biopython, please make sure to update it before doing any commits to avoid overwriting. Thanks everybody for your contributions to Biopython. --Michiel. From biopython-dev at maubp.freeserve.co.uk Sun Jul 2 14:11:47 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sun, 02 Jul 2006 19:11:47 +0100 Subject: [Biopython-dev] New Biopython release coming up In-Reply-To: <44A7DF1B.1000008@c2b2.columbia.edu> References: <44A7DF1B.1000008@c2b2.columbia.edu> Message-ID: <44A80C63.7060809@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Hi everybody, > > The next Biopython release (1.42, code-named "Brooklyn") is coming up. > I'm planning to finish this release about two weeks from now. The tests > of Biopython in CVS all pass, so we are doing well. However, there are > 25 bugs listed in Bugzilla, so please have a look to see if there's > something we can do about them. If you have some code sitting around, > now would be a good time to commit it to CVS. However, if you are not > sure if your code is ready for prime time, please hold off until after > this release. Also, if you have a cvs checkout of Biopython, please make > sure to update it before doing any commits to avoid overwriting. > > Thanks everybody for your contributions to Biopython. > > --Michiel. Sounds like a good plan Michiel Did anyone get back to you about the NBCI Blast XML format? I would say parsing blast output is a fairly important feature to a lot of users (I may of course be biased)... Getting down to specifics: Bugzilla Bug 1997 VARCHAR too small in SCOP tables http://bugzilla.open-bio.org/show_bug.cgi?id=1997 Suggested fix looked OK to me, but as I've never used SCOP as second opinion would be wise. Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character http://bugzilla.open-bio.org/show_bug.cgi?id=1987 I have attached a suggested patch, second opinion welcome Bugzilla Bug 1981 GenBank parser generates unusual feature qualifiers. http://bugzilla.open-bio.org/show_bug.cgi?id=1981 A question about the white space in GenBank comments etc. Changing this is probably harmless but we are already making a big change internally with the move away from Martel, I would rather post pone any further change until after the next release. Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in error http://bugzilla.open-bio.org/show_bug.cgi?id=1936 One for Thomas Hamelryck which on the face of it looks fairly simple. Bugzilla Bug 1946 Parsing GenBank Files - unknown line type PROJECT Does anyone use the new project line? Would a simple string be enough to store this? Peter From mcolosimo at mitre.org Sun Jul 2 14:36:22 2006 From: mcolosimo at mitre.org (Colosimo, Marc E.) Date: Sun, 02 Jul 2006 14:36:22 -0400 Subject: [Biopython-dev] Fasta parser In-Reply-To: <44A6ED70.9080204@c2b2.columbia.edu> Message-ID: On 7/1/06 5:47 PM, "Michiel de Hoon" wrote: > Hi everybody, > > The Biopython shows the following approach to parsing a Fasta file: > >>>> from Bio import Fasta >>>> parser = Fasta.RecordParser() >>>> file = open("ls_orchid.fasta") >>>> iterator = Fasta.Iterator(file, parser) >>>> cur_record = iterator.next() > > But for large Fasta files, it's very slow, compared to file.read(), > which may be due to going through Martel (I believe the same was true > for large GenBank files). > > So I'm thinking about writing a simple-minded Fasta parser for better > performance with large files. What I'm wondering about: > 1) Is there some advantage that I overlooked of using Martel for parsing > Fasta files? > 2) Why is it necessary to create a parser first and passing it to > Fasta.Iterator? Are there any cases where Fasta.Iterator uses something > other than a Fasta.RecordParser? Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object (Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then remap into a SeqRecord. Also, could someone re-run epydoc! My changes in the code have not made it to the on-line API docs. Marc From mcolosimo at mitre.org Sun Jul 2 15:12:23 2006 From: mcolosimo at mitre.org (Colosimo, Marc E.) Date: Sun, 02 Jul 2006 15:12:23 -0400 Subject: [Biopython-dev] BioPython Design In-Reply-To: <44A74F03.8020801@c2b2.columbia.edu> Message-ID: Michiel, When will this next release be made and what is going into it? Since you brought up the issue of design question, I'll have my little rant now. But first, I would like to say that I think it is great that people contribute code and more importantly their time to this project. With out all of the core developers there would be no BioPython. So, Kudos to anyone who has contribute code. Now on to my rant.... I'm not a big user of either BioPerl or BioJava. However, they are well structured and more consistent than BioPython.This FastaIO issue is one of several design issues that really need to be addressed. For example, both BioPerl and BioJava use an SeqIO object structure. Our SeqIO module is heavily underused. For example, we have Fasta, GenBank, LocusLink, NBRF, SwissProt, UniGene main Modules. Interestingly, there is a writers.SeqRecord.embl but I can't quickly find something to read in an embl file! Just look at what BioPerl can read in and how easy it is to find this out (even with out the doc page, all of these are listed under Bio::SeqIO::*) There is a very short "Coding Convention" , which doesn't seem to be followed all that well. My suggestion is if enough people are going to ISMB this year (which I am not), that time should be made to think about a road map for BioPython. My suggestions are: 1) split off a branch for ver 2.0 that supports Python 2.4 only (this would suck for Mac people, like me, but its time to move on) 2) clean house - remove depreciated items, restructure IO, etc... 3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/convertcode.py") 4) use Cheese Shop for missing modules 5) documentation marc On 7/2/06 12:43 AM, "Michiel de Hoon" wrote: > Thanks Iddo! > I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than > the Martel-based one in Bio.Fasta. > > It would be nice to merge these two modules. However, it raises a bunch > of design questions (such as Fasta.Record versus SeqRecord, and Seq > versus string), so it's probably better to wait with that until after > the next Biopython release. Which, by the way, will be coming up soon. > > Thanks, > > --Michiel. > > Iddo Friedberg wrote: >> Michiel, >> >> There is actually a simple minded fasta reader/writer that does not use >> Martel. Bio.SeqIO.FASTA >> >> ./I >> >> -- >> Iddo Friedberg, PhD >> Burnham Institute for Medical Research >> 10901 N. Torrey Pines Rd. >> La Jolla, CA 92037 USA >> T: +1 858 646 3100 x3516 >> http://iddo-friedberg.org >> http://BioFunctionPrediction.org >> >> >> >> -----Original Message----- >> From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon >> Sent: Sat 7/1/2006 2:47 PM >> To: biopython-dev at biopython.org >> Subject: [Biopython-dev] Fasta parser >> >> Hi everybody, >> >> The Biopython shows the following approach to parsing a Fasta file: >> >>>>> from Bio import Fasta >>>>> parser = Fasta.RecordParser() >>>>> file = open("ls_orchid.fasta") >>>>> iterator = Fasta.Iterator(file, parser) >>>>> cur_record = iterator.next() >> >> But for large Fasta files, it's very slow, compared to file.read(), >> which may be due to going through Martel (I believe the same was true >> for large GenBank files). >> >> So I'm thinking about writing a simple-minded Fasta parser for better >> performance with large files. What I'm wondering about: >> 1) Is there some advantage that I overlooked of using Martel for parsing >> Fasta files? >> 2) Why is it necessary to create a parser first and passing it to >> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something >> other than a Fasta.RecordParser? >> >> --Michiel. >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mdehoon at c2b2.columbia.edu Sun Jul 2 16:54:27 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 16:54:27 -0400 Subject: [Biopython-dev] Fasta parser In-Reply-To: References: Message-ID: <44A83283.4060401@c2b2.columbia.edu> >> 2) Why is it necessary to create a parser first and passing it to >> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something >> other than a Fasta.RecordParser? > > Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object > (Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then > remap into a SeqRecord. I see. This is one of the design issues I ran into when comparing Bio.Fasta and Bio.SeqIO.FASTA: Whether parsing a Fasta file should result in a Fasta.Record object or a SeqRecord. > Also, could someone re-run epydoc! My changes in the code have not made it > to the on-line API docs. Done. --Michiel. From mdehoon at c2b2.columbia.edu Sun Jul 2 17:19:46 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 17:19:46 -0400 Subject: [Biopython-dev] BioPython Design In-Reply-To: References: Message-ID: <44A83872.4070209@c2b2.columbia.edu> Colosimo, Marc E. wrote: > When will this next release be made ... I'm planning for the weekend of 15/16 July. > ... and what is going into it? Whatever is in CVS at that time. So essentially today's CVS plus as many bug fixes as possible. I'd hold off on any major changes until after the release. > > I pretty much agree with Marc here. > My suggestion is if enough people are going to ISMB this year > (which I am not), that time should be made to think about a > road map for BioPython. Unfortunately, I won't be going either. A Biopython road map seems like a good idea though. > My suggestions are: > 1) split off a branch for ver 2.0 that supports Python 2.4 only > (this would suck for Mac people, like me, but its time to move on) Is there something essential in 2.4 that's missing in 2.3? Not that I object against supporting 2.4 only, I'm just wondering. Though I'd be hesitant to split off a separate branch, since Biopython is confusing enough already as it is. Btw, I am running Python 2.4 on Mac OS X, and AFAICT there is no problem for Mac users to support 2.4 only. > 2) clean house - remove depreciated items, restructure IO, etc... I totally agree. > 3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/convertcode.py") Here, I'm a bit hesitant. SciPy does not have a good track record in terms of portability. The latest version of numpy looks better though (it compiled without problems on all platforms I tried). But I don't really want to pay $40 for the documentation. > 4) use Cheese Shop for missing modules > 5) documentation My guess is that maintaining the documentation will be easier once we cleaned up Biopython. --Michiel. From mdehoon at c2b2.columbia.edu Sun Jul 2 21:21:00 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 21:21:00 -0400 Subject: [Biopython-dev] New Biopython release coming up In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> Message-ID: <44A870FC.4060909@c2b2.columbia.edu> Peter wrote: > Did anyone get back to you about the NBCI Blast XML format? I would say > parsing blast output is a fairly important feature to a lot of users (I > may of course be biased)... No response yet, but I'll ask them again before the upcoming release. The existing XML parser still works as advertised for single blast searches. For multiple blast searches, people will have to run a previous version of blast locally. > Bugzilla Bug 1997 VARCHAR too small in SCOP tables > http://bugzilla.open-bio.org/show_bug.cgi?id=1997 > Suggested fix looked OK to me, but as I've never used SCOP as second > opinion would be wise. This one looks fine to me, but I'm not a SCOP user either. > Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character > http://bugzilla.open-bio.org/show_bug.cgi?id=1987 > I have attached a suggested patch, second opinion welcome Whereas the patch looks fine, I have no idea what this code is supposed to do, or why it needs to be so complicated. > Bugzilla Bug 1946 Parsing GenBank Files - unknown line type PROJECT > Does anyone use the new project line? Would a simple string be enough > to store this? > From NCBI's description, it appears they're not quite sure yet what this project line should look like (note that the project line in the description is different from the project line in the GenBank file: GenomeProject vs. GENOME_PROJECT). I would just store the line in a simple string, and do something more fancy once we know the proper format. My 2?. --Michiel. From idoerg at burnham.org Mon Jul 3 13:52:44 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Mon, 03 Jul 2006 10:52:44 -0700 Subject: [Biopython-dev] [Fwd: [OBF] Call For Birds of a Feather Suggestions] Message-ID: <44A9596C.90208@burnham.org> The BOSC organizing comittee is currently seeking suggestions for Birds of a Feather meeting ideas. Birds of a Feather meetings are one of the more popular activities at BOSC, occurring at the end of each days session. These are free-form meetings organized by the attendees themselves to discuss one or a few topics of interest in greater detail. BOF?s have been formed to allow developers and users of individual OBF software to meet each other face-to-face to discuss the project, or to discuss completely new ideas, and even start new software development projects. These meetings offer a unique opportunity for individuals to explore more about the activities of the various Open Source Projects, and, in some cases, even take an active role influencing the future of Open Source Software development. If you would like to create a BOF, just sign up for a wiki account, login, and edit the BOSC 2006 Birds of a Feather page. _______________________________________________ Open-Bioinformatics-Foundation mailing list Open-Bioinformatics-Foundation at lists.open-bio.org This is a broadcast-only announce list used to distribute emails to people who subscribe to OBF hosted email discussion or announce lists. To prevent our most active members from getting many duplicate copies of important announcements we created this list today so that only one email gets sent to each subscribed email address. You do not need to subscribe/unsubscribe from this lsit. Problems or Concerns? -- send an email to the OBF mailteam at: mailteam at open-bio.org -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org From biopython-dev at maubp.freeserve.co.uk Thu Jul 6 05:06:07 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu, 06 Jul 2006 10:06:07 +0100 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44A870FC.4060909@c2b2.columbia.edu> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> Message-ID: <44ACD27F.90906@maubp.freeserve.co.uk> Peter wrote: >> Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character >> http://bugzilla.open-bio.org/show_bug.cgi?id=1987 >> I have attached a suggested patch, second opinion welcome Michiel de Hoon wrote: > Whereas the patch looks fine, I have no idea what this code is supposed > to do, or why it needs to be so complicated. I'm not the person to ask. The whole Alphabet is something that confused me a little when first using BioPython. I see why a special class for sequences is a nice idea, and that handling the different variants of RNA, DNA and proteins is a good idea. But to be honest, I have generally used plain strings in my own programs, and meddled with alphabets only when needed (e.g. for translating from DNA to protein sequences). Peter From hoffman at ebi.ac.uk Thu Jul 6 06:36:53 2006 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Thu, 6 Jul 2006 11:36:53 +0100 (BST) Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> Message-ID: [Peter] > The whole Alphabet is something that confused me a little when first > using BioPython. I see why a special class for sequences is a nice > idea, and that handling the different variants of RNA, DNA and proteins > is a good idea. > > But to be honest, I have generally used plain strings in my own > programs, and meddled with alphabets only when needed (e.g. for > translating from DNA to protein sequences). I agree. In general, I think that the alphabet stuff adds unnecessary complexity to perhaps 95 % of the sort of things I would do with Biopython. But as it stands I usually use strs myself instead. -- Michael Hoffman European Bioinformatics Institute From Leighton.Pritchard at scri.ac.uk Thu Jul 6 06:34:46 2006 From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard) Date: Thu, 6 Jul 2006 11:34:46 +0100 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> Message-ID: <1152182087.4828.96.camel@lplinuxdev> An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0002.pl -------------- next part -------------- An embedded message was scrubbed... From: "Leighton Pritchard" Subject: Re: [Biopython-dev] New Biopython release coming up / Alphabets Date: Thu, 6 Jul 2006 11:34:46 +0100 Size: 4250 Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0002.mht From Leighton.Pritchard at scri.ac.uk Thu Jul 6 06:34:46 2006 From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard) Date: Thu, 6 Jul 2006 11:34:46 +0100 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> Message-ID: <1152182087.4828.96.camel@lplinuxdev> An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0003.pl -------------- next part -------------- An embedded message was scrubbed... From: "Leighton Pritchard" Subject: Re: [Biopython-dev] New Biopython release coming up / Alphabets Date: Thu, 6 Jul 2006 11:34:46 +0100 Size: 4250 Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0003.mht From mdehoon at c2b2.columbia.edu Thu Jul 6 12:39:09 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 06 Jul 2006 12:39:09 -0400 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> Message-ID: <44AD3CAD.8030504@c2b2.columbia.edu> Michael Hoffman wrote: > [Peter] >> But to be honest, I have generally used plain strings in my own >> programs, and meddled with alphabets only when needed (e.g. for >> translating from DNA to protein sequences). Note that there is a function "translate" in Bio.Seq that translates DNA to protein using plain strings. > > I agree. In general, I think that the alphabet stuff adds unnecessary > complexity to perhaps 95 % of the sort of things I would do with > Biopython. But as it stands I usually use strs myself instead. It appears that most people (myself included) use plain strings instead of Seq objects (= string + Alphabet). We should check on the biopython mailing list if anybody really needs alphabets, and if not get rid of them (after the upcoming Brooklyn-release (1.42) though). --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From fkauff at duke.edu Thu Jul 6 13:53:23 2006 From: fkauff at duke.edu (Frank Kauff) Date: Thu, 06 Jul 2006 13:53:23 -0400 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44AD3CAD.8030504@c2b2.columbia.edu> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> <44AD3CAD.8030504@c2b2.columbia.edu> Message-ID: <1152208403.2487.36.camel@osiris.biology.duke.edu> On Thu, 2006-07-06 at 12:39 -0400, Michiel Jan Laurens de Hoon wrote: > Michael Hoffman wrote: > > [Peter] > >> But to be honest, I have generally used plain strings in my own > >> programs, and meddled with alphabets only when needed (e.g. for > >> translating from DNA to protein sequences). > > Note that there is a function "translate" in Bio.Seq that translates DNA > to protein using plain strings. > > > > I agree. In general, I think that the alphabet stuff adds unnecessary > > complexity to perhaps 95 % of the sort of things I would do with > > Biopython. But as it stands I usually use strs myself instead. > > It appears that most people (myself included) use plain strings instead > of Seq objects (= string + Alphabet). We should check on the biopython > mailing list if anybody really needs alphabets, and if not get rid of > them (after the upcoming Brooklyn-release (1.42) though). > I use seq objects and the alphabet stuff in the nexus parser, but I don't really know why and wouldn't mind at all to get rid of them. Frank > --Michiel. > > -- Frank Kauff Dept. of Biology Duke University Box 90338 Durham, NC 27708 USA Phone 919-660-7382 Fax 919-660-7293 Web http://www.lutzonilab.net From thamelry at binf.ku.dk Fri Jul 7 06:44:24 2006 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 7 Jul 2006 12:44:24 +0200 (CEST) Subject: [Biopython-dev] New Biopython release coming up In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> Message-ID: <38979.129.11.36.25.1152269064.squirrel@secure.binf.ku.dk> Hi, > Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in > error > http://bugzilla.open-bio.org/show_bug.cgi?id=1936 > One for Thomas Hamelryck which on the face of it looks fairly simple. Won't have time to work on biopython before august I'm afraid (CASP+ articles that need to be finished, etc.). Sorry! Best regards, -Thomas From thamelry at binf.ku.dk Fri Jul 7 06:44:24 2006 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 7 Jul 2006 12:44:24 +0200 (CEST) Subject: [Biopython-dev] New Biopython release coming up In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> Message-ID: <38979.129.11.36.25.1152269064.squirrel@secure.binf.ku.dk> Hi, > Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in > error > http://bugzilla.open-bio.org/show_bug.cgi?id=1936 > One for Thomas Hamelryck which on the face of it looks fairly simple. Won't have time to work on biopython before august I'm afraid (CASP+ articles that need to be finished, etc.). Sorry! Best regards, -Thomas From mcolosimo at mitre.org Tue Jul 11 12:01:15 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 11 Jul 2006 12:01:15 -0400 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44AD3CAD.8030504@c2b2.columbia.edu> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> <44AD3CAD.8030504@c2b2.columbia.edu> Message-ID: On Jul 6, 2006, at 12:39 PM, Michiel Jan Laurens de Hoon wrote: > Michael Hoffman wrote: >> [Peter] >>> But to be honest, I have generally used plain strings in my own >>> programs, and meddled with alphabets only when needed (e.g. for >>> translating from DNA to protein sequences). > > Note that there is a function "translate" in Bio.Seq that > translates DNA > to protein using plain strings. >> >> I agree. In general, I think that the alphabet stuff adds unnecessary >> complexity to perhaps 95 % of the sort of things I would do with >> Biopython. But as it stands I usually use strs myself instead. > > It appears that most people (myself included) use plain strings > instead > of Seq objects (= string + Alphabet). We should check on the biopython > mailing list if anybody really needs alphabets, and if not get rid of > them (after the upcoming Brooklyn-release (1.42) though). > > --Michiel. I am strongly arguing against removing the alphabets. You would loss all of the cool features of Seq Objects (complement, reverse_complement). There are similar functions under Bio.SeqUtils but those are "Deprecated". From just looking around, I think this would break many things. Having said that, I do find them a pain to deal with, but that might have more to do with the structure/layout of the classes. My simple suggestion is to fix/change the base Alphabet classes in Bio.Alphabet.__init__. I am trying to think of a way that we can have a "true" GenericAlphabet class (not generic_alphabet = Alphabet() ) and using just strings. The problem is, is that I don't know if just using letters = None (or letters = []) will cause problems down the road (things like if x in aplabet.letters is used in many classes). Also, I'm really confused as to what is going on in IUPAC.py with the default_manager stuff and _bootstrap. Marc From mcolosimo at mitre.org Tue Jul 11 13:29:52 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 11 Jul 2006 13:29:52 -0400 Subject: [Biopython-dev] BioPython Design In-Reply-To: <44A83872.4070209@c2b2.columbia.edu> References: <44A83872.4070209@c2b2.columbia.edu> Message-ID: <7C24AEA4-68EC-4517-9391-C07512CDD146@mitre.org> On Jul 2, 2006, at 5:19 PM, Michiel de Hoon wrote: > > > My suggestions are: > > 1) split off a branch for ver 2.0 that supports Python 2.4 only > > (this would suck for Mac people, like me, but its time to move on) > > Is there something essential in 2.4 that's missing in 2.3? Not that > I object against supporting 2.4 only, I'm just wondering. Though > I'd be hesitant to split off a separate branch, since Biopython is > confusing enough already as it is. > > Btw, I am running Python 2.4 on Mac OS X, and AFAICT there is no > problem for Mac users to support 2.4 only. There are two off the top of my head: Generator expressions (PEP 289, ) This could be very useful in cleaning up the old code Decorators for Functions (PEP 318, ) I like the idea of using staticmethod and classmethod. The accepts and returns decorators are also interesting. I wish I could find a list of all possible decorators. In any case, some clean up of the code is needed because people have used the string "Decorator" (Alphabet.__init__.py and NeCatch.py) > > > 2) clean house - remove depreciated items, restructure IO, etc... > > I totally agree. > > > 3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/ > convertcode.py") > > Here, I'm a bit hesitant. SciPy does not have a good track record > in terms of portability. The latest version of numpy looks better > though (it compiled without problems on all platforms I tried). But > I don't really want to pay $40 for the documentation. I saw this, but didn't know it was the only documentation. However, as far as I can tell Numeric is dead is NumPy! Marc From krewink at inb.uni-luebeck.de Tue Jul 11 17:23:14 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Tue, 11 Jul 2006 23:23:14 +0200 Subject: [Biopython-dev] BioPython Design Message-ID: <20060711212314.GA31351@pc06.inb.mu-luebeck.de> Am 11.07.2006 um 18:01 schrieb Marc Colosimo: > It appears that most people (myself included) use plain strings instead > of Seq objects (= string + Alphabet). We should check on the biopython > mailing list if anybody really needs alphabets, and if not get rid of > them (after the upcoming Brooklyn-release (1.42) though). There are some good points about Seq objects in the discussion last year: http://lists.open-bio.org/pipermail/biopython-dev/2005-April/002074.html Personaly, I would prefere to keep Alphabets as a part of Seq, but make it behave more like python strings, i.e.: str(seq_obj) == seq_obj.data == seq_obj.tostring() == seq_obj[:] Furthermore, alphabets could be more usefull with an __init__ method looking like def __init__(self, data, alphabet, validate=False) This way, sequences could be checked for consistency on demand. To make Alphabets more usable, it would be nice to have some kind of dictionary interface to map different alphabets: e.g. Alphabet.Alphabets['protein'] == Bio.Alphabet.IUPAC.protein Cheers, Albert -- Albert Krewinkel University of Luebeck phone: +49 (451) 500 5516 email: krewink at inb.uni-luebeck.de From f.schlesinger at iu-bremen.de Wed Jul 12 09:25:43 2006 From: f.schlesinger at iu-bremen.de (Felix Schlesinger) Date: Wed, 12 Jul 2006 15:25:43 +0200 Subject: [Biopython-dev] BioPython Design In-Reply-To: <20060711212314.GA31351@pc06.inb.mu-luebeck.de> References: <20060711212314.GA31351@pc06.inb.mu-luebeck.de> Message-ID: <7317d50c0607120625x7e76008fo961814b280dbad51@mail.gmail.com> > Personaly, I would prefere to keep Alphabets as a part of Seq, > but make it behave more like python strings, i.e.: > str(seq_obj) == seq_obj.data == seq_obj.tostring() == seq_obj[:] Isn't the whole alphabet thing just a type information in the end? (I.e. "This string is of type protein") And if it is, shouldn't we let the python type system handle it via a class hirachie? Or use the python concept of duck typing and assume the string has whatever type is needed at the moment until it fails? Felix Schlesinger From mdehoon at c2b2.columbia.edu Wed Jul 26 13:39:46 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Wed, 26 Jul 2006 13:39:46 -0400 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> <44AD3CAD.8030504@c2b2.columbia.edu> Message-ID: <44C7A8E2.2050100@c2b2.columbia.edu> Marc Colosimo wrote: >> [Michiel] >> It appears that most people (myself included) use plain strings instead >> of Seq objects (= string + Alphabet). We should check on the biopython >> mailing list if anybody really needs alphabets, and if not get rid of >> them (after the upcoming Brooklyn-release (1.42) though). > > [Marc] > I am strongly arguing against removing the alphabets. You would loss > all of the cool features of Seq Objects (complement, > reverse_complement). There are similar functions under Bio.SeqUtils but > those are "Deprecated". From just looking around, I think this would > break many things. There is a function reverse_complement in Bio.Seq that works on plain strings. (If you need the complement instead, you can of course reverse the result). So can you be more specific on which features of Seq objects are actually needed? While I can see the intuitive appeal of having a Seq class, I cannot think of any practical cases where a simple string wouldn't do. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Fri Jul 28 09:50:39 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Fri, 28 Jul 2006 14:50:39 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Message-ID: <44CA162F.1040604@maubp.freeserve.co.uk> This follows on from the discussion last month started by Marc Colosimo, but I want to focus just on reading in sequence files: http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002386.html There was also a thread back a few years ago where Michael Hoffman was looking at timings for parsing Fasta files. http://www.biopython.org/pipermail/biopython-dev/2003-October/001480.html Jeffrey Chang wrote: > That is a nice implementation. However, Biopython already has at least > 3 Fasta parsers! > Bio/Fasta > Bio/SeqIO/FASTA > Bio/expressions/fasta > > Bio/Fasta, the one you compared against, is easily the slowest one. > Bio/SeqIO/FASTA is very similar to your implementation and not likely > to be significantly faster or slower. Bio/expressions/fasta uses > Martel. I don't know how well that will perform. The parsing part > should be blazingly fast (since it is mostly in C), but building the > object will be slow. It might be a wash. > > Jeff Clearly we could try and consolidate these (while making things as nice as possible with depreciation warnings etc for existing code). I've had a little read on the BioPerl SeqIO system: http://www.bioperl.org/wiki/HOWTO:SeqIO I agree with Marc that what we have in BioPython could (and should) be more organised. Ideally (in my opinion) BioPython should be able to read sequences from multiple sequence file formats (e.g. Fasta, GenBank, EMBL, ...) * using a standard interface * into a standard object * do this quickly The resulting object should be able to hold addition information like annotation and (sub)features, the Bio.SeqRecord.SeqRecord object seems ideal. It looks like we have: (1) We have a number of format specific sequence reading modules (in particular Fasta and GenBank) which can read their particular file format into one or more different object representations. These seem to be the best documented (in my opinion). (2) We have a fairly generic (but relatively slow) framework in the Bio.FormatIO system which uses Martel expressions internally. I have found Martel frustrating to debug, and especially slow with large individual records (like genomic GenBank files). There is some documentation on this, e.g. http://www.biopython.org/DIST/docs/cookbook/genbank_to_fasta.html (3) We have the start of a generic "pure python" framework in the Bio.SeqIO system, but it needs some work (e.g. some doc strings, fixing the LargeFastaFormat class, GenBank support). QUESTION: What do you all tend to use? Should I draft a "questionnaire" to be posted on the main discussion list (and the announcements?). Personally, I have been using Bio.Fasta and Bio.GenBank to read sequences. I tend to only output Fasta files, and usually do this "by hand" as they are so simple and I want full control over the description lines. Peter From biopython-dev at maubp.freeserve.co.uk Fri Jul 28 11:05:21 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Fri, 28 Jul 2006 16:05:21 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CA162F.1040604@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> Message-ID: <44CA27B1.30107@maubp.freeserve.co.uk> Jeffrey Chang wrote: > ... However, Biopython already has at least > 3 Fasta parsers! > Bio/Fasta > Bio/SeqIO/FASTA > Bio/expressions/fasta > > Bio/Fasta, the one you compared against, is easily the slowest one. > Bio/SeqIO/FASTA is very similar to your implementation and not likely > to be significantly faster or slower. Bio/expressions/fasta uses > Martel. I don't know how well that will perform. The parsing part > should be blazingly fast (since it is mostly in C), but building the > object will be slow. It might be a wash. The following timings are for iterating over a large fasta file (Escherichia_coli_K12, NC_000913.ffn, with 5254 nucleotide CDS sequences). The test script is attached, the test input is available here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K12/NC_000913.ffn I used BioPython 1.42 with Python 2.3 on Windows XP on a laptop computer. Apart from Fasta.RecordParser, these all return a SeqRecord object with a generic alphabet: 0.89s SeqIO.FASTA.FastaReader (for record in interator) 0.88s SeqIO.FASTA.FastaReader (iterator.next) 0.88s SeqIO.FASTA.FastaReader (iterator[i]) 5.52s FormatIO/SeqRecord (for record in interator) 5.41s FormatIO/SeqRecord (iterator.next) 6.06s Fasta.RecordParser (for record in interator) 6.10s Fasta.SequenceParser (for record in interator) 6.27s Fasta.SequenceParser (iterator.next) As you can see, SeqIO.FASTA.FastaReader (written in simple python) is about six times faster than both the martel based parsers. I have tried this on a file with 2000 records and see a similar scaling. Peter -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: test_fasta_methods.py Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060728/93dbbbb7/attachment.pl From mdehoon at c2b2.columbia.edu Sun Jul 30 21:20:50 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 30 Jul 2006 21:20:50 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CA162F.1040604@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> Message-ID: <44CD5AF2.10708@c2b2.columbia.edu> Thanks Peter. Peter (BioPython Dev) wrote: > QUESTION: What do you all tend to use? I use the stuff in Bio.Fasta, but actually just because it's in the documentation. From your timings, and also because I'm not smart enough to be able to understand Martel, let alone maintain Martel-based parsers, I'm pretty much in favor of Bio.SeqIO. > Should I draft a "questionnaire" > to be posted on the main discussion list (and the announcements?). By all means, yes. In the questionnaire, be sure to separate the issue of parser internals (Martel vs. pure Python) from the issue of how the results should be formatted (Fasta.Record or SeqRecord). --Michiel From lpritc at scri.sari.ac.uk Mon Jul 31 05:59:47 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Mon, 31 Jul 2006 10:59:47 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CA27B1.30107@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> Message-ID: <1154339988.1490.81.camel@lplinuxdev> Hi all, On Fri, 2006-07-28 at 16:05 +0100, Peter (BioPython Dev) wrote: > Jeffrey Chang wrote: > > ... However, Biopython already has at least > > 3 Fasta parsers! > > Bio/Fasta > > Bio/SeqIO/FASTA > > Bio/expressions/fasta > > > > Bio/Fasta, the one you compared against, is easily the slowest one. > > Bio/SeqIO/FASTA is very similar to your implementation and not likely > > to be significantly faster or slower. Bio/expressions/fasta uses > > Martel. I don't know how well that will perform. The parsing part > > should be blazingly fast (since it is mostly in C), but building the > > object will be slow. It might be a wash. Just to add to the confusion, when parsing large FASTA sequence files, I have been using a home-rolled Flex/Pyrex parser (if you'd like a copy, drop me a line). I've used Peter's test framework on the same input file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora Core 3 (up-to-date, eh? ;) ) to get the following typical results: 4.07s FormatIO/SeqRecord (for record in interator) 4.05s FormatIO/SeqRecord (iterator.next) 0.32s SeqIO.FASTA.FastaReader (for record in interator) 0.30s SeqIO.FASTA.FastaReader (iterator.next) 0.31s SeqIO.FASTA.FastaReader (iterator[i]) 5.53s Fasta.RecordParser (for record in interator) 5.00s Fasta.SequenceParser (for record in interator) 4.80s Fasta.SequenceParser (iterator.next) 0.18s SeqUtils/quick_FASTA_reader 0.11s pyfastaseqlexer/next_record 0.09s pyfastaseqlexer/quick_FASTA_reader 0.19s SeqUtils/quick_FASTA_reader (conversion to Seq) 0.14s pyfastaseqlexer/next_record (conversion to Seq) 0.11s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq) 0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) 0.17s pyfastaseqlexer/next_record (conversion to SeqRecord) 0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) pyfastaseqlexer is my Flex/Pyrex combination, which has a number of methods for reading in FASTA sequences. Here I've used the two that correspond to the Bio.SeqUtils.quick_FASTA_reader method (overlooked in the original list, but also included here for comparison), and Peter's iterator method for his tests. Since these extra methods don't return Bio.Seq or Bio.SeqRecord objects, but instead lists of (name, sequence) tuples, I've also included test functions that carry out the conversion in Python, and their timings. It's probably not a surprise that a dedicated Flex-based parser shows such a dramatic speed improvement over the Martel-based parsers. The improvement over SeqIO.FASTA.FastaReader and SeqUtils.quick_FASTA_reader is only marginal, though (a factor of approximately two when conversion to SeqRecord is taken into account). Since we've been discussing the need to use only strings to represent sequences recently, it's interesting to note that SeqUtils.quick_FASTA_reader is about twice as fast as SeqIO.FASTA.FastaReader if there is no conversion of sequences from strings to Seq or SeqRecord objects. While the Flex-based parser is the fastest in these tests, the time saved is marginal unless a large FASTA file is being parsed. Using a file with over 72000 entries (Phytophthora infestans ESTs), my typical timings become: 51.22s FormatIO/SeqRecord (for record in interator) 45.64s FormatIO/SeqRecord (iterator.next) 4.26s SeqIO.FASTA.FastaReader (for record in interator) 4.10s SeqIO.FASTA.FastaReader (iterator.next) 4.30s SeqIO.FASTA.FastaReader (iterator[i]) 58.39s Fasta.RecordParser (for record in interator) 59.97s Fasta.SequenceParser (for record in interator) 58.70s Fasta.SequenceParser (iterator.next) 2.20s SeqUtils/quick_FASTA_reader 1.13s pyfastaseqlexer/next_record 0.56s pyfastaseqlexer/quick_FASTA_reader 2.20s SeqUtils/quick_FASTA_reader (conversion to Seq) 1.53s pyfastaseqlexer/next_record (conversion to Seq) 0.84s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq) 2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) 2.11s pyfastaseqlexer/next_record (conversion to SeqRecord) 1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) The Martel-based parsers become almost unworkable when dealing with files of this size. Note that the conversion of strings to SeqRecord objects is pretty much a constant overhead for the Bio.SeqUtils and pyfastaseqlexer methods (taking around 1s), but that there are apparently additional overheads in the SeqIO.FASTA.FastaReader method. Of course, the hassles of including a Flex-based parser in a general BioPython release probably outweigh the marginal time-saving benefits (see MMCIFlex for details ;) ). I think SeqIO.FASTA.FastaReader and SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and beat the inclusion of a Flex-based parser hands-down in terms of maintainability and portability. L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Mon Jul 31 06:36:00 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 31 Jul 2006 11:36:00 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CD5AF2.10708@c2b2.columbia.edu> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> Message-ID: <44CDDD10.4020904@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Thanks Peter. > > Peter wrote: >>QUESTION: What do you all tend to use? > > I use the stuff in Bio.Fasta, but actually just because it's in the > documentation. Me too. > From your timings, and also because I'm not smart enough > to be able to understand Martel, let alone maintain Martel-based > parsers, I'm pretty much in favor of Bio.SeqIO. That was my gut instinct too. Starting with Bio.SeqIO as a base, I've been "playing" with the code and have a rough "Sequence Iterator" class that supports iteration (provides a next() and __iter__() method), as well as strictly increasing index access. At the moment I have iterators returning SeqRecords for: - Fasta Files - GenBank features (returns the CDS features and their translations) - Genbank files (with the features as SeqFeature objects) There is code in Bio/SeqIO/general.py for a few more file formats which I haven't used yet. This new GenBank iterator actually uses the current Bio.Genbank parser (with a slight tweak to how it acts once it reaches the end of a record). Michiel de Hoon wrote: > >Peter wrote: >> Should I draft a "questionnaire" >> to be posted on the main discussion list (and the announcements?). > > By all means, yes. In the questionnaire, be sure to separate the issue > of parser internals (Martel vs. pure Python) from the issue of how the > results should be formatted (Fasta.Record or SeqRecord). > Draft questionnaire follows, I have included by comments for the record. Too long? Missing any important questions? Peter -- Introduction ============ There is some discussion on the Developer's Mailing list about BioPython's sequence input/output routines. For example, its a bit silly that there are three different Fasta reading routines in BioPython (even if only one of them, Bio.Fasta, is properly documented). Note that we are not going to "just remove" any of the current functionality. Some existing code may be re-written internally, while other code might be marked with a DeprecationWarning. If you could answer the following questions that would help guide our choices. Question One ============ Is reading sequence files an important function to you, and if so which file formats in particular (e.g. Fasta, GenBank, ...) If you have had to write you own code to read a "common" file format which BioPython doesn't support, please get in touch. Peter's answer: > I read Fasta and GenBank files mostly. Also Clustalw alignments, > and Stockholm alignments. Question Two - Reading Fasta Files ================================== Which of the following do you currently use (and why)?: (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a title, and the sequence as a string) (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord objects) (f) Other (Could you tell us more?) Peter's answer: > In most of my script I use Bio.Fasta with either the RecordParser or > FeatureParser. I did look at Bio.FormatIO when I started but found > Bio.Fasta was much better documented (and a similar speed). I have > only recently looked at Bio.SeqIO (hence this entire thread). Question Three - index_file based dictionaries ============================================== Do you use any of the following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c) Any other "Martel/Mindy" based dictionary which first requires creation of an index using the index_file function If so, do you have any comments? Peter's answer: > I do not use multi-record Genbank files (mine are single chromosomes). > > I have used Bio.Fasta.Dictionary but found dealing with the indexes > created by index_file to be annoying - especially when re-indexing > Fasta files which change often. > > I now use a simple wrapper function to load a Fasta file with an > iterator and build the dictionary in memory. For me this is much > less hassle and the memory demands are not too great. Question Four - Record Access... ================================ When loading a file with multiple sequences do you use: (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the records one by one in the order from the file. (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you random access to the records using their identifier. (c) A list giving random access by index number (e.g. load the records using an iterator but saving them in a list). Do you have any additional comments on this? For example, flexibility versus memory requirements. For example, when I need random access to a Fasta file, I build a dictionary in memory (using an iterator) rather than messing about with the index_file based dictionary. Peter's answer: > I usually deal with each record sequentially using an iterator. > > However, I often need random access using the record identifier and > for this I use a dictionary which I create in memory using an iterator. > > As stated in the question, I had tired used Bio.Fasta.Dictionary but > found dealing with the indexes created by index_file to be annoying, > especially having to re-indexing Fasta files which change often. Question Four - Fasta files: FastaRecord or SeqRecord ===================================================== If you use Fasta files, do you want get records returned as FastaRecords or as SeqRecords? If SeqRecords, do you use your own title2ids mapping? For example, >name text text text ACGTACACGT As a FastaRecord this would have: FastaRecord.title = "name text text text" (string) FastaRecord.sequence= "ACGTACACGT" (string) As a SeqRecord (with the default title2ids mapping): SeqRecord.id = (default string) SeqRecord.name = (default string) SeqRecord.description = "name text text text" (string) SeqRecord.seq = Seq("ACGTACACGT", alphabet) Peter's answer > For FASTA files I have usually used FastaRecord objects (with the > sequence as a string) but I have no strong preference. Thinking of > the big picture it would be better to have every parser return > SeqRecords by default. Question Five - GenBank files: GenbankRecord or SeqRecord ========================================================== If you use GenBank files, do you use: (a) Bio.Genbank.FeatureParser which returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects Do you care much either way? For me the only significant difference is that feature locations are held as objects in the SeqRecord, and as the raw string in the Record. Peter's answer > I have no strong preference - unless I wanted to manipulate the > feature locations. I think there might be a performance difference... Question Six - Martel, Scanners and Consumers ============================================== Some of BioPython's existing parsers (e.g. those using Martel) use an event/callback model, where the scanner component generates parsing events which are dealt with by the consumer component. Do any of you use this system to modify existing parser behaviour, or use it as part of your own personal file parser? (a) I don't know, or don't care. I just the the parsers provided. (b) I use this framework to modify a parser in order to do ... (please provide details). Peter's answer > As a user I don't care about the internals. I do care about what > gets used as the name/id/description for SeqRecords but that level > of flexibility is enough. > > As a BioPython contributor: Martel is scary. I think I understand > the whole scanner/consumer model but don't see the point (unless > using a event based scanner like Martel). I suspect all the > function call backs is one reason Martel parsers are slow. Peter From biopython-dev at maubp.freeserve.co.uk Mon Jul 31 08:12:26 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 31 Jul 2006 13:12:26 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154339988.1490.81.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> Message-ID: <44CDF3AA.2020308@maubp.freeserve.co.uk> Leighton Pritchard wrote: > Just to add to the confusion, when parsing large FASTA sequence files, I > have been using a home-rolled Flex/Pyrex parser (if you'd like a copy, > drop me a line). I've used Peter's test framework on the same input > file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora > Core 3 (up-to-date, eh? ;) ) to get the following typical results: Times for NC_000913.ffn when returning SeqRecord objects: > 4.07s FormatIO/SeqRecord (for record in interator) > 4.05s FormatIO/SeqRecord (iterator.next) > 5.00s Fasta.SequenceParser (for record in interator) > 4.80s Fasta.SequenceParser (iterator.next) > 0.32s SeqIO.FASTA.FastaReader (for record in interator) > 0.30s SeqIO.FASTA.FastaReader (iterator.next) > 0.31s SeqIO.FASTA.FastaReader (iterator[i]) > 0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) > 0.17s pyfastaseqlexer/next_record (conversion to SeqRecord) > 0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) And again, but for Phytophthora infestans ESTs with 72000 entries > 51.22s FormatIO/SeqRecord (for record in interator) > 45.64s FormatIO/SeqRecord (iterator.next) > 59.97s Fasta.SequenceParser (for record in interator) > 58.70s Fasta.SequenceParser (iterator.next) > 4.26s SeqIO.FASTA.FastaReader (for record in interator) > 4.10s SeqIO.FASTA.FastaReader (iterator.next) > 4.30s SeqIO.FASTA.FastaReader (iterator[i]) > 2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) > 2.11s pyfastaseqlexer/next_record (conversion to SeqRecord) > 1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) I imagine this file is much much larger than what most of our uses work with - but it does clearly show that the Martel parsers do not scale well. Out of interest, are the sequences in this file split into multiple lines (e.g. max length 80) or are they all single (long) lines? I would expect the later to be quicker to load due to less string operations. > Of course, the hassles of including a Flex-based parser in a general > BioPython release probably outweigh the marginal time-saving benefits > (see MMCIFlex for details ;) ). I think SeqIO.FASTA.FastaReader and > SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and > beat the inclusion of a Flex-based parser hands-down in terms of > maintainability and portability. I agree with you completely that we should avoid the Flex parser based on those grounds, as we can get "close enough" with pure python. Especially if we do something about the overhead of Seq and SeqRecord objects. I did some work on a brand new SeqIO over the weekend. I had got the fasta iterator slightly quicker too. The SeqUtils/quick_FASTA_reader is interesting in that it loads the entire file into memory in one go, and then parses it. On the other hand its not perfect: I would use "\n>" as the split marker rather than ">" which could appear in the description of a sequence. The iterator approach is probably slower but requires much less memory. How big is your 72,000 entry file in MB? Do we need to worry about the size of the raw file in memory - allowing the parsers to load it into memory could make things much faster... Peter From lpritc at scri.sari.ac.uk Mon Jul 31 10:15:54 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Mon, 31 Jul 2006 15:15:54 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CDF3AA.2020308@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> Message-ID: <1154355358.1490.116.camel@lplinuxdev> On Mon, 2006-07-31 at 13:12 +0100, Peter (BioPython Dev) wrote: > I imagine this file is much much larger than what most of our uses work > with - but it does clearly show that the Martel parsers do not scale well. I noticed the scaling problem mostly for GenBank files. Your new GenBank parser is a welcome improvement in speed. > Out of interest, are the sequences in this file split into multiple > lines (e.g. max length 80) or are they all single (long) lines? I would > expect the later to be quicker to load due to less string operations. They're multiple lines with max length 50, and the whole file is 33Mb. It's not the largest FASTA sequence file I'm working with, that's 353Mb (530801 sequences, it's most of a eukaryotic genome with sequences split into multiple lines), so I ran your test script on it, just to see what happened: 419.42s FormatIO/SeqRecord (for record in interator) 389.05s FormatIO/SeqRecord (iterator.next) 35.46s SeqIO.FASTA.FastaReader (for record in interator) 33.73s SeqIO.FASTA.FastaReader (iterator.next) 36.19s SeqIO.FASTA.FastaReader (iterator[i]) 490.19s Fasta.RecordParser (for record in interator) 555.43s Fasta.SequenceParser (for record in interator) 546.87s Fasta.SequenceParser (iterator.next) 37.94s SeqUtils/quick_FASTA_reader 12.84s pyfastaseqlexer/next_record 6.06s pyfastaseqlexer/quick_FASTA_reader 24.08s SeqUtils/quick_FASTA_reader (conversion to Seq) 12.27s pyfastaseqlexer/next_record (conversion to Seq) 8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq) 24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) 18.10s pyfastaseqlexer/next_record (conversion to SeqRecord) 13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) This is only one run - my patience has limits Again, scaling is a big problem for some methods. > The SeqUtils/quick_FASTA_reader is interesting in that it loads the > entire file into memory in one go, and then parses it. On the other > hand its not perfect: I would use "\n>" as the split marker rather than > ">" which could appear in the description of a sequence. I agree (not that it's bitten me, yet), but I'd be inclined to go with "%s>" % os.linesep as the split marker, just in case. > Do we need to worry about the size of the raw file in memory - allowing the parsers to load it > into memory could make things much faster... I use very few FASTA files where that would be a problem, so long as the sequences remain as strings - when they're converted to SeqRecords/SeqFeatures is where I start to get nervous about memory use. L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Mon Jul 31 11:14:04 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 31 Jul 2006 16:14:04 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154355358.1490.116.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> Message-ID: <44CE1E3C.2050502@maubp.freeserve.co.uk> > > They're multiple lines with max length 50, and the whole file is 33Mb. > It's not the largest FASTA sequence file I'm working with, that's 353Mb > (530801 sequences, it's most of a eukaryotic genome with sequences split > into multiple lines), so I ran your test script on it, just to see what > happened: > > 419.42s FormatIO/SeqRecord (for record in interator) > 389.05s FormatIO/SeqRecord (iterator.next) > 35.46s SeqIO.FASTA.FastaReader (for record in interator) > 33.73s SeqIO.FASTA.FastaReader (iterator.next) > 36.19s SeqIO.FASTA.FastaReader (iterator[i]) > 490.19s Fasta.RecordParser (for record in interator) > 555.43s Fasta.SequenceParser (for record in interator) > 546.87s Fasta.SequenceParser (iterator.next) > 37.94s SeqUtils/quick_FASTA_reader > 12.84s pyfastaseqlexer/next_record > 6.06s pyfastaseqlexer/quick_FASTA_reader > 24.08s SeqUtils/quick_FASTA_reader (conversion to Seq) > 12.27s pyfastaseqlexer/next_record (conversion to Seq) > 8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq) > 24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) > 18.10s pyfastaseqlexer/next_record (conversion to SeqRecord) > 13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) > > This is only one run - my patience has limits Again, scaling is > a big problem for some methods. Interesting - but no big surprises, except maybe just how slow Martel is. Did you notice if it run out of memory, and have to page to the hard disk? >>The SeqUtils/quick_FASTA_reader is interesting in that it loads the >>entire file into memory in one go, and then parses it. On the other >>hand its not perfect: I would use "\n>" as the split marker rather than >>">" which could appear in the description of a sequence. > > I agree (not that it's bitten me, yet), but I'd be inclined to go with > "%s>" % os.linesep as the split marker, just in case. Good point. I wonder how many people even know this function exists? >>Do we need to worry about the size of the raw file in memory - allowing >>the parsers to load it into memory could make things much faster... > > I use very few FASTA files where that would be a problem, so long as the > sequences remain as strings - when they're converted to > SeqRecords/SeqFeatures is where I start to get nervous about memory use. Maybe we should avoid loading entire files into memory while parsing - except for those formats like Clustal alignments where there is no real choice. Have you got a feeling for the difference in memory required for a large Fasta file in memory as: * Title string, sequence string * Title string, sequence as Seq object * SeqRecords (which include the sequence as a Seq object) While its overkill for simple file formats like FASTA, I think we do need a fairly high level object like the SeqRecord when dealing with things like Genbank/EMBL to hold the basic annotation and identifiers (id/name/description). I am thinking that we should have a set of sequence parsers that all return SeqRecord objects (with format specific options in some cases to control the exact mapping of the data, e.g. title2ids for Fasta files). And a matching set of sequence writers that take SeqRecord object(s) and write them to a file. Such a mapping won't be perfect, so maybe there is still a place for "format specific representations" like the Record object in Bio.GenBank.Record In the short term maybe we should just replace the internals of the current Bio.Fasta module with a pure python implementation like that in Bio.SeqIO.FASTA - good idea? Bad idea? Peter From f.schlesinger at iu-bremen.de Mon Jul 31 12:07:08 2006 From: f.schlesinger at iu-bremen.de (Felix Schlesinger) Date: Mon, 31 Jul 2006 18:07:08 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CE1E3C.2050502@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: <7317d50c0607310907sc468843nfe3945225d2ace76@mail.gmail.com> > Have you got a feeling for the difference in memory required for a large > Fasta file in memory as: > * Title string, sequence string > * Title string, sequence as Seq object > * SeqRecords (which include the sequence as a Seq object) >From looking at the code the only difference should be one instance of alphabet and one reference to it per sequence. The main difference is that Seq.data.method involves some python, while string.method is pure C code. Felix From mcolosimo at mitre.org Mon Jul 31 12:08:50 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Mon, 31 Jul 2006 12:08:50 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CE1E3C.2050502@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote: > >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the >>> entire file into memory in one go, and then parses it. On the other >>> hand its not perfect: I would use "\n>" as the split marker >>> rather than >>> ">" which could appear in the description of a sequence. >> >> I agree (not that it's bitten me, yet), but I'd be inclined to go >> with >> "%s>" % os.linesep as the split marker, just in case. > > Good point. I wonder how many people even know this function exists? > The only problem with this is that if someone sends you a file not created on your system. I remember hugh problems 5 or so years ago in BioPerl with dealing with the Mac, Unix, Windows line-ending issues. This has mostly simplied down to two - Unix and Windows - unless the person uses a Mac GUI app some of which use \r (CR) instead of \n (LF) where Windows uses \r\n (CRLF). I think the standard python disto comes with crlf.py and lfcr.py that can convert the line endings. > Maybe we should avoid loading entire files into memory while parsing - > except for those formats like Clustal alignments where there is no > real > choice. > > Have you got a feeling for the difference in memory required for a > large > Fasta file in memory as: > * Title string, sequence string > * Title string, sequence as Seq object > * SeqRecords (which include the sequence as a Seq object) > > While its overkill for simple file formats like FASTA, I think we do > need a fairly high level object like the SeqRecord when dealing with > things like Genbank/EMBL to hold the basic annotation and identifiers > (id/name/description). > > I am thinking that we should have a set of sequence parsers that all > return SeqRecord objects (with format specific options in some > cases to > control the exact mapping of the data, e.g. title2ids for Fasta > files). > > And a matching set of sequence writers that take SeqRecord object > (s) and > write them to a file. > > Such a mapping won't be perfect, so maybe there is still a place for > "format specific representations" like the Record object in > Bio.GenBank.Record > > In the short term maybe we should just replace the internals of the > current Bio.Fasta module with a pure python implementation like > that in > Bio.SeqIO.FASTA - good idea? Bad idea? I would keep them separate but change the documentation on the how-to site to point to using the Bio.SeqIO.FASTA since that is where I think we want people to start going. The code change to Bio.Fasta should be to add a depreciation warning. Marc From mdehoon at c2b2.columbia.edu Mon Jul 31 13:34:41 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 31 Jul 2006 13:34:41 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: <44CE3F31.2080404@c2b2.columbia.edu> Marc Colosimo wrote: > On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote: >> In the short term maybe we should just replace the internals of the >> current Bio.Fasta module with a pure python implementation like >> that in >> Bio.SeqIO.FASTA - good idea? Bad idea? > > I would keep them separate but change the documentation on the how-to > site to point to using the Bio.SeqIO.FASTA since that is where I > think we want people to start going. The code change to Bio.Fasta > should be to add a depreciation warning. I agree with Marc here. No need to modify Bio.Fasta if it's on its way out. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Mon Jul 31 13:41:49 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 31 Jul 2006 18:41:49 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: <44CE40DD.3010101@maubp.freeserve.co.uk> Peter wrote: >>In the short term maybe we should just replace the internals of the >>current Bio.Fasta module with a pure python implementation like >>that in Bio.SeqIO.FASTA - good idea? Bad idea? Marc wrote: > I would keep them separate but change the documentation on the how-to > site to point to using the Bio.SeqIO.FASTA since that is where I > think we want people to start going. The code change to Bio.Fasta > should be to add a depreciation warning. Certainly long term we could do that. There may be advantages to the current very flexible Bio.Fasta code that the SeqIO replacement may not offer (e.g. if we focus on just parsing into SeqRecords). Short Term ---------- Right now I guess most people dealing with Fasta files will be using Bio.Fasta, and it is very slow, hence bug 2058: http://bugzilla.open-bio.org/show_bug.cgi?id=2058 My patch makes Bio.Fasta almost as fast as Bio.SeqIO.FASTA according to my tests (modest sized files). If any of you could try this patch on your machines - on the off chance that it causes problems for any existing code. It does pass test_Fasta.py and test_Fasta2.py on Windows at least. Medium/Long Term ---------------- We need to sort out what to do with Bio.SeqIO as currently the existing code in Bio/SeqIO/generic.py and Bio/SeqIO/FASTA.py uses different interfaces. But do agree that something like that should be OK. I have been working on a possible replacement (but it doesn't seem to have made it to the mailing list yet - must check my recent email). Peter From mdehoon at c2b2.columbia.edu Sat Jul 1 21:47:28 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 01 Jul 2006 17:47:28 -0400 Subject: [Biopython-dev] Fasta parser Message-ID: <44A6ED70.9080204@c2b2.columbia.edu> Hi everybody, The Biopython shows the following approach to parsing a Fasta file: >>> from Bio import Fasta >>> parser = Fasta.RecordParser() >>> file = open("ls_orchid.fasta") >>> iterator = Fasta.Iterator(file, parser) >>> cur_record = iterator.next() But for large Fasta files, it's very slow, compared to file.read(), which may be due to going through Martel (I believe the same was true for large GenBank files). So I'm thinking about writing a simple-minded Fasta parser for better performance with large files. What I'm wondering about: 1) Is there some advantage that I overlooked of using Martel for parsing Fasta files? 2) Why is it necessary to create a parser first and passing it to Fasta.Iterator? Are there any cases where Fasta.Iterator uses something other than a Fasta.RecordParser? --Michiel. From idoerg at burnham.org Sat Jul 1 22:52:43 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Sat, 1 Jul 2006 15:52:43 -0700 Subject: [Biopython-dev] Fasta parser References: <44A6ED70.9080204@c2b2.columbia.edu> Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org> Michiel, There is actually a simple minded fasta reader/writer that does not use Martel. Bio.SeqIO.FASTA ./I -- Iddo Friedberg, PhD Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA T: +1 858 646 3100 x3516 http://iddo-friedberg.org http://BioFunctionPrediction.org -----Original Message----- From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon Sent: Sat 7/1/2006 2:47 PM To: biopython-dev at biopython.org Subject: [Biopython-dev] Fasta parser Hi everybody, The Biopython shows the following approach to parsing a Fasta file: >>> from Bio import Fasta >>> parser = Fasta.RecordParser() >>> file = open("ls_orchid.fasta") >>> iterator = Fasta.Iterator(file, parser) >>> cur_record = iterator.next() But for large Fasta files, it's very slow, compared to file.read(), which may be due to going through Martel (I believe the same was true for large GenBank files). So I'm thinking about writing a simple-minded Fasta parser for better performance with large files. What I'm wondering about: 1) Is there some advantage that I overlooked of using Martel for parsing Fasta files? 2) Why is it necessary to create a parser first and passing it to Fasta.Iterator? Are there any cases where Fasta.Iterator uses something other than a Fasta.RecordParser? --Michiel. _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From mdehoon at c2b2.columbia.edu Sun Jul 2 04:43:47 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 00:43:47 -0400 Subject: [Biopython-dev] Fasta parser In-Reply-To: <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org> References: <44A6ED70.9080204@c2b2.columbia.edu> <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org> Message-ID: <44A74F03.8020801@c2b2.columbia.edu> Thanks Iddo! I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than the Martel-based one in Bio.Fasta. It would be nice to merge these two modules. However, it raises a bunch of design questions (such as Fasta.Record versus SeqRecord, and Seq versus string), so it's probably better to wait with that until after the next Biopython release. Which, by the way, will be coming up soon. Thanks, --Michiel. Iddo Friedberg wrote: > Michiel, > > There is actually a simple minded fasta reader/writer that does not use > Martel. Bio.SeqIO.FASTA > > ./I > > -- > Iddo Friedberg, PhD > Burnham Institute for Medical Research > 10901 N. Torrey Pines Rd. > La Jolla, CA 92037 USA > T: +1 858 646 3100 x3516 > http://iddo-friedberg.org > http://BioFunctionPrediction.org > > > > -----Original Message----- > From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon > Sent: Sat 7/1/2006 2:47 PM > To: biopython-dev at biopython.org > Subject: [Biopython-dev] Fasta parser > > Hi everybody, > > The Biopython shows the following approach to parsing a Fasta file: > > >>> from Bio import Fasta > >>> parser = Fasta.RecordParser() > >>> file = open("ls_orchid.fasta") > >>> iterator = Fasta.Iterator(file, parser) > >>> cur_record = iterator.next() > > But for large Fasta files, it's very slow, compared to file.read(), > which may be due to going through Martel (I believe the same was true > for large GenBank files). > > So I'm thinking about writing a simple-minded Fasta parser for better > performance with large files. What I'm wondering about: > 1) Is there some advantage that I overlooked of using Martel for parsing > Fasta files? > 2) Why is it necessary to create a parser first and passing it to > Fasta.Iterator? Are there any cases where Fasta.Iterator uses something > other than a Fasta.RecordParser? > > --Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From idoerg at burnham.org Sun Jul 2 04:48:50 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Sat, 1 Jul 2006 21:48:50 -0700 Subject: [Biopython-dev] Fasta parser References: <44A6ED70.9080204@c2b2.columbia.edu> <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org> <44A74F03.8020801@c2b2.columbia.edu> Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D46A6@MAIL.burnham.org> By (lack of?) design, my own biopython using code seems to be using both the martel and non-Martel parsers. I imagine others may have the same. Point being: any design change should make sure that we are back compatible. Thanks very much for your work on the Biopython release. Cheers, ./I -- Iddo Friedberg, PhD Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA T: +1 858 646 3100 x3516 http://iddo-friedberg.org http://BioFunctionPrediction.org -----Original Message----- From: Michiel de Hoon [mailto:mdehoon at c2b2.columbia.edu] Sent: Sat 7/1/2006 9:43 PM To: Iddo Friedberg Cc: biopython-dev at biopython.org Subject: Re: [Biopython-dev] Fasta parser Thanks Iddo! I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than the Martel-based one in Bio.Fasta. It would be nice to merge these two modules. However, it raises a bunch of design questions (such as Fasta.Record versus SeqRecord, and Seq versus string), so it's probably better to wait with that until after the next Biopython release. Which, by the way, will be coming up soon. Thanks, --Michiel. Iddo Friedberg wrote: > Michiel, > > There is actually a simple minded fasta reader/writer that does not use > Martel. Bio.SeqIO.FASTA > > ./I > > -- > Iddo Friedberg, PhD > Burnham Institute for Medical Research > 10901 N. Torrey Pines Rd. > La Jolla, CA 92037 USA > T: +1 858 646 3100 x3516 > http://iddo-friedberg.org > http://BioFunctionPrediction.org > > > > -----Original Message----- > From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon > Sent: Sat 7/1/2006 2:47 PM > To: biopython-dev at biopython.org > Subject: [Biopython-dev] Fasta parser > > Hi everybody, > > The Biopython shows the following approach to parsing a Fasta file: > > >>> from Bio import Fasta > >>> parser = Fasta.RecordParser() > >>> file = open("ls_orchid.fasta") > >>> iterator = Fasta.Iterator(file, parser) > >>> cur_record = iterator.next() > > But for large Fasta files, it's very slow, compared to file.read(), > which may be due to going through Martel (I believe the same was true > for large GenBank files). > > So I'm thinking about writing a simple-minded Fasta parser for better > performance with large files. What I'm wondering about: > 1) Is there some advantage that I overlooked of using Martel for parsing > Fasta files? > 2) Why is it necessary to create a parser first and passing it to > Fasta.Iterator? Are there any cases where Fasta.Iterator uses something > other than a Fasta.RecordParser? > > --Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From mdehoon at c2b2.columbia.edu Sun Jul 2 14:58:35 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 10:58:35 -0400 Subject: [Biopython-dev] New Biopython release coming up Message-ID: <44A7DF1B.1000008@c2b2.columbia.edu> Hi everybody, The next Biopython release (1.42, code-named "Brooklyn") is coming up. I'm planning to finish this release about two weeks from now. The tests of Biopython in CVS all pass, so we are doing well. However, there are 25 bugs listed in Bugzilla, so please have a look to see if there's something we can do about them. If you have some code sitting around, now would be a good time to commit it to CVS. However, if you are not sure if your code is ready for prime time, please hold off until after this release. Also, if you have a cvs checkout of Biopython, please make sure to update it before doing any commits to avoid overwriting. Thanks everybody for your contributions to Biopython. --Michiel. From biopython-dev at maubp.freeserve.co.uk Sun Jul 2 18:11:47 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sun, 02 Jul 2006 19:11:47 +0100 Subject: [Biopython-dev] New Biopython release coming up In-Reply-To: <44A7DF1B.1000008@c2b2.columbia.edu> References: <44A7DF1B.1000008@c2b2.columbia.edu> Message-ID: <44A80C63.7060809@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Hi everybody, > > The next Biopython release (1.42, code-named "Brooklyn") is coming up. > I'm planning to finish this release about two weeks from now. The tests > of Biopython in CVS all pass, so we are doing well. However, there are > 25 bugs listed in Bugzilla, so please have a look to see if there's > something we can do about them. If you have some code sitting around, > now would be a good time to commit it to CVS. However, if you are not > sure if your code is ready for prime time, please hold off until after > this release. Also, if you have a cvs checkout of Biopython, please make > sure to update it before doing any commits to avoid overwriting. > > Thanks everybody for your contributions to Biopython. > > --Michiel. Sounds like a good plan Michiel Did anyone get back to you about the NBCI Blast XML format? I would say parsing blast output is a fairly important feature to a lot of users (I may of course be biased)... Getting down to specifics: Bugzilla Bug 1997 VARCHAR too small in SCOP tables http://bugzilla.open-bio.org/show_bug.cgi?id=1997 Suggested fix looked OK to me, but as I've never used SCOP as second opinion would be wise. Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character http://bugzilla.open-bio.org/show_bug.cgi?id=1987 I have attached a suggested patch, second opinion welcome Bugzilla Bug 1981 GenBank parser generates unusual feature qualifiers. http://bugzilla.open-bio.org/show_bug.cgi?id=1981 A question about the white space in GenBank comments etc. Changing this is probably harmless but we are already making a big change internally with the move away from Martel, I would rather post pone any further change until after the next release. Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in error http://bugzilla.open-bio.org/show_bug.cgi?id=1936 One for Thomas Hamelryck which on the face of it looks fairly simple. Bugzilla Bug 1946 Parsing GenBank Files - unknown line type PROJECT Does anyone use the new project line? Would a simple string be enough to store this? Peter From mcolosimo at mitre.org Sun Jul 2 18:36:22 2006 From: mcolosimo at mitre.org (Colosimo, Marc E.) Date: Sun, 02 Jul 2006 14:36:22 -0400 Subject: [Biopython-dev] Fasta parser In-Reply-To: <44A6ED70.9080204@c2b2.columbia.edu> Message-ID: On 7/1/06 5:47 PM, "Michiel de Hoon" wrote: > Hi everybody, > > The Biopython shows the following approach to parsing a Fasta file: > >>>> from Bio import Fasta >>>> parser = Fasta.RecordParser() >>>> file = open("ls_orchid.fasta") >>>> iterator = Fasta.Iterator(file, parser) >>>> cur_record = iterator.next() > > But for large Fasta files, it's very slow, compared to file.read(), > which may be due to going through Martel (I believe the same was true > for large GenBank files). > > So I'm thinking about writing a simple-minded Fasta parser for better > performance with large files. What I'm wondering about: > 1) Is there some advantage that I overlooked of using Martel for parsing > Fasta files? > 2) Why is it necessary to create a parser first and passing it to > Fasta.Iterator? Are there any cases where Fasta.Iterator uses something > other than a Fasta.RecordParser? Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object (Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then remap into a SeqRecord. Also, could someone re-run epydoc! My changes in the code have not made it to the on-line API docs. Marc From mcolosimo at mitre.org Sun Jul 2 19:12:23 2006 From: mcolosimo at mitre.org (Colosimo, Marc E.) Date: Sun, 02 Jul 2006 15:12:23 -0400 Subject: [Biopython-dev] BioPython Design In-Reply-To: <44A74F03.8020801@c2b2.columbia.edu> Message-ID: Michiel, When will this next release be made and what is going into it? Since you brought up the issue of design question, I'll have my little rant now. But first, I would like to say that I think it is great that people contribute code and more importantly their time to this project. With out all of the core developers there would be no BioPython. So, Kudos to anyone who has contribute code. Now on to my rant.... I'm not a big user of either BioPerl or BioJava. However, they are well structured and more consistent than BioPython.This FastaIO issue is one of several design issues that really need to be addressed. For example, both BioPerl and BioJava use an SeqIO object structure. Our SeqIO module is heavily underused. For example, we have Fasta, GenBank, LocusLink, NBRF, SwissProt, UniGene main Modules. Interestingly, there is a writers.SeqRecord.embl but I can't quickly find something to read in an embl file! Just look at what BioPerl can read in and how easy it is to find this out (even with out the doc page, all of these are listed under Bio::SeqIO::*) There is a very short "Coding Convention" , which doesn't seem to be followed all that well. My suggestion is if enough people are going to ISMB this year (which I am not), that time should be made to think about a road map for BioPython. My suggestions are: 1) split off a branch for ver 2.0 that supports Python 2.4 only (this would suck for Mac people, like me, but its time to move on) 2) clean house - remove depreciated items, restructure IO, etc... 3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/convertcode.py") 4) use Cheese Shop for missing modules 5) documentation marc On 7/2/06 12:43 AM, "Michiel de Hoon" wrote: > Thanks Iddo! > I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than > the Martel-based one in Bio.Fasta. > > It would be nice to merge these two modules. However, it raises a bunch > of design questions (such as Fasta.Record versus SeqRecord, and Seq > versus string), so it's probably better to wait with that until after > the next Biopython release. Which, by the way, will be coming up soon. > > Thanks, > > --Michiel. > > Iddo Friedberg wrote: >> Michiel, >> >> There is actually a simple minded fasta reader/writer that does not use >> Martel. Bio.SeqIO.FASTA >> >> ./I >> >> -- >> Iddo Friedberg, PhD >> Burnham Institute for Medical Research >> 10901 N. Torrey Pines Rd. >> La Jolla, CA 92037 USA >> T: +1 858 646 3100 x3516 >> http://iddo-friedberg.org >> http://BioFunctionPrediction.org >> >> >> >> -----Original Message----- >> From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon >> Sent: Sat 7/1/2006 2:47 PM >> To: biopython-dev at biopython.org >> Subject: [Biopython-dev] Fasta parser >> >> Hi everybody, >> >> The Biopython shows the following approach to parsing a Fasta file: >> >>>>> from Bio import Fasta >>>>> parser = Fasta.RecordParser() >>>>> file = open("ls_orchid.fasta") >>>>> iterator = Fasta.Iterator(file, parser) >>>>> cur_record = iterator.next() >> >> But for large Fasta files, it's very slow, compared to file.read(), >> which may be due to going through Martel (I believe the same was true >> for large GenBank files). >> >> So I'm thinking about writing a simple-minded Fasta parser for better >> performance with large files. What I'm wondering about: >> 1) Is there some advantage that I overlooked of using Martel for parsing >> Fasta files? >> 2) Why is it necessary to create a parser first and passing it to >> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something >> other than a Fasta.RecordParser? >> >> --Michiel. >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mdehoon at c2b2.columbia.edu Sun Jul 2 20:54:27 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 16:54:27 -0400 Subject: [Biopython-dev] Fasta parser In-Reply-To: References: Message-ID: <44A83283.4060401@c2b2.columbia.edu> >> 2) Why is it necessary to create a parser first and passing it to >> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something >> other than a Fasta.RecordParser? > > Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object > (Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then > remap into a SeqRecord. I see. This is one of the design issues I ran into when comparing Bio.Fasta and Bio.SeqIO.FASTA: Whether parsing a Fasta file should result in a Fasta.Record object or a SeqRecord. > Also, could someone re-run epydoc! My changes in the code have not made it > to the on-line API docs. Done. --Michiel. From mdehoon at c2b2.columbia.edu Sun Jul 2 21:19:46 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 17:19:46 -0400 Subject: [Biopython-dev] BioPython Design In-Reply-To: References: Message-ID: <44A83872.4070209@c2b2.columbia.edu> Colosimo, Marc E. wrote: > When will this next release be made ... I'm planning for the weekend of 15/16 July. > ... and what is going into it? Whatever is in CVS at that time. So essentially today's CVS plus as many bug fixes as possible. I'd hold off on any major changes until after the release. > > I pretty much agree with Marc here. > My suggestion is if enough people are going to ISMB this year > (which I am not), that time should be made to think about a > road map for BioPython. Unfortunately, I won't be going either. A Biopython road map seems like a good idea though. > My suggestions are: > 1) split off a branch for ver 2.0 that supports Python 2.4 only > (this would suck for Mac people, like me, but its time to move on) Is there something essential in 2.4 that's missing in 2.3? Not that I object against supporting 2.4 only, I'm just wondering. Though I'd be hesitant to split off a separate branch, since Biopython is confusing enough already as it is. Btw, I am running Python 2.4 on Mac OS X, and AFAICT there is no problem for Mac users to support 2.4 only. > 2) clean house - remove depreciated items, restructure IO, etc... I totally agree. > 3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/convertcode.py") Here, I'm a bit hesitant. SciPy does not have a good track record in terms of portability. The latest version of numpy looks better though (it compiled without problems on all platforms I tried). But I don't really want to pay $40 for the documentation. > 4) use Cheese Shop for missing modules > 5) documentation My guess is that maintaining the documentation will be easier once we cleaned up Biopython. --Michiel. From mdehoon at c2b2.columbia.edu Mon Jul 3 01:21:00 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 02 Jul 2006 21:21:00 -0400 Subject: [Biopython-dev] New Biopython release coming up In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> Message-ID: <44A870FC.4060909@c2b2.columbia.edu> Peter wrote: > Did anyone get back to you about the NBCI Blast XML format? I would say > parsing blast output is a fairly important feature to a lot of users (I > may of course be biased)... No response yet, but I'll ask them again before the upcoming release. The existing XML parser still works as advertised for single blast searches. For multiple blast searches, people will have to run a previous version of blast locally. > Bugzilla Bug 1997 VARCHAR too small in SCOP tables > http://bugzilla.open-bio.org/show_bug.cgi?id=1997 > Suggested fix looked OK to me, but as I've never used SCOP as second > opinion would be wise. This one looks fine to me, but I'm not a SCOP user either. > Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character > http://bugzilla.open-bio.org/show_bug.cgi?id=1987 > I have attached a suggested patch, second opinion welcome Whereas the patch looks fine, I have no idea what this code is supposed to do, or why it needs to be so complicated. > Bugzilla Bug 1946 Parsing GenBank Files - unknown line type PROJECT > Does anyone use the new project line? Would a simple string be enough > to store this? > From NCBI's description, it appears they're not quite sure yet what this project line should look like (note that the project line in the description is different from the project line in the GenBank file: GenomeProject vs. GENOME_PROJECT). I would just store the line in a simple string, and do something more fancy once we know the proper format. My 2?. --Michiel. From idoerg at burnham.org Mon Jul 3 17:52:44 2006 From: idoerg at burnham.org (Iddo Friedberg) Date: Mon, 03 Jul 2006 10:52:44 -0700 Subject: [Biopython-dev] [Fwd: [OBF] Call For Birds of a Feather Suggestions] Message-ID: <44A9596C.90208@burnham.org> The BOSC organizing comittee is currently seeking suggestions for Birds of a Feather meeting ideas. Birds of a Feather meetings are one of the more popular activities at BOSC, occurring at the end of each days session. These are free-form meetings organized by the attendees themselves to discuss one or a few topics of interest in greater detail. BOF?s have been formed to allow developers and users of individual OBF software to meet each other face-to-face to discuss the project, or to discuss completely new ideas, and even start new software development projects. These meetings offer a unique opportunity for individuals to explore more about the activities of the various Open Source Projects, and, in some cases, even take an active role influencing the future of Open Source Software development. If you would like to create a BOF, just sign up for a wiki account, login, and edit the BOSC 2006 Birds of a Feather page. _______________________________________________ Open-Bioinformatics-Foundation mailing list Open-Bioinformatics-Foundation at lists.open-bio.org This is a broadcast-only announce list used to distribute emails to people who subscribe to OBF hosted email discussion or announce lists. To prevent our most active members from getting many duplicate copies of important announcements we created this list today so that only one email gets sent to each subscribed email address. You do not need to subscribe/unsubscribe from this lsit. Problems or Concerns? -- send an email to the OBF mailteam at: mailteam at open-bio.org -- Iddo Friedberg, Ph.D. Burnham Institute for Medical Research 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9949 http://iddo-friedberg.org http://BioFunctionPrediction.org From biopython-dev at maubp.freeserve.co.uk Thu Jul 6 09:06:07 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu, 06 Jul 2006 10:06:07 +0100 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44A870FC.4060909@c2b2.columbia.edu> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> Message-ID: <44ACD27F.90906@maubp.freeserve.co.uk> Peter wrote: >> Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character >> http://bugzilla.open-bio.org/show_bug.cgi?id=1987 >> I have attached a suggested patch, second opinion welcome Michiel de Hoon wrote: > Whereas the patch looks fine, I have no idea what this code is supposed > to do, or why it needs to be so complicated. I'm not the person to ask. The whole Alphabet is something that confused me a little when first using BioPython. I see why a special class for sequences is a nice idea, and that handling the different variants of RNA, DNA and proteins is a good idea. But to be honest, I have generally used plain strings in my own programs, and meddled with alphabets only when needed (e.g. for translating from DNA to protein sequences). Peter From hoffman at ebi.ac.uk Thu Jul 6 10:36:53 2006 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Thu, 6 Jul 2006 11:36:53 +0100 (BST) Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> Message-ID: [Peter] > The whole Alphabet is something that confused me a little when first > using BioPython. I see why a special class for sequences is a nice > idea, and that handling the different variants of RNA, DNA and proteins > is a good idea. > > But to be honest, I have generally used plain strings in my own > programs, and meddled with alphabets only when needed (e.g. for > translating from DNA to protein sequences). I agree. In general, I think that the alphabet stuff adds unnecessary complexity to perhaps 95 % of the sort of things I would do with Biopython. But as it stands I usually use strs myself instead. -- Michael Hoffman European Bioinformatics Institute From Leighton.Pritchard at scri.ac.uk Thu Jul 6 10:34:46 2006 From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard) Date: Thu, 6 Jul 2006 11:34:46 +0100 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> Message-ID: <1152182087.4828.96.camel@lplinuxdev> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: "Leighton Pritchard" Subject: Re: [Biopython-dev] New Biopython release coming up / Alphabets Date: Thu, 6 Jul 2006 11:34:46 +0100 Size: 4250 URL: From Leighton.Pritchard at scri.ac.uk Thu Jul 6 10:34:46 2006 From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard) Date: Thu, 6 Jul 2006 11:34:46 +0100 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> Message-ID: <1152182087.4828.96.camel@lplinuxdev> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: "Leighton Pritchard" Subject: Re: [Biopython-dev] New Biopython release coming up / Alphabets Date: Thu, 6 Jul 2006 11:34:46 +0100 Size: 4250 URL: From mdehoon at c2b2.columbia.edu Thu Jul 6 16:39:09 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 06 Jul 2006 12:39:09 -0400 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> Message-ID: <44AD3CAD.8030504@c2b2.columbia.edu> Michael Hoffman wrote: > [Peter] >> But to be honest, I have generally used plain strings in my own >> programs, and meddled with alphabets only when needed (e.g. for >> translating from DNA to protein sequences). Note that there is a function "translate" in Bio.Seq that translates DNA to protein using plain strings. > > I agree. In general, I think that the alphabet stuff adds unnecessary > complexity to perhaps 95 % of the sort of things I would do with > Biopython. But as it stands I usually use strs myself instead. It appears that most people (myself included) use plain strings instead of Seq objects (= string + Alphabet). We should check on the biopython mailing list if anybody really needs alphabets, and if not get rid of them (after the upcoming Brooklyn-release (1.42) though). --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From fkauff at duke.edu Thu Jul 6 17:53:23 2006 From: fkauff at duke.edu (Frank Kauff) Date: Thu, 06 Jul 2006 13:53:23 -0400 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44AD3CAD.8030504@c2b2.columbia.edu> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> <44AD3CAD.8030504@c2b2.columbia.edu> Message-ID: <1152208403.2487.36.camel@osiris.biology.duke.edu> On Thu, 2006-07-06 at 12:39 -0400, Michiel Jan Laurens de Hoon wrote: > Michael Hoffman wrote: > > [Peter] > >> But to be honest, I have generally used plain strings in my own > >> programs, and meddled with alphabets only when needed (e.g. for > >> translating from DNA to protein sequences). > > Note that there is a function "translate" in Bio.Seq that translates DNA > to protein using plain strings. > > > > I agree. In general, I think that the alphabet stuff adds unnecessary > > complexity to perhaps 95 % of the sort of things I would do with > > Biopython. But as it stands I usually use strs myself instead. > > It appears that most people (myself included) use plain strings instead > of Seq objects (= string + Alphabet). We should check on the biopython > mailing list if anybody really needs alphabets, and if not get rid of > them (after the upcoming Brooklyn-release (1.42) though). > I use seq objects and the alphabet stuff in the nexus parser, but I don't really know why and wouldn't mind at all to get rid of them. Frank > --Michiel. > > -- Frank Kauff Dept. of Biology Duke University Box 90338 Durham, NC 27708 USA Phone 919-660-7382 Fax 919-660-7293 Web http://www.lutzonilab.net From thamelry at binf.ku.dk Fri Jul 7 10:44:24 2006 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 7 Jul 2006 12:44:24 +0200 (CEST) Subject: [Biopython-dev] New Biopython release coming up In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> Message-ID: <38979.129.11.36.25.1152269064.squirrel@secure.binf.ku.dk> Hi, > Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in > error > http://bugzilla.open-bio.org/show_bug.cgi?id=1936 > One for Thomas Hamelryck which on the face of it looks fairly simple. Won't have time to work on biopython before august I'm afraid (CASP+ articles that need to be finished, etc.). Sorry! Best regards, -Thomas From thamelry at binf.ku.dk Fri Jul 7 10:44:24 2006 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 7 Jul 2006 12:44:24 +0200 (CEST) Subject: [Biopython-dev] New Biopython release coming up In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> Message-ID: <38979.129.11.36.25.1152269064.squirrel@secure.binf.ku.dk> Hi, > Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in > error > http://bugzilla.open-bio.org/show_bug.cgi?id=1936 > One for Thomas Hamelryck which on the face of it looks fairly simple. Won't have time to work on biopython before august I'm afraid (CASP+ articles that need to be finished, etc.). Sorry! Best regards, -Thomas From mcolosimo at mitre.org Tue Jul 11 16:01:15 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 11 Jul 2006 12:01:15 -0400 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: <44AD3CAD.8030504@c2b2.columbia.edu> References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> <44AD3CAD.8030504@c2b2.columbia.edu> Message-ID: On Jul 6, 2006, at 12:39 PM, Michiel Jan Laurens de Hoon wrote: > Michael Hoffman wrote: >> [Peter] >>> But to be honest, I have generally used plain strings in my own >>> programs, and meddled with alphabets only when needed (e.g. for >>> translating from DNA to protein sequences). > > Note that there is a function "translate" in Bio.Seq that > translates DNA > to protein using plain strings. >> >> I agree. In general, I think that the alphabet stuff adds unnecessary >> complexity to perhaps 95 % of the sort of things I would do with >> Biopython. But as it stands I usually use strs myself instead. > > It appears that most people (myself included) use plain strings > instead > of Seq objects (= string + Alphabet). We should check on the biopython > mailing list if anybody really needs alphabets, and if not get rid of > them (after the upcoming Brooklyn-release (1.42) though). > > --Michiel. I am strongly arguing against removing the alphabets. You would loss all of the cool features of Seq Objects (complement, reverse_complement). There are similar functions under Bio.SeqUtils but those are "Deprecated". From just looking around, I think this would break many things. Having said that, I do find them a pain to deal with, but that might have more to do with the structure/layout of the classes. My simple suggestion is to fix/change the base Alphabet classes in Bio.Alphabet.__init__. I am trying to think of a way that we can have a "true" GenericAlphabet class (not generic_alphabet = Alphabet() ) and using just strings. The problem is, is that I don't know if just using letters = None (or letters = []) will cause problems down the road (things like if x in aplabet.letters is used in many classes). Also, I'm really confused as to what is going on in IUPAC.py with the default_manager stuff and _bootstrap. Marc From mcolosimo at mitre.org Tue Jul 11 17:29:52 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Tue, 11 Jul 2006 13:29:52 -0400 Subject: [Biopython-dev] BioPython Design In-Reply-To: <44A83872.4070209@c2b2.columbia.edu> References: <44A83872.4070209@c2b2.columbia.edu> Message-ID: <7C24AEA4-68EC-4517-9391-C07512CDD146@mitre.org> On Jul 2, 2006, at 5:19 PM, Michiel de Hoon wrote: > > > My suggestions are: > > 1) split off a branch for ver 2.0 that supports Python 2.4 only > > (this would suck for Mac people, like me, but its time to move on) > > Is there something essential in 2.4 that's missing in 2.3? Not that > I object against supporting 2.4 only, I'm just wondering. Though > I'd be hesitant to split off a separate branch, since Biopython is > confusing enough already as it is. > > Btw, I am running Python 2.4 on Mac OS X, and AFAICT there is no > problem for Mac users to support 2.4 only. There are two off the top of my head: Generator expressions (PEP 289, ) This could be very useful in cleaning up the old code Decorators for Functions (PEP 318, ) I like the idea of using staticmethod and classmethod. The accepts and returns decorators are also interesting. I wish I could find a list of all possible decorators. In any case, some clean up of the code is needed because people have used the string "Decorator" (Alphabet.__init__.py and NeCatch.py) > > > 2) clean house - remove depreciated items, restructure IO, etc... > > I totally agree. > > > 3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/ > convertcode.py") > > Here, I'm a bit hesitant. SciPy does not have a good track record > in terms of portability. The latest version of numpy looks better > though (it compiled without problems on all platforms I tried). But > I don't really want to pay $40 for the documentation. I saw this, but didn't know it was the only documentation. However, as far as I can tell Numeric is dead is NumPy! Marc From krewink at inb.uni-luebeck.de Tue Jul 11 21:23:14 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Tue, 11 Jul 2006 23:23:14 +0200 Subject: [Biopython-dev] BioPython Design Message-ID: <20060711212314.GA31351@pc06.inb.mu-luebeck.de> Am 11.07.2006 um 18:01 schrieb Marc Colosimo: > It appears that most people (myself included) use plain strings instead > of Seq objects (= string + Alphabet). We should check on the biopython > mailing list if anybody really needs alphabets, and if not get rid of > them (after the upcoming Brooklyn-release (1.42) though). There are some good points about Seq objects in the discussion last year: http://lists.open-bio.org/pipermail/biopython-dev/2005-April/002074.html Personaly, I would prefere to keep Alphabets as a part of Seq, but make it behave more like python strings, i.e.: str(seq_obj) == seq_obj.data == seq_obj.tostring() == seq_obj[:] Furthermore, alphabets could be more usefull with an __init__ method looking like def __init__(self, data, alphabet, validate=False) This way, sequences could be checked for consistency on demand. To make Alphabets more usable, it would be nice to have some kind of dictionary interface to map different alphabets: e.g. Alphabet.Alphabets['protein'] == Bio.Alphabet.IUPAC.protein Cheers, Albert -- Albert Krewinkel University of Luebeck phone: +49 (451) 500 5516 email: krewink at inb.uni-luebeck.de From f.schlesinger at iu-bremen.de Wed Jul 12 13:25:43 2006 From: f.schlesinger at iu-bremen.de (Felix Schlesinger) Date: Wed, 12 Jul 2006 15:25:43 +0200 Subject: [Biopython-dev] BioPython Design In-Reply-To: <20060711212314.GA31351@pc06.inb.mu-luebeck.de> References: <20060711212314.GA31351@pc06.inb.mu-luebeck.de> Message-ID: <7317d50c0607120625x7e76008fo961814b280dbad51@mail.gmail.com> > Personaly, I would prefere to keep Alphabets as a part of Seq, > but make it behave more like python strings, i.e.: > str(seq_obj) == seq_obj.data == seq_obj.tostring() == seq_obj[:] Isn't the whole alphabet thing just a type information in the end? (I.e. "This string is of type protein") And if it is, shouldn't we let the python type system handle it via a class hirachie? Or use the python concept of duck typing and assume the string has whatever type is needed at the moment until it fails? Felix Schlesinger From mdehoon at c2b2.columbia.edu Wed Jul 26 17:39:46 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Wed, 26 Jul 2006 13:39:46 -0400 Subject: [Biopython-dev] New Biopython release coming up / Alphabets In-Reply-To: References: <44A7DF1B.1000008@c2b2.columbia.edu> <44A80C63.7060809@maubp.freeserve.co.uk> <44A870FC.4060909@c2b2.columbia.edu> <44ACD27F.90906@maubp.freeserve.co.uk> <44AD3CAD.8030504@c2b2.columbia.edu> Message-ID: <44C7A8E2.2050100@c2b2.columbia.edu> Marc Colosimo wrote: >> [Michiel] >> It appears that most people (myself included) use plain strings instead >> of Seq objects (= string + Alphabet). We should check on the biopython >> mailing list if anybody really needs alphabets, and if not get rid of >> them (after the upcoming Brooklyn-release (1.42) though). > > [Marc] > I am strongly arguing against removing the alphabets. You would loss > all of the cool features of Seq Objects (complement, > reverse_complement). There are similar functions under Bio.SeqUtils but > those are "Deprecated". From just looking around, I think this would > break many things. There is a function reverse_complement in Bio.Seq that works on plain strings. (If you need the complement instead, you can of course reverse the result). So can you be more specific on which features of Seq objects are actually needed? While I can see the intuitive appeal of having a Seq class, I cannot think of any practical cases where a simple string wouldn't do. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Fri Jul 28 13:50:39 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Fri, 28 Jul 2006 14:50:39 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Message-ID: <44CA162F.1040604@maubp.freeserve.co.uk> This follows on from the discussion last month started by Marc Colosimo, but I want to focus just on reading in sequence files: http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002386.html There was also a thread back a few years ago where Michael Hoffman was looking at timings for parsing Fasta files. http://www.biopython.org/pipermail/biopython-dev/2003-October/001480.html Jeffrey Chang wrote: > That is a nice implementation. However, Biopython already has at least > 3 Fasta parsers! > Bio/Fasta > Bio/SeqIO/FASTA > Bio/expressions/fasta > > Bio/Fasta, the one you compared against, is easily the slowest one. > Bio/SeqIO/FASTA is very similar to your implementation and not likely > to be significantly faster or slower. Bio/expressions/fasta uses > Martel. I don't know how well that will perform. The parsing part > should be blazingly fast (since it is mostly in C), but building the > object will be slow. It might be a wash. > > Jeff Clearly we could try and consolidate these (while making things as nice as possible with depreciation warnings etc for existing code). I've had a little read on the BioPerl SeqIO system: http://www.bioperl.org/wiki/HOWTO:SeqIO I agree with Marc that what we have in BioPython could (and should) be more organised. Ideally (in my opinion) BioPython should be able to read sequences from multiple sequence file formats (e.g. Fasta, GenBank, EMBL, ...) * using a standard interface * into a standard object * do this quickly The resulting object should be able to hold addition information like annotation and (sub)features, the Bio.SeqRecord.SeqRecord object seems ideal. It looks like we have: (1) We have a number of format specific sequence reading modules (in particular Fasta and GenBank) which can read their particular file format into one or more different object representations. These seem to be the best documented (in my opinion). (2) We have a fairly generic (but relatively slow) framework in the Bio.FormatIO system which uses Martel expressions internally. I have found Martel frustrating to debug, and especially slow with large individual records (like genomic GenBank files). There is some documentation on this, e.g. http://www.biopython.org/DIST/docs/cookbook/genbank_to_fasta.html (3) We have the start of a generic "pure python" framework in the Bio.SeqIO system, but it needs some work (e.g. some doc strings, fixing the LargeFastaFormat class, GenBank support). QUESTION: What do you all tend to use? Should I draft a "questionnaire" to be posted on the main discussion list (and the announcements?). Personally, I have been using Bio.Fasta and Bio.GenBank to read sequences. I tend to only output Fasta files, and usually do this "by hand" as they are so simple and I want full control over the description lines. Peter From biopython-dev at maubp.freeserve.co.uk Fri Jul 28 15:05:21 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Fri, 28 Jul 2006 16:05:21 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CA162F.1040604@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> Message-ID: <44CA27B1.30107@maubp.freeserve.co.uk> Jeffrey Chang wrote: > ... However, Biopython already has at least > 3 Fasta parsers! > Bio/Fasta > Bio/SeqIO/FASTA > Bio/expressions/fasta > > Bio/Fasta, the one you compared against, is easily the slowest one. > Bio/SeqIO/FASTA is very similar to your implementation and not likely > to be significantly faster or slower. Bio/expressions/fasta uses > Martel. I don't know how well that will perform. The parsing part > should be blazingly fast (since it is mostly in C), but building the > object will be slow. It might be a wash. The following timings are for iterating over a large fasta file (Escherichia_coli_K12, NC_000913.ffn, with 5254 nucleotide CDS sequences). The test script is attached, the test input is available here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K12/NC_000913.ffn I used BioPython 1.42 with Python 2.3 on Windows XP on a laptop computer. Apart from Fasta.RecordParser, these all return a SeqRecord object with a generic alphabet: 0.89s SeqIO.FASTA.FastaReader (for record in interator) 0.88s SeqIO.FASTA.FastaReader (iterator.next) 0.88s SeqIO.FASTA.FastaReader (iterator[i]) 5.52s FormatIO/SeqRecord (for record in interator) 5.41s FormatIO/SeqRecord (iterator.next) 6.06s Fasta.RecordParser (for record in interator) 6.10s Fasta.SequenceParser (for record in interator) 6.27s Fasta.SequenceParser (iterator.next) As you can see, SeqIO.FASTA.FastaReader (written in simple python) is about six times faster than both the martel based parsers. I have tried this on a file with 2000 records and see a similar scaling. Peter -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: test_fasta_methods.py URL: From mdehoon at c2b2.columbia.edu Mon Jul 31 01:20:50 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 30 Jul 2006 21:20:50 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CA162F.1040604@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> Message-ID: <44CD5AF2.10708@c2b2.columbia.edu> Thanks Peter. Peter (BioPython Dev) wrote: > QUESTION: What do you all tend to use? I use the stuff in Bio.Fasta, but actually just because it's in the documentation. From your timings, and also because I'm not smart enough to be able to understand Martel, let alone maintain Martel-based parsers, I'm pretty much in favor of Bio.SeqIO. > Should I draft a "questionnaire" > to be posted on the main discussion list (and the announcements?). By all means, yes. In the questionnaire, be sure to separate the issue of parser internals (Martel vs. pure Python) from the issue of how the results should be formatted (Fasta.Record or SeqRecord). --Michiel From lpritc at scri.sari.ac.uk Mon Jul 31 09:59:47 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Mon, 31 Jul 2006 10:59:47 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CA27B1.30107@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> Message-ID: <1154339988.1490.81.camel@lplinuxdev> Hi all, On Fri, 2006-07-28 at 16:05 +0100, Peter (BioPython Dev) wrote: > Jeffrey Chang wrote: > > ... However, Biopython already has at least > > 3 Fasta parsers! > > Bio/Fasta > > Bio/SeqIO/FASTA > > Bio/expressions/fasta > > > > Bio/Fasta, the one you compared against, is easily the slowest one. > > Bio/SeqIO/FASTA is very similar to your implementation and not likely > > to be significantly faster or slower. Bio/expressions/fasta uses > > Martel. I don't know how well that will perform. The parsing part > > should be blazingly fast (since it is mostly in C), but building the > > object will be slow. It might be a wash. Just to add to the confusion, when parsing large FASTA sequence files, I have been using a home-rolled Flex/Pyrex parser (if you'd like a copy, drop me a line). I've used Peter's test framework on the same input file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora Core 3 (up-to-date, eh? ;) ) to get the following typical results: 4.07s FormatIO/SeqRecord (for record in interator) 4.05s FormatIO/SeqRecord (iterator.next) 0.32s SeqIO.FASTA.FastaReader (for record in interator) 0.30s SeqIO.FASTA.FastaReader (iterator.next) 0.31s SeqIO.FASTA.FastaReader (iterator[i]) 5.53s Fasta.RecordParser (for record in interator) 5.00s Fasta.SequenceParser (for record in interator) 4.80s Fasta.SequenceParser (iterator.next) 0.18s SeqUtils/quick_FASTA_reader 0.11s pyfastaseqlexer/next_record 0.09s pyfastaseqlexer/quick_FASTA_reader 0.19s SeqUtils/quick_FASTA_reader (conversion to Seq) 0.14s pyfastaseqlexer/next_record (conversion to Seq) 0.11s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq) 0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) 0.17s pyfastaseqlexer/next_record (conversion to SeqRecord) 0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) pyfastaseqlexer is my Flex/Pyrex combination, which has a number of methods for reading in FASTA sequences. Here I've used the two that correspond to the Bio.SeqUtils.quick_FASTA_reader method (overlooked in the original list, but also included here for comparison), and Peter's iterator method for his tests. Since these extra methods don't return Bio.Seq or Bio.SeqRecord objects, but instead lists of (name, sequence) tuples, I've also included test functions that carry out the conversion in Python, and their timings. It's probably not a surprise that a dedicated Flex-based parser shows such a dramatic speed improvement over the Martel-based parsers. The improvement over SeqIO.FASTA.FastaReader and SeqUtils.quick_FASTA_reader is only marginal, though (a factor of approximately two when conversion to SeqRecord is taken into account). Since we've been discussing the need to use only strings to represent sequences recently, it's interesting to note that SeqUtils.quick_FASTA_reader is about twice as fast as SeqIO.FASTA.FastaReader if there is no conversion of sequences from strings to Seq or SeqRecord objects. While the Flex-based parser is the fastest in these tests, the time saved is marginal unless a large FASTA file is being parsed. Using a file with over 72000 entries (Phytophthora infestans ESTs), my typical timings become: 51.22s FormatIO/SeqRecord (for record in interator) 45.64s FormatIO/SeqRecord (iterator.next) 4.26s SeqIO.FASTA.FastaReader (for record in interator) 4.10s SeqIO.FASTA.FastaReader (iterator.next) 4.30s SeqIO.FASTA.FastaReader (iterator[i]) 58.39s Fasta.RecordParser (for record in interator) 59.97s Fasta.SequenceParser (for record in interator) 58.70s Fasta.SequenceParser (iterator.next) 2.20s SeqUtils/quick_FASTA_reader 1.13s pyfastaseqlexer/next_record 0.56s pyfastaseqlexer/quick_FASTA_reader 2.20s SeqUtils/quick_FASTA_reader (conversion to Seq) 1.53s pyfastaseqlexer/next_record (conversion to Seq) 0.84s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq) 2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) 2.11s pyfastaseqlexer/next_record (conversion to SeqRecord) 1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) The Martel-based parsers become almost unworkable when dealing with files of this size. Note that the conversion of strings to SeqRecord objects is pretty much a constant overhead for the Bio.SeqUtils and pyfastaseqlexer methods (taking around 1s), but that there are apparently additional overheads in the SeqIO.FASTA.FastaReader method. Of course, the hassles of including a Flex-based parser in a general BioPython release probably outweigh the marginal time-saving benefits (see MMCIFlex for details ;) ). I think SeqIO.FASTA.FastaReader and SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and beat the inclusion of a Flex-based parser hands-down in terms of maintainability and portability. L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Mon Jul 31 10:36:00 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 31 Jul 2006 11:36:00 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CD5AF2.10708@c2b2.columbia.edu> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> Message-ID: <44CDDD10.4020904@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Thanks Peter. > > Peter wrote: >>QUESTION: What do you all tend to use? > > I use the stuff in Bio.Fasta, but actually just because it's in the > documentation. Me too. > From your timings, and also because I'm not smart enough > to be able to understand Martel, let alone maintain Martel-based > parsers, I'm pretty much in favor of Bio.SeqIO. That was my gut instinct too. Starting with Bio.SeqIO as a base, I've been "playing" with the code and have a rough "Sequence Iterator" class that supports iteration (provides a next() and __iter__() method), as well as strictly increasing index access. At the moment I have iterators returning SeqRecords for: - Fasta Files - GenBank features (returns the CDS features and their translations) - Genbank files (with the features as SeqFeature objects) There is code in Bio/SeqIO/general.py for a few more file formats which I haven't used yet. This new GenBank iterator actually uses the current Bio.Genbank parser (with a slight tweak to how it acts once it reaches the end of a record). Michiel de Hoon wrote: > >Peter wrote: >> Should I draft a "questionnaire" >> to be posted on the main discussion list (and the announcements?). > > By all means, yes. In the questionnaire, be sure to separate the issue > of parser internals (Martel vs. pure Python) from the issue of how the > results should be formatted (Fasta.Record or SeqRecord). > Draft questionnaire follows, I have included by comments for the record. Too long? Missing any important questions? Peter -- Introduction ============ There is some discussion on the Developer's Mailing list about BioPython's sequence input/output routines. For example, its a bit silly that there are three different Fasta reading routines in BioPython (even if only one of them, Bio.Fasta, is properly documented). Note that we are not going to "just remove" any of the current functionality. Some existing code may be re-written internally, while other code might be marked with a DeprecationWarning. If you could answer the following questions that would help guide our choices. Question One ============ Is reading sequence files an important function to you, and if so which file formats in particular (e.g. Fasta, GenBank, ...) If you have had to write you own code to read a "common" file format which BioPython doesn't support, please get in touch. Peter's answer: > I read Fasta and GenBank files mostly. Also Clustalw alignments, > and Stockholm alignments. Question Two - Reading Fasta Files ================================== Which of the following do you currently use (and why)?: (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a title, and the sequence as a string) (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord objects) (f) Other (Could you tell us more?) Peter's answer: > In most of my script I use Bio.Fasta with either the RecordParser or > FeatureParser. I did look at Bio.FormatIO when I started but found > Bio.Fasta was much better documented (and a similar speed). I have > only recently looked at Bio.SeqIO (hence this entire thread). Question Three - index_file based dictionaries ============================================== Do you use any of the following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c) Any other "Martel/Mindy" based dictionary which first requires creation of an index using the index_file function If so, do you have any comments? Peter's answer: > I do not use multi-record Genbank files (mine are single chromosomes). > > I have used Bio.Fasta.Dictionary but found dealing with the indexes > created by index_file to be annoying - especially when re-indexing > Fasta files which change often. > > I now use a simple wrapper function to load a Fasta file with an > iterator and build the dictionary in memory. For me this is much > less hassle and the memory demands are not too great. Question Four - Record Access... ================================ When loading a file with multiple sequences do you use: (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the records one by one in the order from the file. (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you random access to the records using their identifier. (c) A list giving random access by index number (e.g. load the records using an iterator but saving them in a list). Do you have any additional comments on this? For example, flexibility versus memory requirements. For example, when I need random access to a Fasta file, I build a dictionary in memory (using an iterator) rather than messing about with the index_file based dictionary. Peter's answer: > I usually deal with each record sequentially using an iterator. > > However, I often need random access using the record identifier and > for this I use a dictionary which I create in memory using an iterator. > > As stated in the question, I had tired used Bio.Fasta.Dictionary but > found dealing with the indexes created by index_file to be annoying, > especially having to re-indexing Fasta files which change often. Question Four - Fasta files: FastaRecord or SeqRecord ===================================================== If you use Fasta files, do you want get records returned as FastaRecords or as SeqRecords? If SeqRecords, do you use your own title2ids mapping? For example, >name text text text ACGTACACGT As a FastaRecord this would have: FastaRecord.title = "name text text text" (string) FastaRecord.sequence= "ACGTACACGT" (string) As a SeqRecord (with the default title2ids mapping): SeqRecord.id = (default string) SeqRecord.name = (default string) SeqRecord.description = "name text text text" (string) SeqRecord.seq = Seq("ACGTACACGT", alphabet) Peter's answer > For FASTA files I have usually used FastaRecord objects (with the > sequence as a string) but I have no strong preference. Thinking of > the big picture it would be better to have every parser return > SeqRecords by default. Question Five - GenBank files: GenbankRecord or SeqRecord ========================================================== If you use GenBank files, do you use: (a) Bio.Genbank.FeatureParser which returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects Do you care much either way? For me the only significant difference is that feature locations are held as objects in the SeqRecord, and as the raw string in the Record. Peter's answer > I have no strong preference - unless I wanted to manipulate the > feature locations. I think there might be a performance difference... Question Six - Martel, Scanners and Consumers ============================================== Some of BioPython's existing parsers (e.g. those using Martel) use an event/callback model, where the scanner component generates parsing events which are dealt with by the consumer component. Do any of you use this system to modify existing parser behaviour, or use it as part of your own personal file parser? (a) I don't know, or don't care. I just the the parsers provided. (b) I use this framework to modify a parser in order to do ... (please provide details). Peter's answer > As a user I don't care about the internals. I do care about what > gets used as the name/id/description for SeqRecords but that level > of flexibility is enough. > > As a BioPython contributor: Martel is scary. I think I understand > the whole scanner/consumer model but don't see the point (unless > using a event based scanner like Martel). I suspect all the > function call backs is one reason Martel parsers are slow. Peter From biopython-dev at maubp.freeserve.co.uk Mon Jul 31 12:12:26 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 31 Jul 2006 13:12:26 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154339988.1490.81.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> Message-ID: <44CDF3AA.2020308@maubp.freeserve.co.uk> Leighton Pritchard wrote: > Just to add to the confusion, when parsing large FASTA sequence files, I > have been using a home-rolled Flex/Pyrex parser (if you'd like a copy, > drop me a line). I've used Peter's test framework on the same input > file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora > Core 3 (up-to-date, eh? ;) ) to get the following typical results: Times for NC_000913.ffn when returning SeqRecord objects: > 4.07s FormatIO/SeqRecord (for record in interator) > 4.05s FormatIO/SeqRecord (iterator.next) > 5.00s Fasta.SequenceParser (for record in interator) > 4.80s Fasta.SequenceParser (iterator.next) > 0.32s SeqIO.FASTA.FastaReader (for record in interator) > 0.30s SeqIO.FASTA.FastaReader (iterator.next) > 0.31s SeqIO.FASTA.FastaReader (iterator[i]) > 0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) > 0.17s pyfastaseqlexer/next_record (conversion to SeqRecord) > 0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) And again, but for Phytophthora infestans ESTs with 72000 entries > 51.22s FormatIO/SeqRecord (for record in interator) > 45.64s FormatIO/SeqRecord (iterator.next) > 59.97s Fasta.SequenceParser (for record in interator) > 58.70s Fasta.SequenceParser (iterator.next) > 4.26s SeqIO.FASTA.FastaReader (for record in interator) > 4.10s SeqIO.FASTA.FastaReader (iterator.next) > 4.30s SeqIO.FASTA.FastaReader (iterator[i]) > 2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) > 2.11s pyfastaseqlexer/next_record (conversion to SeqRecord) > 1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) I imagine this file is much much larger than what most of our uses work with - but it does clearly show that the Martel parsers do not scale well. Out of interest, are the sequences in this file split into multiple lines (e.g. max length 80) or are they all single (long) lines? I would expect the later to be quicker to load due to less string operations. > Of course, the hassles of including a Flex-based parser in a general > BioPython release probably outweigh the marginal time-saving benefits > (see MMCIFlex for details ;) ). I think SeqIO.FASTA.FastaReader and > SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and > beat the inclusion of a Flex-based parser hands-down in terms of > maintainability and portability. I agree with you completely that we should avoid the Flex parser based on those grounds, as we can get "close enough" with pure python. Especially if we do something about the overhead of Seq and SeqRecord objects. I did some work on a brand new SeqIO over the weekend. I had got the fasta iterator slightly quicker too. The SeqUtils/quick_FASTA_reader is interesting in that it loads the entire file into memory in one go, and then parses it. On the other hand its not perfect: I would use "\n>" as the split marker rather than ">" which could appear in the description of a sequence. The iterator approach is probably slower but requires much less memory. How big is your 72,000 entry file in MB? Do we need to worry about the size of the raw file in memory - allowing the parsers to load it into memory could make things much faster... Peter From lpritc at scri.sari.ac.uk Mon Jul 31 14:15:54 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Mon, 31 Jul 2006 15:15:54 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CDF3AA.2020308@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> Message-ID: <1154355358.1490.116.camel@lplinuxdev> On Mon, 2006-07-31 at 13:12 +0100, Peter (BioPython Dev) wrote: > I imagine this file is much much larger than what most of our uses work > with - but it does clearly show that the Martel parsers do not scale well. I noticed the scaling problem mostly for GenBank files. Your new GenBank parser is a welcome improvement in speed. > Out of interest, are the sequences in this file split into multiple > lines (e.g. max length 80) or are they all single (long) lines? I would > expect the later to be quicker to load due to less string operations. They're multiple lines with max length 50, and the whole file is 33Mb. It's not the largest FASTA sequence file I'm working with, that's 353Mb (530801 sequences, it's most of a eukaryotic genome with sequences split into multiple lines), so I ran your test script on it, just to see what happened: 419.42s FormatIO/SeqRecord (for record in interator) 389.05s FormatIO/SeqRecord (iterator.next) 35.46s SeqIO.FASTA.FastaReader (for record in interator) 33.73s SeqIO.FASTA.FastaReader (iterator.next) 36.19s SeqIO.FASTA.FastaReader (iterator[i]) 490.19s Fasta.RecordParser (for record in interator) 555.43s Fasta.SequenceParser (for record in interator) 546.87s Fasta.SequenceParser (iterator.next) 37.94s SeqUtils/quick_FASTA_reader 12.84s pyfastaseqlexer/next_record 6.06s pyfastaseqlexer/quick_FASTA_reader 24.08s SeqUtils/quick_FASTA_reader (conversion to Seq) 12.27s pyfastaseqlexer/next_record (conversion to Seq) 8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq) 24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) 18.10s pyfastaseqlexer/next_record (conversion to SeqRecord) 13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) This is only one run - my patience has limits Again, scaling is a big problem for some methods. > The SeqUtils/quick_FASTA_reader is interesting in that it loads the > entire file into memory in one go, and then parses it. On the other > hand its not perfect: I would use "\n>" as the split marker rather than > ">" which could appear in the description of a sequence. I agree (not that it's bitten me, yet), but I'd be inclined to go with "%s>" % os.linesep as the split marker, just in case. > Do we need to worry about the size of the raw file in memory - allowing the parsers to load it > into memory could make things much faster... I use very few FASTA files where that would be a problem, so long as the sequences remain as strings - when they're converted to SeqRecords/SeqFeatures is where I start to get nervous about memory use. L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Mon Jul 31 15:14:04 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 31 Jul 2006 16:14:04 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154355358.1490.116.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> Message-ID: <44CE1E3C.2050502@maubp.freeserve.co.uk> > > They're multiple lines with max length 50, and the whole file is 33Mb. > It's not the largest FASTA sequence file I'm working with, that's 353Mb > (530801 sequences, it's most of a eukaryotic genome with sequences split > into multiple lines), so I ran your test script on it, just to see what > happened: > > 419.42s FormatIO/SeqRecord (for record in interator) > 389.05s FormatIO/SeqRecord (iterator.next) > 35.46s SeqIO.FASTA.FastaReader (for record in interator) > 33.73s SeqIO.FASTA.FastaReader (iterator.next) > 36.19s SeqIO.FASTA.FastaReader (iterator[i]) > 490.19s Fasta.RecordParser (for record in interator) > 555.43s Fasta.SequenceParser (for record in interator) > 546.87s Fasta.SequenceParser (iterator.next) > 37.94s SeqUtils/quick_FASTA_reader > 12.84s pyfastaseqlexer/next_record > 6.06s pyfastaseqlexer/quick_FASTA_reader > 24.08s SeqUtils/quick_FASTA_reader (conversion to Seq) > 12.27s pyfastaseqlexer/next_record (conversion to Seq) > 8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq) > 24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord) > 18.10s pyfastaseqlexer/next_record (conversion to SeqRecord) > 13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord) > > This is only one run - my patience has limits Again, scaling is > a big problem for some methods. Interesting - but no big surprises, except maybe just how slow Martel is. Did you notice if it run out of memory, and have to page to the hard disk? >>The SeqUtils/quick_FASTA_reader is interesting in that it loads the >>entire file into memory in one go, and then parses it. On the other >>hand its not perfect: I would use "\n>" as the split marker rather than >>">" which could appear in the description of a sequence. > > I agree (not that it's bitten me, yet), but I'd be inclined to go with > "%s>" % os.linesep as the split marker, just in case. Good point. I wonder how many people even know this function exists? >>Do we need to worry about the size of the raw file in memory - allowing >>the parsers to load it into memory could make things much faster... > > I use very few FASTA files where that would be a problem, so long as the > sequences remain as strings - when they're converted to > SeqRecords/SeqFeatures is where I start to get nervous about memory use. Maybe we should avoid loading entire files into memory while parsing - except for those formats like Clustal alignments where there is no real choice. Have you got a feeling for the difference in memory required for a large Fasta file in memory as: * Title string, sequence string * Title string, sequence as Seq object * SeqRecords (which include the sequence as a Seq object) While its overkill for simple file formats like FASTA, I think we do need a fairly high level object like the SeqRecord when dealing with things like Genbank/EMBL to hold the basic annotation and identifiers (id/name/description). I am thinking that we should have a set of sequence parsers that all return SeqRecord objects (with format specific options in some cases to control the exact mapping of the data, e.g. title2ids for Fasta files). And a matching set of sequence writers that take SeqRecord object(s) and write them to a file. Such a mapping won't be perfect, so maybe there is still a place for "format specific representations" like the Record object in Bio.GenBank.Record In the short term maybe we should just replace the internals of the current Bio.Fasta module with a pure python implementation like that in Bio.SeqIO.FASTA - good idea? Bad idea? Peter From f.schlesinger at iu-bremen.de Mon Jul 31 16:07:08 2006 From: f.schlesinger at iu-bremen.de (Felix Schlesinger) Date: Mon, 31 Jul 2006 18:07:08 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CE1E3C.2050502@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: <7317d50c0607310907sc468843nfe3945225d2ace76@mail.gmail.com> > Have you got a feeling for the difference in memory required for a large > Fasta file in memory as: > * Title string, sequence string > * Title string, sequence as Seq object > * SeqRecords (which include the sequence as a Seq object) >From looking at the code the only difference should be one instance of alphabet and one reference to it per sequence. The main difference is that Seq.data.method involves some python, while string.method is pure C code. Felix From mcolosimo at mitre.org Mon Jul 31 16:08:50 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Mon, 31 Jul 2006 12:08:50 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CE1E3C.2050502@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote: > >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the >>> entire file into memory in one go, and then parses it. On the other >>> hand its not perfect: I would use "\n>" as the split marker >>> rather than >>> ">" which could appear in the description of a sequence. >> >> I agree (not that it's bitten me, yet), but I'd be inclined to go >> with >> "%s>" % os.linesep as the split marker, just in case. > > Good point. I wonder how many people even know this function exists? > The only problem with this is that if someone sends you a file not created on your system. I remember hugh problems 5 or so years ago in BioPerl with dealing with the Mac, Unix, Windows line-ending issues. This has mostly simplied down to two - Unix and Windows - unless the person uses a Mac GUI app some of which use \r (CR) instead of \n (LF) where Windows uses \r\n (CRLF). I think the standard python disto comes with crlf.py and lfcr.py that can convert the line endings. > Maybe we should avoid loading entire files into memory while parsing - > except for those formats like Clustal alignments where there is no > real > choice. > > Have you got a feeling for the difference in memory required for a > large > Fasta file in memory as: > * Title string, sequence string > * Title string, sequence as Seq object > * SeqRecords (which include the sequence as a Seq object) > > While its overkill for simple file formats like FASTA, I think we do > need a fairly high level object like the SeqRecord when dealing with > things like Genbank/EMBL to hold the basic annotation and identifiers > (id/name/description). > > I am thinking that we should have a set of sequence parsers that all > return SeqRecord objects (with format specific options in some > cases to > control the exact mapping of the data, e.g. title2ids for Fasta > files). > > And a matching set of sequence writers that take SeqRecord object > (s) and > write them to a file. > > Such a mapping won't be perfect, so maybe there is still a place for > "format specific representations" like the Record object in > Bio.GenBank.Record > > In the short term maybe we should just replace the internals of the > current Bio.Fasta module with a pure python implementation like > that in > Bio.SeqIO.FASTA - good idea? Bad idea? I would keep them separate but change the documentation on the how-to site to point to using the Bio.SeqIO.FASTA since that is where I think we want people to start going. The code change to Bio.Fasta should be to add a depreciation warning. Marc From mdehoon at c2b2.columbia.edu Mon Jul 31 17:34:41 2006 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 31 Jul 2006 13:34:41 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: <44CE3F31.2080404@c2b2.columbia.edu> Marc Colosimo wrote: > On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote: >> In the short term maybe we should just replace the internals of the >> current Bio.Fasta module with a pure python implementation like >> that in >> Bio.SeqIO.FASTA - good idea? Bad idea? > > I would keep them separate but change the documentation on the how-to > site to point to using the Bio.SeqIO.FASTA since that is where I > think we want people to start going. The code change to Bio.Fasta > should be to add a depreciation warning. I agree with Marc here. No need to modify Bio.Fasta if it's on its way out. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Mon Jul 31 17:41:49 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 31 Jul 2006 18:41:49 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: <44CE40DD.3010101@maubp.freeserve.co.uk> Peter wrote: >>In the short term maybe we should just replace the internals of the >>current Bio.Fasta module with a pure python implementation like >>that in Bio.SeqIO.FASTA - good idea? Bad idea? Marc wrote: > I would keep them separate but change the documentation on the how-to > site to point to using the Bio.SeqIO.FASTA since that is where I > think we want people to start going. The code change to Bio.Fasta > should be to add a depreciation warning. Certainly long term we could do that. There may be advantages to the current very flexible Bio.Fasta code that the SeqIO replacement may not offer (e.g. if we focus on just parsing into SeqRecords). Short Term ---------- Right now I guess most people dealing with Fasta files will be using Bio.Fasta, and it is very slow, hence bug 2058: http://bugzilla.open-bio.org/show_bug.cgi?id=2058 My patch makes Bio.Fasta almost as fast as Bio.SeqIO.FASTA according to my tests (modest sized files). If any of you could try this patch on your machines - on the off chance that it causes problems for any existing code. It does pass test_Fasta.py and test_Fasta2.py on Windows at least. Medium/Long Term ---------------- We need to sort out what to do with Bio.SeqIO as currently the existing code in Bio/SeqIO/generic.py and Bio/SeqIO/FASTA.py uses different interfaces. But do agree that something like that should be OK. I have been working on a possible replacement (but it doesn't seem to have made it to the mailing list yet - must check my recent email). Peter