From bugzilla-daemon at portal.open-bio.org Tue Oct 2 05:09:48 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Oct 2007 05:09:48 -0400 Subject: [Biopython-dev] [Bug 2362] test_copen fails on Windows XP as tries os.fork() In-Reply-To: Message-ID: <200710020909.l9299moD015903@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2362 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2007-10-02 05:09 EST ------- I removed test_copen.py from CVS and deprecated the Bio.MultiProc code. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Tue Oct 2 05:06:54 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Tue, 2 Oct 2007 05:06:54 -0400 Subject: [Biopython-dev] [BioPython] Bio.MultiProc References: <46E6A845.3030601@c2b2.columbia.edu> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu> Hi everybody, Since no users of Bio.MultiProc came forward, I deprecated it for the upcoming release. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Michiel De Hoon Sent: Tue 9/11/2007 10:37 AM To: BioPython Developers List; biopython at biopython.org Subject: [BioPython] Bio.MultiProc Hi everybody, In preparation for the upcoming release, I was running the Biopython test suite and found that test_copen.py hangs on Cygwin. It doesn't fail, it just sits there forever. This may be related to the use of fork() instead of select() in Bio/MultiProc/copen.py. Anyway, while it is probably possible to fix this, I'd have to dig fairly deep into the code, and I am not sure if it is worth it. It looks like the copen functions are used only in Bio/config, which is needed for Bio.db. A description of the functionality of thia module can be found in the tutorial section 4.7.2. Now, I don't remember users asking about this module on the mailing list. From the tutorial documentation, it seems to be a nice piece of code, but I doubt that it is being used often in practice. So I was wondering: 1) Is anybody on this list using this code? 2) If not, can I mark it as deprecated for the upcoming release? Hopefully, people who are using this code will notice, and let us know that they need it. --Michiel. _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From idoerg at gmail.com Tue Oct 2 12:00:41 2007 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 2 Oct 2007 09:00:41 -0700 Subject: [Biopython-dev] [BioPython] Bio.MultiProc In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu> References: <46E6A845.3030601@c2b2.columbia.edu> <6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu> Message-ID: Would it be possible to include the module, comment out the unworkable source code and print a deprecation warning when it is imported? That was we: 1) Don't have a clunky module BUT 2) we warn anyone who uses it (but didn't happen to read your post) that it is deprecated when they install a new biopython version AND 3) Leave an option of fixing and commenting the code back in (i.e. it is not lost forever). Also, is it possible to track down the original author? ./I On 10/2/07, Michiel De Hoon wrote: > > Hi everybody, > > Since no users of Bio.MultiProc came forward, I deprecated it for the > upcoming release. > > --Michiel. > > Michiel de Hoon > Center for Computational Biology and Bioinformatics > Columbia University > 1150 St Nicholas Avenue > New York, NY 10032 > > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org on behalf of Michiel De Hoon > Sent: Tue 9/11/2007 10:37 AM > To: BioPython Developers List; biopython at biopython.org > Subject: [BioPython] Bio.MultiProc > > Hi everybody, > > In preparation for the upcoming release, I was running the Biopython > test suite and found that test_copen.py hangs on Cygwin. It doesn't > fail, it just sits there forever. This may be related to the use of > fork() instead of select() in Bio/MultiProc/copen.py. Anyway, while it > is probably possible to fix this, I'd have to dig fairly deep into the > code, and I am not sure if it is worth it. It looks like the copen > functions are used only in Bio/config, which is needed for Bio.db. A > description of the functionality of thia module can be found in the > tutorial section 4.7.2. > > Now, I don't remember users asking about this module on the mailing > list. From the tutorial documentation, it seems to be a nice piece of > code, but I doubt that it is being used often in practice. > > So I was wondering: > 1) Is anybody on this list using this code? > 2) If not, can I mark it as deprecated for the upcoming release? > Hopefully, people who are using this code will notice, and let us know > that they need it. > > --Michiel. > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- I. Friedberg "The only problem with troubleshooting is that sometimes trouble shoots back." From biopython-dev at maubp.freeserve.co.uk Tue Oct 2 12:55:53 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 02 Oct 2007 17:55:53 +0100 Subject: [Biopython-dev] Bio.MultiProc / Bio.FormatIO In-Reply-To: References: <46E6A845.3030601@c2b2.columbia.edu> <6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu> Message-ID: <47027819.1010207@maubp.freeserve.co.uk> Iddo Friedberg wrote: > Would it be possible to include the module, comment out the unworkable > source code and print a deprecation warning when it is imported? That is sort of what Michiel did - he's just added a deprecation warning, but not touched the code itself. This isn't an option for some of the more "integrated" bits of code like Bio.FormatIO which I suggested removing in Bug 2361 (see also my email to the main list on 19 September): http://bugzilla.open-bio.org/show_bug.cgi?id=2361#c27 Peter From rhaygood at duke.edu Tue Oct 2 19:59:43 2007 From: rhaygood at duke.edu (Ralph Haygood) Date: Tue, 2 Oct 2007 19:59:43 -0400 (EDT) Subject: [Biopython-dev] Statistics code In-Reply-To: <6d941f120709291328q6a9aae97kdcf489549cc9b3f0@mail.gmail.com> References: <6d941f120709291328q6a9aae97kdcf489549cc9b3f0@mail.gmail.com> Message-ID: Tiago, Sorry to be so long replying---I've been almost drowning in work. Use anything you find useful in my code. If you do write an article about it, I'd be glad to be a coauthor, not just in name but actually to help with writing the discussion of sequence statistics. There *is* a lot of stuff in my code, not all of it generally important. For example, few people will care about indel statistics, beyond counting them and maybe getting the frequency distribution of their lengths. The things most people will care about are K (the number of polymorphic sites), Watterson's theta, pi, Tajima's D, Fu and Li's D, Fay and Wu's H, F_ST, and McDonald--Kreitman testing. As for ambiguous nucleotides, my code handles them in one of two ways, at the programmer's option. By default, a site at which any sequence in the alignment contains an ambiguous nucleotide is ignored; for example, ACRGTY ACAGTC is effectively equivalent to ACGT ACGT . However, if the 'expand_diplotypes' option is specified when the Sample object is constructed, each sequence in the alignment is interpreted as a diplotype and converted into a pair of pseudo- haplotypes, two-fold ambiguous nucleotides (R, Y, W, S, M, and K) being interpreted as heterozygous; for example, ACRGTY ACAGTC is effectively equivalent to ACAGTC ACGGTT ACAGTC ACAGTC . In expand_diplotypes mode, sites containing three- or four-fold ambiguous nucleotides are still ignored. Also, you'll get a warning if you request a statistic that depends on correct SNP phasing, which most statistics don't. So far, I've found these two operating modes sufficient for my needs. I think your plan sounds very reasonable, just adding sequence statistics at a pace that's comfortable for you. Any time you have questions, feel free to ask me, and I'll give you whatever benefit there is in my opinion and experience. I'm happy for all this to happen on biopython-dev, so that other people (e.g., Alex Lancaster) can add to it. I'll leave it to the core developers to tell us if we're too noisy. (I'd recommend still sending messages to me with copies to biopython-dev, however, so that I don't accidentally miss them on biopython-dev, which I don't always read carefully.) Ralph On Sat, 29 Sep 2007, Tiago Ant?o wrote: > Hi Ralph, > > Hope all is good with you. I am now finally starting to commit > statistics code to Biopython. But before I go ahead I would like to > ask some advice to you (plus some extra comments): > > About code merging and authorship: > > I am finally looking to your code. There is really lots of stuff > there! Would it be OK with you if I merged your code with mine into > Bio.PopGen.Stats? Obviously the copyright/authorship for the module > would be co-shared as would any authorship of any article deriving > from it... > > About a strategy to advance: > > 1. I personally don't have any experience, really, with working with > sequence data (My background are SNPs, microsatellites/STRs, AFLPs and > that sort of stuff) > 2. Starting on Monday I am beginning a PhD which will require, part > time, sequence analysis > 3. What I mean from 1 and 2 is that I currently don't have maturity to > architect and design a good framework for sequence analysis but I will > gain it with time. > My plan is then to defer all sequence code until I fell I know what I > am doing (although I was still thinking in providing something like > BioPerl's facility of extracting all SNPs from sequences) > If this is OK with you I plan to start committing code the week > starting on this Monday, > > About request for insight: > > If you have any comments to offer on issues regarding representing > indels and ambiguous data (ie ambiguous nucleotides) they might be > useful, as I suppose that is the biggest issue that makes me afraid of > sequence code. > > > Finally: I would summarize our discussion here on biopython-dev (I am > not taking it there directly just because you might not want your code > on Biopython or might want it in other terms). > > Thanks, > Tiago > From mdehoon at c2b2.columbia.edu Tue Oct 2 20:18:59 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Tue, 2 Oct 2007 20:18:59 -0400 Subject: [Biopython-dev] [BioPython] Bio.MultiProc References: <46E6A845.3030601@c2b2.columbia.edu><6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B62D@mail2.exch.c2b2.columbia.edu> > Would it be possible to include the module, comment out the unworkable > source code and print a deprecation warning when it is imported? That is what I did. > 3) Leave an option of fixing and commenting the code back in (i.e. it is not > lost forever). Even after removing the code in some future release, the code will not be lost forever. It can always be retrieved from CVS and from older Biopython releases. > Also, is it possible to track down the original author? That would be Jeff Chang. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: Iddo Friedberg [mailto:idoerg at gmail.com] Sent: Tue 10/2/2007 12:00 PM To: Michiel De Hoon Cc: BioPython Developers List; biopython at biopython.org Subject: Re: [Biopython-dev] [BioPython] Bio.MultiProc Would it be possible to include the module, comment out the unworkable source code and print a deprecation warning when it is imported? That was we: 1) Don't have a clunky module BUT 2) we warn anyone who uses it (but didn't happen to read your post) that it is deprecated when they install a new biopython version AND 3) Leave an option of fixing and commenting the code back in (i.e. it is not lost forever). Also, is it possible to track down the original author? ./I On 10/2/07, Michiel De Hoon wrote: > > Hi everybody, > > Since no users of Bio.MultiProc came forward, I deprecated it for the > upcoming release. > > --Michiel. > > Michiel de Hoon > Center for Computational Biology and Bioinformatics > Columbia University > 1150 St Nicholas Avenue > New York, NY 10032 > > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org on behalf of Michiel De Hoon > Sent: Tue 9/11/2007 10:37 AM > To: BioPython Developers List; biopython at biopython.org > Subject: [BioPython] Bio.MultiProc > > Hi everybody, > > In preparation for the upcoming release, I was running the Biopython > test suite and found that test_copen.py hangs on Cygwin. It doesn't > fail, it just sits there forever. This may be related to the use of > fork() instead of select() in Bio/MultiProc/copen.py. Anyway, while it > is probably possible to fix this, I'd have to dig fairly deep into the > code, and I am not sure if it is worth it. It looks like the copen > functions are used only in Bio/config, which is needed for Bio.db. A > description of the functionality of thia module can be found in the > tutorial section 4.7.2. > > Now, I don't remember users asking about this module on the mailing > list. From the tutorial documentation, it seems to be a nice piece of > code, but I doubt that it is being used often in practice. > > So I was wondering: > 1) Is anybody on this list using this code? > 2) If not, can I mark it as deprecated for the upcoming release? > Hopefully, people who are using this code will notice, and let us know > that they need it. > > --Michiel. > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- I. Friedberg "The only problem with troubleshooting is that sometimes trouble shoots back." From tiagoantao at gmail.com Wed Oct 3 06:14:33 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 3 Oct 2007 11:14:33 +0100 Subject: [Biopython-dev] Coalescent code Message-ID: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com> Hi, I had a plan of starting to commit statistical related code this weekend, but (contrary to my expectations) I am having requests for the coalescent code. As such, I am planning to commit the coalescent code instead. It is quite straightforward code, with only one issue that I would require advice: Some of the code (regarding modeling demographies) requires some templates (very small text files, circa 10 of around 700 bytes each) to go along. Where should I put the files in Biopython? Also, on installation those files have to be put somewhere... Tiago -- http://www.tiago.org/ps From biopython-dev at maubp.freeserve.co.uk Wed Oct 3 10:18:21 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 03 Oct 2007 15:18:21 +0100 Subject: [Biopython-dev] Coalescent code In-Reply-To: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com> References: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com> Message-ID: <4703A4AD.7030008@maubp.freeserve.co.uk> Tiago Ant?o wrote: > It is quite straightforward code, with only one issue that I would > require advice: Some of the code (regarding modeling demographies) > requires some templates (very small text files, circa 10 of around 700 > bytes each) to go along. Where should I put the files in Biopython? > Also, on installation those files have to be put somewhere... There is a similar precedent with Bio/EUtils/DTDs (where the data files are XML DTD files). I guess you could have the 10 plain text data files in with the python files (or under a subdirectory). Opinions? I should really refresh myself on current python packaging guidelines... Peter From tiagoantao at gmail.com Wed Oct 3 11:37:17 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 3 Oct 2007 16:37:17 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: References: <6d941f120709291328q6a9aae97kdcf489549cc9b3f0@mail.gmail.com> Message-ID: <6d941f120710030837k1aa2d4ak7eca8e6e27e35fdd@mail.gmail.com> Ralph, Thanks for the detailed explanation. Because of a couple of requests I had, I am going to commit first the coalescent code, but after the coalescent code is in, I will pick this up. Tiago On 10/3/07, Ralph Haygood wrote: > Tiago, > > Sorry to be so long replying---I've been almost drowning in work. > > Use anything you find useful in my code. If you do write an article > about it, I'd be glad to be a coauthor, not just in name but actually > to help with writing the discussion of sequence statistics. > > There *is* a lot of stuff in my code, not all of it generally > important. For example, few people will care about indel statistics, > beyond counting them and maybe getting the frequency distribution of > their lengths. The things most people will care about are K (the > number of polymorphic sites), Watterson's theta, pi, Tajima's D, Fu > and Li's D, Fay and Wu's H, F_ST, and McDonald--Kreitman testing. > > As for ambiguous nucleotides, my code handles them in one of two ways, > at the programmer's option. By default, a site at which any sequence > in the alignment contains an ambiguous nucleotide is ignored; for > example, > > ACRGTY > ACAGTC > > is effectively equivalent to > > ACGT > ACGT . > > However, if the 'expand_diplotypes' option is specified when the > Sample object is constructed, each sequence in the alignment is > interpreted as a diplotype and converted into a pair of pseudo- > haplotypes, two-fold ambiguous nucleotides (R, Y, W, S, M, and K) > being interpreted as heterozygous; for example, > > ACRGTY > ACAGTC > > is effectively equivalent to > > ACAGTC > ACGGTT > ACAGTC > ACAGTC . > > In expand_diplotypes mode, sites containing three- or four-fold > ambiguous nucleotides are still ignored. Also, you'll get a warning > if you request a statistic that depends on correct SNP phasing, which > most statistics don't. So far, I've found these two operating modes > sufficient for my needs. > > I think your plan sounds very reasonable, just adding sequence > statistics at a pace that's comfortable for you. Any time you have > questions, feel free to ask me, and I'll give you whatever benefit > there is in my opinion and experience. > > I'm happy for all this to happen on biopython-dev, so that other > people (e.g., Alex Lancaster) can add to it. I'll leave it to the > core developers to tell us if we're too noisy. (I'd recommend still > sending messages to me with copies to biopython-dev, however, so that > I don't accidentally miss them on biopython-dev, which I don't always > read carefully.) > > Ralph > > On Sat, 29 Sep 2007, Tiago Ant?o wrote: > > > Hi Ralph, > > > > Hope all is good with you. I am now finally starting to commit > > statistics code to Biopython. But before I go ahead I would like to > > ask some advice to you (plus some extra comments): > > > > About code merging and authorship: > > > > I am finally looking to your code. There is really lots of stuff > > there! Would it be OK with you if I merged your code with mine into > > Bio.PopGen.Stats? Obviously the copyright/authorship for the module > > would be co-shared as would any authorship of any article deriving > > from it... > > > > About a strategy to advance: > > > > 1. I personally don't have any experience, really, with working with > > sequence data (My background are SNPs, microsatellites/STRs, AFLPs and > > that sort of stuff) > > 2. Starting on Monday I am beginning a PhD which will require, part > > time, sequence analysis > > 3. What I mean from 1 and 2 is that I currently don't have maturity to > > architect and design a good framework for sequence analysis but I will > > gain it with time. > > My plan is then to defer all sequence code until I fell I know what I > > am doing (although I was still thinking in providing something like > > BioPerl's facility of extracting all SNPs from sequences) > > If this is OK with you I plan to start committing code the week > > starting on this Monday, > > > > About request for insight: > > > > If you have any comments to offer on issues regarding representing > > indels and ambiguous data (ie ambiguous nucleotides) they might be > > useful, as I suppose that is the biggest issue that makes me afraid of > > sequence code. > > > > > > Finally: I would summarize our discussion here on biopython-dev (I am > > not taking it there directly just because you might not want your code > > on Biopython or might want it in other terms). > > > > Thanks, > > Tiago > > -- http://www.tiago.org/ps From tiagoantao at gmail.com Wed Oct 3 12:04:07 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 3 Oct 2007 17:04:07 +0100 Subject: [Biopython-dev] Coalescent code In-Reply-To: <4703A4AD.7030008@maubp.freeserve.co.uk> References: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com> <4703A4AD.7030008@maubp.freeserve.co.uk> Message-ID: <6d941f120710030904k70b098dcnbbc40bc3420ea831@mail.gmail.com> Hi On 10/3/07, Peter wrote: > There is a similar precedent with Bio/EUtils/DTDs (where the data files > are XML DTD files). I guess you could have the 10 plain text data files > in with the python files (or under a subdirectory). Opinions? In the mean time, I will start committing the code (I can easily accommodate the details of the places to put the files later, when there is a decision). Michiel, please, please don't include SimCoal code that I will be committing on the next public version. Regards, Tiago From mdehoon at c2b2.columbia.edu Wed Oct 3 20:39:47 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 3 Oct 2007 20:39:47 -0400 Subject: [Biopython-dev] Coalescent code References: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com><4703A4AD.7030008@maubp.freeserve.co.uk> <6d941f120710030904k70b098dcnbbc40bc3420ea831@mail.gmail.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B62E@mail2.exch.c2b2.columbia.edu> > Michiel, please, please don't include SimCoal code that I will be > committing on the next public version. To avoid confusion, please don't commit code to CVS that you don't want to be included in the next Biopython release. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-dev-bounces at lists.open-bio.org on behalf of Tiago Ant?o Sent: Wed 10/3/2007 12:04 PM To: biopython-dev at lists.open-bio.org Subject: Re: [Biopython-dev] Coalescent code Hi On 10/3/07, Peter wrote: > There is a similar precedent with Bio/EUtils/DTDs (where the data files > are XML DTD files). I guess you could have the 10 plain text data files > in with the python files (or under a subdirectory). Opinions? In the mean time, I will start committing the code (I can easily accommodate the details of the places to put the files later, when there is a decision). Michiel, please, please don't include SimCoal code that I will be committing on the next public version. Regards, Tiago _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Wed Oct 3 22:10:13 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Oct 2007 22:10:13 -0400 Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with egenix mxTextTools 3.0 In-Reply-To: Message-ID: <200710040210.l942ADGF030763@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2361 ------- Comment #30 from mdehoon at ims.u-tokyo.ac.jp 2007-10-03 22:10 EST ------- Looking at the patch for Bio.FormatIO: ------------------------- #Would like to have just issued a deprecation warning, and removed this #module later. However, due to the FormatIO code in Bio/SeqRecord.py the #deprecation warning would be triggered whenever someone used the SeqRecord. raise ImportError, "Bio.FormatIO has been removed. Please try Bio.SeqIO instead" ------------------------- Since the patch for Bio/SeqRecord.py removes its dependence on Bio.FormatIO, is it still necessary to raise an ImportError instead of issuing a DeprecationWarning? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Oct 5 05:44:09 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Oct 2007 05:44:09 -0400 Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with egenix mxTextTools 3.0 In-Reply-To: Message-ID: <200710050944.l959i9BX029760@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2361 ------- Comment #31 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-05 05:44 EST ------- In terms of typical usage, SeqRecord does not depend on FormatIO However, from a code perspective, FormatIO and SeqRecord "depend" on each other. If we remove the FormatIO "hooks" from SeqRecord.py (so that SeqRecord does not depend on FormatIO), then FormatIO breaks. Rather than leaving in a broken module, I wanted to remove it. A DeprecationWarning doesn't seem right if FormatIO is removed, which is why I suggested an ImportError. We might be able instead to MOVE the FormatIO hooks out of SeqRecord and then issue a DeprecationWarning for FormatIO ... but it looks rather complicated, and probably means tackling the Bio.config code as well. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Oct 5 07:05:49 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Oct 2007 07:05:49 -0400 Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with egenix mxTextTools 3.0 In-Reply-To: Message-ID: <200710051105.l95B5nXW001755@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2361 ------- Comment #32 from mdehoon at ims.u-tokyo.ac.jp 2007-10-05 07:05 EST ------- > If we remove the FormatIO "hooks" from SeqRecord.py (so that SeqRecord does not > depend on FormatIO), then FormatIO breaks. Rather than leaving in a broken > module, I wanted to remove it. A DeprecationWarning doesn't seem right if > FormatIO is removed, which is why I suggested an ImportError. OK, I see. As far as I'm concerned, your patch is fine then. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Oct 5 09:46:51 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Oct 2007 09:46:51 -0400 Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython In-Reply-To: Message-ID: <200710051346.l95Dkpc2010074@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2174 tiagoantao at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #6 from tiagoantao at gmail.com 2007-10-05 09:46 EST ------- It is implemented, documented and with test code. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Fri Oct 5 10:26:43 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 5 Oct 2007 15:26:43 +0100 Subject: [Biopython-dev] Configuration files Message-ID: <6d941f120710050726s4ca53349h1b8d499650e5726a@mail.gmail.com> Hi, Is there any (Biopython standard) way to configure Biopython during runtime? When writing code sometimes I think it would be very convenient (especially to the programmer using Biopython) to abstract some configuration parameters away from the code. Things like the location of binaries, hosts, user names (and maybe passwords) of databases, timeout parameters, etc. These could be stored on a configuration file (or registry entry, or whatever) thus saving users to have to deal in the code with supplying these... Just an idea... Tiago -- http://www.tiago.org/ps From bugzilla-daemon at portal.open-bio.org Mon Oct 8 07:14:30 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Oct 2007 07:14:30 -0400 Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with egenix mxTextTools 3.0 In-Reply-To: Message-ID: <200710081114.l98BEUZh019757@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2361 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #759 is|0 |1 obsolete| | ------- Comment #33 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-08 07:14 EST ------- (From update of attachment 759) Applied these changes to CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Mon Oct 8 06:52:48 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 08 Oct 2007 11:52:48 +0100 Subject: [Biopython-dev] Configuration files In-Reply-To: <6d941f120710050726s4ca53349h1b8d499650e5726a@mail.gmail.com> References: <6d941f120710050726s4ca53349h1b8d499650e5726a@mail.gmail.com> Message-ID: <470A0C00.50505@maubp.freeserve.co.uk> Tiago Ant?o wrote: > Hi, > > Is there any (Biopython standard) way to configure Biopython during > runtime? When writing code sometimes I think it would be very > convenient (especially to the programmer using Biopython) to abstract > some configuration parameters away from the code. Things like the > location of binaries, hosts, user names (and maybe passwords) of > databases, timeout parameters, etc. These could be stored on a > configuration file (or registry entry, or whatever) thus saving users > to have to deal in the code with supplying these... > Just an idea... This sounds like a fairly general thing (i.e. for all of python) rather than being Biopython specific. For example, I find a lot of my scripts have a few if statements at the top setting locations of files and executables based on which user/machine I'm running on (I use both Windows and a couple of Linux boxes with different user names). e.g. Where are the blast executables, the blast databases, and my genome collection, ... Peter From bugzilla-daemon at portal.open-bio.org Mon Oct 8 07:30:03 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Oct 2007 07:30:03 -0400 Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with egenix mxTextTools 3.0 In-Reply-To: Message-ID: <200710081130.l98BU36u021016@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2361 ------- Comment #34 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-08 07:30 EST ------- Recap, most of the issues were resolved by switching Bio.Fasta from Martel to pure python. Additionally: test_Fasta - 'fixed' by deprecating the Mindy indexing functions test_KEGG - fixed by switching from Martel to pure python test_format_registry - 'fixed' by removing FormatIO test_geo - fixed by switching from Martel to pure python test_GenBankFormat - this entire test is for the little-used Martel GenBank expression, and this works with mxTextTools 2.0 but fails with mxTextTools 3.0 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mdehoon at c2b2.columbia.edu Tue Oct 9 00:34:28 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Tue, 9 Oct 2007 00:34:28 -0400 Subject: [Biopython-dev] Output of Biopython tests Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> Hi everybody, With the help of several Biopython developers, especially Peter, the problems with Martel and the new mxTextTools release have now been solved (in the sense that all unit tests now succeed). So we're a lot closer to a new Biopython release. Thanks everybody! When I was running the Biopython tests, one thing bothered me though. All Biopython tests now have a corresponding output file that contains the output the test should generate if it runs correctly. For some tests, this makes perfect sense, particularly if the output is large. For others, on the other hand, having the test output explicitly in a file doesn't actually add much information. For example, the output for test_psw is test_psw test_AlignmentColumn_assertions (test_psw.TestPSW) ... ok test_AlignmentColumn_full (test_psw.TestPSW) ... ok test_AlignmentColumn_kinds (test_psw.TestPSW) ... ok test_AlignmentColumn_repr (test_psw.TestPSW) ... ok test_Alignment_assertions (test_psw.TestPSW) ... ok test_Alignment_normal (test_psw.TestPSW) ... ok test_ColumnUnit (test_psw.TestPSW) ... ok Doctest: Bio.Wise.psw.parse_line ... ok ---------------------------------------------------------------------- Ran 8 tests in 0.002s OK For comparison, this is the test output if test_psw.py fails: test_AlignmentColumn_assertions (__main__.TestPSW) ... ok test_AlignmentColumn_full (__main__.TestPSW) ... ok test_AlignmentColumn_kinds (__main__.TestPSW) ... FAIL test_AlignmentColumn_repr (__main__.TestPSW) ... ok test_Alignment_assertions (__main__.TestPSW) ... ok test_Alignment_normal (__main__.TestPSW) ... ok test_ColumnUnit (__main__.TestPSW) ... ok Doctest: Bio.Wise.psw.parse_line ... ok ====================================================================== FAIL: test_AlignmentColumn_kinds (__main__.TestPSW) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_psw.py", line 47, in test_AlignmentColumn_kinds self.assertEqual(ac.kind, "some_funny_output_I_made_up_instead_of_INSERT") AssertionError: 'INSERT' != 'some_funny_output_I_made_up_instead_of_INSERT' ---------------------------------------------------------------------- Ran 8 tests in 0.000s The point is that for this test, having the output explicitly is not needed in order to identify the problem. Now, for some tests having the output explicitly actually causes a problem. I'm thinking about those unit tests that only run if some particular software is installed on the system (for example, SQL). In those cases, we need to distinguish failure due to missing software from a true failure (the former may not bother the user much if he's not interested in that particular part of Biopython). If a test cannot be run because of missing prerequisites, currently a unit test generates an ImportError, which is then caught inside run_tests. Hence, we get the following output when running the Biopython tests: test_BioSQL ... Skipping test because of import error: Skipping BioSQL tests -- enable tests in Tests/test_BioSQL.py ok When you look inside test_BioSQL.py, you'll see that the actual error is not an ImportError. In addition, if a true ImportError occurs during the test, the test will inadvertently be treated as skipped. My solution would be to skip tests inside test_BioSQL if the prerequisites are not met. However, in that case the test output no longer agrees with the expected test output, generating a failure message. I'd therefore like to suggest the following: 1) Keep the test output, but let each test_* script (instead of run_tests.py) be responsible of comparing the test output with the expected output. 2) If the expected output is trivial, simply use the assert statements to verify the test output instead of storing them in a file and reading them from there. Any objections? --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 From mhobbs_of_lawson at bigpond.com Mon Oct 8 22:18:39 2007 From: mhobbs_of_lawson at bigpond.com (mhobbs_of_lawson) Date: Tue, 9 Oct 2007 12:18:39 +1000 Subject: [Biopython-dev] translate Message-ID: <5496247.1191896319102.JavaMail.root@web06sl> Hi, Please can someone tell me what is wrong here. I simply want to be able to translate ambiguous DNA which includes an 'NNN' triplet. Thanks, Matthew >>> from Bio import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio import Translate >>> s = "NNNTCAAAAAGGTGCATCTAGATG" >>> dna = Seq.Seq(s, IUPAC.ambiguous_dna) >>> trans = Translate.ambiguous_dna_by_id[1] >>> print trans.translate(dna) Traceback (most recent call last): File "", line 1, in File "/cygdrive/c/Python24/Lib/site-packages/Bio/Translate.py", line 20, in translate append(get(s[i:i+3], stop_symbol)) File "/cygdrive/c/Python24/Lib/site-packages/Bio/Data/CodonTable.py", line 544, in get return self.__getitem__(codon) File "/cygdrive/c/Python24/Lib/site-packages/Bio/Data/CodonTable.py", line 577, in __getitem__ raise TranslationError, codon # does not code Bio.Data.CodonTable.TranslationError: NNN From biopython-dev at maubp.freeserve.co.uk Tue Oct 9 07:54:29 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 09 Oct 2007 12:54:29 +0100 Subject: [Biopython-dev] translate In-Reply-To: <5496247.1191896319102.JavaMail.root@web06sl> References: <5496247.1191896319102.JavaMail.root@web06sl> Message-ID: <470B6BF5.607@maubp.freeserve.co.uk> mhobbs_of_lawson wrote: > Hi, > > Please can someone tell me what is wrong here. I simply want to be able to translate ambiguous DNA which includes an 'NNN' triplet. A very reasonable request. I assume you expect just an X for an NNN codon? I have the general impression that some of Biopython's handling of ambiguous sequences isn't all wonderful... something I have started to tackle in bug 2356: http://bugzilla.open-bio.org/show_bug.cgi?id=2366 Obviously sequence manipulation is a core bit of functionality - and I would like at least one other person to comment on that code before I risk committing it ;) Translation of ambiguous codons would be next on my hit list... as right now it doesn't seem to do what I would expect at all. In the short term, manually adding additional mappings to the forward table (a python dictionary) would probably "fix" your specific issue. While we are on this topic, we use "*" for stop codons and "X" for an ambiguous amino acid - but is anyone aware of a character convention for something that might be either a stop codon or an amino acid? (other than just using "X" for this too)? Peter From biopython-dev at maubp.freeserve.co.uk Tue Oct 9 07:44:01 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 09 Oct 2007 12:44:01 +0100 Subject: [Biopython-dev] Output of Biopython tests In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> Message-ID: <470B6981.3020707@maubp.freeserve.co.uk> Michiel De Hoon wrote: > When I was running the Biopython tests, one thing bothered me though. > All Biopython tests now have a corresponding output file that > contains the output the test should generate if it runs correctly. > For some tests, this makes perfect sense, particularly if the output > is large. For others, on the other hand, having the test output > explicitly in a file doesn't actually add much information. Is this actually a problem? It gives us a simple unified test framework where developers can use whatever fancy test frameworks they want to. Personally I have tried to write simple scripts with meaningful output (plus often additional assertions). I think that because these are very simple, they can double as examples/documentation for the curious. My personal view is that some of the "fancy frameworks" used in some test cases are very intimidating to a beginner (and act as a barrier to taking the code and modifying it for their own use). > The point is that for this test, having the output explicitly is not > needed in order to identify the problem. True. I would have written that particular test to give some meaningful output; I find it makes it easier to start debugging why a test fails. > Now, for some tests having the output explicitly actually causes a > problem. I'm thinking about those unit tests that only run if some > particular software is installed on the system (for example, SQL). In > those cases, we need to distinguish failure due to missing software > from a true failure (the former may not bother the user much if he's > not interested in that particular part of Biopython). If a test > cannot be run because of missing prerequisites, currently a unit test > generates an ImportError, which is then caught inside run_tests. > ... > When you look inside test_BioSQL.py, you'll see that the actual error > is not an ImportError. In addition, if a true ImportError occurs > during the test, the test will inadvertently be treated as skipped. Perhaps we should introduce a MissingExternalDependency error instead, used for this specific case, and catch that in run_tests.py, while treating ImportError as a real error. As you say, if we have done some dramatic restructuring (such as removing a module) there could be some REAL ImportErrors which we might risk ignoring. > I'd therefore like to suggest the following: > 1) Keep the test output, but let each test_* script (instead of > run_tests.py) be responsible of comparing the test output with the > expected output. I'm not keen on that - it means duplication of code (or at least some common functionality to call) and makes writing simple tests that little bit harder. I like the fact that the more verbose test scripts can be run on their own as an example of what the module can do. > 2) If the expected output is trivial, simply use the assert > statements to verify the test output instead of storing them in a > file and reading them from there. By all means, test trivial output with assertions. I already do this within many of my "verbose" tests where I want to keep the console output reasonably short. Peter From tiagoantao at gmail.com Tue Oct 9 10:27:18 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 9 Oct 2007 15:27:18 +0100 Subject: [Biopython-dev] Configuration files In-Reply-To: <470A0C00.50505@maubp.freeserve.co.uk> References: <6d941f120710050726s4ca53349h1b8d499650e5726a@mail.gmail.com> <470A0C00.50505@maubp.freeserve.co.uk> Message-ID: <6d941f120710090727m787c08abn13665c662727446c@mail.gmail.com> Would it be interesting to have something like config = Bio.Config.getConfig() fdist_path = config['PopGen.FDistDir'] Something that: 1. Would allow for a standard configuration mechanism (as opposed to having different styles for each module/author) 2. Would abstract away how the configuration is stored (registry, conf file, ...) If there was an agreement on doing this (or something along these lines), I would volunteer the time to do it. On 10/8/07, Peter wrote: > Tiago Ant?o wrote: > > Hi, > > > > Is there any (Biopython standard) way to configure Biopython during > > runtime? When writing code sometimes I think it would be very > > convenient (especially to the programmer using Biopython) to abstract > > some configuration parameters away from the code. Things like the > > location of binaries, hosts, user names (and maybe passwords) of > > databases, timeout parameters, etc. These could be stored on a > > configuration file (or registry entry, or whatever) thus saving users > > to have to deal in the code with supplying these... > > Just an idea... > > This sounds like a fairly general thing (i.e. for all of python) rather > than being Biopython specific. > > For example, I find a lot of my scripts have a few if statements at the > top setting locations of files and executables based on which > user/machine I'm running on (I use both Windows and a couple of Linux > boxes with different user names). > > e.g. Where are the blast executables, the blast databases, and my genome > collection, ... > > Peter > > -- http://www.tiago.org/ps From mhobbs_of_lawson at bigpond.com Tue Oct 9 19:07:43 2007 From: mhobbs_of_lawson at bigpond.com (Matthew Hobbs) Date: Wed, 10 Oct 2007 09:07:43 +1000 Subject: [Biopython-dev] translate In-Reply-To: <470B6BF5.607@maubp.freeserve.co.uk> References: <5496247.1191896319102.JavaMail.root@web06sl> <470B6BF5.607@maubp.freeserve.co.uk> Message-ID: <470C09BF.8050906@bigpond.com> Thanks Peter for your reply. Peter wrote: > mhobbs_of_lawson wrote: >> Please can someone tell me what is wrong here. I simply want to be >> able to translate ambiguous DNA which includes an 'NNN' triplet. > > A very reasonable request. I assume you expect just an X for an NNN codon? yep > In the short term, manually adding additional mappings to the forward > table (a python dictionary) would probably "fix" your specific issue. OK - so this works: from Bio import Seq from Bio.Alphabet import IUPAC from Bio import Translate s = "NNNTCAAAAAGGTGCATCTAGATG" dna = Seq.Seq(s, IUPAC.ambiguous_dna) trans = Translate.ambiguous_dna_by_id[1] trans.table.forward_table.forward_table['NNN'] = 'X' print trans.translate(dna) > While we are on this topic, we use "*" for stop codons and "X" for an > ambiguous amino acid - but is anyone aware of a character convention for > something that might be either a stop codon or an amino acid? (other > than just using "X" for this too)? No I don't know Thanks, Matthew From mdehoon at c2b2.columbia.edu Thu Oct 11 06:31:59 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 11 Oct 2007 06:31:59 -0400 Subject: [Biopython-dev] Output of Biopython tests References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> <470B6981.3020707@maubp.freeserve.co.uk> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu> > Perhaps we should introduce a MissingExternalDependency error instead, > used for this specific case, and catch that in run_tests.py, while > treating ImportError as a real error. OK. I added a MissingExternalDependencyError exception to Bio/__init__.py, and modified BioSQL, Bio.GFF, and some test scripts accordingly. When MissingExternalDependencyError occurs in a test, a warning is printed but it is not counted as a failure. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Thu Oct 11 06:44:56 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 11 Oct 2007 06:44:56 -0400 Subject: [Biopython-dev] function enumerate in Bio/GFF/GenericTools.py; Bio/DocSQL.py Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B637@mail2.exch.c2b2.columbia.edu> Do we still need the function "enumerate" in Bio/GFF/GenericTools.py and Bio/DocSQL.py? AFAICT, this function does exactly the same as the Python built-in enumerate function. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Thu Oct 11 06:31:59 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 11 Oct 2007 06:31:59 -0400 Subject: [Biopython-dev] Output of Biopython tests References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> <470B6981.3020707@maubp.freeserve.co.uk> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu> > Perhaps we should introduce a MissingExternalDependency error instead, > used for this specific case, and catch that in run_tests.py, while > treating ImportError as a real error. OK. I added a MissingExternalDependencyError exception to Bio/__init__.py, and modified BioSQL, Bio.GFF, and some test scripts accordingly. When MissingExternalDependencyError occurs in a test, a warning is printed but it is not counted as a failure. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/ms-tnef Size: 2910 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython-dev/attachments/20071011/fc06d7c7/attachment.bin From biopython-dev at maubp.freeserve.co.uk Thu Oct 11 16:44:46 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Oct 2007 21:44:46 +0100 Subject: [Biopython-dev] Revised tutorial Message-ID: <470E8B3E.6080709@maubp.freeserve.co.uk> In anticipation of the next release, I've done some more work on the tutorial today -- in particular the section on the Seq object which I have turned into a new chapter. If anyone has the time to go over this soon that would be great. I'll be away tomorrow (Friday) but will probably have time to make any revisions needed at the weekend. Its here in CVS: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Doc/Tutorial.tex?cvsroot=biopython This is a LaTeX file which gets turned into the PDF and HTML versions of the tutorial using pdflatex and hevea. If you want to proof read but don't know anything about LaTeX then I can probably email you the PDF version for comment (half a megabyte). Peter From sbassi at gmail.com Thu Oct 11 18:48:39 2007 From: sbassi at gmail.com (Sebastian Bassi) Date: Thu, 11 Oct 2007 19:48:39 -0300 Subject: [Biopython-dev] Revised tutorial In-Reply-To: <470E8B3E.6080709@maubp.freeserve.co.uk> References: <470E8B3E.6080709@maubp.freeserve.co.uk> Message-ID: Hello, I can't resolve all the dependencies to install hevea so I can't generate the dvi from the tex file. Could you please send me by email the final PDF? Best, SB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Lriser: http://www.linspire.com/lraiser_success.php?serial=318 From mdehoon at c2b2.columbia.edu Thu Oct 11 21:53:19 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 11 Oct 2007 21:53:19 -0400 Subject: [Biopython-dev] Output of Biopython tests References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> <470B6981.3020707@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu> <470E3E7E.1000301@maubp.freeserve.co.uk> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B638@mail2.exch.c2b2.columbia.edu> Peter wrote: > Michiel De Hoon wrote: > > OK. I added a MissingExternalDependencyError exception to Bio/__init__.py, > > and modified BioSQL, Bio.GFF, and some test scripts accordingly. When > > MissingExternalDependencyError occurs in a test, a warning is printed but it > > is not counted as a failure. > > I might have defined the exception within the test framework rather than > Bio/__init__.py, but now that it's there we can start to use in things > like modules that wrap external tools. That is why I put it in Bio/__init__.py; Bio/GFF/__init__.py is already using this exception (outside of the testing framework). > I've updated Tests/requires_internet.py and Test/requires_wise.py to > match (I don't have wise on my machine which is why I noticed it still > threw an ImportError). Thanks! I missed those. > Is there anything I can do to help get things ready for the release of > Biopython 1.44? At some point, somebody will need to go through the documentation to check if everything documented there still works with the Biopython in CVS, and to remove sections in the documentation describing deprecated code. But it's probably better to wait until after we decide what to do with test_GenBankFormat. > If you do have time to give the patch on bug 2366 a check, I think it > would be worth including before the next release. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2366 No time to check it. But I'd be happy to rely on your judgement and include it. --Michiel. From mdehoon at c2b2.columbia.edu Thu Oct 11 21:53:19 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 11 Oct 2007 21:53:19 -0400 Subject: [Biopython-dev] Output of Biopython tests References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> <470B6981.3020707@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu> <470E3E7E.1000301@maubp.freeserve.co.uk> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B638@mail2.exch.c2b2.columbia.edu> Peter wrote: > Michiel De Hoon wrote: > > OK. I added a MissingExternalDependencyError exception to Bio/__init__.py, > > and modified BioSQL, Bio.GFF, and some test scripts accordingly. When > > MissingExternalDependencyError occurs in a test, a warning is printed but it > > is not counted as a failure. > > I might have defined the exception within the test framework rather than > Bio/__init__.py, but now that it's there we can start to use in things > like modules that wrap external tools. That is why I put it in Bio/__init__.py; Bio/GFF/__init__.py is already using this exception (outside of the testing framework). > I've updated Tests/requires_internet.py and Test/requires_wise.py to > match (I don't have wise on my machine which is why I noticed it still > threw an ImportError). Thanks! I missed those. > Is there anything I can do to help get things ready for the release of > Biopython 1.44? At some point, somebody will need to go through the documentation to check if everything documented there still works with the Biopython in CVS, and to remove sections in the documentation describing deprecated code. But it's probably better to wait until after we decide what to do with test_GenBankFormat. > If you do have time to give the patch on bug 2366 a check, I think it > would be worth including before the next release. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2366 No time to check it. But I'd be happy to rely on your judgement and include it. --Michiel. From bugzilla-daemon at portal.open-bio.org Thu Oct 11 22:32:05 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 11 Oct 2007 22:32:05 -0400 Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with egenix mxTextTools 3.0 In-Reply-To: Message-ID: <200710120232.l9C2W5e9022504@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2361 ------- Comment #35 from mdehoon at ims.u-tokyo.ac.jp 2007-10-11 22:32 EST ------- > test_GenBankFormat - this entire test is for the little-used Martel GenBank > expression, and this works with mxTextTools 2.0 but fails with mxTextTools 3.0 If it's little-used, should we include it for the next release or can it be removed? If we remove the test, should we then also remove the corresponding module? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Thu Oct 11 16:37:52 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Oct 2007 21:37:52 +0100 Subject: [Biopython-dev] Output of Biopython tests In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu> References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> <470B6981.3020707@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu> Message-ID: <470E89A0.1010502@maubp.freeserve.co.uk> Michiel De Hoon wrote: >> Perhaps we should introduce a MissingExternalDependency error instead, >> used for this specific case, and catch that in run_tests.py, while >> treating ImportError as a real error. > > OK. I added a MissingExternalDependencyError exception to Bio/__init__.py, > and modified BioSQL, Bio.GFF, and some test scripts accordingly. When > MissingExternalDependencyError occurs in a test, a warning is printed but it > is not counted as a failure. I might have defined the exception within the test framework rather than Bio/__init__.py, but not that its there we can start to use in things like modules that wrap external tools. I've updated Tests/requires_internet.py and Test/requires_wise.py to match (I don't have wise on my machine which is why I noticed it still threw an ImportError). This means run_tests.py now runs without errors using CVS on my 64 bit Linux machine (bar the mxTextTools 3.0 issue with test_GenBankFormat.py (bug 2361). Is there anything I can do to help get things ready for the release of Biopython 1.44? If you do have time to give the patch on bug 2366 a check, I think it would be worth including before the next release. http://bugzilla.open-bio.org/show_bug.cgi?id=2366 Peter From fennan at gmail.com Mon Oct 15 05:48:45 2007 From: fennan at gmail.com (Fernando) Date: Mon, 15 Oct 2007 11:48:45 +0200 Subject: [Biopython-dev] Database into variables Message-ID: <7b13e61d0710150248v72a550d6h38e1467edf5073eb@mail.gmail.com> Hi everybody, I am thinking in including some algorithms that I work with into biopython. My first concern is that I'm using a local image of the Gene Ontology database to perform several operations. In order to avoid such database accesses I could precompute the information I need and load it once the module is called. How should I do it? Is there a guideline style to load external variables or something like that? Any other ideas/suggestions? Thanks From fennan at gmail.com Mon Oct 15 06:28:56 2007 From: fennan at gmail.com (Fernando) Date: Mon, 15 Oct 2007 12:28:56 +0200 Subject: [Biopython-dev] Precompute database information Message-ID: <7b13e61d0710150328l354bfb5eu1b76ed05024a65c4@mail.gmail.com> Hi everybody, I am thinking in including some algorithms that I work with into biopython. My first concern is that I'm using a local image of the Gene Ontology database to perform several operations. In order to avoid such database accesses I could precompute the information I need and load it once the module is called. How should I do it? Is there a guideline style to load external variables or something like that? Any other ideas/suggestions? Thanks From bugzilla-daemon at portal.open-bio.org Mon Oct 15 07:11:26 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Oct 2007 07:11:26 -0400 Subject: [Biopython-dev] [Bug 2366] Ambiguous nucleotides in (Reverse)complement functions in Bio.Seq In-Reply-To: Message-ID: <200710151111.l9FBBQOE012625@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2366 tiagoantao at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tiagoantao at gmail.com ------- Comment #3 from tiagoantao at gmail.com 2007-10-15 07:11 EST ------- I had a look at the test code and tried to find which test case is changing the ambiguous_dna dict. I used this little script (putting it here as it might be useful for detecting these types of problems): for i in test_*py; do python run_tests.py $i; done It turns out that it is text_Nexus.py. A further inspection to the code seems to reveal that is not the test case that pollutes the dictionary but the Nexus modules itself. Maybe it makes sense to raise a bug on the Nexus module... Any comments on these findings? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Oct 15 10:16:00 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Oct 2007 10:16:00 -0400 Subject: [Biopython-dev] [Bug 2366] Ambiguous nucleotides in (Reverse)complement functions in Bio.Seq In-Reply-To: Message-ID: <200710151416.l9FEG01A023797@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2366 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-15 10:16 EST ------- Thanks for that Tiago, I guess we should file a bug on Bio.Nexus on the alphabet issue; It may be that it should create a copy or subclass of the ambiguous DNA alphabet in order to include "?" (I imagine that Nexus uses this rather than "N"), and see if it is using the Gapped() alphabet system or not. Did you have any comments on this patch for (reverse) complements? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Mon Oct 15 20:08:13 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Mon, 15 Oct 2007 19:08:13 -0500 Subject: [Biopython-dev] Biopython status Message-ID: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> Hi all, I've just started using Biopython and I am wondering about the status of the group, since I've heard rumors that its dying. So far I have found the library very useful, if not at times frustrating, though I will admit I am fairly new to developing python as well. I have been hesitant to make changes to existing code, however I have found that in a few cases it has been by far the best way to accomplish what I need, and have only done so in cases where it seems to be the *right* thing to do. With that in mind, I have a few questions I was hoping you all could answer. First, how might I put these changes up for review in order to contribute back to the code base? The main changes have been to the AlignAce parser, since as it was it just ignored information contained in the alignace file regarding the motif instances (namely which input sequence they came from, where they started in the sequence, and what strand they were on). I have also needed to create a modified FASTA parser so that I can read things like quality score files. I would be happy to submit the changes to the group or an individual for inspection, but I would like to avoid having to maintain my own separate version of Biopython if possible. I am also wondering how it would be received if I did something like add a to_fasta method to SeqRecord instead of having to go through writing it to a file using a SeqIO when all I want is the string. Finally, are there plans to move to a subversion repository at any point? Thanks! Jared Flatow From sbassi at gmail.com Tue Oct 16 01:09:16 2007 From: sbassi at gmail.com (Sebastian Bassi) Date: Tue, 16 Oct 2007 02:09:16 -0300 Subject: [Biopython-dev] Biopython status In-Reply-To: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> Message-ID: On 10/15/07, Jared Flatow wrote: > I've just started using Biopython and I am wondering about the status > of the group, since I've heard rumors that its dying. So far I have You could subscribe to the rss feed of the CVS and you will see a lot of activity. The developers list and the bug tracking program (bugzilla) is also pretty busy, that doesn't look as a dying group to me :) -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Lriser: http://www.linspire.com/lraiser_success.php?serial=318 From mdehoon at c2b2.columbia.edu Tue Oct 16 01:37:14 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Tue, 16 Oct 2007 01:37:14 -0400 Subject: [Biopython-dev] Biopython status References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B639@mail2.exch.c2b2.columbia.edu> Hi Jared, > I've just started using Biopython and I am wondering about the status > of the group, since I've heard rumors that its dying. >From looking at the activity on the Biopython mailing lists in recent months, it doesn't seem to be dying :-). > So far I have found the library very useful, if not at times frustrating, > though I will admit I am fairly new to developing python as well. One thing to keep in mind is that Biopython started about eight years ago, and some approaches that seemed to be a good idea at that time may not seem to be so now. Nevertheless, I feel that Biopython is moving in the right direction in terms of ease-of-use. > First, how might I put these changes up for review in order > to contribute back to the code base? The main changes have been to > the AlignAce parser, since as it was it just ignored information > contained in the alignace file regarding the motif instances (namely > which input sequence they came from, where they started in the > sequence, and what strand they were on). In this case, it is a good idea to contact the current maintainer of Bio.AlignAce, either via the mailing list or directly. From the Biopython CVS, it seems that Bartek is currently the main maintainer of Bio.AlignAce, so it would be a good idea to discuss with him. > I have also needed to create a modified FASTA parser so that I > can read things like quality score files. At some point, Biopython had several (two or three?) Fasta parsers, two Fasta formats, etc. This is a situation we should definitely avoid. So if your modifications fit in well with the existing Fasta parser in Bio.SeqIO, it may very well be accepted into Biopython. Otherwise, it's better to leave it out. This is just my opinion though. > I am also wondering how it would be received if I did something like > add a to_fasta method to SeqRecord instead of having to go through > writing it to a file using a SeqIO when all I want is the string. This sounds like feature creep to me, so I would be against it. It's easy to add code to Biopython, it's much harder to remove stuff. Code bloat is a real problem in Biopython. > Finally, are there plans to move to a subversion repository at any > point? There were some plans at some point, but I don't know the current status. Best, --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-dev-bounces at lists.open-bio.org on behalf of Jared Flatow Sent: Mon 10/15/2007 8:08 PM To: biopython-dev at lists.open-bio.org Subject: [Biopython-dev] Biopython status Hi all, I've just started using Biopython and I am wondering about the status of the group, since I've heard rumors that its dying. So far I have found the library very useful, if not at times frustrating, though I will admit I am fairly new to developing python as well. I have been hesitant to make changes to existing code, however I have found that in a few cases it has been by far the best way to accomplish what I need, and have only done so in cases where it seems to be the *right* thing to do. With that in mind, I have a few questions I was hoping you all could answer. First, how might I put these changes up for review in order to contribute back to the code base? The main changes have been to the AlignAce parser, since as it was it just ignored information contained in the alignace file regarding the motif instances (namely which input sequence they came from, where they started in the sequence, and what strand they were on). I have also needed to create a modified FASTA parser so that I can read things like quality score files. I would be happy to submit the changes to the group or an individual for inspection, but I would like to avoid having to maintain my own separate version of Biopython if possible. I am also wondering how it would be received if I did something like add a to_fasta method to SeqRecord instead of having to go through writing it to a file using a SeqIO when all I want is the string. Finally, are there plans to move to a subversion repository at any point? Thanks! Jared Flatow _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 04:16:01 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Oct 2007 09:16:01 +0100 Subject: [Biopython-dev] Biopython status In-Reply-To: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> Message-ID: <47147341.4020708@maubp.freeserve.co.uk> Jared Flatow wrote: > I have also needed to create a modified FASTA parser so that I can > read things like quality score files. Could you be a little more specific - what exactly do you mean by a quality score files (links and/or examples). It may be that this warrants setting up a new file format in Bio.SeqIO > I would be happy to submit the changes to the group or an individual > for inspection, but I would like to avoid having to maintain my own > separate version of Biopython if possible. As has already been said - please file some (enhancement) bugs and attach your patches, or raise specific issues for discussion on this mailing list. Depending on the nature of your changes, you might be able to achieve some of them by subclassing Biopython's objects - rather than literally maintaining your own branch of the project. > I am also wondering how it would be received if I did something like > add a to_fasta method to SeqRecord instead of having to go through > writing it to a file using a SeqIO when all I want is the string. Out of interest, why do you want to create a FASTA record as a string? Did you know you can write to a string using any Bio.SeqIO supported file format using StringIO? Perhaps we should spell this out more explicitly in the documentation, but a motivating example would help. I would suggest rather than adding a to_fasta method to the SeqRecord, simply write your own "seqrecord_to_string" function (or create a subclass of SeqRecord with this method). > Finally, are there plans to move to a subversion repository at any > point? It was raised a while ago, and our cunning plan was to let BioPerl try the move first. Once that has been proven, it should be fairly easy for the OBF guys to also move us over. I should email them to see how things stand... Peter From bartek at rezolwenta.eu.org Tue Oct 16 05:11:01 2007 From: bartek at rezolwenta.eu.org (bartek wilczynski) Date: Tue, 16 Oct 2007 11:11:01 +0200 Subject: [Biopython-dev] Biopython status In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B639@mail2.exch.c2b2.columbia.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <6243BAA9F5E0D24DA41B27997D1FD14402B639@mail2.exch.c2b2.columbia.edu> Message-ID: <1192525861.4714802535dae@imp.rezolwenta.eu.org> Michiel De Hoon wrote: > > First, how might I put these changes up for review in order > > to contribute back to the code base? The main changes have been to > > the AlignAce parser, since as it was it just ignored information > > contained in the alignace file regarding the motif instances (namely > > which input sequence they came from, where they started in the > > sequence, and what strand they were on). > > In this case, it is a good idea to contact the current maintainer of > Bio.AlignAce, either via the mailing list or directly. From the Biopython > CVS, it seems that Bartek is currently the main maintainer of Bio.AlignAce, > so it would be a good idea to discuss with him. I'm not dying either ;). I'm the author of the Bio.AlignAce module and if you have any new code to contribute to it, I'll be glad to help you. The best way to do it would be to submit an enhancement bug report in bugzilla. If the changes are smaller, you can just send them (as a diff) to the list and I'll try to fit them to the current cvs version of Bio.AlignAce Bartek Wilczynski From bugzilla-daemon at portal.open-bio.org Tue Oct 16 05:55:37 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 05:55:37 -0400 Subject: [Biopython-dev] [Bug 2380] New: Bio.Nexus is adding "?" and "-" to Bio.Data.IUPACData.ambiguous_dna_values Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2380 Summary: Bio.Nexus is adding "?" and "-" to Bio.Data.IUPACData.ambiguous_dna_values Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk This issue was raised in Bug 2366 where a unit test was found to be "polluting" ambiguous_dna_values, later identified as Bio.Nexus via test_Nexus.py Need to see if Bio.Nexus should be making a copy of this dict, or perhaps defining a subclass of the alphabet (using the Gapped() class maybe). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 05:56:37 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 05:56:37 -0400 Subject: [Biopython-dev] [Bug 2366] Ambiguous nucleotides in (Reverse)complement functions in Bio.Seq In-Reply-To: Message-ID: <200710160956.l9G9ub18007735@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2366 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 05:56 EST ------- Fix committed (after Michiel's OK on the mailing list), marking as fixed. Checking in Tests/test_seq.py; /home/repository/biopython/biopython/Tests/test_seq.py,v <-- test_seq.py new revision: 1.6; previous revision: 1.5 done Checking in Tests/output/test_seq; /home/repository/biopython/biopython/Tests/output/test_seq,v <-- test_seq new revision: 1.6; previous revision: 1.5 done Checking in Bio/Seq.py; /home/repository/biopython/biopython/Bio/Seq.py,v <-- Seq.py new revision: 1.17; previous revision: 1.16 done I've filed Bug 2380 for the Nexus issue: Bio.Nexus is adding "?" and "-" to Bio.Data.IUPACData.ambiguous_dna_values -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 06:11:09 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 06:11:09 -0400 Subject: [Biopython-dev] [Bug 2381] New: translate and transcibe method for the the Seq object (in Bio.Seq) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2381 Summary: translate and transcibe method for the the Seq object (in Bio.Seq) Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Biopython has translation and transcription modules (Bio/Translate.py and Bio/Transcibe.py) but I find them a little bit complicated to use. There are module level functions translate, transcribe, and back_transcribe in Bio/Seq.py which take either a string, a Seq object or a MutableSeq object. I would like to add similar methods to the Seq object (also defined Bio/Seq.py) to make this functionality more accessable from a Seq object. NOTE: Python strings have a translate method of their own which is rather different. Having the Seq translate method doing a biological translation makes sense. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 06:13:35 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 06:13:35 -0400 Subject: [Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq) In-Reply-To: Message-ID: <200710161013.l9GADZtJ008751@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2381 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|translate and transcibe |translate and transcibe |method for the the Seq |methods for the Seq object |object (in Bio.Seq) |(in Bio.Seq) ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 06:13 EST ------- fixed typo in the bug summary -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 06:26:44 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 06:26:44 -0400 Subject: [Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq) In-Reply-To: Message-ID: <200710161026.l9GAQixw009268@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2381 ------- Comment #2 from dalloliogm at gmail.com 2007-10-16 06:26 EST ------- I find difficult to translate a sequence in the 6 reading frames with a single command. Actually I use something like this: for i in xrange(2): translate(Seq[i:]) which is not very nice. It would be nice to add a parameter to the translate function like in the emboss application transeq (http://emboss.sourceforge.net/apps/cvs/emboss/apps/transeq.html), something like this: >>> a = Seq('CAGCTAGCT') >>> a.translate() [(translation of a in the frame 0)] >>> a.translate(1) [(translation of a in the frame 1)] >>> a.translate(F) [(translation of a in the 3 forward frames)] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 06:46:47 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 06:46:47 -0400 Subject: [Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq) In-Reply-To: Message-ID: <200710161046.l9GAklI6010391@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2381 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 06:46 EST ------- Doing a three/six frame translation is however fairly common, and perhaps warrents an "official" implementation in Bio.SeqUtils My current inclination is try and keep the Bio.Seq translation function as simple as possible. There are lots of possible options to worry about... catering to them all could make the translate method rather daunting. Perhaps things like the frame (or even the starting nucleotide) could be done in Bio.Translate only. Another "special case" example I personally would like is an option to check the first codon is a valid start codon for the specified codon table, and to translate it as methionine (M). Then there is the question of if Bio.Translate's "translate_to_stop" functionality should be exposed in a Seq method. Note there is yet another (!) translation function Bio.SeqUtils.translate() which is frame aware [personally I would mark a lot of this module as deprecated]. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Tue Oct 16 12:02:19 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Tue, 16 Oct 2007 11:02:19 -0500 Subject: [Biopython-dev] Biopython status In-Reply-To: <47147341.4020708@maubp.freeserve.co.uk> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> Message-ID: <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> Please forgive me for ever doubting your health, it seems the group is very much alive! On Oct 16, 2007, at 3:16 AM, Peter wrote: > Jared Flatow wrote: >> I have also needed to create a modified FASTA parser so that I can >> read things like quality score files. > > Could you be a little more specific - what exactly do you mean by a > quality score files (links and/or examples). It may be that this > warrants setting up a new file format in Bio.SeqIO That is what I did. The quality score files I meant are simply FASTA- like records that indicate the quality of each base pair read from a sequencing machine, on a scale of something like 1 to 64. The values are tab separated and correspond to 'reads' in another FASTA file that contain the actual sequences read. This is the way the 454 GSFlex machines output their sequencing reads, so for every set of reads there will be a pair of 454Reads.fna, 454Reads.qual files. The only difference between a parser that processes these qual files and one that processes the sequence files is that it shouldn't get rid of spaces, and the newlines should not to be stripped but converted into spaces (when 454 writes a newline of scores they omit the space). Essentially I have made a duplicate of FastaIOs iterator, named it something else, made these two small changes and put an entry for it in the SeqIO file. 16,17c16,17 < def GSQualIterator(handle, alphabet = single_letter_alphabet, title2ids = None) : < """Generator function to iterate over GSFlex quality records (as SeqRecord objects). --- > def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None) : > """Generator function to iterate over Fasta records (as SeqRecord objects). 54c54 < lines.append(line.rstrip()) # .replace(" ","")) leave off the replacing internal spaces so we can process qscore files (jf) --- > lines.append(line.rstrip().replace(" ","")) 58c58 < yield SeqRecord(Seq(" ".join(lines), alphabet), --- > yield SeqRecord(Seq("".join(lines), alphabet), 63a64,199 As you can see a parser like this might be useful for other FASTA- like formats as well and is in no way specific to the GS quality files (its just a space preserving parser). If it were to be implemented in Biopython you might call it something else. > >> I would be happy to submit the changes to the group or an individual >> for inspection, but I would like to avoid having to maintain my own >> separate version of Biopython if possible. > > As has already been said - please file some (enhancement) bugs and > attach your patches, or raise specific issues for discussion on this > mailing list. > > Depending on the nature of your changes, you might be able to achieve > some of them by subclassing Biopython's objects - rather than > literally > maintaining your own branch of the project. > >> I am also wondering how it would be received if I did something like >> add a to_fasta method to SeqRecord instead of having to go >> through writing it to a file using a SeqIO when all I want is the >> string. > > Out of interest, why do you want to create a FASTA record as a string? I am serving the fasta from a database of sequences dynamically via a web server. > > Did you know you can write to a string using any Bio.SeqIO supported > file format using StringIO? Perhaps we should spell this out more > explicitly in the documentation, but a motivating example would help. This is what I do now, but it seems like a hack to me to go this route. To always have to write to a file feels strange, but I see that it would be messy to go OO since there are so many formats. However, giving preference to fasta over other formats by making it innate doesn't seem like such a terrible idea. I do have mixed feelings about 'bloating' the code which is why I asked, and you have convinced me that this is not quite appropriate given existing convention. However the idea would be to put the to_fasta or to_format method inside the SeqRecord, then to call it from the IO when needed to actually write to a file, but call it directly when all that is wanted is a string... > > I would suggest rather than adding a to_fasta method to the > SeqRecord, simply write your own "seqrecord_to_string" function (or > create a subclass of SeqRecord with this method). > I'll leave it alone for now until I can come up with a real proposal =) >> Finally, are there plans to move to a subversion repository at any >> point? > > It was raised a while ago, and our cunning plan was to let BioPerl try > the move first. Once that has been proven, it should be fairly > easy for > the OBF guys to also move us over. I should email them to see how > things stand... BioPerl seems to be the guinea pigs for everything. Leading the way on this might put a stop to those nasty rumors about Biopython. Best Regards, Jared From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 12:47:48 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Oct 2007 17:47:48 +0100 Subject: [Biopython-dev] CVS to SVN In-Reply-To: <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> Message-ID: <4714EB34.8000207@maubp.freeserve.co.uk> Jared wrote: > Leading the way on this ... [CVS to SVN] I would say one reason why we aren't charging ahead with a move from CVS to subversion is only a few posters on this mailing list actively WANT to move to subversion, and no-one has really championed the move (yet). I'm sure if we as a group wanted to this, then the OBF would be happy to assist. After all, moving us rather than BioPerl as the first CVS/SVN migration should be easier as we have a smaller code base. Peter From jflatow at northwestern.edu Tue Oct 16 14:46:53 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Tue, 16 Oct 2007 13:46:53 -0500 Subject: [Biopython-dev] 454 GSFlex quality score files In-Reply-To: <4714EBC7.1040504@maubp.freeserve.co.uk> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EBC7.1040504@maubp.freeserve.co.uk> Message-ID: <48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu> Hi Peter, >>>> I have also needed to create a modified FASTA parser so that I >>>> can read things like quality score files. >>> >>> Could you be a little more specific - what exactly do you mean by a >>> quality score files (links and/or examples). It may be that this >>> warrants setting up a new file format in Bio.SeqIO >> That is what I did. The quality score files I meant are simply >> FASTA- like records that indicate the quality of each base pair >> read from a sequencing machine, on a scale of something like 1 to >> 64. The values are tab separated and correspond to 'reads' in >> another FASTA file that contain the actual sequences read. This >> is the way the 454 GSFlex machines output their sequencing reads, >> so for every set of reads there will be a pair of 454Reads.fna, >> 454Reads.qual files. The only difference between a parser that >> processes these qual files and one that processes the sequence >> files is that it shouldn't get rid of spaces, and the newlines >> should not to be stripped but converted into spaces (when 454 >> writes a newline of scores they omit the space). Essentially I >> have made a duplicate of FastaIOs iterator, named it something >> else, made these two small changes and put an entry for it in the >> SeqIO file. > > Patches and emails don't do well together. Could you file an > enhancement bug, and then upload your code as an attachment? If > you have a few examples of matched pairs of FASTA files and quality > files which you can contribute that would be very helpful too. > Yes I'll get on that. > It looks like you are trying to construct a "sequence" of numerical > values (rather than a sequence of letters like nucleotides/amino > acids). As written I don't think it would work for element access/ > splicing etc. However, with some extra work I suppose we could > stretch the Seq object in this way - and define a new > "IntegerAlphabet". > > But on balance, I don't think "lists of quality values" should be > treated in the same way as sequences (and thus it doesn't seem to > belong in Bio.SeqIO). > I agree. > Alternatively you could regard the quality scores as sequence meta- > data or annotation. One idea would be to generate SeqRecord > objects containing dummy sequences of the correct length made up of > the ambiguous character "N", with the associated quality scores > held as a list of integers in the SeqRecord's annotation > dictionary. Then it would fit into the Bio.SeqIO framework [I was > thinking of something similar for PTT files, NCBI Protein tables, > where again we have annotation but not the actual sequence]. I agree, and this way is most flexible. > > Maybe there should just be a separate parser for GSFlex quality > records which returns iterator giving each record name with a list > of integers. A more elegant scheme would read in the pair of files > together (the FASTA file and the quality file) and generate nicely > annotated SeqRecords with the sequence and the quality. This isn't > really possible with the Bio.SeqIO framework. > Yes, at first I liked this idea best, but it puts some constraints on the way these things are read in. Like if it is to be an iterator, you must have a guarantee that these files contain exactly the same sequences in exactly the same order. This seems like it could potentially be fine for the GSFlex files, but I wonder if there might somewhere down the line be use for quality information about sequences in other cases. If I am not mistaken, some sources use upper/lower case letters now to indicate a bistable degree of confidence in a sequence letter. In any event, this seems like an unnecessary restriction. The way I do it now is I load the reads into a database, then update the database when I read in a quality score file. I think Biopython should have a simple way of implementing something similar which can solve both our metadata problems. In Bio.Fasta there are Parsers which really belong in Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more general Fasta reader, nothing to do with sequences. It can iterate over a FASTA file using the '>' as the record separator, creating Record objects, much like it does now, except without processing them at all or assuming they are sequences. >Record.header Record.data Now Bio.SeqIO.FastaIO can use Bio.Fasta to iterate over the Record objects in a file and transform them into SeqRecord object. If you like, you can provide it with a function header_todict, which takes a string (in this case Record.header) and returns a dictionary, which gets unpacked and passed to the SeqRecord initializer. Basically the Bio.SeqIO.FastaIO returns a generator that looks something like this: (SeqRecord(seq=cleanup(record.data), **header_todict(record.header)) for record in Bio.Fasta.parse(file)) I can also use the Bio.Fasta.parse function now to parse my quality files and add them as metadata: # I create an initial SeqRecord dictionary using the Bio.SeqIO.FastaIO parser seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file, my_header_todict)) # Then I iterate over the sequences in the qual file and look them up in the seq_dict using the same header parsing function # I passed to create my initial SeqRecords, setting the quality scores as I find them them for record in Bio.Fasta.parse(qual_file): seq_dict[my_header_todict(record.header)['id']].quality = my_qualitycleanup(record.data) I hope that makes sense. The advantage to doing it this way is that I can reuse my header parsing function for both the sequence and the metadata, and I can do whatever I want with the fasta record data without writing a whole new parser. The SeqIO fasta parsing functions just makes some default assumptions (like the data is a sequence). Let me know what you think. Jared From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 12:50:15 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Oct 2007 17:50:15 +0100 Subject: [Biopython-dev] 454 GSFlex quality score files In-Reply-To: <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> Message-ID: <4714EBC7.1040504@maubp.freeserve.co.uk> Hi Jared, >>> I have also needed to create a modified FASTA parser so that I can >>> read things like quality score files. >> >> Could you be a little more specific - what exactly do you mean by a >> quality score files (links and/or examples). It may be that this >> warrants setting up a new file format in Bio.SeqIO > > That is what I did. The quality score files I meant are simply FASTA- > like records that indicate the quality of each base pair read from a > sequencing machine, on a scale of something like 1 to 64. The values > are tab separated and correspond to 'reads' in another FASTA file > that contain the actual sequences read. This is the way the 454 > GSFlex machines output their sequencing reads, so for every set of > reads there will be a pair of 454Reads.fna, 454Reads.qual files. The > only difference between a parser that processes these qual files and > one that processes the sequence files is that it shouldn't get rid of > spaces, and the newlines should not to be stripped but converted into > spaces (when 454 writes a newline of scores they omit the space). > Essentially I have made a duplicate of FastaIOs iterator, named it > something else, made these two small changes and put an entry for it > in the SeqIO file. Patches and emails don't do well together. Could you file an enhancement bug, and then upload your code as an attachment? If you have a few examples of matched pairs of FASTA files and quality files which you can contribute that would be very helpful too. It looks like you are trying to construct a "sequence" of numerical values (rather than a sequence of letters like nucleotides/amino acids). As written I don't think it would work for element access/splicing etc. However, with some extra work I suppose we could stretch the Seq object in this way - and define a new "IntegerAlphabet". But on balance, I don't think "lists of quality values" should be treated in the same way as sequences (and thus it doesn't seem to belong in Bio.SeqIO). Alternatively you could regard the quality scores as sequence meta-data or annotation. One idea would be to generate SeqRecord objects containing dummy sequences of the correct length made up of the ambiguous character "N", with the associated quality scores held as a list of integers in the SeqRecord's annotation dictionary. Then it would fit into the Bio.SeqIO framework [I was thinking of something similar for PTT files, NCBI Protein tables, where again we have annotation but not the actual sequence]. Maybe there should just be a separate parser for GSFlex quality records which returns iterator giving each record name with a list of integers. A more elegant scheme would read in the pair of files together (the FASTA file and the quality file) and generate nicely annotated SeqRecords with the sequence and the quality. This isn't really possible with the Bio.SeqIO framework. Peter From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 15:33:54 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Oct 2007 20:33:54 +0100 Subject: [Biopython-dev] 454 GSFlex quality score files In-Reply-To: <48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EBC7.1040504@maubp.freeserve.co.uk> <48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu> Message-ID: <47151222.1060502@maubp.freeserve.co.uk> > In Bio.Fasta there are Parsers which really belong in > Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more > general Fasta reader, nothing to do with sequences. ... In actual fact, the Bio.Fasta module predates Bio.SeqIO, and I was thinking in a few releases time of suggesting its deprecation (but not just yet as for several years it was the best documented and most used parser in Biopython). If we do decided keep Bio.Fasta (or extend it), then perhaps Bio.SeqIO.FastaIO should become just a wrapper for Bio.Fasta I'm still digressing your ideas to turn Bio.Fasta into a generic parser that copes with sequences, qualities scores, or anything else. Peter From bugzilla-daemon at portal.open-bio.org Tue Oct 16 15:57:35 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 15:57:35 -0400 Subject: [Biopython-dev] [Bug 2382] New: Generic FASTA parser Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2382 Summary: Generic FASTA parser Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: jflatow at northwestern.edu I would like to be able read in and iterate over records in generic fasta files of the format: >header data >header data ... This iterator should return Bio.Fasta.Record objects with the corresponding header and data fields. I suggest putting this inside the existing Bio.Fasta module and updating Bio.SeqIO.Fasta to use this iterator and transform the records returned into Bio.SeqRecord objects. This should make it easier to add metadata to SeqRecord objects parsed in from FASTA. Consider the following example for illustration. I have data from a genome sequencing machine that outputs pairs of files. One contains the sequence reads which look like this, the other contains estimates of the quality of each base call in the sequence. The sequence file might look something like this (only with hundreds of thousands more entries): >ERSGEES02IKV6B length=97 xy=3401_1361 region=2 run=R_runname CAATATAATTTCTCTTAAAATTATTCCCATGGCCAGGTGTGGTGGCTCACACCTGTAGTC CCGGCACTTTGGGAGGCCAAGGCACACAGGGGATAGG >ERSGEES02GGZDB length=142 xy=2536_2685 region=2 run= R_runname GGTCTCCAGTGCCCTGTCTCCCCATATTTCTGACACACCTTCTCACAGCCTGGCCCATCT TGCTGGGTCCCTCTTCTCCTCCCTTCCTGCTCCATTTGTCAACACTGCTGGGACATTAGA ATTCAGATCTCCCGGGTCACCG >ERSGEES02JQUCP length=113 xy=3879_0663 region=2 run= R_runname AAAGTGACTAAAGAATCAATTTACATTAATATTCTATGTGAACAGGCAAAATACTTACAA AGAAGTAGAGAAAATATGAATTCAGTACAGAATTCAGATCTCCCGGGTCACCG The corresponding quality score file might look something like this: >ERSGEES02IKV6B length=97 xy=3401_1361 region=2 run= R_runname 27 28 21 27 27 27 28 22 28 25 3 27 27 27 28 21 33 31 20 6 28 21 26 26 18 28 25 2 26 25 29 23 31 24 27 29 22 27 27 27 29 23 27 31 25 27 27 27 27 27 27 32 26 27 27 27 27 26 27 33 30 12 32 26 27 27 27 33 30 12 33 30 12 26 31 25 33 27 32 28 33 28 27 27 27 27 27 26 33 32 20 7 27 27 27 32 26 >ERSGEES02GGZDB length=142 xy=2536_2685 region=2 run= R_runname 28 9 26 24 27 27 20 26 18 25 27 32 29 10 26 26 27 18 25 32 30 17 1 25 27 22 32 30 12 27 27 22 26 25 27 23 25 28 21 32 27 27 27 25 26 27 26 25 27 20 26 26 19 28 25 3 25 27 22 27 19 24 24 24 32 29 11 24 34 31 17 23 23 30 23 27 25 30 23 27 33 31 17 27 20 28 21 27 25 26 26 30 24 27 33 31 13 26 27 27 31 25 27 25 23 26 16 26 27 30 27 7 27 27 27 32 27 26 26 32 27 30 26 27 27 27 27 27 27 27 30 27 6 34 31 17 27 21 27 32 28 18 >ERSGEES02JQUCP length=113 xy=3879_0663 region=2 run= R_runname 29 26 5 25 27 24 27 27 27 30 27 7 26 27 19 25 26 31 26 34 32 16 20 27 26 32 27 32 28 27 25 26 18 27 25 27 26 26 24 27 31 25 27 27 31 26 26 34 32 23 11 26 22 27 32 26 27 26 32 30 11 26 31 24 27 27 25 23 27 27 33 30 19 4 17 26 25 26 31 27 30 26 27 26 22 26 18 24 27 26 32 26 32 28 27 27 25 27 25 24 25 31 28 10 34 31 15 27 21 27 28 21 27 I would like to be able to do the following: # create a function to parse the header line and return a dictionary def parse_gsflex_header(gs_header): parts = gs_record.description.split(' ') assert len(parts) == 5 xy = parts[2].split('=')[1].split('_') return {'letters': gs_record.seq.tostring(), 'name': parts[0], 'length': parts[1].split('=')[1], 'xpos': xy[0], 'ypos': xy[1], 'region': parts[3].split('=')[1], 'run': parts[4].split('=')[1]} # Bio.SeqIO.FastaIO wraps the Bio.Fasta parser, might look something like this class Fasta(): # or however its organized def data_toseq(data): # do some parsing of the data return Seq(...) def parse(file, header_todict): return (SeqRecord(seq=data_toseq(record.data), **header_todict(record.header)) for record in Bio.Fasta.parse(file)) # I create an initial SeqRecord dictionary using the Bio.SeqIO.FastaIO parser seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file, parse_gsflex_header)) # Then I iterate over the sequences in the qual file and look them up in the seq_dict # setting the quality scores as I find them them for record in Bio.Fasta.parse(qual_file): seq_dict[my_header_todict(record.header)['id']].quality = my_qualitycleanup(record.data) This would work well for parsing all kinds of FASTA-like files and provides a simple mechanism for dealing with them record by record. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 16:03:33 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 16:03:33 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710162003.l9GK3XmF007588@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #1 from jflatow at northwestern.edu 2007-10-16 16:03 EST ------- My mistake, the parse_gsflex_header function should look something like this: def parse_gsflex_header(gs_header): parts = re.split('[,|]?\s+', header, maxsplit=1) assert len(parts) == 2 return {'id': parts[0], 'description': header} def my_qualitycleanup(data): return [int x for x in data.replace('\n', '').split(' ')] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Tue Oct 16 16:11:04 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Tue, 16 Oct 2007 15:11:04 -0500 Subject: [Biopython-dev] 454 GSFlex quality score files In-Reply-To: <47151222.1060502@maubp.freeserve.co.uk> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EBC7.1040504@maubp.freeserve.co.uk> <48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu> <47151222.1060502@maubp.freeserve.co.uk> Message-ID: <156C46BF-1798-43D5-BA10-2A94FC63A3AB@northwestern.edu> On Oct 16, 2007, at 2:33 PM, Peter wrote: > > In Bio.Fasta there are Parsers which really belong in > > Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more > > general Fasta reader, nothing to do with sequences. ... > > In actual fact, the Bio.Fasta module predates Bio.SeqIO, and I was > thinking in a few releases time of suggesting its deprecation (but > not just yet as for several years it was the best documented and > most used parser in Biopython). > I see, it looks like its meant to be deprecated, I was just saying its actually doing SeqIO functionality. > If we do decided keep Bio.Fasta (or extend it), then perhaps > Bio.SeqIO.FastaIO should become just a wrapper for Bio.Fasta > > I'm still digressing your ideas to turn Bio.Fasta into a generic > parser that copes with sequences, qualities scores, or anything else. I'm not quite sure you're meaning of digressing, if you mean thinking it over, then great =) Otherwise I hope you'll seriously consider it anyway. Either way, I think I posted a more coherent message on bugzilla with some example data and motivation. jared From jflatow at northwestern.edu Tue Oct 16 16:14:16 2007 From: jflatow at northwestern.edu (Jared Flatow) Date: Tue, 16 Oct 2007 15:14:16 -0500 Subject: [Biopython-dev] CVS to SVN In-Reply-To: <4714EB34.8000207@maubp.freeserve.co.uk> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB34.8000207@maubp.freeserve.co.uk> Message-ID: <6DFB6FBB-CC55-41D1-8D35-4906E6B502CF@northwestern.edu> > I would say one reason why we aren't charging ahead with a move > from CVS to subversion is only a few posters on this mailing list > actively WANT to move to subversion, and no-one has really > championed the move (yet). Does that mean most developers don't WANT to move, or just that they don't ACTIVELY want to move? jared From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 16:42:18 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Oct 2007 21:42:18 +0100 Subject: [Biopython-dev] 454 GSFlex quality score files In-Reply-To: <156C46BF-1798-43D5-BA10-2A94FC63A3AB@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EBC7.1040504@maubp.freeserve.co.uk> <48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu> <47151222.1060502@maubp.freeserve.co.uk> <156C46BF-1798-43D5-BA10-2A94FC63A3AB@northwestern.edu> Message-ID: <4715222A.2070909@maubp.freeserve.co.uk> Jared Flatow wrote: > On Oct 16, 2007, at 2:33 PM, Peter wrote: > >>> In Bio.Fasta there are Parsers which really belong in >>> Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more >>> general Fasta reader, nothing to do with sequences. ... >> In actual fact, the Bio.Fasta module predates Bio.SeqIO, and I was >> thinking in a few releases time of suggesting its deprecation (but >> not just yet as for several years it was the best documented and >> most used parser in Biopython). > > I see, it looks like its meant to be deprecated, I was just saying > its actually doing SeqIO functionality. Well I'm currently just making a suggestion for the future, deprecating Bio.Fasta, we should still canvas opinion on the main mailing list before taking that action. >> If we do decided keep Bio.Fasta (or extend it), then perhaps >> Bio.SeqIO.FastaIO should become just a wrapper for Bio.Fasta >> >> I'm still digressing your ideas to turn Bio.Fasta into a generic >> parser that copes with sequences, qualities scores, or anything else. That was a typo, but you managed to guess my meaning. I meant to say: I'm still digesting [i.e. thinking about] your ideas to turn Bio.Fasta into a generic parser that copes with sequences, qualities scores, or anything else. > I'm not quite sure you're meaning of digressing, if you mean thinking > it over, then great =) Otherwise I hope you'll seriously consider it > anyway. Either way, I think I posted a more coherent message on > bugzilla with some example data and motivation. I'll take a look, Bug 2382 - Generic FASTA parser http://bugzilla.open-bio.org/show_bug.cgi?id=2382 Peter From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 17:01:29 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Oct 2007 22:01:29 +0100 Subject: [Biopython-dev] CVS to SVN In-Reply-To: <6DFB6FBB-CC55-41D1-8D35-4906E6B502CF@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB34.8000207@maubp.freeserve.co.uk> <6DFB6FBB-CC55-41D1-8D35-4906E6B502CF@northwestern.edu> Message-ID: <471526A9.1010709@maubp.freeserve.co.uk> Jared Flatow wrote: >> I would say one reason why we aren't charging ahead with a move >> from CVS to subversion is only a few posters on this mailing list >> actively WANT to move to subversion, and no-one has really >> championed the move (yet). > > Does that mean most developers don't WANT to move, or just that they > don't ACTIVELY want to move? Going back over the archives, Chris Lasher was most vocal in supporting the move, and there were a few other positive voices. Speaking for myself, I have no strong desire either way, and I don't think Michiel objected either (except over the timing). Then as now, we are hoping to get the next release out "shortly", so after that would be a good time to make the switch. [I'm assuming we won't loose any revision history or comments, and that things like the web based ViewCVS and its RSS feed will still be available] Peter From bugzilla-daemon at portal.open-bio.org Tue Oct 16 17:02:03 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 17:02:03 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710162102.l9GL23rr010250@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 17:02 EST ------- Are there any other "FASTA like" formats you know of, in addition to traditional sequence data and the 454 GSFlex quality score files? We could do this using the old Scanner/Consumer model (see the pre-Martel parse, CVS revision 1.8 of Bio/Fasta/__init__.py for example). http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Fasta/__init__.py?rev=1.8&cvsroot=biopython&content-type=text/vnd.viewcvs-markup The scanner would be the same for all formats, and would pass the data with whitespace (spaces, new lines etc) as is. We could then have one consumer for each supported FASTA variant: _Scanner Scans a FASTA-format stream. _RecordConsumer Consumes FASTA data to a Record object. _SequenceConsumer Consumes FASTA data to a Sequence object. _QualityConsumer (new) could build a list of integers for each record? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 17:26:29 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 17:26:29 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710162126.l9GLQT8O011239@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #3 from jflatow at northwestern.edu 2007-10-16 17:26 EST ------- On second thought, let me just rewrite all the code: # The Bio.Fasta parser class Fasta(): # or whatever @staticmethod def parse(file): # return an iterator over the file as Bio.Fasta.Records # for the records, trim newline from header, don't do anything to data # The Bio.SeqIO.FastaIO wrapper for Bio.Fasta class FastaIO(): # or however its organized @staticmethod def header_todict(header): parts = re.split('[,|]?\s+', header, maxsplit=1) assert len(parts) == 2 return {'id': parts[0], 'description': header} @staticmethod def data_toseq(data, alphabet): return Seq(re.sub('\s+', '', data), alphabet) @staticmethod def parse(file, header_todict=Fasta.header_todict, alphabet=single_letter_alphabet): return (SeqRecord(seq=data_toseq(record.data, alphabet), **header_todict(record.header)) for record in Bio.Fasta.parse(file)) # Now to use these in my example I can do seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file)) for record in Bio.Fasta.parse(qual_file): id = Bio.SeqIO.FastaIO.header_todict(record.header)['id'] seq_dict[id].quality = [int(x) for x in record.data.split()] # Suppose instead I have an alignment file, which looks like this: >contigname A A 10 64 T T 9 64 C C 9 64 ... # and on, where the first column is a reference sequence, the second column is a consensus # sequence, the third column is the number of reads aligned, the fourth column is the combined # quality score # Now its just as easy for me to parse this into an object class ContigAlign(): def __init__(self, name, ref, consensus, numreads, qscore): self.name = name self.ref = ref self.consensus = consensus self.numreads = numreads self.qscore = qscore # ill make a dictionary of my contigaligns d = {} for record in Bio.Fasta.parse(file): (ref, consensus, numreads, qscore) = zip(record.data.split('\n')) d[record.header] = ContigAlign(record.header, ref, consensus, numreads, qscore) # maybe i would turn ref and consensus into Seqs, but you get the point -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 17:38:45 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 17:38:45 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710162138.l9GLcj29011655@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 17:38 EST ------- In comment 3, did you just make up this file format as an example? >contigname A A 10 64 T T 9 64 C C 9 64 ... with four columns: reference sequence, consensus, number of reads aligned, and combined quality score. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 17:58:38 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 17:58:38 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710162158.l9GLwc68012343@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #5 from jflatow at northwestern.edu 2007-10-16 17:58 EST ------- Nope, they actually have a file format that looks like this: Position Consensus Quality Score Depth Signal StdDeviation >contig00001 1 1 G 64 2 1.00 0.00 2 A 64 2 1.00 0.00 3 G 64 2 1.00 0.00 4 A 64 2 1.00 0.00 5 G 64 2 2.00 0.00 6 G 64 2 2.00 0.00 7 A 64 2 3.00 0.00 8 A 64 2 3.00 0.00 9 A 64 2 3.00 0.00 10 C 64 2 2.00 0.00 11 C 64 2 2.00 0.00 12 T 64 2 1.00 0.00 13 C 64 2 3.00 0.00 14 C 64 2 3.00 0.00 15 C 64 2 3.00 0.00 16 G 64 2 1.00 0.00 17 T 64 2 1.00 0.00 18 G 64 2 1.00 0.00 19 A 64 2 1.00 0.00 20 T 64 2 1.00 0.00 21 C 64 2 2.00 0.00 22 C 64 2 2.00 0.00 Note the file-wide header at the top of the page (a generic FASTA-like parser might skip to the first '>'), or we could get rid of that beforehand but it would be nice if it were smart. Also, here is another sample FASTA-like file format they use for pair alignments: >ERSGEES01EM5WC, 2..30 of 95 and ERSGEES01C1ZV2, 1..29 of 268 (29/29 ident) 2 CGGTGACCCGGGAGATCTGAATTCCTGGT 30 1 CGGTGACCCGGGAGATCTGAATTCCTGGT 29 >ERSGEES01EM5WC, 2..29 of 95 and ERSGEES01DMS5T, 1..28 of 259 (28/28 ident) 2 CGGTGACCCGGGAGATCTGAATTCCTGG 29 1 CGGTGACCCGGGAGATCTGAATTCCTGG 28 >ERSGEES01EM5WC, 29..2 of 95 and ERSGEES01D8GDV, 205..232 of 232 (28/28 ident) 29 CCAGGAATTCAGATCTCCCGGGTCACCG 2 205 CCAGGAATTCAGATCTCCCGGGTCACCG 232 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 18:09:06 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 18:09:06 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710162209.l9GM96N5012764@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #6 from jflatow at northwestern.edu 2007-10-16 18:09 EST ------- The reference/consensus one was inspired by yet another format they have: there are 2 tools they provide, one for mapping to an existing sequence, the other for ab initio contig building. The mapping one has the extra reference column. As you can see it might be hard to keep up with all these similar formats as part of Biopython (these are only from one source). Certainly the common ones should have wrappers but we should also be able to easily get the stream of records. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 18:13:48 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 18:13:48 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710162213.l9GMDmBM012914@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 18:13 EST ------- Could you attach a few of these real files? Including where they came from, i.e. the company whose software writes such output, and what the call each file format variant. If you can get a matched set (i.e. all associated with the same few sequences) then even better. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 19:09:00 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 19:09:00 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710162309.l9GN90wg015092@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #8 from jflatow at northwestern.edu 2007-10-16 19:08 EST ------- The files are very large, I assure you they are just longer versions of what I have supplied here though. The company is Roche Diagnostics. The initial reads/quality files are the output of the 454 GSFlex genome sequencing machines. They have two pieces of software: gsMapper and gsAssembler which output the contigs. Reads/Quality files from the machine are called: 454Reads.{fna,qual} gs* output: 454{All,Large}Contigs.{fna,qual} 454PairAlign.txt 454AlignmentInfo.tsv -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Oct 16 20:10:45 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 20:10:45 -0400 Subject: [Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq) In-Reply-To: Message-ID: <200710170010.l9H0AjYe018147@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2381 ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2007-10-16 20:10 EST ------- > Note there is yet another (!) translation function Bio.SeqUtils.translate() > which is frame aware [personally I would mark a lot of this module as > deprecated]. Given the various translate functions we already have in Biopython, why do you want to add another one? Is there something the "translate" method can do that the "translate" function cannot? Since the "translate" function can take Seq objects as well as simple strings, I'd prefer the "translate" function over a "translate" method. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 12:49:18 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Oct 2007 17:49:18 +0100 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> Message-ID: <4714EB8E.3000700@maubp.freeserve.co.uk> >> Did you know you can write to a string using any Bio.SeqIO supported >> file format using StringIO? Perhaps we should spell this out more >> explicitly in the documentation, but a motivating example would help. > > This is what I do now, but it seems like a hack to me to go this > route. To always have to write to a file feels strange, but I see > that it would be messy to go OO since there are so many formats. > However, giving preference to fasta over other formats by making it > innate doesn't seem like such a terrible idea. I do have mixed > feelings about 'bloating' the code which is why I asked, and you have > convinced me that this is not quite appropriate given existing > convention. However the idea would be to put the to_fasta or > to_format method inside the SeqRecord, then to call it from the IO > when needed to actually write to a file, but call it directly when > all that is wanted is a string... Its debatable isn't it? I suspect that for most users, when they want a record in a particular file format its for writing to a file. However, adding a to_format() method to a SeqRecord some sense (suitable for sequential file formats only). This would take a format name and return a string, by calling Bio.SeqIO with a StringIO object internally. Peter From bugzilla-daemon at portal.open-bio.org Tue Oct 16 22:17:28 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Oct 2007 22:17:28 -0400 Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser In-Reply-To: Message-ID: <200710170217.l9H2HSAx024040@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2382 ------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2007-10-16 22:17 EST ------- If all these special fasta files are coming from Roche Diagnostics, I'd suggest to create a module rather than trying to put this in Bio.SeqIO. Bio.SeqIO is one of the few modules in Biopython that is used by most users, so I'd like to keep it clean as much as possible. To avoid confusion for users who just want to parse regular Fasta files, I think the module should not be called Bio.Fasta. In addition, I doubt we'd get much code reuse from a generic Bio.Fasta module beyond what is needed for the Roche files, since the only thing they have in common is that they use ">" to separate records. With a separate module to handle the Roche files, my preferred usage would be something like this: from Bio import SeqIO, GSFlex # Or whatever you'd like to call it seqrecords = SeqIO.parse(open("mysequences.fa"), "fasta") qualities = GSFlex.parse(open("myqualities.qual"),