From bugzilla-daemon at portal.open-bio.org Tue Oct 2 05:09:48 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 2 Oct 2007 05:09:48 -0400
Subject: [Biopython-dev] [Bug 2362] test_copen fails on Windows XP as tries
os.fork()
In-Reply-To:
Message-ID: <200710020909.l9299moD015903@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2362
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2007-10-02 05:09 EST -------
I removed test_copen.py from CVS and deprecated the Bio.MultiProc code.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mdehoon at c2b2.columbia.edu Tue Oct 2 05:06:54 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Tue, 2 Oct 2007 05:06:54 -0400
Subject: [Biopython-dev] [BioPython] Bio.MultiProc
References: <46E6A845.3030601@c2b2.columbia.edu>
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu>
Hi everybody,
Since no users of Bio.MultiProc came forward, I deprecated it for the
upcoming release.
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
-----Original Message-----
From: biopython-bounces at lists.open-bio.org on behalf of Michiel De Hoon
Sent: Tue 9/11/2007 10:37 AM
To: BioPython Developers List; biopython at biopython.org
Subject: [BioPython] Bio.MultiProc
Hi everybody,
In preparation for the upcoming release, I was running the Biopython
test suite and found that test_copen.py hangs on Cygwin. It doesn't
fail, it just sits there forever. This may be related to the use of
fork() instead of select() in Bio/MultiProc/copen.py. Anyway, while it
is probably possible to fix this, I'd have to dig fairly deep into the
code, and I am not sure if it is worth it. It looks like the copen
functions are used only in Bio/config, which is needed for Bio.db. A
description of the functionality of thia module can be found in the
tutorial section 4.7.2.
Now, I don't remember users asking about this module on the mailing
list. From the tutorial documentation, it seems to be a nice piece of
code, but I doubt that it is being used often in practice.
So I was wondering:
1) Is anybody on this list using this code?
2) If not, can I mark it as deprecated for the upcoming release?
Hopefully, people who are using this code will notice, and let us know
that they need it.
--Michiel.
_______________________________________________
BioPython mailing list - BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython
From idoerg at gmail.com Tue Oct 2 12:00:41 2007
From: idoerg at gmail.com (Iddo Friedberg)
Date: Tue, 2 Oct 2007 09:00:41 -0700
Subject: [Biopython-dev] [BioPython] Bio.MultiProc
In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu>
References: <46E6A845.3030601@c2b2.columbia.edu>
<6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu>
Message-ID:
Would it be possible to include the module, comment out the unworkable
source code and print a deprecation warning when it is imported? That was
we:
1) Don't have a clunky module BUT
2) we warn anyone who uses it (but didn't happen to read your post) that it
is deprecated when they install a new biopython version AND
3) Leave an option of fixing and commenting the code back in (i.e. it is not
lost forever).
Also, is it possible to track down the original author?
./I
On 10/2/07, Michiel De Hoon wrote:
>
> Hi everybody,
>
> Since no users of Bio.MultiProc came forward, I deprecated it for the
> upcoming release.
>
> --Michiel.
>
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1150 St Nicholas Avenue
> New York, NY 10032
>
>
>
> -----Original Message-----
> From: biopython-bounces at lists.open-bio.org on behalf of Michiel De Hoon
> Sent: Tue 9/11/2007 10:37 AM
> To: BioPython Developers List; biopython at biopython.org
> Subject: [BioPython] Bio.MultiProc
>
> Hi everybody,
>
> In preparation for the upcoming release, I was running the Biopython
> test suite and found that test_copen.py hangs on Cygwin. It doesn't
> fail, it just sits there forever. This may be related to the use of
> fork() instead of select() in Bio/MultiProc/copen.py. Anyway, while it
> is probably possible to fix this, I'd have to dig fairly deep into the
> code, and I am not sure if it is worth it. It looks like the copen
> functions are used only in Bio/config, which is needed for Bio.db. A
> description of the functionality of thia module can be found in the
> tutorial section 4.7.2.
>
> Now, I don't remember users asking about this module on the mailing
> list. From the tutorial documentation, it seems to be a nice piece of
> code, but I doubt that it is being used often in practice.
>
> So I was wondering:
> 1) Is anybody on this list using this code?
> 2) If not, can I mark it as deprecated for the upcoming release?
> Hopefully, people who are using this code will notice, and let us know
> that they need it.
>
> --Michiel.
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
--
I. Friedberg
"The only problem with troubleshooting is that
sometimes trouble shoots back."
From biopython-dev at maubp.freeserve.co.uk Tue Oct 2 12:55:53 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 02 Oct 2007 17:55:53 +0100
Subject: [Biopython-dev] Bio.MultiProc / Bio.FormatIO
In-Reply-To:
References: <46E6A845.3030601@c2b2.columbia.edu> <6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu>
Message-ID: <47027819.1010207@maubp.freeserve.co.uk>
Iddo Friedberg wrote:
> Would it be possible to include the module, comment out the unworkable
> source code and print a deprecation warning when it is imported?
That is sort of what Michiel did - he's just added a deprecation
warning, but not touched the code itself.
This isn't an option for some of the more "integrated" bits of code like
Bio.FormatIO which I suggested removing in Bug 2361 (see also my email
to the main list on 19 September):
http://bugzilla.open-bio.org/show_bug.cgi?id=2361#c27
Peter
From rhaygood at duke.edu Tue Oct 2 19:59:43 2007
From: rhaygood at duke.edu (Ralph Haygood)
Date: Tue, 2 Oct 2007 19:59:43 -0400 (EDT)
Subject: [Biopython-dev] Statistics code
In-Reply-To: <6d941f120709291328q6a9aae97kdcf489549cc9b3f0@mail.gmail.com>
References: <6d941f120709291328q6a9aae97kdcf489549cc9b3f0@mail.gmail.com>
Message-ID:
Tiago,
Sorry to be so long replying---I've been almost drowning in work.
Use anything you find useful in my code. If you do write an article
about it, I'd be glad to be a coauthor, not just in name but actually
to help with writing the discussion of sequence statistics.
There *is* a lot of stuff in my code, not all of it generally
important. For example, few people will care about indel statistics,
beyond counting them and maybe getting the frequency distribution of
their lengths. The things most people will care about are K (the
number of polymorphic sites), Watterson's theta, pi, Tajima's D, Fu
and Li's D, Fay and Wu's H, F_ST, and McDonald--Kreitman testing.
As for ambiguous nucleotides, my code handles them in one of two ways,
at the programmer's option. By default, a site at which any sequence
in the alignment contains an ambiguous nucleotide is ignored; for
example,
ACRGTY
ACAGTC
is effectively equivalent to
ACGT
ACGT .
However, if the 'expand_diplotypes' option is specified when the
Sample object is constructed, each sequence in the alignment is
interpreted as a diplotype and converted into a pair of pseudo-
haplotypes, two-fold ambiguous nucleotides (R, Y, W, S, M, and K)
being interpreted as heterozygous; for example,
ACRGTY
ACAGTC
is effectively equivalent to
ACAGTC
ACGGTT
ACAGTC
ACAGTC .
In expand_diplotypes mode, sites containing three- or four-fold
ambiguous nucleotides are still ignored. Also, you'll get a warning
if you request a statistic that depends on correct SNP phasing, which
most statistics don't. So far, I've found these two operating modes
sufficient for my needs.
I think your plan sounds very reasonable, just adding sequence
statistics at a pace that's comfortable for you. Any time you have
questions, feel free to ask me, and I'll give you whatever benefit
there is in my opinion and experience.
I'm happy for all this to happen on biopython-dev, so that other
people (e.g., Alex Lancaster) can add to it. I'll leave it to the
core developers to tell us if we're too noisy. (I'd recommend still
sending messages to me with copies to biopython-dev, however, so that
I don't accidentally miss them on biopython-dev, which I don't always
read carefully.)
Ralph
On Sat, 29 Sep 2007, Tiago Ant?o wrote:
> Hi Ralph,
>
> Hope all is good with you. I am now finally starting to commit
> statistics code to Biopython. But before I go ahead I would like to
> ask some advice to you (plus some extra comments):
>
> About code merging and authorship:
>
> I am finally looking to your code. There is really lots of stuff
> there! Would it be OK with you if I merged your code with mine into
> Bio.PopGen.Stats? Obviously the copyright/authorship for the module
> would be co-shared as would any authorship of any article deriving
> from it...
>
> About a strategy to advance:
>
> 1. I personally don't have any experience, really, with working with
> sequence data (My background are SNPs, microsatellites/STRs, AFLPs and
> that sort of stuff)
> 2. Starting on Monday I am beginning a PhD which will require, part
> time, sequence analysis
> 3. What I mean from 1 and 2 is that I currently don't have maturity to
> architect and design a good framework for sequence analysis but I will
> gain it with time.
> My plan is then to defer all sequence code until I fell I know what I
> am doing (although I was still thinking in providing something like
> BioPerl's facility of extracting all SNPs from sequences)
> If this is OK with you I plan to start committing code the week
> starting on this Monday,
>
> About request for insight:
>
> If you have any comments to offer on issues regarding representing
> indels and ambiguous data (ie ambiguous nucleotides) they might be
> useful, as I suppose that is the biggest issue that makes me afraid of
> sequence code.
>
>
> Finally: I would summarize our discussion here on biopython-dev (I am
> not taking it there directly just because you might not want your code
> on Biopython or might want it in other terms).
>
> Thanks,
> Tiago
>
From mdehoon at c2b2.columbia.edu Tue Oct 2 20:18:59 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Tue, 2 Oct 2007 20:18:59 -0400
Subject: [Biopython-dev] [BioPython] Bio.MultiProc
References: <46E6A845.3030601@c2b2.columbia.edu><6243BAA9F5E0D24DA41B27997D1FD14402B62B@mail2.exch.c2b2.columbia.edu>
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B62D@mail2.exch.c2b2.columbia.edu>
> Would it be possible to include the module, comment out the unworkable
> source code and print a deprecation warning when it is imported?
That is what I did.
> 3) Leave an option of fixing and commenting the code back in (i.e. it is
not
> lost forever).
Even after removing the code in some future release, the code will not be
lost forever. It can always be retrieved from CVS and from older Biopython
releases.
> Also, is it possible to track down the original author?
That would be Jeff Chang.
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
-----Original Message-----
From: Iddo Friedberg [mailto:idoerg at gmail.com]
Sent: Tue 10/2/2007 12:00 PM
To: Michiel De Hoon
Cc: BioPython Developers List; biopython at biopython.org
Subject: Re: [Biopython-dev] [BioPython] Bio.MultiProc
Would it be possible to include the module, comment out the unworkable
source code and print a deprecation warning when it is imported? That was
we:
1) Don't have a clunky module BUT
2) we warn anyone who uses it (but didn't happen to read your post) that it
is deprecated when they install a new biopython version AND
3) Leave an option of fixing and commenting the code back in (i.e. it is not
lost forever).
Also, is it possible to track down the original author?
./I
On 10/2/07, Michiel De Hoon wrote:
>
> Hi everybody,
>
> Since no users of Bio.MultiProc came forward, I deprecated it for the
> upcoming release.
>
> --Michiel.
>
> Michiel de Hoon
> Center for Computational Biology and Bioinformatics
> Columbia University
> 1150 St Nicholas Avenue
> New York, NY 10032
>
>
>
> -----Original Message-----
> From: biopython-bounces at lists.open-bio.org on behalf of Michiel De Hoon
> Sent: Tue 9/11/2007 10:37 AM
> To: BioPython Developers List; biopython at biopython.org
> Subject: [BioPython] Bio.MultiProc
>
> Hi everybody,
>
> In preparation for the upcoming release, I was running the Biopython
> test suite and found that test_copen.py hangs on Cygwin. It doesn't
> fail, it just sits there forever. This may be related to the use of
> fork() instead of select() in Bio/MultiProc/copen.py. Anyway, while it
> is probably possible to fix this, I'd have to dig fairly deep into the
> code, and I am not sure if it is worth it. It looks like the copen
> functions are used only in Bio/config, which is needed for Bio.db. A
> description of the functionality of thia module can be found in the
> tutorial section 4.7.2.
>
> Now, I don't remember users asking about this module on the mailing
> list. From the tutorial documentation, it seems to be a nice piece of
> code, but I doubt that it is being used often in practice.
>
> So I was wondering:
> 1) Is anybody on this list using this code?
> 2) If not, can I mark it as deprecated for the upcoming release?
> Hopefully, people who are using this code will notice, and let us know
> that they need it.
>
> --Michiel.
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
--
I. Friedberg
"The only problem with troubleshooting is that
sometimes trouble shoots back."
From tiagoantao at gmail.com Wed Oct 3 06:14:33 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 3 Oct 2007 11:14:33 +0100
Subject: [Biopython-dev] Coalescent code
Message-ID: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com>
Hi,
I had a plan of starting to commit statistical related code this
weekend, but (contrary to my expectations) I am having requests for
the coalescent code. As such, I am planning to commit the coalescent
code instead.
It is quite straightforward code, with only one issue that I would
require advice: Some of the code (regarding modeling demographies)
requires some templates (very small text files, circa 10 of around 700
bytes each) to go along. Where should I put the files in Biopython?
Also, on installation those files have to be put somewhere...
Tiago
--
http://www.tiago.org/ps
From biopython-dev at maubp.freeserve.co.uk Wed Oct 3 10:18:21 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 03 Oct 2007 15:18:21 +0100
Subject: [Biopython-dev] Coalescent code
In-Reply-To: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com>
References: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com>
Message-ID: <4703A4AD.7030008@maubp.freeserve.co.uk>
Tiago Ant?o wrote:
> It is quite straightforward code, with only one issue that I would
> require advice: Some of the code (regarding modeling demographies)
> requires some templates (very small text files, circa 10 of around 700
> bytes each) to go along. Where should I put the files in Biopython?
> Also, on installation those files have to be put somewhere...
There is a similar precedent with Bio/EUtils/DTDs (where the data files
are XML DTD files). I guess you could have the 10 plain text data files
in with the python files (or under a subdirectory). Opinions?
I should really refresh myself on current python packaging guidelines...
Peter
From tiagoantao at gmail.com Wed Oct 3 11:37:17 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 3 Oct 2007 16:37:17 +0100
Subject: [Biopython-dev] Statistics code
In-Reply-To:
References: <6d941f120709291328q6a9aae97kdcf489549cc9b3f0@mail.gmail.com>
Message-ID: <6d941f120710030837k1aa2d4ak7eca8e6e27e35fdd@mail.gmail.com>
Ralph,
Thanks for the detailed explanation. Because of a couple of requests I
had, I am going to commit first the coalescent code, but after the
coalescent code is in, I will pick this up.
Tiago
On 10/3/07, Ralph Haygood wrote:
> Tiago,
>
> Sorry to be so long replying---I've been almost drowning in work.
>
> Use anything you find useful in my code. If you do write an article
> about it, I'd be glad to be a coauthor, not just in name but actually
> to help with writing the discussion of sequence statistics.
>
> There *is* a lot of stuff in my code, not all of it generally
> important. For example, few people will care about indel statistics,
> beyond counting them and maybe getting the frequency distribution of
> their lengths. The things most people will care about are K (the
> number of polymorphic sites), Watterson's theta, pi, Tajima's D, Fu
> and Li's D, Fay and Wu's H, F_ST, and McDonald--Kreitman testing.
>
> As for ambiguous nucleotides, my code handles them in one of two ways,
> at the programmer's option. By default, a site at which any sequence
> in the alignment contains an ambiguous nucleotide is ignored; for
> example,
>
> ACRGTY
> ACAGTC
>
> is effectively equivalent to
>
> ACGT
> ACGT .
>
> However, if the 'expand_diplotypes' option is specified when the
> Sample object is constructed, each sequence in the alignment is
> interpreted as a diplotype and converted into a pair of pseudo-
> haplotypes, two-fold ambiguous nucleotides (R, Y, W, S, M, and K)
> being interpreted as heterozygous; for example,
>
> ACRGTY
> ACAGTC
>
> is effectively equivalent to
>
> ACAGTC
> ACGGTT
> ACAGTC
> ACAGTC .
>
> In expand_diplotypes mode, sites containing three- or four-fold
> ambiguous nucleotides are still ignored. Also, you'll get a warning
> if you request a statistic that depends on correct SNP phasing, which
> most statistics don't. So far, I've found these two operating modes
> sufficient for my needs.
>
> I think your plan sounds very reasonable, just adding sequence
> statistics at a pace that's comfortable for you. Any time you have
> questions, feel free to ask me, and I'll give you whatever benefit
> there is in my opinion and experience.
>
> I'm happy for all this to happen on biopython-dev, so that other
> people (e.g., Alex Lancaster) can add to it. I'll leave it to the
> core developers to tell us if we're too noisy. (I'd recommend still
> sending messages to me with copies to biopython-dev, however, so that
> I don't accidentally miss them on biopython-dev, which I don't always
> read carefully.)
>
> Ralph
>
> On Sat, 29 Sep 2007, Tiago Ant?o wrote:
>
> > Hi Ralph,
> >
> > Hope all is good with you. I am now finally starting to commit
> > statistics code to Biopython. But before I go ahead I would like to
> > ask some advice to you (plus some extra comments):
> >
> > About code merging and authorship:
> >
> > I am finally looking to your code. There is really lots of stuff
> > there! Would it be OK with you if I merged your code with mine into
> > Bio.PopGen.Stats? Obviously the copyright/authorship for the module
> > would be co-shared as would any authorship of any article deriving
> > from it...
> >
> > About a strategy to advance:
> >
> > 1. I personally don't have any experience, really, with working with
> > sequence data (My background are SNPs, microsatellites/STRs, AFLPs and
> > that sort of stuff)
> > 2. Starting on Monday I am beginning a PhD which will require, part
> > time, sequence analysis
> > 3. What I mean from 1 and 2 is that I currently don't have maturity to
> > architect and design a good framework for sequence analysis but I will
> > gain it with time.
> > My plan is then to defer all sequence code until I fell I know what I
> > am doing (although I was still thinking in providing something like
> > BioPerl's facility of extracting all SNPs from sequences)
> > If this is OK with you I plan to start committing code the week
> > starting on this Monday,
> >
> > About request for insight:
> >
> > If you have any comments to offer on issues regarding representing
> > indels and ambiguous data (ie ambiguous nucleotides) they might be
> > useful, as I suppose that is the biggest issue that makes me afraid of
> > sequence code.
> >
> >
> > Finally: I would summarize our discussion here on biopython-dev (I am
> > not taking it there directly just because you might not want your code
> > on Biopython or might want it in other terms).
> >
> > Thanks,
> > Tiago
> >
--
http://www.tiago.org/ps
From tiagoantao at gmail.com Wed Oct 3 12:04:07 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 3 Oct 2007 17:04:07 +0100
Subject: [Biopython-dev] Coalescent code
In-Reply-To: <4703A4AD.7030008@maubp.freeserve.co.uk>
References: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com>
<4703A4AD.7030008@maubp.freeserve.co.uk>
Message-ID: <6d941f120710030904k70b098dcnbbc40bc3420ea831@mail.gmail.com>
Hi
On 10/3/07, Peter wrote:
> There is a similar precedent with Bio/EUtils/DTDs (where the data files
> are XML DTD files). I guess you could have the 10 plain text data files
> in with the python files (or under a subdirectory). Opinions?
In the mean time, I will start committing the code (I can easily
accommodate the details of the places to put the files later, when
there is a decision).
Michiel, please, please don't include SimCoal code that I will be
committing on the next public version.
Regards,
Tiago
From mdehoon at c2b2.columbia.edu Wed Oct 3 20:39:47 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Wed, 3 Oct 2007 20:39:47 -0400
Subject: [Biopython-dev] Coalescent code
References: <6d941f120710030314g73e38aa4w8c3b473eeaa18cc9@mail.gmail.com><4703A4AD.7030008@maubp.freeserve.co.uk>
<6d941f120710030904k70b098dcnbbc40bc3420ea831@mail.gmail.com>
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B62E@mail2.exch.c2b2.columbia.edu>
> Michiel, please, please don't include SimCoal code that I will be
> committing on the next public version.
To avoid confusion, please don't commit code to CVS that you don't want to be
included in the next Biopython release.
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
-----Original Message-----
From: biopython-dev-bounces at lists.open-bio.org on behalf of Tiago Ant?o
Sent: Wed 10/3/2007 12:04 PM
To: biopython-dev at lists.open-bio.org
Subject: Re: [Biopython-dev] Coalescent code
Hi
On 10/3/07, Peter wrote:
> There is a similar precedent with Bio/EUtils/DTDs (where the data files
> are XML DTD files). I guess you could have the 10 plain text data files
> in with the python files (or under a subdirectory). Opinions?
In the mean time, I will start committing the code (I can easily
accommodate the details of the places to put the files later, when
there is a decision).
Michiel, please, please don't include SimCoal code that I will be
committing on the next public version.
Regards,
Tiago
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
From bugzilla-daemon at portal.open-bio.org Wed Oct 3 22:10:13 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Oct 2007 22:10:13 -0400
Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with
egenix mxTextTools 3.0
In-Reply-To:
Message-ID: <200710040210.l942ADGF030763@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2361
------- Comment #30 from mdehoon at ims.u-tokyo.ac.jp 2007-10-03 22:10 EST -------
Looking at the patch for Bio.FormatIO:
-------------------------
#Would like to have just issued a deprecation warning, and removed this
#module later. However, due to the FormatIO code in Bio/SeqRecord.py the
#deprecation warning would be triggered whenever someone used the SeqRecord.
raise ImportError, "Bio.FormatIO has been removed. Please try Bio.SeqIO
instead"
-------------------------
Since the patch for Bio/SeqRecord.py removes its dependence on Bio.FormatIO, is
it still necessary to raise an ImportError instead of issuing a
DeprecationWarning?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Oct 5 05:44:09 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Oct 2007 05:44:09 -0400
Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with
egenix mxTextTools 3.0
In-Reply-To:
Message-ID: <200710050944.l959i9BX029760@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2361
------- Comment #31 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-05 05:44 EST -------
In terms of typical usage, SeqRecord does not depend on FormatIO
However, from a code perspective, FormatIO and SeqRecord "depend" on each
other.
If we remove the FormatIO "hooks" from SeqRecord.py (so that SeqRecord does not
depend on FormatIO), then FormatIO breaks. Rather than leaving in a broken
module, I wanted to remove it. A DeprecationWarning doesn't seem right if
FormatIO is removed, which is why I suggested an ImportError.
We might be able instead to MOVE the FormatIO hooks out of SeqRecord and then
issue a DeprecationWarning for FormatIO ... but it looks rather complicated,
and probably means tackling the Bio.config code as well.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Oct 5 07:05:49 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Oct 2007 07:05:49 -0400
Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with
egenix mxTextTools 3.0
In-Reply-To:
Message-ID: <200710051105.l95B5nXW001755@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2361
------- Comment #32 from mdehoon at ims.u-tokyo.ac.jp 2007-10-05 07:05 EST -------
> If we remove the FormatIO "hooks" from SeqRecord.py (so that SeqRecord does not
> depend on FormatIO), then FormatIO breaks. Rather than leaving in a broken
> module, I wanted to remove it. A DeprecationWarning doesn't seem right if
> FormatIO is removed, which is why I suggested an ImportError.
OK, I see. As far as I'm concerned, your patch is fine then.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Oct 5 09:46:51 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 5 Oct 2007 09:46:51 -0400
Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython
In-Reply-To:
Message-ID: <200710051346.l95Dkpc2010074@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2174
tiagoantao at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution| |FIXED
------- Comment #6 from tiagoantao at gmail.com 2007-10-05 09:46 EST -------
It is implemented, documented and with test code.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From tiagoantao at gmail.com Fri Oct 5 10:26:43 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Fri, 5 Oct 2007 15:26:43 +0100
Subject: [Biopython-dev] Configuration files
Message-ID: <6d941f120710050726s4ca53349h1b8d499650e5726a@mail.gmail.com>
Hi,
Is there any (Biopython standard) way to configure Biopython during
runtime? When writing code sometimes I think it would be very
convenient (especially to the programmer using Biopython) to abstract
some configuration parameters away from the code. Things like the
location of binaries, hosts, user names (and maybe passwords) of
databases, timeout parameters, etc. These could be stored on a
configuration file (or registry entry, or whatever) thus saving users
to have to deal in the code with supplying these...
Just an idea...
Tiago
--
http://www.tiago.org/ps
From bugzilla-daemon at portal.open-bio.org Mon Oct 8 07:14:30 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Oct 2007 07:14:30 -0400
Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with
egenix mxTextTools 3.0
In-Reply-To:
Message-ID: <200710081114.l98BEUZh019757@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2361
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #759 is|0 |1
obsolete| |
------- Comment #33 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-08 07:14 EST -------
(From update of attachment 759)
Applied these changes to CVS.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython-dev at maubp.freeserve.co.uk Mon Oct 8 06:52:48 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Mon, 08 Oct 2007 11:52:48 +0100
Subject: [Biopython-dev] Configuration files
In-Reply-To: <6d941f120710050726s4ca53349h1b8d499650e5726a@mail.gmail.com>
References: <6d941f120710050726s4ca53349h1b8d499650e5726a@mail.gmail.com>
Message-ID: <470A0C00.50505@maubp.freeserve.co.uk>
Tiago Ant?o wrote:
> Hi,
>
> Is there any (Biopython standard) way to configure Biopython during
> runtime? When writing code sometimes I think it would be very
> convenient (especially to the programmer using Biopython) to abstract
> some configuration parameters away from the code. Things like the
> location of binaries, hosts, user names (and maybe passwords) of
> databases, timeout parameters, etc. These could be stored on a
> configuration file (or registry entry, or whatever) thus saving users
> to have to deal in the code with supplying these...
> Just an idea...
This sounds like a fairly general thing (i.e. for all of python) rather
than being Biopython specific.
For example, I find a lot of my scripts have a few if statements at the
top setting locations of files and executables based on which
user/machine I'm running on (I use both Windows and a couple of Linux
boxes with different user names).
e.g. Where are the blast executables, the blast databases, and my genome
collection, ...
Peter
From bugzilla-daemon at portal.open-bio.org Mon Oct 8 07:30:03 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Oct 2007 07:30:03 -0400
Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with
egenix mxTextTools 3.0
In-Reply-To:
Message-ID: <200710081130.l98BU36u021016@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2361
------- Comment #34 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-08 07:30 EST -------
Recap, most of the issues were resolved by switching Bio.Fasta from Martel to
pure python. Additionally:
test_Fasta - 'fixed' by deprecating the Mindy indexing functions
test_KEGG - fixed by switching from Martel to pure python
test_format_registry - 'fixed' by removing FormatIO
test_geo - fixed by switching from Martel to pure python
test_GenBankFormat - this entire test is for the little-used Martel GenBank
expression, and this works with mxTextTools 2.0 but fails with mxTextTools 3.0
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mdehoon at c2b2.columbia.edu Tue Oct 9 00:34:28 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Tue, 9 Oct 2007 00:34:28 -0400
Subject: [Biopython-dev] Output of Biopython tests
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu>
Hi everybody,
With the help of several Biopython developers, especially Peter, the problems
with Martel and the new mxTextTools release have now been solved (in the
sense that all unit tests now succeed). So we're a lot closer to a new
Biopython release. Thanks everybody!
When I was running the Biopython tests, one thing bothered me though. All
Biopython tests now have a corresponding output file that contains the output
the test should generate if it runs correctly. For some tests, this makes
perfect sense, particularly if the output is large. For others, on the other
hand, having the test output explicitly in a file doesn't actually add much
information. For example, the output for test_psw is
test_psw
test_AlignmentColumn_assertions (test_psw.TestPSW) ... ok
test_AlignmentColumn_full (test_psw.TestPSW) ... ok
test_AlignmentColumn_kinds (test_psw.TestPSW) ... ok
test_AlignmentColumn_repr (test_psw.TestPSW) ... ok
test_Alignment_assertions (test_psw.TestPSW) ... ok
test_Alignment_normal (test_psw.TestPSW) ... ok
test_ColumnUnit (test_psw.TestPSW) ... ok
Doctest: Bio.Wise.psw.parse_line ... ok
----------------------------------------------------------------------
Ran 8 tests in 0.002s
OK
For comparison, this is the test output if test_psw.py fails:
test_AlignmentColumn_assertions (__main__.TestPSW) ... ok
test_AlignmentColumn_full (__main__.TestPSW) ... ok
test_AlignmentColumn_kinds (__main__.TestPSW) ... FAIL
test_AlignmentColumn_repr (__main__.TestPSW) ... ok
test_Alignment_assertions (__main__.TestPSW) ... ok
test_Alignment_normal (__main__.TestPSW) ... ok
test_ColumnUnit (__main__.TestPSW) ... ok
Doctest: Bio.Wise.psw.parse_line ... ok
======================================================================
FAIL: test_AlignmentColumn_kinds (__main__.TestPSW)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_psw.py", line 47, in test_AlignmentColumn_kinds
self.assertEqual(ac.kind,
"some_funny_output_I_made_up_instead_of_INSERT")
AssertionError: 'INSERT' != 'some_funny_output_I_made_up_instead_of_INSERT'
----------------------------------------------------------------------
Ran 8 tests in 0.000s
The point is that for this test, having the output explicitly is not needed
in order to identify the problem.
Now, for some tests having the output explicitly actually causes a problem.
I'm thinking about those unit tests that only run if some particular software
is installed on the system (for example, SQL). In those cases, we need to
distinguish failure due to missing software from a true failure (the former
may not bother the user much if he's not interested in that particular part
of Biopython). If a test cannot be run because of missing prerequisites,
currently a unit test generates an ImportError, which is then caught inside
run_tests. Hence, we get the following output when running the Biopython
tests:
test_BioSQL ... Skipping test because of import error: Skipping BioSQL tests
--
enable tests in Tests/test_BioSQL.py
ok
When you look inside test_BioSQL.py, you'll see that the actual error is not
an ImportError. In addition, if a true ImportError occurs during the test,
the test will inadvertently be treated as skipped.
My solution would be to skip tests inside test_BioSQL if the prerequisites
are not met. However, in that case the test output no longer agrees with the
expected test output, generating a failure message.
I'd therefore like to suggest the following:
1) Keep the test output, but let each test_* script (instead of run_tests.py)
be responsible of comparing the test output with the expected output.
2) If the expected output is trivial, simply use the assert statements to
verify the test output instead of storing them in a file and reading them
from there.
Any objections?
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
From mhobbs_of_lawson at bigpond.com Mon Oct 8 22:18:39 2007
From: mhobbs_of_lawson at bigpond.com (mhobbs_of_lawson)
Date: Tue, 9 Oct 2007 12:18:39 +1000
Subject: [Biopython-dev] translate
Message-ID: <5496247.1191896319102.JavaMail.root@web06sl>
Hi,
Please can someone tell me what is wrong here. I simply want to be able to translate ambiguous DNA which includes an 'NNN' triplet.
Thanks,
Matthew
>>> from Bio import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio import Translate
>>> s = "NNNTCAAAAAGGTGCATCTAGATG"
>>> dna = Seq.Seq(s, IUPAC.ambiguous_dna)
>>> trans = Translate.ambiguous_dna_by_id[1]
>>> print trans.translate(dna)
Traceback (most recent call last):
File "", line 1, in
File "/cygdrive/c/Python24/Lib/site-packages/Bio/Translate.py", line 20, in translate
append(get(s[i:i+3], stop_symbol))
File "/cygdrive/c/Python24/Lib/site-packages/Bio/Data/CodonTable.py", line 544, in get
return self.__getitem__(codon)
File "/cygdrive/c/Python24/Lib/site-packages/Bio/Data/CodonTable.py", line 577, in __getitem__
raise TranslationError, codon # does not code
Bio.Data.CodonTable.TranslationError: NNN
From biopython-dev at maubp.freeserve.co.uk Tue Oct 9 07:54:29 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 09 Oct 2007 12:54:29 +0100
Subject: [Biopython-dev] translate
In-Reply-To: <5496247.1191896319102.JavaMail.root@web06sl>
References: <5496247.1191896319102.JavaMail.root@web06sl>
Message-ID: <470B6BF5.607@maubp.freeserve.co.uk>
mhobbs_of_lawson wrote:
> Hi,
>
> Please can someone tell me what is wrong here. I simply want to be able to translate ambiguous DNA which includes an 'NNN' triplet.
A very reasonable request. I assume you expect just an X for an NNN codon?
I have the general impression that some of Biopython's handling of
ambiguous sequences isn't all wonderful... something I have started to
tackle in bug 2356:
http://bugzilla.open-bio.org/show_bug.cgi?id=2366
Obviously sequence manipulation is a core bit of functionality - and I
would like at least one other person to comment on that code before I
risk committing it ;)
Translation of ambiguous codons would be next on my hit list... as right
now it doesn't seem to do what I would expect at all.
In the short term, manually adding additional mappings to the forward
table (a python dictionary) would probably "fix" your specific issue.
While we are on this topic, we use "*" for stop codons and "X" for an
ambiguous amino acid - but is anyone aware of a character convention for
something that might be either a stop codon or an amino acid? (other
than just using "X" for this too)?
Peter
From biopython-dev at maubp.freeserve.co.uk Tue Oct 9 07:44:01 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 09 Oct 2007 12:44:01 +0100
Subject: [Biopython-dev] Output of Biopython tests
In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu>
References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu>
Message-ID: <470B6981.3020707@maubp.freeserve.co.uk>
Michiel De Hoon wrote:
> When I was running the Biopython tests, one thing bothered me though.
> All Biopython tests now have a corresponding output file that
> contains the output the test should generate if it runs correctly.
> For some tests, this makes perfect sense, particularly if the output
> is large. For others, on the other hand, having the test output
> explicitly in a file doesn't actually add much information.
Is this actually a problem? It gives us a simple unified test framework
where developers can use whatever fancy test frameworks they want to.
Personally I have tried to write simple scripts with meaningful output
(plus often additional assertions). I think that because these are very
simple, they can double as examples/documentation for the curious.
My personal view is that some of the "fancy frameworks" used in some
test cases are very intimidating to a beginner (and act as a barrier to
taking the code and modifying it for their own use).
> The point is that for this test, having the output explicitly is not
> needed in order to identify the problem.
True. I would have written that particular test to give some meaningful
output; I find it makes it easier to start debugging why a test fails.
> Now, for some tests having the output explicitly actually causes a
> problem. I'm thinking about those unit tests that only run if some
> particular software is installed on the system (for example, SQL). In
> those cases, we need to distinguish failure due to missing software
> from a true failure (the former may not bother the user much if he's
> not interested in that particular part of Biopython). If a test
> cannot be run because of missing prerequisites, currently a unit test
> generates an ImportError, which is then caught inside run_tests.
> ...
> When you look inside test_BioSQL.py, you'll see that the actual error
> is not an ImportError. In addition, if a true ImportError occurs
> during the test, the test will inadvertently be treated as skipped.
Perhaps we should introduce a MissingExternalDependency error instead,
used for this specific case, and catch that in run_tests.py, while
treating ImportError as a real error.
As you say, if we have done some dramatic restructuring (such as
removing a module) there could be some REAL ImportErrors which we might
risk ignoring.
> I'd therefore like to suggest the following:
> 1) Keep the test output, but let each test_* script (instead of
> run_tests.py) be responsible of comparing the test output with the
> expected output.
I'm not keen on that - it means duplication of code (or at least some
common functionality to call) and makes writing simple tests that little
bit harder. I like the fact that the more verbose test scripts can be
run on their own as an example of what the module can do.
> 2) If the expected output is trivial, simply use the assert
> statements to verify the test output instead of storing them in a
> file and reading them from there.
By all means, test trivial output with assertions. I already do this
within many of my "verbose" tests where I want to keep the console
output reasonably short.
Peter
From tiagoantao at gmail.com Tue Oct 9 10:27:18 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 9 Oct 2007 15:27:18 +0100
Subject: [Biopython-dev] Configuration files
In-Reply-To: <470A0C00.50505@maubp.freeserve.co.uk>
References: <6d941f120710050726s4ca53349h1b8d499650e5726a@mail.gmail.com>
<470A0C00.50505@maubp.freeserve.co.uk>
Message-ID: <6d941f120710090727m787c08abn13665c662727446c@mail.gmail.com>
Would it be interesting to have something like
config = Bio.Config.getConfig()
fdist_path = config['PopGen.FDistDir']
Something that:
1. Would allow for a standard configuration mechanism (as opposed to
having different styles for each module/author)
2. Would abstract away how the configuration is stored (registry, conf
file, ...)
If there was an agreement on doing this (or something along these
lines), I would volunteer the time to do it.
On 10/8/07, Peter wrote:
> Tiago Ant?o wrote:
> > Hi,
> >
> > Is there any (Biopython standard) way to configure Biopython during
> > runtime? When writing code sometimes I think it would be very
> > convenient (especially to the programmer using Biopython) to abstract
> > some configuration parameters away from the code. Things like the
> > location of binaries, hosts, user names (and maybe passwords) of
> > databases, timeout parameters, etc. These could be stored on a
> > configuration file (or registry entry, or whatever) thus saving users
> > to have to deal in the code with supplying these...
> > Just an idea...
>
> This sounds like a fairly general thing (i.e. for all of python) rather
> than being Biopython specific.
>
> For example, I find a lot of my scripts have a few if statements at the
> top setting locations of files and executables based on which
> user/machine I'm running on (I use both Windows and a couple of Linux
> boxes with different user names).
>
> e.g. Where are the blast executables, the blast databases, and my genome
> collection, ...
>
> Peter
>
>
--
http://www.tiago.org/ps
From mhobbs_of_lawson at bigpond.com Tue Oct 9 19:07:43 2007
From: mhobbs_of_lawson at bigpond.com (Matthew Hobbs)
Date: Wed, 10 Oct 2007 09:07:43 +1000
Subject: [Biopython-dev] translate
In-Reply-To: <470B6BF5.607@maubp.freeserve.co.uk>
References: <5496247.1191896319102.JavaMail.root@web06sl>
<470B6BF5.607@maubp.freeserve.co.uk>
Message-ID: <470C09BF.8050906@bigpond.com>
Thanks Peter for your reply.
Peter wrote:
> mhobbs_of_lawson wrote:
>> Please can someone tell me what is wrong here. I simply want to be
>> able to translate ambiguous DNA which includes an 'NNN' triplet.
>
> A very reasonable request. I assume you expect just an X for an NNN codon?
yep
> In the short term, manually adding additional mappings to the forward
> table (a python dictionary) would probably "fix" your specific issue.
OK - so this works:
from Bio import Seq
from Bio.Alphabet import IUPAC
from Bio import Translate
s = "NNNTCAAAAAGGTGCATCTAGATG"
dna = Seq.Seq(s, IUPAC.ambiguous_dna)
trans = Translate.ambiguous_dna_by_id[1]
trans.table.forward_table.forward_table['NNN'] = 'X'
print trans.translate(dna)
> While we are on this topic, we use "*" for stop codons and "X" for an
> ambiguous amino acid - but is anyone aware of a character convention for
> something that might be either a stop codon or an amino acid? (other
> than just using "X" for this too)?
No I don't know
Thanks,
Matthew
From mdehoon at c2b2.columbia.edu Thu Oct 11 06:31:59 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Thu, 11 Oct 2007 06:31:59 -0400
Subject: [Biopython-dev] Output of Biopython tests
References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu>
<470B6981.3020707@maubp.freeserve.co.uk>
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu>
> Perhaps we should introduce a MissingExternalDependency error instead,
> used for this specific case, and catch that in run_tests.py, while
> treating ImportError as a real error.
OK. I added a MissingExternalDependencyError exception to Bio/__init__.py,
and modified BioSQL, Bio.GFF, and some test scripts accordingly. When
MissingExternalDependencyError occurs in a test, a warning is printed but it
is not counted as a failure.
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
From mdehoon at c2b2.columbia.edu Thu Oct 11 06:44:56 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Thu, 11 Oct 2007 06:44:56 -0400
Subject: [Biopython-dev] function enumerate in Bio/GFF/GenericTools.py;
Bio/DocSQL.py
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B637@mail2.exch.c2b2.columbia.edu>
Do we still need the function "enumerate" in Bio/GFF/GenericTools.py and
Bio/DocSQL.py?
AFAICT, this function does exactly the same as the Python built-in enumerate
function.
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
From mdehoon at c2b2.columbia.edu Thu Oct 11 06:31:59 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Thu, 11 Oct 2007 06:31:59 -0400
Subject: [Biopython-dev] Output of Biopython tests
References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu>
<470B6981.3020707@maubp.freeserve.co.uk>
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu>
> Perhaps we should introduce a MissingExternalDependency error instead,
> used for this specific case, and catch that in run_tests.py, while
> treating ImportError as a real error.
OK. I added a MissingExternalDependencyError exception to Bio/__init__.py,
and modified BioSQL, Bio.GFF, and some test scripts accordingly. When
MissingExternalDependencyError occurs in a test, a warning is printed but it
is not counted as a failure.
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 2910 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython-dev/attachments/20071011/fc06d7c7/attachment.bin
From biopython-dev at maubp.freeserve.co.uk Thu Oct 11 16:44:46 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Oct 2007 21:44:46 +0100
Subject: [Biopython-dev] Revised tutorial
Message-ID: <470E8B3E.6080709@maubp.freeserve.co.uk>
In anticipation of the next release, I've done some more work on the
tutorial today -- in particular the section on the Seq object which I
have turned into a new chapter.
If anyone has the time to go over this soon that would be great. I'll be
away tomorrow (Friday) but will probably have time to make any revisions
needed at the weekend.
Its here in CVS:
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Doc/Tutorial.tex?cvsroot=biopython
This is a LaTeX file which gets turned into the PDF and HTML versions of
the tutorial using pdflatex and hevea. If you want to proof read but
don't know anything about LaTeX then I can probably email you the PDF
version for comment (half a megabyte).
Peter
From sbassi at gmail.com Thu Oct 11 18:48:39 2007
From: sbassi at gmail.com (Sebastian Bassi)
Date: Thu, 11 Oct 2007 19:48:39 -0300
Subject: [Biopython-dev] Revised tutorial
In-Reply-To: <470E8B3E.6080709@maubp.freeserve.co.uk>
References: <470E8B3E.6080709@maubp.freeserve.co.uk>
Message-ID:
Hello,
I can't resolve all the dependencies to install hevea so I can't
generate the dvi from the tex file. Could you please send me by email
the final PDF?
Best,
SB.
--
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318
From mdehoon at c2b2.columbia.edu Thu Oct 11 21:53:19 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Thu, 11 Oct 2007 21:53:19 -0400
Subject: [Biopython-dev] Output of Biopython tests
References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> <470B6981.3020707@maubp.freeserve.co.uk>
<6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu>
<470E3E7E.1000301@maubp.freeserve.co.uk>
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B638@mail2.exch.c2b2.columbia.edu>
Peter wrote:
> Michiel De Hoon wrote:
> > OK. I added a MissingExternalDependencyError exception to
Bio/__init__.py,
> > and modified BioSQL, Bio.GFF, and some test scripts accordingly. When
> > MissingExternalDependencyError occurs in a test, a warning is printed but
it
> > is not counted as a failure.
>
> I might have defined the exception within the test framework rather than
> Bio/__init__.py, but now that it's there we can start to use in things
> like modules that wrap external tools.
That is why I put it in Bio/__init__.py; Bio/GFF/__init__.py is already using
this exception (outside of the testing framework).
> I've updated Tests/requires_internet.py and Test/requires_wise.py to
> match (I don't have wise on my machine which is why I noticed it still
> threw an ImportError).
Thanks! I missed those.
> Is there anything I can do to help get things ready for the release of
> Biopython 1.44?
At some point, somebody will need to go through the documentation to check if
everything documented there still works with the Biopython in CVS, and to
remove sections in the documentation describing deprecated code. But it's
probably better to wait until after we decide what to do with
test_GenBankFormat.
> If you do have time to give the patch on bug 2366 a check, I think it
> would be worth including before the next release.
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2366
No time to check it. But I'd be happy to rely on your judgement and include
it.
--Michiel.
From mdehoon at c2b2.columbia.edu Thu Oct 11 21:53:19 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Thu, 11 Oct 2007 21:53:19 -0400
Subject: [Biopython-dev] Output of Biopython tests
References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> <470B6981.3020707@maubp.freeserve.co.uk>
<6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu>
<470E3E7E.1000301@maubp.freeserve.co.uk>
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B638@mail2.exch.c2b2.columbia.edu>
Peter wrote:
> Michiel De Hoon wrote:
> > OK. I added a MissingExternalDependencyError exception to
Bio/__init__.py,
> > and modified BioSQL, Bio.GFF, and some test scripts accordingly. When
> > MissingExternalDependencyError occurs in a test, a warning is printed but
it
> > is not counted as a failure.
>
> I might have defined the exception within the test framework rather than
> Bio/__init__.py, but now that it's there we can start to use in things
> like modules that wrap external tools.
That is why I put it in Bio/__init__.py; Bio/GFF/__init__.py is already using
this exception (outside of the testing framework).
> I've updated Tests/requires_internet.py and Test/requires_wise.py to
> match (I don't have wise on my machine which is why I noticed it still
> threw an ImportError).
Thanks! I missed those.
> Is there anything I can do to help get things ready for the release of
> Biopython 1.44?
At some point, somebody will need to go through the documentation to check if
everything documented there still works with the Biopython in CVS, and to
remove sections in the documentation describing deprecated code. But it's
probably better to wait until after we decide what to do with
test_GenBankFormat.
> If you do have time to give the patch on bug 2366 a check, I think it
> would be worth including before the next release.
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2366
No time to check it. But I'd be happy to rely on your judgement and include
it.
--Michiel.
From bugzilla-daemon at portal.open-bio.org Thu Oct 11 22:32:05 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 11 Oct 2007 22:32:05 -0400
Subject: [Biopython-dev] [Bug 2361] Test Suite Failures from Martel/Sax with
egenix mxTextTools 3.0
In-Reply-To:
Message-ID: <200710120232.l9C2W5e9022504@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2361
------- Comment #35 from mdehoon at ims.u-tokyo.ac.jp 2007-10-11 22:32 EST -------
> test_GenBankFormat - this entire test is for the little-used Martel GenBank
> expression, and this works with mxTextTools 2.0 but fails with mxTextTools 3.0
If it's little-used, should we include it for the next release or can it be
removed? If we remove the test, should we then also remove the corresponding
module?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython-dev at maubp.freeserve.co.uk Thu Oct 11 16:37:52 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Oct 2007 21:37:52 +0100
Subject: [Biopython-dev] Output of Biopython tests
In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu>
References: <6243BAA9F5E0D24DA41B27997D1FD14402B634@mail2.exch.c2b2.columbia.edu> <470B6981.3020707@maubp.freeserve.co.uk>
<6243BAA9F5E0D24DA41B27997D1FD14402B636@mail2.exch.c2b2.columbia.edu>
Message-ID: <470E89A0.1010502@maubp.freeserve.co.uk>
Michiel De Hoon wrote:
>> Perhaps we should introduce a MissingExternalDependency error instead,
>> used for this specific case, and catch that in run_tests.py, while
>> treating ImportError as a real error.
>
> OK. I added a MissingExternalDependencyError exception to Bio/__init__.py,
> and modified BioSQL, Bio.GFF, and some test scripts accordingly. When
> MissingExternalDependencyError occurs in a test, a warning is printed but it
> is not counted as a failure.
I might have defined the exception within the test framework rather than
Bio/__init__.py, but not that its there we can start to use in things
like modules that wrap external tools.
I've updated Tests/requires_internet.py and Test/requires_wise.py to
match (I don't have wise on my machine which is why I noticed it still
threw an ImportError).
This means run_tests.py now runs without errors using CVS on my 64 bit
Linux machine (bar the mxTextTools 3.0 issue with test_GenBankFormat.py
(bug 2361).
Is there anything I can do to help get things ready for the release of
Biopython 1.44?
If you do have time to give the patch on bug 2366 a check, I think it
would be worth including before the next release.
http://bugzilla.open-bio.org/show_bug.cgi?id=2366
Peter
From fennan at gmail.com Mon Oct 15 05:48:45 2007
From: fennan at gmail.com (Fernando)
Date: Mon, 15 Oct 2007 11:48:45 +0200
Subject: [Biopython-dev] Database into variables
Message-ID: <7b13e61d0710150248v72a550d6h38e1467edf5073eb@mail.gmail.com>
Hi everybody,
I am thinking in including some algorithms that I work with into biopython.
My first concern is that I'm using a local image of the Gene Ontology
database to perform several operations. In order to avoid such database
accesses I could precompute the information I need and load it once the
module is called. How should I do it? Is there a guideline style to load
external variables or something like that? Any other ideas/suggestions?
Thanks
From fennan at gmail.com Mon Oct 15 06:28:56 2007
From: fennan at gmail.com (Fernando)
Date: Mon, 15 Oct 2007 12:28:56 +0200
Subject: [Biopython-dev] Precompute database information
Message-ID: <7b13e61d0710150328l354bfb5eu1b76ed05024a65c4@mail.gmail.com>
Hi everybody,
I am thinking in including some algorithms that I work with into biopython.
My first concern is that I'm using a local image of the Gene Ontology
database to perform several operations. In order to avoid such database
accesses I could precompute the information I need and load it once the
module is called. How should I do it? Is there a guideline style to load
external variables or something like that? Any other ideas/suggestions?
Thanks
From bugzilla-daemon at portal.open-bio.org Mon Oct 15 07:11:26 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Oct 2007 07:11:26 -0400
Subject: [Biopython-dev] [Bug 2366] Ambiguous nucleotides in
(Reverse)complement functions in Bio.Seq
In-Reply-To:
Message-ID: <200710151111.l9FBBQOE012625@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2366
tiagoantao at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |tiagoantao at gmail.com
------- Comment #3 from tiagoantao at gmail.com 2007-10-15 07:11 EST -------
I had a look at the test code and tried to find which test case is changing the
ambiguous_dna dict.
I used this little script (putting it here as it might be useful for detecting
these types of problems):
for i in test_*py; do
python run_tests.py $i;
done
It turns out that it is text_Nexus.py. A further inspection to the code seems
to reveal that is not the test case that pollutes the dictionary but the Nexus
modules itself.
Maybe it makes sense to raise a bug on the Nexus module... Any comments on
these findings?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Oct 15 10:16:00 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Oct 2007 10:16:00 -0400
Subject: [Biopython-dev] [Bug 2366] Ambiguous nucleotides in
(Reverse)complement functions in Bio.Seq
In-Reply-To:
Message-ID: <200710151416.l9FEG01A023797@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2366
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-15 10:16 EST -------
Thanks for that Tiago,
I guess we should file a bug on Bio.Nexus on the alphabet issue; It may be that
it should create a copy or subclass of the ambiguous DNA alphabet in order to
include "?" (I imagine that Nexus uses this rather than "N"), and see if it is
using the Gapped() alphabet system or not.
Did you have any comments on this patch for (reverse) complements?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From jflatow at northwestern.edu Mon Oct 15 20:08:13 2007
From: jflatow at northwestern.edu (Jared Flatow)
Date: Mon, 15 Oct 2007 19:08:13 -0500
Subject: [Biopython-dev] Biopython status
Message-ID: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu>
Hi all,
I've just started using Biopython and I am wondering about the status
of the group, since I've heard rumors that its dying. So far I have
found the library very useful, if not at times frustrating, though I
will admit I am fairly new to developing python as well. I have been
hesitant to make changes to existing code, however I have found that
in a few cases it has been by far the best way to accomplish what I
need, and have only done so in cases where it seems to be the *right*
thing to do.
With that in mind, I have a few questions I was hoping you all could
answer. First, how might I put these changes up for review in order
to contribute back to the code base? The main changes have been to
the AlignAce parser, since as it was it just ignored information
contained in the alignace file regarding the motif instances (namely
which input sequence they came from, where they started in the
sequence, and what strand they were on). I have also needed to create
a modified FASTA parser so that I can read things like quality score
files. I would be happy to submit the changes to the group or an
individual for inspection, but I would like to avoid having to
maintain my own separate version of Biopython if possible.
I am also wondering how it would be received if I did something like
add a to_fasta method to SeqRecord instead of having to go through
writing it to a file using a SeqIO when all I want is the string.
Finally, are there plans to move to a subversion repository at any
point?
Thanks!
Jared Flatow
From sbassi at gmail.com Tue Oct 16 01:09:16 2007
From: sbassi at gmail.com (Sebastian Bassi)
Date: Tue, 16 Oct 2007 02:09:16 -0300
Subject: [Biopython-dev] Biopython status
In-Reply-To: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu>
Message-ID:
On 10/15/07, Jared Flatow wrote:
> I've just started using Biopython and I am wondering about the status
> of the group, since I've heard rumors that its dying. So far I have
You could subscribe to the rss feed of the CVS and you will see a lot
of activity. The developers list and the bug tracking program
(bugzilla) is also pretty busy, that doesn't look as a dying group to
me :)
--
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Lriser: http://www.linspire.com/lraiser_success.php?serial=318
From mdehoon at c2b2.columbia.edu Tue Oct 16 01:37:14 2007
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Tue, 16 Oct 2007 01:37:14 -0400
Subject: [Biopython-dev] Biopython status
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu>
Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B639@mail2.exch.c2b2.columbia.edu>
Hi Jared,
> I've just started using Biopython and I am wondering about the status
> of the group, since I've heard rumors that its dying.
>From looking at the activity on the Biopython mailing lists in recent months,
it doesn't seem to be dying :-).
> So far I have found the library very useful, if not at times frustrating,
> though I will admit I am fairly new to developing python as well.
One thing to keep in mind is that Biopython started about eight years ago,
and some approaches that seemed to be a good idea at that time may not seem
to be so now. Nevertheless, I feel that Biopython is moving in the right
direction in terms of ease-of-use.
> First, how might I put these changes up for review in order
> to contribute back to the code base? The main changes have been to
> the AlignAce parser, since as it was it just ignored information
> contained in the alignace file regarding the motif instances (namely
> which input sequence they came from, where they started in the
> sequence, and what strand they were on).
In this case, it is a good idea to contact the current maintainer of
Bio.AlignAce, either via the mailing list or directly. From the Biopython
CVS, it seems that Bartek is currently the main maintainer of Bio.AlignAce,
so it would be a good idea to discuss with him.
> I have also needed to create a modified FASTA parser so that I
> can read things like quality score files.
At some point, Biopython had several (two or three?) Fasta parsers, two Fasta
formats, etc. This is a situation we should definitely avoid. So if your
modifications fit in well with the existing Fasta parser in Bio.SeqIO, it may
very well be accepted into Biopython. Otherwise, it's better to leave it out.
This is just my opinion though.
> I am also wondering how it would be received if I did something like
> add a to_fasta method to SeqRecord instead of having to go through
> writing it to a file using a SeqIO when all I want is the string.
This sounds like feature creep to me, so I would be against it. It's easy to
add code to Biopython, it's much harder to remove stuff. Code bloat is a real
problem in Biopython.
> Finally, are there plans to move to a subversion repository at any
> point?
There were some plans at some point, but I don't know the current status.
Best,
--Michiel.
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
-----Original Message-----
From: biopython-dev-bounces at lists.open-bio.org on behalf of Jared Flatow
Sent: Mon 10/15/2007 8:08 PM
To: biopython-dev at lists.open-bio.org
Subject: [Biopython-dev] Biopython status
Hi all,
I've just started using Biopython and I am wondering about the status
of the group, since I've heard rumors that its dying. So far I have
found the library very useful, if not at times frustrating, though I
will admit I am fairly new to developing python as well. I have been
hesitant to make changes to existing code, however I have found that
in a few cases it has been by far the best way to accomplish what I
need, and have only done so in cases where it seems to be the *right*
thing to do.
With that in mind, I have a few questions I was hoping you all could
answer. First, how might I put these changes up for review in order
to contribute back to the code base? The main changes have been to
the AlignAce parser, since as it was it just ignored information
contained in the alignace file regarding the motif instances (namely
which input sequence they came from, where they started in the
sequence, and what strand they were on). I have also needed to create
a modified FASTA parser so that I can read things like quality score
files. I would be happy to submit the changes to the group or an
individual for inspection, but I would like to avoid having to
maintain my own separate version of Biopython if possible.
I am also wondering how it would be received if I did something like
add a to_fasta method to SeqRecord instead of having to go through
writing it to a file using a SeqIO when all I want is the string.
Finally, are there plans to move to a subversion repository at any
point?
Thanks!
Jared Flatow
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 04:16:01 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Oct 2007 09:16:01 +0100
Subject: [Biopython-dev] Biopython status
In-Reply-To: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu>
Message-ID: <47147341.4020708@maubp.freeserve.co.uk>
Jared Flatow wrote:
> I have also needed to create a modified FASTA parser so that I can
> read things like quality score files.
Could you be a little more specific - what exactly do you mean by a
quality score files (links and/or examples). It may be that this
warrants setting up a new file format in Bio.SeqIO
> I would be happy to submit the changes to the group or an individual
> for inspection, but I would like to avoid having to maintain my own
> separate version of Biopython if possible.
As has already been said - please file some (enhancement) bugs and
attach your patches, or raise specific issues for discussion on this
mailing list.
Depending on the nature of your changes, you might be able to achieve
some of them by subclassing Biopython's objects - rather than literally
maintaining your own branch of the project.
> I am also wondering how it would be received if I did something like
> add a to_fasta method to SeqRecord instead of having to go through
> writing it to a file using a SeqIO when all I want is the string.
Out of interest, why do you want to create a FASTA record as a string?
Did you know you can write to a string using any Bio.SeqIO supported
file format using StringIO? Perhaps we should spell this out more
explicitly in the documentation, but a motivating example would help.
I would suggest rather than adding a to_fasta method to the SeqRecord,
simply write your own "seqrecord_to_string" function (or create a
subclass of SeqRecord with this method).
> Finally, are there plans to move to a subversion repository at any
> point?
It was raised a while ago, and our cunning plan was to let BioPerl try
the move first. Once that has been proven, it should be fairly easy for
the OBF guys to also move us over. I should email them to see how
things stand...
Peter
From bartek at rezolwenta.eu.org Tue Oct 16 05:11:01 2007
From: bartek at rezolwenta.eu.org (bartek wilczynski)
Date: Tue, 16 Oct 2007 11:11:01 +0200
Subject: [Biopython-dev] Biopython status
In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B639@mail2.exch.c2b2.columbia.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu>
<6243BAA9F5E0D24DA41B27997D1FD14402B639@mail2.exch.c2b2.columbia.edu>
Message-ID: <1192525861.4714802535dae@imp.rezolwenta.eu.org>
Michiel De Hoon wrote:
> > First, how might I put these changes up for review in order
> > to contribute back to the code base? The main changes have been to
> > the AlignAce parser, since as it was it just ignored information
> > contained in the alignace file regarding the motif instances (namely
> > which input sequence they came from, where they started in the
> > sequence, and what strand they were on).
>
> In this case, it is a good idea to contact the current maintainer of
> Bio.AlignAce, either via the mailing list or directly. From the Biopython
> CVS, it seems that Bartek is currently the main maintainer of Bio.AlignAce,
> so it would be a good idea to discuss with him.
I'm not dying either ;). I'm the author of the Bio.AlignAce module and if you
have any new code to contribute to it, I'll be glad to help you. The best way
to do it would be to submit an enhancement bug report in bugzilla. If the
changes are smaller, you can just send them (as a diff) to the list and I'll
try to fit them to the current cvs version of Bio.AlignAce
Bartek Wilczynski
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 05:55:37 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 05:55:37 -0400
Subject: [Biopython-dev] [Bug 2380] New: Bio.Nexus is adding "?" and "-" to
Bio.Data.IUPACData.ambiguous_dna_values
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2380
Summary: Bio.Nexus is adding "?" and "-" to
Bio.Data.IUPACData.ambiguous_dna_values
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: minor
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
This issue was raised in Bug 2366 where a unit test was found to be "polluting"
ambiguous_dna_values, later identified as Bio.Nexus via test_Nexus.py
Need to see if Bio.Nexus should be making a copy of this dict, or perhaps
defining a subclass of the alphabet (using the Gapped() class maybe).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 05:56:37 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 05:56:37 -0400
Subject: [Biopython-dev] [Bug 2366] Ambiguous nucleotides in
(Reverse)complement functions in Bio.Seq
In-Reply-To:
Message-ID: <200710160956.l9G9ub18007735@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2366
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 05:56 EST -------
Fix committed (after Michiel's OK on the mailing list), marking as fixed.
Checking in Tests/test_seq.py;
/home/repository/biopython/biopython/Tests/test_seq.py,v <-- test_seq.py
new revision: 1.6; previous revision: 1.5
done
Checking in Tests/output/test_seq;
/home/repository/biopython/biopython/Tests/output/test_seq,v <-- test_seq
new revision: 1.6; previous revision: 1.5
done
Checking in Bio/Seq.py;
/home/repository/biopython/biopython/Bio/Seq.py,v <-- Seq.py
new revision: 1.17; previous revision: 1.16
done
I've filed Bug 2380 for the Nexus issue:
Bio.Nexus is adding "?" and "-" to Bio.Data.IUPACData.ambiguous_dna_values
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 06:11:09 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 06:11:09 -0400
Subject: [Biopython-dev] [Bug 2381] New: translate and transcibe method for
the the Seq object (in Bio.Seq)
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2381
Summary: translate and transcibe method for the the Seq object
(in Bio.Seq)
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
Biopython has translation and transcription modules (Bio/Translate.py and
Bio/Transcibe.py) but I find them a little bit complicated to use.
There are module level functions translate, transcribe, and back_transcribe in
Bio/Seq.py which take either a string, a Seq object or a MutableSeq object.
I would like to add similar methods to the Seq object (also defined Bio/Seq.py)
to make this functionality more accessable from a Seq object.
NOTE: Python strings have a translate method of their own which is rather
different. Having the Seq translate method doing a biological translation
makes sense.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 06:13:35 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 06:13:35 -0400
Subject: [Biopython-dev] [Bug 2381] translate and transcibe methods for the
Seq object (in Bio.Seq)
In-Reply-To:
Message-ID: <200710161013.l9GADZtJ008751@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2381
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|translate and transcibe |translate and transcibe
|method for the the Seq |methods for the Seq object
|object (in Bio.Seq) |(in Bio.Seq)
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 06:13 EST -------
fixed typo in the bug summary
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 06:26:44 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 06:26:44 -0400
Subject: [Biopython-dev] [Bug 2381] translate and transcibe methods for the
Seq object (in Bio.Seq)
In-Reply-To:
Message-ID: <200710161026.l9GAQixw009268@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2381
------- Comment #2 from dalloliogm at gmail.com 2007-10-16 06:26 EST -------
I find difficult to translate a sequence in the 6 reading frames with a single
command.
Actually I use something like this:
for i in xrange(2):
translate(Seq[i:])
which is not very nice.
It would be nice to add a parameter to the translate function like in the
emboss application transeq
(http://emboss.sourceforge.net/apps/cvs/emboss/apps/transeq.html), something
like this:
>>> a = Seq('CAGCTAGCT')
>>> a.translate()
[(translation of a in the frame 0)]
>>> a.translate(1)
[(translation of a in the frame 1)]
>>> a.translate(F)
[(translation of a in the 3 forward frames)]
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 06:46:47 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 06:46:47 -0400
Subject: [Biopython-dev] [Bug 2381] translate and transcibe methods for the
Seq object (in Bio.Seq)
In-Reply-To:
Message-ID: <200710161046.l9GAklI6010391@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2381
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 06:46 EST -------
Doing a three/six frame translation is however fairly common, and perhaps
warrents an "official" implementation in Bio.SeqUtils
My current inclination is try and keep the Bio.Seq translation function as
simple as possible. There are lots of possible options to worry about...
catering to them all could make the translate method rather daunting.
Perhaps things like the frame (or even the starting nucleotide) could be done
in Bio.Translate only. Another "special case" example I personally would like
is an option to check the first codon is a valid start codon for the specified
codon table, and to translate it as methionine (M). Then there is the question
of if Bio.Translate's "translate_to_stop" functionality should be exposed in a
Seq method.
Note there is yet another (!) translation function Bio.SeqUtils.translate()
which is frame aware [personally I would mark a lot of this module as
deprecated].
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From jflatow at northwestern.edu Tue Oct 16 12:02:19 2007
From: jflatow at northwestern.edu (Jared Flatow)
Date: Tue, 16 Oct 2007 11:02:19 -0500
Subject: [Biopython-dev] Biopython status
In-Reply-To: <47147341.4020708@maubp.freeserve.co.uk>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu>
<47147341.4020708@maubp.freeserve.co.uk>
Message-ID: <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
Please forgive me for ever doubting your health, it seems the group
is very much alive!
On Oct 16, 2007, at 3:16 AM, Peter wrote:
> Jared Flatow wrote:
>> I have also needed to create a modified FASTA parser so that I can
>> read things like quality score files.
>
> Could you be a little more specific - what exactly do you mean by a
> quality score files (links and/or examples). It may be that this
> warrants setting up a new file format in Bio.SeqIO
That is what I did. The quality score files I meant are simply FASTA-
like records that indicate the quality of each base pair read from a
sequencing machine, on a scale of something like 1 to 64. The values
are tab separated and correspond to 'reads' in another FASTA file
that contain the actual sequences read. This is the way the 454
GSFlex machines output their sequencing reads, so for every set of
reads there will be a pair of 454Reads.fna, 454Reads.qual files. The
only difference between a parser that processes these qual files and
one that processes the sequence files is that it shouldn't get rid of
spaces, and the newlines should not to be stripped but converted into
spaces (when 454 writes a newline of scores they omit the space).
Essentially I have made a duplicate of FastaIOs iterator, named it
something else, made these two small changes and put an entry for it
in the SeqIO file.
16,17c16,17
< def GSQualIterator(handle, alphabet = single_letter_alphabet,
title2ids = None) :
< """Generator function to iterate over GSFlex quality records
(as SeqRecord objects).
---
> def FastaIterator(handle, alphabet = single_letter_alphabet,
title2ids = None) :
> """Generator function to iterate over Fasta records (as
SeqRecord objects).
54c54
< lines.append(line.rstrip()) # .replace(" ","")) leave
off the replacing internal spaces so we can process qscore files (jf)
---
> lines.append(line.rstrip().replace(" ",""))
58c58
< yield SeqRecord(Seq(" ".join(lines), alphabet),
---
> yield SeqRecord(Seq("".join(lines), alphabet),
63a64,199
As you can see a parser like this might be useful for other FASTA-
like formats as well and is in no way specific to the GS quality
files (its just a space preserving parser). If it were to be
implemented in Biopython you might call it something else.
>
>> I would be happy to submit the changes to the group or an individual
>> for inspection, but I would like to avoid having to maintain my own
>> separate version of Biopython if possible.
>
> As has already been said - please file some (enhancement) bugs and
> attach your patches, or raise specific issues for discussion on this
> mailing list.
>
> Depending on the nature of your changes, you might be able to achieve
> some of them by subclassing Biopython's objects - rather than
> literally
> maintaining your own branch of the project.
>
>> I am also wondering how it would be received if I did something like
>> add a to_fasta method to SeqRecord instead of having to go
>> through writing it to a file using a SeqIO when all I want is the
>> string.
>
> Out of interest, why do you want to create a FASTA record as a string?
I am serving the fasta from a database of sequences dynamically via a
web server.
>
> Did you know you can write to a string using any Bio.SeqIO supported
> file format using StringIO? Perhaps we should spell this out more
> explicitly in the documentation, but a motivating example would help.
This is what I do now, but it seems like a hack to me to go this
route. To always have to write to a file feels strange, but I see
that it would be messy to go OO since there are so many formats.
However, giving preference to fasta over other formats by making it
innate doesn't seem like such a terrible idea. I do have mixed
feelings about 'bloating' the code which is why I asked, and you have
convinced me that this is not quite appropriate given existing
convention. However the idea would be to put the to_fasta or
to_format method inside the SeqRecord, then to call it from the IO
when needed to actually write to a file, but call it directly when
all that is wanted is a string...
>
> I would suggest rather than adding a to_fasta method to the
> SeqRecord, simply write your own "seqrecord_to_string" function (or
> create a subclass of SeqRecord with this method).
>
I'll leave it alone for now until I can come up with a real proposal =)
>> Finally, are there plans to move to a subversion repository at any
>> point?
>
> It was raised a while ago, and our cunning plan was to let BioPerl try
> the move first. Once that has been proven, it should be fairly
> easy for
> the OBF guys to also move us over. I should email them to see how
> things stand...
BioPerl seems to be the guinea pigs for everything. Leading the way
on this might put a stop to those nasty rumors about Biopython.
Best Regards,
Jared
From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 12:47:48 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Oct 2007 17:47:48 +0100
Subject: [Biopython-dev] CVS to SVN
In-Reply-To: <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk>
<7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
Message-ID: <4714EB34.8000207@maubp.freeserve.co.uk>
Jared wrote:
> Leading the way on this ... [CVS to SVN]
I would say one reason why we aren't charging ahead with a move from CVS
to subversion is only a few posters on this mailing list actively WANT
to move to subversion, and no-one has really championed the move (yet).
I'm sure if we as a group wanted to this, then the OBF would be happy to
assist. After all, moving us rather than BioPerl as the first CVS/SVN
migration should be easier as we have a smaller code base.
Peter
From jflatow at northwestern.edu Tue Oct 16 14:46:53 2007
From: jflatow at northwestern.edu (Jared Flatow)
Date: Tue, 16 Oct 2007 13:46:53 -0500
Subject: [Biopython-dev] 454 GSFlex quality score files
In-Reply-To: <4714EBC7.1040504@maubp.freeserve.co.uk>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk>
<7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
<4714EBC7.1040504@maubp.freeserve.co.uk>
Message-ID: <48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu>
Hi Peter,
>>>> I have also needed to create a modified FASTA parser so that I
>>>> can read things like quality score files.
>>>
>>> Could you be a little more specific - what exactly do you mean by a
>>> quality score files (links and/or examples). It may be that this
>>> warrants setting up a new file format in Bio.SeqIO
>> That is what I did. The quality score files I meant are simply
>> FASTA- like records that indicate the quality of each base pair
>> read from a sequencing machine, on a scale of something like 1 to
>> 64. The values are tab separated and correspond to 'reads' in
>> another FASTA file that contain the actual sequences read. This
>> is the way the 454 GSFlex machines output their sequencing reads,
>> so for every set of reads there will be a pair of 454Reads.fna,
>> 454Reads.qual files. The only difference between a parser that
>> processes these qual files and one that processes the sequence
>> files is that it shouldn't get rid of spaces, and the newlines
>> should not to be stripped but converted into spaces (when 454
>> writes a newline of scores they omit the space). Essentially I
>> have made a duplicate of FastaIOs iterator, named it something
>> else, made these two small changes and put an entry for it in the
>> SeqIO file.
>
> Patches and emails don't do well together. Could you file an
> enhancement bug, and then upload your code as an attachment? If
> you have a few examples of matched pairs of FASTA files and quality
> files which you can contribute that would be very helpful too.
>
Yes I'll get on that.
> It looks like you are trying to construct a "sequence" of numerical
> values (rather than a sequence of letters like nucleotides/amino
> acids). As written I don't think it would work for element access/
> splicing etc. However, with some extra work I suppose we could
> stretch the Seq object in this way - and define a new
> "IntegerAlphabet".
>
> But on balance, I don't think "lists of quality values" should be
> treated in the same way as sequences (and thus it doesn't seem to
> belong in Bio.SeqIO).
>
I agree.
> Alternatively you could regard the quality scores as sequence meta-
> data or annotation. One idea would be to generate SeqRecord
> objects containing dummy sequences of the correct length made up of
> the ambiguous character "N", with the associated quality scores
> held as a list of integers in the SeqRecord's annotation
> dictionary. Then it would fit into the Bio.SeqIO framework [I was
> thinking of something similar for PTT files, NCBI Protein tables,
> where again we have annotation but not the actual sequence].
I agree, and this way is most flexible.
>
> Maybe there should just be a separate parser for GSFlex quality
> records which returns iterator giving each record name with a list
> of integers. A more elegant scheme would read in the pair of files
> together (the FASTA file and the quality file) and generate nicely
> annotated SeqRecords with the sequence and the quality. This isn't
> really possible with the Bio.SeqIO framework.
>
Yes, at first I liked this idea best, but it puts some constraints on
the way these things are read in. Like if it is to be an iterator,
you must have a guarantee that these files contain exactly the same
sequences in exactly the same order. This seems like it could
potentially be fine for the GSFlex files, but I wonder if there might
somewhere down the line be use for quality information about
sequences in other cases. If I am not mistaken, some sources use
upper/lower case letters now to indicate a bistable degree of
confidence in a sequence letter. In any event, this seems like an
unnecessary restriction.
The way I do it now is I load the reads into a database, then update
the database when I read in a quality score file. I think Biopython
should have a simple way of implementing something similar which can
solve both our metadata problems.
In Bio.Fasta there are Parsers which really belong in
Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more
general Fasta reader, nothing to do with sequences. It can iterate
over a FASTA file using the '>' as the record separator, creating
Record objects, much like it does now, except without processing them
at all or assuming they are sequences.
>Record.header
Record.data
Now Bio.SeqIO.FastaIO can use Bio.Fasta to iterate over the Record
objects in a file and transform them into SeqRecord object. If you
like, you can provide it with a function header_todict, which takes a
string (in this case Record.header) and returns a dictionary, which
gets unpacked and passed to the SeqRecord initializer. Basically the
Bio.SeqIO.FastaIO returns a generator that looks something like this:
(SeqRecord(seq=cleanup(record.data), **header_todict(record.header))
for record in Bio.Fasta.parse(file))
I can also use the Bio.Fasta.parse function now to parse my quality
files and add them as metadata:
# I create an initial SeqRecord dictionary using the
Bio.SeqIO.FastaIO parser
seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file,
my_header_todict))
# Then I iterate over the sequences in the qual file and look them up
in the seq_dict using the same header parsing function
# I passed to create my initial SeqRecords, setting the quality
scores as I find them them
for record in Bio.Fasta.parse(qual_file):
seq_dict[my_header_todict(record.header)['id']].quality =
my_qualitycleanup(record.data)
I hope that makes sense. The advantage to doing it this way is that I
can reuse my header parsing function for both the sequence and the
metadata, and I can do whatever I want with the fasta record data
without writing a whole new parser. The SeqIO fasta parsing functions
just makes some default assumptions (like the data is a sequence).
Let me know what you think.
Jared
From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 12:50:15 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Oct 2007 17:50:15 +0100
Subject: [Biopython-dev] 454 GSFlex quality score files
In-Reply-To: <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk>
<7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
Message-ID: <4714EBC7.1040504@maubp.freeserve.co.uk>
Hi Jared,
>>> I have also needed to create a modified FASTA parser so that I can
>>> read things like quality score files.
>>
>> Could you be a little more specific - what exactly do you mean by a
>> quality score files (links and/or examples). It may be that this
>> warrants setting up a new file format in Bio.SeqIO
>
> That is what I did. The quality score files I meant are simply FASTA-
> like records that indicate the quality of each base pair read from a
> sequencing machine, on a scale of something like 1 to 64. The values
> are tab separated and correspond to 'reads' in another FASTA file
> that contain the actual sequences read. This is the way the 454
> GSFlex machines output their sequencing reads, so for every set of
> reads there will be a pair of 454Reads.fna, 454Reads.qual files. The
> only difference between a parser that processes these qual files and
> one that processes the sequence files is that it shouldn't get rid of
> spaces, and the newlines should not to be stripped but converted into
> spaces (when 454 writes a newline of scores they omit the space).
> Essentially I have made a duplicate of FastaIOs iterator, named it
> something else, made these two small changes and put an entry for it
> in the SeqIO file.
Patches and emails don't do well together. Could you file an
enhancement bug, and then upload your code as an attachment? If you
have a few examples of matched pairs of FASTA files and quality files
which you can contribute that would be very helpful too.
It looks like you are trying to construct a "sequence" of numerical
values (rather than a sequence of letters like nucleotides/amino acids).
As written I don't think it would work for element access/splicing
etc. However, with some extra work I suppose we could stretch the Seq
object in this way - and define a new "IntegerAlphabet".
But on balance, I don't think "lists of quality values" should be
treated in the same way as sequences (and thus it doesn't seem to belong
in Bio.SeqIO).
Alternatively you could regard the quality scores as sequence meta-data
or annotation. One idea would be to generate SeqRecord objects
containing dummy sequences of the correct length made up of the
ambiguous character "N", with the associated quality scores held as a
list of integers in the SeqRecord's annotation dictionary. Then it
would fit into the Bio.SeqIO framework [I was thinking of something
similar for PTT files, NCBI Protein tables, where again we have
annotation but not the actual sequence].
Maybe there should just be a separate parser for GSFlex quality records
which returns iterator giving each record name with a list of
integers. A more elegant scheme would read in the pair of files together
(the FASTA file and the quality file) and generate nicely annotated
SeqRecords with the sequence and the quality. This isn't really
possible with the Bio.SeqIO framework.
Peter
From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 15:33:54 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Oct 2007 20:33:54 +0100
Subject: [Biopython-dev] 454 GSFlex quality score files
In-Reply-To: <48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EBC7.1040504@maubp.freeserve.co.uk>
<48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu>
Message-ID: <47151222.1060502@maubp.freeserve.co.uk>
> In Bio.Fasta there are Parsers which really belong in
> Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more
> general Fasta reader, nothing to do with sequences. ...
In actual fact, the Bio.Fasta module predates Bio.SeqIO, and I was
thinking in a few releases time of suggesting its deprecation (but not
just yet as for several years it was the best documented and most used
parser in Biopython).
If we do decided keep Bio.Fasta (or extend it), then perhaps
Bio.SeqIO.FastaIO should become just a wrapper for Bio.Fasta
I'm still digressing your ideas to turn Bio.Fasta into a generic parser
that copes with sequences, qualities scores, or anything else.
Peter
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 15:57:35 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 15:57:35 -0400
Subject: [Biopython-dev] [Bug 2382] New: Generic FASTA parser
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
Summary: Generic FASTA parser
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: jflatow at northwestern.edu
I would like to be able read in and iterate over records in generic fasta files
of the format:
>header
data
>header
data
...
This iterator should return Bio.Fasta.Record objects with the corresponding
header and data fields.
I suggest putting this inside the existing Bio.Fasta module and updating
Bio.SeqIO.Fasta to use this iterator and transform the records returned into
Bio.SeqRecord objects.
This should make it easier to add metadata to SeqRecord objects parsed in from
FASTA. Consider the following example for illustration. I have data from a
genome sequencing machine that outputs pairs of files. One contains the
sequence reads which look like this, the other contains estimates of the
quality of each base call in the sequence.
The sequence file might look something like this (only with hundreds of
thousands more entries):
>ERSGEES02IKV6B length=97 xy=3401_1361 region=2 run=R_runname
CAATATAATTTCTCTTAAAATTATTCCCATGGCCAGGTGTGGTGGCTCACACCTGTAGTC
CCGGCACTTTGGGAGGCCAAGGCACACAGGGGATAGG
>ERSGEES02GGZDB length=142 xy=2536_2685 region=2 run= R_runname
GGTCTCCAGTGCCCTGTCTCCCCATATTTCTGACACACCTTCTCACAGCCTGGCCCATCT
TGCTGGGTCCCTCTTCTCCTCCCTTCCTGCTCCATTTGTCAACACTGCTGGGACATTAGA
ATTCAGATCTCCCGGGTCACCG
>ERSGEES02JQUCP length=113 xy=3879_0663 region=2 run= R_runname
AAAGTGACTAAAGAATCAATTTACATTAATATTCTATGTGAACAGGCAAAATACTTACAA
AGAAGTAGAGAAAATATGAATTCAGTACAGAATTCAGATCTCCCGGGTCACCG
The corresponding quality score file might look something like this:
>ERSGEES02IKV6B length=97 xy=3401_1361 region=2 run= R_runname
27 28 21 27 27 27 28 22 28 25 3 27 27 27 28 21 33 31 20 6 28 21 26 26 18 28 25
2 26 25 29 23 31 24 27 29 22 27 27 27 29 23 27 31 25 27 27 27 27 27 27 32 26 27
27 27 27 26 27 33
30 12 32 26 27 27 27 33 30 12 33 30 12 26 31 25 33 27 32 28 33 28 27 27 27 27
27 26 33 32 20 7 27 27 27 32 26
>ERSGEES02GGZDB length=142 xy=2536_2685 region=2 run= R_runname
28 9 26 24 27 27 20 26 18 25 27 32 29 10 26 26 27 18 25 32 30 17 1 25 27 22 32
30 12 27 27 22 26 25 27 23 25 28 21 32 27 27 27 25 26 27 26 25 27 20 26 26 19
28 25 3 25 27 22 27
19 24 24 24 32 29 11 24 34 31 17 23 23 30 23 27 25 30 23 27 33 31 17 27 20 28
21 27 25 26 26 30 24 27 33 31 13 26 27 27 31 25 27 25 23 26 16 26 27 30 27 7 27
27 27 32 27 26 26 32
27 30 26 27 27 27 27 27 27 27 30 27 6 34 31 17 27 21 27 32 28 18
>ERSGEES02JQUCP length=113 xy=3879_0663 region=2 run= R_runname
29 26 5 25 27 24 27 27 27 30 27 7 26 27 19 25 26 31 26 34 32 16 20 27 26 32 27
32 28 27 25 26 18 27 25 27 26 26 24 27 31 25 27 27 31 26 26 34 32 23 11 26 22
27 32 26 27 26 32 30
11 26 31 24 27 27 25 23 27 27 33 30 19 4 17 26 25 26 31 27 30 26 27 26 22 26 18
24 27 26 32 26 32 28 27 27 25 27 25 24 25 31 28 10 34 31 15 27 21 27 28 21 27
I would like to be able to do the following:
# create a function to parse the header line and return a dictionary
def parse_gsflex_header(gs_header):
parts = gs_record.description.split(' ')
assert len(parts) == 5
xy = parts[2].split('=')[1].split('_')
return {'letters': gs_record.seq.tostring(),
'name': parts[0],
'length': parts[1].split('=')[1],
'xpos': xy[0],
'ypos': xy[1],
'region': parts[3].split('=')[1],
'run': parts[4].split('=')[1]}
# Bio.SeqIO.FastaIO wraps the Bio.Fasta parser, might look something like this
class Fasta(): # or however its organized
def data_toseq(data):
# do some parsing of the data
return Seq(...)
def parse(file, header_todict):
return (SeqRecord(seq=data_toseq(record.data),
**header_todict(record.header)) for record in Bio.Fasta.parse(file))
# I create an initial SeqRecord dictionary using the Bio.SeqIO.FastaIO parser
seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file, parse_gsflex_header))
# Then I iterate over the sequences in the qual file and look them up in the
seq_dict
# setting the quality scores as I find them them
for record in Bio.Fasta.parse(qual_file):
seq_dict[my_header_todict(record.header)['id']].quality =
my_qualitycleanup(record.data)
This would work well for parsing all kinds of FASTA-like files and provides a
simple mechanism for dealing with them record by record.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 16:03:33 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 16:03:33 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710162003.l9GK3XmF007588@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #1 from jflatow at northwestern.edu 2007-10-16 16:03 EST -------
My mistake, the parse_gsflex_header function should look something like this:
def parse_gsflex_header(gs_header):
parts = re.split('[,|]?\s+', header, maxsplit=1)
assert len(parts) == 2
return {'id': parts[0],
'description': header}
def my_qualitycleanup(data):
return [int x for x in data.replace('\n', '').split(' ')]
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From jflatow at northwestern.edu Tue Oct 16 16:11:04 2007
From: jflatow at northwestern.edu (Jared Flatow)
Date: Tue, 16 Oct 2007 15:11:04 -0500
Subject: [Biopython-dev] 454 GSFlex quality score files
In-Reply-To: <47151222.1060502@maubp.freeserve.co.uk>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EBC7.1040504@maubp.freeserve.co.uk>
<48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu>
<47151222.1060502@maubp.freeserve.co.uk>
Message-ID: <156C46BF-1798-43D5-BA10-2A94FC63A3AB@northwestern.edu>
On Oct 16, 2007, at 2:33 PM, Peter wrote:
> > In Bio.Fasta there are Parsers which really belong in
> > Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more
> > general Fasta reader, nothing to do with sequences. ...
>
> In actual fact, the Bio.Fasta module predates Bio.SeqIO, and I was
> thinking in a few releases time of suggesting its deprecation (but
> not just yet as for several years it was the best documented and
> most used parser in Biopython).
>
I see, it looks like its meant to be deprecated, I was just saying
its actually doing SeqIO functionality.
> If we do decided keep Bio.Fasta (or extend it), then perhaps
> Bio.SeqIO.FastaIO should become just a wrapper for Bio.Fasta
>
> I'm still digressing your ideas to turn Bio.Fasta into a generic
> parser that copes with sequences, qualities scores, or anything else.
I'm not quite sure you're meaning of digressing, if you mean thinking
it over, then great =) Otherwise I hope you'll seriously consider it
anyway. Either way, I think I posted a more coherent message on
bugzilla with some example data and motivation.
jared
From jflatow at northwestern.edu Tue Oct 16 16:14:16 2007
From: jflatow at northwestern.edu (Jared Flatow)
Date: Tue, 16 Oct 2007 15:14:16 -0500
Subject: [Biopython-dev] CVS to SVN
In-Reply-To: <4714EB34.8000207@maubp.freeserve.co.uk>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk>
<7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
<4714EB34.8000207@maubp.freeserve.co.uk>
Message-ID: <6DFB6FBB-CC55-41D1-8D35-4906E6B502CF@northwestern.edu>
> I would say one reason why we aren't charging ahead with a move
> from CVS to subversion is only a few posters on this mailing list
> actively WANT to move to subversion, and no-one has really
> championed the move (yet).
Does that mean most developers don't WANT to move, or just that they
don't ACTIVELY want to move?
jared
From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 16:42:18 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Oct 2007 21:42:18 +0100
Subject: [Biopython-dev] 454 GSFlex quality score files
In-Reply-To: <156C46BF-1798-43D5-BA10-2A94FC63A3AB@northwestern.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EBC7.1040504@maubp.freeserve.co.uk> <48D92CF4-04B5-42F9-92D2-3A2D9D2FE7E2@northwestern.edu> <47151222.1060502@maubp.freeserve.co.uk>
<156C46BF-1798-43D5-BA10-2A94FC63A3AB@northwestern.edu>
Message-ID: <4715222A.2070909@maubp.freeserve.co.uk>
Jared Flatow wrote:
> On Oct 16, 2007, at 2:33 PM, Peter wrote:
>
>>> In Bio.Fasta there are Parsers which really belong in
>>> Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more
>>> general Fasta reader, nothing to do with sequences. ...
>> In actual fact, the Bio.Fasta module predates Bio.SeqIO, and I was
>> thinking in a few releases time of suggesting its deprecation (but
>> not just yet as for several years it was the best documented and
>> most used parser in Biopython).
>
> I see, it looks like its meant to be deprecated, I was just saying
> its actually doing SeqIO functionality.
Well I'm currently just making a suggestion for the future, deprecating
Bio.Fasta, we should still canvas opinion on the main mailing list
before taking that action.
>> If we do decided keep Bio.Fasta (or extend it), then perhaps
>> Bio.SeqIO.FastaIO should become just a wrapper for Bio.Fasta
>>
>> I'm still digressing your ideas to turn Bio.Fasta into a generic
>> parser that copes with sequences, qualities scores, or anything else.
That was a typo, but you managed to guess my meaning. I meant to say:
I'm still digesting [i.e. thinking about] your ideas to turn Bio.Fasta
into a generic parser that copes with sequences, qualities scores, or
anything else.
> I'm not quite sure you're meaning of digressing, if you mean thinking
> it over, then great =) Otherwise I hope you'll seriously consider it
> anyway. Either way, I think I posted a more coherent message on
> bugzilla with some example data and motivation.
I'll take a look, Bug 2382 - Generic FASTA parser
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
Peter
From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 17:01:29 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Oct 2007 22:01:29 +0100
Subject: [Biopython-dev] CVS to SVN
In-Reply-To: <6DFB6FBB-CC55-41D1-8D35-4906E6B502CF@northwestern.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB34.8000207@maubp.freeserve.co.uk>
<6DFB6FBB-CC55-41D1-8D35-4906E6B502CF@northwestern.edu>
Message-ID: <471526A9.1010709@maubp.freeserve.co.uk>
Jared Flatow wrote:
>> I would say one reason why we aren't charging ahead with a move
>> from CVS to subversion is only a few posters on this mailing list
>> actively WANT to move to subversion, and no-one has really
>> championed the move (yet).
>
> Does that mean most developers don't WANT to move, or just that they
> don't ACTIVELY want to move?
Going back over the archives, Chris Lasher was most vocal in supporting
the move, and there were a few other positive voices.
Speaking for myself, I have no strong desire either way, and I don't
think Michiel objected either (except over the timing). Then as now, we
are hoping to get the next release out "shortly", so after that would be
a good time to make the switch.
[I'm assuming we won't loose any revision history or comments, and that
things like the web based ViewCVS and its RSS feed will still be available]
Peter
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 17:02:03 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 17:02:03 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710162102.l9GL23rr010250@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 17:02 EST -------
Are there any other "FASTA like" formats you know of, in addition to
traditional sequence data and the 454 GSFlex quality score files?
We could do this using the old Scanner/Consumer model (see the pre-Martel
parse, CVS revision 1.8 of Bio/Fasta/__init__.py for example).
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Fasta/__init__.py?rev=1.8&cvsroot=biopython&content-type=text/vnd.viewcvs-markup
The scanner would be the same for all formats, and would pass the data with
whitespace (spaces, new lines etc) as is. We could then have one consumer for
each supported FASTA variant:
_Scanner Scans a FASTA-format stream.
_RecordConsumer Consumes FASTA data to a Record object.
_SequenceConsumer Consumes FASTA data to a Sequence object.
_QualityConsumer (new) could build a list of integers for each record?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 17:26:29 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 17:26:29 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710162126.l9GLQT8O011239@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #3 from jflatow at northwestern.edu 2007-10-16 17:26 EST -------
On second thought, let me just rewrite all the code:
# The Bio.Fasta parser
class Fasta(): # or whatever
@staticmethod
def parse(file):
# return an iterator over the file as Bio.Fasta.Records
# for the records, trim newline from header, don't do anything to data
# The Bio.SeqIO.FastaIO wrapper for Bio.Fasta
class FastaIO(): # or however its organized
@staticmethod
def header_todict(header):
parts = re.split('[,|]?\s+', header, maxsplit=1)
assert len(parts) == 2
return {'id': parts[0],
'description': header}
@staticmethod
def data_toseq(data, alphabet):
return Seq(re.sub('\s+', '', data), alphabet)
@staticmethod
def parse(file, header_todict=Fasta.header_todict,
alphabet=single_letter_alphabet):
return (SeqRecord(seq=data_toseq(record.data, alphabet),
**header_todict(record.header)) for record in
Bio.Fasta.parse(file))
# Now to use these in my example I can do
seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file))
for record in Bio.Fasta.parse(qual_file):
id = Bio.SeqIO.FastaIO.header_todict(record.header)['id']
seq_dict[id].quality = [int(x) for x in record.data.split()]
# Suppose instead I have an alignment file, which looks like this:
>contigname
A A 10 64
T T 9 64
C C 9 64
...
# and on, where the first column is a reference sequence, the second column is
a consensus
# sequence, the third column is the number of reads aligned, the fourth column
is the combined
# quality score
# Now its just as easy for me to parse this into an object
class ContigAlign():
def __init__(self, name, ref, consensus, numreads, qscore):
self.name = name
self.ref = ref
self.consensus = consensus
self.numreads = numreads
self.qscore = qscore
# ill make a dictionary of my contigaligns
d = {}
for record in Bio.Fasta.parse(file):
(ref, consensus, numreads, qscore) = zip(record.data.split('\n'))
d[record.header] = ContigAlign(record.header, ref, consensus, numreads,
qscore)
# maybe i would turn ref and consensus into Seqs, but you get the point
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 17:38:45 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 17:38:45 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710162138.l9GLcj29011655@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 17:38 EST -------
In comment 3, did you just make up this file format as an example?
>contigname
A A 10 64
T T 9 64
C C 9 64
...
with four columns: reference sequence, consensus, number of reads aligned, and
combined quality score.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 17:58:38 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 17:58:38 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710162158.l9GLwc68012343@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #5 from jflatow at northwestern.edu 2007-10-16 17:58 EST -------
Nope, they actually have a file format that looks like this:
Position Consensus Quality Score Depth Signal StdDeviation
>contig00001 1
1 G 64 2 1.00 0.00
2 A 64 2 1.00 0.00
3 G 64 2 1.00 0.00
4 A 64 2 1.00 0.00
5 G 64 2 2.00 0.00
6 G 64 2 2.00 0.00
7 A 64 2 3.00 0.00
8 A 64 2 3.00 0.00
9 A 64 2 3.00 0.00
10 C 64 2 2.00 0.00
11 C 64 2 2.00 0.00
12 T 64 2 1.00 0.00
13 C 64 2 3.00 0.00
14 C 64 2 3.00 0.00
15 C 64 2 3.00 0.00
16 G 64 2 1.00 0.00
17 T 64 2 1.00 0.00
18 G 64 2 1.00 0.00
19 A 64 2 1.00 0.00
20 T 64 2 1.00 0.00
21 C 64 2 2.00 0.00
22 C 64 2 2.00 0.00
Note the file-wide header at the top of the page (a generic FASTA-like parser
might skip to the first '>'), or we could get rid of that beforehand but it
would be nice if it were smart.
Also, here is another sample FASTA-like file format they use for pair
alignments:
>ERSGEES01EM5WC, 2..30 of 95 and ERSGEES01C1ZV2, 1..29 of 268 (29/29 ident)
2 CGGTGACCCGGGAGATCTGAATTCCTGGT 30
1 CGGTGACCCGGGAGATCTGAATTCCTGGT 29
>ERSGEES01EM5WC, 2..29 of 95 and ERSGEES01DMS5T, 1..28 of 259 (28/28 ident)
2 CGGTGACCCGGGAGATCTGAATTCCTGG 29
1 CGGTGACCCGGGAGATCTGAATTCCTGG 28
>ERSGEES01EM5WC, 29..2 of 95 and ERSGEES01D8GDV, 205..232 of 232 (28/28 ident)
29 CCAGGAATTCAGATCTCCCGGGTCACCG 2
205 CCAGGAATTCAGATCTCCCGGGTCACCG 232
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 18:09:06 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 18:09:06 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710162209.l9GM96N5012764@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #6 from jflatow at northwestern.edu 2007-10-16 18:09 EST -------
The reference/consensus one was inspired by yet another format they have: there
are 2 tools they provide, one for mapping to an existing sequence, the other
for ab initio contig building. The mapping one has the extra reference column.
As you can see it might be hard to keep up with all these similar formats as
part of Biopython (these are only from one source). Certainly the common ones
should have wrappers but we should also be able to easily get the stream of
records.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 18:13:48 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 18:13:48 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710162213.l9GMDmBM012914@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2007-10-16 18:13 EST -------
Could you attach a few of these real files? Including where they came from,
i.e. the company whose software writes such output, and what the call each file
format variant.
If you can get a matched set (i.e. all associated with the same few sequences)
then even better.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 19:09:00 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 19:09:00 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710162309.l9GN90wg015092@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #8 from jflatow at northwestern.edu 2007-10-16 19:08 EST -------
The files are very large, I assure you they are just longer versions of what I
have supplied here though. The company is Roche Diagnostics. The initial
reads/quality files are the output of the 454 GSFlex genome sequencing
machines. They have two pieces of software: gsMapper and gsAssembler which
output the contigs.
Reads/Quality files from the machine are called:
454Reads.{fna,qual}
gs* output:
454{All,Large}Contigs.{fna,qual}
454PairAlign.txt
454AlignmentInfo.tsv
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 20:10:45 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 20:10:45 -0400
Subject: [Biopython-dev] [Bug 2381] translate and transcibe methods for the
Seq object (in Bio.Seq)
In-Reply-To:
Message-ID: <200710170010.l9H0AjYe018147@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2381
------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2007-10-16 20:10 EST -------
> Note there is yet another (!) translation function Bio.SeqUtils.translate()
> which is frame aware [personally I would mark a lot of this module as
> deprecated].
Given the various translate functions we already have in Biopython, why do you
want to add another one? Is there something the "translate" method can do that
the "translate" function cannot? Since the "translate" function can take Seq
objects as well as simple strings, I'd prefer the "translate" function over a
"translate" method.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython-dev at maubp.freeserve.co.uk Tue Oct 16 12:49:18 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Oct 2007 17:49:18 +0100
Subject: [Biopython-dev] SeqRecord to file format as string
In-Reply-To: <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk>
<7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu>
Message-ID: <4714EB8E.3000700@maubp.freeserve.co.uk>
>> Did you know you can write to a string using any Bio.SeqIO supported
>> file format using StringIO? Perhaps we should spell this out more
>> explicitly in the documentation, but a motivating example would help.
>
> This is what I do now, but it seems like a hack to me to go this
> route. To always have to write to a file feels strange, but I see
> that it would be messy to go OO since there are so many formats.
> However, giving preference to fasta over other formats by making it
> innate doesn't seem like such a terrible idea. I do have mixed
> feelings about 'bloating' the code which is why I asked, and you have
> convinced me that this is not quite appropriate given existing
> convention. However the idea would be to put the to_fasta or
> to_format method inside the SeqRecord, then to call it from the IO
> when needed to actually write to a file, but call it directly when
> all that is wanted is a string...
Its debatable isn't it? I suspect that for most users, when they want a
record in a particular file format its for writing to a file. However,
adding a to_format() method to a SeqRecord some sense (suitable for
sequential file formats only). This would take a format name and return
a string, by calling Bio.SeqIO with a StringIO object internally.
Peter
From bugzilla-daemon at portal.open-bio.org Tue Oct 16 22:17:28 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Oct 2007 22:17:28 -0400
Subject: [Biopython-dev] [Bug 2382] Generic FASTA parser
In-Reply-To:
Message-ID: <200710170217.l9H2HSAx024040@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2382
------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2007-10-16 22:17 EST -------
If all these special fasta files are coming from Roche Diagnostics, I'd suggest
to create a module rather than trying to put this in Bio.SeqIO. Bio.SeqIO is
one of the few modules in Biopython that is used by most users, so I'd like to
keep it clean as much as possible. To avoid confusion for users who just want
to parse regular Fasta files, I think the module should not be called
Bio.Fasta. In addition, I doubt we'd get much code reuse from a generic
Bio.Fasta module beyond what is needed for the Roche files, since the only
thing they have in common is that they use ">" to separate records.
With a separate module to handle the Roche files, my preferred usage would be
something like this:
from Bio import SeqIO, GSFlex # Or whatever you'd like to call it
seqrecords = SeqIO.parse(open("mysequences.fa"), "fasta")
qualities = GSFlex.parse(open("myqualities.qual"),