From bugzilla-daemon at portal.open-bio.org Wed Jan 3 08:15:00 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 08:15:00 -0500
Subject: [Biopython-dev] [Bug 2174] New: FDist Support in BioPython
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2174
Summary: FDist Support in BioPython
Product: Biopython
Version: 1.24
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: tiagoantao at gmail.com
This is an enhancement bug to submit code related to fdist2
http://www.rubic.rdg.ac.uk/~mab/software.html
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 08:15:18 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 08:15:18 -0500
Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython
In-Reply-To:
Message-ID: <200701031315.l03DFIGn007058@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2174
tiagoantao at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 08:16:06 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 08:16:06 -0500
Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython
In-Reply-To:
Message-ID: <200701031316.l03DG6qL007102@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2174
------- Comment #1 from tiagoantao at gmail.com 2007-01-03 08:16 -------
Created an attachment (id=532)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=532&action=view)
Code support fdist
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From tiagoantao at gmail.com Wed Jan 3 08:16:30 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 3 Jan 2007 13:16:30 +0000
Subject: [Biopython-dev] FDist: more Population Genetics code
Message-ID: <6d941f120701030516m1adb3daeh6e4645121ba8679d@mail.gmail.com>
Hi!
I have submitted another enhancement bug, with support for FDist. It
allows to generate and parse Fdist files and to control fdist
applications. There are also a couple of utility functions. FDist is a
niche application (mainly used to detect selection in animal
genetics). Not the most fundamental one to support, but it is
currently one that I am working on, thus, the code.
Regarding my summited code for GenePop, I have summited a different
version on bugzilla. The main difference, is that I moved everything
from Bio to Bio.PopGen.
Before I continue putting code on bugzilla I would like to know if it
is worthwhile doing it... Any opinions on the code submitted or if any
changes are required? I would really like to continue converting my
code to BioPython, but only if it has any possibility of ending up
being useful/included in distribution somewhere in the future... ;)
I am currently working on code related to SimCoal2, Arlequin and
general statistics (Fst, heterozygosity, ...). Which will probably be
ready quite soon (ie, next two weeks). This is more mainstream than
FDist
I have some other code lying around mainly related to HapMap, but I
will only submit it after reviewing and reusing it again. This is more
distant future ... like a couple of months.
Tiago
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:38:39 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:38:39 -0500
Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple
queries and recent (2.2.13) blast - patch attached
In-Reply-To:
Message-ID: <200701032138.l03Lcdji028402@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2051
------- Comment #13 from mdehoon at ims.u-tokyo.ac.jp 2007-01-03 16:38 -------
> Regardless, I do still see a
> number of inconsistencies.
Please submit a separate bug report (including your patch) for these
inconsistencies. The current bug report is titled
"XML Blast parser unusable with multiple queries and recent (2.2.13) blast -
patch attached"
With Peter's patch, we can now parse multiple blast queries, so I'd like to
close this bug report.
For future bug reports and patches:
Try to handle separate bugs in separate bug reports and patches. For
developers, when looking at a patch handling several issues at the same time,
it's difficult to understand which parts of the patch are essential, which are
good but non-essential, and which are code cleanup. Speaking for myself, I
would probably have considered this patch earlier if it had been less
convoluted.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:48:45 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:48:45 -0500
Subject: [Biopython-dev] [Bug 2176] New: XML Blast parser: miscellaneous bug
fixes and cleanup
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
Summary: XML Blast parser: miscellaneous bug fixes and cleanup
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: jmjoseph at andrew.cmu.edu
This follows the discussion started in bug 2051. The blast XML parser does now
work (Thanks!), but could still use a little work. Here's a list of the issues
I can see now. I'll follow with patches to correct a few.
In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still
defined as (None,None) tuples. However, in NCBIXML.py, these
variables are set as integers. I don't see a point of a tuple at all,
at least for NCBIXML. (I realize it is used in NCBIStandalone.py).
Most importantly, the inconsistency makes it difficult to handle cases
when the parameter is not set. It seems easiest, though, to just
retain the tuple format.
In the past, I worried that the order of tuple building for
self._blast.gap_penalties or ka_params could cause the tuple to have
an incorrect ordering. I seem to remember hitting an issue where the
tuple was built with the wrong length, but I can't be specific. In
general, it remains odd to me to not just use a list and set each
element respectively. If necessary, one could convert to a tuple when
finished or use some other approach that does not rely upon order.
Why not use query_len, as defined in the XML file, or query_length
instead of query_letters as a variable name? In
BlastParser._end_Iteration, self._blast.query_letters is set. This is
not defined/documented in the Parameters class in Record.py. Rather,
query_length is defined there. In the Header class, though, the name
query_letters is used. There also seems to be some confusion between
num_letters_in_database, num_sequences_in_database, database_letters,
and database_sequences. Note that even if this naming is not
corrected, NCBIXML.py:186 is wrong with "self._blast_query_letters"
rather than "self._blast.query_letters".
Similarly, why store the bit score and E-value as 'bits' and
'_hsp.expect'/'descr.e' rather than just using bit_score and
evalue, as in the blast XML ouput?
I make use of in 2.2.13. This value missing
entirely.
The parsing of and is confusing. For example,
1
gnl|BL_ORD_ID|0
3377250
...
results in _hit.title set to "gnl|BL_ORD_ID|0 3377250". I would
rather they remain separate (or both methods be used).
This is certainly not an exhaustive list. I'm happy to provide
another patch correcting many of these inconsistencies. At the
very least, the variable names defined in Record.py should be
used in NCBIXML.py. May I modify at least the above names to
correspond more closely to the names used in the XML? I know
I've found this particularly confusing.
-Jacob
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:50:33 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:50:33 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701032150.l03LoXp4028921@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #1 from jmjoseph at andrew.cmu.edu 2007-01-03 16:50 -------
Created an attachment (id=533)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=533&action=view)
Patch to NCBIXML.py
These patches to NCBIXML and Record:
* replace query_letters with query_length,
* use tuples for _hsp.identities, positives, and gaps
* store _hsp.align_length
* separate the hit id and hit def elements. title is retained
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:50:53 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:50:53 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701032150.l03Lorvn028958@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #2 from jmjoseph at andrew.cmu.edu 2007-01-03 16:50 -------
Created an attachment (id=534)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=534&action=view)
Patch to Record.py
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:53:02 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:53:02 -0500
Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple
queries and recent (2.2.13) blast - patch attached
In-Reply-To:
Message-ID: <200701032153.l03Lr2Th029085@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2051
jmjoseph at andrew.cmu.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution| |FIXED
------- Comment #14 from jmjoseph at andrew.cmu.edu 2007-01-03 16:53 -------
Michiel, I have started bug 2176. Thank you for your assistance.
-Jacob
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From tiagoantao at gmail.com Fri Jan 5 05:35:59 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Fri, 5 Jan 2007 10:35:59 +0000
Subject: [Biopython-dev] biopython-dev
In-Reply-To:
References:
Message-ID: <6d941f120701050235p437e9283sfad21772401baefa@mail.gmail.com>
Hi Ralph,
Thanks for the info, let me see if I can sum up what I have and what I
am planning to do...
I currently work with microsatellite and SNP data (already isolated
ones, not retrieved from sequences that I have). I have code (parsers,
controllers... is varies from case to case; the quality also varies)
related to GenePop, fdist2, SimCoal2, Arlequin. I also have
preliminary code to work with HapMap and the UCSC table browser.
I have code implementing some statistics like Fst (Cockram and Weir),
expected/observed heterozygozity, ...
I will be, in the middle term, quite interested in all the sequence
part (Tajima Ds, Fu and Li's, and e.g. the new statistic in the Voight
2006 paper). Also, linkage disequilibrium is very high on my priority
list.
I have been thinking quite a bit on representation of markers and
populations (especially in a genomic context). e.g. I have noticed
that you use a couple of arrays, one with names, the other with
sequences, to represent population data. I am currently scratching my
head with representation on a genomic scale (ie, multi-marker, mainly
because of LD). But I think this will come smoothly when I really
start to do LD studies...
This is all in a context of detecting selection, disentangling
selection from population structure, and hopefully, in the near future
coevolution in the context of host/parasite (diseases...).
I have set aside some time to assure that all the code that I am doing
can be reused by the community. It is my plan to build and maintain
this code during the next years (I am funded until 2010 with a PhD
grant).
Regards,
Tiago
On 1/4/07, Ralph Haygood wrote:
> Tiago,
>
> Yes, I do still read biopython-dev. But at the moment, I have even
> less time than usual, because I'm at a conference. If there's
> something you want to ask me, go ahead, but unless the answer is
> trivial, it may take me several days.
>
> You're right that my stuff is very sequence oriented. In fact, it's
> very alignment oriented. It can analyze simple insertion/deletion as
> well as single-nucleotide variation. Here's a typical use case, to
> give you the flavor:
>
> alignment = phylip_file_to_alignment("sm50PromoterSpurAfra.phy")
> populations = {'Spur': range(20), 'Afra': [20]}
> statistics = Statistics(alignment, populations)
> print "ungapped length: %d" % statistics.ungapped_length()
> print "K SNPs: %d" % statistics.get_K('Spur')
> print "K simple indels: %d" % statistics.get_K_simple_indel('Spur')
> print "theta_W SNPs: %g" % statistics.get_theta_W('Spur')
> print "theta_W simple indels: %g" % statistics.get_theta_W_simple_indel('Spur')
> print "pi SNPs: %g" % statistics.get_pi('Spur')
> print "pi simple indels: %g" % statistics.get_pi_simple_indel('Spur')
> print "D_T SNPs: %g" % statistics.get_D_T('Spur')
> print "D_T simple indels: %g" % statistics.get_D_T_simple_indel('Spur')
> print "D_FL SNPs: %g" % statistics.get_D_FL('Spur', 'Afra')
> print "D_FL simple indels: %g" % statistics.get_D_FL_simple_indel('Spur', 'Afra')
> etc.
>
> Spur is Stronglyocentrotus purpuratus and Afra is Allocentrotus
> fragilis, two closely related species of sea urchin. In this example,
> I have 20 sequences of a certain region from Spur and one from Afra,
> so I'm analyzing the population genetics of the region within Spur,
> with Afra as an outgroup for doing things like inferring which allele
> is ancestral at a polymorphism within Spur. K is the number of
> polymorphisms, theta_W is Watterson's estimator of 4 x effective
> population size x neutral mutation rate, pi is the average number of
> pairwise differences between alleles, D_T is Tajima's D, D_FL is Fu
> and Li's D (which requires an outgroup), etc. The software can do
> more elaborate things like permutation tests for assessing whether a
> statistic differs between two alignments, which might be something
> like known transcription factor binding sites versus other nucleotide
> sites in a promoter. The canned software DnaSP can't do that, which
> is one of the reasons why I wrote my stuff.
>
> Ralph
>
--
Good judgment comes from experience.
Experience comes from bad judgment.
- Unknown author
From bugzilla-daemon at portal.open-bio.org Sun Jan 7 16:25:18 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 7 Jan 2007 16:25:18 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701072125.l07LPIiS032620@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2007-01-07 16:25 -------
> In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still
> defined as (None,None) tuples. However, in NCBIXML.py, these
> variables are set as integers. I don't see a point of a tuple at all,
> at least for NCBIXML. (I realize it is used in NCBIStandalone.py).
> Most importantly, the inconsistency makes it difficult to handle cases
> when the parameter is not set. It seems easiest, though, to just
> retain the tuple format.
I don't see a good reason for a tuple either -- though it may have seemed like
a good idea back in the days that Blast only produced plain-text output.
Instead of making NCBIXML also use a tuple, I'd rather set
HSP.identities|gaps|positives to None instead of (None, None) in Record.py.
This may break some code for people using NCBIStandalone. On the other hand, it
doesn't break Biopython's test suite.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 8 10:24:20 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Jan 2007 10:24:20 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701081524.l08FOKFn008935@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-08 10:24 -------
Regarding the inconsistent use tuples for _hsp.identities, positives, and gaps
- I would like all the parsers NCBIStandalone and NCBIXML (and ideally the HTML
parser too) to return identical record objects.
To do this, we could either:
(a) change NCBIXML to use tuples instead of integers (as suggested by Jacob)
or,
(b) change NCBIStandalone to use simple integers instead of tuples (is this
what you meant in comment 3 Michiel?)
Choice (b) would seem simpler in the long term - but would probably break more
existing code. Also, users of NCBIXML are going to have to update their
scripts anyway after bug 2051, so choice (a) would distrupt less people.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 8 11:14:36 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Jan 2007 11:14:36 -0500
Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current
Swiss-Prot version (RX and OH lines are broken)
In-Reply-To:
Message-ID: <200701081614.l08GEaMm011511@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2043
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-08 11:14 -------
Checked in support for for Line type OH (Organism Host) for viral hosts based
on code from Kristian Rother. These lines were just being ignored.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Jan 9 11:10:06 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 9 Jan 2007 11:10:06 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701091610.l09GA6Wm004669@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2007-01-09 11:10 -------
> Regarding the inconsistent use tuples for _hsp.identities, positives, and gaps
> - I would like all the parsers NCBIStandalone and NCBIXML (and ideally the
> HTML parser too) to return identical record objects.
Even after patch #2090, the NCBIStandalone parser is broken for multiple Blast
records, and will probably be broken for single Blast records also when a new
Blast version comes out. I haven't tried the HTML parser, but I'd be surprised
if it can parse HTML output from recent versions of Blast. So whereas I agree
in principle that the three parsers should return identical records objects, in
practice it's hardly relevant given that two of the three parsers either don't
work or cannot work reliably.
> To do this, we could either:
>
> (a) change NCBIXML to use tuples instead of integers (as suggested by Jacob)
All three of us agree that there's no good reason for tuples. Option (a)
implies copying a bad design choice from a semi-broken parser to a functioning
parser.
> or,
>
> (b) change NCBIStandalone to use simple integers instead of tuples (is this
> what you meant in comment 3 Michiel?)
>
> Choice (b) would seem simpler in the long term - but would probably break more
> existing code. Also, users of NCBIXML are going to have to update their
> scripts anyway after bug 2051, so choice (a) would distrupt less people.
Both option (a) and (b) break existing code. So let me suggest option (c):
(c) Don't do anything.
This doesn't break any code. In the near term, people that use both the
plain-text parser and the XML parser will have to deal with differences in the
Blast record produced by the parser. But how many people are that anyway? Most
likely, not enough to justify option (a). In the long term, assuming that both
the plain-text parser and the HTML parser will be deprecated, there will be no
more inconsistencies.
My question to Jacob:
Why do you need to use the plain-text Blast parser? Is there something it can
do that the XML parser cannot?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Jan 9 12:29:01 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 9 Jan 2007 12:29:01 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701091729.l09HT1Vi009189@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-09 12:29 -------
Tuples/Integers for HSP.identities, HSP.gaps, and HSP.positives
---------------------------------------------------------------
Michiel's option (c) of doing nothing is very pragmatic.
If we go for this, I think we should at least update the record object's
documentation to say it will be a tuple (when used with NCBIStandalone) or an
integer (when used with NCBIXML). Perhaps we should also change the default in
a new record object too...
query_letters versus query_length
---------------------------------
Another of Jacobs suggestions was to rename the record.query_letters (short for
number of letters in query?) to something like query_length (which is closer to
the actual text of query_len used in the XML file). I personally am not
inclined to change this even though it would be slightly clearer.
Note that I have corrected the error on line 186 of NCBIXML.py in CVS - well
spotted Jacob. This mistake was my fault - recently introduced as part of the
changes I made on bug 2051
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 10 15:47:59 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 10 Jan 2007 15:47:59 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701102047.l0AKlxX6027453@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #7 from mdehoon at ims.u-tokyo.ac.jp 2007-01-10 15:47 -------
> Another of Jacobs suggestions was to rename the record.query_letters (short for
> number of letters in query?) to something like query_length (which is closer to
> the actual text of query_len used in the XML file). I personally am not
> inclined to change this even though it would be slightly clearer.
In principle I agree with Jacob on this one. But as Jacob also indicates, there
are probably more variable names that are less than ideal. So if we change
these variable names, it's better to change all of them at the same time. This,
however, will break a lot of existing code. With all the other changes to the
Blast parsers, now doesn't seem to be the best time for such a change. However,
let's get back to this point once the dust settles with the Blast parsers.
With hit_id, hit_def, and hsp.align_length, I see no problems with Jacob's
suggestion. Objections, anybody?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 10 16:05:23 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 10 Jan 2007 16:05:23 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701102105.l0AL5Nht028299@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #8 from jmjoseph at andrew.cmu.edu 2007-01-10 16:05 -------
> My question to Jacob:
> Why do you need to use the plain-text Blast parser? Is there something it can
> do that the XML parser cannot?
I use only the XML parser. My greatest concern is not that the plain-text and
XML parsers are different, but rather that the XML parser is not consistent
with Record.py. An example that I consider completely broken is the definition
of query_length in Record.py, but the use of self._blast.query_letters in
NCBIXML.py.
To avoid breaking the existing plain-text parser code, would it be too
objectionable to use a new class, Record-XML.py, with definitions that exactly
match the usage in NCBIXML.py? Since few people are likely to use both
parsers, and any using the XML parser have required recent code updates anyway,
perhaps this separation would be easiest.
Thanks. -Jacob
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Jan 11 00:18:17 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 11 Jan 2007 00:18:17 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701110518.l0B5IHuD018624@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2007-01-11 00:18 -------
> To avoid breaking the existing plain-text parser code, would it be too
> objectionable to use a new class, Record-XML.py, with definitions that exactly
> match the usage in NCBIXML.py?
Go ahead, but don't add it to Biopython ;-).
It would just add to the confusion, without a real benefit to users.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From kosa at genesilico.pl Thu Jan 11 03:52:27 2007
From: kosa at genesilico.pl (Jan Kosinski)
Date: Thu, 11 Jan 2007 09:52:27 +0100
Subject: [Biopython-dev] powerful Alignment class in Biopython?
Message-ID: <45A5FACB.40109@genesilico.pl>
Hi,
Is anyone going to develop or developing now an Alignment class in
Biopython as powerful as for example SimpleAlign in Bioperl? Look here
for instance for methods available in Bioperl
http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html.
The reason I am asking is that I do not know if I should start working
on more functional subclass of biopython Alignment class (I do not want
to come back to Perl ;-)...
Regards,
Janek
From fkauff at duke.edu Thu Jan 11 04:27:25 2007
From: fkauff at duke.edu (Frank)
Date: Thu, 11 Jan 2007 10:27:25 +0100
Subject: [Biopython-dev] powerful Alignment class in Biopython?
In-Reply-To: <45A5FACB.40109@genesilico.pl>
References: <45A5FACB.40109@genesilico.pl>
Message-ID: <1168507645.2888.3.camel@osiris.biologie.uni-kl.de>
Hi Janek,
then Nexus parser in Biopython (for which I still haven't written any
documentation yet...) basically holds an alignment, and has some methods
that deal with basic alignment functionality. If you're going to work on
a more sophisticated alignment class, maybe we should try to get Nexus
class and alignment class work smoothly together.
Frank
On Thu, 2007-01-11 at 09:52 +0100, Jan Kosinski wrote:
> Hi,
>
> Is anyone going to develop or developing now an Alignment class in
> Biopython as powerful as for example SimpleAlign in Bioperl? Look here
> for instance for methods available in Bioperl
> http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html.
>
> The reason I am asking is that I do not know if I should start working
> on more functional subclass of biopython Alignment class (I do not want
> to come back to Perl ;-)...
>
> Regards,
> Janek
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From tiagoantao at gmail.com Thu Jan 11 05:36:35 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 11 Jan 2007 10:36:35 +0000
Subject: [Biopython-dev] PopGen code
Message-ID: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
Hi,
A couple of weeks ago I have put on bugzilla code related to
population genetics, namely parsing of GenePop and Fdist files, plus
code to control fdist.
I have had no feedback whatsoever, namely comments to the quality of
the code, if there is interest in adding it in the future to
BioPython, etc...
I have much more code that I could start converting to BioPython
format, some of which is a bit more complicated to convert (e.g., an
Arlequin format parser). Before I start doing it I would like to know
if there will be any feedback at all or if I am just loosing my
time...
Regards,
Tiago
--
Good judgment comes from experience.
Experience comes from bad judgment.
- Unknown author
From kosa at genesilico.pl Thu Jan 11 06:34:50 2007
From: kosa at genesilico.pl (Jan Kosinski)
Date: Thu, 11 Jan 2007 12:34:50 +0100
Subject: [Biopython-dev] powerful Alignment class in Biopython?
In-Reply-To: <1168507645.2888.3.camel@osiris.biologie.uni-kl.de>
References: <45A5FACB.40109@genesilico.pl>
<1168507645.2888.3.camel@osiris.biologie.uni-kl.de>
Message-ID: <45A620DA.4000302@genesilico.pl>
Hi,
I have a feeling that it would be better to write all methods similar to
BioPerl ones directly for BioPython Alignment class. The main reason is
that this class is not related to any format like Fasta, Clustal or
Nexus. It stores SeqRecords which are also not in Fasta or other format.
It would make many things easier. For instance, I can write all my
functions which do sth with alignments so that they accept general
Alignment objects (and not necessarily FastaAlignment or
ClustalAlignment objects ). Would not it better to write all stuff which
do general things with alignments (column counting, column
selection/removal etc.) so that it works with general Alignment class
rather than with class for alignment of specific biological format?
Janek
Frank wrote:
> Hi Janek,
>
> then Nexus parser in Biopython (for which I still haven't written any
> documentation yet...) basically holds an alignment, and has some methods
> that deal with basic alignment functionality. If you're going to work on
> a more sophisticated alignment class, maybe we should try to get Nexus
> class and alignment class work smoothly together.
>
> Frank
>
>
> On Thu, 2007-01-11 at 09:52 +0100, Jan Kosinski wrote:
>
>> Hi,
>>
>> Is anyone going to develop or developing now an Alignment class in
>> Biopython as powerful as for example SimpleAlign in Bioperl? Look here
>> for instance for methods available in Bioperl
>> http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html.
>>
>> The reason I am asking is that I do not know if I should start working
>> on more functional subclass of biopython Alignment class (I do not want
>> to come back to Perl ;-)...
>>
>> Regards,
>> Janek
>>
>>
>>
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
From kosa at genesilico.pl Thu Jan 11 07:11:11 2007
From: kosa at genesilico.pl (Jan Kosinski)
Date: Thu, 11 Jan 2007 13:11:11 +0100
Subject: [Biopython-dev] [BioPython] what to use for working with fasta
sequences and alignments?
In-Reply-To: <45A50D24.1090906@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl>
<45A50D24.1090906@maubp.freeserve.co.uk>
Message-ID: <45A6295F.5030103@genesilico.pl>
Are you going to fix this in the new SeqIO?:
When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are
stripped away after the first "space".
Janek
Peter (BioPython List) wrote:
> Jan Kosinski wrote:
>> Hi,
>>
>> I am quite new in BioPython and I am a little bit confused when
>> trying to use BioPython for working with fasta sequences and alignments.
>>
>> For instance, I can read and parse fasta files with Bio.Fasta, return
>> records (as Fasta.record class), iterate and so on. But then I am
>> going to Bio.Fasta.FastaAlign module which offers FastaAlignment
>> (subclass of Alignment class) class. However, this class has very
>> limited methods and get_all_seqs and get_seq_by_num return SeqRecord
>> object instead of Fasta.record (why??) what makes it hard to use
>> Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta
>> (with Fasta.record) for sequences. Maybe I am wrong but Biopython
>> seems to be full of incompatibilities. Or one should know which
>> modules and classes should not be used?
>>
>> Could you recommend me what should I use for my work with fasta
>> sequences and alignments? Which BioPython modules and classes?
>
> You can use Bio.Fasta to read in files either as Fasta.Record objects,
> or as SeqRecord objects. I would use SeqRecord objects - they are
> more general should you ever want to use a different input file format
> - plus as you have noticed, the alignment object also uses SeqRecord
> objects to hold each (gapped) sequence.
>
> There are other options if you search the code - but Bio.Fasta is the
> best documented and most used.
>
> If you are brave, then you might have a look at the new code in
> Bio.SeqIO which you can get from CVS. This is still in a state of
> flux however... but the Fasta parsing is much faster. See this page
> and the mailing list archives for more:
>
> http://www.biopython.org/wiki/SeqIO
>
> > Or should I use other packages like CoreBio?
>
> You could do - it has the advantage of having started recently from a
> clean slate, and having much less "old code".
>
>> Thank you in advance for any guidelines,
>> Janek Kosinski
>
> Peter
From biopython-dev at maubp.freeserve.co.uk Fri Jan 12 07:36:56 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Fri, 12 Jan 2007 12:36:56 +0000
Subject: [Biopython-dev] PopGen code
In-Reply-To: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
Message-ID: <45A780E8.2070803@maubp.freeserve.co.uk>
Tiago Ant?o wrote:
> Hi,
>
> A couple of weeks ago I have put on bugzilla code related to
> population genetics, namely parsing of GenePop and Fdist files, plus
> code to control fdist.
> I have had no feedback whatsoever, namely comments to the quality of
> the code, if there is interest in adding it in the future to
> BioPython, etc...
>
> I have much more code that I could start converting to BioPython
> format, some of which is a bit more complicated to convert (e.g., an
> Arlequin format parser). Before I start doing it I would like to know
> if there will be any feedback at all or if I am just loosing my
> time...
I suppose I/we would be able to read your code from a general
perspective (coding style, clarity of comments, etc). I haven't made
time for this.
I suspect BioPython currently has no active developers who feel
qualified to interpret your population genetics code. I was hoping that
you and Ralph Haygood would combine forces - if you are both happy with
some code that does bode well. Any comments Michiel?
Regarding population genetic file formats - from a very quick search
about Arlequin it sounds like this file format can hold lots of
different types of data. I would encourage you to try and come up with
a generic population record data object that could hold this or
information from GenePop or Fdist as well. I have no idea how easy this
would be...
Peter
From tiagoantao at gmail.com Fri Jan 12 09:16:53 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Fri, 12 Jan 2007 14:16:53 +0000
Subject: [Biopython-dev] PopGen code
In-Reply-To: <45A780E8.2070803@maubp.freeserve.co.uk>
References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
<45A780E8.2070803@maubp.freeserve.co.uk>
Message-ID: <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com>
Hi,
Thanks for the answer.
> I suspect BioPython currently has no active developers who feel
> qualified to interpret your population genetics code. I was hoping that
> you and Ralph Haygood would combine forces - if you are both happy with
> some code that does bode well. Any comments Michiel?
I think Ralph (who subscribes to this list, and thus can comment) has
strong time constraints, and will probably have little available time
in the near future...
> Regarding population genetic file formats - from a very quick search
> about Arlequin it sounds like this file format can hold lots of
> different types of data. I would encourage you to try and come up with
> a generic population record data object that could hold this or
> information from GenePop or Fdist as well. I have no idea how easy this
> would be...
I have been thinking a lot about a generic data structure to hold
population genomic (ie not only genetic) data. I have, in fact,
implemented (in CAML, not Python) quite a few different data
representations. I was not happy with none of them. Different kinds of
markers (that sometimes overlap - eg sequences and SNPs), linkage
disequilibrium (thus relations between markers...), ploidy (no need to
think on different organisms, think mitochondria, nuclear chromosomes,
Y chromosome), ... make a general solution not trivial.
As I see it, there are a few options:
1. Have a grand, unified structure, but that will take time to mature
2. Assume that there will be different representations for different
scopes, assume that that is a bad thing and live with that
3. Assume that there will be different representations, and that that
is good, in the sense that a one size, fits all approach in this case
has lots of problems
I think the pragmatic approach for now is not to have a generic
representation. I would lean more to let things mature (develop
statistics, parsers, ...) and after there is more experience (and,
hopefully, user feedback) then reassess the issue of a general
representation. I am aware that this will entail each part of code
having a different calling data structure, but I think that with care
and common sense that won't be very problematic.
I don't mind having the code on an alpha branch for as long as you see
fit, I just want to be sure that whatever effort I put in converting
(or creating new) my code to BioPython is not lost, that is why I
would like feedback on what will happen to the code that I am
submitting. I am willing to accommodate any reasonable requirements
regarding code quality and development process...
Regards,
Tiago
--
Good judgment comes from experience.
Experience comes from bad judgment.
- Unknown author
From bsouthey at gmail.com Fri Jan 12 09:30:26 2007
From: bsouthey at gmail.com (Bruce Southey)
Date: Fri, 12 Jan 2007 08:30:26 -0600
Subject: [Biopython-dev] PopGen code
In-Reply-To: <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com>
References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
<45A780E8.2070803@maubp.freeserve.co.uk>
<6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com>
Message-ID:
Hi,
While I do have a remote interest in this, I do not have any time to
look at this at present. As I mentioned in a previous email, that
John Cole is doing related work in Python but not part of BioPython.
It would probably be good to have some unified approach and direction
because of the overlaps that occur and so other pieces of code can be
easily added.
Regards
Bruce
On 1/12/07, Tiago Ant?o wrote:
> Hi,
>
> Thanks for the answer.
>
> > I suspect BioPython currently has no active developers who feel
> > qualified to interpret your population genetics code. I was hoping that
> > you and Ralph Haygood would combine forces - if you are both happy with
> > some code that does bode well. Any comments Michiel?
>
> I think Ralph (who subscribes to this list, and thus can comment) has
> strong time constraints, and will probably have little available time
> in the near future...
>
> > Regarding population genetic file formats - from a very quick search
> > about Arlequin it sounds like this file format can hold lots of
> > different types of data. I would encourage you to try and come up with
> > a generic population record data object that could hold this or
> > information from GenePop or Fdist as well. I have no idea how easy this
> > would be...
>
> I have been thinking a lot about a generic data structure to hold
> population genomic (ie not only genetic) data. I have, in fact,
> implemented (in CAML, not Python) quite a few different data
> representations. I was not happy with none of them. Different kinds of
> markers (that sometimes overlap - eg sequences and SNPs), linkage
> disequilibrium (thus relations between markers...), ploidy (no need to
> think on different organisms, think mitochondria, nuclear chromosomes,
> Y chromosome), ... make a general solution not trivial.
> As I see it, there are a few options:
> 1. Have a grand, unified structure, but that will take time to mature
> 2. Assume that there will be different representations for different
> scopes, assume that that is a bad thing and live with that
> 3. Assume that there will be different representations, and that that
> is good, in the sense that a one size, fits all approach in this case
> has lots of problems
>
> I think the pragmatic approach for now is not to have a generic
> representation. I would lean more to let things mature (develop
> statistics, parsers, ...) and after there is more experience (and,
> hopefully, user feedback) then reassess the issue of a general
> representation. I am aware that this will entail each part of code
> having a different calling data structure, but I think that with care
> and common sense that won't be very problematic.
>
> I don't mind having the code on an alpha branch for as long as you see
> fit, I just want to be sure that whatever effort I put in converting
> (or creating new) my code to BioPython is not lost, that is why I
> would like feedback on what will happen to the code that I am
> submitting. I am willing to accommodate any reasonable requirements
> regarding code quality and development process...
>
> Regards,
> Tiago
>
> --
> Good judgment comes from experience.
> Experience comes from bad judgment.
> - Unknown author
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
From bugzilla-daemon at portal.open-bio.org Fri Jan 12 19:27:11 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Jan 2007 19:27:11 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701130027.l0D0RBlR027978@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #10 from mdehoon at ims.u-tokyo.ac.jp 2007-01-12 19:27 -------
I've committed the code to handle hit_id, hit_def, and hsp.align_length to CVS.
Let's keep this bug report open for now to remind ourselves to revisit the
issues with variable names at some point.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 06:04:24 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 06:04:24 -0500
Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current
Swiss-Prot version (RX and OH lines are broken)
In-Reply-To:
Message-ID: <200701151104.l0FB4OdQ015531@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2043
k_rother at yahoo.de changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |k_rother at yahoo.de
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 06:07:33 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 06:07:33 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151107.l0FB7Xdb015724@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
------- Comment #2 from k_rother at yahoo.de 2007-01-15 06:07 -------
Created an attachment (id=543)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=543&action=view)
new date() method handling new style DT lines
new to bugzilla. don't know whether this is the proper way to commit code.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 06:17:55 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 06:17:55 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151117.l0FBHtd1016589@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
------- Comment #3 from k_rother at yahoo.de 2007-01-15 06:17 -------
Created an attachment (id=544)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=544&action=view)
SProt.py that digests all 250,000 Uniprot entries successfully.
also checked the data record contents whether the dates and version numbers of
the first few entries are correct.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 06:19:59 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 06:19:59 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151119.l0FBJxt9016846@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
k_rother at yahoo.de changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |k_rother at yahoo.de
Status|NEW |RESOLVED
Resolution| |WORKSFORME
------- Comment #4 from k_rother at yahoo.de 2007-01-15 06:19 -------
i think this should finish the bug unless someone wants to beautify the code.
KR
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 07:26:57 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 07:26:57 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151226.l0FCQvNE022159@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |biopython-
| |bugzilla at maubp.freeserve.co.
| |uk
Status|RESOLVED |REOPENED
Resolution|WORKSFORME |
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-15 07:26 -------
Reopening - its not fixed until we have updated the code in CVS.
However, I will try and have a look at your code.
By the way - in general developers pefer patches rather than chunks of code, or
edited copies of the original.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 07:51:46 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 07:51:46 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151251.l0FCpk8H023891@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|REOPENED |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-15 07:51 -------
I have updated CVS with a slightly modified version of your code Kristian.
See revision 1.36, web version here:
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SwissProt/SProt.py?cvsroot=biopython
It passes the old unit test, test_SProt.py, but if you could double check this
on the latest release that would be great.
Thanks very much.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython-dev at maubp.freeserve.co.uk Mon Jan 15 15:04:34 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Mon, 15 Jan 2007 20:04:34 +0000
Subject: [Biopython-dev] Bio.SeqIO
In-Reply-To: <45A94BFD.5080209@c2b2.columbia.edu>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk>
<45A94BFD.5080209@c2b2.columbia.edu>
Message-ID: <45ABDE52.1030300@maubp.freeserve.co.uk>
Michiel de Hoon wrote:
> In my opinion, the new Bio.SeqIO code is a huge improvement to
> Biopython, so I'd be happy to make a new release for it.
>
> ...
>
> For Bio.SeqIO, we're also in pretty good shape, as far as I can tell.
> From what I remember, the remaining issues were
> 1) Which functionality to include, in particular
> a) if functions should accept file names in addition to file handles;
I have decided to follow Michiel's stance on this issue: handles only.
> b) if functions should infer the file format from the file extension,
> the file content, or otherwise.
Right now the file format string is optional and if omitted the file
extension (via handle.name) is used to try and guess.
It would be trivial to remove this functionality and make format a
required argument.
We could at a later date chose to add limited support for format
guessing based on file contents without altering the function parameters
(i.e. the API).
Both these features would be nice to have (speaking as user) but then
again, am I prepared to support the headaches they may cause later on.
I'm wavering on this issue (having previously been in favour of
including the format guessing).
Item 1(c) on Michiel's list could have been do we need the three "helper
functions" which turned a file into a SeqRecord list, dictionary or
alignment.
Again, I have come round to Michiel's view and removed these as they
were just simple wrappers for list, SequencesToDictionary and
SequencesToAlignment.
> 2) What are the best names for the functions that the user will see.
The good news is that after that little spring clean there are less
functions to name - just these four really:
SequenceIterator, once known as FileToSequenceIterator and before that
File2SequenceIterator. Now takes just an input file handle and an
optional file format. Returns a SeqRecord iterator.
SequencesToDictionary - takes SeqRecord iterator or list, plus an
optional function to define the keys, and returns a dictionary.
SequencesToAlignment - takes SeqRecord iterator or list, and returns an
alignment object. Perhaps this functionality should be included in the
alignment class itself...
WriteSequences, once known as SequencesToFile - takes a SeqRecord
iterator or list, and output handle, and a format string. Intended for
use on a whole file at once (i.e. the general case where there may be
headers/footers etc). This does not let you do incremental writes one
for each record (which would be possible for some formats like GenBank
or fasta)
Peter
From mdehoon at c2b2.columbia.edu Mon Jan 15 18:00:16 2007
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Mon, 15 Jan 2007 18:00:16 -0500
Subject: [Biopython-dev] Bio.SeqIO
In-Reply-To: <45ABDE52.1030300@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu>
<45ABDE52.1030300@maubp.freeserve.co.uk>
Message-ID: <45AC0780.9060306@c2b2.columbia.edu>
Peter wrote:
> WriteSequences, once known as SequencesToFile - takes a SeqRecord
> iterator or list, and output handle, and a format string. Intended for
> use on a whole file at once (i.e. the general case where there may be
> headers/footers etc). This does not let you do incremental writes one
> for each record (which would be possible for some formats like GenBank
> or fasta)
At the end of WriteSequences, the file is closed:
def WriteSequences(sequences, handle, format) :
...
handle.close() #just in case the writer object forgot
Why would it be a problem if the handle is not closed?
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From biopython-dev at maubp.freeserve.co.uk Tue Jan 16 06:10:21 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Jan 2007 11:10:21 +0000
Subject: [Biopython-dev] Bio.SeqIO
In-Reply-To: <45AC0780.9060306@c2b2.columbia.edu>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk>
<45AC0780.9060306@c2b2.columbia.edu>
Message-ID: <45ACB29D.2010002@maubp.freeserve.co.uk>
Michiel Jan Laurens de Hoon wrote:
> Peter wrote:
>> WriteSequences, once known as SequencesToFile - takes a SeqRecord
>> iterator or list, and output handle, and a format string. Intended for
>> use on a whole file at once (i.e. the general case where there may be
>> headers/footers etc). This does not let you do incremental writes one
>> for each record (which would be possible for some formats like GenBank
>> or fasta)
>
> At the end of WriteSequences, the file is closed:
>
> def WriteSequences(sequences, handle, format) :
> ...
> handle.close() #just in case the writer object forgot
>
> Why would it be a problem if the handle is not closed?
OK, I've fixed that.
That issue was on my mind too - in particular it would stop Bio.SeqIO
from creating concatenated phylip alignments which are used in
bootstrapping. Reading this sort of file is a different issue, which I
am also currently thinking about.
Peter
From biopython-dev at maubp.freeserve.co.uk Tue Jan 16 07:48:49 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Jan 2007 12:48:49 +0000
Subject: [Biopython-dev] Bio.SeqIO - Output
In-Reply-To: <45ACB29D.2010002@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk> <45AC0780.9060306@c2b2.columbia.edu>
<45ACB29D.2010002@maubp.freeserve.co.uk>
Message-ID: <45ACC9B1.6010709@maubp.freeserve.co.uk>
I've been thinking about sequence output (i.e. writing sequence files),
and have come to the conclusion that my writer classes in
Bio/SeqIO/Interfaces.py are probably too complicated.
My current Bio.SeqIO output implementation tries to be very flexible -
if you look beyond the top level function WriteSequences (aka
SequencesToFile) then the individual writer classes have a confusing
range of capabilities.
New Idea
========
I was thinking that we should only support two cases for sequence output:
(*) simple sequential file formats
- record by record, or file at once
- can use a SeqRecord iterator (or a list)
(*) all other file formats
- file at once only
- probably needs a list of SeqRecords (not an iterator)
For the sequential file formats such as fasta, genbank and swiss there
are no headers or footers - and a single sequence alone would be a valid
file.
For all other file formats (e.g. clustal, stockholm, phylip, anything in
XML, ...) we would only offer the "file at once" option.
When implementing a writer for a new file format, you just have to
implement a "write file" function or a "write record" function which
takes the record(s) and a handle. The implementation details are up to you.
Drawbacks
=========
There are some sequential file formats where, under the scheme above,
you would be forced to write the file in one go...
However, I can only think of one irrelevant example, so this may not
matter. Can anyone suggest some other examples? Some sort of simple
tabular file with a header row maybe?
For example simple Stockholm files (if you ignore the PFAM style
annotation) have a generic header, followed by sequential records and a
generic footer.
The point here is that the header does not contain anything about the
records which will follow it. e.g. The number of records, or if they
are protein or nucleotides.
For files like this it would be possible to write the file record by
record given an iterator - provided you also write the header and footer.
Right now this is the only file format I can think of that has this
property - and I don't currently even support this (instead like BioPerl
I create Stockholm files with PFAM style annotations).
Stockholm files with PFAM style annotation do not qualify, because the
header contains the number of records. Similarly for non-interlaced PHYLIP.
Peter
From mdehoon at c2b2.columbia.edu Tue Jan 16 12:51:23 2007
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Tue, 16 Jan 2007 12:51:23 -0500
Subject: [Biopython-dev] [BioPython] Next release plans;
was: what to use for working with fasta sequences and alignments?
In-Reply-To: <45AA34AE.5080100@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk>
<45A94BFD.5080209@c2b2.columbia.edu>
<45AA34AE.5080100@maubp.freeserve.co.uk>
Message-ID: <45AD109B.4030009@c2b2.columbia.edu>
Peter wrote:
> Regarding the fix checked in on bug 1970 I still would prefer we call
> the new XML iterator NCBIXML.Iterator(handle) rather than
> NCBIXML.parse(handle) but I'll live ;)
>
I chose "parse" because it is used in the old (Biopython release 1.42)
Blast XML parser:
Old:
>>> from Bio.Blast import NCBIXML
>>> b_parser = NCBIXML.BlastParser()
>>> b_record = b_parser.parse(blast_out)
New:
>>> from Bio.Blast import NCBIXML
>>> b_records = NCBIXML.parse(blast_out)
>>> b_record = b_records.next() # Repeat to get subsequent Blast records
Whereas I am not dead set on "parse", it agrees with similar functions
in Python:
1) Function name is a verb, not a noun
2) Function name describes what the function does, not what the function
returns
3) Function names are short, and start with a lower case letter.
For example, to read a file line-by-line in Python:
>>> inputfile = open("somefunnyfile")
# "open"; not "Iterator", nor "FileToLineIterator",
# even though "open" returns an iterator:
>>> for line in inputfile:
... print line
To read an image file with the Python Imaging Library:
>>> import Image
>>> im = Image.open("lena.ppm")
# "open"; not "Image", nor "FileNameToImage".
To read a Python object from a pickled file:
>>> import pickle
>>> inputfile = open("somepickledfile")
>>> myobject = pickle.load(inputfile)
# "load"; not "FileToObject".
>>> inputfile.close()
To parse an XML file with the sax parser framework in Python:
>>> from xml.sax.handler import ContentHandler
>>> from xml import sax
>>> handler = SomeSubclassOfContentHandler()
>>> inputfile = open("myxmlfile.xml")
>>> sax.parse(inputfile, handler)
# "parse", same as in the new Bio.Blast.NCBIXML
>>> inputfile.close()
So, for Bio.Blast.NCBIXML, good names would be "load", "read", "parse",
or something similar. "Iterator" would not be consistent; besides, until
recently I didn't know what an iterator is, so I doubt that new users
would know.
What we could do is to have two functions in Bio.Blast.NCBIXML, perhaps
one called "read" and the other "iterate", where the former returns a
single Blast record (for an XML file containing only one Blast result),
and the latter an iterator over multiple Blast records.
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From mdehoon at c2b2.columbia.edu Tue Jan 16 12:47:17 2007
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Tue, 16 Jan 2007 12:47:17 -0500
Subject: [Biopython-dev] [BioPython] Next release plans;
was: what to use for working with fasta sequences and alignments?
In-Reply-To: <45AA34AE.5080100@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk>
<45A94BFD.5080209@c2b2.columbia.edu>
<45AA34AE.5080100@maubp.freeserve.co.uk>
Message-ID: <45AD0FA5.3060008@c2b2.columbia.edu>
Peter wrote:
> In general, I agree that the Blast XML parser in CVS looks in good shape
> - but we really need to update the documentation for using Blast for the
> next release.
>
Yeah I know, I've been holding off on updating the documentation so it
is consistent with the latest Biopython release 1.42. I'll update it
together with the next release.
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From bugzilla-daemon at portal.open-bio.org Sat Jan 27 18:01:07 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 27 Jan 2007 18:01:07 -0500
Subject: [Biopython-dev] [Bug 1963] Adding __str__ method to codon tables
and translators
In-Reply-To:
Message-ID: <200701272301.l0RN17l6026463@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1963
------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2007-01-27 18:01 -------
> Question One:
> Is this worth adding to BioPython or not?
Yes, definitely.
> Question Two:
> What is the preferred behaviour for ambiguous tables? Just a 4x4x4 table as
> for the unambiguous tables? Or the full 15x15x15 table? I have implemented
> both (see commented out code)
My feeling is that 15x15x15 would become too large to be clearly visible on the
screen. So I'd prefer 4x4x4, maybe with a reminder printed at the end as to
what each ambiguous codon may represent.
> Question Three:
> Is there a standard BioPython function to convert from one letter amino acid
> sequences into three letter names? i.e. like one_to_three from
> Bio.PDB.Polypeptide but more general. That function does not cope with
> ambigous names.
There is the function seq3 in Bio/SeqUtils. If it is not complete, it can be
extended easily, and seems to be a better place for this general function than
Bio/PDB/Polypeptide.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 13:15:00 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 08:15:00 -0500
Subject: [Biopython-dev] [Bug 2174] New: FDist Support in BioPython
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2174
Summary: FDist Support in BioPython
Product: Biopython
Version: 1.24
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: tiagoantao at gmail.com
This is an enhancement bug to submit code related to fdist2
http://www.rubic.rdg.ac.uk/~mab/software.html
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 13:15:18 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 08:15:18 -0500
Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython
In-Reply-To:
Message-ID: <200701031315.l03DFIGn007058@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2174
tiagoantao at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 13:16:06 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 08:16:06 -0500
Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython
In-Reply-To:
Message-ID: <200701031316.l03DG6qL007102@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2174
------- Comment #1 from tiagoantao at gmail.com 2007-01-03 08:16 -------
Created an attachment (id=532)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=532&action=view)
Code support fdist
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From tiagoantao at gmail.com Wed Jan 3 13:16:30 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 3 Jan 2007 13:16:30 +0000
Subject: [Biopython-dev] FDist: more Population Genetics code
Message-ID: <6d941f120701030516m1adb3daeh6e4645121ba8679d@mail.gmail.com>
Hi!
I have submitted another enhancement bug, with support for FDist. It
allows to generate and parse Fdist files and to control fdist
applications. There are also a couple of utility functions. FDist is a
niche application (mainly used to detect selection in animal
genetics). Not the most fundamental one to support, but it is
currently one that I am working on, thus, the code.
Regarding my summited code for GenePop, I have summited a different
version on bugzilla. The main difference, is that I moved everything
from Bio to Bio.PopGen.
Before I continue putting code on bugzilla I would like to know if it
is worthwhile doing it... Any opinions on the code submitted or if any
changes are required? I would really like to continue converting my
code to BioPython, but only if it has any possibility of ending up
being useful/included in distribution somewhere in the future... ;)
I am currently working on code related to SimCoal2, Arlequin and
general statistics (Fst, heterozygosity, ...). Which will probably be
ready quite soon (ie, next two weeks). This is more mainstream than
FDist
I have some other code lying around mainly related to HapMap, but I
will only submit it after reviewing and reusing it again. This is more
distant future ... like a couple of months.
Tiago
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:38:39 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:38:39 -0500
Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple
queries and recent (2.2.13) blast - patch attached
In-Reply-To:
Message-ID: <200701032138.l03Lcdji028402@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2051
------- Comment #13 from mdehoon at ims.u-tokyo.ac.jp 2007-01-03 16:38 -------
> Regardless, I do still see a
> number of inconsistencies.
Please submit a separate bug report (including your patch) for these
inconsistencies. The current bug report is titled
"XML Blast parser unusable with multiple queries and recent (2.2.13) blast -
patch attached"
With Peter's patch, we can now parse multiple blast queries, so I'd like to
close this bug report.
For future bug reports and patches:
Try to handle separate bugs in separate bug reports and patches. For
developers, when looking at a patch handling several issues at the same time,
it's difficult to understand which parts of the patch are essential, which are
good but non-essential, and which are code cleanup. Speaking for myself, I
would probably have considered this patch earlier if it had been less
convoluted.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:48:45 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:48:45 -0500
Subject: [Biopython-dev] [Bug 2176] New: XML Blast parser: miscellaneous bug
fixes and cleanup
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
Summary: XML Blast parser: miscellaneous bug fixes and cleanup
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: jmjoseph at andrew.cmu.edu
This follows the discussion started in bug 2051. The blast XML parser does now
work (Thanks!), but could still use a little work. Here's a list of the issues
I can see now. I'll follow with patches to correct a few.
In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still
defined as (None,None) tuples. However, in NCBIXML.py, these
variables are set as integers. I don't see a point of a tuple at all,
at least for NCBIXML. (I realize it is used in NCBIStandalone.py).
Most importantly, the inconsistency makes it difficult to handle cases
when the parameter is not set. It seems easiest, though, to just
retain the tuple format.
In the past, I worried that the order of tuple building for
self._blast.gap_penalties or ka_params could cause the tuple to have
an incorrect ordering. I seem to remember hitting an issue where the
tuple was built with the wrong length, but I can't be specific. In
general, it remains odd to me to not just use a list and set each
element respectively. If necessary, one could convert to a tuple when
finished or use some other approach that does not rely upon order.
Why not use query_len, as defined in the XML file, or query_length
instead of query_letters as a variable name? In
BlastParser._end_Iteration, self._blast.query_letters is set. This is
not defined/documented in the Parameters class in Record.py. Rather,
query_length is defined there. In the Header class, though, the name
query_letters is used. There also seems to be some confusion between
num_letters_in_database, num_sequences_in_database, database_letters,
and database_sequences. Note that even if this naming is not
corrected, NCBIXML.py:186 is wrong with "self._blast_query_letters"
rather than "self._blast.query_letters".
Similarly, why store the bit score and E-value as 'bits' and
'_hsp.expect'/'descr.e' rather than just using bit_score and
evalue, as in the blast XML ouput?
I make use of in 2.2.13. This value missing
entirely.
The parsing of and is confusing. For example,
1
gnl|BL_ORD_ID|0
3377250
...
results in _hit.title set to "gnl|BL_ORD_ID|0 3377250". I would
rather they remain separate (or both methods be used).
This is certainly not an exhaustive list. I'm happy to provide
another patch correcting many of these inconsistencies. At the
very least, the variable names defined in Record.py should be
used in NCBIXML.py. May I modify at least the above names to
correspond more closely to the names used in the XML? I know
I've found this particularly confusing.
-Jacob
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:50:33 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:50:33 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701032150.l03LoXp4028921@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #1 from jmjoseph at andrew.cmu.edu 2007-01-03 16:50 -------
Created an attachment (id=533)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=533&action=view)
Patch to NCBIXML.py
These patches to NCBIXML and Record:
* replace query_letters with query_length,
* use tuples for _hsp.identities, positives, and gaps
* store _hsp.align_length
* separate the hit id and hit def elements. title is retained
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:50:53 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:50:53 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701032150.l03Lorvn028958@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #2 from jmjoseph at andrew.cmu.edu 2007-01-03 16:50 -------
Created an attachment (id=534)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=534&action=view)
Patch to Record.py
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:53:02 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Jan 2007 16:53:02 -0500
Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple
queries and recent (2.2.13) blast - patch attached
In-Reply-To:
Message-ID: <200701032153.l03Lr2Th029085@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2051
jmjoseph at andrew.cmu.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution| |FIXED
------- Comment #14 from jmjoseph at andrew.cmu.edu 2007-01-03 16:53 -------
Michiel, I have started bug 2176. Thank you for your assistance.
-Jacob
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From tiagoantao at gmail.com Fri Jan 5 10:35:59 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Fri, 5 Jan 2007 10:35:59 +0000
Subject: [Biopython-dev] biopython-dev
In-Reply-To:
References:
Message-ID: <6d941f120701050235p437e9283sfad21772401baefa@mail.gmail.com>
Hi Ralph,
Thanks for the info, let me see if I can sum up what I have and what I
am planning to do...
I currently work with microsatellite and SNP data (already isolated
ones, not retrieved from sequences that I have). I have code (parsers,
controllers... is varies from case to case; the quality also varies)
related to GenePop, fdist2, SimCoal2, Arlequin. I also have
preliminary code to work with HapMap and the UCSC table browser.
I have code implementing some statistics like Fst (Cockram and Weir),
expected/observed heterozygozity, ...
I will be, in the middle term, quite interested in all the sequence
part (Tajima Ds, Fu and Li's, and e.g. the new statistic in the Voight
2006 paper). Also, linkage disequilibrium is very high on my priority
list.
I have been thinking quite a bit on representation of markers and
populations (especially in a genomic context). e.g. I have noticed
that you use a couple of arrays, one with names, the other with
sequences, to represent population data. I am currently scratching my
head with representation on a genomic scale (ie, multi-marker, mainly
because of LD). But I think this will come smoothly when I really
start to do LD studies...
This is all in a context of detecting selection, disentangling
selection from population structure, and hopefully, in the near future
coevolution in the context of host/parasite (diseases...).
I have set aside some time to assure that all the code that I am doing
can be reused by the community. It is my plan to build and maintain
this code during the next years (I am funded until 2010 with a PhD
grant).
Regards,
Tiago
On 1/4/07, Ralph Haygood wrote:
> Tiago,
>
> Yes, I do still read biopython-dev. But at the moment, I have even
> less time than usual, because I'm at a conference. If there's
> something you want to ask me, go ahead, but unless the answer is
> trivial, it may take me several days.
>
> You're right that my stuff is very sequence oriented. In fact, it's
> very alignment oriented. It can analyze simple insertion/deletion as
> well as single-nucleotide variation. Here's a typical use case, to
> give you the flavor:
>
> alignment = phylip_file_to_alignment("sm50PromoterSpurAfra.phy")
> populations = {'Spur': range(20), 'Afra': [20]}
> statistics = Statistics(alignment, populations)
> print "ungapped length: %d" % statistics.ungapped_length()
> print "K SNPs: %d" % statistics.get_K('Spur')
> print "K simple indels: %d" % statistics.get_K_simple_indel('Spur')
> print "theta_W SNPs: %g" % statistics.get_theta_W('Spur')
> print "theta_W simple indels: %g" % statistics.get_theta_W_simple_indel('Spur')
> print "pi SNPs: %g" % statistics.get_pi('Spur')
> print "pi simple indels: %g" % statistics.get_pi_simple_indel('Spur')
> print "D_T SNPs: %g" % statistics.get_D_T('Spur')
> print "D_T simple indels: %g" % statistics.get_D_T_simple_indel('Spur')
> print "D_FL SNPs: %g" % statistics.get_D_FL('Spur', 'Afra')
> print "D_FL simple indels: %g" % statistics.get_D_FL_simple_indel('Spur', 'Afra')
> etc.
>
> Spur is Stronglyocentrotus purpuratus and Afra is Allocentrotus
> fragilis, two closely related species of sea urchin. In this example,
> I have 20 sequences of a certain region from Spur and one from Afra,
> so I'm analyzing the population genetics of the region within Spur,
> with Afra as an outgroup for doing things like inferring which allele
> is ancestral at a polymorphism within Spur. K is the number of
> polymorphisms, theta_W is Watterson's estimator of 4 x effective
> population size x neutral mutation rate, pi is the average number of
> pairwise differences between alleles, D_T is Tajima's D, D_FL is Fu
> and Li's D (which requires an outgroup), etc. The software can do
> more elaborate things like permutation tests for assessing whether a
> statistic differs between two alignments, which might be something
> like known transcription factor binding sites versus other nucleotide
> sites in a promoter. The canned software DnaSP can't do that, which
> is one of the reasons why I wrote my stuff.
>
> Ralph
>
--
Good judgment comes from experience.
Experience comes from bad judgment.
- Unknown author
From bugzilla-daemon at portal.open-bio.org Sun Jan 7 21:25:18 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 7 Jan 2007 16:25:18 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701072125.l07LPIiS032620@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2007-01-07 16:25 -------
> In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still
> defined as (None,None) tuples. However, in NCBIXML.py, these
> variables are set as integers. I don't see a point of a tuple at all,
> at least for NCBIXML. (I realize it is used in NCBIStandalone.py).
> Most importantly, the inconsistency makes it difficult to handle cases
> when the parameter is not set. It seems easiest, though, to just
> retain the tuple format.
I don't see a good reason for a tuple either -- though it may have seemed like
a good idea back in the days that Blast only produced plain-text output.
Instead of making NCBIXML also use a tuple, I'd rather set
HSP.identities|gaps|positives to None instead of (None, None) in Record.py.
This may break some code for people using NCBIStandalone. On the other hand, it
doesn't break Biopython's test suite.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 8 15:24:20 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Jan 2007 10:24:20 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701081524.l08FOKFn008935@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-08 10:24 -------
Regarding the inconsistent use tuples for _hsp.identities, positives, and gaps
- I would like all the parsers NCBIStandalone and NCBIXML (and ideally the HTML
parser too) to return identical record objects.
To do this, we could either:
(a) change NCBIXML to use tuples instead of integers (as suggested by Jacob)
or,
(b) change NCBIStandalone to use simple integers instead of tuples (is this
what you meant in comment 3 Michiel?)
Choice (b) would seem simpler in the long term - but would probably break more
existing code. Also, users of NCBIXML are going to have to update their
scripts anyway after bug 2051, so choice (a) would distrupt less people.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 8 16:14:36 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Jan 2007 11:14:36 -0500
Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current
Swiss-Prot version (RX and OH lines are broken)
In-Reply-To:
Message-ID: <200701081614.l08GEaMm011511@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2043
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-08 11:14 -------
Checked in support for for Line type OH (Organism Host) for viral hosts based
on code from Kristian Rother. These lines were just being ignored.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Jan 9 16:10:06 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 9 Jan 2007 11:10:06 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701091610.l09GA6Wm004669@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2007-01-09 11:10 -------
> Regarding the inconsistent use tuples for _hsp.identities, positives, and gaps
> - I would like all the parsers NCBIStandalone and NCBIXML (and ideally the
> HTML parser too) to return identical record objects.
Even after patch #2090, the NCBIStandalone parser is broken for multiple Blast
records, and will probably be broken for single Blast records also when a new
Blast version comes out. I haven't tried the HTML parser, but I'd be surprised
if it can parse HTML output from recent versions of Blast. So whereas I agree
in principle that the three parsers should return identical records objects, in
practice it's hardly relevant given that two of the three parsers either don't
work or cannot work reliably.
> To do this, we could either:
>
> (a) change NCBIXML to use tuples instead of integers (as suggested by Jacob)
All three of us agree that there's no good reason for tuples. Option (a)
implies copying a bad design choice from a semi-broken parser to a functioning
parser.
> or,
>
> (b) change NCBIStandalone to use simple integers instead of tuples (is this
> what you meant in comment 3 Michiel?)
>
> Choice (b) would seem simpler in the long term - but would probably break more
> existing code. Also, users of NCBIXML are going to have to update their
> scripts anyway after bug 2051, so choice (a) would distrupt less people.
Both option (a) and (b) break existing code. So let me suggest option (c):
(c) Don't do anything.
This doesn't break any code. In the near term, people that use both the
plain-text parser and the XML parser will have to deal with differences in the
Blast record produced by the parser. But how many people are that anyway? Most
likely, not enough to justify option (a). In the long term, assuming that both
the plain-text parser and the HTML parser will be deprecated, there will be no
more inconsistencies.
My question to Jacob:
Why do you need to use the plain-text Blast parser? Is there something it can
do that the XML parser cannot?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Jan 9 17:29:01 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 9 Jan 2007 12:29:01 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701091729.l09HT1Vi009189@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-09 12:29 -------
Tuples/Integers for HSP.identities, HSP.gaps, and HSP.positives
---------------------------------------------------------------
Michiel's option (c) of doing nothing is very pragmatic.
If we go for this, I think we should at least update the record object's
documentation to say it will be a tuple (when used with NCBIStandalone) or an
integer (when used with NCBIXML). Perhaps we should also change the default in
a new record object too...
query_letters versus query_length
---------------------------------
Another of Jacobs suggestions was to rename the record.query_letters (short for
number of letters in query?) to something like query_length (which is closer to
the actual text of query_len used in the XML file). I personally am not
inclined to change this even though it would be slightly clearer.
Note that I have corrected the error on line 186 of NCBIXML.py in CVS - well
spotted Jacob. This mistake was my fault - recently introduced as part of the
changes I made on bug 2051
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 10 20:47:59 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 10 Jan 2007 15:47:59 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701102047.l0AKlxX6027453@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #7 from mdehoon at ims.u-tokyo.ac.jp 2007-01-10 15:47 -------
> Another of Jacobs suggestions was to rename the record.query_letters (short for
> number of letters in query?) to something like query_length (which is closer to
> the actual text of query_len used in the XML file). I personally am not
> inclined to change this even though it would be slightly clearer.
In principle I agree with Jacob on this one. But as Jacob also indicates, there
are probably more variable names that are less than ideal. So if we change
these variable names, it's better to change all of them at the same time. This,
however, will break a lot of existing code. With all the other changes to the
Blast parsers, now doesn't seem to be the best time for such a change. However,
let's get back to this point once the dust settles with the Blast parsers.
With hit_id, hit_def, and hsp.align_length, I see no problems with Jacob's
suggestion. Objections, anybody?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Jan 10 21:05:23 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 10 Jan 2007 16:05:23 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701102105.l0AL5Nht028299@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #8 from jmjoseph at andrew.cmu.edu 2007-01-10 16:05 -------
> My question to Jacob:
> Why do you need to use the plain-text Blast parser? Is there something it can
> do that the XML parser cannot?
I use only the XML parser. My greatest concern is not that the plain-text and
XML parsers are different, but rather that the XML parser is not consistent
with Record.py. An example that I consider completely broken is the definition
of query_length in Record.py, but the use of self._blast.query_letters in
NCBIXML.py.
To avoid breaking the existing plain-text parser code, would it be too
objectionable to use a new class, Record-XML.py, with definitions that exactly
match the usage in NCBIXML.py? Since few people are likely to use both
parsers, and any using the XML parser have required recent code updates anyway,
perhaps this separation would be easiest.
Thanks. -Jacob
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Jan 11 05:18:17 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 11 Jan 2007 00:18:17 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701110518.l0B5IHuD018624@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2007-01-11 00:18 -------
> To avoid breaking the existing plain-text parser code, would it be too
> objectionable to use a new class, Record-XML.py, with definitions that exactly
> match the usage in NCBIXML.py?
Go ahead, but don't add it to Biopython ;-).
It would just add to the confusion, without a real benefit to users.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From kosa at genesilico.pl Thu Jan 11 08:52:27 2007
From: kosa at genesilico.pl (Jan Kosinski)
Date: Thu, 11 Jan 2007 09:52:27 +0100
Subject: [Biopython-dev] powerful Alignment class in Biopython?
Message-ID: <45A5FACB.40109@genesilico.pl>
Hi,
Is anyone going to develop or developing now an Alignment class in
Biopython as powerful as for example SimpleAlign in Bioperl? Look here
for instance for methods available in Bioperl
http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html.
The reason I am asking is that I do not know if I should start working
on more functional subclass of biopython Alignment class (I do not want
to come back to Perl ;-)...
Regards,
Janek
From fkauff at duke.edu Thu Jan 11 09:27:25 2007
From: fkauff at duke.edu (Frank)
Date: Thu, 11 Jan 2007 10:27:25 +0100
Subject: [Biopython-dev] powerful Alignment class in Biopython?
In-Reply-To: <45A5FACB.40109@genesilico.pl>
References: <45A5FACB.40109@genesilico.pl>
Message-ID: <1168507645.2888.3.camel@osiris.biologie.uni-kl.de>
Hi Janek,
then Nexus parser in Biopython (for which I still haven't written any
documentation yet...) basically holds an alignment, and has some methods
that deal with basic alignment functionality. If you're going to work on
a more sophisticated alignment class, maybe we should try to get Nexus
class and alignment class work smoothly together.
Frank
On Thu, 2007-01-11 at 09:52 +0100, Jan Kosinski wrote:
> Hi,
>
> Is anyone going to develop or developing now an Alignment class in
> Biopython as powerful as for example SimpleAlign in Bioperl? Look here
> for instance for methods available in Bioperl
> http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html.
>
> The reason I am asking is that I do not know if I should start working
> on more functional subclass of biopython Alignment class (I do not want
> to come back to Perl ;-)...
>
> Regards,
> Janek
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From tiagoantao at gmail.com Thu Jan 11 10:36:35 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 11 Jan 2007 10:36:35 +0000
Subject: [Biopython-dev] PopGen code
Message-ID: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
Hi,
A couple of weeks ago I have put on bugzilla code related to
population genetics, namely parsing of GenePop and Fdist files, plus
code to control fdist.
I have had no feedback whatsoever, namely comments to the quality of
the code, if there is interest in adding it in the future to
BioPython, etc...
I have much more code that I could start converting to BioPython
format, some of which is a bit more complicated to convert (e.g., an
Arlequin format parser). Before I start doing it I would like to know
if there will be any feedback at all or if I am just loosing my
time...
Regards,
Tiago
--
Good judgment comes from experience.
Experience comes from bad judgment.
- Unknown author
From kosa at genesilico.pl Thu Jan 11 11:34:50 2007
From: kosa at genesilico.pl (Jan Kosinski)
Date: Thu, 11 Jan 2007 12:34:50 +0100
Subject: [Biopython-dev] powerful Alignment class in Biopython?
In-Reply-To: <1168507645.2888.3.camel@osiris.biologie.uni-kl.de>
References: <45A5FACB.40109@genesilico.pl>
<1168507645.2888.3.camel@osiris.biologie.uni-kl.de>
Message-ID: <45A620DA.4000302@genesilico.pl>
Hi,
I have a feeling that it would be better to write all methods similar to
BioPerl ones directly for BioPython Alignment class. The main reason is
that this class is not related to any format like Fasta, Clustal or
Nexus. It stores SeqRecords which are also not in Fasta or other format.
It would make many things easier. For instance, I can write all my
functions which do sth with alignments so that they accept general
Alignment objects (and not necessarily FastaAlignment or
ClustalAlignment objects ). Would not it better to write all stuff which
do general things with alignments (column counting, column
selection/removal etc.) so that it works with general Alignment class
rather than with class for alignment of specific biological format?
Janek
Frank wrote:
> Hi Janek,
>
> then Nexus parser in Biopython (for which I still haven't written any
> documentation yet...) basically holds an alignment, and has some methods
> that deal with basic alignment functionality. If you're going to work on
> a more sophisticated alignment class, maybe we should try to get Nexus
> class and alignment class work smoothly together.
>
> Frank
>
>
> On Thu, 2007-01-11 at 09:52 +0100, Jan Kosinski wrote:
>
>> Hi,
>>
>> Is anyone going to develop or developing now an Alignment class in
>> Biopython as powerful as for example SimpleAlign in Bioperl? Look here
>> for instance for methods available in Bioperl
>> http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html.
>>
>> The reason I am asking is that I do not know if I should start working
>> on more functional subclass of biopython Alignment class (I do not want
>> to come back to Perl ;-)...
>>
>> Regards,
>> Janek
>>
>>
>>
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
From kosa at genesilico.pl Thu Jan 11 12:11:11 2007
From: kosa at genesilico.pl (Jan Kosinski)
Date: Thu, 11 Jan 2007 13:11:11 +0100
Subject: [Biopython-dev] [BioPython] what to use for working with fasta
sequences and alignments?
In-Reply-To: <45A50D24.1090906@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl>
<45A50D24.1090906@maubp.freeserve.co.uk>
Message-ID: <45A6295F.5030103@genesilico.pl>
Are you going to fix this in the new SeqIO?:
When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are
stripped away after the first "space".
Janek
Peter (BioPython List) wrote:
> Jan Kosinski wrote:
>> Hi,
>>
>> I am quite new in BioPython and I am a little bit confused when
>> trying to use BioPython for working with fasta sequences and alignments.
>>
>> For instance, I can read and parse fasta files with Bio.Fasta, return
>> records (as Fasta.record class), iterate and so on. But then I am
>> going to Bio.Fasta.FastaAlign module which offers FastaAlignment
>> (subclass of Alignment class) class. However, this class has very
>> limited methods and get_all_seqs and get_seq_by_num return SeqRecord
>> object instead of Fasta.record (why??) what makes it hard to use
>> Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta
>> (with Fasta.record) for sequences. Maybe I am wrong but Biopython
>> seems to be full of incompatibilities. Or one should know which
>> modules and classes should not be used?
>>
>> Could you recommend me what should I use for my work with fasta
>> sequences and alignments? Which BioPython modules and classes?
>
> You can use Bio.Fasta to read in files either as Fasta.Record objects,
> or as SeqRecord objects. I would use SeqRecord objects - they are
> more general should you ever want to use a different input file format
> - plus as you have noticed, the alignment object also uses SeqRecord
> objects to hold each (gapped) sequence.
>
> There are other options if you search the code - but Bio.Fasta is the
> best documented and most used.
>
> If you are brave, then you might have a look at the new code in
> Bio.SeqIO which you can get from CVS. This is still in a state of
> flux however... but the Fasta parsing is much faster. See this page
> and the mailing list archives for more:
>
> http://www.biopython.org/wiki/SeqIO
>
> > Or should I use other packages like CoreBio?
>
> You could do - it has the advantage of having started recently from a
> clean slate, and having much less "old code".
>
>> Thank you in advance for any guidelines,
>> Janek Kosinski
>
> Peter
From biopython-dev at maubp.freeserve.co.uk Fri Jan 12 12:36:56 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Fri, 12 Jan 2007 12:36:56 +0000
Subject: [Biopython-dev] PopGen code
In-Reply-To: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
Message-ID: <45A780E8.2070803@maubp.freeserve.co.uk>
Tiago Ant?o wrote:
> Hi,
>
> A couple of weeks ago I have put on bugzilla code related to
> population genetics, namely parsing of GenePop and Fdist files, plus
> code to control fdist.
> I have had no feedback whatsoever, namely comments to the quality of
> the code, if there is interest in adding it in the future to
> BioPython, etc...
>
> I have much more code that I could start converting to BioPython
> format, some of which is a bit more complicated to convert (e.g., an
> Arlequin format parser). Before I start doing it I would like to know
> if there will be any feedback at all or if I am just loosing my
> time...
I suppose I/we would be able to read your code from a general
perspective (coding style, clarity of comments, etc). I haven't made
time for this.
I suspect BioPython currently has no active developers who feel
qualified to interpret your population genetics code. I was hoping that
you and Ralph Haygood would combine forces - if you are both happy with
some code that does bode well. Any comments Michiel?
Regarding population genetic file formats - from a very quick search
about Arlequin it sounds like this file format can hold lots of
different types of data. I would encourage you to try and come up with
a generic population record data object that could hold this or
information from GenePop or Fdist as well. I have no idea how easy this
would be...
Peter
From tiagoantao at gmail.com Fri Jan 12 14:16:53 2007
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Fri, 12 Jan 2007 14:16:53 +0000
Subject: [Biopython-dev] PopGen code
In-Reply-To: <45A780E8.2070803@maubp.freeserve.co.uk>
References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
<45A780E8.2070803@maubp.freeserve.co.uk>
Message-ID: <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com>
Hi,
Thanks for the answer.
> I suspect BioPython currently has no active developers who feel
> qualified to interpret your population genetics code. I was hoping that
> you and Ralph Haygood would combine forces - if you are both happy with
> some code that does bode well. Any comments Michiel?
I think Ralph (who subscribes to this list, and thus can comment) has
strong time constraints, and will probably have little available time
in the near future...
> Regarding population genetic file formats - from a very quick search
> about Arlequin it sounds like this file format can hold lots of
> different types of data. I would encourage you to try and come up with
> a generic population record data object that could hold this or
> information from GenePop or Fdist as well. I have no idea how easy this
> would be...
I have been thinking a lot about a generic data structure to hold
population genomic (ie not only genetic) data. I have, in fact,
implemented (in CAML, not Python) quite a few different data
representations. I was not happy with none of them. Different kinds of
markers (that sometimes overlap - eg sequences and SNPs), linkage
disequilibrium (thus relations between markers...), ploidy (no need to
think on different organisms, think mitochondria, nuclear chromosomes,
Y chromosome), ... make a general solution not trivial.
As I see it, there are a few options:
1. Have a grand, unified structure, but that will take time to mature
2. Assume that there will be different representations for different
scopes, assume that that is a bad thing and live with that
3. Assume that there will be different representations, and that that
is good, in the sense that a one size, fits all approach in this case
has lots of problems
I think the pragmatic approach for now is not to have a generic
representation. I would lean more to let things mature (develop
statistics, parsers, ...) and after there is more experience (and,
hopefully, user feedback) then reassess the issue of a general
representation. I am aware that this will entail each part of code
having a different calling data structure, but I think that with care
and common sense that won't be very problematic.
I don't mind having the code on an alpha branch for as long as you see
fit, I just want to be sure that whatever effort I put in converting
(or creating new) my code to BioPython is not lost, that is why I
would like feedback on what will happen to the code that I am
submitting. I am willing to accommodate any reasonable requirements
regarding code quality and development process...
Regards,
Tiago
--
Good judgment comes from experience.
Experience comes from bad judgment.
- Unknown author
From bsouthey at gmail.com Fri Jan 12 14:30:26 2007
From: bsouthey at gmail.com (Bruce Southey)
Date: Fri, 12 Jan 2007 08:30:26 -0600
Subject: [Biopython-dev] PopGen code
In-Reply-To: <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com>
References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com>
<45A780E8.2070803@maubp.freeserve.co.uk>
<6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com>
Message-ID:
Hi,
While I do have a remote interest in this, I do not have any time to
look at this at present. As I mentioned in a previous email, that
John Cole is doing related work in Python but not part of BioPython.
It would probably be good to have some unified approach and direction
because of the overlaps that occur and so other pieces of code can be
easily added.
Regards
Bruce
On 1/12/07, Tiago Ant?o wrote:
> Hi,
>
> Thanks for the answer.
>
> > I suspect BioPython currently has no active developers who feel
> > qualified to interpret your population genetics code. I was hoping that
> > you and Ralph Haygood would combine forces - if you are both happy with
> > some code that does bode well. Any comments Michiel?
>
> I think Ralph (who subscribes to this list, and thus can comment) has
> strong time constraints, and will probably have little available time
> in the near future...
>
> > Regarding population genetic file formats - from a very quick search
> > about Arlequin it sounds like this file format can hold lots of
> > different types of data. I would encourage you to try and come up with
> > a generic population record data object that could hold this or
> > information from GenePop or Fdist as well. I have no idea how easy this
> > would be...
>
> I have been thinking a lot about a generic data structure to hold
> population genomic (ie not only genetic) data. I have, in fact,
> implemented (in CAML, not Python) quite a few different data
> representations. I was not happy with none of them. Different kinds of
> markers (that sometimes overlap - eg sequences and SNPs), linkage
> disequilibrium (thus relations between markers...), ploidy (no need to
> think on different organisms, think mitochondria, nuclear chromosomes,
> Y chromosome), ... make a general solution not trivial.
> As I see it, there are a few options:
> 1. Have a grand, unified structure, but that will take time to mature
> 2. Assume that there will be different representations for different
> scopes, assume that that is a bad thing and live with that
> 3. Assume that there will be different representations, and that that
> is good, in the sense that a one size, fits all approach in this case
> has lots of problems
>
> I think the pragmatic approach for now is not to have a generic
> representation. I would lean more to let things mature (develop
> statistics, parsers, ...) and after there is more experience (and,
> hopefully, user feedback) then reassess the issue of a general
> representation. I am aware that this will entail each part of code
> having a different calling data structure, but I think that with care
> and common sense that won't be very problematic.
>
> I don't mind having the code on an alpha branch for as long as you see
> fit, I just want to be sure that whatever effort I put in converting
> (or creating new) my code to BioPython is not lost, that is why I
> would like feedback on what will happen to the code that I am
> submitting. I am willing to accommodate any reasonable requirements
> regarding code quality and development process...
>
> Regards,
> Tiago
>
> --
> Good judgment comes from experience.
> Experience comes from bad judgment.
> - Unknown author
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
From bugzilla-daemon at portal.open-bio.org Sat Jan 13 00:27:11 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Jan 2007 19:27:11 -0500
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200701130027.l0D0RBlR027978@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #10 from mdehoon at ims.u-tokyo.ac.jp 2007-01-12 19:27 -------
I've committed the code to handle hit_id, hit_def, and hsp.align_length to CVS.
Let's keep this bug report open for now to remind ourselves to revisit the
issues with variable names at some point.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 11:04:24 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 06:04:24 -0500
Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current
Swiss-Prot version (RX and OH lines are broken)
In-Reply-To:
Message-ID: <200701151104.l0FB4OdQ015531@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2043
k_rother at yahoo.de changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |k_rother at yahoo.de
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 11:07:33 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 06:07:33 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151107.l0FB7Xdb015724@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
------- Comment #2 from k_rother at yahoo.de 2007-01-15 06:07 -------
Created an attachment (id=543)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=543&action=view)
new date() method handling new style DT lines
new to bugzilla. don't know whether this is the proper way to commit code.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 11:17:55 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 06:17:55 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151117.l0FBHtd1016589@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
------- Comment #3 from k_rother at yahoo.de 2007-01-15 06:17 -------
Created an attachment (id=544)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=544&action=view)
SProt.py that digests all 250,000 Uniprot entries successfully.
also checked the data record contents whether the dates and version numbers of
the first few entries are correct.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 11:19:59 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 06:19:59 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151119.l0FBJxt9016846@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
k_rother at yahoo.de changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |k_rother at yahoo.de
Status|NEW |RESOLVED
Resolution| |WORKSFORME
------- Comment #4 from k_rother at yahoo.de 2007-01-15 06:19 -------
i think this should finish the bug unless someone wants to beautify the code.
KR
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 12:26:57 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 07:26:57 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151226.l0FCQvNE022159@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |biopython-
| |bugzilla at maubp.freeserve.co.
| |uk
Status|RESOLVED |REOPENED
Resolution|WORKSFORME |
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-15 07:26 -------
Reopening - its not fixed until we have updated the code in CVS.
However, I will try and have a look at your code.
By the way - in general developers pefer patches rather than chunks of code, or
edited copies of the original.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Jan 15 12:51:46 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 15 Jan 2007 07:51:46 -0500
Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new
DT lines
In-Reply-To:
Message-ID: <200701151251.l0FCpk8H023891@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1956
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|REOPENED |RESOLVED
Resolution| |FIXED
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-15 07:51 -------
I have updated CVS with a slightly modified version of your code Kristian.
See revision 1.36, web version here:
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SwissProt/SProt.py?cvsroot=biopython
It passes the old unit test, test_SProt.py, but if you could double check this
on the latest release that would be great.
Thanks very much.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython-dev at maubp.freeserve.co.uk Mon Jan 15 20:04:34 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Mon, 15 Jan 2007 20:04:34 +0000
Subject: [Biopython-dev] Bio.SeqIO
In-Reply-To: <45A94BFD.5080209@c2b2.columbia.edu>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk>
<45A94BFD.5080209@c2b2.columbia.edu>
Message-ID: <45ABDE52.1030300@maubp.freeserve.co.uk>
Michiel de Hoon wrote:
> In my opinion, the new Bio.SeqIO code is a huge improvement to
> Biopython, so I'd be happy to make a new release for it.
>
> ...
>
> For Bio.SeqIO, we're also in pretty good shape, as far as I can tell.
> From what I remember, the remaining issues were
> 1) Which functionality to include, in particular
> a) if functions should accept file names in addition to file handles;
I have decided to follow Michiel's stance on this issue: handles only.
> b) if functions should infer the file format from the file extension,
> the file content, or otherwise.
Right now the file format string is optional and if omitted the file
extension (via handle.name) is used to try and guess.
It would be trivial to remove this functionality and make format a
required argument.
We could at a later date chose to add limited support for format
guessing based on file contents without altering the function parameters
(i.e. the API).
Both these features would be nice to have (speaking as user) but then
again, am I prepared to support the headaches they may cause later on.
I'm wavering on this issue (having previously been in favour of
including the format guessing).
Item 1(c) on Michiel's list could have been do we need the three "helper
functions" which turned a file into a SeqRecord list, dictionary or
alignment.
Again, I have come round to Michiel's view and removed these as they
were just simple wrappers for list, SequencesToDictionary and
SequencesToAlignment.
> 2) What are the best names for the functions that the user will see.
The good news is that after that little spring clean there are less
functions to name - just these four really:
SequenceIterator, once known as FileToSequenceIterator and before that
File2SequenceIterator. Now takes just an input file handle and an
optional file format. Returns a SeqRecord iterator.
SequencesToDictionary - takes SeqRecord iterator or list, plus an
optional function to define the keys, and returns a dictionary.
SequencesToAlignment - takes SeqRecord iterator or list, and returns an
alignment object. Perhaps this functionality should be included in the
alignment class itself...
WriteSequences, once known as SequencesToFile - takes a SeqRecord
iterator or list, and output handle, and a format string. Intended for
use on a whole file at once (i.e. the general case where there may be
headers/footers etc). This does not let you do incremental writes one
for each record (which would be possible for some formats like GenBank
or fasta)
Peter
From mdehoon at c2b2.columbia.edu Mon Jan 15 23:00:16 2007
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Mon, 15 Jan 2007 18:00:16 -0500
Subject: [Biopython-dev] Bio.SeqIO
In-Reply-To: <45ABDE52.1030300@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu>
<45ABDE52.1030300@maubp.freeserve.co.uk>
Message-ID: <45AC0780.9060306@c2b2.columbia.edu>
Peter wrote:
> WriteSequences, once known as SequencesToFile - takes a SeqRecord
> iterator or list, and output handle, and a format string. Intended for
> use on a whole file at once (i.e. the general case where there may be
> headers/footers etc). This does not let you do incremental writes one
> for each record (which would be possible for some formats like GenBank
> or fasta)
At the end of WriteSequences, the file is closed:
def WriteSequences(sequences, handle, format) :
...
handle.close() #just in case the writer object forgot
Why would it be a problem if the handle is not closed?
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From biopython-dev at maubp.freeserve.co.uk Tue Jan 16 11:10:21 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Jan 2007 11:10:21 +0000
Subject: [Biopython-dev] Bio.SeqIO
In-Reply-To: <45AC0780.9060306@c2b2.columbia.edu>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk>
<45AC0780.9060306@c2b2.columbia.edu>
Message-ID: <45ACB29D.2010002@maubp.freeserve.co.uk>
Michiel Jan Laurens de Hoon wrote:
> Peter wrote:
>> WriteSequences, once known as SequencesToFile - takes a SeqRecord
>> iterator or list, and output handle, and a format string. Intended for
>> use on a whole file at once (i.e. the general case where there may be
>> headers/footers etc). This does not let you do incremental writes one
>> for each record (which would be possible for some formats like GenBank
>> or fasta)
>
> At the end of WriteSequences, the file is closed:
>
> def WriteSequences(sequences, handle, format) :
> ...
> handle.close() #just in case the writer object forgot
>
> Why would it be a problem if the handle is not closed?
OK, I've fixed that.
That issue was on my mind too - in particular it would stop Bio.SeqIO
from creating concatenated phylip alignments which are used in
bootstrapping. Reading this sort of file is a different issue, which I
am also currently thinking about.
Peter
From biopython-dev at maubp.freeserve.co.uk Tue Jan 16 12:48:49 2007
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue, 16 Jan 2007 12:48:49 +0000
Subject: [Biopython-dev] Bio.SeqIO - Output
In-Reply-To: <45ACB29D.2010002@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk> <45AC0780.9060306@c2b2.columbia.edu>
<45ACB29D.2010002@maubp.freeserve.co.uk>
Message-ID: <45ACC9B1.6010709@maubp.freeserve.co.uk>
I've been thinking about sequence output (i.e. writing sequence files),
and have come to the conclusion that my writer classes in
Bio/SeqIO/Interfaces.py are probably too complicated.
My current Bio.SeqIO output implementation tries to be very flexible -
if you look beyond the top level function WriteSequences (aka
SequencesToFile) then the individual writer classes have a confusing
range of capabilities.
New Idea
========
I was thinking that we should only support two cases for sequence output:
(*) simple sequential file formats
- record by record, or file at once
- can use a SeqRecord iterator (or a list)
(*) all other file formats
- file at once only
- probably needs a list of SeqRecords (not an iterator)
For the sequential file formats such as fasta, genbank and swiss there
are no headers or footers - and a single sequence alone would be a valid
file.
For all other file formats (e.g. clustal, stockholm, phylip, anything in
XML, ...) we would only offer the "file at once" option.
When implementing a writer for a new file format, you just have to
implement a "write file" function or a "write record" function which
takes the record(s) and a handle. The implementation details are up to you.
Drawbacks
=========
There are some sequential file formats where, under the scheme above,
you would be forced to write the file in one go...
However, I can only think of one irrelevant example, so this may not
matter. Can anyone suggest some other examples? Some sort of simple
tabular file with a header row maybe?
For example simple Stockholm files (if you ignore the PFAM style
annotation) have a generic header, followed by sequential records and a
generic footer.
The point here is that the header does not contain anything about the
records which will follow it. e.g. The number of records, or if they
are protein or nucleotides.
For files like this it would be possible to write the file record by
record given an iterator - provided you also write the header and footer.
Right now this is the only file format I can think of that has this
property - and I don't currently even support this (instead like BioPerl
I create Stockholm files with PFAM style annotations).
Stockholm files with PFAM style annotation do not qualify, because the
header contains the number of records. Similarly for non-interlaced PHYLIP.
Peter
From mdehoon at c2b2.columbia.edu Tue Jan 16 17:51:23 2007
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Tue, 16 Jan 2007 12:51:23 -0500
Subject: [Biopython-dev] [BioPython] Next release plans;
was: what to use for working with fasta sequences and alignments?
In-Reply-To: <45AA34AE.5080100@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk>
<45A94BFD.5080209@c2b2.columbia.edu>
<45AA34AE.5080100@maubp.freeserve.co.uk>
Message-ID: <45AD109B.4030009@c2b2.columbia.edu>
Peter wrote:
> Regarding the fix checked in on bug 1970 I still would prefer we call
> the new XML iterator NCBIXML.Iterator(handle) rather than
> NCBIXML.parse(handle) but I'll live ;)
>
I chose "parse" because it is used in the old (Biopython release 1.42)
Blast XML parser:
Old:
>>> from Bio.Blast import NCBIXML
>>> b_parser = NCBIXML.BlastParser()
>>> b_record = b_parser.parse(blast_out)
New:
>>> from Bio.Blast import NCBIXML
>>> b_records = NCBIXML.parse(blast_out)
>>> b_record = b_records.next() # Repeat to get subsequent Blast records
Whereas I am not dead set on "parse", it agrees with similar functions
in Python:
1) Function name is a verb, not a noun
2) Function name describes what the function does, not what the function
returns
3) Function names are short, and start with a lower case letter.
For example, to read a file line-by-line in Python:
>>> inputfile = open("somefunnyfile")
# "open"; not "Iterator", nor "FileToLineIterator",
# even though "open" returns an iterator:
>>> for line in inputfile:
... print line
To read an image file with the Python Imaging Library:
>>> import Image
>>> im = Image.open("lena.ppm")
# "open"; not "Image", nor "FileNameToImage".
To read a Python object from a pickled file:
>>> import pickle
>>> inputfile = open("somepickledfile")
>>> myobject = pickle.load(inputfile)
# "load"; not "FileToObject".
>>> inputfile.close()
To parse an XML file with the sax parser framework in Python:
>>> from xml.sax.handler import ContentHandler
>>> from xml import sax
>>> handler = SomeSubclassOfContentHandler()
>>> inputfile = open("myxmlfile.xml")
>>> sax.parse(inputfile, handler)
# "parse", same as in the new Bio.Blast.NCBIXML
>>> inputfile.close()
So, for Bio.Blast.NCBIXML, good names would be "load", "read", "parse",
or something similar. "Iterator" would not be consistent; besides, until
recently I didn't know what an iterator is, so I doubt that new users
would know.
What we could do is to have two functions in Bio.Blast.NCBIXML, perhaps
one called "read" and the other "iterate", where the former returns a
single Blast record (for an XML file containing only one Blast result),
and the latter an iterator over multiple Blast records.
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From mdehoon at c2b2.columbia.edu Tue Jan 16 17:47:17 2007
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Tue, 16 Jan 2007 12:47:17 -0500
Subject: [Biopython-dev] [BioPython] Next release plans;
was: what to use for working with fasta sequences and alignments?
In-Reply-To: <45AA34AE.5080100@maubp.freeserve.co.uk>
References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk>
<45A94BFD.5080209@c2b2.columbia.edu>
<45AA34AE.5080100@maubp.freeserve.co.uk>
Message-ID: <45AD0FA5.3060008@c2b2.columbia.edu>
Peter wrote:
> In general, I agree that the Blast XML parser in CVS looks in good shape
> - but we really need to update the documentation for using Blast for the
> next release.
>
Yeah I know, I've been holding off on updating the documentation so it
is consistent with the latest Biopython release 1.42. I'll update it
together with the next release.
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From bugzilla-daemon at portal.open-bio.org Sat Jan 27 23:01:07 2007
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 27 Jan 2007 18:01:07 -0500
Subject: [Biopython-dev] [Bug 1963] Adding __str__ method to codon tables
and translators
In-Reply-To:
Message-ID: <200701272301.l0RN17l6026463@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1963
------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2007-01-27 18:01 -------
> Question One:
> Is this worth adding to BioPython or not?
Yes, definitely.
> Question Two:
> What is the preferred behaviour for ambiguous tables? Just a 4x4x4 table as
> for the unambiguous tables? Or the full 15x15x15 table? I have implemented
> both (see commented out code)
My feeling is that 15x15x15 would become too large to be clearly visible on the
screen. So I'd prefer 4x4x4, maybe with a reminder printed at the end as to
what each ambiguous codon may represent.
> Question Three:
> Is there a standard BioPython function to convert from one letter amino acid
> sequences into three letter names? i.e. like one_to_three from
> Bio.PDB.Polypeptide but more general. That function does not cope with
> ambigous names.
There is the function seq3 in Bio/SeqUtils. If it is not complete, it can be
extended easily, and seems to be a better place for this general function than
Bio/PDB/Polypeptide.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.