From bugzilla-daemon at portal.open-bio.org Sun Oct 1 02:10:37 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 1 Oct 2006 02:10:37 -0400
Subject: [Biopython-dev] [Bug 1939] Doc/Makefile does not build pdf, html,
txt files completely correctly
In-Reply-To:
Message-ID: <200610010610.k916Ab3S003487@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1939
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #19 from mdehoon at ims.u-tokyo.ac.jp 2006-10-01 02:10 -------
I've taken bits and pieces of the patch to get the recursive behavior for make.
Getting html output from biopdb_faq is not essential, and does not warrant
adding a hack to Biopython.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chris.lasher at gmail.com Mon Oct 9 22:55:17 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Mon, 9 Oct 2006 22:55:17 -0400
Subject: [Biopython-dev] ranlib.c missing in Bio/Cluster
Message-ID: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com>
I just checked out the latest CVS and setup.py failed on installation
during gcc compilation of ranlib.o since Bio/Cluster/ranlib.c couldn't
be found. Any suggestions? Was this file supposed to be in the CVS?
The checkout notes indicate that it has been replaced with something
else.
Thanks,
Chris
From mdehoon at c2b2.columbia.edu Tue Oct 10 00:10:22 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Tue, 10 Oct 2006 00:10:22 -0400
Subject: [Biopython-dev] ranlib.c missing in Bio/Cluster
In-Reply-To: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com>
References: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com>
Message-ID: <452B1D2E.4060703@c2b2.columbia.edu>
There was some confusion about the license status of ranlib, so I
removed it from Bio.Cluster and replaced it with a new random number
generator written from scratch. Apparently I forgot to update setup.py
in CVS accordingly. I have done that now, so if you get the new setup.py
from CVS the compilation should work. You could also edit your local
copy of setup.py and remove ranlib.c, linpack.c, and com.c.
Sorry for the confusion.
--Michiel.
Chris Lasher wrote:
> I just checked out the latest CVS and setup.py failed on installation
> during gcc compilation of ranlib.o since Bio/Cluster/ranlib.c couldn't
> be found. Any suggestions? Was this file supposed to be in the CVS?
> The checkout notes indicate that it has been replaced with something
> else.
>
> Thanks,
> Chris
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From chris.lasher at gmail.com Tue Oct 10 00:46:29 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 10 Oct 2006 00:46:29 -0400
Subject: [Biopython-dev] Subversion Repository
Message-ID: <128a885f0610092146y5a184ccfw31d433d228a9b05d@mail.gmail.com>
Anybody know if BioPython (I suppose all Open Bio projects) will
switch over to Subversion, and if so, when? I think the merits and
advantages of Subversion over CVS speak for themselves. It's certainly
become my revision control system of preference. Anybody else's?
Curious,
Chris
From bugzilla-daemon at portal.open-bio.org Thu Oct 19 00:14:23 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 19 Oct 2006 00:14:23 -0400
Subject: [Biopython-dev] [Bug 2014] Bio/Blast/NCBIStandalone.py parsing of
psiblast fails
In-Reply-To:
Message-ID: <200610190414.k9J4ENQm025952@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2014
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:14 -------
Fixed in CVS, thanks.
Note though that the parser for plain-text blast output is very difficult to
maintain, because the output format keeps changing with different versions of
blast. I'd encourage you to use the XML parser instead, as it is much more
stable.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Oct 19 00:21:40 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 19 Oct 2006 00:21:40 -0400
Subject: [Biopython-dev] [Bug 2032] query_to and sbjct_to added in parsed
NCBI-Blast XML
In-Reply-To:
Message-ID: <200610190421.k9J4LeGH026582@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2032
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:21 -------
Fixed in CVS following same bug report on the mailing list.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Oct 19 00:44:57 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 19 Oct 2006 00:44:57 -0400
Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple
queries and recent (2.2.13) blast - patch attached
In-Reply-To:
Message-ID: <200610190444.k9J4ivGv028629@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2051
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |mdehoon at ims.u-tokyo.ac.jp
------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:44 -------
A new blast version (2.2.15) came out recently, so I tried the XML parser with
the its output of a multiple query. I didn't notice a problem except that all
alignments are put into one list, which is annoying because then we have to
find out which alignment corresponds to which query.
So, which specific problem with the XML parser are you trying to solve? And do
these problems still occur with blast 2.2.15? (as far as I can tell, its XML
output is the same as for blast 2.2.14, so it's probably here to stay).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chris.lasher at gmail.com Tue Oct 24 22:22:06 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 24 Oct 2006 22:22:06 -0400
Subject: [Biopython-dev] [BioPython] Martel-based parsing of Unigene
flat files
In-Reply-To: <453975D5.4070701@mail.nih.gov>
References: <453975D5.4070701@mail.nih.gov>
Message-ID: <128a885f0610241922h5db02fbfod1a83cfeade29801@mail.gmail.com>
Hi Sean,
FWIW this should probably have been posted to BioPython-dev, but I
don't think that would improve your chances of getting a response. I
am cross-posting it there, anyways. Unfortunately for you, I do not
have an answer for you. :-(
I, myself, would be interested in a response to this question from the
Devs, as I would like to write a parser for PTT files. Last I saw
there was a lot of chatter about the Martel parsers being incredibly
slow compared to straightforward solutions. It seems that standard
format parsers would be one of the easiest ways for BioPython newbies
to contribute to developing the BioPython project, however, there
isn't very much in the way of documentation on the BioPython way to do
so, let alone developer documentation at all. I would like to know
what can be done to get some dev docs going on the wiki.
Chris
On 10/20/06, Sean Davis wrote:
> I am relatively new to python and biopython (coming from perl side of
> things). I would like to make a parser for Unigene flat file format.
> However, after digging through the LocusLink parsing code (as probably
> the most similar format, etc.), I'm still at a loss for how Martel-based
> parsing works. I understand the big picture (converting an re-based
> parsing of a file into events), but it is the detail that I am missing.
> I know about pydoc, but the pydoc for much of Martel is not very helpful
> to me, at least not in my current state of knowledge. Any suggestions
> on how to get started?
>
> Thanks,
> Sean
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
From sdavis2 at mail.nih.gov Thu Oct 26 08:09:43 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 26 Oct 2006 08:09:43 -0400
Subject: [Biopython-dev] Basic python question with regard to Unigene parser
Message-ID: <200610260809.43245.sdavis2@mail.nih.gov>
Let me start off by saying that I am a python newbie after working in perl for
the last few years. I am working on a Unigene flat file parser. In my
scanner, I have a construct that looks like:
for line in handle:
tag = line.split(' ')[0]
line = line.rstrip()
if tag=='ID':
consumer.ID(line)
if tag=='GENE':
consumer.GENE(line)
if tag=='TITLE':
consumer.TITLE(line)
if tag=='EXPRESS':
consumer.EXPRESS(line)
....
Since I am setting things up so that there is a 1:1 correspondence between the
"tag" and the consumer method, is there an easy way to reduce this long set
of IF statements to a simple mapping procedure that maps a tag to the correct
method?
Sorry for the naive question....
Thanks,
Sean
From sdavis2 at mail.nih.gov Thu Oct 26 08:30:08 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 26 Oct 2006 08:30:08 -0400
Subject: [Biopython-dev] Basic python question with regard to Unigene
parser
In-Reply-To: <200610260809.43245.sdavis2@mail.nih.gov>
References: <200610260809.43245.sdavis2@mail.nih.gov>
Message-ID: <200610260830.08326.sdavis2@mail.nih.gov>
On Thursday 26 October 2006 08:09, Sean Davis wrote:
> Let me start off by saying that I am a python newbie after working in perl
> for the last few years. I am working on a Unigene flat file parser. In my
> scanner, I have a construct that looks like:
>
> for line in handle:
> tag = line.split(' ')[0]
> line = line.rstrip()
> if tag=='ID':
> consumer.ID(line)
> if tag=='GENE':
> consumer.GENE(line)
> if tag=='TITLE':
> consumer.TITLE(line)
> if tag=='EXPRESS':
> consumer.EXPRESS(line)
> ....
>
> Since I am setting things up so that there is a 1:1 correspondence between
> the "tag" and the consumer method, is there an easy way to reduce this long
> set of IF statements to a simple mapping procedure that maps a tag to the
> correct method?
>
> Sorry for the naive question....
Even more apologies. I answered my own question. Something like this seems
to work:
exec('consumer.'+tag+'(line)')
which replaces all the IF statements quite nicely.
Sean
From james.balhoff at duke.edu Thu Oct 26 09:46:34 2006
From: james.balhoff at duke.edu (Jim Balhoff)
Date: Thu, 26 Oct 2006 09:46:34 -0400
Subject: [Biopython-dev] Basic python question with regard to Unigene
parser
In-Reply-To: <200610260830.08326.sdavis2@mail.nih.gov>
References: <200610260809.43245.sdavis2@mail.nih.gov>
<200610260830.08326.sdavis2@mail.nih.gov>
Message-ID:
Hi Sean,
On Oct 26, 2006, at 8:30 AM, Sean Davis wrote:
> On Thursday 26 October 2006 08:09, Sean Davis wrote:
>> Let me start off by saying that I am a python newbie after working
>> in perl
>> for the last few years. I am working on a Unigene flat file
>> parser. In my
>> scanner, I have a construct that looks like:
>>
>> for line in handle:
>> tag = line.split(' ')[0]
>> line = line.rstrip()
>> if tag=='ID':
>> consumer.ID(line)
>> if tag=='GENE':
>> consumer.GENE(line)
>> if tag=='TITLE':
>> consumer.TITLE(line)
>> if tag=='EXPRESS':
>> consumer.EXPRESS(line)
>> ....
>>
>> Since I am setting things up so that there is a 1:1 correspondence
>> between
>> the "tag" and the consumer method, is there an easy way to reduce
>> this long
>> set of IF statements to a simple mapping procedure that maps a tag
>> to the
>> correct method?
>>
>> Sorry for the naive question....
>
> Even more apologies. I answered my own question. Something like
> this seems
> to work:
>
> exec('consumer.'+tag+'(line)')
>
> which replaces all the IF statements quite nicely.
Alternatively, you may want to look at getattr(). There is a good
description here:
Jim
From sdavis2 at mail.nih.gov Thu Oct 26 10:56:25 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 26 Oct 2006 10:56:25 -0400
Subject: [Biopython-dev] Unigene flat file parser
Message-ID: <200610261056.25883.sdavis2@mail.nih.gov>
I have put together a parser for the Unigene flat file format described here:
ftp://ftp.ncbi.nih.gov/repository/UniGene/README
under the Hs.data section. The actual .data files are included in the various
organism-specific directories.
Is there any interest in including this in biopython? If so, I would
appreciate some input on the code and details of contributions, etc. The
current code is available here:
http://watson.nci.nih.gov/pressa/~sdavis/Unigene.py
Use like so and note that the ugrecord has much more information (in fact, all
information is captured) in it that given in its __repr__.
#!/usr/bin/python
import Unigene
fh = file('Hs.data') #downloaded previously from ftp, or whatever
ugparser = Unigene.Iterator(fh,Unigene.RecordParser())
for ugrecord in ugparser:
print ugrecord
From mdehoon at c2b2.columbia.edu Thu Oct 26 14:01:24 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 26 Oct 2006 14:01:24 -0400
Subject: [Biopython-dev] Unigene flat file parser
In-Reply-To: <200610261056.25883.sdavis2@mail.nih.gov>
References: <200610261056.25883.sdavis2@mail.nih.gov>
Message-ID: <4540F7F4.2050003@c2b2.columbia.edu>
Sean Davis wrote:
> I have put together a parser for the Unigene flat file format described here:
Perhaps a silly question from a non-Unigene user, but what is the
relation between your parser and the one in Bio/UniGene/__init__.py? The
latter seems to parse HTML files (see the example in
Tests/test_unigene.py) instead of flat files. Is your parser intended as
a replacement for Bio/UniGene/__init__.py?
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From sdavis2 at mail.nih.gov Thu Oct 26 16:15:52 2006
From: sdavis2 at mail.nih.gov (Davis, Sean (NIH/NCI) [E])
Date: Thu, 26 Oct 2006 16:15:52 -0400
Subject: [Biopython-dev] Unigene flat file parser
References: <200610261056.25883.sdavis2@mail.nih.gov>
<4540F7F4.2050003@c2b2.columbia.edu>
Message-ID: <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov>
Michiel,
It looks to me like it parses an HTML file downloaded from the NCBI website containing a single unigene record of interest--potentially useful if one knows what one needs.
I, on the other hand, have always just used the flat files as the source for unigene, as I typically want ALL the data for one or several species available. A single flat file is available for each organism and contains ALL the unigene entries and their associated information for that organism. By concatenating several files (they are simple text files), one can parse the entire unigene database.
So, in short, I don't see this unigene parser as a replacement for the current module. They fill different needs; this one fills a need that I have and is useful for whole-genome, multiple species work, or microarray analyses and whether and where it fits into biopython is really up to the community.
Just a quick comment on speed for the parser--it parses Hs.data (the largest flat file in unigene, 84,000 entries, with just under 7,000,000 sequence entries, 150 Mb file size) in just under 5 minutes on my Xeon desktop.
Sean
-----Original Message-----
From: Michiel Jan Laurens de Hoon [mailto:mdehoon at c2b2.columbia.edu]
Sent: Thu 10/26/2006 2:01 PM
To: Davis, Sean (NIH/NCI) [E]
Cc: biopython-dev at lists.open-bio.org
Subject: Re: [Biopython-dev] Unigene flat file parser
Sean Davis wrote:
> I have put together a parser for the Unigene flat file format described here:
Perhaps a silly question from a non-Unigene user, but what is the
relation between your parser and the one in Bio/UniGene/__init__.py? The
latter seems to parse HTML files (see the example in
Tests/test_unigene.py) instead of flat files. Is your parser intended as
a replacement for Bio/UniGene/__init__.py?
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From biopython-dev at maubp.freeserve.co.uk Fri Oct 27 15:08:21 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Fri, 27 Oct 2006 20:08:21 +0100
Subject: [Biopython-dev] New Bio.SeqIO code
Message-ID: <45425925.8090607@maubp.freeserve.co.uk>
Hello list,
I've checked in a somewhat cleaned up (and more tested) version of the
earlier attachments to bug 2059.
And I've updated the wiki page:
http://biopython.org/wiki/SeqIO
Has anyone got any tips on formatting python code on Wiki? Maybe I
should just write the docs in LaTeX like the cook book etc.
Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord
objects, it would be a good idea to make them slightly more user-friendly:
http://bugzilla.open-bio.org/show_bug.cgi?id=2057
(I would like to check this in before writing to much of the SeqIO
documentation)
If any of you want to check this out and have a look, I'd be pleased to
get some feedback.
There should be no impact on the rest of BioPython, or existing scripts.
Peter
-----------------------------------------------------------------
Link to view CVS,
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/?cvsroot=biopython
Old files, not touched:
Bio/SeqIO/FASTA.py
Bio/SeqIO/generic.py
Bio/SeqIO/__init__.py (replaces almost empty old file)
======================
* the helper functions (i.e. the functions I expect people to use)
* mappings from file types to parsers and writers
* mappings from file extensions to file types
* large self test suite (which does not need any input files, but will
create a temp file in the current directory)
Bio/SeqIO/Interfaces.py
=======================
Base classes for readers/writers
Bio/SeqIO/FastaIO.py
====================
Uses a generator function for the reader.
Uses a sub-class of SequentialSequenceWriter for the writer.
Bio/SeqIO/ClustalIO.py
======================
Uses a generator function for the reader, based on the old class in
Bio/SeqIO/generic.py
Bio/SeqIO/PhylipIO.py
=====================
Reads and writes phylip files with names strictly truncated at 10
characters.
Uses a generator function for the reader, subclasses SequenceWriter
Bio/SeqIO/StockholmIO.py
========================
Uses subclasses from Interfaces.py
Unlike prior code attached to bug 2059, this code contains just one
writer and parser, which expects the Stockholm file to follow the PFAM
conventions. It should read other files fine - but what happens to the
annotation is less well defined. This is what BioPerl does
http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c10
Bio/SeqIO/GenBankIO.py
======================
Uses a generator function for the reader, which just calls Bio.GenBank
to do the work. See also bug 2059 comment 11 on my thoughts about how
to include EMBL support:
http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c11
Bio/SeqIO/NexusIO.py
====================
Uses a generator function for the reader, which just calls Bio.Nexus to
do the parsing and then extracts the sequences. Has not been tested much.
Peter
From mdehoon at c2b2.columbia.edu Sat Oct 28 01:40:02 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sat, 28 Oct 2006 01:40:02 -0400
Subject: [Biopython-dev] Unigene flat file parser
In-Reply-To: <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov>
References: <200610261056.25883.sdavis2@mail.nih.gov>
<4540F7F4.2050003@c2b2.columbia.edu>
<014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov>
Message-ID: <4542ED32.8060702@c2b2.columbia.edu>
OK, that's fine then.
Is anybody actually using the current Bio/UniGene stuff? I couldn't find
documentation for it and it hasn't been updated in more than two years,
so it may be some dead code sitting around. If so, we can remove this
code; Bio/UniGene would be a nice place to put Sean's code (even though
it is doing something different from the current Bio/UniGene).
--Michiel.
Davis, Sean (NIH/NCI) [E] wrote:
> So, in short, I don't see this unigene parser as a replacement for
> the current module. They fill different needs; this one fills a need
> that I have and is useful for whole-genome, multiple species work, or
> microarray analyses and whether and where it fits into biopython is
> really up to the community.
>
> Michiel wrote:
>> Perhaps a silly question from a non-Unigene user, but what is the
>> relation between your parser and the one in
>> Bio/UniGene/__init__.py? The latter seems to parse HTML files (see
>> the example in Tests/test_unigene.py) instead of flat files. Is
>> your parser intended as a replacement for Bio/UniGene/__init__.py?
From mdehoon at c2b2.columbia.edu Sat Oct 28 01:56:51 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sat, 28 Oct 2006 01:56:51 -0400
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45425925.8090607@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>
Message-ID: <4542F123.9050106@c2b2.columbia.edu>
Thanks, Peter!
It looks very nice. Actually, I have been using an earlier version of
the new SeqIO module (from your code on Bugzilla) and found it to work
quite well. A few short comments:
To parse a Fasta file using the new SeqIO looks like this:
from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("example.fasta") :
print record.id
print record.seq
I would rather have something like this:
from Bio.SeqIO import Fasta
for record in Fasta.parse(open("example.fasta")):
print record.id
print record.seq
where Fasta.parse returns a FastaIterator object, and the argument is
either a file object or a file name. You can in addition have a function
Bio.SeqIO.parse that guesses the file type from the file name extension
(as you have now for File2SequenceIterator), though that wouldn't work
for file handles.
On a related note, I don't think we need the SequenceList and
SequenceDict class. To make a list, one can do
from Bio.SeqIO import Fasta
records = [record for record in Fasta.parse(open("example.fasta"))]
To convert an iterator to a dictionary takes one line more, and is
probably more straightforward than SequenceDict.
--Michiel.
Peter (BioPython Dev) wrote:
> Hello list,
>
> I've checked in a somewhat cleaned up (and more tested) version of the
> earlier attachments to bug 2059.
>
> And I've updated the wiki page:
> http://biopython.org/wiki/SeqIO
>
> Has anyone got any tips on formatting python code on Wiki? Maybe I
> should just write the docs in LaTeX like the cook book etc.
>
> Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord
> objects, it would be a good idea to make them slightly more user-friendly:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
>
> (I would like to check this in before writing to much of the SeqIO
> documentation)
>
> If any of you want to check this out and have a look, I'd be pleased to
> get some feedback.
From biopython-dev at maubp.freeserve.co.uk Sat Oct 28 07:59:13 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sat, 28 Oct 2006 12:59:13 +0100
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4542F123.9050106@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk>
<4542F123.9050106@c2b2.columbia.edu>
Message-ID: <45434611.1040708@maubp.freeserve.co.uk>
Michiel de Hoon wrote:
> Thanks, Peter!
> It looks very nice. Actually, I have been using an earlier version of
> the new SeqIO module (from your code on Bugzilla) and found it to work
> quite well.
Thank you - and good to here the (old version) is working OK.
> A few short comments:
>
> To parse a Fasta file using the new SeqIO looks like this:
>
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("example.fasta") :
> print record.id
> print record.seq
>
> I would rather have something like this:
>
> from Bio.SeqIO import Fasta
> for record in Fasta.parse(open("example.fasta")):
> print record.id
> print record.seq
>
> where Fasta.parse returns a FastaIterator object, and the argument is
> either a file object or a file name.
I think you have raised two issues - file names/handles (discussed
below), and the use of a generic function versus a format specific one
(or at least the naming conventions).
I like the idea of a generic function File2SequenceIterator() which can
be used on lots of different file formats, just by changing the
arguments. However, there is nothing to stop you using the underlying
format specific iterators directly:
from Bio.SeqIO.FastaIO import FastaIterator
for record in FastaIterator(open("example.fasta")):
print record.id
print record.seq
(which is similar to your suggestion above)
As long as you don't need to use any file format specific options, then
for every file format the style of the code is the same - but switching
file formats takes a little more work:
from Bio.SeqIO.NexusIO import NexusIterator
for record in NexusIterator(open("example.nexus")):
print record.id
print record.seq
versus:
from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("example.nexus") :
print record.id
print record.seq
or, to give an example where the file extension is no use and the format
must be explicitly stated:
from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") :
print record.id
print record.seq
I expect the "helper functions" like File2SequenceIterator() to be used
for the simple cases where the user does not care about the minor
options we might offer for individual file formats (this would cover
beginners).
They are also nice for writing multiple file format test cases ;)
I see later in you email you suggested a generic Bio.SeqIO.parse(file)
function which would cope with multiple file formats. Was your point
more about what we call things?
I'm happy to go from File2SequenceIterator() to something like
SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() -
with matching versions like SeqList() and SeqDict()
However, I'm not so keen on "parse()" because it gives no clue as to
what it will return.
---
On the other point, filenames/handles. Right now, the individual
iterators only take a handle. This was a simplification I made to make
my life as straight forward as possible.
The File2SequenceIterator() function (and friends) can take a filename,
handle, or a string containing the contents of a file (in addition to
the format). However, these are done as three separate arguments.
I could have one argument that takes a file name or handle, and works it
out on its own. Bio.Nexus tries to do this for example. Having the
individual iterators also do this trick would be pretty simple (using a
shared utility function).
The "contents of a file" string argument was handy when testing, but I
imagine this is not going to be a common situation. If people need
this, they can use python's StringIO module to turn their data string
into a handle easily enough.
> You can in addition have a function
> Bio.SeqIO.parse that guesses the file type from the file name extension
> (as you have now for File2SequenceIterator), though that wouldn't work
> for file handles.
When dealing with a file handle, converting it to an undo file handle
would probably work - if we had code to guess the file format. I have
tried to raise a syntax error when a parser is given an invalid file -
which would mean we could just try some common file formats in order
until one works without a syntax error.
But I felt this was not needed right away, so I put it off.
> On a related note, I don't think we need the SequenceList and
> SequenceDict class. To make a list, one can do
>
> from Bio.SeqIO import Fasta
> records = [record for record in Fasta.parse(open("example.fasta"))]
Currently that would be written:
from Bio.SeqIO.FastaIO import FastaIterator
records = [record for record in FastaIterator(open("example.fasta"))]
Or even just the following, which I find simpler:
from Bio.SeqIO.FastaIO import FastaIterator
records = list(FastaIterator(open("example.fasta")))
Versus the alternatives:
from Bio.SeqIO import File2SequenceList
records = File2SequenceList("example.fasta")
from Bio.SeqIO import File2SequenceDict
record_dict = File2SequenceDict("example.fasta")
> To convert an iterator to a dictionary takes one line more, and is
> probably more straightforward than SequenceDict.
That was one thing I wanted to discuss - having a SequenceDict and
SequenceList class would let us add doc strings and perhaps methods like
maxlength, minlength, totallength, ...
Or, I can just use simple list and dict objects in the functions
File2SequenceList and File2SequenceDict.
I have no strong preference on this issue - so unless someone else
speaks up, I'll go back to simple lists and dictionaries - keeps things
simple.
Peter
From sdavis2 at mail.nih.gov Sat Oct 28 12:47:03 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Sat, 28 Oct 2006 12:47:03 -0400
Subject: [Biopython-dev] Unigene flat file parser
In-Reply-To: <4542ED32.8060702@c2b2.columbia.edu>
References: <200610261056.25883.sdavis2@mail.nih.gov>
<4540F7F4.2050003@c2b2.columbia.edu>
<014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov>
<4542ED32.8060702@c2b2.columbia.edu>
Message-ID: <45438987.1070403@mail.nih.gov>
Michiel de Hoon wrote:
> OK, that's fine then.
>
> Is anybody actually using the current Bio/UniGene stuff? I couldn't
> find documentation for it and it hasn't been updated in more than two
> years, so it may be some dead code sitting around. If so, we can
> remove this code; Bio/UniGene would be a nice place to put Sean's code
> (even though it is doing something different from the current
> Bio/UniGene).
I haven't looked into it much, but for dynamic queries of individual
Unigene entries, it seems that Eutils might be the better way to go,
anyway.
Sean
From mdehoon at c2b2.columbia.edu Sun Oct 29 01:09:14 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 29 Oct 2006 01:09:14 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45434611.1040708@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>
<4542F123.9050106@c2b2.columbia.edu>
<45434611.1040708@maubp.freeserve.co.uk>
Message-ID: <4544458A.5000102@c2b2.columbia.edu>
Well let's first decide which functions we want in Bio.SeqIO, and then
decide how to name them.
I'm fine with the idea of having a function that can guess the file
format from the extension. I also agree that a parser that can guess the
file format from the file contents is not needed at this point.
> That was one thing I wanted to discuss - having a SequenceDict and
> SequenceList class would let us add doc strings and perhaps methods
> like maxlength, minlength, totallength, ...
>
> Or, I can just use simple list and dict objects in the functions
> File2SequenceList and File2SequenceDict.
>
> I have no strong preference on this issue - so unless someone else
> speaks up, I'll go back to simple lists and dictionaries - keeps
> things simple.
If we go back to simple lists and dictionaries, do we still need the
functions File2SequenceList and File2SequenceDict? I'd like to avoid
software bloat as much as possible, so if we don't need these two
functions, so much the better.
About file handles:
> The File2SequenceIterator() function (and friends) can take a
> filename, handle, or a string containing the contents of a file (in
> addition to the format). However, these are done as three separate
> arguments.
>
> I could have one argument that takes a file name or handle, and works
> it out on its own. Bio.Nexus tries to do this for example. Having
> the individual iterators also do this trick would be pretty simple
> (using a shared utility function).
>
> The "contents of a file" string argument was handy when testing, but I
> imagine this is not going to be a common situation. If people need
> this, they can use python's StringIO module to turn their data string
> into a handle easily enough.
I like the idea of one argument that takes a file name or handle. I
believe that that is how other Biopython functions work.
--Michiel.
Peter wrote:
> Michiel de Hoon wrote:
>> Thanks, Peter!
>> It looks very nice. Actually, I have been using an earlier version of
>> the new SeqIO module (from your code on Bugzilla) and found it to work
>> quite well.
>
> Thank you - and good to here the (old version) is working OK.
>
> > A few short comments:
>>
>> To parse a Fasta file using the new SeqIO looks like this:
>>
>> from Bio.SeqIO import File2SequenceIterator
>> for record in File2SequenceIterator("example.fasta") :
>> print record.id
>> print record.seq
>>
>> I would rather have something like this:
>>
>> from Bio.SeqIO import Fasta
>> for record in Fasta.parse(open("example.fasta")):
>> print record.id
>> print record.seq
>>
>> where Fasta.parse returns a FastaIterator object, and the argument is
>> either a file object or a file name.
>
> I think you have raised two issues - file names/handles (discussed
> below), and the use of a generic function versus a format specific one
> (or at least the naming conventions).
>
> I like the idea of a generic function File2SequenceIterator() which can
> be used on lots of different file formats, just by changing the
> arguments. However, there is nothing to stop you using the underlying
> format specific iterators directly:
>
> from Bio.SeqIO.FastaIO import FastaIterator
> for record in FastaIterator(open("example.fasta")):
> print record.id
> print record.seq
>
> (which is similar to your suggestion above)
>
> As long as you don't need to use any file format specific options, then
> for every file format the style of the code is the same - but switching
> file formats takes a little more work:
>
> from Bio.SeqIO.NexusIO import NexusIterator
> for record in NexusIterator(open("example.nexus")):
> print record.id
> print record.seq
>
> versus:
>
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("example.nexus") :
> print record.id
> print record.seq
>
> or, to give an example where the file extension is no use and the format
> must be explicitly stated:
>
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") :
> print record.id
> print record.seq
>
> I expect the "helper functions" like File2SequenceIterator() to be used
> for the simple cases where the user does not care about the minor
> options we might offer for individual file formats (this would cover
> beginners).
>
> They are also nice for writing multiple file format test cases ;)
>
> I see later in you email you suggested a generic Bio.SeqIO.parse(file)
> function which would cope with multiple file formats. Was your point
> more about what we call things?
>
> I'm happy to go from File2SequenceIterator() to something like
> SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() -
> with matching versions like SeqList() and SeqDict()
>
> However, I'm not so keen on "parse()" because it gives no clue as to
> what it will return.
>
> ---
>
> On the other point, filenames/handles. Right now, the individual
> iterators only take a handle. This was a simplification I made to make
> my life as straight forward as possible.
>
> The File2SequenceIterator() function (and friends) can take a filename,
> handle, or a string containing the contents of a file (in addition to
> the format). However, these are done as three separate arguments.
>
> I could have one argument that takes a file name or handle, and works it
> out on its own. Bio.Nexus tries to do this for example. Having the
> individual iterators also do this trick would be pretty simple (using a
> shared utility function).
>
> The "contents of a file" string argument was handy when testing, but I
> imagine this is not going to be a common situation. If people need
> this, they can use python's StringIO module to turn their data string
> into a handle easily enough.
>
> > You can in addition have a function
>> Bio.SeqIO.parse that guesses the file type from the file name
>> extension (as you have now for File2SequenceIterator), though that
>> wouldn't work for file handles.
>
> When dealing with a file handle, converting it to an undo file handle
> would probably work - if we had code to guess the file format. I have
> tried to raise a syntax error when a parser is given an invalid file -
> which would mean we could just try some common file formats in order
> until one works without a syntax error.
>
> But I felt this was not needed right away, so I put it off.
>
>> On a related note, I don't think we need the SequenceList and
>> SequenceDict class. To make a list, one can do
>>
>> from Bio.SeqIO import Fasta
>> records = [record for record in Fasta.parse(open("example.fasta"))]
>
> Currently that would be written:
>
> from Bio.SeqIO.FastaIO import FastaIterator
> records = [record for record in FastaIterator(open("example.fasta"))]
>
> Or even just the following, which I find simpler:
>
> from Bio.SeqIO.FastaIO import FastaIterator
> records = list(FastaIterator(open("example.fasta")))
>
> Versus the alternatives:
>
> from Bio.SeqIO import File2SequenceList
> records = File2SequenceList("example.fasta")
>
> from Bio.SeqIO import File2SequenceDict
> record_dict = File2SequenceDict("example.fasta")
>
>> To convert an iterator to a dictionary takes one line more, and is
>> probably more straightforward than SequenceDict.
>
> That was one thing I wanted to discuss - having a SequenceDict and
> SequenceList class would let us add doc strings and perhaps methods like
> maxlength, minlength, totallength, ...
>
> Or, I can just use simple list and dict objects in the functions
> File2SequenceList and File2SequenceDict.
>
> I have no strong preference on this issue - so unless someone else
> speaks up, I'll go back to simple lists and dictionaries - keeps things
> simple.
>
> Peter
>
From biopython-dev at maubp.freeserve.co.uk Sun Oct 29 06:25:35 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sun, 29 Oct 2006 11:25:35 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4544458A.5000102@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk>
<4544458A.5000102@c2b2.columbia.edu>
Message-ID: <45448FAF.1090104@maubp.freeserve.co.uk>
Michiel de Hoon wrote:
> Well let's first decide which functions we want in Bio.SeqIO, and then
> decide how to name them.
Agreed.
One point against names like File2SequenceIterator is the pun on two
versus to (i.e. convert) will not be so obvious to non-native English
speakers.
> > That was one thing I wanted to discuss - having a SequenceDict and
> > SequenceList class would let us add doc strings and perhaps methods
> > like maxlength, minlength, totallength, ...
> >
> > Or, I can just use simple list and dict objects in the functions
> > File2SequenceList and File2SequenceDict.
> >
> > I have no strong preference on this issue - so unless someone else
> > speaks up, I'll go back to simple lists and dictionaries - keeps
> > things simple.
>
> If we go back to simple lists and dictionaries, do we still need the
> functions File2SequenceList and File2SequenceDict? I'd like to avoid
> software bloat as much as possible, so if we don't need these two
> functions, so much the better.
I think there is some benefit to having File2SequenceDict included as
converting from a SeqRecord iterator to a dictionary of SeqRecords isn't
completely trivial.
There are at least two important questions: What to use as the
dictionary key (e.g. record.id) and how to deal with duplicate keys
(e.g. use first/last record with that id, or simply abort).
Consider this line of code as an alternative to File2SequenceDict:
iterator = File2SequenceList(...)
d = dict([record.id, record] for record in iterator)
I don't think its very readable, or intuitive (and could scare
beginners). Part of my aim with Bio.SeqIO was to make the interface simple.
More importantly, if there are records with duplicate ids then with this
code the resulting dictionary will have only the last record.
Personally I would want duplicate keys to cause an exception.
Rewriting File2SequenceDict() to use a simple dict would give something
like this, where record2key is an optional user supplied function.
def File2SequenceDict(..., record2key=None) :
iterator = File2SequenceIterator(...)
if record2key is None : record2key = lambda record : record.id
answer = dict()
for record in iterator :
key = record2key(record)
assert key not in answer, "Duplicate key"
answer[key] = record
return answer
The record2key function is perhaps not needed - I was trying to make the
function flexible. The duplicate key behaviour could also be an option.
The other function, File2SequenceList isn't really needed if we are
using simple lists. Its basically a wrapper for
list(File2SequenceIterator(...)) or some other one liner.
The main reason I invented File2SequenceList() was for completeness -
given I already had File2SequenceDict() and File2SequenceIterator()
> About file handles:
>
> > The File2SequenceIterator() function (and friends) can take a
> > filename, handle, or a string containing the contents of a file (in
> > addition to the format). However, these are done as three separate
> > arguments.
> >
> > I could have one argument that takes a file name or handle, and works
> > it out on its own. Bio.Nexus tries to do this for example. Having
> > the individual iterators also do this trick would be pretty simple
> > (using a shared utility function).
> >
> > The "contents of a file" string argument was handy when testing, but I
> > imagine this is not going to be a common situation. If people need
> > this, they can use python's StringIO module to turn their data string
> > into a handle easily enough.
>
> I like the idea of one argument that takes a file name or handle. I
> believe that that is how other Biopython functions work.
OK then - I'll do that.
Peter
From biopython-dev at maubp.freeserve.co.uk Sun Oct 29 19:13:57 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 30 Oct 2006 00:13:57 +0000
Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna
Message-ID: <454543C5.1080209@maubp.freeserve.co.uk>
Hello all,
I've been looking at writing multiple sequence alignments in Nexus
format for the new Bio.SeqIO code, and came up with the following little
problem:
Given one or more Seq objects, how can I reliably decide if they are
protein, DNA, or RNA?
(These are the relevant choices in a Nexus file's format datatype=...
header.)
I'm resigned to the fact that if the Seq object has the generic alphabet
this boils down to looking at the sequence strings and making an
educated guess (probably following an established algorithm from an
alignment program). Does any such code already exist in BioPython?
However - is there a nice/official way to ask an alphabet object what it
is (protein, DNA, RNA)?
Looking over the code in Bio.Alphabet the only thing I can think of is
to get the class name as a string and search it(!) We can't look at the
letters property as this is None for the base classes like ProteinAlphabet.
If we are prepared to meddle with the alphabet system we might add
attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these
base classes. Or simply have a "sequence_type" method, which the
subclasses can re-define as required.
(I wasn't meaning to reopen the whole "do we need alphabets"
conversation last discussed in July 2006. At least, not yet...)
Peter
From fkauff at duke.edu Sun Oct 29 19:48:39 2006
From: fkauff at duke.edu (Frank)
Date: Sun, 29 Oct 2006 19:48:39 -0500
Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna
In-Reply-To: <454543C5.1080209@maubp.freeserve.co.uk>
References: <454543C5.1080209@maubp.freeserve.co.uk>
Message-ID: <1162169319.12941.5.camel@cpe-071-077-002-012.nc.res.rr.com>
Hi all,
On Mon, 2006-10-30 at 00:13 +0000, Peter (BioPython Dev) wrote:
> Hello all,
>
> I've been looking at writing multiple sequence alignments in Nexus
> format for the new Bio.SeqIO code, and came up with the following little
> problem:
>
> Given one or more Seq objects, how can I reliably decide if they are
> protein, DNA, or RNA?
>
> (These are the relevant choices in a Nexus file's format datatype=...
> header.)
>
> I'm resigned to the fact that if the Seq object has the generic alphabet
> this boils down to looking at the sequence strings and making an
> educated guess (probably following an established algorithm from an
> alignment program). Does any such code already exist in BioPython?
>
I'm not aware of any such code - however, an educated guess would be
easy, (more or less ACGTNX only, ACGUNX only, everything else...?). With
NEXUS it becomes tricky, as a dataset could potentially be partitioned
into a mix of all types. And there is no "official" way to indicate this
in the datatype= option.
Frank
> However - is there a nice/official way to ask an alphabet object what it
> is (protein, DNA, RNA)?
>
> Looking over the code in Bio.Alphabet the only thing I can think of is
> to get the class name as a string and search it(!) We can't look at the
> letters property as this is None for the base classes like ProteinAlphabet.
>
> If we are prepared to meddle with the alphabet system we might add
> attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these
> base classes. Or simply have a "sequence_type" method, which the
> subclasses can re-define as required.
>
> (I wasn't meaning to reopen the whole "do we need alphabets"
> conversation last discussed in July 2006. At least, not yet...)
>
> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From mdehoon at c2b2.columbia.edu Sun Oct 29 22:20:48 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 29 Oct 2006 22:20:48 -0500
Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna
In-Reply-To: <454543C5.1080209@maubp.freeserve.co.uk>
References: <454543C5.1080209@maubp.freeserve.co.uk>
Message-ID: <45456F90.1090005@c2b2.columbia.edu>
Peter (BioPython Dev) wrote:
> Given one or more Seq objects, how can I reliably decide if they are
> protein, DNA, or RNA?
>
> (These are the relevant choices in a Nexus file's format datatype=...
> header.)
>
> I'm resigned to the fact that if the Seq object has the generic alphabet
> this boils down to looking at the sequence strings and making an
> educated guess (probably following an established algorithm from an
> alignment program). Does any such code already exist in BioPython?
Something similar exists in Bio.Seq in the complement,
reverse_complement methods of Seq objects, but it only distinguishes
between DNA and RNA. I don't know of any official way to do that in
Biopython.
--Michiel.
From mdehoon at c2b2.columbia.edu Sun Oct 29 22:42:44 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 29 Oct 2006 22:42:44 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45448FAF.1090104@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk>
<4544458A.5000102@c2b2.columbia.edu>
<45448FAF.1090104@maubp.freeserve.co.uk>
Message-ID: <454574B4.3050407@c2b2.columbia.edu>
Peter wrote:
> There are at least two important questions: What to use as the
> dictionary key (e.g. record.id) and how to deal with duplicate keys
> (e.g. use first/last record with that id, or simply abort).
>
> Rewriting File2SequenceDict() to use a simple dict would give something
> like this, where record2key is an optional user supplied function.
>
> def File2SequenceDict(..., record2key=None) :
> iterator = File2SequenceIterator(...)
> if record2key is None : record2key = lambda record : record.id
> answer = dict()
> for record in iterator :
> key = record2key(record)
> assert key not in answer, "Duplicate key"
> answer[key] = record
> return answer
>
> The record2key function is perhaps not needed - I was trying to make the
> function flexible. The duplicate key behaviour could also be an option.
>
I am using File2SequenceIterator in one of my scripts (thanks by the way
for that, my script is a lot faster now. I didn't do a rigorous timing,
but it's about a zillion times faster), and convert the iterator to a
dictionary using plain Python. If I were to use File2SequenceDict
instead, I would need the record2key argument, because in my application
I want only part of record.id as the key.
In the File2SequenceDict above, answer[key] contains the complete
record. Some people will want that. However, in my application I only
want to store the record.seq part in answer[key]. Somebody else may want
str(record.seq). So we'd also need a record2value argument.
For duplicate keys, there are at least four possibilities (raise an
exception, store only one of the keys, store neither of the keys and
don't raise an exception, store both after modifying one of the keys).
So this should also be an option.
You'll end up with a File2SequenceDict function that is more complicated
than the plain Python solution.
--Michiel.
From biopython-dev at maubp.freeserve.co.uk Mon Oct 30 05:54:41 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 30 Oct 2006 10:54:41 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <454574B4.3050407@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk>
<454574B4.3050407@c2b2.columbia.edu>
Message-ID: <4545D9F1.2040902@maubp.freeserve.co.uk>
Michiel de Hoon wrote:
> On a related note, I don't think we need the SequenceList and
> SequenceDict class. To make a list, one can do ...
I've updated the new code in Bio.SeqIO to remove SequenceDict and
SequenceList and use the standard dictionary and list instead.
Michiel de Hoon wrote:
> I am using File2SequenceIterator in one of my scripts (thanks by the way
> for that, my script is a lot faster now. I didn't do a rigorous timing,
> but it's about a zillion times faster), and convert the iterator to a
> dictionary using plain Python. If I were to use File2SequenceDict
> instead, I would need the record2key argument, because in my application
> I want only part of record.id as the key.
With such a speed up, I'd guess you were using Bio.Fasta before. I've
noticed the same thing. Are you dealing with NCBI style fasta
identifiers made up of several fields separated by "|" characters?
> In the File2SequenceDict above, answer[key] contains the complete
> record. Some people will want that. However, in my application I only
> want to store the record.seq part in answer[key]. Somebody else may want
> str(record.seq). So we'd also need a record2value argument.
It does slightly undermine the "you only get SeqRecord objects"
principle. On the other hand, its a simple addition that is easy to
explain and implement. I'm happy to add this.
> For duplicate keys, there are at least four possibilities (raise an
> exception, store only one of the keys, store neither of the keys and
> don't raise an exception, store both after modifying one of the keys).
> So this should also be an option.
Supporting all these options with an easy to understand interface looks
too hard.
In my opinion if someone is trying to build a dictionary using repeated
keys they have made a mistake (either in their datafile, or their
record2key function) - so raising an exception is reasonable default
behaviour (and is easy to code).
Apart from the "exception" option, which of these actions do you
generally find most appropriate?
> You'll end up with a File2SequenceDict function that is more complicated
> than the plain Python solution.
Yes. Trying to do everything would be bad - both complicated to
implement, probably complicated to use as well.
Peter
From mdehoon at c2b2.columbia.edu Mon Oct 30 17:02:34 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Mon, 30 Oct 2006 17:02:34 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45425925.8090607@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>
Message-ID: <4546767A.70302@c2b2.columbia.edu>
Peter (BioPython Dev) wrote:
> Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord
> objects, it would be a good idea to make them slightly more user-friendly:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
>
> (I would like to check this in before writing to much of the SeqIO
> documentation)
Looks good to me. Thanks!
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From bugzilla-daemon at portal.open-bio.org Sun Oct 1 06:10:37 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 1 Oct 2006 02:10:37 -0400
Subject: [Biopython-dev] [Bug 1939] Doc/Makefile does not build pdf, html,
txt files completely correctly
In-Reply-To:
Message-ID: <200610010610.k916Ab3S003487@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=1939
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #19 from mdehoon at ims.u-tokyo.ac.jp 2006-10-01 02:10 -------
I've taken bits and pieces of the patch to get the recursive behavior for make.
Getting html output from biopdb_faq is not essential, and does not warrant
adding a hack to Biopython.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chris.lasher at gmail.com Tue Oct 10 02:55:17 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Mon, 9 Oct 2006 22:55:17 -0400
Subject: [Biopython-dev] ranlib.c missing in Bio/Cluster
Message-ID: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com>
I just checked out the latest CVS and setup.py failed on installation
during gcc compilation of ranlib.o since Bio/Cluster/ranlib.c couldn't
be found. Any suggestions? Was this file supposed to be in the CVS?
The checkout notes indicate that it has been replaced with something
else.
Thanks,
Chris
From mdehoon at c2b2.columbia.edu Tue Oct 10 04:10:22 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Tue, 10 Oct 2006 00:10:22 -0400
Subject: [Biopython-dev] ranlib.c missing in Bio/Cluster
In-Reply-To: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com>
References: <128a885f0610091955k63527ebcke165ede0f0afce3e@mail.gmail.com>
Message-ID: <452B1D2E.4060703@c2b2.columbia.edu>
There was some confusion about the license status of ranlib, so I
removed it from Bio.Cluster and replaced it with a new random number
generator written from scratch. Apparently I forgot to update setup.py
in CVS accordingly. I have done that now, so if you get the new setup.py
from CVS the compilation should work. You could also edit your local
copy of setup.py and remove ranlib.c, linpack.c, and com.c.
Sorry for the confusion.
--Michiel.
Chris Lasher wrote:
> I just checked out the latest CVS and setup.py failed on installation
> during gcc compilation of ranlib.o since Bio/Cluster/ranlib.c couldn't
> be found. Any suggestions? Was this file supposed to be in the CVS?
> The checkout notes indicate that it has been replaced with something
> else.
>
> Thanks,
> Chris
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From chris.lasher at gmail.com Tue Oct 10 04:46:29 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 10 Oct 2006 00:46:29 -0400
Subject: [Biopython-dev] Subversion Repository
Message-ID: <128a885f0610092146y5a184ccfw31d433d228a9b05d@mail.gmail.com>
Anybody know if BioPython (I suppose all Open Bio projects) will
switch over to Subversion, and if so, when? I think the merits and
advantages of Subversion over CVS speak for themselves. It's certainly
become my revision control system of preference. Anybody else's?
Curious,
Chris
From bugzilla-daemon at portal.open-bio.org Thu Oct 19 04:14:23 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 19 Oct 2006 00:14:23 -0400
Subject: [Biopython-dev] [Bug 2014] Bio/Blast/NCBIStandalone.py parsing of
psiblast fails
In-Reply-To:
Message-ID: <200610190414.k9J4ENQm025952@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2014
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:14 -------
Fixed in CVS, thanks.
Note though that the parser for plain-text blast output is very difficult to
maintain, because the output format keeps changing with different versions of
blast. I'd encourage you to use the XML parser instead, as it is much more
stable.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Oct 19 04:21:40 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 19 Oct 2006 00:21:40 -0400
Subject: [Biopython-dev] [Bug 2032] query_to and sbjct_to added in parsed
NCBI-Blast XML
In-Reply-To:
Message-ID: <200610190421.k9J4LeGH026582@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2032
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:21 -------
Fixed in CVS following same bug report on the mailing list.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Oct 19 04:44:57 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 19 Oct 2006 00:44:57 -0400
Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple
queries and recent (2.2.13) blast - patch attached
In-Reply-To:
Message-ID: <200610190444.k9J4ivGv028629@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2051
mdehoon at ims.u-tokyo.ac.jp changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |mdehoon at ims.u-tokyo.ac.jp
------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2006-10-19 00:44 -------
A new blast version (2.2.15) came out recently, so I tried the XML parser with
the its output of a multiple query. I didn't notice a problem except that all
alignments are put into one list, which is annoying because then we have to
find out which alignment corresponds to which query.
So, which specific problem with the XML parser are you trying to solve? And do
these problems still occur with blast 2.2.15? (as far as I can tell, its XML
output is the same as for blast 2.2.14, so it's probably here to stay).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chris.lasher at gmail.com Wed Oct 25 02:22:06 2006
From: chris.lasher at gmail.com (Chris Lasher)
Date: Tue, 24 Oct 2006 22:22:06 -0400
Subject: [Biopython-dev] [BioPython] Martel-based parsing of Unigene
flat files
In-Reply-To: <453975D5.4070701@mail.nih.gov>
References: <453975D5.4070701@mail.nih.gov>
Message-ID: <128a885f0610241922h5db02fbfod1a83cfeade29801@mail.gmail.com>
Hi Sean,
FWIW this should probably have been posted to BioPython-dev, but I
don't think that would improve your chances of getting a response. I
am cross-posting it there, anyways. Unfortunately for you, I do not
have an answer for you. :-(
I, myself, would be interested in a response to this question from the
Devs, as I would like to write a parser for PTT files. Last I saw
there was a lot of chatter about the Martel parsers being incredibly
slow compared to straightforward solutions. It seems that standard
format parsers would be one of the easiest ways for BioPython newbies
to contribute to developing the BioPython project, however, there
isn't very much in the way of documentation on the BioPython way to do
so, let alone developer documentation at all. I would like to know
what can be done to get some dev docs going on the wiki.
Chris
On 10/20/06, Sean Davis wrote:
> I am relatively new to python and biopython (coming from perl side of
> things). I would like to make a parser for Unigene flat file format.
> However, after digging through the LocusLink parsing code (as probably
> the most similar format, etc.), I'm still at a loss for how Martel-based
> parsing works. I understand the big picture (converting an re-based
> parsing of a file into events), but it is the detail that I am missing.
> I know about pydoc, but the pydoc for much of Martel is not very helpful
> to me, at least not in my current state of knowledge. Any suggestions
> on how to get started?
>
> Thanks,
> Sean
> _______________________________________________
> BioPython mailing list - BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
From sdavis2 at mail.nih.gov Thu Oct 26 12:09:43 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 26 Oct 2006 08:09:43 -0400
Subject: [Biopython-dev] Basic python question with regard to Unigene parser
Message-ID: <200610260809.43245.sdavis2@mail.nih.gov>
Let me start off by saying that I am a python newbie after working in perl for
the last few years. I am working on a Unigene flat file parser. In my
scanner, I have a construct that looks like:
for line in handle:
tag = line.split(' ')[0]
line = line.rstrip()
if tag=='ID':
consumer.ID(line)
if tag=='GENE':
consumer.GENE(line)
if tag=='TITLE':
consumer.TITLE(line)
if tag=='EXPRESS':
consumer.EXPRESS(line)
....
Since I am setting things up so that there is a 1:1 correspondence between the
"tag" and the consumer method, is there an easy way to reduce this long set
of IF statements to a simple mapping procedure that maps a tag to the correct
method?
Sorry for the naive question....
Thanks,
Sean
From sdavis2 at mail.nih.gov Thu Oct 26 12:30:08 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 26 Oct 2006 08:30:08 -0400
Subject: [Biopython-dev] Basic python question with regard to Unigene
parser
In-Reply-To: <200610260809.43245.sdavis2@mail.nih.gov>
References: <200610260809.43245.sdavis2@mail.nih.gov>
Message-ID: <200610260830.08326.sdavis2@mail.nih.gov>
On Thursday 26 October 2006 08:09, Sean Davis wrote:
> Let me start off by saying that I am a python newbie after working in perl
> for the last few years. I am working on a Unigene flat file parser. In my
> scanner, I have a construct that looks like:
>
> for line in handle:
> tag = line.split(' ')[0]
> line = line.rstrip()
> if tag=='ID':
> consumer.ID(line)
> if tag=='GENE':
> consumer.GENE(line)
> if tag=='TITLE':
> consumer.TITLE(line)
> if tag=='EXPRESS':
> consumer.EXPRESS(line)
> ....
>
> Since I am setting things up so that there is a 1:1 correspondence between
> the "tag" and the consumer method, is there an easy way to reduce this long
> set of IF statements to a simple mapping procedure that maps a tag to the
> correct method?
>
> Sorry for the naive question....
Even more apologies. I answered my own question. Something like this seems
to work:
exec('consumer.'+tag+'(line)')
which replaces all the IF statements quite nicely.
Sean
From james.balhoff at duke.edu Thu Oct 26 13:46:34 2006
From: james.balhoff at duke.edu (Jim Balhoff)
Date: Thu, 26 Oct 2006 09:46:34 -0400
Subject: [Biopython-dev] Basic python question with regard to Unigene
parser
In-Reply-To: <200610260830.08326.sdavis2@mail.nih.gov>
References: <200610260809.43245.sdavis2@mail.nih.gov>
<200610260830.08326.sdavis2@mail.nih.gov>
Message-ID:
Hi Sean,
On Oct 26, 2006, at 8:30 AM, Sean Davis wrote:
> On Thursday 26 October 2006 08:09, Sean Davis wrote:
>> Let me start off by saying that I am a python newbie after working
>> in perl
>> for the last few years. I am working on a Unigene flat file
>> parser. In my
>> scanner, I have a construct that looks like:
>>
>> for line in handle:
>> tag = line.split(' ')[0]
>> line = line.rstrip()
>> if tag=='ID':
>> consumer.ID(line)
>> if tag=='GENE':
>> consumer.GENE(line)
>> if tag=='TITLE':
>> consumer.TITLE(line)
>> if tag=='EXPRESS':
>> consumer.EXPRESS(line)
>> ....
>>
>> Since I am setting things up so that there is a 1:1 correspondence
>> between
>> the "tag" and the consumer method, is there an easy way to reduce
>> this long
>> set of IF statements to a simple mapping procedure that maps a tag
>> to the
>> correct method?
>>
>> Sorry for the naive question....
>
> Even more apologies. I answered my own question. Something like
> this seems
> to work:
>
> exec('consumer.'+tag+'(line)')
>
> which replaces all the IF statements quite nicely.
Alternatively, you may want to look at getattr(). There is a good
description here:
Jim
From sdavis2 at mail.nih.gov Thu Oct 26 14:56:25 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 26 Oct 2006 10:56:25 -0400
Subject: [Biopython-dev] Unigene flat file parser
Message-ID: <200610261056.25883.sdavis2@mail.nih.gov>
I have put together a parser for the Unigene flat file format described here:
ftp://ftp.ncbi.nih.gov/repository/UniGene/README
under the Hs.data section. The actual .data files are included in the various
organism-specific directories.
Is there any interest in including this in biopython? If so, I would
appreciate some input on the code and details of contributions, etc. The
current code is available here:
http://watson.nci.nih.gov/pressa/~sdavis/Unigene.py
Use like so and note that the ugrecord has much more information (in fact, all
information is captured) in it that given in its __repr__.
#!/usr/bin/python
import Unigene
fh = file('Hs.data') #downloaded previously from ftp, or whatever
ugparser = Unigene.Iterator(fh,Unigene.RecordParser())
for ugrecord in ugparser:
print ugrecord
From mdehoon at c2b2.columbia.edu Thu Oct 26 18:01:24 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 26 Oct 2006 14:01:24 -0400
Subject: [Biopython-dev] Unigene flat file parser
In-Reply-To: <200610261056.25883.sdavis2@mail.nih.gov>
References: <200610261056.25883.sdavis2@mail.nih.gov>
Message-ID: <4540F7F4.2050003@c2b2.columbia.edu>
Sean Davis wrote:
> I have put together a parser for the Unigene flat file format described here:
Perhaps a silly question from a non-Unigene user, but what is the
relation between your parser and the one in Bio/UniGene/__init__.py? The
latter seems to parse HTML files (see the example in
Tests/test_unigene.py) instead of flat files. Is your parser intended as
a replacement for Bio/UniGene/__init__.py?
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From sdavis2 at mail.nih.gov Thu Oct 26 20:15:52 2006
From: sdavis2 at mail.nih.gov (Davis, Sean (NIH/NCI) [E])
Date: Thu, 26 Oct 2006 16:15:52 -0400
Subject: [Biopython-dev] Unigene flat file parser
References: <200610261056.25883.sdavis2@mail.nih.gov>
<4540F7F4.2050003@c2b2.columbia.edu>
Message-ID: <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov>
Michiel,
It looks to me like it parses an HTML file downloaded from the NCBI website containing a single unigene record of interest--potentially useful if one knows what one needs.
I, on the other hand, have always just used the flat files as the source for unigene, as I typically want ALL the data for one or several species available. A single flat file is available for each organism and contains ALL the unigene entries and their associated information for that organism. By concatenating several files (they are simple text files), one can parse the entire unigene database.
So, in short, I don't see this unigene parser as a replacement for the current module. They fill different needs; this one fills a need that I have and is useful for whole-genome, multiple species work, or microarray analyses and whether and where it fits into biopython is really up to the community.
Just a quick comment on speed for the parser--it parses Hs.data (the largest flat file in unigene, 84,000 entries, with just under 7,000,000 sequence entries, 150 Mb file size) in just under 5 minutes on my Xeon desktop.
Sean
-----Original Message-----
From: Michiel Jan Laurens de Hoon [mailto:mdehoon at c2b2.columbia.edu]
Sent: Thu 10/26/2006 2:01 PM
To: Davis, Sean (NIH/NCI) [E]
Cc: biopython-dev at lists.open-bio.org
Subject: Re: [Biopython-dev] Unigene flat file parser
Sean Davis wrote:
> I have put together a parser for the Unigene flat file format described here:
Perhaps a silly question from a non-Unigene user, but what is the
relation between your parser and the one in Bio/UniGene/__init__.py? The
latter seems to parse HTML files (see the example in
Tests/test_unigene.py) instead of flat files. Is your parser intended as
a replacement for Bio/UniGene/__init__.py?
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032
From biopython-dev at maubp.freeserve.co.uk Fri Oct 27 19:08:21 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Fri, 27 Oct 2006 20:08:21 +0100
Subject: [Biopython-dev] New Bio.SeqIO code
Message-ID: <45425925.8090607@maubp.freeserve.co.uk>
Hello list,
I've checked in a somewhat cleaned up (and more tested) version of the
earlier attachments to bug 2059.
And I've updated the wiki page:
http://biopython.org/wiki/SeqIO
Has anyone got any tips on formatting python code on Wiki? Maybe I
should just write the docs in LaTeX like the cook book etc.
Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord
objects, it would be a good idea to make them slightly more user-friendly:
http://bugzilla.open-bio.org/show_bug.cgi?id=2057
(I would like to check this in before writing to much of the SeqIO
documentation)
If any of you want to check this out and have a look, I'd be pleased to
get some feedback.
There should be no impact on the rest of BioPython, or existing scripts.
Peter
-----------------------------------------------------------------
Link to view CVS,
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/?cvsroot=biopython
Old files, not touched:
Bio/SeqIO/FASTA.py
Bio/SeqIO/generic.py
Bio/SeqIO/__init__.py (replaces almost empty old file)
======================
* the helper functions (i.e. the functions I expect people to use)
* mappings from file types to parsers and writers
* mappings from file extensions to file types
* large self test suite (which does not need any input files, but will
create a temp file in the current directory)
Bio/SeqIO/Interfaces.py
=======================
Base classes for readers/writers
Bio/SeqIO/FastaIO.py
====================
Uses a generator function for the reader.
Uses a sub-class of SequentialSequenceWriter for the writer.
Bio/SeqIO/ClustalIO.py
======================
Uses a generator function for the reader, based on the old class in
Bio/SeqIO/generic.py
Bio/SeqIO/PhylipIO.py
=====================
Reads and writes phylip files with names strictly truncated at 10
characters.
Uses a generator function for the reader, subclasses SequenceWriter
Bio/SeqIO/StockholmIO.py
========================
Uses subclasses from Interfaces.py
Unlike prior code attached to bug 2059, this code contains just one
writer and parser, which expects the Stockholm file to follow the PFAM
conventions. It should read other files fine - but what happens to the
annotation is less well defined. This is what BioPerl does
http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c10
Bio/SeqIO/GenBankIO.py
======================
Uses a generator function for the reader, which just calls Bio.GenBank
to do the work. See also bug 2059 comment 11 on my thoughts about how
to include EMBL support:
http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c11
Bio/SeqIO/NexusIO.py
====================
Uses a generator function for the reader, which just calls Bio.Nexus to
do the parsing and then extracts the sequences. Has not been tested much.
Peter
From mdehoon at c2b2.columbia.edu Sat Oct 28 05:40:02 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sat, 28 Oct 2006 01:40:02 -0400
Subject: [Biopython-dev] Unigene flat file parser
In-Reply-To: <014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov>
References: <200610261056.25883.sdavis2@mail.nih.gov>
<4540F7F4.2050003@c2b2.columbia.edu>
<014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov>
Message-ID: <4542ED32.8060702@c2b2.columbia.edu>
OK, that's fine then.
Is anybody actually using the current Bio/UniGene stuff? I couldn't find
documentation for it and it hasn't been updated in more than two years,
so it may be some dead code sitting around. If so, we can remove this
code; Bio/UniGene would be a nice place to put Sean's code (even though
it is doing something different from the current Bio/UniGene).
--Michiel.
Davis, Sean (NIH/NCI) [E] wrote:
> So, in short, I don't see this unigene parser as a replacement for
> the current module. They fill different needs; this one fills a need
> that I have and is useful for whole-genome, multiple species work, or
> microarray analyses and whether and where it fits into biopython is
> really up to the community.
>
> Michiel wrote:
>> Perhaps a silly question from a non-Unigene user, but what is the
>> relation between your parser and the one in
>> Bio/UniGene/__init__.py? The latter seems to parse HTML files (see
>> the example in Tests/test_unigene.py) instead of flat files. Is
>> your parser intended as a replacement for Bio/UniGene/__init__.py?
From mdehoon at c2b2.columbia.edu Sat Oct 28 05:56:51 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sat, 28 Oct 2006 01:56:51 -0400
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45425925.8090607@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>
Message-ID: <4542F123.9050106@c2b2.columbia.edu>
Thanks, Peter!
It looks very nice. Actually, I have been using an earlier version of
the new SeqIO module (from your code on Bugzilla) and found it to work
quite well. A few short comments:
To parse a Fasta file using the new SeqIO looks like this:
from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("example.fasta") :
print record.id
print record.seq
I would rather have something like this:
from Bio.SeqIO import Fasta
for record in Fasta.parse(open("example.fasta")):
print record.id
print record.seq
where Fasta.parse returns a FastaIterator object, and the argument is
either a file object or a file name. You can in addition have a function
Bio.SeqIO.parse that guesses the file type from the file name extension
(as you have now for File2SequenceIterator), though that wouldn't work
for file handles.
On a related note, I don't think we need the SequenceList and
SequenceDict class. To make a list, one can do
from Bio.SeqIO import Fasta
records = [record for record in Fasta.parse(open("example.fasta"))]
To convert an iterator to a dictionary takes one line more, and is
probably more straightforward than SequenceDict.
--Michiel.
Peter (BioPython Dev) wrote:
> Hello list,
>
> I've checked in a somewhat cleaned up (and more tested) version of the
> earlier attachments to bug 2059.
>
> And I've updated the wiki page:
> http://biopython.org/wiki/SeqIO
>
> Has anyone got any tips on formatting python code on Wiki? Maybe I
> should just write the docs in LaTeX like the cook book etc.
>
> Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord
> objects, it would be a good idea to make them slightly more user-friendly:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
>
> (I would like to check this in before writing to much of the SeqIO
> documentation)
>
> If any of you want to check this out and have a look, I'd be pleased to
> get some feedback.
From biopython-dev at maubp.freeserve.co.uk Sat Oct 28 11:59:13 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sat, 28 Oct 2006 12:59:13 +0100
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4542F123.9050106@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk>
<4542F123.9050106@c2b2.columbia.edu>
Message-ID: <45434611.1040708@maubp.freeserve.co.uk>
Michiel de Hoon wrote:
> Thanks, Peter!
> It looks very nice. Actually, I have been using an earlier version of
> the new SeqIO module (from your code on Bugzilla) and found it to work
> quite well.
Thank you - and good to here the (old version) is working OK.
> A few short comments:
>
> To parse a Fasta file using the new SeqIO looks like this:
>
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("example.fasta") :
> print record.id
> print record.seq
>
> I would rather have something like this:
>
> from Bio.SeqIO import Fasta
> for record in Fasta.parse(open("example.fasta")):
> print record.id
> print record.seq
>
> where Fasta.parse returns a FastaIterator object, and the argument is
> either a file object or a file name.
I think you have raised two issues - file names/handles (discussed
below), and the use of a generic function versus a format specific one
(or at least the naming conventions).
I like the idea of a generic function File2SequenceIterator() which can
be used on lots of different file formats, just by changing the
arguments. However, there is nothing to stop you using the underlying
format specific iterators directly:
from Bio.SeqIO.FastaIO import FastaIterator
for record in FastaIterator(open("example.fasta")):
print record.id
print record.seq
(which is similar to your suggestion above)
As long as you don't need to use any file format specific options, then
for every file format the style of the code is the same - but switching
file formats takes a little more work:
from Bio.SeqIO.NexusIO import NexusIterator
for record in NexusIterator(open("example.nexus")):
print record.id
print record.seq
versus:
from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("example.nexus") :
print record.id
print record.seq
or, to give an example where the file extension is no use and the format
must be explicitly stated:
from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") :
print record.id
print record.seq
I expect the "helper functions" like File2SequenceIterator() to be used
for the simple cases where the user does not care about the minor
options we might offer for individual file formats (this would cover
beginners).
They are also nice for writing multiple file format test cases ;)
I see later in you email you suggested a generic Bio.SeqIO.parse(file)
function which would cope with multiple file formats. Was your point
more about what we call things?
I'm happy to go from File2SequenceIterator() to something like
SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() -
with matching versions like SeqList() and SeqDict()
However, I'm not so keen on "parse()" because it gives no clue as to
what it will return.
---
On the other point, filenames/handles. Right now, the individual
iterators only take a handle. This was a simplification I made to make
my life as straight forward as possible.
The File2SequenceIterator() function (and friends) can take a filename,
handle, or a string containing the contents of a file (in addition to
the format). However, these are done as three separate arguments.
I could have one argument that takes a file name or handle, and works it
out on its own. Bio.Nexus tries to do this for example. Having the
individual iterators also do this trick would be pretty simple (using a
shared utility function).
The "contents of a file" string argument was handy when testing, but I
imagine this is not going to be a common situation. If people need
this, they can use python's StringIO module to turn their data string
into a handle easily enough.
> You can in addition have a function
> Bio.SeqIO.parse that guesses the file type from the file name extension
> (as you have now for File2SequenceIterator), though that wouldn't work
> for file handles.
When dealing with a file handle, converting it to an undo file handle
would probably work - if we had code to guess the file format. I have
tried to raise a syntax error when a parser is given an invalid file -
which would mean we could just try some common file formats in order
until one works without a syntax error.
But I felt this was not needed right away, so I put it off.
> On a related note, I don't think we need the SequenceList and
> SequenceDict class. To make a list, one can do
>
> from Bio.SeqIO import Fasta
> records = [record for record in Fasta.parse(open("example.fasta"))]
Currently that would be written:
from Bio.SeqIO.FastaIO import FastaIterator
records = [record for record in FastaIterator(open("example.fasta"))]
Or even just the following, which I find simpler:
from Bio.SeqIO.FastaIO import FastaIterator
records = list(FastaIterator(open("example.fasta")))
Versus the alternatives:
from Bio.SeqIO import File2SequenceList
records = File2SequenceList("example.fasta")
from Bio.SeqIO import File2SequenceDict
record_dict = File2SequenceDict("example.fasta")
> To convert an iterator to a dictionary takes one line more, and is
> probably more straightforward than SequenceDict.
That was one thing I wanted to discuss - having a SequenceDict and
SequenceList class would let us add doc strings and perhaps methods like
maxlength, minlength, totallength, ...
Or, I can just use simple list and dict objects in the functions
File2SequenceList and File2SequenceDict.
I have no strong preference on this issue - so unless someone else
speaks up, I'll go back to simple lists and dictionaries - keeps things
simple.
Peter
From sdavis2 at mail.nih.gov Sat Oct 28 16:47:03 2006
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Sat, 28 Oct 2006 12:47:03 -0400
Subject: [Biopython-dev] Unigene flat file parser
In-Reply-To: <4542ED32.8060702@c2b2.columbia.edu>
References: <200610261056.25883.sdavis2@mail.nih.gov>
<4540F7F4.2050003@c2b2.columbia.edu>
<014DBF86B19310419F0DF8910FC56457240CEA@nihcesmlbx10.nih.gov>
<4542ED32.8060702@c2b2.columbia.edu>
Message-ID: <45438987.1070403@mail.nih.gov>
Michiel de Hoon wrote:
> OK, that's fine then.
>
> Is anybody actually using the current Bio/UniGene stuff? I couldn't
> find documentation for it and it hasn't been updated in more than two
> years, so it may be some dead code sitting around. If so, we can
> remove this code; Bio/UniGene would be a nice place to put Sean's code
> (even though it is doing something different from the current
> Bio/UniGene).
I haven't looked into it much, but for dynamic queries of individual
Unigene entries, it seems that Eutils might be the better way to go,
anyway.
Sean
From mdehoon at c2b2.columbia.edu Sun Oct 29 06:09:14 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 29 Oct 2006 01:09:14 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45434611.1040708@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>
<4542F123.9050106@c2b2.columbia.edu>
<45434611.1040708@maubp.freeserve.co.uk>
Message-ID: <4544458A.5000102@c2b2.columbia.edu>
Well let's first decide which functions we want in Bio.SeqIO, and then
decide how to name them.
I'm fine with the idea of having a function that can guess the file
format from the extension. I also agree that a parser that can guess the
file format from the file contents is not needed at this point.
> That was one thing I wanted to discuss - having a SequenceDict and
> SequenceList class would let us add doc strings and perhaps methods
> like maxlength, minlength, totallength, ...
>
> Or, I can just use simple list and dict objects in the functions
> File2SequenceList and File2SequenceDict.
>
> I have no strong preference on this issue - so unless someone else
> speaks up, I'll go back to simple lists and dictionaries - keeps
> things simple.
If we go back to simple lists and dictionaries, do we still need the
functions File2SequenceList and File2SequenceDict? I'd like to avoid
software bloat as much as possible, so if we don't need these two
functions, so much the better.
About file handles:
> The File2SequenceIterator() function (and friends) can take a
> filename, handle, or a string containing the contents of a file (in
> addition to the format). However, these are done as three separate
> arguments.
>
> I could have one argument that takes a file name or handle, and works
> it out on its own. Bio.Nexus tries to do this for example. Having
> the individual iterators also do this trick would be pretty simple
> (using a shared utility function).
>
> The "contents of a file" string argument was handy when testing, but I
> imagine this is not going to be a common situation. If people need
> this, they can use python's StringIO module to turn their data string
> into a handle easily enough.
I like the idea of one argument that takes a file name or handle. I
believe that that is how other Biopython functions work.
--Michiel.
Peter wrote:
> Michiel de Hoon wrote:
>> Thanks, Peter!
>> It looks very nice. Actually, I have been using an earlier version of
>> the new SeqIO module (from your code on Bugzilla) and found it to work
>> quite well.
>
> Thank you - and good to here the (old version) is working OK.
>
> > A few short comments:
>>
>> To parse a Fasta file using the new SeqIO looks like this:
>>
>> from Bio.SeqIO import File2SequenceIterator
>> for record in File2SequenceIterator("example.fasta") :
>> print record.id
>> print record.seq
>>
>> I would rather have something like this:
>>
>> from Bio.SeqIO import Fasta
>> for record in Fasta.parse(open("example.fasta")):
>> print record.id
>> print record.seq
>>
>> where Fasta.parse returns a FastaIterator object, and the argument is
>> either a file object or a file name.
>
> I think you have raised two issues - file names/handles (discussed
> below), and the use of a generic function versus a format specific one
> (or at least the naming conventions).
>
> I like the idea of a generic function File2SequenceIterator() which can
> be used on lots of different file formats, just by changing the
> arguments. However, there is nothing to stop you using the underlying
> format specific iterators directly:
>
> from Bio.SeqIO.FastaIO import FastaIterator
> for record in FastaIterator(open("example.fasta")):
> print record.id
> print record.seq
>
> (which is similar to your suggestion above)
>
> As long as you don't need to use any file format specific options, then
> for every file format the style of the code is the same - but switching
> file formats takes a little more work:
>
> from Bio.SeqIO.NexusIO import NexusIterator
> for record in NexusIterator(open("example.nexus")):
> print record.id
> print record.seq
>
> versus:
>
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("example.nexus") :
> print record.id
> print record.seq
>
> or, to give an example where the file extension is no use and the format
> must be explicitly stated:
>
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") :
> print record.id
> print record.seq
>
> I expect the "helper functions" like File2SequenceIterator() to be used
> for the simple cases where the user does not care about the minor
> options we might offer for individual file formats (this would cover
> beginners).
>
> They are also nice for writing multiple file format test cases ;)
>
> I see later in you email you suggested a generic Bio.SeqIO.parse(file)
> function which would cope with multiple file formats. Was your point
> more about what we call things?
>
> I'm happy to go from File2SequenceIterator() to something like
> SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() -
> with matching versions like SeqList() and SeqDict()
>
> However, I'm not so keen on "parse()" because it gives no clue as to
> what it will return.
>
> ---
>
> On the other point, filenames/handles. Right now, the individual
> iterators only take a handle. This was a simplification I made to make
> my life as straight forward as possible.
>
> The File2SequenceIterator() function (and friends) can take a filename,
> handle, or a string containing the contents of a file (in addition to
> the format). However, these are done as three separate arguments.
>
> I could have one argument that takes a file name or handle, and works it
> out on its own. Bio.Nexus tries to do this for example. Having the
> individual iterators also do this trick would be pretty simple (using a
> shared utility function).
>
> The "contents of a file" string argument was handy when testing, but I
> imagine this is not going to be a common situation. If people need
> this, they can use python's StringIO module to turn their data string
> into a handle easily enough.
>
> > You can in addition have a function
>> Bio.SeqIO.parse that guesses the file type from the file name
>> extension (as you have now for File2SequenceIterator), though that
>> wouldn't work for file handles.
>
> When dealing with a file handle, converting it to an undo file handle
> would probably work - if we had code to guess the file format. I have
> tried to raise a syntax error when a parser is given an invalid file -
> which would mean we could just try some common file formats in order
> until one works without a syntax error.
>
> But I felt this was not needed right away, so I put it off.
>
>> On a related note, I don't think we need the SequenceList and
>> SequenceDict class. To make a list, one can do
>>
>> from Bio.SeqIO import Fasta
>> records = [record for record in Fasta.parse(open("example.fasta"))]
>
> Currently that would be written:
>
> from Bio.SeqIO.FastaIO import FastaIterator
> records = [record for record in FastaIterator(open("example.fasta"))]
>
> Or even just the following, which I find simpler:
>
> from Bio.SeqIO.FastaIO import FastaIterator
> records = list(FastaIterator(open("example.fasta")))
>
> Versus the alternatives:
>
> from Bio.SeqIO import File2SequenceList
> records = File2SequenceList("example.fasta")
>
> from Bio.SeqIO import File2SequenceDict
> record_dict = File2SequenceDict("example.fasta")
>
>> To convert an iterator to a dictionary takes one line more, and is
>> probably more straightforward than SequenceDict.
>
> That was one thing I wanted to discuss - having a SequenceDict and
> SequenceList class would let us add doc strings and perhaps methods like
> maxlength, minlength, totallength, ...
>
> Or, I can just use simple list and dict objects in the functions
> File2SequenceList and File2SequenceDict.
>
> I have no strong preference on this issue - so unless someone else
> speaks up, I'll go back to simple lists and dictionaries - keeps things
> simple.
>
> Peter
>
From biopython-dev at maubp.freeserve.co.uk Sun Oct 29 11:25:35 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sun, 29 Oct 2006 11:25:35 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <4544458A.5000102@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk>
<4544458A.5000102@c2b2.columbia.edu>
Message-ID: <45448FAF.1090104@maubp.freeserve.co.uk>
Michiel de Hoon wrote:
> Well let's first decide which functions we want in Bio.SeqIO, and then
> decide how to name them.
Agreed.
One point against names like File2SequenceIterator is the pun on two
versus to (i.e. convert) will not be so obvious to non-native English
speakers.
> > That was one thing I wanted to discuss - having a SequenceDict and
> > SequenceList class would let us add doc strings and perhaps methods
> > like maxlength, minlength, totallength, ...
> >
> > Or, I can just use simple list and dict objects in the functions
> > File2SequenceList and File2SequenceDict.
> >
> > I have no strong preference on this issue - so unless someone else
> > speaks up, I'll go back to simple lists and dictionaries - keeps
> > things simple.
>
> If we go back to simple lists and dictionaries, do we still need the
> functions File2SequenceList and File2SequenceDict? I'd like to avoid
> software bloat as much as possible, so if we don't need these two
> functions, so much the better.
I think there is some benefit to having File2SequenceDict included as
converting from a SeqRecord iterator to a dictionary of SeqRecords isn't
completely trivial.
There are at least two important questions: What to use as the
dictionary key (e.g. record.id) and how to deal with duplicate keys
(e.g. use first/last record with that id, or simply abort).
Consider this line of code as an alternative to File2SequenceDict:
iterator = File2SequenceList(...)
d = dict([record.id, record] for record in iterator)
I don't think its very readable, or intuitive (and could scare
beginners). Part of my aim with Bio.SeqIO was to make the interface simple.
More importantly, if there are records with duplicate ids then with this
code the resulting dictionary will have only the last record.
Personally I would want duplicate keys to cause an exception.
Rewriting File2SequenceDict() to use a simple dict would give something
like this, where record2key is an optional user supplied function.
def File2SequenceDict(..., record2key=None) :
iterator = File2SequenceIterator(...)
if record2key is None : record2key = lambda record : record.id
answer = dict()
for record in iterator :
key = record2key(record)
assert key not in answer, "Duplicate key"
answer[key] = record
return answer
The record2key function is perhaps not needed - I was trying to make the
function flexible. The duplicate key behaviour could also be an option.
The other function, File2SequenceList isn't really needed if we are
using simple lists. Its basically a wrapper for
list(File2SequenceIterator(...)) or some other one liner.
The main reason I invented File2SequenceList() was for completeness -
given I already had File2SequenceDict() and File2SequenceIterator()
> About file handles:
>
> > The File2SequenceIterator() function (and friends) can take a
> > filename, handle, or a string containing the contents of a file (in
> > addition to the format). However, these are done as three separate
> > arguments.
> >
> > I could have one argument that takes a file name or handle, and works
> > it out on its own. Bio.Nexus tries to do this for example. Having
> > the individual iterators also do this trick would be pretty simple
> > (using a shared utility function).
> >
> > The "contents of a file" string argument was handy when testing, but I
> > imagine this is not going to be a common situation. If people need
> > this, they can use python's StringIO module to turn their data string
> > into a handle easily enough.
>
> I like the idea of one argument that takes a file name or handle. I
> believe that that is how other Biopython functions work.
OK then - I'll do that.
Peter
From biopython-dev at maubp.freeserve.co.uk Mon Oct 30 00:13:57 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 30 Oct 2006 00:13:57 +0000
Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna
Message-ID: <454543C5.1080209@maubp.freeserve.co.uk>
Hello all,
I've been looking at writing multiple sequence alignments in Nexus
format for the new Bio.SeqIO code, and came up with the following little
problem:
Given one or more Seq objects, how can I reliably decide if they are
protein, DNA, or RNA?
(These are the relevant choices in a Nexus file's format datatype=...
header.)
I'm resigned to the fact that if the Seq object has the generic alphabet
this boils down to looking at the sequence strings and making an
educated guess (probably following an established algorithm from an
alignment program). Does any such code already exist in BioPython?
However - is there a nice/official way to ask an alphabet object what it
is (protein, DNA, RNA)?
Looking over the code in Bio.Alphabet the only thing I can think of is
to get the class name as a string and search it(!) We can't look at the
letters property as this is None for the base classes like ProteinAlphabet.
If we are prepared to meddle with the alphabet system we might add
attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these
base classes. Or simply have a "sequence_type" method, which the
subclasses can re-define as required.
(I wasn't meaning to reopen the whole "do we need alphabets"
conversation last discussed in July 2006. At least, not yet...)
Peter
From fkauff at duke.edu Mon Oct 30 00:48:39 2006
From: fkauff at duke.edu (Frank)
Date: Sun, 29 Oct 2006 19:48:39 -0500
Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna
In-Reply-To: <454543C5.1080209@maubp.freeserve.co.uk>
References: <454543C5.1080209@maubp.freeserve.co.uk>
Message-ID: <1162169319.12941.5.camel@cpe-071-077-002-012.nc.res.rr.com>
Hi all,
On Mon, 2006-10-30 at 00:13 +0000, Peter (BioPython Dev) wrote:
> Hello all,
>
> I've been looking at writing multiple sequence alignments in Nexus
> format for the new Bio.SeqIO code, and came up with the following little
> problem:
>
> Given one or more Seq objects, how can I reliably decide if they are
> protein, DNA, or RNA?
>
> (These are the relevant choices in a Nexus file's format datatype=...
> header.)
>
> I'm resigned to the fact that if the Seq object has the generic alphabet
> this boils down to looking at the sequence strings and making an
> educated guess (probably following an established algorithm from an
> alignment program). Does any such code already exist in BioPython?
>
I'm not aware of any such code - however, an educated guess would be
easy, (more or less ACGTNX only, ACGUNX only, everything else...?). With
NEXUS it becomes tricky, as a dataset could potentially be partitioned
into a mix of all types. And there is no "official" way to indicate this
in the datatype= option.
Frank
> However - is there a nice/official way to ask an alphabet object what it
> is (protein, DNA, RNA)?
>
> Looking over the code in Bio.Alphabet the only thing I can think of is
> to get the class name as a string and search it(!) We can't look at the
> letters property as this is None for the base classes like ProteinAlphabet.
>
> If we are prepared to meddle with the alphabet system we might add
> attributes like "isProtein", "isNucleotide", "isRNA", "isDNA" to these
> base classes. Or simply have a "sequence_type" method, which the
> subclasses can re-define as required.
>
> (I wasn't meaning to reopen the whole "do we need alphabets"
> conversation last discussed in July 2006. At least, not yet...)
>
> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From mdehoon at c2b2.columbia.edu Mon Oct 30 03:20:48 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 29 Oct 2006 22:20:48 -0500
Subject: [Biopython-dev] Determining if seq alphabet is protein/dna/rna
In-Reply-To: <454543C5.1080209@maubp.freeserve.co.uk>
References: <454543C5.1080209@maubp.freeserve.co.uk>
Message-ID: <45456F90.1090005@c2b2.columbia.edu>
Peter (BioPython Dev) wrote:
> Given one or more Seq objects, how can I reliably decide if they are
> protein, DNA, or RNA?
>
> (These are the relevant choices in a Nexus file's format datatype=...
> header.)
>
> I'm resigned to the fact that if the Seq object has the generic alphabet
> this boils down to looking at the sequence strings and making an
> educated guess (probably following an established algorithm from an
> alignment program). Does any such code already exist in BioPython?
Something similar exists in Bio.Seq in the complement,
reverse_complement methods of Seq objects, but it only distinguishes
between DNA and RNA. I don't know of any official way to do that in
Biopython.
--Michiel.
From mdehoon at c2b2.columbia.edu Mon Oct 30 03:42:44 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 29 Oct 2006 22:42:44 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45448FAF.1090104@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk>
<4544458A.5000102@c2b2.columbia.edu>
<45448FAF.1090104@maubp.freeserve.co.uk>
Message-ID: <454574B4.3050407@c2b2.columbia.edu>
Peter wrote:
> There are at least two important questions: What to use as the
> dictionary key (e.g. record.id) and how to deal with duplicate keys
> (e.g. use first/last record with that id, or simply abort).
>
> Rewriting File2SequenceDict() to use a simple dict would give something
> like this, where record2key is an optional user supplied function.
>
> def File2SequenceDict(..., record2key=None) :
> iterator = File2SequenceIterator(...)
> if record2key is None : record2key = lambda record : record.id
> answer = dict()
> for record in iterator :
> key = record2key(record)
> assert key not in answer, "Duplicate key"
> answer[key] = record
> return answer
>
> The record2key function is perhaps not needed - I was trying to make the
> function flexible. The duplicate key behaviour could also be an option.
>
I am using File2SequenceIterator in one of my scripts (thanks by the way
for that, my script is a lot faster now. I didn't do a rigorous timing,
but it's about a zillion times faster), and convert the iterator to a
dictionary using plain Python. If I were to use File2SequenceDict
instead, I would need the record2key argument, because in my application
I want only part of record.id as the key.
In the File2SequenceDict above, answer[key] contains the complete
record. Some people will want that. However, in my application I only
want to store the record.seq part in answer[key]. Somebody else may want
str(record.seq). So we'd also need a record2value argument.
For duplicate keys, there are at least four possibilities (raise an
exception, store only one of the keys, store neither of the keys and
don't raise an exception, store both after modifying one of the keys).
So this should also be an option.
You'll end up with a File2SequenceDict function that is more complicated
than the plain Python solution.
--Michiel.
From biopython-dev at maubp.freeserve.co.uk Mon Oct 30 10:54:41 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 30 Oct 2006 10:54:41 +0000
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <454574B4.3050407@c2b2.columbia.edu>
References: <45425925.8090607@maubp.freeserve.co.uk> <4542F123.9050106@c2b2.columbia.edu> <45434611.1040708@maubp.freeserve.co.uk> <4544458A.5000102@c2b2.columbia.edu> <45448FAF.1090104@maubp.freeserve.co.uk>
<454574B4.3050407@c2b2.columbia.edu>
Message-ID: <4545D9F1.2040902@maubp.freeserve.co.uk>
Michiel de Hoon wrote:
> On a related note, I don't think we need the SequenceList and
> SequenceDict class. To make a list, one can do ...
I've updated the new code in Bio.SeqIO to remove SequenceDict and
SequenceList and use the standard dictionary and list instead.
Michiel de Hoon wrote:
> I am using File2SequenceIterator in one of my scripts (thanks by the way
> for that, my script is a lot faster now. I didn't do a rigorous timing,
> but it's about a zillion times faster), and convert the iterator to a
> dictionary using plain Python. If I were to use File2SequenceDict
> instead, I would need the record2key argument, because in my application
> I want only part of record.id as the key.
With such a speed up, I'd guess you were using Bio.Fasta before. I've
noticed the same thing. Are you dealing with NCBI style fasta
identifiers made up of several fields separated by "|" characters?
> In the File2SequenceDict above, answer[key] contains the complete
> record. Some people will want that. However, in my application I only
> want to store the record.seq part in answer[key]. Somebody else may want
> str(record.seq). So we'd also need a record2value argument.
It does slightly undermine the "you only get SeqRecord objects"
principle. On the other hand, its a simple addition that is easy to
explain and implement. I'm happy to add this.
> For duplicate keys, there are at least four possibilities (raise an
> exception, store only one of the keys, store neither of the keys and
> don't raise an exception, store both after modifying one of the keys).
> So this should also be an option.
Supporting all these options with an easy to understand interface looks
too hard.
In my opinion if someone is trying to build a dictionary using repeated
keys they have made a mistake (either in their datafile, or their
record2key function) - so raising an exception is reasonable default
behaviour (and is easy to code).
Apart from the "exception" option, which of these actions do you
generally find most appropriate?
> You'll end up with a File2SequenceDict function that is more complicated
> than the plain Python solution.
Yes. Trying to do everything would be bad - both complicated to
implement, probably complicated to use as well.
Peter
From mdehoon at c2b2.columbia.edu Mon Oct 30 22:02:34 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Mon, 30 Oct 2006 17:02:34 -0500
Subject: [Biopython-dev] New Bio.SeqIO code
In-Reply-To: <45425925.8090607@maubp.freeserve.co.uk>
References: <45425925.8090607@maubp.freeserve.co.uk>
Message-ID: <4546767A.70302@c2b2.columbia.edu>
Peter (BioPython Dev) wrote:
> Can I check in bug 2057 too? Given the SeqIO system produces SeqRecord
> objects, it would be a good idea to make them slightly more user-friendly:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
>
> (I would like to check this in before writing to much of the SeqIO
> documentation)
Looks good to me. Thanks!
--Michiel.
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032