[Biopython-dev] New Bio.SeqIO code
Peter
biopython-dev at maubp.freeserve.co.uk
Sat Oct 28 11:59:13 UTC 2006
Michiel de Hoon wrote:
> Thanks, Peter!
> It looks very nice. Actually, I have been using an earlier version of
> the new SeqIO module (from your code on Bugzilla) and found it to work
> quite well.
Thank you - and good to here the (old version) is working OK.
> A few short comments:
>
> To parse a Fasta file using the new SeqIO looks like this:
>
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("example.fasta") :
> print record.id
> print record.seq
>
> I would rather have something like this:
>
> from Bio.SeqIO import Fasta
> for record in Fasta.parse(open("example.fasta")):
> print record.id
> print record.seq
>
> where Fasta.parse returns a FastaIterator object, and the argument is
> either a file object or a file name.
I think you have raised two issues - file names/handles (discussed
below), and the use of a generic function versus a format specific one
(or at least the naming conventions).
I like the idea of a generic function File2SequenceIterator() which can
be used on lots of different file formats, just by changing the
arguments. However, there is nothing to stop you using the underlying
format specific iterators directly:
from Bio.SeqIO.FastaIO import FastaIterator
for record in FastaIterator(open("example.fasta")):
print record.id
print record.seq
(which is similar to your suggestion above)
As long as you don't need to use any file format specific options, then
for every file format the style of the code is the same - but switching
file formats takes a little more work:
from Bio.SeqIO.NexusIO import NexusIterator
for record in NexusIterator(open("example.nexus")):
print record.id
print record.seq
versus:
from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("example.nexus") :
print record.id
print record.seq
or, to give an example where the file extension is no use and the format
must be explicitly stated:
from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") :
print record.id
print record.seq
I expect the "helper functions" like File2SequenceIterator() to be used
for the simple cases where the user does not care about the minor
options we might offer for individual file formats (this would cover
beginners).
They are also nice for writing multiple file format test cases ;)
I see later in you email you suggested a generic Bio.SeqIO.parse(file)
function which would cope with multiple file formats. Was your point
more about what we call things?
I'm happy to go from File2SequenceIterator() to something like
SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() -
with matching versions like SeqList() and SeqDict()
However, I'm not so keen on "parse()" because it gives no clue as to
what it will return.
---
On the other point, filenames/handles. Right now, the individual
iterators only take a handle. This was a simplification I made to make
my life as straight forward as possible.
The File2SequenceIterator() function (and friends) can take a filename,
handle, or a string containing the contents of a file (in addition to
the format). However, these are done as three separate arguments.
I could have one argument that takes a file name or handle, and works it
out on its own. Bio.Nexus tries to do this for example. Having the
individual iterators also do this trick would be pretty simple (using a
shared utility function).
The "contents of a file" string argument was handy when testing, but I
imagine this is not going to be a common situation. If people need
this, they can use python's StringIO module to turn their data string
into a handle easily enough.
> You can in addition have a function
> Bio.SeqIO.parse that guesses the file type from the file name extension
> (as you have now for File2SequenceIterator), though that wouldn't work
> for file handles.
When dealing with a file handle, converting it to an undo file handle
would probably work - if we had code to guess the file format. I have
tried to raise a syntax error when a parser is given an invalid file -
which would mean we could just try some common file formats in order
until one works without a syntax error.
But I felt this was not needed right away, so I put it off.
> On a related note, I don't think we need the SequenceList and
> SequenceDict class. To make a list, one can do
>
> from Bio.SeqIO import Fasta
> records = [record for record in Fasta.parse(open("example.fasta"))]
Currently that would be written:
from Bio.SeqIO.FastaIO import FastaIterator
records = [record for record in FastaIterator(open("example.fasta"))]
Or even just the following, which I find simpler:
from Bio.SeqIO.FastaIO import FastaIterator
records = list(FastaIterator(open("example.fasta")))
Versus the alternatives:
from Bio.SeqIO import File2SequenceList
records = File2SequenceList("example.fasta")
from Bio.SeqIO import File2SequenceDict
record_dict = File2SequenceDict("example.fasta")
> To convert an iterator to a dictionary takes one line more, and is
> probably more straightforward than SequenceDict.
That was one thing I wanted to discuss - having a SequenceDict and
SequenceList class would let us add doc strings and perhaps methods like
maxlength, minlength, totallength, ...
Or, I can just use simple list and dict objects in the functions
File2SequenceList and File2SequenceDict.
I have no strong preference on this issue - so unless someone else
speaks up, I'll go back to simple lists and dictionaries - keeps things
simple.
Peter
More information about the Biopython-dev
mailing list