[Biopython-dev] New Bio.SeqIO code

Sat Oct 28 11:59:13 UTC 2006

Michiel de Hoon wrote:
> Thanks, Peter!
> It looks very nice. Actually, I have been using an earlier version of 
> the new SeqIO module (from your code on Bugzilla) and found it to work 
> quite well.

Thank you - and good to here the (old version) is working OK.

 > A few short comments:
> 
> To parse a Fasta file using the new SeqIO looks like this:
> 
> from Bio.SeqIO import File2SequenceIterator
> for record in File2SequenceIterator("example.fasta") :
>      print record.id
>      print record.seq
> 
> I would rather have something like this:
> 
> from Bio.SeqIO import Fasta
> for record in Fasta.parse(open("example.fasta")):
>      print record.id
>      print record.seq
> 
> where Fasta.parse returns a FastaIterator object, and the argument is 
> either a file object or a file name.

I think you have raised two issues - file names/handles (discussed 
below), and the use of a generic function versus a format specific one 
(or at least the naming conventions).

I like the idea of a generic function File2SequenceIterator() which can 
be used on lots of different file formats, just by changing the 
arguments.  However, there is nothing to stop you using the underlying 
format specific iterators directly:

from Bio.SeqIO.FastaIO import FastaIterator
for record in FastaIterator(open("example.fasta")):
      print record.id
      print record.seq

(which is similar to your suggestion above)

As long as you don't need to use any file format specific options, then 
for every file format the style of the code is the same - but switching 
file formats takes a little more work:

from Bio.SeqIO.NexusIO import NexusIterator
for record in NexusIterator(open("example.nexus")):
      print record.id
      print record.seq

versus:

from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("example.nexus") :
      print record.id
      print record.seq

or, to give an example where the file extension is no use and the format 
must be explicitly stated:

from Bio.SeqIO import File2SequenceIterator
for record in File2SequenceIterator("nexus_seqs.txt", format="nexus") :
      print record.id
      print record.seq

I expect the "helper functions" like File2SequenceIterator() to be used 
for the simple cases where the user does not care about the minor 
options we might offer for individual file formats (this would cover 
beginners).

They are also nice for writing multiple file format test cases ;)

I see later in you email you suggested a generic Bio.SeqIO.parse(file) 
function which would cope with multiple file formats.  Was your point 
more about what we call things?

I'm happy to go from File2SequenceIterator() to something like 
SequenceIterator(), SequenceIter(), SeqRecordIter, or just SeqIter() - 
with matching versions like SeqList() and SeqDict()

However, I'm not so keen on "parse()" because it gives no clue as to 
what it will return.

                                ---

On the other point, filenames/handles.  Right now, the individual 
iterators only take a handle.  This was a simplification I made to make 
my life as straight forward as possible.

The File2SequenceIterator() function (and friends) can take a filename, 
handle, or a string containing the contents of a file (in addition to 
the format).  However, these are done as three separate arguments.

I could have one argument that takes a file name or handle, and works it 
out on its own.  Bio.Nexus tries to do this for example.  Having the 
individual iterators also do this trick would be pretty simple (using a 
shared utility function).

The "contents of a file" string argument was handy when testing, but I 
imagine this is not going to be a common situation.  If people need 
this, they can use python's StringIO module to turn their data string 
into a handle easily enough.

 > You can in addition have a function
> Bio.SeqIO.parse that guesses the file type from the file name extension 
> (as you have now for File2SequenceIterator), though that wouldn't work 
> for file handles.

When dealing with a file handle, converting it to an undo file handle 
would probably work - if we had code to guess the file format.  I have 
tried to raise a syntax error when a parser is given an invalid file - 
which would mean we could just try some common file formats in order 
until one works without a syntax error.

But I felt this was not needed right away, so I put it off.

> On a related note, I don't think we need the SequenceList and 
> SequenceDict class. To make a list, one can do
> 
> from Bio.SeqIO import Fasta
> records = [record for record in Fasta.parse(open("example.fasta"))]

Currently that would be written:

from Bio.SeqIO.FastaIO import FastaIterator
records = [record for record in FastaIterator(open("example.fasta"))]

Or even just the following, which I find simpler:

from Bio.SeqIO.FastaIO import FastaIterator
records = list(FastaIterator(open("example.fasta")))

Versus the alternatives:

from Bio.SeqIO import File2SequenceList
records = File2SequenceList("example.fasta")

from Bio.SeqIO import File2SequenceDict
record_dict = File2SequenceDict("example.fasta")

> To convert an iterator to a dictionary takes one line more, and is 
> probably more straightforward than SequenceDict.

That was one thing I wanted to discuss - having a SequenceDict and 
SequenceList class would let us add doc strings and perhaps methods like 
maxlength, minlength, totallength, ...

Or, I can just use simple list and dict objects in the functions 
File2SequenceList and File2SequenceDict.

I have no strong preference on this issue - so unless someone else 
speaks up, I'll go back to simple lists and dictionaries - keeps things 
simple.

Peter