[Biopython-dev] New Bio.SeqIO code

Peter (BioPython Dev) biopython-dev at maubp.freeserve.co.uk
Fri Nov 3 11:48:17 UTC 2006


My apologies for this somewhat long email.

Handles and Filenames
=====================

Currently the individual format specific iterators just require a handle
(and not a filename).  Are we all happy with this?

Michiel de Hoon wrote:
>> While it does sound like a nice idea for the end user, the idea of 
>> filenames and handles is pretty important in python, and maybe we 
>> shouldn't worry about forcing newcomers deal with handles.  After
>> all, the SeqIO system will make them deal with iterators and
>> SeqRecords which I think are far more complicated!
>> 
>> What do you think Michiel?
> 
> My preferred solution would be for File2SequenceIterator to take
> handles only.

Assuming we keep the non-ambiguous file extension to file format
mappings, allowing a filename as a possible argument to
File2SequenceIterator (and any variants) makes good sense.

Note that most handle objects have a "name" attribute to get the
filename, which could be used to determine the file extension.  i.e. We
can still do the file extension to file format mapping using just a file
handle (instead of a filename).

Currently File2SequenceIterator has separate named arguments for a
handle, filename and format.  If no handle is provided, it will open one
using the filename provided.

We could make the handle and format the first arguments as a compromise?

If we drop the extension to file format mapping (see below), then I
agree File2SequenceIterator could just expect a handle and not a filename.

Guessing File Formats
=====================

>> Chris Lasher wrote:
>>> Which brings me to the issue of "guessing" a file's format.
>>> Yikes, again! I'd expect that kind of "magickery" from Perl, but
>>> once again, explicit is better than implicit. I honestly think
>>> it's not too much to expect the user to know what filetype
>>> they're expecting BioPython to deal with. Could you guys please
>>> explain the motivation behind this to me?

Michiel de Hoon wrote:
> I am leaning towards Chris' opinion. File type guessing (from
> extension or file contents) doesn't seem really necessary. At least,
> I don't remember a user asking for it. The benefits of file type
> guessing from the extension are minimal (since a user can probably do
> that more reliably himself, knowing the file names he's likely to
> encounter). And since file type guessing will not be foolproof, it
> may even be confusing. Once file type guessing is available in
> Biopython though, we're committed to it and we'll have to support it.
> So I'd be happier without the file type guessing functionality.
> 
> That said, if somebody really wants it, I can live with it.

I agree that we shouldn't implement file format guessing based on the
contents of a file (unless, as you say, we get strong feedback wanting it).

I personally want the file extension to format mapping, but then I am
fairly disciplined about using file extensions.  As I seem to be the
only voice advocating this, it looks like I may have to give in...

Is it worth asking on the main discussion list to canvas opinion?

Maybe we should settle on the function names before doing that - it
would be better replace the current function names now, before too many
people are used to them.

Functions and Naming
====================
This is where I think things stand for Bio/SeqIO/__init__.py

We have functions to do the following, where "file" may mean just a
handle, or perhaps the choice of a handle or filename (see above):

(*) File to SeqRecord iterator, currently File2SequenceIterator
(*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
(*) SeqRecord iterator/list to alignment, currently Iter2Alignment
(*) Write SeqRecordwher iterator/list to a file, currently Sequences2File

Possible names without the digit two: FileToSequenceIterator,
SequencesToDict, SequencesToAlignment, and SequencesToFile

I think Michiel wanted to drop the following "wrapper functions" as code
bloat:

(*) File to list of SeqRecord objects, currently File2SequenceList
     Just use list(File2SequenceIterator(...)) instead

(*) File to dictionary of SeqRecord objects, currently File2SequenceDict
     Just use SequenceIter2Dict(File2SequenceIterator(...)) instead

(*) File to alignment, currently File2Alignment
     Just use Iter2Alignment(File2SequenceIterator(...))

The reason I invented the above three examples was so I could do things
like this in one line (assuming my files have valid known extensions):

rec_iter = File2SequenceIterator(filename="demo.faa")
rec_list = File2SequenceList(filename="demo.gbk")
rec_dict = File2SequenceDict(filename="demo.fasta")
align    = File2Alignment(filename="demo.sth")

or perhaps:

align    = File2Alignment(filename="demo.aln", format="clustal")

The alternatives suggestions seem to lead to using file handles and an
explicit format, with a second function to convert from an iterator if
required.  While this can be done in one line - I find the following
much less straight forward:

rec_iter = File2SequenceIterator(open("demo.faa"), "fasta")

rec_list = list(File2SequenceIterator(open("demo.gbk"), "genbank"))

rec_dict = SequenceIter2Dict(File2SequenceIterator(open("demo.fasta"),
                                                    "fasta"))

align = Iter2Alignment(File2SequenceIterator(open("demo.sth"),

                                              "stockholm"))


Peter






More information about the Biopython-dev mailing list