[Biopython-dev] New Bio.SeqIO code
Peter (BioPython Dev)
biopython-dev at maubp.freeserve.co.uk
Fri Nov 3 11:48:17 UTC 2006
My apologies for this somewhat long email.
Handles and Filenames
=====================
Currently the individual format specific iterators just require a handle
(and not a filename). Are we all happy with this?
Michiel de Hoon wrote:
>> While it does sound like a nice idea for the end user, the idea of
>> filenames and handles is pretty important in python, and maybe we
>> shouldn't worry about forcing newcomers deal with handles. After
>> all, the SeqIO system will make them deal with iterators and
>> SeqRecords which I think are far more complicated!
>>
>> What do you think Michiel?
>
> My preferred solution would be for File2SequenceIterator to take
> handles only.
Assuming we keep the non-ambiguous file extension to file format
mappings, allowing a filename as a possible argument to
File2SequenceIterator (and any variants) makes good sense.
Note that most handle objects have a "name" attribute to get the
filename, which could be used to determine the file extension. i.e. We
can still do the file extension to file format mapping using just a file
handle (instead of a filename).
Currently File2SequenceIterator has separate named arguments for a
handle, filename and format. If no handle is provided, it will open one
using the filename provided.
We could make the handle and format the first arguments as a compromise?
If we drop the extension to file format mapping (see below), then I
agree File2SequenceIterator could just expect a handle and not a filename.
Guessing File Formats
=====================
>> Chris Lasher wrote:
>>> Which brings me to the issue of "guessing" a file's format.
>>> Yikes, again! I'd expect that kind of "magickery" from Perl, but
>>> once again, explicit is better than implicit. I honestly think
>>> it's not too much to expect the user to know what filetype
>>> they're expecting BioPython to deal with. Could you guys please
>>> explain the motivation behind this to me?
Michiel de Hoon wrote:
> I am leaning towards Chris' opinion. File type guessing (from
> extension or file contents) doesn't seem really necessary. At least,
> I don't remember a user asking for it. The benefits of file type
> guessing from the extension are minimal (since a user can probably do
> that more reliably himself, knowing the file names he's likely to
> encounter). And since file type guessing will not be foolproof, it
> may even be confusing. Once file type guessing is available in
> Biopython though, we're committed to it and we'll have to support it.
> So I'd be happier without the file type guessing functionality.
>
> That said, if somebody really wants it, I can live with it.
I agree that we shouldn't implement file format guessing based on the
contents of a file (unless, as you say, we get strong feedback wanting it).
I personally want the file extension to format mapping, but then I am
fairly disciplined about using file extensions. As I seem to be the
only voice advocating this, it looks like I may have to give in...
Is it worth asking on the main discussion list to canvas opinion?
Maybe we should settle on the function names before doing that - it
would be better replace the current function names now, before too many
people are used to them.
Functions and Naming
====================
This is where I think things stand for Bio/SeqIO/__init__.py
We have functions to do the following, where "file" may mean just a
handle, or perhaps the choice of a handle or filename (see above):
(*) File to SeqRecord iterator, currently File2SequenceIterator
(*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
(*) SeqRecord iterator/list to alignment, currently Iter2Alignment
(*) Write SeqRecordwher iterator/list to a file, currently Sequences2File
Possible names without the digit two: FileToSequenceIterator,
SequencesToDict, SequencesToAlignment, and SequencesToFile
I think Michiel wanted to drop the following "wrapper functions" as code
bloat:
(*) File to list of SeqRecord objects, currently File2SequenceList
Just use list(File2SequenceIterator(...)) instead
(*) File to dictionary of SeqRecord objects, currently File2SequenceDict
Just use SequenceIter2Dict(File2SequenceIterator(...)) instead
(*) File to alignment, currently File2Alignment
Just use Iter2Alignment(File2SequenceIterator(...))
The reason I invented the above three examples was so I could do things
like this in one line (assuming my files have valid known extensions):
rec_iter = File2SequenceIterator(filename="demo.faa")
rec_list = File2SequenceList(filename="demo.gbk")
rec_dict = File2SequenceDict(filename="demo.fasta")
align = File2Alignment(filename="demo.sth")
or perhaps:
align = File2Alignment(filename="demo.aln", format="clustal")
The alternatives suggestions seem to lead to using file handles and an
explicit format, with a second function to convert from an iterator if
required. While this can be done in one line - I find the following
much less straight forward:
rec_iter = File2SequenceIterator(open("demo.faa"), "fasta")
rec_list = list(File2SequenceIterator(open("demo.gbk"), "genbank"))
rec_dict = SequenceIter2Dict(File2SequenceIterator(open("demo.fasta"),
"fasta"))
align = Iter2Alignment(File2SequenceIterator(open("demo.sth"),
"stockholm"))
Peter
More information about the Biopython-dev
mailing list