[Biopython-dev] New Bio.SeqIO code

Iddo Friedberg idoerg at burnham.org
Fri Nov 10 07:30:17 UTC 2006

Michiel de Hoon wrote:
> Peter (BioPython Dev) wrote:
>> Currently the individual format specific iterators just require a handle
>> (and not a filename).  Are we all happy with this?
> Happy.

I second that.

I have two arguments against that:

1) It is standard practice in biopython to pass file handle as arguments 
to a parser rather than a filename. If we break this, we would start 
thinking which parser takes a handle and which a filename. things will 
be a mess.

2) Also, what if you are not passing a real file? E.g. I have 
applications that pass StringIO streams  into the parser. You are 
lumping two levels of IO into one, and IMHO that is bad practice. In 
other words, a filehandle can always be generated from a file, easily

 >>> filefunc(open('myfile'))

but you cannot generate a file form a filehandle type of data. OK, you 
can programatically generate a tmp file for reading, but that places a 
burden on the user.

3) The last argument against rigid filename extensions is 
interoperability with other applications that generate those files. 
Suppose you have one application that generates fasta files with a .tfa 
extension, and another with a .fa extension and yet a third with .pfa 
extensions... and those extensions are important to you for other 
reasons, like knowing which is a nucleic acid file and which is protein. 
Actually, all the NCBI genomic files are built like this... :)

OK, three arguments. I think that relying on filename extensions for 
content is rather DOS-ish and places an extra burden on the user. I'm 
suffering enough on my Windows machine with Rasmol trying to open all my 
.pdb files. Including those where pdb stands for "Palm Pilot database" 
rather than Protein Data Bank.

>> We could make the handle and format the first arguments as a compromise?
> If in doubt, don't add it to Biopython!
> It's much easier to add a functionality later, should the need arise, 
> than to remove one.

We could add the format as a OPTIONAL keyword argument, with a "None" 
default value. And have the parser recognize the format from a lookahead 
using a magic regexp fro each format. The user passed format overrides 
the parser guesswork. Shouldn't be too  hard to implement, as file 
formats are very distinct.

>> I personally want the file extension to format mapping, but then I am
>> fairly disciplined about using file extensions.  As I seem to be the
>> only voice advocating this, it looks like I may have to give in...
>> Is it worth asking on the main discussion list to canvas opinion?
> Sure, go ahead. But ask for *why* a user wants file extension to format 
> mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to 
> know which usage case that we haven't thought about yet warrants file 
> extension to format mapping.
>> We have functions to do the following, where "file" may mean just a
>> handle, or perhaps the choice of a handle or filename (see above):
>> (*) File to SeqRecord iterator, currently File2SequenceIterator
>> (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
>> (*) SeqRecord iterator/list to alignment, currently Iter2Alignment
>> (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File
> If:
>    File2SequenceIterator doesn't infer the file format from the extension
> and
>    File2SequenceIterator takes handles only, so no file names,
> then:
>    Why do we need the File2SequenceIterator function?
> Btw, we should make a new Biopython release once the dust settles.
> --Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037, USA
T: +1 858 646 3100 x3516

More information about the Biopython-dev mailing list