[Biopython-dev] New Bio.SeqIO code
Iddo Friedberg
idoerg at burnham.org
Fri Nov 10 07:30:17 UTC 2006
Michiel de Hoon wrote:
> Peter (BioPython Dev) wrote:
>> Currently the individual format specific iterators just require a handle
>> (and not a filename). Are we all happy with this?
>
> Happy.
I second that.
I have two arguments against that:
1) It is standard practice in biopython to pass file handle as arguments
to a parser rather than a filename. If we break this, we would start
thinking which parser takes a handle and which a filename. things will
be a mess.
2) Also, what if you are not passing a real file? E.g. I have
applications that pass StringIO streams into the parser. You are
lumping two levels of IO into one, and IMHO that is bad practice. In
other words, a filehandle can always be generated from a file, easily
>>> filefunc(open('myfile'))
but you cannot generate a file form a filehandle type of data. OK, you
can programatically generate a tmp file for reading, but that places a
burden on the user.
3) The last argument against rigid filename extensions is
interoperability with other applications that generate those files.
Suppose you have one application that generates fasta files with a .tfa
extension, and another with a .fa extension and yet a third with .pfa
extensions... and those extensions are important to you for other
reasons, like knowing which is a nucleic acid file and which is protein.
Actually, all the NCBI genomic files are built like this... :)
OK, three arguments. I think that relying on filename extensions for
content is rather DOS-ish and places an extra burden on the user. I'm
suffering enough on my Windows machine with Rasmol trying to open all my
.pdb files. Including those where pdb stands for "Palm Pilot database"
rather than Protein Data Bank.
>
>> We could make the handle and format the first arguments as a compromise?
>
> If in doubt, don't add it to Biopython!
> It's much easier to add a functionality later, should the need arise,
> than to remove one.
We could add the format as a OPTIONAL keyword argument, with a "None"
default value. And have the parser recognize the format from a lookahead
using a magic regexp fro each format. The user passed format overrides
the parser guesswork. Shouldn't be too hard to implement, as file
formats are very distinct.
>
>> I personally want the file extension to format mapping, but then I am
>> fairly disciplined about using file extensions. As I seem to be the
>> only voice advocating this, it looks like I may have to give in...
>>
>> Is it worth asking on the main discussion list to canvas opinion?
>
> Sure, go ahead. But ask for *why* a user wants file extension to format
> mapping (so just "Yeah, I'd like that..." doesn't count). I'd like to
> know which usage case that we haven't thought about yet warrants file
> extension to format mapping.
>
>> We have functions to do the following, where "file" may mean just a
>> handle, or perhaps the choice of a handle or filename (see above):
>>
>> (*) File to SeqRecord iterator, currently File2SequenceIterator
>> (*) SeqRecord iterator/list to dictionary, currently SequenceIter2Dict
>> (*) SeqRecord iterator/list to alignment, currently Iter2Alignment
>> (*) Write SeqRecordwher iterator/list to a file, currently Sequences2File
>
> If:
> File2SequenceIterator doesn't infer the file format from the extension
> and
> File2SequenceIterator takes handles only, so no file names,
> then:
> Why do we need the File2SequenceIterator function?
>
> Btw, we should make a new Biopython release once the dust settles.
>
> --Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
--
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037, USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org
More information about the Biopython-dev
mailing list