[Biopython-dev] New Bio.SeqIO code
Peter (BioPython Dev)
biopython-dev at maubp.freeserve.co.uk
Tue Nov 14 00:49:02 UTC 2006
Iddo Friedberg wrote:
> 3) The last argument against rigid filename extensions is
> interoperability with other applications that generate those files.
> Suppose you have one application that generates fasta files with a
> .tfa extension, and another with a .fa extension and yet a third with
> .pfa extensions... and those extensions are important to you for
> other reasons, like knowing which is a nucleic acid file and which is
> protein. Actually, all the NCBI genomic files are built like this...
> :)
Interesting tidbit.
If you are using "exotic" file extensions, then you would have to
explicitly tell my Bio.SeqIO code the file's format.
Although "fa" is currently a known extension mapped to fasta format in
Bio.SeqIO, your other examples are not. Are these other extensions used
outside the internal systems of the NCBI?
> OK, three arguments. I think that relying on filename extensions for
> content is rather DOS-ish and places an extra burden on the user.
I'm not trying to force anyone into using specific filename extensions -
I'm trying to make life easier for people who already do this (or who
download their data from online sources like the NCBI or PFAM - which do
seem to be consistent in their naming conventions).
> I'm suffering enough on my Windows machine with Rasmol trying to open
> all my .pdb files. Including those where pdb stands for "Palm Pilot
> database" rather than Protein Data Bank.
Yes - multiple interpretations of a given file format are a problem.
I've noticed that same PDB extension clash too (but I don't use a Palm
pilot any more).
Can anyone think of any common extensions used for more than one file
format? I know Clustal uses *.aln for its alignments which is perhaps
asking for trouble...
> We could add the format as a OPTIONAL keyword argument, with a "None"
> default value. And have the parser recognize the format from a
> lookahead using a magic regexp fro each format. The user passed
> format overrides the parser guesswork. Shouldn't be too hard to
> implement, as file formats are very distinct.
Currently the format is an optional keyword argument defaulting to None.
When it is omitted, I currently use a limited filename extension to
format mapping (assuming the filename is available) to deduce/guess the
format.
Peter
More information about the Biopython-dev
mailing list