[Biopython-dev] New Bio.SeqIO code
Iddo Friedberg
idoerg at burnham.org
Tue Nov 14 17:19:14 UTC 2006
Peter (BioPython Dev) wrote:
> Iddo Friedberg wrote:
>> 3) The last argument against rigid filename extensions is
>> interoperability with other applications that generate those files.
>> Suppose you have one application that generates fasta files with a
>> .tfa extension, and another with a .fa extension and yet a third with
>> .pfa extensions... and those extensions are important to you for
>> other reasons, like knowing which is a nucleic acid file and which is
>> protein. Actually, all the NCBI genomic files are built like this...
>> :)
>
> Interesting tidbit.
>
> If you are using "exotic" file extensions, then you would have to
> explicitly tell my Bio.SeqIO code the file's format.
>
> Although "fa" is currently a known extension mapped to fasta format in
> Bio.SeqIO, your other examples are not. Are these other extensions used
> outside the internal systems of the NCBI?
I would tidbit or exotic. It is very prevalent, NCBI's GenBank genomic
repositories are very much deferred to. The point is, since NCBI uses
one standard of file extensions for its genomic databases, TIGR another
(actually, TIGR points to GenBank for completed genomes) UCSC a third...
then maybe relying on file suffixes is not such a great idea.
See for example the E. coli genome:
ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Escherichia_coli_K12
Some are fasta format. But have different contents: whole genome,
noncoding RNA, protein. Same with those that are GenBank format. So the
NCBI suffixes denote not only the file format, but the biological
content as well.
Also, for the reasons I gave in my previous email, I think we should
stick with passing file handles, not file names.
There is no real need for to pass a filename rather than a file handle.
If you need information from the filename, you can read the filename
from the file handle:
>>> foo = open('foo')
>>> print foo.name
'foo'
And the functions could still accept StringIO streams if needed.
>
>>
>
> I'm not trying to force anyone into using specific filename extensions -
> I'm trying to make life easier for people who already do this (or who
> download their data from online sources like the NCBI or PFAM - which do
> seem to be consistent in their naming conventions).
>
You cannot rely on such consistency prevailing. Especially not with NCBI.;)
>
>> We could add the format as a OPTIONAL keyword argument, with a "None"
>> default value. And have the parser recognize the format from a
>> lookahead using a magic regexp fro each format. The user passed
>> format overrides the parser guesswork. Shouldn't be too hard to
>> implement, as file formats are very distinct.
>
> Currently the format is an optional keyword argument defaulting to None.
> When it is omitted, I currently use a limited filename extension to
> format mapping (assuming the filename is available) to deduce/guess the
> format.
>
Ideally, the data format should be supplied by the user. Second best is
inferring from parsing the first line or so in the file. Third is
filename extension. Bit both options B and C are not very good
practices, IMHO.
> Peter
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
--
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org
More information about the Biopython-dev
mailing list