[Biopython-dev] New Bio.SeqIO code

Tue Nov 14 17:19:14 UTC 2006

Peter (BioPython Dev) wrote:
> Iddo Friedberg wrote:
>> 3) The last argument against rigid filename extensions is 
>> interoperability with other applications that generate those files. 
>> Suppose you have one application that generates fasta files with a
>> .tfa extension, and another with a .fa extension and yet a third with
>> .pfa extensions... and those extensions are important to you for
>> other reasons, like knowing which is a nucleic acid file and which is
>> protein. Actually, all the NCBI genomic files are built like this...
>> :)
> 
> Interesting tidbit.
> 
> If you are using "exotic" file extensions, then you would have to
> explicitly tell my Bio.SeqIO code the file's format.
> 
> Although "fa" is currently a known extension mapped to fasta format in
> Bio.SeqIO, your other examples are not.  Are these other extensions used
> outside the internal systems of the NCBI?

I would tidbit or exotic. It is very prevalent, NCBI's GenBank genomic 
repositories are very much deferred to. The point is, since NCBI uses 
one standard of file extensions for its genomic databases, TIGR another 
(actually, TIGR points to GenBank for completed genomes) UCSC a third... 
then maybe relying on file suffixes is not such a great idea.

See for example the E. coli genome:

ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Escherichia_coli_K12

Some are fasta format. But have different contents: whole genome, 
noncoding RNA, protein. Same with those that are GenBank format. So the 
NCBI suffixes denote not only the file format, but the biological 
content as well.

Also, for the reasons I gave in my previous email, I think we should 
stick with passing file handles, not file names.

There is no real need for to pass a filename rather than a file handle. 
If you need information from the filename, you can read the filename 
from the file handle:

 >>> foo = open('foo')

 >>> print foo.name
'foo'

And the functions could still accept StringIO streams if needed.

> 
>> 
> 
> I'm not trying to force anyone into using specific filename extensions -
>   I'm trying to make life easier for people who already do this (or who
> download their data from online sources like the NCBI or PFAM - which do
> seem to be consistent in their naming conventions).
> 

You cannot rely on such consistency prevailing. Especially not with NCBI.;)

> 
>> We could add the format as a OPTIONAL keyword argument, with a "None"
>> default value. And have the parser recognize the format from a
>> lookahead using a magic regexp fro each format. The user passed
>> format overrides the parser guesswork. Shouldn't be too  hard to
>> implement, as file formats are very distinct.
> 
> Currently the format is an optional keyword argument defaulting to None.
> When it is omitted, I currently use a limited filename extension to
> format mapping (assuming the filename is available) to deduce/guess the
> format.
> 

Ideally, the data format should be supplied by the user. Second best is 
inferring from parsing the first line or so in the file. Third is 
filename extension. Bit both options B and C are not very good 
practices, IMHO.

> Peter
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 

-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org