[Biopython-dev] Bio.SeqIO - Output

Tue Jan 16 12:48:49 UTC 2007

I've been thinking about sequence output (i.e. writing sequence files), 
and have come to the conclusion that my writer classes in 
Bio/SeqIO/Interfaces.py are probably too complicated.

My current Bio.SeqIO output implementation tries to be very flexible - 
if you look beyond the top level function WriteSequences (aka 
SequencesToFile) then the individual writer classes have a confusing 
range of capabilities.

New Idea
========
I was thinking that we should only support two cases for sequence output:

(*) simple sequential file formats
     - record by record, or file at once
     - can use a SeqRecord iterator (or a list)
(*) all other file formats
     - file at once only
     - probably needs a list of SeqRecords (not an iterator)

For the sequential file formats such as fasta, genbank and swiss there 
are no headers or footers - and a single sequence alone would be a valid 
file.

For all other file formats (e.g. clustal, stockholm, phylip, anything in 
XML, ...) we would only offer the "file at once" option.

When implementing a writer for a new file format, you just have to 
implement a "write file" function or a "write record" function which 
takes the record(s) and a handle.  The implementation details are up to you.

Drawbacks
=========
There are some sequential file formats where, under the scheme above, 
you would be forced to write the file in one go...

However, I can only think of one irrelevant example, so this may not 
matter.  Can anyone suggest some other examples?  Some sort of simple 
tabular file with a header row maybe?

For example simple Stockholm files (if you ignore the PFAM style 
annotation) have a generic header, followed by sequential records and a 
generic footer.

The point here is that the header does not contain anything about the 
records which will follow it.  e.g. The number of records, or if they 
are protein or nucleotides.

For files like this it would be possible to write the file record by 
record given an iterator - provided you also write the header and footer.

Right now this is the only file format I can think of that has this 
property - and I don't currently even support this (instead like BioPerl 
I create Stockholm files with PFAM style annotations).

Stockholm files with PFAM style annotation do not qualify, because the 
header contains the number of records.  Similarly for non-interlaced PHYLIP.

Peter