[Biopython-dev] Bio.SeqIO - Output
Peter
biopython-dev at maubp.freeserve.co.uk
Tue Jan 16 12:48:49 UTC 2007
I've been thinking about sequence output (i.e. writing sequence files),
and have come to the conclusion that my writer classes in
Bio/SeqIO/Interfaces.py are probably too complicated.
My current Bio.SeqIO output implementation tries to be very flexible -
if you look beyond the top level function WriteSequences (aka
SequencesToFile) then the individual writer classes have a confusing
range of capabilities.
New Idea
========
I was thinking that we should only support two cases for sequence output:
(*) simple sequential file formats
- record by record, or file at once
- can use a SeqRecord iterator (or a list)
(*) all other file formats
- file at once only
- probably needs a list of SeqRecords (not an iterator)
For the sequential file formats such as fasta, genbank and swiss there
are no headers or footers - and a single sequence alone would be a valid
file.
For all other file formats (e.g. clustal, stockholm, phylip, anything in
XML, ...) we would only offer the "file at once" option.
When implementing a writer for a new file format, you just have to
implement a "write file" function or a "write record" function which
takes the record(s) and a handle. The implementation details are up to you.
Drawbacks
=========
There are some sequential file formats where, under the scheme above,
you would be forced to write the file in one go...
However, I can only think of one irrelevant example, so this may not
matter. Can anyone suggest some other examples? Some sort of simple
tabular file with a header row maybe?
For example simple Stockholm files (if you ignore the PFAM style
annotation) have a generic header, followed by sequential records and a
generic footer.
The point here is that the header does not contain anything about the
records which will follow it. e.g. The number of records, or if they
are protein or nucleotides.
For files like this it would be possible to write the file record by
record given an iterator - provided you also write the header and footer.
Right now this is the only file format I can think of that has this
property - and I don't currently even support this (instead like BioPerl
I create Stockholm files with PFAM style annotations).
Stockholm files with PFAM style annotation do not qualify, because the
header contains the number of records. Similarly for non-interlaced PHYLIP.
Peter
More information about the Biopython-dev
mailing list