[Biojava-dev] SeqIO maintenance
Keith James
kdj@sanger.ac.uk
04 Nov 2002 16:36:10 +0000
I've been having a look through the seqIO stuff and found a few things
which I think need attention prior to 1.3 release.
1. Writing complex formats e.g. EMBL is too slow and seems to have
become slower (no numbers to back this up, though). Big pauses for
GC during writing. I think it needs attention.
2. The SeqFileFormer runtime class loading stuff I wrote is both
unecessary and confusing. I think I can and should kill it (without
affecting the interfaces).
3. We are referring to sequence formats by String names (e.g. Embl,
Swissprot) in the interfaces, apart from in one method of both
MSFAlignmentFormat and FastaAlignmentFormat which takes an
int. However, the required int field is class private and in the
case of MSFAlignmentFormat is different from the public int field
for the same format in SeqIOTools.
SeqIOTools uses Nimesh Singh's int fields to identify formats (or
aspects thereof). Personally, I prefer this nomenclature. What do
others prefer? At least we should provide a map between the two
systems. (Also, the int fields probably belong in the
SequenceFormat interface as this is the convention used
elsewhere. Right now they're all in SeqIOTools.)
It might also be nice to have SequenceFormat.FASTADNA equal to
(SequenceFormat.FASTA & SequenceFormat.DNA) etc.
4. There is now almost identical file format guessing code
cut'n'pasted in SeqIOTools, SeqAlignReadWrite, MSFAlignmentFormat,
FastaAlignmentFormat. I'd like to move all this to a package
private class.
Can anyone think of more while I'm at it?
Keith
--
- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -