[Biojava-dev] Bug in SeqIOTools
Keith James
kdj at sanger.ac.uk
Tue Mar 4 10:12:57 EST 2003
>>>>> "Mark" == Schreiber, Mark <mark.schreiber at agresearch.co.nz> writes:
Mark> Hi - The new way of getting an int to identify your file
Mark> type in SeqIOTools is somewhat buggy. The problem seems to
Mark> stem from the use of the method
Mark> SeqIOTools.identifyFormat(String formatName, String
Mark> alphabetName) this method returns an int by doing some
Mark> bitwise operations that should equal one of the constants in
Mark> SeqIOConstants.
Mark> There seems to be a problem however with formats like
Mark> Genbank. If you supply the formatName "genbank" then the DNA
Mark> alphabet is implied however you have to give a alphabetName
Mark> as an argument. If you give the name DNA then the returned
Mark> in no longer mathces the SeqIOConstants value for GenBank so
Mark> you can't use fileToBioJava() type methods ie it doesn't
Mark> recognize the genbank | dna operation. If you use an empty
Mark> string for the alphabetName if defaults to "Unknown" which
Mark> again won't work. If you put null as the secong argument you
Mark> get a null pointer exception.
Well, I had to hack it on the fly in Singapore in order to get the
OBDA stuff working. There's a bunch of methods which are now broken. I
found a couple of hours last night to fix more OBDA, but the
fileToBiojava etc in SeqIOTools is down below that on my list.
I orginally mapped the name "genbank" to GENBANK | DNA, but then
GENBANK | RNA is also valid. Plus you can coerce a sequence of any
alphabet into just about any format with EMBOSS (e.g. GENBANK | AA).
So the current state is that swissprot, genpept and pdb imply AA,
phred implies DNA and all others make no assumption. It would be more
consistent to make no assumptions at all about format name implying an
alphabet.
Mark> To be really robust we should probably have an overloaded
Mark> identifyFormat() method that takes either, just the format
Mark> name and complains if it really needs an alphabet (like for
Mark> Fasta) and one that takes two and complains if your
Mark> combination makes no sense eg GenBank and RNA or
Mark> something. We need to at least get it working before 1.3
But GENBANK | RNA does make sense e.g. gb:HSA299431
You're right, though. I need to check in a test for all format/alpha
combinations for each method. I can't do this in work hours - it'll
take me a few days to find the necessary time.
cheers,
Keith
--
- Keith James <kdj at sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -
More information about the biojava-dev
mailing list