[Biojava-l] ah hah. Now I know why: it's a bug, perhaps in MSFAlignment

Fri, 1 Mar 2002 15:29:19 -0800

I wrote that original code, and yes it is a bug, I had not thought about making a correction for gaps. The type keyword is often not there and I have seen it in a variety of formats. I think the following might be the best

1) regex the type and *try* to match it to some of the "Types" we have seen
2) as well as counting a's, t's, g's, c's, and u's, also counts .,*,-
if   sum(atgcu)/( TotalChars-sum(,*-)) > .90 then it is DNA

-Robin

-----Original Message-----
From: Guoneng Zhong [mailto:Guoneng.Zhong@med.nyu.edu]
Sent: Friday, March 01, 2002 12:44 PM
To: biojava-l@biojava.org
Subject: [Biojava-l] ah hah. Now I know why: it's a bug, perhaps in
MSFAlignment

Relating to the previous email I posted, I believe this might be a bug.  
In MSFAlignmentFormat, between lines 137 and 155 is the test to see if 
the given report is a DNA, Protein, or RNA.  Its method is to go through 
the entire report, find out how many a's, t's, g's, c's, and u's there 
are.  If the number of these nucleotide looking things to the number of 
monomers is greater than 90% (line 157), then it is a polynucleotide; 
otherwise it is a protein.  That makes sense if this were not an 
alignment report.  In this alignment report, there are many gaps and 
they are represented by a dash or a dot.  So in my instance, I have 
about 60% dots, making the nucleotide only 30% of the whole collection 
of Symbols, even though they occupy 100% of the non-gap symbols.

So is this a correct interpretation?  If so, is this a bug?  Why doesn't 
the parser just check the "Type" keyword in the report, where, at least 
on mine, it says "N".  I suppose if that doesn't work then one could use 
the methodology above to guess.  But I think the guess is flawed, no?

G

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l