[Biojava-l] How do I read a FASTA file containing protein sequences in lowercase?
Richard Holland
holland at eaglegenomics.com
Fri Nov 13 15:04:27 UTC 2009
I've applied the patch to the trunk of biojava-live. Thanks!
Richard
On 9 Nov 2009, at 16:26, Carl Mäsak wrote:
> Richard (>):
>> Ah OK I see what's going on.
>>
>> The convenience method you're using, RichSequence.IOTools.readStream(), uses
>> FastaFormat to try and guess the alphabet to use based on the first line of
>> the input sequence.
>>
>> In FastaFormat, it does this by searching for matching non-DNA symbols. The
>> search is case-sensitive:
>>
>> protected static final Pattern aminoAcids =
>> Pattern.compile(".*[FLIPQE].*");
>>
>> FastaFormat needs patching to make this pattern non-case-sensitive.
>
> Patch attached.
>
> I also took the opportunity to remove the occurrences of .* in the
> Pattern above. Generally, once should be using Matcher.find() when one
> is interested in matching a part of a string. This is more efficient
> than using Matcher.matches() and surrounding the desired regular
> expression with .*, since the latter will cause a lot of unnecessary
> backtracking and make the search quadratic.
>
> This effect only shows up for very long strings, but long strings can
> and do happen in bioinformatics. The below measurements show the
> quadratic behaviour of the former approach.
>
> $ for length in 100 1000 10000 100000 1000000; do (time java
> WithDotStar $length) 2>&1 | grep real; done
> real 0m0.371s
> real 0m0.367s
> real 0m0.577s
> real 0m2.735s
> real 0m25.275s
>
> $ for length in 100 1000 10000 100000 1000000; do (time java
> WithoutDotStar $length) 2>&1 | grep real; done
> real 0m0.309s
> real 0m0.361s
> real 0m0.468s
> real 0m1.184s
> real 0m9.703s
>
> Kindly,
> // Carl
> <aminoAcids.patch><WithDotStar.java><WithoutDotStar.java>
--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/
More information about the Biojava-l
mailing list