[Biojava-l] Masked regions, X's, No Call's N's
Kevin T. Pedretti
pedretti@eng.uiowa.edu
Thu, 5 Oct 2000 10:29:30 -0500 (CDT)
Hi Matthew,
I did some more experimentation last night and wrote a little program
that output a similar list to the one gave below. At the time I sent the
email, I didn't think there was any notion of an ambiguity symbols yet in
the code -- looking through the list archives you had sent some messages
about adding a AmbiguitySymbol interface but I didn't see that in the API
docs. Anyway, in our lab we use runs of Xs to signify repetive regions
and regions of low complexity as detected by programs like seg and
repeatmasker. Ns signify no calls. X=N for all practical purposes, but
the distinction is that the N's are put in the sequence by the base
caller while the X's are added at a later stage in the pipeline and
typically appear in runs. I also found this on the NCBI BLAST faq:
Q: After running a search why do I see a string of "X"s (or "N"s) in my
query sequence that I did not put there?
You are seeing the result of automatic filtering of your query for
low-complexity sequence that is performed to prevent artifactual hits. The
filter substitutes any low-complexity sequence that it finds with the
letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter
"X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can
result in high scores that reflect compositional bias rather than
significant position-by-position alignment (Wootton & Federhen, 1996).
Filter programs can eliminate these potentially confounding matches from
the blast reports, leaving regions whose BLAST statistics reflect the
specificity of their parities alignment. Queries searched with the blastn
program are filtered with DUST. The other BLAST programs use SEG.
So maybe the standard protocol is to use Ns... maybe the write thing for
me to do is preprocess any fasta files to replace Xs with Ns. Or do you
think it would be better to add an X to the alphabet?
Kevin
On Thu, 5 Oct 2000, Matthew Pocock wrote:
> Hi Kevin,
>
> Which version of BioJava are you working from? Could you check that you
> either have the 1.01 release, or a fresh check-out from the anonymous CVS
> repository. It is probably best to grab the source code release as it has
> some demos included:
>
> http://biojava.org/download/source/biojava-1.01.zip
>
> N characters should appear in the DNA alphabet as ambiguity symbols, so you
> should be fine parsing them. I don't think that X is in there - what does X
> represent? Could you try running the demo called:
>
> symbol.TestAmbiguity
>
> You need to have build biojava, and have biojava.jar and xerces.jar in your
> classpath, and then cd into the demos directory and type:
>
> java symbol.TestAmbiguity
>
> This should print out a list of lines something like:
>
> adenine -> {adenine}
> guanine -> {guanine}
> cytosine -> {cytosine}
> thymine -> {thymine}
> ag -> {adenine, guanine}
> ct -> {thymine, cytosine}
> ac -> {adenine, cytosine}
> GT -> {thymine, guanine}
> gc -> {cytosine, guanine}
> at -> {thymine, adenine}
> act -> {thymine, adenine, cytosine}
> gtc -> {thymine, cytosine, guanine}
> gac -> {adenine, cytosine, guanine}
> gat -> {thymine, adenine, guanine}
> agct -> {thymine, adenine, cytosine, guanine}
> gap -> {}
>
> If none of this helps, then mail the list and/or me, and we will see if we
> can get you up-and-running.
>
> Matthew
>
> "Kevin T. Pedretti" wrote:
>
> > Hello All,
> > I'm just starting to use biojava and have already hit a roadblock. The
> > DNA alphabet doesn't contain symbols for X's or N's so when I try to read
> > in a fasta sequence with these characters, it throws a bunch of
> > exceptions. Can I extend the alphabet to contain these characters and
> > expect everything else to still work. Sorry if this has been covered
> > before and thanks for your help.
> >
> > Kevin
> >
> > _______________________________________________
> > Biojava-l mailing list - Biojava-l@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-l
>
> --
> Joon: You're out of your tree
> Sam: It wasn't my tree
> (Benny & Joon)
>
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>