[Biojava-l] diploid alphabet
Thomas Down
td2@sanger.ac.uk
Mon, 21 Oct 2002 19:52:01 +0100
On Mon, Oct 21, 2002 at 09:20:40AM -0700, Doug Passey wrote:
> hi all,
> we are faced with the problem of representing heterozygous indels in diploid
> resequenced data. normal heterozygotes (SNPs) in a diploid sequence can be
> represented with the various ambiquity symbols, but in my cursory look at
> the symbol/alphabet stuff in the biojava API docs, i did not see any way of
> representing ambiquities of the form: A/-, C/-, G/-, or T/- ... which are
> the four forms of a single base heterozygous indel in diploid data. is
> somebody working on this, and if not, does someone have suggestion about how
> to add this to the whole alphabet/symbol scheme of biojava? i am a relative
> novice at biojava; so if i have to implement this, i might need a little
> guidance to make sure that it is implemented in the correct way.
Hi...
You can't represent an ambiguity matching either a nucleotide
or a `standard' gap in the BioJava scheme. The reason is that
gaps are represented as the empty set, in a world where normal
symbols are singleton sets, and ambiguities are sets with more
than one member. The gap symbol is an explicit `there's nothing
here', as you would get in a gapped alignment.
One way to represent indel polymorphisms is by making the sequence
a profile hidden markov model. That's the `ideal' of what you're
really trying to represent, but I can see that it may not be
terribly efficient or practical for your application.
A reasonable alternative would be to create a new 5-symbol alphabet
containing the 4 DNA symbols plus an extra one called `indel'
(an easy way to create alphabets is to edit the AlphabetManager.xml
file in the resources/ tree of the biojava source code). In this
alphabet, you can then create ambiguity symbols such as
[adenine indel].
Does this make sense?
Thomas.