[BioPython] Re: prototype for Seq Gui
Andrew Dalke
dalke@acm.org
Fri, 05 May 2000 19:56:29 -0600
I'm responding to a couple of Cayte's messages:
> He needs to code inspect to find the anwer, but he doesn't
> remember whether the code was in IUPAC.py or in Codon.py
> or which directory to search.
Part of that is my fault. I don't remember where they are
either. I used my "ohh, this feels reasonable" method which
often works for about an hour or two. :)
> I think the translate utilities need a pick list.
Yes, they do. It's complicated by the fact that there can be
many types of translations -- the simple ones I did for each
coding type, and the ones you list below.
The idea of my PropertyManager is to serve as a central
repository of available objects. It currently has a bunch
of terms like:
translator.id.1
translator.name.Yeast Mitochondrial
With a bit more (*ahem* to self) work, there should be
methods to get all attributes with names starting
"translator.id" that work on DNA alphabets. New translation
methods just need to add themselves to the property manager.
Implicit in this is a tree structure, eg, that you want to
see translation tables by their id, or their name or ... .
As your messages show, you can also get the list of translation
tables from the Data/CodonTable.py via, eg, the keys of
the "ambiguous_dna_by_id" table.
> Also, I think the translation from protein to DNA needs
> several options.
> To present a list of possible candidate DNA sequences.
> To sort the candidate sequences on the basis of probability.
> Probability distributions based on the type of organism.
These are also useful. The backtranslate code I wrote only
chooses one triplet per protein (though it tries to be clever
about getting the right ambiguity codon for ambiguous amino
acid).
I lack the data for doing probability distributions. I could
only find one for the standard translation table. I am also
not sure about how to track the random seed (essential for
reproducibility). Maybe as a derived class? Or a still
hypothetical SeqRecord?
Also, when there are multiple back codons, I choose the
first (last?) one listed alphabetically. I know this is
wrong, but I need that likelyhood data. I chose alphabetically,
BTW, so that the order wouldn't change if the hash table
implementation ever changed.
Sorting based on probability is ungainly, since there are about
3 possibilities per amino acid, so a protein sequence of 10
letters maps to 3**10 nucleotides. I can imagine choosing
the top N most probable, or some markup to indicate higher
variability, but I lack the domain experience to know what
people want, so I base what I do on what bioperl and others
do.
> USER STORY: Mary Rainforest has found an interesting
> protein in an orchid. She would like to find the DNA sequence
> and find if similar plants have a similar sequence. She would
> like to find the sequence that is correct for this plant and
> doesn't think if is the default back translation.
The way to compare protein to DNA sequence is using tfasta
or the BLAST equivalent, instead of doing a back translate
and compare the DNAs. Thos alignment methods know about the
different encodings (another note to self - check if they
also have probability tables) and take them into account when
doing the alignment score.
It is possible to have a tool help out making the list of
possible reverse translations. The current code in CodonTable
is:
def make_back_table(table, default_stop_codon):
# ONLY RETURNS A SINGLE CODON
# Do the sort so changes in the hash implementation won't affect
# the result when one amino acid is coded by more than one codon.
back_table = {}
keys = table.keys() ; keys.sort()
for key in keys:
back_table[table[key]] = key
back_table[None] = default_stop_codon
return back_table
To get a list of all posible codons, replace
for key in keys:
back_table[table[key]] = key
with
for key in keys:
back_table[table[key]] = back_table.get(table[key], []) +
[key]
or
for key in keys:
try:
back_table[table[key]].append(key)
except KeyError:
back_table[table[key]] = [key]
Cayte:
> bio.perl.org/pub/katel/biopython/Bio. It is a prototype
> for a Tkinter GUI wrapper for Andrew's Seq code. It is NOT
> ready but I am looking for suggestions
Will the input sequence be ambiguous or unambiguous DNA or
RNA? It can autodetect -- and biopython needs a good autodetect
function for distinguishing
ambiguous & unambiguous DNA
" & " RNA
all 4 types of nucleotides
ambiguous & unambiguous protein
all 4 types of nucleotieds + protein
I've also seen one algorithms which tries to detect a wider
range of alphabets by finding the alphabet with the smallest
set of letters which completely describes the input set.
Autodetection gets hairy.
For now you can either have user selection or autodetect,
or let the user select autodetect, and showing the guessed
result perhaps as
alphabet [Ambiguous DNA] autodetect <*>
^^^^ ^^^
(pulldown of types) (checkbox)
The autodetect code for nucleotides is:
If it contains a U, it's RNA.
If it contains a T, it's DNA.
If it doesn't contain either, it's DNA.
If it contains both, it's neither.
If it contains characters other than "ATCGU" it's ambiguous.
Otherwise it's unambiguous.
Will you support backtranslation of protein to DNA? I
noticed, BTW, that there's no work for back translation to
RNA different than to DNA, so if both are needed there may
need to be a "back translate to DNA" and "..to RNA" options.
Jeff, I recall, doesn't think back translations are really
useful. Again, there's that domain knowledge :)
The output sequence alphabet, although obvious, should be
labeled. It should have a title like "translated protein
sequence" or "transcribed DNA sequence" so people don't have
to remember what the last action was.
As for the GUI itself, the top of the list of codon tables
should be aligned with the top of the sequence input box.
The list also needs a scrollbar to show there are more items
than listed there (they can be accessed by selecting and
dragging the mouse down, but that is not obvious).
The inside of the sequence entry box should be recessed,
as a clue that that is a writable area. The lower box
should not be editable. There may need to be scrollbars for
the sequence boxes, though I think people will just copy
and paste (or save to file?) as need be.
Not being much of a Tk programmer, I don't know how to
change those.
For the code part, instead of "w.insert( END, ...) " you
could look at the list directly in Data/CodonTable.py, but
I suspect you did things this way so the layout could be
mocked up independent of having a working biopython
distribution. It definitely made it easy for me to run
without needing to set my PYTHONPATH!
As a GUI design in general, I've been thinking about having
a display like:
[list of ] [input area for ]
[available ] [selecting parameters ]
[operations] [related to the ]
[translate ] [selected operation, eg]
[transcribe] [list of codon tables ]
[ sequence input area ]
[ ]
[ ]
[ ]
[ ]
[ output area ]
[ ]
[ ]
[ ]
The two fixed areas are: the sequence input box and the
list of operations. When the operation changes, the list
of parameters associated with that operation changes (that's
the box in the upper right). As well, the output area
may change, if the output isn't a sequence.
For example, suppose there's a "alpha helix propensity"
operation. One possible parameter is the window size,
which could be a slider bar in the upper right.
The output area would be a graph, along with some scaling
controls to zoom in on different areas.
Hmm, OTOH, this isn't very useful since I would like to
see different propensity predictions simultaneously, or
even several of the same propensity predictions with
different parameters. Hmm, hmmm! That's like what you wanted
when seeing different possible back translations at the same
time.
So just think of this as a general idea I like (similar
to tabbed entry areas, which I don't like).
Andrew
dalke@acm.org