[BioPython] Re: prototype for Seq Gui

Cayte katel@worldpath.net
Sat, 6 May 2000 16:09:08 -0700


----- Original Message -----
From: Andrew Dalke <dalke@acm.org>
To: <biopython@biopython.org>
Sent: Friday, May 05, 2000 6:56 PM
Subject: [BioPython] Re: prototype for Seq Gui


> I lack the data for doing probability distributions.  I could
> only find one for the standard translation table.  I am also
> not sure about how to track the random seed (essential for
> reproducibility).  Maybe as a derived class?  Or a still
> hypothetical SeqRecord?
>
> Also, when there are multiple back codons, I choose the
> first (last?) one listed alphabetically.  I know this is
> wrong, but I need that likelyhood data.  I chose alphabetically,
> BTW, so that the order wouldn't change if the hash table
> implementation ever changed.
>
> Sorting based on probability is ungainly, since there are about
> 3 possibilities per amino acid, so a protein sequence of 10
> letters maps to 3**10 nucleotides.  I can imagine choosing
> the top N most probable, or some markup to indicate higher
> variability, but I lack the domain experience to know what
> people want, so I base what I do on what bioperl and others
> do.

   Actually, lack of domain knowledge is BOTH an advantage and a
disadvantage.  The disadvantage is obvious.  The advantage is that you are
looking at things in a fresh way.  For this reason children and people who
switch fields sometimes come up with surprisingly creative ideas.  I lack
domain knowledge too, but I don;t think we should automatically reject new
approaches and copy bioperl.
>
> > USER STORY:       Mary Rainforest has found an interesting
> > protein in an orchid.  She would like to find the DNA sequence
> > and find if similar plants have a similar sequence.  She would
> > like to find the sequence that is correct for this plant and
> > doesn't think if is the default back translation.
>
> The way to compare protein to DNA sequence is using tfasta
> or the BLAST equivalent, instead of doing a back translate
> and compare the DNAs.  Thos alignment methods know about the
> different encodings (another note to self - check if they
> also have probability tables) and take them into account when
> doing the alignment score.
>
    But I could see where a field biologist might just have a laptop and no
internet connection.  He might want to do a quick check.  I don't have a
strong opinion about this.
> It is possible to have a tool help out making the list of
> possible reverse translations.  The current code in CodonTable
> Cayte:
> > bio.perl.org/pub/katel/biopython/Bio.  It is a prototype
> > for a Tkinter GUI wrapper for Andrew's Seq code.  It is NOT
> > ready but I am looking for suggestions
>
> Will the input sequence be ambiguous or unambiguous DNA or
> RNA?  It can autodetect -- and biopython needs a good autodetect
> function for distinguishing
>   ambiguous & unambiguous DNA
>     "       &    "        RNA
>   all 4 types of nucleotides
>   ambiguous & unambiguous protein
>   all 4 types of nucleotieds + protein
>
> I've also seen one algorithms which tries to detect a wider
> range of alphabets by finding the alphabet with the smallest
> set of letters which completely describes the input set.
> Autodetection gets hairy.
>
> For now you can either have user selection or autodetect,
> or let the user select autodetect, and showing the guessed
> result perhaps as
>
>     alphabet  [Ambiguous DNA]   autodetect <*>
>                    ^^^^                    ^^^
>              (pulldown of types)        (checkbox)
>
> The autodetect code for nucleotides is:
>   If it contains a U, it's RNA.
>   If it contains a T, it's DNA.
>   If it doesn't contain either, it's DNA.
>   If it contains both, it's neither.
>   If it contains characters other than "ATCGU" it's ambiguous.
>   Otherwise it's unambiguous.
>
> Will you support backtranslation of protein to DNA?  I
> noticed, BTW, that there's no work for back translation to
> RNA different than to DNA, so if both are needed there may
> need to be a "back translate to DNA" and "..to RNA" options.
> Jeff, I recall, doesn't think back translations are really
> useful.  Again, there's that domain knowledge :)
>
> The output sequence alphabet, although obvious, should be
> labeled.  It should have a title like "translated protein
> sequence" or "transcribed DNA sequence" so people don't have
> to remember what the last action was.
>
> As for the GUI itself, the top of the list of codon tables
> should be aligned with the top of the sequence input box.
> The list also needs a scrollbar to show there are more items
> than listed there (they can be accessed by selecting and
> dragging the mouse down, but that is not obvious).
>
> The inside of the  sequence entry box should be recessed,
> as a clue that that is a writable area.  The lower box
> should not be editable.  There may need to be scrollbars for
> the sequence boxes, though I think people will just copy
> and paste (or save to file?) as need be.
>
> Not being much of a Tk programmer, I don't know how to
> change those.
>

Absolutely.  I'm going to have to get some serious Tkinker documentation,
though.  Doesn't it seem like the free stuff on the web just gives you
enough to get your feet wet, then invites you to buy a book?:)
>
> For the code part, instead of "w.insert( END, ...) " you
> could look at the list directly in Data/CodonTable.py, but
> I suspect you did things this way so the layout could be
> mocked up independent of having a working biopython
> distribution.  It definitely made it easy for me to run
> without needing to set my PYTHONPATH!
>
>
> As a GUI design in general, I've been thinking about having
> a display like:
>
>    [list of   ]     [input area for        ]
>    [available ]     [selecting parameters  ]
>    [operations]     [related to the        ]
>    [translate ]     [selected operation, eg]
>    [transcribe]     [list of codon tables  ]
>
>    [ sequence input area                            ]
>    [                                                ]
>    [                                                ]
>    [                                                ]
>    [                                                ]
>
>    [ output area                                    ]
>    [                                                ]
>    [                                                ]
>    [                                                ]
>
> The two fixed areas are: the sequence input box and the
> list of operations.  When the operation changes, the list
> of parameters associated with that operation changes (that's
> the box in the upper right).  As well, the output area
> may change, if the output isn't a sequence.
>
> For example, suppose there's a "alpha helix propensity"
> operation.  One possible parameter is the window size,
> which could be a slider bar in the upper right.
>
> The output area would be a graph, along with some scaling
> controls to zoom in on different areas.
>
> Hmm, OTOH, this isn't very useful since I would like to
> see different propensity predictions simultaneously, or
> even several of the same propensity predictions with
> different parameters. Hmm, hmmm!  That's like what you wanted
> when seeing different possible back translations at the same
> time.
>
> So just think of this as a general idea I like (similar
> to tabbed entry areas, which I don't like).
>
  These sound like great ideas.  My only question is when?  I think the next
steps should be to clean up the gui, as you suggested, and add a logging
feature.  The logging is essential.  Other features, such as graphs,  can be
added incrementally as we learn the needs of the bioimformatics community.
Of course, if the program is too minimal, it will be boring, so we need to
strike a balance.  But I like to start the dialog with the user early( after
bugs are removed, but before the full functionality is implemented.  )

   Also, I wonder if we should have an advanced button.  It can be confusing
to have too many options in the dialog.


Cayte