Bioperl: text vs. bio seq recognition

Thu, 27 May 1999 10:33:56 -0700

At 09:41 +0100 5/27/99, Tony Cox wrote:
>
> It's pretty straight forward to spot DNA but does anyone have any advise
>on how
> to differentiate reliably between a text query and a protein sequence (or
>maybe
> a DNA sequence with ambiguities). Somebody out there must have done this
> before.....

I recently was working on the backend code to a form where the user is
expected to paste in a sequence and then specify type (nucleotide or
protein) for the blast search.  Since I don't trust the user not to
accidentally specify the wrong type, or forget to specify and use the
default (and the results would be ... inappropriate otherwise) my code does
some basic checking to see if the data matches the specified type.

Note that the Perl code below (clipped from my script) is based on the
assumption that the user tells the program what the type should be (or the
program thinks it knows).

For your purposes, that should be changed -- e.g. you might compare the
input first to nucleotide, then to protein, and lastly to text, setting a
flag for each match. If there is more than one match, you either do more
analysis or complain.

  $NPCT = 20;  # maximum allowable percent of non [ATCGN] letters in
               # valid nucleotide sequence
  $PPCT = 5;   # minimum allowable percent of non [ATCGN] letters in
               # valid protein sequence

  $seq = shift;  # the "sequence" (in theory).

  $seq =~ tr/a-z/A-Z/;         # flatten case
  $len = length($seq);

  if ($type =~ /^n/i) {        # nucleotide
      $iupac = $seq =~ tr/ATCGN\n//c;  # count non-ATCGN characters
      $pct = ($iupac / $len) * 100;
      if ($pct >= $NPCT) {
           $ERR = "Supposed nucleotide sequence contains too many" .
                  "letters not in set [ATCGN]";
            return(0);
      }
  } else {
      # must be protein :-)
      $iupac = $seq =~ tr/ABCDEFGHIKLMNPQRSTVWXY//c;  # count non-IUPAC
chars
      if ($iupac) {
        $ERR = "Supposed protein sequence contains out of bound letters" .
               " not in IUPAC set";
        return(0);
      }
      $iupac = $seq =~ tr/ATCGN\n//c;  # count non-ATCGN characters
      $pct = ($iupac / $len) * 100;
      if ($pct < $PPCT) {
         $ERR = "Supposed protein sequence contains too few letters" .
				" outside set [ATCGN]";
         return(-1);
        }

  }

This is a very simple algorithm based on several factors:

   1) The IUPAC alphabet for nucleotide is (almost) a proper subset of the
alphabet for Protein. (There's no U in the protein alphabet, but unless you
expect to have RNA sequence this information probably won't help you)

   2) Nevertheless, nucleotide sequence is most likely to be (primarily)
[ATCGN]
and possibly X.  Thus the use of the $NPCT setting... it's _possible_ to
have, e.g. W, Y, R... in nucleotide sequence, but not as _likely_, so we
set a threshold for the minimum percentage of non-[ATCGN] characters we
accept in nucleotide sequence.

   3) Protein sequence is similarly unlikely to be composed primarily of
[ATCGN], so we set another threshold for "unliklely protein" -- maximum
acceptable percentage of [ATCGN]

   4) Text has spaces in it :-) i.e. text of more than one word is not
composed solely of letters from the IUPAC alphabet. Also, reasonably sized
runs of text are also likely to contain the letter O.

If you want to get trickier, you can make some more educated algorithms
based on character frequency. That is, a, e, i, o, u appear often in text,
but are the corresponding amino acids as common in proteins? (I don't know
the English word<->letter frequency table off the top of my head but I'm
certain it's findable - I don't know the AA frequency table either, but I'd
ask a biologist :-]

Testing length of sequence is probably also a good idea (although I didn't
bother). While a single word of text could be made up of only IUPAC
letters, a single word is usually VERY short for a proper sequence.

- Vicki
-----
 //=\   Vicki Brown <vlb@deltagen.com>
 \=//    Journeyman Sourcerer: Scripts & Philtres
  //=\    (Mac)Perl, awk, sed, *sh..., occasional C
  \=//     A little web-gardening on the weekends
   //=\
   \=//      Deltagen, Inc.
    //=\     1031 Bing St, San Carlos, CA 94070
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================