[Biopython-dev] Sequence object allows non-alphabet characters

Mon Dec 19 12:49:50 UTC 2011

Peter Cock <p.j.a.cock <at> googlemail.com> writes:

> 
> On Sunday, December 18, 2011, Markus Piotrowski <
> Markus.Piotrowski <at> ruhr-uni-bochum.de> wrote:
> > Dear Biopyhton developers,
> >
> > I wonder why the following code does not throw an exception:
> >
> >>>> from Bio.Seq import Seq
> >>>> from Bio.Alphabet import IUPAC
> >>>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna)
> >>>> mySeq
> > Seq('GATC1234YWSK', IUPACUnambiguousDNA())
> >
> > I expected that trying to generate a sequence object containing
> non-alphabet
> > characters would either throw an exception/warning or "downgrade" the
> alphabet,
> > if possible.
> >

> 
> See https://redmine.open-bio.org/issues/2597
> 
> To me the obvious approach is to valid this in the Seq object
> __init__ if and only if the alphabet selected has a letters
> attribute with the valid characters given. However, this will
> slow things down and probably break a number of existing
> scripts. Perhaps a global setting for ignore (current behaviuor),
> warning, or exception? We could make the default the
> warning for a release or two and then switch to an error.
> 
> Peter
> 

What about an additional optional option in the sequence object like
"validate=True/False" with false as default. This would not break existing code,
will not influence speed (if validate=False) but gives the possibility to have
the sequence validated against the selected alphabet. In addition, validate=True
without an selected alphabet would allow for a basic sequence polishing, like
setting uppercase and removing whitespaces and digits (any non-alphabetic
characters?).