[Biopython-dev] Sequence object allows non-alphabet characters

Sun Dec 18 14:03:10 UTC 2011

On Sunday, December 18, 2011, Markus Piotrowski <
Markus.Piotrowski at ruhr-uni-bochum.de> wrote:
> Dear Biopyhton developers,
>
> I wonder why the following code does not throw an exception:
>
>>>> from Bio.Seq import Seq
>>>> from Bio.Alphabet import IUPAC
>>>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna)
>>>> mySeq
> Seq('GATC1234YWSK', IUPACUnambiguousDNA())
>
> I expected that trying to generate a sequence object containing
non-alphabet
> characters would either throw an exception/warning or "downgrade" the
alphabet,
> if possible.
>
> Another facet of the same problem are whitespaces:
>
>>>> mySeq = Seq("GATC GATC", IUPAC.unambiguous_dna)
>>>> mySeq
> Seq('GATC GATC', IUPACUnambiguousDNA())
>>>> len(mySeq)
> 9
>
> Which is problematic when the sequence length is required (calculating GC
> content, calculating melting temperature, etc.)
>
> While it could be argued that checking the integrity of the sequence data
is
> related to parsing, I think that the sequence in the sequence object
should
> never contain whitespaces and if an alphabet is assigned it should not
contain
> non-alphabet characters. So this should be handled by the sequence object
itself?
>

See https://redmine.open-bio.org/issues/2597

To me the obvious approach is to valid this in the Seq object
__init__ if and only if the alphabet selected has a letters
attribute with the valid characters given. However, this will
slow things down and probably break a number of existing
scripts. Perhaps a global setting for ignore (current behaviuor),
warning, or exception? We could make the default the
warning for a release or two and then switch to an error.

Peter