[Biopython-dev] Sequence object allows non-alphabet characters

Peter Cock p.j.a.cock at googlemail.com
Sun Dec 18 14:03:10 UTC 2011

On Sunday, December 18, 2011, Markus Piotrowski <
Markus.Piotrowski at ruhr-uni-bochum.de> wrote:
> Dear Biopyhton developers,
> I wonder why the following code does not throw an exception:
>>>> from Bio.Seq import Seq
>>>> from Bio.Alphabet import IUPAC
>>>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna)
>>>> mySeq
> Seq('GATC1234YWSK', IUPACUnambiguousDNA())
> I expected that trying to generate a sequence object containing
> characters would either throw an exception/warning or "downgrade" the
> if possible.
> Another facet of the same problem are whitespaces:
>>>> mySeq = Seq("GATC GATC", IUPAC.unambiguous_dna)
>>>> mySeq
> Seq('GATC GATC', IUPACUnambiguousDNA())
>>>> len(mySeq)
> 9
> Which is problematic when the sequence length is required (calculating GC
> content, calculating melting temperature, etc.)
> While it could be argued that checking the integrity of the sequence data
> related to parsing, I think that the sequence in the sequence object
> never contain whitespaces and if an alphabet is assigned it should not
> non-alphabet characters. So this should be handled by the sequence object

See https://redmine.open-bio.org/issues/2597

To me the obvious approach is to valid this in the Seq object
__init__ if and only if the alphabet selected has a letters
attribute with the valid characters given. However, this will
slow things down and probably break a number of existing
scripts. Perhaps a global setting for ignore (current behaviuor),
warning, or exception? We could make the default the
warning for a release or two and then switch to an error.


More information about the Biopython-dev mailing list