[Biopython] Seq object ungap method

Fri Nov 20 09:29:25 EST 2009

Hi all,

Something we discussed last year was adding an ungap method
to the Seq object. e.g.
http://lists.open-bio.org/pipermail/biopython/2008-September/004523.html
http://lists.open-bio.org/pipermail/biopython/2008-September/004527.html

As mentioned earlier this month on the dev mailing list,
http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006983.html
I actually made the time to implement this, and posted it on a
github branch - you can see the updated Bio/Seq.py file here:
http://github.com/peterjc/biopython/blob/ungap/Bio/Seq.py

I've included a copy of the proposed docstring for the new Seq object
ungap method at the end of this email, which tries to illustrate how this
would be used.

I'd like some comments - is this worth including in Biopython?

Thanks,

Peter

--

This is the proposed docstring for the new Seq object ungap method,
the examples double as doctest unit tests:

<quote>
Return a copy of the sequence without the gap character(s).

The gap character can be specified in two ways - as an explicit argument,
or via the sequence's alphabet. For example:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> my_dna = Seq("ATA--TGAAAT-TTGAAAA", generic_dna)
>>> my_dna
Seq('ATA--TGAAAT-TTGAAAA', DNAAlphabet())
>>> my_dna.ungap("-")
Seq('ATATGAAATTTGAAAA', DNAAlphabet())

If the gap character is not given as an argument, it will be taken from
the sequence's alphabet (if defined). Notice that the returned sequence's
alphabet is adjusted since it no longer requires a gapped alphabet:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC, Gapped, HasStopCodon
>>> my_pro = Seq("MVVLE=AD*", HasStopCodon(Gapped(IUPAC.protein, "=")))
>>> my_pro
Seq('MVVLE=AD*', HasStopCodon(Gapped(IUPACProtein(), '='), '*'))
>>> my_pro.ungap()
Seq('MVVLEAD*', HasStopCodon(IUPACProtein(), '*'))

Or, with a simpler gapped DNA example:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC, Gapped
>>> my_seq = Seq("CGGGTAG=AAAAAA", Gapped(IUPAC.unambiguous_dna, "="))
>>> my_seq
Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
>>> my_seq.ungap()
Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA())

As long as it is consistent with the alphabet, although it is redundant,
you can stil supply the gap character as an argument to this method:

>>> my_seq
Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
>>> my_seq.ungap("=")
Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA())

However, if the gap character given as the argument disagrees with that
declared in the alphabet, an exception is raised:

>>> my_seq
Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
>>> my_seq.ungap("-")
Traceback (most recent call last):
   ...
ValueError: Gap '-' does not match '=' from alphabet

Finally, if a gap character is not supplied, and the alphabet does not
define one, an exception is raised:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> my_dna = Seq("ATA--TGAAAT-TTGAAAA", generic_dna)
>>> my_dna
Seq('ATA--TGAAAT-TTGAAAA', DNAAlphabet())
>>> my_dna.ungap()
Traceback (most recent call last):
   ...
ValueError: Gap character not given and not defined in alphabet

</quote>