[Biopython-dev] Seq object join method
Peter
biopython at maubp.freeserve.co.uk
Fri Nov 20 11:11:43 EST 2009
Hello all,
Some more code to evaluate, again on a branch in github:
http://github.com/peterjc/biopython/commit/c7cd0329061f88e3a8eae0979dd17c54a36ab4e5
This adds a join method to the Seq object, basically an alphabet
aware version of the Python string join method. Recall that for
strings:
sep.join([a,b,c]) == a + sep + b + sep + c
This leads to a common idiom for concatenating a list of strings,
"".join([a,b,c]) == a + "" + b + "" + c == a + b + c
That is fine for strings, but not necessarily for Seq objects since even
a zero length sequence has an alphabet. Consider this example:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna
>>> unamb_dna_seq = Seq("ACGT", unambiguous_dna)
>>> ambig_dna_seq = Seq("ACRGT", ambiguous_dna)
>>> unamb_dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> ambig_dna_seq
Seq('ACRGT', IUPACAmbiguousDNA())
If we add the ambiguous and unambiguous IUPAC DNA alphabets,
we get the ambiguous IUPAC DNA alphabet:
>>> unamb_dna_seq + ambig_dna_seq
Seq('ACGTACRGT', IUPACAmbiguousDNA())
However, if the default generic alphabet is included, the result is
a generic alphabet:
>>> unamb_dna_seq + Seq("") + ambig_dna_seq
Seq('ACGTACRGT', Alphabet())
Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]),
should it follow the addition behaviour (giving a default alphabet)
or "do the sensible thing" and preserve the IUPAC alphabet?
As written, Seq("").join(...) is handled as a special case, and
the alphabet of the empty string is ignored. To me this is a
case of "practicality beats purity", it is much nicer than being
forced to do Seq("", ambiguous_dna).join(...) where the empty
sequence is given a suitable alphabet.
So, what do people think?
Peter
More information about the Biopython-dev
mailing list