[Biopython-dev] Seq object join method

Eric Talevich eric.talevich at gmail.com
Fri Nov 20 14:28:42 EST 2009


On Fri, Nov 20, 2009 at 11:11 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Hello all,
>
> Some more code to evaluate, again on a branch in github:
> http://github.com/peterjc/biopython/commit/c7cd0329061f88e3a8eae0979dd17c54a36ab4e5
>
> This adds a join method to the Seq object, basically an alphabet
> aware version of the Python string join method. Recall that for
> strings:
>
> sep.join([a,b,c]) == a + sep + b + sep + c
>
> This leads to a common idiom for concatenating a list of strings,
>
> "".join([a,b,c]) == a + "" + b + "" + c == a + b + c
>
> [...]
>
> Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]),
> should it follow the addition behaviour (giving a default alphabet)
> or "do the sensible thing" and preserve the IUPAC alphabet?
>
> As written, Seq("").join(...) is handled as a special case, and
> the alphabet of the empty string is ignored. To me this is a
> case of "practicality beats purity", it is much nicer than being
> forced to do Seq("", ambiguous_dna).join(...) where the empty
> sequence is given a suitable alphabet.
>
> So, what do people think?
>
> Peter
>

Thoughts:

1. Why doesn't Alphabet._consensus_alphabet raise a
TypeError("Incompatable alphabets") where _check_type_compatibility
would fail, at least as an optional argument? Probably because it's a
private function. Should it be a public function, with a friendlier
interface?

2. This might cause massive compatibility problems now, but would it
be better for Seq() to use an "unknown_alphabet" by default instead of
"generic"? Then _consensus_alphabet could safely ignore those
sequences with unspecified alphabets, and Seq.join wouldn't need that
special case.

3. Alternately, how much code would break if _consensus_alphabet
simply treated generic_alphabet as an unspecified sequence, and
ignored it when calculating the consensus alphabet? This effect could
be limited to just Seq.join by dropping the test that the sequence
length is 0, but it might be useful to have the same behavior for
addition.

Cheers,
Eric


More information about the Biopython-dev mailing list