[Biopython] Sequence object "find" is still case specific?

Sun Mar 3 19:13:34 UTC 2013

Thanks for your replies on Google Plus,  Iddo Friedberg and Chris
Lasher...reproducing here

On Google plus Iddo wrote:
Good question. There is no default strict checking, you may also want to
see the manual: Section 3.6
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc173.6.

On Google Plus Chris Lasher wrote:
Hmm, well, lower case nucleotides have often represented "masked regions"
of sequences. It seems that Biopython sequences were meant to be
case-sensitive (e.g.,
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc22).From the
documentation there, it seems like you've discovered a bug in the API; I
feel that Seq should raise a ValueError when instantiating with lower-case
nucleotiods and unambiguous_dna.

I suppose my suggestion would be to always normalize to upper-case if
you're not dealing with masked regions.

So I understand that in most cases I am better off
....just treating my Sequence objects as strings or
....impose strict checking while creating them
....or force convert to upper during instantiation

Would it not make sense to have either of the following behavior

seq = Seq("atgCTCGAGcatcatcat",IUPAC.unambiguous_dna) throws an error since
mixed case is used which is not allowed

or

It just silently converts it all to the case of the Unambiguous_DNA
specification and then all "find" and "search" works regardless of case on
this internal representation which is just "DNA".

*But for now I will just force case to upper when instantiating*

Thanks for your help
Hari

So in the examples:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
>>> dna_seq.upper()
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())

>>> "GTAC" in dna_seq
False
>>> "GTAC" in dna_seq.upper()
True

But however the find still fails ...which is counter-intuituve.

>>>dna_seq.find("acgt")
-1

On Sun, Mar 3, 2013 at 1:34 PM, hari jayaram <harijay at gmail.com> wrote:
>
> I am relatively new to biopython having not used it for a while. I have
the "bad" habit of storing sequences in an internal database with mixed
case strings i.e "atgCTCGAGcatcatcat" where the upper case strings are a
restriction site I use normally for cloning purposes.
>
> I am interested in using biopython to write a pdf based (using reportlab)
plasmid vector map drawing utility for all the sequences in my database.
>
>
> I am just getting started and was wondering why the Sequence object
"find" still behaves like an ordinary python string find for eg.
>
>
> >>> from Bio.Seq import Seq
> >>> raw_seq_mixed_case = "atgCTCGAGcatcatcatcatcat"
> >>> from Bio.Alphabet import IUPAC
> >>> my_seq = Seq(raw_seq_mixed_case, IUPAC.unambiguous_dna)
> >>> my_seq.find("ctcgag")
> -1
> >>> my_seq.find("CTCGAG")
> 3
>
> Along these lines , this does not work either.
> >>> search_sequence = Seq("ctcgag",IUPAC.unambiguous_dna)
> >>> my_seq.find(search_sequence)
> -1
> >>> my_seq.find(search_sequence.tostring())
> -1
> >>> my_seq.find(search_sequence.tostring().upper())
> 3
>
> I wonder if I am doing something wrong.
>
> It seems strange that the Seq object would behave like a python String
after going through the  process of telling it that it is
"unambiguous_dna". Didnt want to roll my own solution for handling
sequences etc and would prefer playing along with biopython conventions.
>
> Thanks for your help
> Hari
>