[Biopython-dev] [Bug 2601] Seq find() method: proposal

Mon Sep 29 20:47:50 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2601

------- Comment #4 from bsouthey at gmail.com  2008-09-29 16:47 EST -------
(In reply to comment #3)
> Make a cup of tea, this is a long one... ;)
> 
> Peter:
> >Note that any Seq.find() method should be as like the string find method as
> >possible for consistency. 
> Bruce:
> > While I
> > see the point of the reverse complement and overlapping matches, these are
> > inconsistent with re module
> 
> I see your points, but I'm not /entirely/ in agreement, here.  

Good, as where is the fun otherwise?  :-)

> that it is nearly always a good thing that the input arguments and returned
> results match those that are expected for same-named functions in similar
> classes, I think that we may still take the opportunity to implement useful
> behaviour that is relevant to biological sequences where the intent doesn't
> stray too far from what you'd expect for a string.  For example, the ability to
> accommodate ambiguous alphabets or regular expressions - not part of
> string.find() - would be useful.  I think that this approach implements
> additional functionality of which the string.find() method's functionality is a
> subset, and so could be implemented without breaking the apparent identical
> operation of string.find() and Seq.find().  This would facilitate the use of
> string-specific third-party modules that could be useful for analysis of
> biological sequences, while extending functionality.
> 
> Where I begin to disagree is on whether it is always desirable to constrain the
> behaviour of these functions for the sake of consistency with other modules,
> while still taking time to make them behave differently at all, rather than
> just implementing that exact same behaviour, and handling the
> biologically-useful stuff in a different method altogether.  I like the idea of
> making Seq.py more string-like, in part because when I first started using
> Biopython, I missed being able to slice, and other conveniently string-y
> things.

Okay, so what is still missing with these new changes?

> 
> By way of contrast: 
> string.find() has the behaviour of only returning a single match - that which
> is closest to the string start.  This might be useful to some (in ORF-finding,
> perhaps), but I expect I would use a finditer() method that returned all
> matches (for which there is no equivalent string method) almost exclusively, if
> available.  I expect that I could cope quite happily with find() doing
> different things on pure strings and on Seq objects, but I'd be OK with a
> nonstandard finditer() alongside a 100% string-compatible find() as an
> alternative to this, though I'd want finditer() to return overlapping matches.
> 

It is not correct to compare finditer (a re method) to find (a string method)
or for that matter re.match or re.search. (I do notice a confusion between
these similar but different functions but there are numerous web pages that
discuss when one or the other should be used.) I do understand the interest but
there two different points that you raised in this bug. First is finding one
match (such as re.search or re.match) and finding all matches (such as
re.findall or re.finditer). I fully agree with having these. This is the second
point that I definitely think that the user has to decide whether or not they
want overlapping matches not the developer. There is no option under this
implementation.

> Such overlapping matches, however, do not match re.finditer() behaviour.  But,
> in this case, the re method's behaviour is constrained for good reasons related
> to regular expression implementation, and not reasons related to biological
> good sense.  I think that there is sufficient reason not to be consistent here,
> and instead to return biologically-useful overlapping matches.

I am not for or against having an method that returns overlapping matches
rather I am against only having returning overlapping matches as the only
choice. 

> 
> The core of my argument here is that we're not just working with strings, but
> with string representations of biological objects; that's exactly why we have
> this specialised library, and don't just use strings in the first place.  I
> think that there will be occasions when we should break some syntactic
> expectations, where it is appropriate for the problem domain, and that this
> *might* (note equivocation) be one of them.
> 
> Peter:
> >One enhancement is that it might be worth checking
> >the search string is valid against the Seq object's alphabet (see also Bug
> >2597).
> 
> Good point.  In the implementation I put up here, if there are any invalid
> characters then the string just won't be found, which may be overgenerous to
> user error ;)  Raising a ValueError or some such to let the user know that the
> search alphabet wasn't valid would be very helpful.
> 
> Peter:
> > To determine if a sequence has a nucleotide alphabet, use the fact that
> > any well defined nucleotide alphabet object should be a subclass of
> > Bio.Alphabet.NucleotideAlphabet() rather than checking a predefined list.
> 
> Fair enough - I didn't know that NucleotideAlphabet existed... I got as far up
> the hierarchy as DNAAlphabet and RNAAlphabet, and stopped at working code ;)
> 
> Peter:
> > However, there is no way of knowing if the sequence is double stranded or
> > single sided, so personally I don't like the way your suggested function
> > automatically searches the reverse complement strand too.
> 
> It just suited my purpose at the time.  Whether or not the nucleotide sequence
> is single- or double-stranded, people might still want to search for a
> complementary sequence; e.g. microarray/PCR/siRNA probes, etc.  The method as
> written reports the strand on which the match can be found, and the user is
> free to discard results as they see fit, which again suited me at the time.  A
> 'strand' argument to the method of 'forward', 'reverse', or 'both', or just
> assuming 'both' if not specified would be better, I agree.  
> 
> What drove my implementation above was that, while nucleotide sequence matches
> may or may not be of interest in either direction, reverse matches to protein
> sequences are definitely (AFAIAC) not that interesting ;)
> 
> Bruce:
> >I do think that any general function involving regular expressions should
> > conform to the Python re module. The reasoning follows Peter's point that a
> > user should not have to convert the Seq object into a Python string. 
> 
> I don't think I understand this point.  Would you prefer an re.search() like
> implementation that takes a Seq object as its query argument?  I don't think
> I'd find that as useful, myself, as a method that just takes a string.  Such a
> method could also maybe parse arguments so as to compile the regex from the
> Seq.data attribute though, fulfilling your requirement.

What I mean is that a user should be able to either specify the pattern or
specify a regular expression object. In either case the optional flags that are
often useful to have like ignorecase are ignored.

> 
> I used regular expression based searching in my implementation for speed, and
> strictly speaking a string is also a regular expression, even if it doesn't
> have special characters - I didn't see any inconsistency there.  My docstring
> is maybe a bit misleading about that but, when I wrote it, it wasn't intended
> for anyone but me to use.  Sorry about that.
> 
> Also, I disagree regarding conformance to the re module, particularly as our
> use of re is likely to be less general than the re module itself - see above.
> 
> > So I think it would be more valuable to implement
> > specific methods from the re modules. In this case, the functions should accept
> > regular expression.
> 
> I would quite like to have a 'true' regular expression search method myself,
> with wildcards for nucleotide symbols, but this would have to be implemented
> differently to my attempt above:  e.g., for proper reverse complement searches,
> you'd have to reverse complement the wildcards as well as ambiguity codes.
> 
> > I also do not see the gain for the reverse complement because this is just
> > another pattern. 
> 
> The gain was that I needed matches to my patterns of interest on the sequence
> in either direction, and I only cared which strand they lay on for reasons of
> locating them.  Reverse complementing the query is usually quicker than reverse
> complementing the genome on which you search.  Assuming you're searching on a
> genome, of course ;)
> 
> > Also it is potentially confusing because the direction is not
> > immediately apparent without further computation. 
> 
> I'm not sure I understand you: in teh above code, the method returns the strand
> on which the match is found, along with all the other data.  The computation
> required to handle this is the same as that to find the start and end points:
> parse an integer from the tuple.  I'm not intending that the return type should
> be set in stone and, as I mentioned, it was just a handy step in the creation
> of SeqFeatures in the parent SeqRecord.

Regardless of what a user actually wants, they must wait for two searches along
the sequence. After that finishes the user must examine each and every entry
(due to the match_locations.sort()) to find the strand regardless of what they
want to do. I do not any advantage in this than someone calling the function
twice to get match_locations and rev_locations, doing 'match_locations +=
rev_locations' and match_locations.sort(). 

> 
> > In this case I think that
> > 'explicit is better than implicit' (The Zen of Python) so I think the decision
> > to use the reverse complement must come prior to the use of this method.
> 
> In the spirit of quoted arguments from authority: "A foolish consistency is the
> hobgoblin of little minds" (Python Style Guide) ;)   
> 
Okay, then more Zen:
"In the face of ambiguity, refuse the temptation to guess."

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.