[Biopython-dev] [Bug 2601] Seq find() method: proposal

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Oct 1 04:42:39 EDT 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2601





------- Comment #5 from lpritc at scri.sari.ac.uk  2008-10-01 04:42 EST -------
(In reply to comment #4)
> (In reply to comment #3)

> Good, as where is the fun otherwise?  :-)

<grin> 

I think that the discussion has been useful.

> > I like the idea of
> > making Seq.py more string-like, in part because when I first started using
> > Biopython, I missed being able to slice, and other conveniently string-y
> > things.
> 
> Okay, so what is still missing with these new changes?

I like the new, and proposed, changes to Seq.  "When I first started" was
nearly eight years ago, now...

> > string.find() has the behaviour of only returning a single match - that which
> > is closest to the string start.  This might be useful to some (in ORF-finding,
> > perhaps), but I expect I would use a finditer() method that returned all
> > matches (for which there is no equivalent string method) almost exclusively
> 
> It is not correct to compare finditer (a re method) to find (a string method)
> or for that matter re.match or re.search. 

I think that it's perfectly valid to compare pretty much anything to pretty
much anything else, such as now when we have an opportunity to get the
pattern-finding functionality we want/need into the Seq object.  Substitution
(e.g. using string.find() in place of re.finditer()) is a different matter.

To me, string.find() and re.search() are pretty much equivalent, except for
their internal implementation, query argument type and return value. 
re.match() is like string.startswith(), with the same caveats.  re.finditer()
has no string.method() equivalent, but I would still find such a method useful.
 I think the abstract distinction between search types here is:

1) Find match at start of sequence (re.match() and string.startswith())
2) Find first match in sequence (re.search() and string.find())
3) Find all non-overlapping matches in sequence (re.finditer() only)
4) Find all overlapping matches in sequence (neither re nor string)
1a) 2a) 3a) 4a) The same, but in the reverse complement.

Moving down the list, the problem becomes more general.  The type of search I
need most often in biological sequences is number (4a), or (4) for proteins. 
Each of search types (1) to (3) (a or not) has a theoretically faster
implementation than doing (4) then filtering the results.  I don't mind having
more than one search method with different names, or having to specify
arguments to get a particular kind of search.  I do mind not having (4a) as an
option...

BTW, for reverse complement searches, I'm happy for this to be an optional
argument - when I wrote the code above, I didn't need anything but two-strand
searches.

> I definitely think that the user has to decide whether or not they
> want overlapping matches not the developer. There is no option under this
> implementation.

There is no option in string or re, either - not because the developer has
guessed that the user always wants it, but because they have effectively
guessed that the user *never* wants it (or that, if they do, they'll generalise
the search themselves).  This is probably because they were writing more
general libraries with different use cases (and, in the case of re, actual
implementation restrictions) than the Seq object.  We have an opportunity to
have the find()/search()/whatever() method be biologically-relevant, and I
think we should take it.

I think that, because overlapping matches are biologically-informative, and I
see no reason other than consistency with the re module (which is constrained
for reasons that do not apply to biological sequences) not to do so, that we
make the default behaviour to find overlapping matches, and provide an option
to exclude overlaps (which will probably make internal implementation faster).

> I am not for or against having an method that returns overlapping matches
> rather I am against only having returning overlapping matches as the only
> choice. 

I'm actually in full agreement with you on this.

> > I don't think I understand this point.  Would you prefer an re.search() like
> > implementation that takes a Seq object as its query argument?  I don't think
> > I'd find that as useful, myself, as a method that just takes a string.  Such a
> > method could also maybe parse arguments so as to compile the regex from the
> > Seq.data attribute though, fulfilling your requirement.
> 
> What I mean is that a user should be able to either specify the pattern or
> specify a regular expression object. In either case the optional flags that are
> often useful to have like ignorecase are ignored.

Ah, I see.  I think that, because we are working with a restricted symbol set,
we do not strictly need the full functionality that is present in re.  We would
need as a minimum for a domain-specific re-a-like syntax:

o symbols in the sequence alphabet, including correctly-interpreted ambiguity
codes
o .*+$^ etc. wildcards
o {m,n} - like syntax for repeats
o [] and [^] set notation
o lookahead and lookbehind

All of which, except for correct interpretation of ambiguity codes, is already
in re and with a few tweaks we could just use re methods internally for this. 
The ambiguity codes could perhaps be implemented by substitution of sets of
symbols for each ambiguity code, and the conformance of the regular expression
to the sequence alphabet ensured by a filter on the query.  Having a method
that intelligently accepts both strings and compiled regexes would suit me.

I suggest reversing the query rather than reversing the subject sequence
because reverse-complementing larger sequences is likely to take a
comparatively long time...

> Regardless of what a user actually wants, they must wait for two searches along
> the sequence. After that finishes the user must examine each and every entry
> (due to the match_locations.sort()) to find the strand regardless of what they
> want to do. 

In my code, yes - because that was the functionality I wanted when searching
whole genomes for exact pattern matches.  It may not have come across in my
first post, but I was proposing the code as a potential starting point (for
discussion as much as for an implementaion), not as the finished article. 

> I do not any advantage in this than someone calling the function
> twice to get match_locations and rev_locations, doing 'match_locations +=
> rev_locations' and match_locations.sort(). 

Assuming that the return value was the same as in my code above then yes, there
is no particular computational advantage (except the negligible ones of making
one instead of two function calls, and fewer calls/lines of code implying less
opportunity for user error).  But, and again I stress this, I wrote the code
with a particular purpose in mind and not as an enhancement for all possible
uses of the Seq object.  Had I needed to perform single-strand searches on
nucleotide sequences, I'd probably have hacked the code in the way you've been
suggesting, with strandedness as an optional argument.

> Okay, then more Zen:
> "In the face of ambiguity, refuse the temptation to guess."

Damn!  I'm out of quotes... ;) Time to ask the question on the
Biopython-users/BiP lists?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list