[Biopython-dev] [Biopython (old issues only) - Bug #2601] (Closed) Seq find() method: proposal

Thu Nov 10 02:31:17 UTC 2016

Issue #2601 has been updated by Vincent Davis.

Description updated
Status changed from New to Closed
Assignee changed from Biopython Dev Mailing List to Vincent Davis
% Done changed from 0 to 100

A find method has been implemented over 8yrs ago, no additional feedback has been submitted. closing issue

----------------------------------------
Bug #2601: Seq find() method: proposal
https://redmine.open-bio.org/issues/2601#change-15352

* Author: Leighton Pritchard
* Status: Closed
* Priority: Normal
* Assignee: Vincent Davis
* Category: Main Distribution
* Target version: Not Applicable
* URL: 
----------------------------------------
A find() method for the Seq object was recently proposed on the mailing list.  I have extended Seq locally to include a find method that uses the re module and the reverse_complement function from Bio.Seq, and is described below.  In the original implementation, the search was meant to be called from the parent SeqRecord object, which populated itself with features describing the search results.

I'm proposing this as a potential starting point for the implementation of a Seq.find() method.  

Note that the loop of re.search() calls was necessary to obtain the set of overlapping matches, as re.finditer() only returns non-overlapping matches.  The two functions searching in forward-only and reverse-only directions could probably be combined, and behaviour distinguished on keyword, for neater code.

####

    def find_regexes(self, pattern):
        """ find_regexes(self, pattern)

            pattern           String, regular expression to search for

            Finds all occurrences of the passed regular expression in the
            sequence, and returns a list of tuples in the format:
            (start, end, match, strand).

            If the sequence is a nucleotide sequence, the reverse strand is
            also searched
        """
        # Find forward matches
        match_locations = [(hit.start()+1, hit.end(), \
                            self.data[hit.start():hit.end()], 1) \
                           for hit in self.__find_overlapping_regexes(pattern)]
        # If the sequence is a nucleotide sequence, look on the reverse
        # strand, too
        if self.alphabet.__class__ in [Alphabet.DNAAlphabet,
                                       Alphabet.RNAAlphabet,
                                       IUPAC.ExtendedIUPACDNA,
                                       IUPAC.IUPACAmbiguousDNA,
                                       IUPAC.IUPACUnambiguousDNA,
                                       IUPAC.IUPACAmbiguousRNA,
                                       IUPAC.IUPACUnambiguousRNA]:
            rev_locations = [(hit.start()+1, hit.end(), \
                              self.data[hit.start():hit.end()], 1) \
                             for hit in \
                             self.__find_overlapping_regexes_rev(pattern)]
            match_locations += rev_locations
        match_locations.sort()
        return match_locations

    def __find_overlapping_regexes(self, pattern):
        """ Finds all overlapping regexes matching the passed pattern in the
            sequence, and returns a list of re.SRE_Match objects describing
            them.
        """
        hits = []
        pos = 0
        regex = re.compile(pattern)
        while pos < len(self.data):
            hit = regex.search(self.data, pos=pos)
            if hit is None:
                break
            hits.append(hit)
            pos = hit.start()+1
        return hits

    def __find_overlapping_regexes_rev(self, pattern):
        """ Finds all overlapping regexes matching the passed pattern in the
            sequence, and returns a list of re.SRE_Match objects describing
            them, as hits positioned in the forward direction - i.e. start and
            end read in the forward sense.
        """
        hits = []
        pos = 0
        regex = re.compile(reverse_complement(Seq(pattern, self.alphabet)))
        while pos < len(self.data):
            hit = regex.search(self.data, pos=pos)
            if hit is None:
                break
            hits.append(hit)
            pos = hit.start()+1
        return hits

-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20161110/f9b3ee74/attachment.html>