[Biopython-dev] [Bug 2601] New: Seq find() method: proposal
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Mon Sep 29 12:35:34 UTC 2008
http://bugzilla.open-bio.org/show_bug.cgi?id=2601
Summary: Seq find() method: proposal
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: BioSQL
AssignedTo: biopython-dev at biopython.org
ReportedBy: lpritc at scri.sari.ac.uk
A find() method for the Seq object was recently proposed on the mailing list.
I have extended Seq locally to include a find method that uses the re module
and the reverse_complement function from Bio.Seq, and is described below. In
the original implementation, the search was meant to be called from the parent
SeqRecord object, which populated itself with features describing the search
results.
I'm proposing this as a potential starting point for the implementation of a
Seq.find() method.
Note that the loop of re.search() calls was necessary to obtain the set of
overlapping matches, as re.finditer() only returns non-overlapping matches.
The two functions searching in forward-only and reverse-only directions could
probably be combined, and behaviour distinguished on keyword, for neater code.
####
def find_regexes(self, pattern):
""" find_regexes(self, pattern)
pattern String, regular expression to search for
Finds all occurrences of the passed regular expression in the
sequence, and returns a list of tuples in the format:
(start, end, match, strand).
If the sequence is a nucleotide sequence, the reverse strand is
also searched
"""
# Find forward matches
match_locations = [(hit.start()+1, hit.end(), \
self.data[hit.start():hit.end()], 1) \
for hit in self.__find_overlapping_regexes(pattern)]
# If the sequence is a nucleotide sequence, look on the reverse
# strand, too
if self.alphabet.__class__ in [Alphabet.DNAAlphabet,
Alphabet.RNAAlphabet,
IUPAC.ExtendedIUPACDNA,
IUPAC.IUPACAmbiguousDNA,
IUPAC.IUPACUnambiguousDNA,
IUPAC.IUPACAmbiguousRNA,
IUPAC.IUPACUnambiguousRNA]:
rev_locations = [(hit.start()+1, hit.end(), \
self.data[hit.start():hit.end()], 1) \
for hit in \
self.__find_overlapping_regexes_rev(pattern)]
match_locations += rev_locations
match_locations.sort()
return match_locations
def __find_overlapping_regexes(self, pattern):
""" Finds all overlapping regexes matching the passed pattern in the
sequence, and returns a list of re.SRE_Match objects describing
them.
"""
hits = []
pos = 0
regex = re.compile(pattern)
while pos < len(self.data):
hit = regex.search(self.data, pos=pos)
if hit is None:
break
hits.append(hit)
pos = hit.start()+1
return hits
def __find_overlapping_regexes_rev(self, pattern):
""" Finds all overlapping regexes matching the passed pattern in the
sequence, and returns a list of re.SRE_Match objects describing
them, as hits positioned in the forward direction - i.e. start and
end read in the forward sense.
"""
hits = []
pos = 0
regex = re.compile(reverse_complement(Seq(pattern, self.alphabet)))
while pos < len(self.data):
hit = regex.search(self.data, pos=pos)
if hit is None:
break
hits.append(hit)
pos = hit.start()+1
return hits
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list