[Biopython-dev] Prosite
Mark Lambrecht
mark at acoma.Stanford.EDU
Thu Jan 24 17:48:39 EST 2002
Jeff,
Everything works fine now. You saved my day : I needed the info in the
prosite.dat file.
Thanks,
Mark
On Thu, 24 Jan 2002, Jeffrey Chang wrote:
> Yep, it looks like Release 17 from last month introduced some format
> changes that broke the parser. I've updated the parser to handle the
> new lines -- __init__.py is attached. Please try this out and let me
> know how it works. Thanks for the report and the patch!
>
> Jeff
>
>
> On Wed, Jan 23, 2002 at 12:36:51PM -0800, Mark Lambrecht wrote:
> > Hi,
> >
> > Thanks for all the excellent Biopython code.
> > I used the Prosite parser and it breaks on a number of CC and MA lines.
> > Maybe there is a new version of the prosite.dat file ?
> > We added some code to the Bio/Prosite/__init__.py , and commented it with
> > ## (lambrecht/dyoo)
> > Then everything works again but possibly doesn't use the information in
> > these lines.
> > I attached the __init__.py
> > Could you take a look ?
> >
> > Thanks !!
> >
> > Mark
> >
> >
> > --------------------------------------------------------------------------
> > Mark Lambrecht
> > Postdoctoral Research Fellow
> > The Arabidopsis Information Resource FAX: (650) 325-6857
> > Carnegie Institution of Washington Tel: (650) 325-1521 ext.397
> > Department of Plant Biology URL: http://arabidopsis.org/
> > 260 Panama St.
> > Stanford, CA 94305
> > --------------------------------------------------------------------------
>
> > # Copyright 1999 by Jeffrey Chang. All rights reserved.
> > # This code is part of the Biopython distribution and governed by its
> > # license. Please see the LICENSE file that should have been included
> > # as part of this package.
> >
> > # Copyright 2000 by Jeffrey Chang. All rights reserved.
> > # This code is part of the Biopython distribution and governed by its
> > # license. Please see the LICENSE file that should have been included
> > # as part of this package.
> >
> > """Prosite
> >
> > This module provides code to work with the prosite.dat file from
> > Prosite.
> > http://www.expasy.ch/prosite/
> >
> > Tested with:
> > Release 15.0, July 1998
> > Release 16.0, July 1999
> >
> >
> > Classes:
> > Record Holds Prosite data.
> > PatternHit Holds data from a hit against a Prosite pattern.
> > Iterator Iterates over entries in a Prosite file.
> > Dictionary Accesses a Prosite file using a dictionary interface.
> > ExPASyDictionary Accesses Prosite records from ExPASy.
> > RecordParser Parses a Prosite record into a Record object.
> >
> > _Scanner Scans Prosite-formatted data.
> > _RecordConsumer Consumes Prosite data to a Record object.
> >
> >
> > Functions:
> > scan_sequence_expasy Scan a sequence for occurrences of Prosite patterns.
> > index_file Index a Prosite file for a Dictionary.
> > _extract_record Extract Prosite data from a web page.
> > _extract_pattern_hits Extract Prosite patterns from a web page.
> >
> > """
> > __all__ = [
> > 'Pattern',
> > 'Prodoc',
> > ]
> > from types import *
> > import string
> > import re
> > import sgmllib
> > from Bio import File
> > from Bio import Index
> > from Bio.ParserSupport import *
> > from Bio.WWW import ExPASy
> > from Bio.WWW import RequestLimiter
> >
> > class Record:
> > """Holds information from a Prosite record.
> >
> > Members:
> > name ID of the record. e.g. ADH_ZINC
> > type Type of entry. e.g. PATTERN, MATRIX, or RULE
> > accession e.g. PS00387
> > created Date the entry was created. (MMM-YYYY)
> > data_update Date the 'primary' data was last updated.
> > info_update Date data other than 'primary' data was last updated.
> > pdoc ID of the PROSITE DOCumentation.
> >
> > description Free-format description.
> > pattern The PROSITE pattern. See docs.
> > matrix List of strings that describes a matrix entry.
> > rules List of rule definitions. (strings)
> >
> > NUMERICAL RESULTS
> > nr_sp_release SwissProt release.
> > nr_sp_seqs Number of seqs in that release of Swiss-Prot. (int)
> > nr_total Number of hits in Swiss-Prot. tuple of (hits, seqs)
> > nr_positive True positives. tuple of (hits, seqs)
> > nr_unknown Could be positives. tuple of (hits, seqs)
> > nr_false_pos False positives. tuple of (hits, seqs)
> > nr_false_neg False negatives. (int)
> > nr_partial False negatives, because they are fragments. (int)
> >
> > COMMENTS
> > cc_taxo_range Taxonomic range. See docs for format
> > cc_max_repeat Maximum number of repetitions in a protein
> > cc_site Interesting site. list of tuples (pattern pos, desc.)
> > cc_skip_flag Can this entry be ignored?
> >
> > DATA BANK REFERENCES - The following are all
> > lists of tuples (swiss-prot accession,
> > swiss-prot name)
> > dr_positive
> > dr_false_neg
> > dr_false_pos
> > dr_potential Potential hits, but fingerprint region not yet available.
> > dr_unknown Could possibly belong
> >
> > pdb_structs List of PDB entries.
> >
> > """
> > def __init__(self):
> > self.name = ''
> > self.type = ''
> > self.accession = ''
> > self.created = ''
> > self.data_update = ''
> > self.info_update = ''
> > self.pdoc = ''
> >
> > self.description = ''
> > self.pattern = ''
> > self.matrix = []
> > self.rules = []
> >
> > self.nr_sp_release = ''
> > self.nr_sp_seqs = ''
> > self.nr_total = (None, None)
> > self.nr_positive = (None, None)
> > self.nr_unknown = (None, None)
> > self.nr_false_pos = (None, None)
> > self.nr_false_neg = None
> > self.nr_partial = None
> >
> > self.cc_taxo_range = ''
> > self.cc_max_repeat = ''
> > self.cc_site = []
> > self.cc_skip_flag = ''
> >
> > self.dr_positive = []
> > self.dr_false_neg = []
> > self.dr_false_pos = []
> > self.dr_potential = []
> > self.dr_unknown = []
> >
> > self.pdb_structs = []
> >
> > class PatternHit:
> > """Holds information from a hit against a Prosite pattern.
> >
> > Members:
> > name ID of the record. e.g. ADH_ZINC
> > accession e.g. PS00387
> > pdoc ID of the PROSITE DOCumentation.
> > description Free-format description.
> > matches List of tuples (start, end, sequence) where
> > start and end are indexes of the match, and sequence is
> > the sequence matched.
> >
> > """
> > def __init__(self):
> > self.name = None
> > self.accession = None
> > self.pdoc = None
> > self.description = None
> > self.matches = []
> > def __str__(self):
> > lines = []
> > lines.append("%s %s %s" % (self.accession, self.pdoc, self.name))
> > lines.append(self.description)
> > lines.append('')
> > if len(self.matches) > 1:
> > lines.append("Number of matches: %s" % len(self.matches))
> > for i in range(len(self.matches)):
> > start, end, seq = self.matches[i]
> > range_str = "%d-%d" % (start, end)
> > if len(self.matches) > 1:
> > lines.append("%7d %10s %s" % (i+1, range_str, seq))
> > else:
> > lines.append("%7s %10s %s" % (' ', range_str, seq))
> > return string.join(lines, '\n')
> >
> > class Iterator:
> > """Returns one record at a time from a Prosite file.
> >
> > Methods:
> > next Return the next record from the stream, or None.
> >
> > """
> > def __init__(self, handle, parser=None):
> > """__init__(self, handle, parser=None)
> >
> > Create a new iterator. handle is a file-like object. parser
> > is an optional Parser object to change the results into another form.
> > If set to None, then the raw contents of the file will be returned.
> >
> > """
> > if type(handle) is not FileType and type(handle) is not InstanceType:
> > raise ValueError, "I expected a file handle or file-like object"
> > self._uhandle = File.UndoHandle(handle)
> > self._parser = parser
> >
> > def next(self):
> > """next(self) -> object
> >
> > Return the next Prosite record from the file. If no more records,
> > return None.
> >
> > """
> > # Skip the copyright info, if it's the first record.
> > line = self._uhandle.peekline()
> > if line[:2] == 'CC':
> > while 1:
> > line = self._uhandle.readline()
> > if not line:
> > break
> > if line[:2] == '//':
> > break
> > if line[:2] != 'CC':
> > raise SyntaxError, \
> > "Oops, where's the copyright?"
> >
> > lines = []
> > while 1:
> > line = self._uhandle.readline()
> > if not line:
> > break
> > lines.append(line)
> > if line[:2] == '//':
> > break
> >
> > if not lines:
> > return None
> >
> > data = string.join(lines, '')
> > if self._parser is not None:
> > return self._parser.parse(File.StringHandle(data))
> > return data
> >
> > class Dictionary:
> > """Accesses a Prosite file using a dictionary interface.
> >
> > """
> > __filename_key = '__filename'
> >
> > def __init__(self, indexname, parser=None):
> > """__init__(self, indexname, parser=None)
> >
> > Open a Prosite Dictionary. indexname is the name of the
> > index for the dictionary. The index should have been created
> > using the index_file function. parser is an optional Parser
> > object to change the results into another form. If set to None,
> > then the raw contents of the file will be returned.
> >
> > """
> > self._index = Index.Index(indexname)
> > self._handle = open(self._index[Dictionary.__filename_key])
> > self._parser = parser
> >
> > def __len__(self):
> > return len(self._index)
> >
> > def __getitem__(self, key):
> > start, len = self._index[key]
> > self._handle.seek(start)
> > data = self._handle.read(len)
> > if self._parser is not None:
> > return self._parser.parse(File.StringHandle(data))
> > return data
> >
> > def __getattr__(self, name):
> > return getattr(self._index, name)
> >
> > class ExPASyDictionary:
> > """Access PROSITE at ExPASy using a read-only dictionary interface.
> >
> > """
> > def __init__(self, delay=5.0, parser=None):
> > """__init__(self, delay=5.0, parser=None)
> >
> > Create a new Dictionary to access PROSITE. parser is an optional
> > parser (e.g. Prosite.RecordParser) object to change the results
> > into another form. If set to None, then the raw contents of the
> > file will be returned. delay is the number of seconds to wait
> > between each query.
> >
> > """
> > self.parser = parser
> > self.limiter = RequestLimiter(delay)
> >
> > def __len__(self):
> > raise NotImplementedError, "Prosite contains lots of entries"
> > def clear(self):
> > raise NotImplementedError, "This is a read-only dictionary"
> > def __setitem__(self, key, item):
> > raise NotImplementedError, "This is a read-only dictionary"
> > def update(self):
> > raise NotImplementedError, "This is a read-only dictionary"
> > def copy(self):
> > raise NotImplementedError, "You don't need to do this..."
> > def keys(self):
> > raise NotImplementedError, "You don't really want to do this..."
> > def items(self):
> > raise NotImplementedError, "You don't really want to do this..."
> > def values(self):
> > raise NotImplementedError, "You don't really want to do this..."
> >
> > def has_key(self, id):
> > """has_key(self, id) -> bool"""
> > try:
> > self[id]
> > except KeyError:
> > return 0
> > return 1
> >
> > def get(self, id, failobj=None):
> > try:
> > return self[id]
> > except KeyError:
> > return failobj
> > raise "How did I get here?"
> >
> > def __getitem__(self, id):
> > """__getitem__(self, id) -> object
> >
> > Return a Prosite entry. id is either the id or accession
> > for the entry. Raises a KeyError if there's an error.
> >
> > """
> > # First, check to see if enough time has passed since my
> > # last query.
> > self.limiter.wait()
> >
> > try:
> > handle = ExPASy.get_prosite_entry(id)
> > except IOError:
> > raise KeyError, id
> > try:
> > handle = File.StringHandle(_extract_record(handle))
> > except ValueError:
> > raise KeyError, id
> >
> > if self.parser is not None:
> > return self.parser.parse(handle)
> > return handle.read()
> >
> > class RecordParser(AbstractParser):
> > """Parses Prosite data into a Record object.
> >
> > """
> > def __init__(self):
> > self._scanner = _Scanner()
> > self._consumer = _RecordConsumer()
> >
> > def parse(self, handle):
> > self._scanner.feed(handle, self._consumer)
> > return self._consumer.data
> >
> > class _Scanner:
> > """Scans Prosite-formatted data.
> >
> > Tested with:
> > Release 15.0, July 1998
> >
> > """
> > def feed(self, handle, consumer):
> > """feed(self, handle, consumer)
> >
> > Feed in Prosite data for scanning. handle is a file-like
> > object that contains prosite data. consumer is a
> > Consumer object that will receive events as the report is scanned.
> >
> > """
> > if isinstance(handle, File.UndoHandle):
> > uhandle = handle
> > else:
> > uhandle = File.UndoHandle(handle)
> >
> > while 1:
> > line = uhandle.peekline()
> > if not line:
> > break
> > elif is_blank_line(line):
> > # Skip blank lines between records
> > uhandle.readline()
> > continue
> > elif line[:2] == 'ID':
> > self._scan_record(uhandle, consumer)
> > elif line[:2] == 'CC':
> > self._scan_copyrights(uhandle, consumer)
> > else:
> > raise SyntaxError, "There doesn't appear to be a record"
> >
> > def _scan_copyrights(self, uhandle, consumer):
> > consumer.start_copyrights()
> > self._scan_line('CC', uhandle, consumer.copyright, any_number=1)
> > self._scan_terminator(uhandle, consumer)
> > consumer.end_copyrights()
> >
> > def _scan_record(self, uhandle, consumer):
> > consumer.start_record()
> > for fn in self._scan_fns:
> > fn(self, uhandle, consumer)
> >
> > # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before
> > # the 3D lines, instead of the other way around.
> > # Thus, I'll give the 3D lines another chance after the DO lines
> > # are finished.
> > if fn is self._scan_do.im_func:
> > self._scan_3d(uhandle, consumer)
> > consumer.end_record()
> >
> > def _scan_line(self, line_type, uhandle, event_fn,
> > exactly_one=None, one_or_more=None, any_number=None,
> > up_to_one=None):
> > # Callers must set exactly one of exactly_one, one_or_more, or
> > # any_number to a true value. I do not explicitly check to
> > # make sure this function is called correctly.
> >
> > # This does not guarantee any parameter safety, but I
> > # like the readability. The other strategy I tried was have
> > # parameters min_lines, max_lines.
> >
> > if exactly_one or one_or_more:
> > read_and_call(uhandle, event_fn, start=line_type)
> > if one_or_more or any_number:
> > while 1:
> > if not attempt_read_and_call(uhandle, event_fn,
> > start=line_type):
> > break
> > if up_to_one:
> > attempt_read_and_call(uhandle, event_fn, start=line_type)
> >
> > def _scan_id(self, uhandle, consumer):
> > self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)
> >
> > def _scan_ac(self, uhandle, consumer):
> > self._scan_line('AC', uhandle, consumer.accession, exactly_one=1)
> >
> > def _scan_dt(self, uhandle, consumer):
> > self._scan_line('DT', uhandle, consumer.date, exactly_one=1)
> >
> > def _scan_de(self, uhandle, consumer):
> > self._scan_line('DE', uhandle, consumer.description, exactly_one=1)
> >
> > def _scan_pa(self, uhandle, consumer):
> > self._scan_line('PA', uhandle, consumer.pattern, any_number=1)
> >
> > def _scan_ma(self, uhandle, consumer):
> > # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15
> > # contain a CC line buried within an 'MA' line. Need to check
> > # for that.
> > while 1:
> > if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'):
> > line1 = uhandle.readline()
> > line2 = uhandle.readline()
> > uhandle.saveline(line2)
> > uhandle.saveline(line1)
> > if line1[:2] == 'CC' and line2[:2] == 'MA':
> > read_and_call(uhandle, consumer.comment, start='CC')
> > else:
> > break
> >
> > def _scan_ru(self, uhandle, consumer):
> > self._scan_line('RU', uhandle, consumer.rule, any_number=1)
> >
> > def _scan_nr(self, uhandle, consumer):
> > self._scan_line('NR', uhandle, consumer.numerical_results,
> > any_number=1)
> >
> > def _scan_cc(self, uhandle, consumer):
> > self._scan_line('CC', uhandle, consumer.comment, any_number=1)
> >
> > def _scan_dr(self, uhandle, consumer):
> > self._scan_line('DR', uhandle, consumer.database_reference,
> > any_number=1)
> >
> > def _scan_3d(self, uhandle, consumer):
> > self._scan_line('3D', uhandle, consumer.pdb_reference,
> > any_number=1)
> >
> > def _scan_do(self, uhandle, consumer):
> > self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1)
> >
> > def _scan_terminator(self, uhandle, consumer):
> > self._scan_line('//', uhandle, consumer.terminator, exactly_one=1)
> >
> > _scan_fns = [
> > _scan_id,
> > _scan_ac,
> > _scan_dt,
> > _scan_de,
> > _scan_pa,
> > _scan_ma,
> > _scan_ru,
> > _scan_nr,
> > _scan_ma, ## (lambrecht/dyoo) is this right?
> > _scan_nr, ## (lambrecht/dyoo) is this right?
> > _scan_cc,
> > _scan_dr,
> > _scan_3d,
> > _scan_do,
> > _scan_terminator
> > ]
> >
> > class _RecordConsumer(AbstractConsumer):
> > """Consumer that converts a Prosite record to a Record object.
> >
> > Members:
> > data Record with Prosite data.
> >
> > """
> > def __init__(self):
> > self.data = None
> >
> > def start_record(self):
> > self.data = Record()
> >
> > def end_record(self):
> > self._clean_record(self.data)
> >
> > def identification(self, line):
> > cols = string.split(line)
> > if len(cols) != 3:
> > raise SyntaxError, "I don't understand identification line\n%s" % \
> > line
> > self.data.name = self._chomp(cols[1]) # don't want ';'
> > self.data.type = self._chomp(cols[2]) # don't want '.'
> >
> > def accession(self, line):
> > cols = string.split(line)
> > if len(cols) != 2:
> > raise SyntaxError, "I don't understand accession line\n%s" % line
> > self.data.accession = self._chomp(cols[1])
> >
> > def date(self, line):
> > uprline = string.upper(line)
> > cols = string.split(uprline)
> >
> > # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE'
> > if cols[2] != '(CREATED);' or \
> > cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \
> > cols[7][:4] != '(INF' or cols[8] != 'UPDATE).':
> > raise SyntaxError, "I don't understand date line\n%s" % line
> >
> > self.data.created = cols[1]
> > self.data.data_update = cols[3]
> > self.data.info_update = cols[6]
> >
> > def description(self, line):
> > self.data.description = self._clean(line)
> >
> > def pattern(self, line):
> > self.data.pattern = self.data.pattern + self._clean(line)
> >
> > def matrix(self, line):
> > self.data.matrix.append(self._clean(line))
> >
> > def rule(self, line):
> > self.data.rules.append(self._clean(line))
> >
> > def numerical_results(self, line):
> > cols = string.split(self._clean(line), ';')
> > for col in cols:
> > if not col:
> > continue
> > qual, data = map(string.lstrip, string.split(col, '='))
> > if qual == '/RELEASE':
> > release, seqs = string.split(data, ',')
> > self.data.nr_sp_release = release
> > self.data.nr_sp_seqs = int(seqs)
> > elif qual == '/FALSE_NEG':
> > self.data.nr_false_neg = int(data)
> > elif qual == '/PARTIAL':
> > self.data.nr_partial = int(data)
> > ## (lambrecht/dyoo) added temporary fix for qual //MATRIX_TYPE in CC
> > elif qual =='/MATRIX_TYPE':
> > pass
> > elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']:
> > m = re.match(r'(\d+)\((\d+)\)', data)
> > if not m:
> > raise error, "Broken data %s in comment line\n%s" % \
> > (repr(data), line)
> > hits = tuple(map(int, m.groups()))
> > if(qual == "/TOTAL"):
> > self.data.nr_total = hits
> > elif(qual == "/POSITIVE"):
> > self.data.nr_positive = hits
> > elif(qual == "/UNKNOWN"):
> > self.data.nr_unknown = hits
> > elif(qual == "/FALSE_POS"):
> > self.data.nr_false_pos = hits
> > else:
> > raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
> > (repr(qual), line)
> >
> > def comment(self, line):
> > cols = string.split(self._clean(line), ';')
> > for col in cols:
> > # DNAJ_2 in Release 15 has a non-standard comment line:
> > # CC Automatic scaling using reversed database
> > # Throw it away. (Should I keep it?)
> > if not col or col[:17] == 'Automatic scaling':
> > continue
> > qual, data = map(string.lstrip, string.split(col, '='))
> > if qual in ('/MATRIX_TYPE', '/SCALING_DB', '/AUTHOR',
> > '/FT_KEY', '/FT_DESC'):
> > continue ## (lambrecht/dyoo) This is a temporary fix until we know what
> > ## to do here
> > if qual == '/TAXO-RANGE':
> > self.data.cc_taxo_range = data
> > elif qual == '/MAX-REPEAT':
> > self.data.cc_max_repeat = data
> > elif qual == '/SITE':
> > pos, desc = string.split(data, ',')
> > self.data.cc_site = (int(pos), desc)
> > elif qual == '/SKIP-FLAG':
> > self.data.cc_skip_flag = data
> > else:
> > raise SyntaxError, "Unknown qual %s in comment line\n%s" % \
> > (repr(qual), line)
> >
> > def database_reference(self, line):
> > refs = string.split(self._clean(line), ';')
> > for ref in refs:
> > if not ref:
> > continue
> > acc, name, type = map(string.strip, string.split(ref, ','))
> > if type == 'T':
> > self.data.dr_positive.append((acc, name))
> > elif type == 'F':
> > self.data.dr_false_pos.append((acc, name))
> > elif type == 'N':
> > self.data.dr_false_neg.append((acc, name))
> > elif type == 'P':
> > self.data.dr_potential.append((acc, name))
> > elif type == '?':
> > self.data.dr_unknown.append((acc, name))
> > else:
> > raise SyntaxError, "I don't understand type flag %s" % type
> >
> > def pdb_reference(self, line):
> > cols = string.split(line)
> > for id in cols[1:]: # get all but the '3D' col
> > self.data.pdb_structs.append(self._chomp(id))
> >
> > def documentation(self, line):
> > self.data.pdoc = self._chomp(self._clean(line))
> >
> > def terminator(self, line):
> > pass
> >
> > def _chomp(self, word, to_chomp='.,;'):
> > # Remove the punctuation at the end of a word.
> > if word[-1] in to_chomp:
> > return word[:-1]
> > return word
> >
> > def _clean(self, line, rstrip=1):
> > # Clean up a line.
> > if rstrip:
> > return string.rstrip(line[5:])
> > return line[5:]
> >
> > def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None):
> > """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) ->
> > list of PatternHit's
> >
> > Search a sequence for occurrences of Prosite patterns. You can
> > specify either a sequence in seq or a SwissProt/trEMBL ID or accession
> > in id. Only one of those should be given. If exclude_frequent
> > is true, then the patterns with the high probability of occurring
> > will be excluded.
> >
> > """
> > if (seq and id) or not (seq or id):
> > raise ValueError, "Please specify either a sequence or an id"
> > handle = ExPASy.scanprosite1(seq, id, exclude_frequent)
> > return _extract_pattern_hits(handle)
> >
> > def _extract_pattern_hits(handle):
> > """_extract_pattern_hits(handle) -> list of PatternHit's
> >
> > Extract hits from a web page. Raises a ValueError if there
> > was an error in the query.
> >
> > """
> > class parser(sgmllib.SGMLParser):
> > def __init__(self):
> > sgmllib.SGMLParser.__init__(self)
> > self.hits = []
> > self.broken_message = 'Some error occurred'
> > self._in_pre = 0
> > self._current_hit = None
> > self._last_found = None # Save state of parsing
> > def handle_data(self, data):
> > if string.find(data, 'try again') >= 0:
> > self.broken_message = data
> > return
> > elif data == 'illegal':
> > self.broken_message = 'Sequence contains illegal characters'
> > return
> > if not self._in_pre:
> > return
> > elif not string.strip(data):
> > return
> > if self._last_found is None and data[:4] == 'PDOC':
> > self._current_hit.pdoc = data
> > self._last_found = 'pdoc'
> > elif self._last_found == 'pdoc':
> > if data[:2] != 'PS':
> > raise SyntaxError, "Expected accession but got:\n%s" % data
> > self._current_hit.accession = data
> > self._last_found = 'accession'
> > elif self._last_found == 'accession':
> > self._current_hit.name = data
> > self._last_found = 'name'
> > elif self._last_found == 'name':
> > self._current_hit.description = data
> > self._last_found = 'description'
> > elif self._last_found == 'description':
> > m = re.findall(r'(\d+)-(\d+) (\w+)', data)
> > for start, end, seq in m:
> > self._current_hit.matches.append(
> > (int(start), int(end), seq))
> >
> > def do_hr(self, attrs):
> > # <HR> inside a <PRE> section means a new hit.
> > if self._in_pre:
> > self._current_hit = PatternHit()
> > self.hits.append(self._current_hit)
> > self._last_found = None
> > def start_pre(self, attrs):
> > self._in_pre = 1
> > self.broken_message = None # Probably not broken
> > def end_pre(self):
> > self._in_pre = 0
> > p = parser()
> > p.feed(handle.read())
> > if p.broken_message:
> > raise ValueError, p.broken_message
> > return p.hits
> >
> >
> >
> >
> > def index_file(filename, indexname, rec2key=None):
> > """index_file(filename, indexname, rec2key=None)
> >
> > Index a Prosite file. filename is the name of the file.
> > indexname is the name of the dictionary. rec2key is an
> > optional callback that takes a Record and generates a unique key
> > (e.g. the accession number) for the record. If not specified,
> > the id name will be used.
> >
> > """
> > if not os.path.exists(filename):
> > raise ValueError, "%s does not exist" % filename
> >
> > index = Index.Index(indexname, truncate=1)
> > index[Dictionary._Dictionary__filename_key] = filename
> >
> > iter = Iterator(open(filename), parser=RecordParser())
> > while 1:
> > start = iter._uhandle.tell()
> > rec = iter.next()
> > length = iter._uhandle.tell() - start
> >
> > if rec is None:
> > break
> > if rec2key is not None:
> > key = rec2key(rec)
> > else:
> > key = rec.name
> >
> > if not key:
> > raise KeyError, "empty key was produced"
> > elif index.has_key(key):
> > raise KeyError, "duplicate key %s found" % key
> >
> > index[key] = start, length
> >
> > def _extract_record(handle):
> > """_extract_record(handle) -> str
> >
> > Extract PROSITE data from a web page. Raises a ValueError if no
> > data was found in the web page.
> >
> > """
> > # All the data appears between tags:
> > # <pre width = 80>ID NIR_SIR; PATTERN.
> > # </PRE>
> > class parser(sgmllib.SGMLParser):
> > def __init__(self):
> > sgmllib.SGMLParser.__init__(self)
> > self._in_pre = 0
> > self.data = []
> > def handle_data(self, data):
> > if self._in_pre:
> > self.data.append(data)
> > def do_br(self, attrs):
> > if self._in_pre:
> > self.data.append('\n')
> > def start_pre(self, attrs):
> > self._in_pre = 1
> > def end_pre(self):
> > self._in_pre = 0
> > p = parser()
> > p.feed(handle.read())
> > if not p.data:
> > raise ValueError, "No data found in web page."
> > return string.join(p.data, '')
> >
>
>
--------------------------------------------------------------------------
Mark Lambrecht
Postdoctoral Research Fellow
The Arabidopsis Information Resource FAX: (650) 325-6857
Carnegie Institution of Washington Tel: (650) 325-1521 ext.397
Department of Plant Biology URL: http://arabidopsis.org/
260 Panama St.
Stanford, CA 94305
--------------------------------------------------------------------------
More information about the Biopython-dev
mailing list