[Biopython-dev] SwissProt parser
Brad Chapman
chapmanb at arches.uga.edu
Sat Nov 11 12:49:10 EST 2000
Hello all;
I was writing docs for SwissProt, and noticed the parser
breaking with some of the sequences I was playing around with. I don't
normally use SwissProt, so I have no idea if these entries are
representative of the format or anything, but the following entries
gave me problems: '023729', '023730', '023731' (some nice Chalcone
synthases).
The problem was that there is a reference to the NCBI taxonomy id in
the entries that the parser wasn't looking for. It occurs right after
the organism info and looks like:
OX NCBI_TaxID=41205;
Anyways, I modified the parser so that it would accept this, and added
the possible information to the sequence class. It seems to work okay
with the entries I mentioned, and still passes the regression
tests. The patch for this is attached.
Please let me know if there are any problems with the patch or
anything. Thanks!
Brad
-------------- next part --------------
*** SProt.py.orig Sun Jul 16 19:18:57 2000
--- SProt.py Sat Nov 11 12:30:16 2000
***************
*** 61,66 ****
--- 61,67 ----
organelle The origin of the sequence.
organism_classification The taxonomy classification. List of strings.
(http://www.ncbi.nlm.nih.gov/Taxonomy/)
+ taxonomy_id NCBI taxonomy id
references List of Reference objects.
comments List of strings.
cross_references List of tuples (db, id1[, id2][, id3]). See the docs.
***************
*** 89,94 ****
--- 90,96 ----
self.organism = ''
self.organelle = ''
self.organism_classification = []
+ self.taxonomy_id = ''
self.references = []
self.comments = []
self.cross_references = []
***************
*** 391,396 ****
--- 393,402 ----
self._scan_line('OC', uhandle, consumer.organism_classification,
one_or_more=1)
+ def _scan_ox(self, uhandle, consumer):
+ self._scan_line('OX', uhandle, consumer.taxonomy_id,
+ one_or_more=1)
+
def _scan_reference(self, uhandle, consumer):
while 1:
if safe_peekline(uhandle)[:2] != 'RN':
***************
*** 462,467 ****
--- 468,474 ----
_scan_os,
_scan_og,
_scan_oc,
+ _scan_ox,
_scan_reference,
_scan_cc,
_scan_dr,
***************
*** 540,545 ****
--- 547,557 ----
cols = string.split(line, ';')
for col in cols:
self.data.organism_classification.append(string.lstrip(col))
+
+ def taxonomy_id(self, line):
+ line = self._chomp(string.rstrip(line[5:]))
+ descr, tax_id = string.split(line, '=')
+ self.data.taxonomy_id = tax_id
def reference_number(self, line):
rn = string.rstrip(line[5:])
More information about the Biopython-dev
mailing list