[Biopython-dev] SwissProt parser

Sat Nov 11 12:49:10 EST 2000

Hello all;
      I was writing docs for SwissProt, and noticed the parser
breaking with some of the sequences I was playing around with. I don't 
normally use SwissProt, so I have no idea if these entries are
representative of the format or anything, but the following entries
gave me problems: '023729', '023730', '023731' (some nice Chalcone
synthases). 

The problem was that there is a reference to the NCBI taxonomy id in
the entries that the parser wasn't looking for. It occurs right after
the organism info and looks like:

OX   NCBI_TaxID=41205;

Anyways, I modified the parser so that it would accept this, and added 
the possible information to the sequence class. It seems to work okay
with the entries I mentioned, and still passes the regression
tests. The patch for this is attached.

Please let me know if there are any problems with the patch or
anything. Thanks!

Brad

-------------- next part --------------
*** SProt.py.orig	Sun Jul 16 19:18:57 2000
--- SProt.py	Sat Nov 11 12:30:16 2000
***************
*** 61,66 ****
--- 61,67 ----
      organelle         The origin of the sequence.
      organism_classification  The taxonomy classification.  List of strings.
                               (http://www.ncbi.nlm.nih.gov/Taxonomy/)
+     taxonomy_id       NCBI taxonomy id
      references        List of Reference objects.
      comments          List of strings.
      cross_references  List of tuples (db, id1[, id2][, id3]).  See the docs.
***************
*** 89,94 ****
--- 90,96 ----
          self.organism = ''
          self.organelle = ''
          self.organism_classification = []
+         self.taxonomy_id = ''
          self.references = []
          self.comments = []
          self.cross_references = []
***************
*** 391,396 ****
--- 393,402 ----
          self._scan_line('OC', uhandle, consumer.organism_classification,
                          one_or_more=1)

+     def _scan_ox(self, uhandle, consumer):
+         self._scan_line('OX', uhandle, consumer.taxonomy_id,
+                         one_or_more=1)
+ 
      def _scan_reference(self, uhandle, consumer):
          while 1:
              if safe_peekline(uhandle)[:2] != 'RN':
***************
*** 462,467 ****
--- 468,474 ----
          _scan_os,
          _scan_og,
          _scan_oc,
+         _scan_ox,
          _scan_reference,
          _scan_cc,
          _scan_dr,
***************
*** 540,545 ****
--- 547,557 ----
          cols = string.split(line, ';')
          for col in cols:
              self.data.organism_classification.append(string.lstrip(col))
+ 
+     def taxonomy_id(self, line):
+         line = self._chomp(string.rstrip(line[5:]))
+         descr, tax_id = string.split(line, '=')
+         self.data.taxonomy_id = tax_id

      def reference_number(self, line):
          rn = string.rstrip(line[5:])