[Biopython-dev] Re: Parsing Protein GenBank Records

Tue Sep 18 21:25:32 EDT 2001

Hi Joung;
(ccing this to biopython-dev since this is relevant to everyone)

> I'm having trouble parsing GenBank records obtained from the protein
> database. The parser works fine for nucleotide GenBank records , but not for
> protein records. I would appreciate it very much if you can guide me in
> right direction for parsing such records.
> 
> Here is the code and the error that I get back.
> 
> >>> parser = GenBank.RecordParser()
> >>> ncbi = GenBank.NCBIDictionary(database='Protein')
> >>> rec = ncbi['6754304']

The parser does work for proteins in general, but does fail badly on
this particular REFSEQ sequence. In the past, REFSEQ stuff has been
only "sort of" GenBank format, and this record is no exception. It
has a lot of formatting problems (has no identifier for the sequence
type in the LOCUS line, has extra DBSOURCE tag, has non-standard
feature table types and keys (Protein, Region, region_name)).
Anyways, it is a big non-standard formatting mess.

I've fixed the GenBank parser to be able to handle this, and checked
the changes into CVS. Diffs to the relevant files (Record.py,
__init__.py and genbank_format.py in Bio.GenBank) are also attached
to this file in case you don't have CVS access.

Thanks for the bug report. Hope this works for you!

Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
*** Record.py.orig	Sat May 19 15:31:16 2001
--- Record.py	Tue Sep 18 21:02:18 2001
***************
*** 106,112 ****
--- 106,114 ----
      o date - The date of submission of the record, in a form like '28-JUL-1998'
      o accession - list of all accession numbers for the sequence.
      o nid - Nucleotide identifier number.
+     o pid - Proteint identifier number
      o version - The accession number + version (ie. AB01234.2)
+     o db_source - Information about the database the record came from
      o gi - The NCBI gi identifier for the record.
      o keywords - A list of keywords related to the record.
      o segment - If the record is one of a series, this is info about which
***************
*** 153,159 ****
--- 155,163 ----
          self.definition = ''
          self.accession = []
          self.nid = ''
+         self.pid = ''
          self.version = ''
+         self.db_source = ''
          self.gi = ''
          self.keywords = []
          self.segment = ''
***************
*** 185,191 ****
--- 189,197 ----
          output += self._accession_line()
          output += self._version_line()
          output += self._nid_line()
+         output += self._pid_line()
          output += self._keywords_line()
+         output += self._db_source_line()
          output += self._segment_line()
          output += self._source_line()
          output += self._organism_line()
***************
*** 210,216 ****
          output += "%-9s" % self.locus
          output += " " # 22 space
          output += "%7s" % self.size
!         output += " bp "

          # treat circular types differently, since they'll have long residue
          # types
--- 216,225 ----
          output += "%-9s" % self.locus
          output += " " # 22 space
          output += "%7s" % self.size
!         if self.residue_type.find("PROTEIN") >= 0:
!             output += " aa"
!         else:
!             output += " bp "

          # treat circular types differently, since they'll have long residue
          # types
***************
*** 272,277 ****
--- 281,296 ----
              output = ""
          return output

+     def _pid_line(self):
+         """Output for PID line. Presumedly, PID usage is also obsolete.
+         """
+         if self.pid:
+             output = Record.BASE_FORMAT % "PID"
+             output += "%s\n" % self.pid
+         else:
+             output = ""
+         return output
+ 
      def _keywords_line(self):
          """Output for the KEYWORDS line.
          """
***************
*** 288,293 ****
--- 307,322 ----
              output += _wrapped_genbank(keyword_info,
                                         Record.GB_BASE_INDENT)

+         return output
+ 
+     def _db_source_line(self):
+         """Output for DBSOURCE line.
+         """
+         if self.db_source:
+             output = Record.BASE_FORMAT % "DBSOURCE"
+             output += "%s\n" % self.db_source
+         else:
+             output = ""
          return output

      def _segment_line(self):
-------------- next part --------------
*** __init__.py.orig	Sat Jul 28 12:02:25 2001
--- __init__.py	Tue Sep 18 21:13:48 2001
***************
*** 98,112 ****
      def __getitem__(self, key):
          """Retrieve an item from the dictionary.
          """
!         print "keys:", self._index.keys()
          # get the location of the record of interest in the file
          start, len = self._index[key]
!         print "start:", start, "len:", len

          # read through and get the data from the file
          self._handle.seek(start)
          data = self._handle.read(len)
!         print "data:", data

          # run the data through the parser if one is specified
          if self._parser is not None:
--- 98,112 ----
      def __getitem__(self, key):
          """Retrieve an item from the dictionary.
          """
!         # print "keys:", self._index.keys()
          # get the location of the record of interest in the file
          start, len = self._index[key]
!         # print "start:", start, "len:", len

          # read through and get the data from the file
          self._handle.seek(start)
          data = self._handle.read(len)
!         # print "data:", data

          # run the data through the parser if one is specified
          if self._parser is not None:
***************
*** 434,439 ****
--- 434,442 ----
      def nid(self, content):
          self.data.annotations['nid'] = content

+     def pid(self, content):
+         self.data.annotations['pid'] = content
+ 
      def version(self, version_id):
          """Set the version to overwrite the id.

***************
*** 443,448 ****
--- 446,454 ----
          """
          self.data.id = version_id

+     def db_source(self, content):
+         self.data.annotations['db_source'] = content.rstrip()
+ 
      def gi(self, content):
          self.data.annotations['gi'] = content

***************
*** 485,510 ****
          (bases 1 to 86436)
          (sites)
          (bases 1 to 105654; 110423 to 111122)
          """
!         # first remove the parentheses
          ref_base_info = content[1:-1]

          all_locations = []
!         # only attempt to get out information if we find the words
!         # 'bases' and 'to'
          if (string.find(ref_base_info, 'bases') != -1 and
              string.find(ref_base_info, 'to') != -1):
              # get rid of the beginning 'bases'
              ref_base_info = ref_base_info[5:]
!             # split possibly multiple locations using the ';'
!             all_base_info = string.split(ref_base_info, ';')
! 
!             for base_info in all_base_info:
!                 start, end = string.split(base_info, 'to')
!                 this_location = \
!                   SeqFeature.FeatureLocation(int(string.strip(start)),
!                                              int(string.strip(end)))
!                 all_locations.append(this_location)

          # make sure if we are not finding information then we have
          # the string 'sites' or the string 'bases'
--- 491,516 ----
          (bases 1 to 86436)
          (sites)
          (bases 1 to 105654; 110423 to 111122)
+         1  (residues 1 to 182)
          """
!         # first remove the parentheses or other junk
          ref_base_info = content[1:-1]

          all_locations = []
!         # parse if we've got 'bases' and 'to'
          if (string.find(ref_base_info, 'bases') != -1 and
              string.find(ref_base_info, 'to') != -1):
              # get rid of the beginning 'bases'
              ref_base_info = ref_base_info[5:]
!             locations = self._split_reference_locations(ref_base_info)
!             all_locations.extend(locations)
!         elif (ref_base_info.find("residues") >= 0 and
!               ref_base_info.find("to") >= 0):
!             residues_start = ref_base_info.find("residues")
!             # get only the information after "residues"
!             ref_base_info = ref_base_info[(residues_start + len("residues ")):]
!             locations = self._split_reference_locations(ref_base_info)
!             all_locations.extend(locations)

          # make sure if we are not finding information then we have
          # the string 'sites' or the string 'bases'
***************
*** 517,523 ****
                               (ref_base_info, self.data.id))

          self._current_ref.location = all_locations
!                 
      def authors(self, content):
          self._current_ref.authors = content

--- 523,551 ----
                               (ref_base_info, self.data.id))

          self._current_ref.location = all_locations
! 
!     def _split_reference_locations(self, location_string):
!         """Get reference locations out of a string of reference information
!         
!         The passed string should be of the form:
! 
!             1 to 20; 20 to 100
! 
!         This splits the information out and returns a list of location objects
!         based on the reference locations.
!         """
!         # split possibly multiple locations using the ';'
!         all_base_info = location_string.split(';')
! 
!         new_locations = []
!         for base_info in all_base_info:
!             start, end = base_info.split('to')
!             this_location = \
!               SeqFeature.FeatureLocation(int(string.strip(start)),
!                                              int(string.strip(end)))
!             new_locations.append(this_location)
!         return new_locations
! 
      def authors(self, content):
          self._current_ref.authors = content

***************
*** 905,913 ****
--- 933,947 ----
      def nid(self, content):
          self.data.nid = content

+     def pid(self, content):
+         self.data.pid = content
+ 
      def version(self, content):
          self.data.version = content

+     def db_source(self, content):
+         self.data.db_source = content.rstrip()
+ 
      def gi(self, content):
          self.data.gi = content

***************
*** 1070,1076 ****
          # in the MartelParser
          self.interest_tags = ["locus", "size", "residue_type",
                                "data_file_division", "date",
!                               "definition", "accession", "nid", "version",
                                "gi", "keywords", "segment",
                                "source", "organism",
                                "taxonomy", "reference_num",
--- 1104,1111 ----
          # in the MartelParser
          self.interest_tags = ["locus", "size", "residue_type",
                                "data_file_division", "date",
!                               "definition", "accession", "nid", 
!                               "pid", "version", "db_source",
                                "gi", "keywords", "segment",
                                "source", "organism",
                                "taxonomy", "reference_num",
-------------- next part --------------
*** genbank_format.py.orig	Thu May 10 17:42:43 2001
--- genbank_format.py	Tue Sep 18 21:07:11 2001
***************
*** 142,147 ****
--- 142,156 ----
                          nid +
                          Martel.AnyEol())

+ # PID         g6754304
+ pid = Martel.Group("pid",
+                    Martel.Re("[\w\d]+"))
+ pid_line = Martel.Group("pid_line", 
+                         Martel.Str("PID") +
+                         blank_space +
+                         pid + 
+                         Martel.AnyEol())
+ 
  # version and GI line
  # VERSION     AC007323.5  GI:6587720
  version = Martel.Group("version",
***************
*** 159,164 ****
--- 168,181 ----
                              gi +
                              Martel.AnyEol())

+ # DBSOURCE    REFSEQ: accession NM_010510.1
+ db_source = Martel.Group("db_source",
+                          Martel.ToEol())
+ db_source_line = Martel.Group("db_source_line",
+                               Martel.Str("DBSOURCE") +
+                               blank_space +
+                               db_source) 
+ 
  # keywords line
  # KEYWORDS    antifreeze protein homology; cold-regulated gene; cor6.6 gene;
  #             KIN1 homology.
***************
*** 312,319 ****
--- 329,338 ----
      "primer",           # Primer binding region used with PCR  XXX not in 
                          #   http://www.ncbi.nlm.nih.gov/collab/FT/index.html
      "promoter",         # A region involved in transcription initiation
+     "Protein",          # A REFSEQ invention for referring to a protein
      "protein_bind",     # Non-covalent protein binding site on DNA or RNA
      "RBS",              # Ribosome binding site
+     "Region",           # Another REFSEQ invention that doesn't make any sense
      "rep_origin",       # Replication origin for duplex DNA
      "repeat_region",    # Sequence containing repeated subsequences
      "repeat_unit",      # One repeated unit of a repeat_region
***************
*** 424,429 ****
--- 443,449 ----
                        #   evidence for a feature
      "clone_lib",      # Clone library from which the sequence was obtained
      "clone",          # Clone from which the sequence was obtained
+     "coded_by",       # REFSEQ invention to specify a crossreference
      "codon_start",    # Indicates the first base of the first complete codon
                        #   in a CDS (as 1 or 2 or 3)
      "codon",          # Specifies a codon that is different from any found
***************
*** 505,510 ****
--- 525,531 ----
      "rearranged",     # If the sequence shown is DNA and a member of the
                        #   immunoglobulin family, this qualifier is used to
                        #   denote that the sequence is from rearranged DNA
+     "region_name",    # REFSEQ invention to go with their Region Type
      "replace",        # Indicates that the sequence identified a feature's
                        #   intervals is replaced by the  sequence shown in
                        #   "text"
***************
*** 624,630 ****
--- 645,653 ----
                        definition_block + \
                        accession_block + \
                        Martel.Opt(nid_line) + \
+                       Martel.Opt(pid_line) + \
                        Martel.Opt(version_line) + \
+                       Martel.Opt(db_source_line) + \
                        keywords_block + \
                        Martel.Opt(segment_line) + \
                        source_block + \
***************
*** 633,639 ****
                        Martel.Opt(comment_block) + \
                        features_line + \
                        Martel.Rep1(feature) + \
!                       base_count_line + \
                        sequence_entry + \
                        record_end)

--- 656,662 ----
                        Martel.Opt(comment_block) + \
                        features_line + \
                        Martel.Rep1(feature) + \
!                       Martel.Opt(base_count_line) + \
                        sequence_entry + \
                        record_end)