[Biopython-dev] Proposed Incompatible Changes to GenBank SeqFeatures

Brad Chapman chapmanb at arches.uga.edu
Wed Sep 19 06:53:36 EDT 2001


Hello all;
Recently I've been slugging through the biopython implementation of
the new BioCorba spec, and have been forced to come back into some
bad coding I did on the parts of the GenBank parser that converts
GenBank into SeqRecord and SeqFeature objects. Most of the mistakes
that I'm working on fixing are detailed by Andrew:

http://www.biopython.org/pipermail/biopython-dev/2001-July/000451.html

(unfortunately he sent this right before ISMB when I was crazily
coding to get my poster done :-<)

Anyways, I'd like to try and fix the problems he mentions here (and
which are also a problem for biopython-corba), and have got the
changes together. The issue is that some of the changes are
back-incompatible, so I'd like to talk about them here and get
people's thoughts on whether the changes will severely affect their
existing code. Note that his only influences parsing GenBank in
SeqRecord and SeqFeature objects, not in GenBank Record objects (ie.
the FeatureParser, not the RecordParser).

I'll go through the changes step by step, and attach the diffs
(against current CVS) which implement the changes. I'll start with
incompatible changes, and then go on to the more benign back-compatible
changes. Comments are very welcome. Okay, here we go:

Back-incompatible changes
-------------------------
1. 
Andrew:
> feature qualifiers shouldn't really be a dictionary

I have been storing feature qualifiers as a dictionary with the keys
equal to the qualifier keys and the values equal to the qualifier
values. The problem comes when there are multiple keys of the same
name, ie with db_xref in the following CDS Feature:

     CDS             50..250
                     /gene="cor6.6"
                     /db_xref="GI:16230"
                     /db_xref="SWISS-PROT:P31169"

What I did was this hideous hack to add numbers to the end of the
qualifiers to make them unique, so the feature above would be:

{"gene" : "cor6.6",
 "db_xref" : "GI:16230",
 "db_xref1" : "SWISS-PROT:P31169"}

As Andrew points out, this is hideous and all around bad news. I
would like to propose to use a combination dictionary/list structure
to hold multiple keys, so the above now looks like:

{"gene" : ["cor6.6"],
 "db_xref" : [""GI:16230", "SWISS-PROT:P31169"]}

This affects all people using qualifiers (since even single items
are now in a list), but I think it is easier to always have the
values be lists (otherwise you'll always have to check is the key is
of type("") or type([]).

2. 
> - What's the numbering system of the FeatureLocation?

There is a difference between "biological" coordinates used in
GenBank and python/biopython coordinates. In "biological"
coordinates 1 is the first base in a sequence, and if you want to
get 1 to 50, that includes both the first and 50th base.

In python, 0 is the first base in the sequence, and 1 to 50 includes
1, but not 50.

Previously I did no conversion of GenBank "biological" coordinates
to python coordinates, but Andrew argues (and I agree) that I should
do this to make things less confusing. The new implementation does
the conversion, so this will give an "off-by-one" type error to
people using the current numbers.

3. 
Andrew:
> Related to that, what's the type used when there are subfeatures?

Previously, if we had a sequence feature like:
CDS             join(104..160,320..390,504..579)

I would code this as a top level SeqFeature with type "CDS" and
location (104..579), and have sub_features of this top level feature
with type "CDS_join." This is stolen from bioperl, but is not that
great in retrospect, since I'm hacking the type and all of that. 

I'd like to propose adding a location_operator attribute to
SeqFeature (already done in CVS) and have the top level SeqFeature
be type "CDS" with location_operator "join", and all sub_features
also be of the same type and location_operator. This will only
affect people who relied on the previous (fairly ugly)
type/location_operator concatenation mechanism.

4. strand information

Previous I had the default strand for a SeqFeature be 0, which I now
think is a mistake. In my mind, the strand information is:

None -- No strand information, or not relevant (ie. protein)
1 -- The top strand
-1 -- The bottom strand
0 -- both strands

So, I changed the default strand info to be None. Hopefully this is
a relatively minor change.

Back-compatible changes
=======================
1. 
Andrew:
> The biopython SeqFeature currently must be used like:
>    feature = SeqFeature()
>    feature.type = "variation"
>    ...

> I would much rather prefer allowing the values to be set
> through the constructor, as in
>     feature = SeqFeature(type = "variation", ...)
      
I agree with Andrew on this and have implemented it. The only
current issue is that if I try to set sub_features or annotations in
the constructor I'll get infinate recursion problems when printing
out the features (ie. try going this and running test_GenBank). I
haven't been able to figure out why that is yet.

2. New attributes in SeqFeature

I added location_operator and id attributes to SeqFeatures (to help
support BioCorba), which can now be used. These shouldn't affect
anyones old code and you can now use 'em if you want.


Whew, I think that is everything :-). Thanks for reading all of the
way through this. I'd really like comments on whether these changes
are good/bad, and most importantly whether I can check 'em in or
should do something different. Thanks much!

Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
*** __init__.py	Wed Sep 19 05:56:29 2001
--- __init__.py.orig	Tue Sep 18 21:13:48 2001
***************
*** 374,397 ****
          """
          return text.replace(" ", "")
  
-     def _convert_to_python_numbers(self, start, end):
-         """Convert a start and end range to python notation.
- 
-         In GenBank, starts and ends are defined in "biological" coordinates,
-         where 1 is the first base and [i, j] means to include both i and j.
- 
-         In python, 0 is the first base and [i, j] means to include i, but
-         not j. 
- 
-         So, to convert "biological" to python coordinates, we need to 
-         subtract 1 from the start, and leave the end and things should
-         be converted happily.
-         """
-         new_start = start - 1
-         new_end = end
- 
-         return new_start, new_end
- 
  class _FeatureConsumer(_BaseGenBankConsumer):
      """Create a SeqRecord object with Features to return.
  
--- 374,379 ----
***************
*** 558,567 ****
          new_locations = []
          for base_info in all_base_info:
              start, end = base_info.split('to')
!             new_start, new_end = \
!               self._convert_to_python_numbers(int(start.strip()),
!                                               int(end.strip()))
!             this_location = SeqFeature.FeatureLocation(new_start, new_end)
              new_locations.append(this_location)
          return new_locations
  
--- 540,548 ----
          new_locations = []
          for base_info in all_base_info:
              start, end = base_info.split('to')
!             this_location = \
!               SeqFeature.FeatureLocation(int(string.strip(start)),
!                                              int(string.strip(end)))
              new_locations.append(this_location)
          return new_locations
  
***************
*** 707,717 ****
          # current feature, then get the information for this feature
          for inner_element in function.args:
              new_sub_feature = SeqFeature.SeqFeature()
!             # inherit the type from the parent
!             new_sub_feature.type = cur_feature.type 
!             # add the join or order info to the location_operator
!             cur_feature.location_operator = function.name
!             new_sub_feature.location_operator = function.name
              # inherit references and strand from the parent feature
              new_sub_feature.ref = cur_feature.ref
              new_sub_feature.ref_db = cur_feature.ref_db
--- 688,695 ----
          # current feature, then get the information for this feature
          for inner_element in function.args:
              new_sub_feature = SeqFeature.SeqFeature()
!             # add _join or _order to the name to make the type clear
!             new_sub_feature.type = cur_feature.type + '_' + function.name
              # inherit references and strand from the parent feature
              new_sub_feature.ref = cur_feature.ref
              new_sub_feature.ref_db = cur_feature.ref_db
***************
*** 726,735 ****
          # set the location of the top -- this should be a combination of
          # the start position of the first sub_feature and the end position
          # of the last sub_feature
- 
-         # these positions are already converted to python coordinates 
-         # (when the sub_features were added) so they don't need to
-         # be converted again
          feature_start = cur_feature.sub_features[0].location.start
          feature_end = cur_feature.sub_features[-1].location.end
          cur_feature.location = SeqFeature.FeatureLocation(feature_start,
--- 704,709 ----
***************
*** 788,796 ****
          # check if we just have a single base
          if not(isinstance(range_info, LocationParser.Range)):
              pos = self._get_position(range_info)
!             # move the single position back one to be consistent with how
!             # python indexes numbers (starting at 0)
!             pos.position = pos.position  - 1
              return SeqFeature.FeatureLocation(pos, pos)
          # otherwise we need to get both sides of the range
          else:
--- 762,768 ----
          # check if we just have a single base
          if not(isinstance(range_info, LocationParser.Range)):
              pos = self._get_position(range_info)
! 
              return SeqFeature.FeatureLocation(pos, pos)
          # otherwise we need to get both sides of the range
          else:
***************
*** 798,807 ****
              start_pos = self._get_position(range_info.low)
              end_pos = self._get_position(range_info.high)
  
-             start_pos.position, end_pos.position = \
-               self._convert_to_python_numbers(start_pos.position,
-                                               end_pos.position)
- 
              return SeqFeature.FeatureLocation(start_pos, end_pos)
  
      def _get_position(self, position):
--- 770,775 ----
***************
*** 854,867 ****
          # if we've got a key from before, add it to the dictionary of
          # qualifiers
          if self._cur_qualifier_key:
!             key = self._cur_qualifier_key
!             value = self._cur_qualifier_value
!             # if the qualifier name exists, append the value
!             if self._cur_feature.qualifiers.has_key(key):
!                 self._cur_feature.qualifiers[key].append(value)
!             # otherwise start a new list of the key with its values
!             else:
!                 self._cur_feature.qualifiers[key] = [value]
  
      def qualifier_key(self, content):
          """When we get a qualifier key, use it as a dictionary key.
--- 822,837 ----
          # if we've got a key from before, add it to the dictionary of
          # qualifiers
          if self._cur_qualifier_key:
!             # get a unique name
!             unique_name = self._cur_qualifier_key
!             counter = 1
!             while self._cur_feature.qualifiers.has_key(unique_name):
!                 unique_name = self._cur_qualifier_key + str(counter)
!                 counter = counter + 1
!                 
!                 
!             self._cur_feature.qualifiers[unique_name] = \
!                                                       self._cur_qualifier_value
  
      def qualifier_key(self, content):
          """When we get a qualifier key, use it as a dictionary key.
-------------- next part --------------
*** SeqFeature.py.orig	Tue Sep 18 20:35:03 2001
--- SeqFeature.py	Wed Sep 19 01:30:26 2001
***************
*** 69,75 ****
      40 and 50 to 60, respectively.
      """
      def __init__(self, location = None, type = '', location_operator = '',
!                  strand = 0, id = "<unknown id>", 
                   qualifiers = {}, sub_features = [],
                   ref = None, ref_db = None):
          """Initialize a SeqFeature on a Sequence.
--- 69,75 ----
      40 and 50 to 60, respectively.
      """
      def __init__(self, location = None, type = '', location_operator = '',
!                  strand = None, id = "<unknown id>", 
                   qualifiers = {}, sub_features = [],
                   ref = None, ref_db = None):
          """Initialize a SeqFeature on a Sequence.


More information about the Biopython-dev mailing list