[Biopython-dev] [Bug 2622] Parsing between position locations like 5933^5934 in GenBank/EMBL files

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Tue Oct 21 08:07:01 EDT 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2622





------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2008-10-21 08:07 EST -------
Part of the problem is in Bio/GenBank/__init__.py around line 793,

        # case 4 -- we've got 100^101
        elif isinstance(position, LocationParser.Between):
            final_pos = SeqFeature.BetweenPosition(position.low.val,
                                                 position.high.val)
        # case 5 -- we've got (100.101)
        elif isinstance(position, LocationParser.TwoBound):
            final_pos = SeqFeature.WithinPosition(position.low.val,
                                                position.high.val)

The BetweenPosition and WithinPosition objects expect the (low) position and
the extension, not the low position and the high position.  Thus instead:

        # case 4 -- we've got 100^101 => position 100, extension 1
        elif isinstance(position, LocationParser.Between):
            final_pos = SeqFeature.BetweenPosition(position.low.val,
                                 position.high.val-position.low.val)
        # case 5 -- we've got (100.101) => position 100, extension 1
        elif isinstance(position, LocationParser.TwoBound):
            final_pos = SeqFeature.WithinPosition(position.low.val,
                                position.high.val-position.low.val)

However, things still don't seem quite right with the SeqFeature.location
object (even with this change) as the same object is used for both the start
and end, which means both have zero-based locations:

==================================================
     variation       5933^5934
     variation       5933^5934
     variation       8529^8530
==================================================
NC_005816.1
type: variation
location: [(5932^5933):(5932^5933)]
ref: None:None
strand: 1
qualifiers: 
        Key: note, Value: ['compared to AL109969']
        Key: replace, Value: ['a']

type: variation
location: [(5932^5933):(5932^5933)]
ref: None:None
strand: 1
qualifiers: 
        Key: note, Value: ['compared to AF053945']
        Key: replace, Value: ['aa']

type: variation
location: [(8528^8529):(8528^8529)]
ref: None:None
strand: 1
qualifiers: 
        Key: note, Value: ['compared to AL109969']
        Key: replace, Value: ['tt']


Note that a location string "5933..5934" (2bp) becomes in Biopython a typical
range between two exact positions, representing the slice [5932:5934] (2bp). 
Perhaps locations like 5933^5934 (0bp) should be held similarly, akin to a
slice [5933:5933] (0bp).

e.g. for a sequence "ACTG...", a location string "2^3" means between "AC" and
"TG...", or in python speak the empty slice [2:2]

The GenBank release notes do say:
> 3. A site between two bases;
> ...
> A site between two residues, such as an endonuclease cleavage site, is
> indicated by listing the two bases separated by a carat (e.g., 23^24).

I think they mean implicitly two neighbouring bases - after all "23^25" can
just be written as "24" or "23^26" as "24..26".  The need for the caret "23^25"
is a result of the one-based counting system - avoided in python slice
notation.

Finally, it is not clear to me from the GenBank release notes if locations like
"23^34" can be joined as part of more complex location, or not.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list