[Biopython] iterating over FeatureLocation

Tue Jan 14 11:18:12 UTC 2014

On Tue, Jan 14, 2014 at 10:20 AM, Michael Thon <mike.thon at gmail.com> wrote:
> Hi Peter - Thanks for your help.  Here is another problem.  Here is the
> block of features in my GenBank file for a gene:
>
>      gene            complement(1..588)
>                      /Source="maker"
>                      /ID="CFIO01_14176"
>
> /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
>                      -gene-0.30"
>                      /label="CFIO01_14176"
>      CDS             order(complement(200..588),complement(1..124))
>                      /Source="maker"
>                      /codon_start=1
>                      /ID="CFIO01_14176-RA:cds"
>                      /label="CDS"
>      mRNA            complement(1..588)
>                      /Source="maker"
>                      /ID="CFIO01_14176-RA"
>
> /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
>                      -gene-0.30-mRNA-1"
>                      /_AED="0.06"
>                      /_QI="0|0|0|1|1|1|2|0|171"
>                      /_eAED="0.06"
>                      /label="CFIO01_14176-RA"
>
> Now, here is the CDS feature after is was parsed by BioPython:
>
>
> (Pdb) feat.location.parts
> [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1),
> FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)]
>
> Note that two positions have changed.  The CDS segments are
> (complement(200..588),complement(1..124)) but the positions in SeqFeature
> object are 0..124 and 199..588
>
> I checked some other features too and it looks like BioPython adds 1 to the
> start of each segment.  For the features on the complementary strand it
> subtracts 1.

Not quite, no.

The Biopython SeqFeature location system uses Python counting as
in string slicing etc. This means that effectively all the start coordinates
you see are one less than the start coordinates in GenBank/EMBL
format files.

> When I translate the feature into a protein sequence like this:
> str(feat.extract(seq).seq.translate()) , the sequence is correct so this
> must not be a bug. so, how to I access the exact values that are in the
> genbank formatted file?

You must convert back from Python counting to GenBank/EMBL counting,

location.start + 1
location.end

However, for many things the Python counting is more natural once
you are used to it ;)

Peter