[Biopython] iterating over FeatureLocation

Tue Jan 14 10:20:00 UTC 2014

Hi Peter - Thanks for your help.  Here is another problem.  Here is the block of features in my GenBank file for a gene:

     gene            complement(1..588)
                     /Source="maker"
                     /ID="CFIO01_14176"
                     /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
                     -gene-0.30"
                     /label="CFIO01_14176"
     CDS             order(complement(200..588),complement(1..124))
                     /Source="maker"
                     /codon_start=1
                     /ID="CFIO01_14176-RA:cds"
                     /label="CDS"
     mRNA            complement(1..588)
                     /Source="maker"
                     /ID="CFIO01_14176-RA"
                     /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
                     -gene-0.30-mRNA-1"
                     /_AED="0.06"
                     /_QI="0|0|0|1|1|1|2|0|171"
                     /_eAED="0.06"
                     /label="CFIO01_14176-RA"

Now, here is the CDS feature after is was parsed by BioPython:

(Pdb) feat.location.parts
[FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)]

Note that two positions have changed.  The CDS segments are (complement(200..588),complement(1..124)) but the positions in SeqFeature object are 0..124 and 199..588

I checked some other features too and it looks like BioPython adds 1 to the start of each segment.  For the features on the complementary strand it subtracts 1.  

When I translate the feature into a protein sequence like this: str(feat.extract(seq).seq.translate()) , the sequence is correct so this must not be a bug. so, how to I access the exact values that are in the genbank formatted file?  

On Jan 13, 2014, at 5:18 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon <mike.thon at gmail.com> wrote:
>> Here are two examples from the GenBank format file (not from GenBank though)
>> 
>> 
>>     CDS             order(6621..6658,6739..6985)
>>                     /Source="maker"
>>                     /codon_start=1
>>                     /ID="CFIO01_14847-RA:cds"
>>                     /label=“CDS"
>> 
>>     CDS             419..2374
>>                     /Source="maker"
>>                     /codon_start=1
>>                     /ID="CFIO01_05899-RA:cds"
>>                     /label=“CDS"
>> 
>> if the feature is a simple feature, then I just need to access its start and end.
>> If its a compound feature then I need to iterate over each segment, accessing the start and end.
>> 
>> What I am doing at the moment is this:
>> 
>> if feat._sub_features:
>>        for sf in feat.sub_features:
>>                start = sf.location.start
>>                …
>> else:
>>        start = feat.location.start
>>        …
>> 
>> it works, I think.  Is there a better way?
> 
> Don't do that :) Python variables/methods/etc starting with a single
> underscore are by convention private and should not generally be
> used. In this case, ._sub_features is an internal detail for the behind
> the scenes backwards compatibility for the now deprecated property
> .sub_features (don't use that either).
> 
> Instead use the location object itself directly, it now holds any
> sub-location information using a CompoundLocation object.
> See the .parts attribute, which gives a list of simple locations.
> 
> e.g.
> 
> for part in feat.location.parts:
>    start = part.start
>    ...
> 
>> 
>> Also, is there an easy way to get the sequence represented by the seqfeature,
>> if it is made up of CompoundLocations?  These features are CDSs where each
>> sub-feature is an exon.  I need to splice them all together and get the translation.
>> 
> 
> Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)`
> to get the spliced sequence, which you can then translate. See the section
> "Sequence described by a feature or location" in the Tutorial,
> 
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
> 
> On reflection, the Tutorial could do with a bit more detail on how to use
> a CompoundLocation, but I did try to cover this in the docstrings.
> 
> Regards,
> 
> Peter