[Biopython] iterating over FeatureLocation
Michael Thon
mike.thon at gmail.com
Tue Jan 14 10:20:00 UTC 2014
Hi Peter - Thanks for your help. Here is another problem. Here is the block of features in my GenBank file for a gene:
gene complement(1..588)
/Source="maker"
/ID="CFIO01_14176"
/Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
-gene-0.30"
/label="CFIO01_14176"
CDS order(complement(200..588),complement(1..124))
/Source="maker"
/codon_start=1
/ID="CFIO01_14176-RA:cds"
/label="CDS"
mRNA complement(1..588)
/Source="maker"
/ID="CFIO01_14176-RA"
/Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
-gene-0.30-mRNA-1"
/_AED="0.06"
/_QI="0|0|0|1|1|1|2|0|171"
/_eAED="0.06"
/label="CFIO01_14176-RA"
Now, here is the CDS feature after is was parsed by BioPython:
(Pdb) feat.location.parts
[FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)]
Note that two positions have changed. The CDS segments are (complement(200..588),complement(1..124)) but the positions in SeqFeature object are 0..124 and 199..588
I checked some other features too and it looks like BioPython adds 1 to the start of each segment. For the features on the complementary strand it subtracts 1.
When I translate the feature into a protein sequence like this: str(feat.extract(seq).seq.translate()) , the sequence is correct so this must not be a bug. so, how to I access the exact values that are in the genbank formatted file?
On Jan 13, 2014, at 5:18 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon <mike.thon at gmail.com> wrote:
>> Here are two examples from the GenBank format file (not from GenBank though)
>>
>>
>> CDS order(6621..6658,6739..6985)
>> /Source="maker"
>> /codon_start=1
>> /ID="CFIO01_14847-RA:cds"
>> /label=“CDS"
>>
>> CDS 419..2374
>> /Source="maker"
>> /codon_start=1
>> /ID="CFIO01_05899-RA:cds"
>> /label=“CDS"
>>
>> if the feature is a simple feature, then I just need to access its start and end.
>> If its a compound feature then I need to iterate over each segment, accessing the start and end.
>>
>> What I am doing at the moment is this:
>>
>> if feat._sub_features:
>> for sf in feat.sub_features:
>> start = sf.location.start
>> …
>> else:
>> start = feat.location.start
>> …
>>
>> it works, I think. Is there a better way?
>
> Don't do that :) Python variables/methods/etc starting with a single
> underscore are by convention private and should not generally be
> used. In this case, ._sub_features is an internal detail for the behind
> the scenes backwards compatibility for the now deprecated property
> .sub_features (don't use that either).
>
> Instead use the location object itself directly, it now holds any
> sub-location information using a CompoundLocation object.
> See the .parts attribute, which gives a list of simple locations.
>
> e.g.
>
> for part in feat.location.parts:
> start = part.start
> ...
>
>>
>> Also, is there an easy way to get the sequence represented by the seqfeature,
>> if it is made up of CompoundLocations? These features are CDSs where each
>> sub-feature is an exon. I need to splice them all together and get the translation.
>>
>
> Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)`
> to get the spliced sequence, which you can then translate. See the section
> "Sequence described by a feature or location" in the Tutorial,
>
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>
> On reflection, the Tutorial could do with a bit more detail on how to use
> a CompoundLocation, but I did try to cover this in the docstrings.
>
> Regards,
>
> Peter
More information about the Biopython
mailing list