[Biopython] iterating over FeatureLocation

Tue Jan 21 16:39:06 UTC 2014

Here’s another question.  I have this GenBank formatted feature:

     CDS             order(complement(3448..3635),complement(2617..3256))
                     /Source="maker"
                     /codon_start=1
                     /ID="CFIO01_05457-RA:cds"
                     /label=“CDS"

When I extract the sequence I get this:

(Pdb) str(feat.extract(seq).seq)
'ACCAGTCGGCTCCGGCAAGACAGCTCTGATGCTCGCCCTCTGCCTCGCCCTGCGCGAAAAATACTCCATCGCCGCCGTCACAAACGACATCTTCACCCGTGAGGACGCCGAATTCCTCACCCGCCACAAGGCCCTGCCCGCCCCGCGCATCCGCGCCATCGAGACGGGCGGCTGCCCGCACGCCGCCGTGCGCGAGGACATCTCGGCCAACCTCGCCGCCCTCGAGGACCTCCACCGCGAGTTCGACGCCGATCTGCTCCTCATCGAGTCCGGCGGCGACAACCTGGCCGCCAACTACTCCCGCGAGCTGGCCGATTACATCATCTACGTCATTGACGTCTCGGGAGGCGACAAGATCCCGCGCAAGGGCGGCCCGGGTATCACACAGAGCGACTTGCTGGTTGTGAACAAGACGGATCTGGCCGAGATTGTGGGCGCGGATCTGGGTGTCATGGAGAGGGACGCGCGCAAGATGCGAGAGGGCGGGCCGACTGTGTTTGCGCAGGTGAAGAAGAATGTTGCCGTTGATCACATTGTCAACCTCATGCTTAGCGCGTGGAAGGCGAGTGGTGCCGAGGAGAACCGTAGGGCTGCGGGCGGACCGCGGCCTACAGAGGGCCTTGACAGCCTCAAGGCTTGAATGTCTCACGAGCACTCACACGACGGCCCTCATGGCCACGCGCACTCCCACGAGGGCGGCTTCAATGCCCAGGAGCACGGCCACTCCCACGAGATCCTTGATGGTCCTGGAAGCTATCTCGGCCGCGAGATGCCCATTGTCGAGGGCAGAAACTGGAGCGATCGTGCTTTCACAATTGGTATTGGAGG'

This is supposed to be a CDS which can be translated to a protein coding sequence starting with M and ending with a stop codon.  the above sequence isn’t correct - the exons are in the wrong order.  When I reverse the order of the exons I get the correct order and get a CDS sequence that can be translated:

(Pdb) feat.location.parts.reverse()
(Pdb) str(feat.extract(seq).seq)
'ATGTCTCACGAGCACTCACACGACGGCCCTCATGGCCACGCGCACTCCCACGAGGGCGGCTTCAATGCCCAGGAGCACGGCCACTCCCACGAGATCCTTGATGGTCCTGGAAGCTATCTCGGCCGCGAGATGCCCATTGTCGAGGGCAGAAACTGGAGCGATCGTGCTTTCACAATTGGTATTGGAGGACCAGTCGGCTCCGGCAAGACAGCTCTGATGCTCGCCCTCTGCCTCGCCCTGCGCGAAAAATACTCCATCGCCGCCGTCACAAACGACATCTTCACCCGTGAGGACGCCGAATTCCTCACCCGCCACAAGGCCCTGCCCGCCCCGCGCATCCGCGCCATCGAGACGGGCGGCTGCCCGCACGCCGCCGTGCGCGAGGACATCTCGGCCAACCTCGCCGCCCTCGAGGACCTCCACCGCGAGTTCGACGCCGATCTGCTCCTCATCGAGTCCGGCGGCGACAACCTGGCCGCCAACTACTCCCGCGAGCTGGCCGATTACATCATCTACGTCATTGACGTCTCGGGAGGCGACAAGATCCCGCGCAAGGGCGGCCCGGGTATCACACAGAGCGACTTGCTGGTTGTGAACAAGACGGATCTGGCCGAGATTGTGGGCGCGGATCTGGGTGTCATGGAGAGGGACGCGCGCAAGATGCGAGAGGGCGGGCCGACTGTGTTTGCGCAGGTGAAGAAGAATGTTGCCGTTGATCACATTGTCAACCTCATGCTTAGCGCGTGGAAGGCGAGTGGTGCCGAGGAGAACCGTAGGGCTGCGGGCGGACCGCGGCCTACAGAGGGCCTTGACAGCCTCAAGGCTTGA'
(Pdb) str(feat.extract(seq).seq.translate())
'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*'

So my question is, is there something wrong with the file I’m parsing?

On Jan 14, 2014, at 12:22 PM, Heath O'Brien <Heath.OBrien at bristol.ac.uk> wrote:

> Finally a question that I’m confident I can answer…
> 
> Genbank uses one-based numbering and closed intervals while python uses zero-based numbering and half-open intervals, so it’s necessary to convert the coordinates. See https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi
> 
> You can convert back to one-based coordinates by adding 1 to the start coordinate.
> 
> all good things,
> Heath
> 
> On 14 Jan 2014, at 10:20, Michael Thon <mike.thon at gmail.com> wrote:
> 
>> Hi Peter - Thanks for your help.  Here is another problem.  Here is the block of features in my GenBank file for a gene:
>> 
>>    gene            complement(1..588)
>>                    /Source="maker"
>>                    /ID="CFIO01_14176"
>>                    /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
>>                    -gene-0.30"
>>                    /label="CFIO01_14176"
>>    CDS             order(complement(200..588),complement(1..124))
>>                    /Source="maker"
>>                    /codon_start=1
>>                    /ID="CFIO01_14176-RA:cds"
>>                    /label="CDS"
>>    mRNA            complement(1..588)
>>                    /Source="maker"
>>                    /ID="CFIO01_14176-RA"
>>                    /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
>>                    -gene-0.30-mRNA-1"
>>                    /_AED="0.06"
>>                    /_QI="0|0|0|1|1|1|2|0|171"
>>                    /_eAED="0.06"
>>                    /label="CFIO01_14176-RA"
>> 
>> Now, here is the CDS feature after is was parsed by BioPython:
>> 
>> 
>> (Pdb) feat.location.parts
>> [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)]
>> 
>> Note that two positions have changed.  The CDS segments are (complement(200..588),complement(1..124)) but the positions in SeqFeature object are 0..124 and 199..588
>> 
>> I checked some other features too and it looks like BioPython adds 1 to the start of each segment.  For the features on the complementary strand it subtracts 1.  
>> 
>> When I translate the feature into a protein sequence like this: str(feat.extract(seq).seq.translate()) , the sequence is correct so this must not be a bug. so, how to I access the exact values that are in the genbank formatted file?  
>> 
>> On Jan 13, 2014, at 5:18 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> 
>>> On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon <mike.thon at gmail.com> wrote:
>>>> Here are two examples from the GenBank format file (not from GenBank though)
>>>> 
>>>> 
>>>>   CDS             order(6621..6658,6739..6985)
>>>>                   /Source="maker"
>>>>                   /codon_start=1
>>>>                   /ID="CFIO01_14847-RA:cds"
>>>>                   /label=“CDS"
>>>> 
>>>>   CDS             419..2374
>>>>                   /Source="maker"
>>>>                   /codon_start=1
>>>>                   /ID="CFIO01_05899-RA:cds"
>>>>                   /label=“CDS"
>>>> 
>>>> if the feature is a simple feature, then I just need to access its start and end.
>>>> If its a compound feature then I need to iterate over each segment, accessing the start and end.
>>>> 
>>>> What I am doing at the moment is this:
>>>> 
>>>> if feat._sub_features:
>>>>      for sf in feat.sub_features:
>>>>              start = sf.location.start
>>>>              …
>>>> else:
>>>>      start = feat.location.start
>>>>      …
>>>> 
>>>> it works, I think.  Is there a better way?
>>> 
>>> Don't do that :) Python variables/methods/etc starting with a single
>>> underscore are by convention private and should not generally be
>>> used. In this case, ._sub_features is an internal detail for the behind
>>> the scenes backwards compatibility for the now deprecated property
>>> .sub_features (don't use that either).
>>> 
>>> Instead use the location object itself directly, it now holds any
>>> sub-location information using a CompoundLocation object.
>>> See the .parts attribute, which gives a list of simple locations.
>>> 
>>> e.g.
>>> 
>>> for part in feat.location.parts:
>>>  start = part.start
>>>  ...
>>> 
>>>> 
>>>> Also, is there an easy way to get the sequence represented by the seqfeature,
>>>> if it is made up of CompoundLocations?  These features are CDSs where each
>>>> sub-feature is an exon.  I need to splice them all together and get the translation.
>>>> 
>>> 
>>> Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)`
>>> to get the spliced sequence, which you can then translate. See the section
>>> "Sequence described by a feature or location" in the Tutorial,
>>> 
>>> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>>> 
>>> On reflection, the Tutorial could do with a bit more detail on how to use
>>> a CompoundLocation, but I did try to cover this in the docstrings.
>>> 
>>> Regards,
>>> 
>>> Peter
>> 
>> 
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>