[Biopython] iterating over FeatureLocation
Michael Thon
mike.thon at gmail.com
Tue Jan 21 16:39:06 UTC 2014
Here’s another question. I have this GenBank formatted feature:
CDS order(complement(3448..3635),complement(2617..3256))
/Source="maker"
/codon_start=1
/ID="CFIO01_05457-RA:cds"
/label=“CDS"
When I extract the sequence I get this:
(Pdb) str(feat.extract(seq).seq)
'ACCAGTCGGCTCCGGCAAGACAGCTCTGATGCTCGCCCTCTGCCTCGCCCTGCGCGAAAAATACTCCATCGCCGCCGTCACAAACGACATCTTCACCCGTGAGGACGCCGAATTCCTCACCCGCCACAAGGCCCTGCCCGCCCCGCGCATCCGCGCCATCGAGACGGGCGGCTGCCCGCACGCCGCCGTGCGCGAGGACATCTCGGCCAACCTCGCCGCCCTCGAGGACCTCCACCGCGAGTTCGACGCCGATCTGCTCCTCATCGAGTCCGGCGGCGACAACCTGGCCGCCAACTACTCCCGCGAGCTGGCCGATTACATCATCTACGTCATTGACGTCTCGGGAGGCGACAAGATCCCGCGCAAGGGCGGCCCGGGTATCACACAGAGCGACTTGCTGGTTGTGAACAAGACGGATCTGGCCGAGATTGTGGGCGCGGATCTGGGTGTCATGGAGAGGGACGCGCGCAAGATGCGAGAGGGCGGGCCGACTGTGTTTGCGCAGGTGAAGAAGAATGTTGCCGTTGATCACATTGTCAACCTCATGCTTAGCGCGTGGAAGGCGAGTGGTGCCGAGGAGAACCGTAGGGCTGCGGGCGGACCGCGGCCTACAGAGGGCCTTGACAGCCTCAAGGCTTGAATGTCTCACGAGCACTCACACGACGGCCCTCATGGCCACGCGCACTCCCACGAGGGCGGCTTCAATGCCCAGGAGCACGGCCACTCCCACGAGATCCTTGATGGTCCTGGAAGCTATCTCGGCCGCGAGATGCCCATTGTCGAGGGCAGAAACTGGAGCGATCGTGCTTTCACAATTGGTATTGGAGG'
This is supposed to be a CDS which can be translated to a protein coding sequence starting with M and ending with a stop codon. the above sequence isn’t correct - the exons are in the wrong order. When I reverse the order of the exons I get the correct order and get a CDS sequence that can be translated:
(Pdb) feat.location.parts.reverse()
(Pdb) str(feat.extract(seq).seq)
'ATGTCTCACGAGCACTCACACGACGGCCCTCATGGCCACGCGCACTCCCACGAGGGCGGCTTCAATGCCCAGGAGCACGGCCACTCCCACGAGATCCTTGATGGTCCTGGAAGCTATCTCGGCCGCGAGATGCCCATTGTCGAGGGCAGAAACTGGAGCGATCGTGCTTTCACAATTGGTATTGGAGGACCAGTCGGCTCCGGCAAGACAGCTCTGATGCTCGCCCTCTGCCTCGCCCTGCGCGAAAAATACTCCATCGCCGCCGTCACAAACGACATCTTCACCCGTGAGGACGCCGAATTCCTCACCCGCCACAAGGCCCTGCCCGCCCCGCGCATCCGCGCCATCGAGACGGGCGGCTGCCCGCACGCCGCCGTGCGCGAGGACATCTCGGCCAACCTCGCCGCCCTCGAGGACCTCCACCGCGAGTTCGACGCCGATCTGCTCCTCATCGAGTCCGGCGGCGACAACCTGGCCGCCAACTACTCCCGCGAGCTGGCCGATTACATCATCTACGTCATTGACGTCTCGGGAGGCGACAAGATCCCGCGCAAGGGCGGCCCGGGTATCACACAGAGCGACTTGCTGGTTGTGAACAAGACGGATCTGGCCGAGATTGTGGGCGCGGATCTGGGTGTCATGGAGAGGGACGCGCGCAAGATGCGAGAGGGCGGGCCGACTGTGTTTGCGCAGGTGAAGAAGAATGTTGCCGTTGATCACATTGTCAACCTCATGCTTAGCGCGTGGAAGGCGAGTGGTGCCGAGGAGAACCGTAGGGCTGCGGGCGGACCGCGGCCTACAGAGGGCCTTGACAGCCTCAAGGCTTGA'
(Pdb) str(feat.extract(seq).seq.translate())
'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*'
So my question is, is there something wrong with the file I’m parsing?
On Jan 14, 2014, at 12:22 PM, Heath O'Brien <Heath.OBrien at bristol.ac.uk> wrote:
> Finally a question that I’m confident I can answer…
>
> Genbank uses one-based numbering and closed intervals while python uses zero-based numbering and half-open intervals, so it’s necessary to convert the coordinates. See https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi
>
> You can convert back to one-based coordinates by adding 1 to the start coordinate.
>
> all good things,
> Heath
>
> On 14 Jan 2014, at 10:20, Michael Thon <mike.thon at gmail.com> wrote:
>
>> Hi Peter - Thanks for your help. Here is another problem. Here is the block of features in my GenBank file for a gene:
>>
>> gene complement(1..588)
>> /Source="maker"
>> /ID="CFIO01_14176"
>> /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
>> -gene-0.30"
>> /label="CFIO01_14176"
>> CDS order(complement(200..588),complement(1..124))
>> /Source="maker"
>> /codon_start=1
>> /ID="CFIO01_14176-RA:cds"
>> /label="CDS"
>> mRNA complement(1..588)
>> /Source="maker"
>> /ID="CFIO01_14176-RA"
>> /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus
>> -gene-0.30-mRNA-1"
>> /_AED="0.06"
>> /_QI="0|0|0|1|1|1|2|0|171"
>> /_eAED="0.06"
>> /label="CFIO01_14176-RA"
>>
>> Now, here is the CDS feature after is was parsed by BioPython:
>>
>>
>> (Pdb) feat.location.parts
>> [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)]
>>
>> Note that two positions have changed. The CDS segments are (complement(200..588),complement(1..124)) but the positions in SeqFeature object are 0..124 and 199..588
>>
>> I checked some other features too and it looks like BioPython adds 1 to the start of each segment. For the features on the complementary strand it subtracts 1.
>>
>> When I translate the feature into a protein sequence like this: str(feat.extract(seq).seq.translate()) , the sequence is correct so this must not be a bug. so, how to I access the exact values that are in the genbank formatted file?
>>
>> On Jan 13, 2014, at 5:18 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>>> On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon <mike.thon at gmail.com> wrote:
>>>> Here are two examples from the GenBank format file (not from GenBank though)
>>>>
>>>>
>>>> CDS order(6621..6658,6739..6985)
>>>> /Source="maker"
>>>> /codon_start=1
>>>> /ID="CFIO01_14847-RA:cds"
>>>> /label=“CDS"
>>>>
>>>> CDS 419..2374
>>>> /Source="maker"
>>>> /codon_start=1
>>>> /ID="CFIO01_05899-RA:cds"
>>>> /label=“CDS"
>>>>
>>>> if the feature is a simple feature, then I just need to access its start and end.
>>>> If its a compound feature then I need to iterate over each segment, accessing the start and end.
>>>>
>>>> What I am doing at the moment is this:
>>>>
>>>> if feat._sub_features:
>>>> for sf in feat.sub_features:
>>>> start = sf.location.start
>>>> …
>>>> else:
>>>> start = feat.location.start
>>>> …
>>>>
>>>> it works, I think. Is there a better way?
>>>
>>> Don't do that :) Python variables/methods/etc starting with a single
>>> underscore are by convention private and should not generally be
>>> used. In this case, ._sub_features is an internal detail for the behind
>>> the scenes backwards compatibility for the now deprecated property
>>> .sub_features (don't use that either).
>>>
>>> Instead use the location object itself directly, it now holds any
>>> sub-location information using a CompoundLocation object.
>>> See the .parts attribute, which gives a list of simple locations.
>>>
>>> e.g.
>>>
>>> for part in feat.location.parts:
>>> start = part.start
>>> ...
>>>
>>>>
>>>> Also, is there an easy way to get the sequence represented by the seqfeature,
>>>> if it is made up of CompoundLocations? These features are CDSs where each
>>>> sub-feature is an exon. I need to splice them all together and get the translation.
>>>>
>>>
>>> Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)`
>>> to get the spliced sequence, which you can then translate. See the section
>>> "Sequence described by a feature or location" in the Tutorial,
>>>
>>> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>>>
>>> On reflection, the Tutorial could do with a bit more detail on how to use
>>> a CompoundLocation, but I did try to cover this in the docstrings.
>>>
>>> Regards,
>>>
>>> Peter
>>
>>
>> _______________________________________________
>> Biopython mailing list - Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
More information about the Biopython
mailing list