[Biopython] iterating over FeatureLocation
Peter Cock
p.j.a.cock at googlemail.com
Wed Jan 22 11:19:49 UTC 2014
On Tue, Jan 21, 2014 at 4:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Jan 21, 2014 at 4:39 PM, Michael Thon <mike.thon at gmail.com> wrote:
>>
>> Here’s another question. I have this GenBank formatted feature:
>>
>> CDS order(complement(3448..3635),complement(2617..3256))
>> /Source="maker"
>> /codon_start=1
>> /ID="CFIO01_05457-RA:cds"
>> /label=“CDS"
>>
>> When I extract the sequence I get this:
>>
>> (Pdb) str(feat.extract(seq).seq)
>> ...
>>
>>
>> This is supposed to be a CDS which can be translated to a protein coding
>> sequence starting with M and ending with a stop codon. the above sequence
>> isn’t correct - the exons are in the wrong order. When I reverse the order
>> of the exons I get the correct order and get a CDS sequence that can be
>> translated:
>>
>> (Pdb) feat.location.parts.reverse()
>> (Pdb) str(feat.extract(seq).seq)
>> ...
>>
>> (Pdb) str(feat.extract(seq).seq.translate())
>>
>> 'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*'
>>
>> So my question is, is there something wrong with the file I’m parsing?
>
>
> Possibly - the 'order' tag actually means the order of the parts is unknown.
> If the order is known, it should be 'join' instead:
>
> join(complement(3448..3635),complement(2617..3256))
>
> What's the accession/URL for the full file this example came from?
>
> Peter
Thanks for sending me the file. I don't think Biopython is really at
fault, rather something is going wrong in the production of this
GenBank format file. It appears to be a tricky case of trans-splicing.
However, thinking about this, it might be reasonable for Biopython
to give an error or warning when extracting an "order" location because
this means the order of the sub-parts is not determined (and thus
could be stitched together wrongly - as you have seen).
The following variants of the location string all give the (nonsensical)
sequence you are seeing:
CDS order(complement(3448..3635),complement(2617..3256))
CDS join(complement(3448..3635),complement(2617..3256))
CDS complement(join(3448..3635,2617..3256))
Extracting and translating gives this sequence with multiple in
frame stop codons, but lacking a terminal stop codon. i.e.
TSRLRQDSSDARPLPRPARKILHRRRHKRHLHP*GRR...YWR (ends)
Surprisingly, what I think the annotation is trying to say is that this case
the exons appear to be trans-spliced, rather than being in the typical
order you would expect from the strand. These "work" and give the
protein sequence you wanted,
CDS complement(join(2617..3256,3448..3635))
CDS join(complement(2617..3256),complement(3448..3635))
CDS order(complement(2617..3256),complement(3448..3635))
For GenBank format it would be nice to also add the /trans_splicing
tag as well. I would recommend you (or the team) go back to the original
annotation to check what was the intended meaning here.
Regards,
Peter
More information about the Biopython
mailing list