[Biopython] iterating over FeatureLocation

Wed Jan 22 11:19:49 UTC 2014

On Tue, Jan 21, 2014 at 4:52 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Jan 21, 2014 at 4:39 PM, Michael Thon <mike.thon at gmail.com> wrote:
>>
>> Here’s another question.  I have this GenBank formatted feature:
>>
>>      CDS             order(complement(3448..3635),complement(2617..3256))
>>                      /Source="maker"
>>                      /codon_start=1
>>                      /ID="CFIO01_05457-RA:cds"
>>                      /label=“CDS"
>>
>> When I extract the sequence I get this:
>>
>> (Pdb) str(feat.extract(seq).seq)
>> ...
>>
>>
>> This is supposed to be a CDS which can be translated to a protein coding
>> sequence starting with M and ending with a stop codon.  the above sequence
>> isn’t correct - the exons are in the wrong order.  When I reverse the order
>> of the exons I get the correct order and get a CDS sequence that can be
>> translated:
>>
>> (Pdb) feat.location.parts.reverse()
>> (Pdb) str(feat.extract(seq).seq)
>> ...
>>
>> (Pdb) str(feat.extract(seq).seq.translate())
>>
>> 'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*'
>>
>> So my question is, is there something wrong with the file I’m parsing?
>
>
> Possibly - the 'order' tag actually means the order of the parts is unknown.
> If the order is known, it should be 'join' instead:
>
> join(complement(3448..3635),complement(2617..3256))
>
> What's the accession/URL for the full file this example came from?
>
> Peter

Thanks for sending me the file. I don't think Biopython is really at
fault, rather something is going wrong in the production of this
GenBank format file. It appears to be a tricky case of trans-splicing.

However, thinking about this, it might be reasonable for Biopython
to give an error or warning when extracting an "order" location because
this means the order of the sub-parts is not determined (and thus
could be stitched together wrongly - as you have seen).

The following variants of the location string all give the (nonsensical)
sequence you are seeing:

     CDS             order(complement(3448..3635),complement(2617..3256))
     CDS             join(complement(3448..3635),complement(2617..3256))
     CDS             complement(join(3448..3635,2617..3256))

Extracting and translating gives this sequence with multiple in
frame stop codons, but lacking a terminal stop codon. i.e.

TSRLRQDSSDARPLPRPARKILHRRRHKRHLHP*GRR...YWR (ends)

Surprisingly, what I think the annotation is trying to say is that this case
the exons appear to be trans-spliced, rather than being in the typical
order you would expect from the strand. These "work" and give the
protein sequence you wanted,

     CDS             complement(join(2617..3256,3448..3635))
     CDS             join(complement(2617..3256),complement(3448..3635))
     CDS             order(complement(2617..3256),complement(3448..3635))

For GenBank format it would be nice to also add the /trans_splicing
tag as well. I would recommend you (or the team)  go back to the original
annotation to check what was the intended meaning here.

Regards,

Peter