[Biopython] sequence coordinate mapping

Peter biopython at maubp.freeserve.co.uk
Wed Jun 23 14:00:26 UTC 2010


On Tue, Jun 22, 2010 at 9:49 PM, Laurent <lgautier at gmail.com> wrote:
>
> Peter wrote:
>>
>> This second branch now implements two methods for mapping
>> between feature coordinates and the parent sequence coordinates.
>>
>> http://github.com/peterjc/biopython/tree/feature-coords
>>
>> In the case where due to overlapping sub-features a parent letter
>> has more than one possible feature coordindate, this returns the
>> lowest feature coordinate. This is slightly faster since we don't
>> have to check all the sub-features. However, perhaps doing so
>> and raising an exception is preferable to avoid silent errors in
>> this corner case?
>
> Exception is better.
>
> In the worst case raising an exception will take a split second to fix,
> while silent logic twists can in the best case take time and frustration to
> find and fix (in the worst case it can lead to wrong results undetected).

Agreed.

I've made get_local_coord give an exception now for ambiguous
mappings, and introduced get_local_coords (with a trailing s for
plural) which gives a list of the local coordinates. That seems to
cover the typical case nicely and makes dealing with the special
case fairly easy.

http://github.com/peterjc/biopython/tree/feature-coords

>> Note this does not handle the third case of amino acid coordinates
>> (which only applies where the parent sequence is nucleotides and
>> the feature is something like a CDS or mature peptide entry).
>
> Also, I followed that distantly but wouldn't it make sense to abstract
> everything into a system of nested relative coordinates ?
> I guess that putting a bit of code together would be the easiest to
> demonstrate what I have in mind (obviously after checking that other
> packages around do not already have something similar, and after
> I spare some time aside for that).

That might be a good idea (I'm note quite sure what you are
suggesting).

Another related problem is going from gapped to ungapped coordinates
(also described as padded and unpadded) when working with sequence
alignments.

Peter



More information about the Biopython mailing list