[Biopython] sequence coordinate mapping

Tue Jun 22 20:49:41 UTC 2010

On 22/06/10 18:00, biopython-request at lists.open-bio.org wrote:
> On Fri, Jun 18, 2010 at 7:10 PM, Peter<biopython at maubp.freeserve.co.uk>  wrote:
>> On Fri, Jun 18, 2010 at 3:08 PM, Chris Fields<cjfields at illinois.edu>  wrote:
>>> On Jun 18, 2010, at 8:39 AM, Peter wrote:
>>>> Perhaps one approach would be to do this in the SeqFeature. If we
>>>> define a SeqFeature's length in the natural way, then we have
>>>> len(SeqFeature) == len(SeqFeature.extract(parent_seq)).
>>>> Now we have two coordinates systems, 0 to len(SeqFeature) and
>>>> the regions it describes on the parent sequence. Then we could
>>>> discuss a pair of methods on the SeqFeature for converting
>>>> between the two coordinate systems. Once you have that, the
>>>> special case of amino acid coordinates is much easier to do
>>>> (account for where the start codon is, divide by three).
>>>>
>>>> I've made another commit on the __contains__ branch to
>>>> also implement __len__ for the SeqFeature:
>>>> http://github.com/peterjc/biopython/commit/74b264acacd228d64859d28d75e2c30a8030d03f
>
> Note I found an off by one error with the end point, fixed now:
>
> http://github.com/peterjc/biopython/commit/dc18df0c5d9cc824ddd31c96d59ee2bf9c5c7fc2
>
> With a few more unit tests, I think that could be merged to the trunk.
>
>> ... I had a go at
>> doing one of the mappings - from feature coordinates to parent
>> coordinates (e.g. CDS back to genome, or PFAM domain back
>> to protein if your SeqRecord is for a protein sequence):
>>
>> http://github.com/peterjc/biopython/tree/feature-coords
>>
>> As you can tell from the rather lengthy docstring and doctests,
>> this is quite hairy and difficult to explain. ...
>>
>> I'm pretty sure we can do the reverse mapping from a
>> parent sequence coordinate to a feature coordinate,
>> although I can come up with pathological examples where
>> this is not a one to one mapping, but one to many. e.g. a
>> ribosomal slippage where a base gets used twice. In this
>> case we could raise an error, or maybe more simply take
>> the first match.
>
> This second branch now implements two methods for mapping
> between feature coordinates and the parent sequence coordinates.
>
> http://github.com/peterjc/biopython/tree/feature-coords
>
> In the case where due to overlapping sub-features a parent letter
> has more than one possible feature coordindate, this returns the
> lowest feature coordinate. This is slightly faster since we don't
> have to check all the sub-features. However, perhaps doing so
> and raising an exception is preferable to avoid silent errors in
> this corner case?

Exception is better.

In the worst case raising an exception will take a split second to fix, 
while silent logic twists can in the best case take time and frustration 
to find and fix (in the worst case it can lead to wrong results undetected).

> Note this does not handle the third case of amino acid coordinates
> (which only applies where the parent sequence is nucleotides and
> the feature is something like a CDS or mature peptide entry).

Also, I followed that distantly but wouldn't it make sense to abstract 
everything into a system of nested relative coordinates ?
I guess that putting a bit of code together would be the easiest to 
demonstrate what I have in mind (obviously after checking that other 
packages around do not already have something similar, and after I spare 
some time aside for that).

L.

> Peter
>
>
> ------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> End of Biopython Digest, Vol 90, Issue 15
> *****************************************