[Biopython] sequence coordinate mapping
Peter
biopython at maubp.freeserve.co.uk
Mon Jun 21 17:59:59 UTC 2010
On Fri, Jun 18, 2010 at 7:10 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Jun 18, 2010 at 3:08 PM, Chris Fields <cjfields at illinois.edu> wrote:
>> On Jun 18, 2010, at 8:39 AM, Peter wrote:
>>> Perhaps one approach would be to do this in the SeqFeature. If we
>>> define a SeqFeature's length in the natural way, then we have
>>> len(SeqFeature) == len(SeqFeature.extract(parent_seq)).
>>> Now we have two coordinates systems, 0 to len(SeqFeature) and
>>> the regions it describes on the parent sequence. Then we could
>>> discuss a pair of methods on the SeqFeature for converting
>>> between the two coordinate systems. Once you have that, the
>>> special case of amino acid coordinates is much easier to do
>>> (account for where the start codon is, divide by three).
>>>
>>> I've made another commit on the __contains__ branch to
>>> also implement __len__ for the SeqFeature:
>>> http://github.com/peterjc/biopython/commit/74b264acacd228d64859d28d75e2c30a8030d03f
Note I found an off by one error with the end point, fixed now:
http://github.com/peterjc/biopython/commit/dc18df0c5d9cc824ddd31c96d59ee2bf9c5c7fc2
With a few more unit tests, I think that could be merged to the trunk.
> ... I had a go at
> doing one of the mappings - from feature coordinates to parent
> coordinates (e.g. CDS back to genome, or PFAM domain back
> to protein if your SeqRecord is for a protein sequence):
>
> http://github.com/peterjc/biopython/tree/feature-coords
>
> As you can tell from the rather lengthy docstring and doctests,
> this is quite hairy and difficult to explain. ...
>
> I'm pretty sure we can do the reverse mapping from a
> parent sequence coordinate to a feature coordinate,
> although I can come up with pathological examples where
> this is not a one to one mapping, but one to many. e.g. a
> ribosomal slippage where a base gets used twice. In this
> case we could raise an error, or maybe more simply take
> the first match.
This second branch now implements two methods for mapping
between feature coordinates and the parent sequence coordinates.
http://github.com/peterjc/biopython/tree/feature-coords
In the case where due to overlapping sub-features a parent letter
has more than one possible feature coordindate, this returns the
lowest feature coordinate. This is slightly faster since we don't
have to check all the sub-features. However, perhaps doing so
and raising an exception is preferable to avoid silent errors in
this corner case?
Note this does not handle the third case of amino acid coordinates
(which only applies where the parent sequence is nucleotides and
the feature is something like a CDS or mature peptide entry).
Peter
More information about the Biopython
mailing list