[Biopython] sequence coordinate mapping

Peter biopython at maubp.freeserve.co.uk
Fri Jun 18 10:01:56 UTC 2010


On Fri, Jun 18, 2010 at 12:13 AM, Reece Hart <reece at berkeley.edu> wrote:
> Hi All-
>
> I'm looking for code in Python (preferably already in BioPython) to map
> between genomic, CDS, and protein coordinates. For example, map position
> 7872 of AB026906.1 to position 274 in the CDS and pos 92 in the protein.
>
> It's not difficult and I've already written a crude version, but I'm a
> little surprised that it's not there and I don't want to reinvent.
>
> I'm looking for something akin to Bio::Coordinate::GeneMapper, for those
> from BioPerl.
>
> Thanks,
> Reece

The Bio::Coordinate::GeneMapper stuff looks quite complicated just
from the documentation - maybe I'm looking in the wrong place but
some examples would help to understand the full scope of it.

There isn't anything quite like this built into Biopython at the moment.
Your question also sounds hard in general. What about where a
single base on the genome maps to multiple genes (overlapping
genes are common in bacteria and viruses). What about where
a single base on the genome maps to an intron in a gene - would
you want any values back? What about where a gene has a fuzzy
boundary? What about a ribosomal slippage where a single bp
ends up coding for two residues in the protein?

It can be broken down into two steps: (1) finding a list of features
covering a position on the genome, (2) for a CDS feature getting
the amino acid position (which would require looking for the
codon start position if specified in the annotation).

Just thinking out loud, implementing "in" and/or sorting on our
FeatureLocation (and perhaps SeqFeature) objects (i.e. implement
the special __contains__ method,  __lt__ method etc) could be
useful syntactic sugar for this kind of work.

Peter



More information about the Biopython mailing list