[Biopython] sequence coordinate mapping

Brad Chapman chapmanb at 50mail.com
Fri Jun 18 08:58:03 EDT 2010


Reece and Peter;

> > I'm looking for code in Python (preferably already in BioPython) to map
> > between genomic, CDS, and protein coordinates. For example, map position
> > 7872 of AB026906.1 to position 274 in the CDS and pos 92 in the protein.
> >
> > It's not difficult and I've already written a crude version, but I'm a
> > little surprised that it's not there and I don't want to reinvent.
> >
> > I'm looking for something akin to Bio::Coordinate::GeneMapper, for those
> > from BioPerl.

A general implementation like this would be really useful. Here is a
stab I took at the problem a while back:

http://bitbucket.org/chapmanb/synbio/src/tip/SynBio/Codons/CodingRegion.py

This represents a sequence as a set of Exon/Intron objects and then
lets you manipulate directly on the coding region.

Here's a representation inspired by that for SNP calling:

http://github.com/chapmanb/bcbb/blob/master/biopython/CodingRegion.py

The tricky part of this problem is getting exon/intron coordinates parsed 
and in the right format. I'm not sure I ever really got this
correct, but hopefully those implementations help.

> Your question also sounds hard in general. What about where a
> single base on the genome maps to multiple genes (overlapping
> genes are common in bacteria and viruses). What about where
> a single base on the genome maps to an intron in a gene - would
> you want any values back? What about where a gene has a fuzzy
> boundary? What about a ribosomal slippage where a single bp
> ends up coding for two residues in the protein?

You'd want to catch and raise errors in pathological cases,
but this would be useful for the standard cases. If the target is
SNP calling, you'd want to be able to have a genomic coordinate and
find out if it's in a gene; if so, is it in a coding region?; if so,
what is the protein at that position? It is handy to be able to
pull out each of those representations so you can ask questions
about the location in the coding sequence or amino acid change
caused by a SNP.

> Something like this? This implements __contains__ on the SeqFeature
> so that you can check if a simple location (integer) is within a feature.
> http://github.com/peterjc/biopython/tree/feature-in
> 
> There is a docstring with examples, just look at the diff here:
> http://github.com/peterjc/biopython/commit/83c44e8f6ee62a9c5855b603cb3c080d367e23d6

That's nice. The next part would be remapping the coordinates so
once you have the feature you can easily address the relative
position you are interested in.

Brad


More information about the Biopython mailing list