[Biopython-dev] Start/end co-ordinates in SeqRecord objects?

Peter Cock p.j.a.cock at googlemail.com
Tue May 22 05:48:18 EDT 2012


On Tue, May 22, 2012 at 10:44 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> ...
> I would like the SeqRecord slicing (done by the alignment object)
> to be able to deduce these new start/end co-ordinates from the
> original co-ordinates.
>
> ...
>
> This is doable - but a sensible question is how common a
> use case is it to slice alignments (or SeqRecord objects) and
> care about their co-ordinates? This may actually be more
> important for classical multiple sequence alignments like
> Stockholm and MAF than for SearchIO.

I was struggling to come up with a simple self contained
motivating example. Here is a possible example with BLAST,
(although you can do similar things with multiple sequence
alignments), but it is actually a larger or different problem.

Suppose you have a domain of interest in a larger protein,
and you want to pull out similar domains from similar proteins
in a BLAST database. So, you do the BLAST search, and filter
the results (e.g. use a minimum match length to ensure you
are looking at full proteins). You then want to pull out just the
region of the matched protein corresponding to your domain
of interest.

To solve this task, a SeqRecord location property is just a
step in the right direction - but what this really boils down to
is mapping between the three different co-ordinate systems:
Ungapped query seuqnece <-> aligned columns (i.e. the
common gapped sequence coordinates) <-> ungapped
match sequence. Maybe that would be some nice
functionality to add... the API needs a lot of thought though.
Perhaps a specialized GappedSeq object (which could
let us deprecate the current gapped alphabet class)?

Peter


More information about the Biopython-dev mailing list