[Biopython] Transferring SeqFeatures between aligned sequences

Peter Cock p.j.a.cock at googlemail.com
Fri Mar 11 09:53:16 UTC 2011


On Thu, Mar 10, 2011 at 11:25 PM, Uri Laserson <laserson at mit.edu> wrote:
>> I'm not quite sure I followed that figure.
>
> I think you understood perfectly.

Good - your text was clearer for me.

>> In your case the start and end co-ordinates may be shifted
>> by different amounts (since you are doing gapped alignments).
>> This worries me as the length of your features can change.
>> For any gene or CDS features that is a problem (frame shifts).
>> Have you thought about that? Perhaps you're dealing with
>> non-coding features only?
>
> That's exactly the complication here.  I have one reference sequence that is
> highly annotated, and I have a read that I want to align to it and transfer
> over the annotations to the corresponding positions.

OK - and do you want to worry about spotting frameshifts,
and updating the translation for CDS features?

> One way I can handle this situation is that when I actually build the
> pairwise gapped alignment (which I do manually), in addition to the actual
> gapped-sequence strings, I can generate two lists that contain the ungapped
> coordinates of each sequence (in my diagram, this is the numbering above and
> below).  Figuring out the new coords from the old coordinates is then a
> matter of matching the positions in the lists.  (Though perhaps it's easier
> to implement using dictionaries, so I don't have to search the lists I
> generated.)

Yes, that kind of technique is also useful for  mapping between
gapped and ungapped coordinates in assembly files.

> Eitherway, in order to move the SeqFeature to the new sequence, should I
> make a deep copy of it and then manually modify the start and end coords?
> Uri

You could do, or create a new SeqFeature, or "steal" the old one and
modify it. The later technique would probably be fastest since there
are no new objects to create, just a few integer attributes changes
(location positions), but is perhaps a bit risky if you don't comment
it clearly. If you do that, perhaps do this by popping the features
from the old SeqRecord's feature list, modify them, and add them
to the new SeqRecord's feature list.

If all your current annotation uses simple exact locations, life is
easier. If there are fuzzy locations, then using the location object's
private _shift method might be simplest.

Another query, are you going to look for inversions? In such
cases the strand needs flipping and the start/end interchanged.
The SeqRecord reverse complement method has to do this,
and therefore the SeqFeature and its location and position
classes all have a private _flip method.

[If you find these private methods useful, perhaps we can make
them public? Let us know]

Thanks,

Peter




More information about the Biopython mailing list