[Biopython-dev] SeqRecord locations; was: Beta code in the official releases?

Thu Sep 6 07:16:41 UTC 2012

On Thu, Sep 6, 2012 at 6:57 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi guys,
>
> To add my two cents, I am in favor of creating a dynamic SeqRecord
> coordinate system using SeqFeature. However, I think it would also be
> good if we set some limitations as there are so many ways that slicing
> and addition could be used to create new SeqRecords, and anticipating
> all these scenarios may create an over-engineered (and probably
> slower) SeqRecord.
>
> Some scenarios that I can think now:
>
> 1. Slicing SeqRecord objects using step values > 1
> (e.g. new_seq = seq[1:120:3])

Absolutely - here I would expect to lose the location information.
We already have similar restrictions in the SeqRecord slicing
for how SeqFeatures are handled.

> 2. Adding two or more SeqRecord objects with noncontiguous coordinate
> (i.e. end coordinate of the first sequence is not directly followed by
> the second sequence's start coordinate), and then slice the resulting
> object

Adding *could* be done via the CompoundLocation, although that
in itself might want to consider if nicely-abutting locations should
be merged, e.g. in GenBank notation 100..201 and 202..300 could
be 100.300 rather than join(100..201,202..300) which is what my
CompoundLocation code currently does.

> So maybe some limitations that we could set are:
>
> 1. Only update the coordinates if slicing step is 1 (or -1), otherwise
> discard it.

Yep.

> 2. Only update the coordinates if addition is between contiguous
> coordinates, otherwise discard it.

That does seem simple - especially as the primary driver for this
is multiple sequence alignments and those only support simple
continuous locations with a start and end.

> Personally, I think this would cover most use cases for slicing while
> allowing us to keep it simple.

That is perhaps a good balance (and as a bonus means we
don't have to link this to the CompoundLocation unless we
want to).

> As for the name, 'region' sounds better than 'location'. Maybe
> 'coverage'? I don't have any strong preference between these, but
> 'subregion' doesn't feel that nice.

Region seems fine.

> Finally, for the coordinate system, I imagine it will use Python's
> coordinate system, too? (zero-based, half-open, and the parsers /
> writers should do the conversion).

Yes. I'm suggesting using the FeatureLocation object (from
Bio.SeqFeatures), which does this.

> Should we also reverse the
> coordinates if the objects are sliced in reverse (e.g.
> seqrecord[::-1]) or simply inverse the strand value but keep the
> coordinates unchanged?

The strand changes, and the start/end must also be recalculated
from the length of the parent sequence. The FeatureLocation
has a (private) _flip method to do this. In some cases we won't
have the parent sequence length, so would have to drop the
location.

I'll have a go at implementing this on a branch in the next
few hours (unless something more pressing comes up at
the BioHackathon). As it happens this overlaps nicely with
some of the group discussion about how to represent feature
locations in RDF.

Regards,

Peter