[Biopython-dev] SeqRecord locations; was: Beta code in the official releases?

Thu Sep 6 01:34:50 UTC 2012

On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> In my mind, the main technical issue regarding MAF and AlignIO
> and the common alignment object is the lack of a common way
> of handling the idea of start/end (and sometimes strand) for
> each sequence (in a consistent co-ordinate system using Python
> counting). Evidently I haven't manage to adequately convey my
> interpretation/concern.
>
> Some file formats like EMBOSS' have these number explicitly
> but we're not parsing them:
> http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html
>
> In the case of "fasta-m10" the numbers are stored in private
> properties as a 'short term' hack:
> http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html
>
> Others like Stockholm have identifier/start-end as a combined
> names (but this is not mandatory). Here the start and end are
> being stored in the annotations dictionary (as unparsed strings,
> still using 1-based co-ordinates).
>
> In MAF the start/end are explicit and much more important.
> It would be near pointless to parse the the file ignoring these.
> Maybe your approach is good enough for MAF, and we
> should have adopted it as is, and delayed better integration
> with the other AlignIO formats?
>
> i.e. This is a general limitation in AlignIO and the object
> model, somewhat annoying in the formats already supported,
> but information critical to the MAF format.
>
> I was expecting a convention for this to fall out of Bow's GSoC
> work for 'pairwise alignments' in SearchIO - but the object
> model he came up with was not SeqRecord based (many
> of the file formats he was using didn't include sequences).
>
> Right now my inclination is still to add a location property to
> the SeqRecord, usually a FeatureLocation, but it could also
> be the proposed CompoundLocation for more complex cases.
> The question then is if/when this would be propagated, e.g.
> SeqRecord slicing/addition.
> http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html
> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html
>
> So the wheels are turning, but slowly. I have not had as
> much time to dedicate to this as I would like - but other
> smaller or less inter-connected things are much easer to
> review and merge.

To expand on the SeqRecord.location property idea, I am
thinking about (in the typical use cases) using a normal
FeatureLocation object (from Bio.SeqFeature) where the
start, end or strand are in the same co-ordinate system
as the sequence of the SeqRecord.

i.e. For a protein fragment, they would be in amino acids.
For a nucleotide fragment, they would be in base pairs.

Note that you might want to describe the CDS region
for a protein sequence (which would be possible even
for a join using the proposed CompoundLocation), so
maybe 'location' is the wrong name here, perhaps
'fragment' or 'subregion', or something is clearer?

When I talked about adding SeqRecords, and what would
the combined SeqRecord's location be, we could use
FeatureLocation addition (as defined on the branch for
CompoundLocation objects).

For slicing a SeqRecord, provided len(record.location)
== len(record), this is well defined. However, I expect
that quite often if used for alignments, what we will have
instead is len(record.location) = len(record.seq.ungapped())
so we might be able to update the sub-record's location
if we count the gap characters and factor them in. This
equality could be verified in the SeqRecord __init__
(which would require the gap character, but the AlignIO
parsers should all set that).

I would like slicing to update the start/end because
slicing alignment objects seems to be a quite common
operation - so if you started from an alignment file
using start/end (like Stockholm or MAF) it would be
good to update these fields for the sub-alignment.

This feels like it would work, but would it be useful or
just over engineering? Would a simple static location
property which is not automatically propagated in
SeqRecord manipulations be enough (at least initially)?

If so, is Brad's suggestion to just use special values in
the annotations dictionary a simpler way forward (where
we already have policies in place for handling generic
annotation during SeqRecord annotation - in general
dropping it)?

If so, would this be keys 'start', 'end', 'strand' for
integer start and end using Python counting, and
a strand value of +1 or -1 for forward and reverse?
[We could use strand None for unavailable as in
the SeqFeature location object, but I think no entry
in the dictionary is nicer here].

Peter