[Biopython] Additions to the SeqRecord

Thu Nov 12 09:08:46 EST 2009

On Thu, Nov 12, 2009 at 1:47 PM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
>
> Hi,
>
> To avoid issues with the inadvertent propagation of inappropriate
> annotation, I'd be more comfortable with it being an optional feature of the
> slice - to be used when appropriate and with caution - than the default
> behaviour.

Better safe than sorry?

> One counterexample I can think of is the slicing of a sequence for which a
> feature or annotation applies only to a subregion of the SeqRecord.  This is
> not an uncommon property of modular proteins.  If I were to slice the
> N-terminal domains of a set of sequences with distinct N- and C-terminal
> domains, I would not want to carry through annotation for the C-terminal
> domains. If I did this without noticing, there may be a danger of, say,
> downstream use inferring inappropriate class membership if I wanted to
> generate a set of sequences containing that C-terminal domain, and I did
> this automatically based on the annotation of a SeqRecord.
>
> Another counterexample would be propagated inappropriate class membership
> for annotations that require a complete sequence for context.  For example,
> many bacterial CDS annotations feature reports of BLAST matches to other
> databases.  These are results derived from the full length feature, and the
> BLAST match obtained from the slice result is likely to differ.

Both good examples.

> Having seen first-hand the propagation of faulty annotations (e.g. presence
> of a signal peptide and other functionally-related motifs) through to
> cloning - and the resultant waste of time, money and other resources - I
> would seek to avoid this kind of behaviour.  As it is, the propagation of
> sequence ID and description without modification to indicate that a copy and
> potential change has been done is potentially dangerous, and needs to be
> done with some care to avoid 'poisoning the well'.

Yes - as already noted in the documentation, the id/name/description
may not apply to the sliced record, and some caution is advisable.

> The behaviour you describe makes most sense in the context of
> per-letter-annotation (as this is the natural granularity of the changes),
> and for relatively small changes to a large sequence containing multiple
> features whose annotations are reasonably self-contained. I too would like
> to be able to treat these specially on occasion, conserving much of the
> annotation.  However, I think the potential pitfalls are pretty significant
> and would not want this to be default behaviour.

OK. So the current behaviour on the trunk is acceptable (for annotation
where we know the location), but the proposed change for location-less
annotation is too risky.

> A third way might be only to include those annotations with location data
> where the region covered by the annotation is not disrupted by the slicing.
> For example, a slice/addition that removed sites 200-300 would retain
> features/annotations that ran from 120-199 and 301-350, but not carry
> forward features that ran from 120-201, or from 250-301.  Features and
> annotations that span the full record length would not be carried forward
> under this proposal.

Exactly - SeqFeatures entirely within the sliced region are kept. Those
outside the sliced region (or crossing the boundary) are lost. As a result,
because GenBank-style source feature span the whole sequence, they
are lost on slicing to a sub-sequence. This is the current behaviour and
I wasn't suggesting any changes.

General annotation in the SeqRecord's annotation dictionary has no
location information - it may apply to the whole sequence (e.g from
organism X) or just part (e.g. a text note it contains XXX domain).
Likewise the database cross reference list.

The dbxref list and annotations dict are thus the hardest to handle -
the only practical automatic actions on slicing are to discard them
(the current behaviour on Biopython 1.50 to date), or keep them all
as per my suggestion (which as you stress, is risky).

In light of Leighton's valid concerns, and weighing this against the
limited benefits which only apply in special cases like the examples
I gave, let's leave things as they are. i.e. Explicit is better than implicit
(Zen of Python), if you want to propagate the annotations dict and
dbxrefs to a sliced record, you must continue do it explicity.

Thanks for the feedback!

Peter