[Biopython] Additions to the SeqRecord

Thu Nov 12 08:47:20 EST 2009

Hi,

To avoid issues with the inadvertent propagation of inappropriate
annotation, I'd be more comfortable with it being an optional feature of the
slice - to be used when appropriate and with caution - than the default
behaviour.

One counterexample I can think of is the slicing of a sequence for which a
feature or annotation applies only to a subregion of the SeqRecord.  This is
not an uncommon property of modular proteins.  If I were to slice the
N-terminal domains of a set of sequences with distinct N- and C-terminal
domains, I would not want to carry through annotation for the C-terminal
domains. If I did this without noticing, there may be a danger of, say,
downstream use inferring inappropriate class membership if I wanted to
generate a set of sequences containing that C-terminal domain, and I did
this automatically based on the annotation of a SeqRecord.

Another counterexample would be propagated inappropriate class membership
for annotations that require a complete sequence for context.  For example,
many bacterial CDS annotations feature reports of BLAST matches to other
databases.  These are results derived from the full length feature, and the
BLAST match obtained from the slice result is likely to differ.

Having seen first-hand the propagation of faulty annotations (e.g. presence
of a signal peptide and other functionally-related motifs) through to
cloning - and the resultant waste of time, money and other resources - I
would seek to avoid this kind of behaviour.  As it is, the propagation of
sequence ID and description without modification to indicate that a copy and
potential change has been done is potentially dangerous, and needs to be
done with some care to avoid 'poisoning the well'.

The behaviour you describe makes most sense in the context of
per-letter-annotation (as this is the natural granularity of the changes),
and for relatively small changes to a large sequence containing multiple
features whose annotations are reasonably self-contained.  I too would like
to be able to treat these specially on occasion, conserving much of the
annotation.  However, I think the potential pitfalls are pretty significant
and would not want this to be default behaviour.

A third way might be only to include those annotations with location data
where the region covered by the annotation is not disrupted by the slicing.
For example, a slice/addition that removed sites 200-300 would retain
features/annotations that ran from 120-199 and 301-350, but not carry
forward features that ran from 120-201, or from 250-301.  Features and
annotations that span the full record length would not be carried forward
under this proposal.

Best,

L.

On 12/11/2009 12:04, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> Hello all,
> 
> Something we added in Biopython 1.50 was the ability to slice a SeqRecord,
> which tries to do something sensible with all the annotation - in particular
> per-letter-annotation (like quality scores) and features (which have
> locations)
> are handled as you would naturally expect.
> 
> Something you can look forward to in our next release (assuming no
> major issues crop up in testing) is adding SeqRecord objects together.
> Again, this will try and do something unambiguous with the annotation.
> 
> I have two motivational examples in mind which combine slicing and
> addition of SeqRecord objects to edit a record while preserving as much
> annotation as possible. For example, removing a section of sequence,
> say letters from 100 to 200:
> 
> from Bio import SeqIO
> record = SeqIO.read(...)
> deletion_mutant = record[:100] + record[200:]
> 
> (The above would make sense for both protein and nucleotide records).
> Or, for a circular nucleotide sequence (like a plasmid or many small
> genomes), you might want to shift the origin, e.g. by 150 bases:
> 
> shifted = record[150:] + record[:150]
> 
> You can already do both these examples with the latest (unreleased) code.
> However, the situation with the annotation isn't ideal. When slicing a record,
> for non-location based annotation there is no way to know for sure if the
> annotation still applies to the daughter sequence. Therefore in the face of
> this ambiguity, when we added SeqRecord slicing in Biopython 1.50, we
> did not copy the dbxrefs and annotations dictionary to the daughter record.
> i.e. You currently have to do this manually (if required), for example:
> 
> deletion_mutant = record[:100] + record[200:]
> deletion_mutant.dbxrefs = record.dbxrefs[:]
> deletion_mutant.annotations = record.annotations.copy()
> 
> I would like to propose changing the SeqRecord slice behaviour to
> blindly copy the dbxrefs list and annotations dict to the daughter record
> (just like the id, name and description are already blindly copied even
> though they may not make sense for the daughter record). Then these
> slicing+addition examples will "just work" without the user having to
> explicitly copy the dbxrefs and annotations dict.
> 
> This is a non-backwards compatible change, but with hindsight is
> perhaps a more natural behaviour. We would of course highlight this
> in the release notes (maybe with some worked examples on the blog).
> 
> Does changing SeqRecord slicing like this seem like a good idea?
> 
> Peter
> 
> P.S. The code changes required are very small (two extra lines), see
> this commit on my experimental branch on github for details - most
> of the changes are documentation and unit tests for this work:
> http://github.com/peterjc/biopython/commit/41e944f338476a79bd7f8998196df21a1c0
> 6d4f7
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________