[Biopython-dev] [BioPython] about the SeqRecord slicing
Peter
biopython at maubp.freeserve.co.uk
Fri Mar 27 06:29:10 EDT 2009
On Fri, Mar 27, 2009 at 8:22 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> On Thursday 26 March 2009 16:32:23 Peter wrote:
>
>> You'd also want the SeqRecord to support __add__ (and __radd__) so
>> that two SeqRecord objects can be added together. I have thought
>> about this before, and it is a *much* more complicated issue due to
>> the meta data. In general the only safe and unambiguous choice is to
>> exclude it from the combined record:
>> * sequence - just add (using normal rules for adding Seq objects)
>> * name/id/description - if the two agree, use that? Otherwise default
>> to a blank value?
>> * annotations - for each keyed value, you could combine the entries?
>> Or just throwing them all away?
>> * letter_annotations - if an entry is present in both you can combine
>> it. Otherwise throw them away?
>> * features - these could be combined, adjusting the locations for one
>> record's features as appropriate
>
> As I said before I think that the same problem is presented when you do a
> slice. If I have the sequence of a gene named X with some annotations and I
> slice a part, is still be named geneX? Should the annotations be kept?
The problems about the annotation when slicing a SeqRecord are similar, but
I think things are worse when adding two SeqRecords together.
For slicing, there are a few sub of cases:
- per-letter-annotation can be sliced too - easy.
- features - we retain only features fully inside the new sub-sequence (the
border line features which cross the slice boundary are a small problem -
excluding them is the simplest solution to code and explain).
- id/name - debatable. Currently kept.
- description - debatable. Consider a description which says "whole genome",
that doesn't really apply to a partial sequence. On the other hand, it may.
Currently kept for the sub-record.
- annotations - again debatable. Without context information, we can't guess.
The only sensible options are keep it all (as in CVS) or none of it.
I think it is worth keeping the id/name in general (consider typical use cases
like cropping a domain from a gene, or cropping columns off an alignment).
I would be OK with dropping the contents of the annotations dictionary and
description is order to avoid ambiguity, but this would prevent certain tasks.
Peter
More information about the Biopython-dev
mailing list