[Biopython] SeqRecord slicing bug and fix
Peter Cock
p.j.a.cock at googlemail.com
Wed May 29 05:26:25 EDT 2013
On Wed, May 29, 2013 at 2:22 AM, Mark Budde <markbudde at gmail.com> wrote:
> On Tue, May 28, 2013 at 4:09 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> On Tue, May 28, 2013 at 11:03 PM, Mark Budde <markbudde at gmail.com> wrote:
>> > There is a bug in the SeqRecord slicing behavior. The bug crops up on
>> > circular records with a feature spanning the beginning and end of the
>> > plasmid. Any slice outside of the feature will return the feature, and
>> > the
>> > feature.location.end is negative.
>> >
>> >>>> record = SeqIO.read('pUC19_mod.gb', 'genbank')
>> >>>> record
>> >
>> > SeqRecord(seq=Seq('TCGCGCGTTTCGGTGATGACGGTGAAAACCTCTGACACATGCAGCTCCCGGAGA...GTC',
>> > IUPACAmbiguousDNA()), id='pUC19', name='pUC19', description='',
>> > dbxrefs=[])
>> >>>> record.features
>> > [SeqFeature(FeatureLocation(ExactPosition(2299), ExactPosition(200),
>> > strand=1), type='misc_feature')]
>>
>> The issue is you've got start > end, which arguably should
>> raise an exception (there is a TODO in the code for that)
>>
>> Is this a circular record and a feature spanning the origin?
>
> Yes, it is a circular plasmid with a feature spanning the origin. There are
> legitimate reasons to have features span the origin,
The SeqFeature system is very heavily influenced by the GenBank/EMBL
feature table, meaning you'd write this like join(2300...3000,1..200) where
for the sake of argument I've assumed the genome is 3000 long.
Currently join features are handled with sub_features, but this is about
to change in the forthcoming Biopython 1.62 release which introduces
CompoundLocation objects instead. If you fancy trying the latest
Biopython from github, you should be able create this example with:
wrap_location = FeatureLocation(2299, 3000) + FeatureLocation(0, 200)
(Doing the equivalent with the current system in Biopython 1.61 or
older is far more complicated)
> ... so please do not raise
> an exception. I think the provided code is the best solution to the problem
> (and completely fixes the problems within my personal code when this is an
> issue), but would be interested in hearing other suggestions.
An exception if start > end would prevent downstream surprises in
code like __getitem__ which assumes this. There are other problems
which prevent allowing this - for example it is impossible to calculate
the length of the location (and therefore the SeqFeature) without
also knowing the circular genome's length. Likewise __contains__
(which is used for testing if an integer position is in a feature)
and __iter__ (which is used for iterating over the integer positions
within a feature) would break.
Unfortunately your initial code suggestion would only solve a small
subset of the feature location functionality.
Basically trying to support wrapped features like this would require
a *lot* of special case code, but add very little over the current
join-based approach which also handles many other biological
annotations nicely like spliced genes.
>> > On a related note, is there an appropriate way to modify the position of
>> > a SeqFeature? I have been doing "feature.location._end =
>> > ExactPosition(newEnd)" , but I was under the impression that I shouldn't
>> > modify objects beginning with an underscore.
>>
>> Yes, things starting with a single underscore should be
>> regarded as private and not used. Currently that appears
>> to be setup as a read only property (which you can change
>> directly using feature.location._end = new_value) and
>> right now I'm not sure why that was done, but it has been
>> read only for since Bio.SeqFeature was first written 12
>> years ago. Maybe no one has asked till now?
>
> Well the alternative is to make a new feature and import all of the other
> atributes, but that seems like a lot of work for no practical gain.
No, you'd only need to create a new FeatureLocation and
then feature.location = new_location but this is still a hassle,
and definitely worth looking at changing. Thanks for raising
this.
One reason a read-only FeatureLocation *could* be nice is if
we had any clever indexing code which could break if the
locations were liable to change. But we don't have anything
like that within Biopython at the moment.
We should probably continue this on the biopython-dev list,
in case there is a more practical downside to allowing the
FeatureLocation start and end to be updated which I'm
currently missing.
Regards,
Peter
More information about the Biopython
mailing list