[Biopython-dev] [Biopython] SeqRecord reverse complement method?

Thu Oct 1 09:04:17 UTC 2009

On Wed, Sep 30, 2009 at 4:27 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> This has meant that generally the current status quo isn't
> a problem (at least for me). However, what prompted me
> to work on this issue was a real world example.
>
> We have a draft genome where after doing a basic
> annotation, it would make sense to flip the strands. I
> want to be able to load our current GenBank file, apply
> the reverse complement, and have all the annotated
> features recalculated to match. With more and more
> sequencing projects, this isn't such an odd thing to
> want to do.

The github branch has SeqRecord reverse complement
working pretty well (with plenty of tests covering the fuzzy
locations), and a first attempt at SeqRecord addition too:
http://github.com/peterjc/biopython/commits/seqrecords

This lets me solve my motivating example like so:

from Bio import SeqIO
old_record = SeqIO.parse(open("pBAD30.gb"), "gb")
handle = open("pBAD30_rc.gb", "w")
SeqIO.write([old_record.reverse_complement(...)], handle, "gb")
handle.close()

If I wanted to shift the origin, this would be possible by
combining SeqRecord slicing and addition:

from Bio import SeqIO
cut = 3765
old_record = SeqIO.parse(open("pBAD30.gb"), "gb")
handle = open("pBAD30_rc.gb", "w")
new_record = old_record[cut:] + old_record[:cut]
SeqIO.write([new_record], handle, "gb")
handle.close()

And of course you can do both (which is probably what I
will be using for the real task from work that this example
is based on):

from Bio import SeqIO
cut = 3765
old_record = SeqIO.parse(open("pBAD30.gb"), "gb")
handle = open("pBAD30_rc.gb", "w")
new_record = (old_record[cut:] + old_record[:cut]).reverse_complement(...)
SeqIO.write([new_record], handle, "gb")
handle.close()

The general scheme is nice and simple I think, but
the trouble is in the details.

For this particular example, it makes sense for all the
annotation to preserved. For the reverse complement
this is possible (although currently on my branch, this
is not the default - hence the dot dot dot in the example
above where right now this need to be requested
explicitly).

However, currently on SeqRecord slicing we take the
cautious approach to the annotation, and the annotation
dictionary and dbxrefs list are lost. On reflection, perhaps
the more liberal straight forward approach is more useful:
copy all the annotation (and leave it to the user to remove
anything that becomes inappropriate). Then this code
would "work":

new_record = old_record[cut:] + old_record[:cut]

Right now, based on the current slicing in the trunk,
you have to copy these annotations manually:

new_record = old_record[cut:] + old_record[:cut]
new_record.annotations = old_record.annotations.copy()
new_record.dbxrefs = old_record.dbxrefs[:]

The question is which is preferable? The current slicing
makes the user think about their annotation explicitly.
The alternative is to blindly copy it, knowing that in some
cases it will not be appropriate to the sub-record.

Peter

P.S. For those of you interested in multiple sequence
alignments, once SeqRecord addition is dealt with,
adding alignments becomes practical. i.e. taking
two gene alignments for N species, and then
concatenating them as discussed on Bug 2552:
http://bugzilla.open-bio.org/show_bug.cgi?id=2552