[Biopython] EMBL DNA records with locations referencing other sequences
Adam Sjøgren
asjo at koldfront.dk
Mon Sep 30 10:32:59 UTC 2019
Hi,
I'm trying to figure out how to handle EMBL DNA records that have
locations that reference other sequences, using BioPython.
An examples is BX465837, which I download like this:
$ curl -o BX465837.embl 'https://www.ebi.ac.uk/ena/data/view/BX465837&display=text&download=txt&filename=BX465837.txt'
Looking at the first three features, we see references to other records
in the third location:
FH Key Location/Qualifiers
FH
FT source 1..206154
FT /organism="Danio rerio"
FT /chromosome="19"
FT /mol_type="genomic DNA"
FT /clone_lib="DanioKey"
FT /clone="DKEY-27C15"
FT /db_xref="taxon:7955"
FT misc_feature 1..204154
FT /note="annotated region of clone"
FT mRNA join(BX537337.9:201339..201471,BX537337.9:212287..212438,
FT BX537337.9:214424..214601,1605..1701,2016..2091,5411..5554,
FT 5720..5970,6471..6553,6656..6804,6972..7074,7152..7350,
FT 7497..7589,7714..8254,8415..8662,9113..9410,9493..9715,
FT 9874..13517)
FT /locus_tag="DKEY-27C15.6-001"
(the "BX537337.9"-part). Later features also have these references.
I can read this record in with SeqIO:
>>> from Bio import SeqIO
>>> with open("BX465837.embl", "r") as fh:
... rec = SeqIO.read(fh, "embl")
...
For simplicity I will just grab the first three features, here is a
print of those:
>>> for feature in first_three:
... print("{} {}".format(feature.type, feature.location))
...
source [0:206154](+)
misc_feature [0:204154](+)
mRNA join{BX537337.9[201338:201471](+), BX537337.9[212286:212438](+), BX537337.9[214423:214601](+), [1604:1701](+), [2015:2091](+), [5410:5554](+), [5719:5970](+), [6470:6553](+), [6655:6804](+), [6971:7074](+), [7151:7350](+), [7496:7589](+), [7713:8254](+), [8414:8662](+), [9112:9410](+), [9492:9715](+), [9873:13517](+)}
Now, if I try to extract the sequence of any of them, I get an
exception immedately:
>>> for feature in first_three:
... print("{} {}".format(feature.type, feature.location))
... seq = feature.extract(rec)
...
source [0:206154](+)
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 369, in extract
return self.location.extract(parent_sequence)
File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 954, in extract
f_seq = parent_sequence[self.nofuzzy_start:self.nofuzzy_end]
File "/usr/lib/python3/dist-packages/Bio/SeqRecord.py", line 459, in __getitem__
answer.features.append(f._shift(-start))
File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 303, in _shift
answer = SeqFeature(location=self.location._shift(offset),
File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 1181, in _shift
return CompoundLocation([loc._shift(offset) for loc in self.parts],
File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 1181, in <listcomp>
return CompoundLocation([loc._shift(offset) for loc in self.parts],
File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 873, in _shift
raise ValueError("Feature references another sequence.")
ValueError: Feature references another sequence.
(which makes sense - how would BioPython get that other referenced
sequence?)
How can I make .extract() (and eventually .translate() on CDS features)
work here? Can I provide the referenced sequences to BioPython somehow,
or how do you usually deal with these kind of locations?
I found this issue: https://github.com/biopython/biopython/issues/808
which seems to have petered out.
Best regards,
Adam
--
"The world isn't fair, Calvin." Adam Sjøgren
"I know, but why isn't it ever unfair in my favor?" asjo at koldfront.dk
More information about the Biopython
mailing list