[Biopython] EMBL DNA records with locations referencing other sequences

Mon Sep 30 11:11:52 UTC 2019

Hi Adam,

You have found the relevant issue,
https://github.com/biopython/biopython/issues/808 - which has the
outlines of a strategy. I would first download the referenced
sequences, and store them in a dictionary keyed by the accession.
Then, you need a new "extract" function (which might ultimately
replace the current extract method) taking as arguments the complex
location, main sequence, and this dictionary of external sequences.
This code needs to special case the extract for any FeatureLocation
with .ref set, see e.g.

https://github.com/biopython/biopython/issues/808#issuecomment-209364333

Does this all look too horrible/challenging?

I've not had to deal with one of these sequences (or indeed seen one)
in a long time. It may be easier to use the feature's accession to
look up the gene or protein sequence directly online - rather than
trying to reconstruct it yourself?

Peter

On Mon, Sep 30, 2019 at 11:37 AM Adam Sjøgren <asjo at koldfront.dk> wrote:
>
>   Hi,
>
>
> I'm trying to figure out how to handle EMBL DNA records that have
> locations that reference other sequences, using BioPython.
>
> An examples is BX465837, which I download like this:
>
>   $ curl -o BX465837.embl 'https://www.ebi.ac.uk/ena/data/view/BX465837&display=text&download=txt&filename=BX465837.txt'
>
> Looking at the first three features, we see references to other records
> in the third location:
>
>   FH   Key             Location/Qualifiers
>   FH
>   FT   source          1..206154
>   FT                   /organism="Danio rerio"
>   FT                   /chromosome="19"
>   FT                   /mol_type="genomic DNA"
>   FT                   /clone_lib="DanioKey"
>   FT                   /clone="DKEY-27C15"
>   FT                   /db_xref="taxon:7955"
>   FT   misc_feature    1..204154
>   FT                   /note="annotated region of clone"
>   FT   mRNA            join(BX537337.9:201339..201471,BX537337.9:212287..212438,
>   FT                   BX537337.9:214424..214601,1605..1701,2016..2091,5411..5554,
>   FT                   5720..5970,6471..6553,6656..6804,6972..7074,7152..7350,
>   FT                   7497..7589,7714..8254,8415..8662,9113..9410,9493..9715,
>   FT                   9874..13517)
>   FT                   /locus_tag="DKEY-27C15.6-001"
>
> (the "BX537337.9"-part). Later features also have these references.
>
> I can read this record in with SeqIO:
>
>   >>> from Bio import SeqIO
>   >>> with open("BX465837.embl", "r") as fh:
>   ...     rec = SeqIO.read(fh, "embl")
>   ...
>
> For simplicity I will just grab the first three features, here is a
> print of those:
>
>   >>> for feature in first_three:
>   ...     print("{} {}".format(feature.type, feature.location))
>   ...
>   source [0:206154](+)
>   misc_feature [0:204154](+)
>   mRNA join{BX537337.9[201338:201471](+), BX537337.9[212286:212438](+), BX537337.9[214423:214601](+), [1604:1701](+), [2015:2091](+), [5410:5554](+), [5719:5970](+), [6470:6553](+), [6655:6804](+), [6971:7074](+), [7151:7350](+), [7496:7589](+), [7713:8254](+), [8414:8662](+), [9112:9410](+), [9492:9715](+), [9873:13517](+)}
>
> Now, if I try to extract the sequence of any of them, I get an
> exception immedately:
>
>   >>> for feature in first_three:
>   ...     print("{} {}".format(feature.type, feature.location))
>   ...     seq = feature.extract(rec)
>   ...
>   source [0:206154](+)
>   Traceback (most recent call last):
>     File "<stdin>", line 3, in <module>
>     File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 369, in extract
>       return self.location.extract(parent_sequence)
>     File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 954, in extract
>       f_seq = parent_sequence[self.nofuzzy_start:self.nofuzzy_end]
>     File "/usr/lib/python3/dist-packages/Bio/SeqRecord.py", line 459, in __getitem__
>       answer.features.append(f._shift(-start))
>     File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 303, in _shift
>       answer = SeqFeature(location=self.location._shift(offset),
>     File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 1181, in _shift
>       return CompoundLocation([loc._shift(offset) for loc in self.parts],
>     File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 1181, in <listcomp>
>       return CompoundLocation([loc._shift(offset) for loc in self.parts],
>     File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 873, in _shift
>       raise ValueError("Feature references another sequence.")
>   ValueError: Feature references another sequence.
>
> (which makes sense - how would BioPython get that other referenced
> sequence?)
>
> How can I make .extract() (and eventually .translate() on CDS features)
> work here? Can I provide the referenced sequences to BioPython somehow,
> or how do you usually deal with these kind of locations?
>
> I found this issue: https://github.com/biopython/biopython/issues/808
> which seems to have petered out.
>
>
>   Best regards,
>
>     Adam
>
> --
>  "The world isn't fair, Calvin."                              Adam Sjøgren
>  "I know, but why isn't it ever unfair in my favor?"     asjo at koldfront.dk
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> https://mailman.open-bio.org/mailman/listinfo/biopython