[Biopython] EMBL DNA records with locations referencing other sequences

Mon Sep 30 10:32:59 UTC 2019

  Hi,

I'm trying to figure out how to handle EMBL DNA records that have
locations that reference other sequences, using BioPython.

An examples is BX465837, which I download like this:

  $ curl -o BX465837.embl 'https://www.ebi.ac.uk/ena/data/view/BX465837&display=text&download=txt&filename=BX465837.txt'

Looking at the first three features, we see references to other records
in the third location:

  FH   Key             Location/Qualifiers
  FH
  FT   source          1..206154
  FT                   /organism="Danio rerio"
  FT                   /chromosome="19"
  FT                   /mol_type="genomic DNA"
  FT                   /clone_lib="DanioKey"
  FT                   /clone="DKEY-27C15"
  FT                   /db_xref="taxon:7955"
  FT   misc_feature    1..204154
  FT                   /note="annotated region of clone"
  FT   mRNA            join(BX537337.9:201339..201471,BX537337.9:212287..212438,
  FT                   BX537337.9:214424..214601,1605..1701,2016..2091,5411..5554,
  FT                   5720..5970,6471..6553,6656..6804,6972..7074,7152..7350,
  FT                   7497..7589,7714..8254,8415..8662,9113..9410,9493..9715,
  FT                   9874..13517)
  FT                   /locus_tag="DKEY-27C15.6-001"

(the "BX537337.9"-part). Later features also have these references.

I can read this record in with SeqIO:

  >>> from Bio import SeqIO
  >>> with open("BX465837.embl", "r") as fh:
  ...     rec = SeqIO.read(fh, "embl")
  ... 

For simplicity I will just grab the first three features, here is a
print of those:

  >>> for feature in first_three:
  ...     print("{} {}".format(feature.type, feature.location))
  ... 
  source [0:206154](+)
  misc_feature [0:204154](+)
  mRNA join{BX537337.9[201338:201471](+), BX537337.9[212286:212438](+), BX537337.9[214423:214601](+), [1604:1701](+), [2015:2091](+), [5410:5554](+), [5719:5970](+), [6470:6553](+), [6655:6804](+), [6971:7074](+), [7151:7350](+), [7496:7589](+), [7713:8254](+), [8414:8662](+), [9112:9410](+), [9492:9715](+), [9873:13517](+)}

Now, if I try to extract the sequence of any of them, I get an
exception immedately:

  >>> for feature in first_three:
  ...     print("{} {}".format(feature.type, feature.location))
  ...     seq = feature.extract(rec)
  ... 
  source [0:206154](+)
  Traceback (most recent call last):
    File "<stdin>", line 3, in <module>
    File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 369, in extract
      return self.location.extract(parent_sequence)
    File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 954, in extract
      f_seq = parent_sequence[self.nofuzzy_start:self.nofuzzy_end]
    File "/usr/lib/python3/dist-packages/Bio/SeqRecord.py", line 459, in __getitem__
      answer.features.append(f._shift(-start))
    File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 303, in _shift
      answer = SeqFeature(location=self.location._shift(offset),
    File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 1181, in _shift
      return CompoundLocation([loc._shift(offset) for loc in self.parts],
    File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 1181, in <listcomp>
      return CompoundLocation([loc._shift(offset) for loc in self.parts],
    File "/usr/lib/python3/dist-packages/Bio/SeqFeature.py", line 873, in _shift
      raise ValueError("Feature references another sequence.")
  ValueError: Feature references another sequence.

(which makes sense - how would BioPython get that other referenced
sequence?)

How can I make .extract() (and eventually .translate() on CDS features)
work here? Can I provide the referenced sequences to BioPython somehow,
or how do you usually deal with these kind of locations?

I found this issue: https://github.com/biopython/biopython/issues/808
which seems to have petered out.

  Best regards,

    Adam

-- 
 "The world isn't fair, Calvin."                              Adam Sjøgren
 "I know, but why isn't it ever unfair in my favor?"     asjo at koldfront.dk