[Biopython] Handling records referencing other records

Fri Sep 18 15:17:01 UTC 2015

On Fri, Sep 18, 2015 at 3:30 PM, Athey, John * <John.Athey at fda.hhs.gov> wrote:
> Hello all,
>
> I’m looking for advice on how to handle Genbank records that reference other
> records as part of their location. My program iterates through large
> Genbank-formatted files with SeqIO.parse and extracts the CDS for subsequent
> analysis, using feat.extract(). However, upon hitting a record where the
> feature location references another record, it SOMETIMES fails. For example,
> http://www.ncbi.nlm.nih.gov/nuccore/DQ100169 seems to be handled correctly,
> while http://www.ncbi.nlm.nih.gov/nuccore/DQ100170 gives a “ValueError:
> Feature references another sequence.” Curiously, in both cases the CDS
> feature itself doesn’t specify another record, only the parent gene does.
>
> My questions about this are:
>
> 1)      Why does the extraction fail on some records but not on all of them?

In http://www.ncbi.nlm.nih.gov/nuccore/DQ100169 the CDS has location:

<1..>587

i.e. all of the 587 bp in the current record DQ100170, but noting the
start and end are incomplete.

In http://www.ncbi.nlm.nih.gov/nuccore/DQ100170 the CDS has location:

order(DQ100169.1:<1..587,1..>3167)

This explicitly references DQ100169.1 as well as implicitly referencing
region 1..>3167 of itself.

As you've seen, the SeqFeature's .extract method does not support
this (yet). I've previously wondered about extending the .extract
method to add an optional dictionary argument to lookup any externally
referenced sequences. e.g.

feature_seq = my_feature.extract(my_sequences, other_seq_dict)

i.e. Optionally provide a dictionary with the other sequence IDs like
"DQ100169.1" as keys. Ideally this could be used with the dictionary
like object from Bio.SeqIO.index_db(...)

However, I've never had a compelling use case.

> 2)      Is there a way to extract the data I’m looking for without causing
> this error?

Not easily - these look like partial gene sequence anyway, so is it
worth the effort?

You can of course pull the NCBI provided amino acid sequence
from the feature's translation qualifier value.

> 3)      If the answer to (2) is no, is there some other way to check whether
> the sequence will cause this error, skip extracting that sequence, and
> exclude that record from the analysis?

You would use a Python try/except to catch this ValueError.

Peter