[Biopython] Handling records referencing other records

Fri Sep 18 21:07:36 UTC 2015

Hi John,
I don't know if this will help but I recently had a list of proteins for
which I wanted the mRNA or CDS for each one so that I could use the RNA.
(`mRNA` meaning someone entered a specific corresponding Genbank entry
described as the mRNA and CDS meaning extracted from the `coded_by`
information.) I found some of the same issues you seem to be describing and
worked out getting around them, I think. The program tries more agressive
and inefficient means as it gets to the tougher and tougher ones to
extract. I tried to make it so it doesn't give up. It probably isn't
perfect yet but at the time it would easily get several hundred starting
from the NCBI-sourced fasta sequences for the protein. (The sequence itself
isn't important but the description line actually is. It extracts an id
from there.) It even validates them to make sure they encode the original
protein using the correct one of the 24 genetic codes.

You can check the code out at
https://github.com/fomightez/sequencework/blob/master/RetrieveSeq/GetmRNAorCDSforProtein.py.
The description is at
https://github.com/fomightez/sequencework/tree/master/RetrieveSeq .

Feel free to adapt it or let me know if you'd like some help testing it
with your data or my help in maybe trying to get adapt it to what you have
as starting material.

Wayne

Date: Fri, 18 Sep 2015 14:30:05 +0000
From: "Athey, John *" <John.Athey at fda.hhs.gov>
To: "biopython at mailman.open-bio.org" <biopython at mailman.open-bio.org>
Subject: [Biopython] Handling records referencing other records
Message-ID:
    <5D5BA0385615F148A9D2FD86BB656F700FEAF9F8 at FDSWV09433.fda.gov>
Content-Type: text/plain; charset="us-ascii"

Hello all,

I'm looking for advice on how to handle Genbank records that reference
other records as part of their location. My program iterates through large
Genbank-formatted files with SeqIO.parse and extracts the CDS for
subsequent analysis, using feat.extract(). However, upon hitting a record
where the feature location references another record, it SOMETIMES fails.
For example, http://www.ncbi.nlm.nih.gov/nuccore/DQ100169 seems to be
handled correctly, while http://www.ncbi.nlm.nih.gov/nuccore/DQ100170 gives
a "ValueError: Feature references another sequence." Curiously, in both
cases the CDS feature itself doesn't specify another record, only the
parent gene does.

My questions about this are:

1)      Why does the extraction fail on some records but not on all of them?

2)      Is there a way to extract the data I'm looking for without causing
this error?

3)      If the answer to (2) is no, is there some other way to check
whether the sequence will cause this error, skip extracting that sequence,
and exclude that record from the analysis?

Thanks for any help you can provide!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20150918/15c44c0e/attachment.html>