[Biopython] Parsing problem

Tue Dec 8 18:52:13 UTC 2009

Hi all,

I am having a little problem while trying to parse a GenBank (or rather
GenProt) file using BioPython. I am trying to extract the position on the
genome from the "coded_by" qualifier of the CDS feature of a protein.

The "coded_by" string in this specific case looks like this:

'complement(NC_012967.1:
3622110..3624728)'

Now, when I run

Bio.GFF.easy.LocationFromString('complement(NC_012967.1:3622110..3624728)' )

I get

File "/usr/lib/pymodules/python2.6/Bio/GFF/easy.py", line 419, in __init__
   list.__init__(self, [int(location_str)-1]) # zero based, nip it in the
bud
ValueError: invalid literal for int() with base 10:
'NC_012967.1:3622110..3624728'

Is there another way to parse this location string or do I have to cook up
some kind of custom RegExp?

Iwan

P.S.: Code snippet:

from Bio import Entrez
from Bio import SeqIO
from Bio import GFF
gi = 254163455
handle = Entrez.efetch(db="protein", id=gi, rettype="gb")
record= SeqIO.read(handle,"genbank")
handle.close()
for feature in record.features:
   if(feature.type=="CDS" and feature.qualifiers.has_key("coded_by")):
       print feature.qualifiers["coded_by"][0],
       loc=GFF.easy.LocationFromString(feature.qualifiers["coded_by"][0])
       print loc.start(),loc.end(), loc.complement