[Biopython] Access Entrez gene DB using rettype 'gb'

Brad Chapman chapmanb at 50mail.com
Fri Dec 3 21:57:36 UTC 2010


David;
[Bacterial download example]
> http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true

Thanks for the additional explanation. The problem is that the fasta
file download doesn't have the fliC identifier in the header line.
It's not that Biopython doesn't present it to you correctly; rather,
NCBI didn't include it in the fasta output. 

Thanks to Peter's tip about efetch supporting slicing, here is a
script that replicates what you're getting by hand. The tricky part is 
that the gene record is returned as XML instead of their native record
format, so it needs to be parsed for the items of interest.

https://gist.github.com/727625

import xml.etree.ElementTree as ET
from Bio import Entrez

def fetch_gene_coordinates(search_term):
    handle = Entrez.esearch(db="gene", term=search_term)
    rec = Entrez.read(handle)
    gene_id = rec["IdList"][0] # assuming best match works

    handle = Entrez.efetch(db="gene", id=gene_id, retmode="xml")
    gene_locus = ET.parse(handle).getroot().find("Entrezgene/Entrezgene_locus")
    region = gene_locus.find("Gene-commentary/Gene-commentary_seqs/Seq-loc/Seq-loc_int/Seq-interval")
    # add 1 to coordinates from XML to match what Entrez expected (0 to 1 based)
    start = int(region.find("Seq-interval_from").text) + 1
    end = int(region.find("Seq-interval_to").text) + 1
    gi_id = region.find("Seq-interval_id/Seq-id/Seq-id_gi").text
    strand = region.find("Seq-interval_strand/Na-strand").get("value")
    return gi_id, start, end, strand

def get_fasta_seq(gi_id, start, end, strand):
    strand = 2 if strand.lower() == "minus" else 1
    handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=gi_id,
            seq_start=start, seq_stop=end, strand=strand)
    return handle.read()

Entrez.email = "yours at mail.com"
search_term = "fliC ct18"

gi_id, start, end, strand = fetch_gene_coordinates(search_term)
print get_fasta_seq(gi_id, start, end, strand)


Hope this works for you,
Brad



More information about the Biopython mailing list