[Biopython] Access Entrez gene DB using rettype 'gb'

Brad Chapman chapmanb at 50mail.com
Fri Dec 10 13:10:26 UTC 2010


Michiel, David and Peter;

> > That's right. The gene database doesn't give back the "native" XML
> > format which Entrez.read deals with. Instead it's got custom XML output:
> > 
> > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html#gene
> 
> Really? What's the difference between "native" and "custom" XML? The
> Entrez Gene XML example from this linked gets parsed perfectly fine by
> Entrez.read.

Nice one. I seriously underestimated the power of your code to deal
with that nastiness. Here's an updated version that uses Entrez.read
instead of the custom XML parsing with ElementTree:

https://gist.github.com/727625

from Bio import Entrez

def fetch_gene_coordinates(search_term):
    handle = Entrez.esearch(db="gene", term=search_term)
    rec = Entrez.read(handle)
    gene_id = rec["IdList"][0] # assuming best match works

    handle = Entrez.efetch(db="gene", id=gene_id, retmode="xml")
    rec = Entrez.read(handle)[0]
    gene_locus = rec["Entrezgene_locus"][0]
    region = gene_locus["Gene-commentary_seqs"][0]["Seq-loc_int"]["Seq-interval"]
    start = int(region["Seq-interval_from"]) + 1
    end = int(region["Seq-interval_to"]) + 1
    gi_id = region["Seq-interval_id"]["Seq-id"]["Seq-id_gi"]
    strand = region["Seq-interval_strand"]["Na-strand"].attributes["value"]
    return gi_id, start, end, strand

def get_fasta_seq(gi_id, start, end, strand):
    strand = 2 if strand.lower() == "minus" else 1
    handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=gi_id,
            seq_start=start, seq_stop=end, strand=strand)
    return handle.read()

Entrez.email = "yours at mail.com"
search_term = "fliC ct18"

gi_id, start, end, strand = fetch_gene_coordinates(search_term)
print get_fasta_seq(gi_id, start, end, strand)

Brad



More information about the Biopython mailing list