[Biopython] Access Entrez gene DB using rettype 'gb'
Brad Chapman
chapmanb at 50mail.com
Fri Dec 3 21:57:36 UTC 2010
David;
[Bacterial download example]
> http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true
Thanks for the additional explanation. The problem is that the fasta
file download doesn't have the fliC identifier in the header line.
It's not that Biopython doesn't present it to you correctly; rather,
NCBI didn't include it in the fasta output.
Thanks to Peter's tip about efetch supporting slicing, here is a
script that replicates what you're getting by hand. The tricky part is
that the gene record is returned as XML instead of their native record
format, so it needs to be parsed for the items of interest.
https://gist.github.com/727625
import xml.etree.ElementTree as ET
from Bio import Entrez
def fetch_gene_coordinates(search_term):
handle = Entrez.esearch(db="gene", term=search_term)
rec = Entrez.read(handle)
gene_id = rec["IdList"][0] # assuming best match works
handle = Entrez.efetch(db="gene", id=gene_id, retmode="xml")
gene_locus = ET.parse(handle).getroot().find("Entrezgene/Entrezgene_locus")
region = gene_locus.find("Gene-commentary/Gene-commentary_seqs/Seq-loc/Seq-loc_int/Seq-interval")
# add 1 to coordinates from XML to match what Entrez expected (0 to 1 based)
start = int(region.find("Seq-interval_from").text) + 1
end = int(region.find("Seq-interval_to").text) + 1
gi_id = region.find("Seq-interval_id/Seq-id/Seq-id_gi").text
strand = region.find("Seq-interval_strand/Na-strand").get("value")
return gi_id, start, end, strand
def get_fasta_seq(gi_id, start, end, strand):
strand = 2 if strand.lower() == "minus" else 1
handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=gi_id,
seq_start=start, seq_stop=end, strand=strand)
return handle.read()
Entrez.email = "yours at mail.com"
search_term = "fliC ct18"
gi_id, start, end, strand = fetch_gene_coordinates(search_term)
print get_fasta_seq(gi_id, start, end, strand)
Hope this works for you,
Brad
More information about the Biopython
mailing list