[Biopython] Access Entrez gene DB using rettype 'gb'

Sean Davis sdavis2 at mail.nih.gov
Thu Dec 2 21:07:14 UTC 2010


On Thu, Dec 2, 2010 at 3:42 PM, David Jacobs <
developer at allthingsprogress.com> wrote:

> I want to do something obvious but can't find a good way to do it. Maybe
> I'm
> looking in the wrong places. Anyway, I figured I'd ask here. (Bear with me,
> I'm new to Python and Biopython.)
>
> My question is: What's the easiest way to find and parse DNA sequences from
> the gene database?
>
> I'd like to use something like:
>
> handle = Entrez.efetch(db='gene', id='2', rettype='gb')
> handle.read()
>
> But this doesn't work. After poking around, I've learned you can do this
> query on, the nucleotide database. But not on the gene database. Instead, I
> have to do this:
>
> handle = Entrez.efetch(db='gene', id='2', retmode='gb')
>
> I get back something like this:
>
>
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=gb
>
> That isn't easily parseable, at least as far as I can tell. So what's the
> best way for me to find my sequence? And is there a parser for the string I
> get from retmode='gb'?
>
>
Hi, David.

Genes (in the sense used in Entrez Gene) do not have sequences.  Their
respective transcripts do, however, and there can be, in general, multiple
transcripts per gene.  Therefore, I think you would have to do a query for
the gene of interest and then link to nucleotide to get the sequences for
the associated transcripts.  If you want to do this for many genes, it may
be easier to download the entire refseq collection for your species of
interest and simply load stuff into memory or index the fasta file.

Sean



More information about the Biopython mailing list