[Biopython] Access Entrez gene DB using rettype 'gb'

Sean Davis sdavis2 at mail.nih.gov
Fri Dec 3 01:15:28 UTC 2010


On Thu, Dec 2, 2010 at 4:42 PM, David Jacobs <
developer at allthingsprogress.com> wrote:

> Hi Sean,
>
> Thanks for the info. I didn't realize the gene database wasn't concerned
> with sequences. (The distinction isn't so clear when you're using the web
> interface.) So now I'm trying to query nucleotide. My scripting approach has
> been:
>
> 1. Get list of gene names from a file
> 2. Query nucleotide for gene ID
> 3. Use that gene ID to download the proper nucleotide entry
>
>
Hi, David.

Perhaps you can give a concrete example.  What is the starting value (gene
name, HUGO gene symbol, Entrez Gene ID)?  What is the expected output--you
mention "proper nucleotide entry", but there will likely be more than one
for a given gene?  You also mention that you are interested in a specific
region of the genome--do you want the gene locus or the transcripts or the
CDS, or something else?  Finally, how many genes are we talking about here?
 5-10 or thousands?

Sean


> However, every time I get an ID from nucleotide, it's for an entire genome.
> How can I specify either a) a specific gene (as identified in the gene
> database) or b) a specific region of the genome?
>
> David
>
> On Thu, Dec 2, 2010 at 4:07 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>
>>
>> Hi, David.
>>
>> Genes (in the sense used in Entrez Gene) do not have sequences.  Their
>> respective transcripts do, however, and there can be, in general, multiple
>> transcripts per gene.  Therefore, I think you would have to do a query for
>> the gene of interest and then link to nucleotide to get the sequences for
>> the associated transcripts.  If you want to do this for many genes, it may
>> be easier to download the entire refseq collection for your species of
>> interest and simply load stuff into memory or index the fasta file.
>>
>> Sean
>>
>>
>



More information about the Biopython mailing list