[Biopython] Access Entrez gene DB using rettype 'gb'

David Jacobs developer at allthingsprogress.com
Fri Dec 3 21:19:13 UTC 2010


>
> > Thanks for the info. There seems to be a huge disconnect between what I
> want
> > to do and what this library is letting me do. It seems like there should
> be
> > a really simple way to look up bacterial gene sequences by their names,
> and
> > it's disappointing that that's not the case.
> >
> > Every workaround I've tried has also failed.
> >
> > For example, I've downloaded the full CT18 genome from the FTP server and
> > parsed it using SeqIO. The problem is that SeqRecord doesn't give me an
> > accessor to the "name" attribute of the sequence, as it would appear in
> the
> > gene database.
>
> You'll have to give me more to go on - what did you download by FTP,
> a FASTA file, GenBank? How about giving the URL and an example
> of the "name" you want to use.
>

In this case, I downloaded the file Brad listed earlier:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn

The "name" I'd like back is the name listed as "symbol" at
http://www.ncbi.nlm.nih.gov/gene when I query "ct18 fliC"--in other words, I
want to search the genome file I have for "fliC", and not just the
human-readable description. It seems to me like the "name" attribute for
SeqRecord would be a useful place to put this, especially since right now,
"name" is just a duplicate of the information in "id".

This already works for protein entries. See:

https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L322

The thing is, the human-readable description of each gene is already
annotated in the genome FASTA file I downloaded. I just need the symbol, as
it's easily searchable and more canonical.

> What's more, if I search the gene database for a name, I do,
> > in fact, get an ID back. But that ID has no information about the start
> and
> > stop indices for my sequence, so I can't use that information in
> conjunction
> > with my downloaded genome.
>
> Have you looked at EInfo? It is for cross referencing between the different
> Entrez databases.
>

I'll have a look.


> > Further still, if I try to query the gene
> > database for my gene's full information (using the ID that I grabbed from
> > esearch(db=gene ...)), I get back data formatted in a way that BioPython
> > can't parse.
>
> Are you talking about using EFetch here? Which database? The valid
> combinations of retmode and rettype change according to this. See e.g.:
> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html


I'm using ESearch. As I say, I'm trying to query the "gene" database using a
name ("ct18 fliC"). I do get back just one entry, and it gives me the
correct GID. When I try to query "gene" using this ID--in order to get the
start and stop indices--the best I get is:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=1248507&retmode=txt

<http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=1248507&retmode=txt>Which
I don't know how to parse. (Again, I just want the start and stop
positions.)

 > This is a touch aggravating.

> >
> > What am I missing?
> >
>
> The NCBI Entrez documentation is definitely sparse :(
>
> If all you want to do is get the nucleotide sequence for bacterial
> genes then I do suspect working with the FASTA or GenBank files
> would be easier than using Entrez (as Brad suggested earlier).
>

I'd rather not do this manually, though. It seems like BioPython should make
tedious tasks like this easy and sustainable.


> Can you give a specific example - couple of gene names you want,
> and desired answer (the sequence want to find for them)? Sean did
> ask earlier - this really would and we'd be better able to help you.


For the example I've given over the last couple of e-mails, this is the
sequence I want:

http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true

It's directly linked to as from the fliC page in the "gene" database. (From
the link labeled "FASTA".)

Does that make things clearer?

David



More information about the Biopython mailing list