[Biopython] Access Entrez gene DB using rettype 'gb'

David Jacobs developer at allthingsprogress.com
Fri Dec 3 19:38:59 UTC 2010


Thanks for the info. There seems to be a huge disconnect between what I want
to do and what this library is letting me do. It seems like there should be
a really simple way to look up bacterial gene sequences by their names, and
it's disappointing that that's not the case.

Every workaround I've tried has also failed.

For example, I've downloaded the full CT18 genome from the FTP server and
parsed it using SeqIO. The problem is that SeqRecord doesn't give me an
accessor to the "name" attribute of the sequence, as it would appear in the
gene database. What's more, if I search the gene database for a name, I do,
in fact, get an ID back. But that ID has no information about the start and
stop indices for my sequence, so I can't use that information in conjunction
with my downloaded genome. Further still, if I try to query the gene
database for my gene's full information (using the ID that I grabbed from
esearch(db=gene ...)), I get back data formatted in a way that BioPython
can't parse.

This is a touch aggravating.

What am I missing?

On Fri, Dec 3, 2010 at 9:44 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Fri, Dec 3, 2010 at 1:44 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> > David;
> >
> >> As an example of what I'm doing: say I want to find the sequence for
> fliC in
> >> Salmonella Typhi CT18. Since nucleotide is only giving me entire
> genomes,
> >> I've queried the gene database instead. The query is "fliC ct18" and it
> >> gives me one entry:
> >>
> >>
> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=fliC+ct18
> >>
> >> Now I want the raw sequence for that gene. The sequence that shows up
> when I
> >> click "FASTA" on the above page:
> >>
> >>
> http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true
> >
> > The best approach here might be to download the FASTA files for your
> > bacteria of interest, and then extract the sequences you need that
> > way. For your example, this file has the genes pre-sliced:
> >
> >
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn
> >
>
> For bacteria (well, prokaryotes) I agree the files on the NCBI FTP
> site (FASTA files and others like the GenBank flat files) are very
> handy. This is certainly worth looking at.
>
> > Using EUtils is hard here because there isn't an official identifier
> > for the sequence you are interested in. In this case you'll have to
> > pull down the genome and then subset it yourself based on the
> > coordinates.
>
> Actually you should be able to get just the subsequence of interest
> via EFetch, see the seq_start and seq_stop parameters:
> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html
>
> Peter
>



More information about the Biopython mailing list