[Biopython] Access Entrez gene DB using rettype 'gb'

Fri Dec 3 14:44:58 UTC 2010

On Fri, Dec 3, 2010 at 1:44 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> David;
>
>> As an example of what I'm doing: say I want to find the sequence for fliC in
>> Salmonella Typhi CT18. Since nucleotide is only giving me entire genomes,
>> I've queried the gene database instead. The query is "fliC ct18" and it
>> gives me one entry:
>>
>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=fliC+ct18
>>
>> Now I want the raw sequence for that gene. The sequence that shows up when I
>> click "FASTA" on the above page:
>>
>> http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true
>
> The best approach here might be to download the FASTA files for your
> bacteria of interest, and then extract the sequences you need that
> way. For your example, this file has the genes pre-sliced:
>
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn
>

For bacteria (well, prokaryotes) I agree the files on the NCBI FTP
site (FASTA files and others like the GenBank flat files) are very
handy. This is certainly worth looking at.

> Using EUtils is hard here because there isn't an official identifier
> for the sequence you are interested in. In this case you'll have to
> pull down the genome and then subset it yourself based on the
> coordinates.

Actually you should be able to get just the subsequence of interest
via EFetch, see the seq_start and seq_stop parameters:
http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html

Peter