[Biopython] Access Entrez gene DB using rettype 'gb'

Peter biopython at maubp.freeserve.co.uk
Fri Dec 3 22:42:05 UTC 2010


On Fri, Dec 3, 2010 at 9:19 PM, David Jacobs wrote:
>
> In this case, I downloaded the file Brad listed earlier:
>
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn
>
> The "name" I'd like back is the name listed as "symbol" at
> http://www.ncbi.nlm.nih.gov/gene when I query "ct18 fliC"--in other words, I
> want to search the genome file I have for "fliC", and not just the
> human-readable description. It seems to me like the "name" attribute for
> SeqRecord would be a useful place to put this, especially since right now,
> "name" is just a duplicate of the information in "id".
>
> This already works for protein entries. See:
>
> https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L322

That's not an example from parsing a FASTA file, its a record
constructed "by hand" - not really a fair comparison. The trouble
with FASTA files is there is no standard way to structure the
information in the ">" line, other than the first word is the identifier.

> The thing is, the human-readable description of each gene is already
> annotated in the genome FASTA file I downloaded. I just need the
> symbol, as it's easily searchable and more canonical.

As Brad pointed out, sadly the gene name "fliC" is not in that
FASTA file anywhere. However, it is in the GenBank file:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.gbk

You can loop over all the features, filter on type (e.g. gene or CDS)
and look at the annotation (qualifiers is a dictionary, entries are
lists of strings) for features with the gene name (or locus tag, or
database cross reference) of interest:

from Bio import SeqIO
genome = SeqIO.read("NC_003198.gbk", "gb")
for feature in genome.features:
    if feature.type=="CDS" \
    and "fliC" in feature.qualifiers.get('gene',[]):
        print feature
        print feature.extract(genome.seq)

Also have a look at this example for another way to pick out the
feature of interest:

http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

For the online approach with Entrez, Brad has replied already.

Regards,

Peter



More information about the Biopython mailing list