[Biopython] Access Entrez gene DB using rettype 'gb'

David developer at allthingsprogress.com
Tue Dec 7 20:52:29 UTC 2010


> As Brad pointed out, sadly the gene name "fliC" is not in that
> FASTA file anywhere.

I believe I mentioned that, too. Only the human-readable descriptions
are available, for some reason. Since the gene symbols are just as
important (actually more important), it's confusing to me that NCBI
doesn't include them. I guess that's neither here nor there.

> You can loop over all the features, filter on type (e.g. gene or CDS)
> and look at the annotation (qualifiers is a dictionary, entries are
> lists of strings) for features with the gene name (or locus tag, or
> database cross reference) of interest:
> 
> from Bio import SeqIO
> genome = SeqIO.read("NC_003198.gbk", "gb")
> for feature in genome.features:
>     if feature.type=="CDS" \
>     and "fliC" in feature.qualifiers.get('gene',[]):
>         print feature
>         print feature.extract(genome.seq)
> 
> Also have a look at this example for another way to pick out the
> feature of interest:
> 
> http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features
> 
> For the online approach with Entrez, Brad has replied already.
> 
> Regards,
> Peter

Thanks, Brad and Peter, for your solutions. I'm going to use a modified
version of Brad's Internet-based script to work with my genes of
interest. I'll keep your solution in mind, Peter, in case I need to use
the script offline in the future.

It turns out one of the problems I was having was simply parsing XML
with Python (I am used to Ruby). Brad's example script has worked well
to solve that problem.

One question that this has prompted (for me) is: Why not extend
BioPython to support queries that NCBI does not, itself, support? I am a
newcomer, but it seems like the Entrez model is simply a wrapper for the
NCBI interfaces and protocols. But there is potential for more.

Is there any interest in boosting the Entrez model to support common
tasks like bacterial gene lookups?

David




More information about the Biopython mailing list