[Biopython] How do I retrieve information regarding isolation_source and clones using Entrez

Young Song youngcsong at gmail.com
Wed Sep 7 22:36:26 UTC 2011


Hi,

   I have spent few hours reading the Biopython manual, and I am currently
trying to write a script that can retrieve information regarding
isolation_source and clone name for certain sequences.  This is the code
that I have now:

>>from Bio import Entrez

>>Entrez.email = "youngcsong at gmail.com"
>>entrez_handle = Entrez.efetch(db="protein", id="ADJ51069", retmode="xml")
>>entrez_record = Entrez.read(entrez_handle)

>>print entrez_record[0].keys()

Then I get the following:

[u'GBSeq_moltype', u'GBSeq_source', u'GBSeq_sequence',
u'GBSeq_primary-accession', u'GBSeq_definition', u'GBSeq_accession-version',
u'GBSeq_topology', u'GBSeq_length', u'*GBSeq_feature-table*',
u'GBSeq_create-date', u'GBSeq_other-seqids', u'GBSeq_division',
u'GBSeq_taxonomy', u'GBSeq_comment', u'GBSeq_source-db',
u'GBSeq_references', u'GBSeq_update-date', u'GBSeq_organism',
u'GBSeq_locus']

I used the key, "GBSeq_feature-table" to see what sort of values are stored
here,

>>print records[0]["GBSeq_feature-table"]

Then I get following, which seems rather confusing:

[{u'GBFeature_quals': [{u'GBQualifier_name': 'organism',
u'GBQualifier_value': 'uncultured prokaryote'}, *{u'GBQualifier_name':
'isolation_source', u'GBQualifier_value': 'contaminated river sediment'}*,
{u'GBQualifier_name': 'db_xref', u'GBQualifier_value':
'taxon:198431'}, *{u'GBQualifier_name':
'clone', u'GBQualifier_value': '**Arthur_Kill_OTU4'}*, {u'GBQualifier_name':
'environmental_sample'}, {u'GBQualifier_name': 'country',
u'GBQualifier_value': 'USA: New Jersey'}], u'GBFeature_key': 'source',
u'GBFeature_intervals': [{u'GBInterval_from': '1', u'GBInterval_to': '218',
u'GBInterval_accession': 'ADJ51069.1'}], u'GBFeature_location': '1..218'},
{u'GBFeature_quals': [{u'GBQualifier_name': 'product', u'GBQualifier_value':
'alkylsuccinate synthase'}], u'GBFeature_intervals': [{u'GBInterval_from':
'1', u'GBInterval_to': '218', u'GBInterval_accession': 'ADJ51069.1'}],
u'GBFeature_location': '<1..>218', u'GBFeature_key': 'Protein',
u'GBFeature_partial5': StringElement('', attributes={u'value': u'true'}),
u'GBFeature_partial3': StringElement('', attributes={u'value': u'true'})},
{u'GBFeature_quals': [{u'GBQualifier_name': 'region_name',
u'GBQualifier_value': 'RNR_PFL'}, {u'GBQualifier_name': 'note',
u'GBQualifier_value': 'Ribonucleotide reductase and Pyruvate formate lyase;
cl09939'}, {u'GBQualifier_name': 'db_xref', u'GBQualifier_value':
'CDD:186877'}], u'GBFeature_intervals': [{u'GBInterval_from': '1',
u'GBInterval_to': '218', u'GBInterval_accession': 'ADJ51069.1'}],
u'GBFeature_location': '<1..>218', u'GBFeature_key': 'Region',
u'GBFeature_partial5': StringElement('', attributes={u'value': u'true'}),
u'GBFeature_partial3': StringElement('', attributes={u'value': u'true'})},
{u'GBFeature_quals': [{u'GBQualifier_name': 'gene', u'GBQualifier_value':
'assA'}, {u'GBQualifier_name': 'coded_by', u'GBQualifier_value':
'GU453639.1:<1..>658'}, {u'GBQualifier_name': 'codon_start',
u'GBQualifier_value': '3'}, {u'GBQualifier_name': 'transl_table',
u'GBQualifier_value': '11'}], u'GBFeature_intervals': [{u'GBInterval_from':
'1', u'GBInterval_to': '218', u'GBInterval_accession': 'ADJ51069.1'}],
u'GBFeature_location': '1..218', u'GBFeature_key': 'CDS',
u'GBFeature_partial5': StringElement('', attributes={u'value': u'true'}),
u'GBFeature_partial3': StringElement('', attributes={u'value': u'true'})}]

It seems like there is some attributes called GBQualifier_name and
GBQualifier_value, but I am not sure how to utilize these attributes to get
the bolded values (i.e. contaminated river sediment and Arthur_Kill_OTU4).
Your help here would be very much appreciated.  Thank you in advance.

Young
-- 
Young C. Song
Masters Student
Graduate Program in Bioinformatics
The University of British Columbia
Department of Microbiology and Immunology
2350 Health Science Mall
Vancouver, BC V6T 1Z4, Canada



More information about the Biopython mailing list