[Biopython] processing genbank file

Peter Cock p.j.a.cock at googlemail.com
Thu Jun 16 11:52:02 UTC 2011


On Thu, Jun 16, 2011 at 12:43 PM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> Hi to all,
> >From a genbank file I want to extract certain information. Here is my code
>
> #---------------------------------------------------------------------------------------------------------
> from Bio import SeqIO
> handle = open('NP_954888.1.gb', "rU")
> for gb_record in SeqIO.parse(handle, 'gb'):

If you've only got one record in the file, you can get rid of one loop:

gb_record = SeqIO.read('NP_954888.1.gb', 'gb')

Since there will in generally be many features in a GenBank file,
you do need this loop to look at each potential gene:

>  for gb_feature in gb_record.features:
> if gb_feature.type == 'CDS':
>  gene=gb_feature.qualifiers['gene'][0]
>                 db_xref=gb_feature.qualifiers['db_xref']

Note in the above not all CDS features will have a gene or db_xref
qualifier - you may get a KeyError exception with some files.

>                                print gene, db_xref
>
> print gb_record.annotations['organism']
>
> #====================================================
>
> Is there any simple way to print information like gene name, GeneID etc. or
> I have to use this loop method :( for an example to print organism name I
> need to do only gb_record.annotations['organism'] while to print 'gene' id I
> need the for loop !!!!

You will need some loops in general: One single GenBank file can hold
multiple records, each of which can hold multiple features, each of which
can have multiple names and database cross-references.

> Another problem is the db_xref=gb_feature.qualifiers['db_xref'] gives me
> all /db_xref entries in CDS field while I want only /db_xref="GeneID:309165"
> (or only the GeneID)...how to do that
>
> Thanks in Advance

Since you can get multiple /db_xref (or other qualifiers), when the parser
was designed a list was used for the values. You could filter on what the
entries start with, e.g. db_xref.startswith("GeneID:")

Peter




More information about the Biopython mailing list