[Biopython] processing genbank file

Sheila the angel from.d.putto at gmail.com
Thu Jun 16 12:28:34 UTC 2011


So if I have only one file which contains only 1 record (say 'NP_954888.1.gb'
) and I want to extract information like  'gene' name I can't do it in one
line e.g.
#------------------------------------------------------------------------------
gene=gb_record.features['CDS'].qualifiers['gene'][0]     #or
something similar to this will not work
#-----------------------------------------------------------------------------

But I have to use loop as
#-----------------------------------------------------------------------------
gb_record = SeqIO.read('NP_954888.1.gb', 'gb')
for gb_feature in gb_record.features:
      if gb_feature.type == 'CDS':
      gene=gb_feature.qualifiers['gene'][0]
      print gene
#-----------------------------------------------------------------------------

?????


On Thu, Jun 16, 2011 at 1:52 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, Jun 16, 2011 at 12:43 PM, Sheila the angel
> <from.d.putto at gmail.com> wrote:
> > Hi to all,
> > >From a genbank file I want to extract certain information. Here is my
> code
> >
> >
> #---------------------------------------------------------------------------------------------------------
> > from Bio import SeqIO
> > handle = open('NP_954888.1.gb', "rU")
> > for gb_record in SeqIO.parse(handle, 'gb'):
>
> If you've only got one record in the file, you can get rid of one loop:
>
> gb_record = SeqIO.read('NP_954888.1.gb', 'gb')
>
> Since there will in generally be many features in a GenBank file,
> you do need this loop to look at each potential gene:
>
> >  for gb_feature in gb_record.features:
> > if gb_feature.type == 'CDS':
> >  gene=gb_feature.qualifiers['gene'][0]
> >                 db_xref=gb_feature.qualifiers['db_xref']
>
> Note in the above not all CDS features will have a gene or db_xref
> qualifier - you may get a KeyError exception with some files.
>
> >                                print gene, db_xref
> >
> > print gb_record.annotations['organism']
> >
> > #====================================================
> >
> > Is there any simple way to print information like gene name, GeneID etc.
> or
> > I have to use this loop method :( for an example to print organism name I
> > need to do only gb_record.annotations['organism'] while to print 'gene'
> id I
> > need the for loop !!!!
>
> You will need some loops in general: One single GenBank file can hold
> multiple records, each of which can hold multiple features, each of which
> can have multiple names and database cross-references.
>
> > Another problem is the db_xref=gb_feature.qualifiers['db_xref'] gives me
> > all /db_xref entries in CDS field while I want only
> /db_xref="GeneID:309165"
> > (or only the GeneID)...how to do that
> >
> > Thanks in Advance
>
> Since you can get multiple /db_xref (or other qualifiers), when the parser
> was designed a list was used for the values. You could filter on what the
> entries start with, e.g. db_xref.startswith("GeneID:")
>
> Peter
>



More information about the Biopython mailing list