[Biopython] processing genbank file

Thu Jun 16 13:24:02 UTC 2011

On Thu, Jun 16, 2011 at 1:28 PM, Sheila the angel
<from.d.putto at gmail.com> wrote:
> So if I have only one file which contains only 1 record (say
> 'NP_954888.1.gb' ) and I want to extract information like
>  'gene' name I can't do it in one line e.g.
> #------------------------------------------------------------------------------
> gene=gb_record.features['CDS'].qualifiers['gene'][0]     #or
> something similar to this will not work

Supposing there was a neat built in way to filter the features by type,
in general there would still be multiple CDS features - often 1000s,
so you'd need to choose from them.

> #-----------------------------------------------------------------------------
> But I have to use loop as
> #-----------------------------------------------------------------------------
> gb_record = SeqIO.read('NP_954888.1.gb', 'gb')
> for gb_feature in gb_record.features:
>       if gb_feature.type == 'CDS':
>       gene=gb_feature.qualifiers['gene'][0]
>       print gene
> #-----------------------------------------------------------------------------
> ?????

I've checked your example NP_954888 and it is actually a GenPept
file (a protein GenBank file), and it does have just one CDS feature.

Do you prefer this syntax?

gb_record = SeqIO.read('NP_954888.1.gb', 'gb')
cds_features = [f for f in gb_record.features if f.type=="CDS"]
assert len(cds_features)==1
print cds_features[0].qualifiers['gene'][0]

Peter