[Biopython] Bio.GenBank .Scanner ?

Tue Oct 12 08:46:13 UTC 2010

On Mon, Oct 11, 2010 at 11:30 PM, Ara Kooser <akooser at unm.edu> wrote:
> Hello all,
>
>  I found a partial answer to my question. I've download all the GenBank
> files for Strep. sp. AA4. I am using SeqIO to look at the information in the
> files. The documentation recommends using SeqIO. I am searching for the tag
> that will only extract:
>     CDS             1..5256
>                     /locus_tag="StAA4_010100030484"
>                     /coded_by="complement(NZ_ACEV01000078.1:25146..40916)"
>                     /note="COG3321 Polyketide synthase modules and related
>                     proteins"
>                     /transl_table=11
>                     /db_xref="CDD:33130"
>
> this /coded_by="complement(NZ_ACEV01000078.1:25146..40916)" line from the
> GenBank files.
> The api documentation on-line discusses the parse_feature which is what I
> think I need. I am not sure the best way to pull out that one line.

I would not recommend usingBio.GenBank.Scanner directly for this task.
If you did want to do this, you would create your own consumer class
(probably as a subclass of BaseGenBankConsumer) and use this with
the GenBankScanner object. Your consumer would ignore most of the
parsing events, and focus on the CDS coded_by qualifier information.

> My current code is:
> from Bio import SeqIO
> gb_file = "sequences.gp"
> for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank"):
>    gb_feature = gb_record.features[2]
>    print gb_feature
>
>
> Thank you for your time and help.
> Ara

Try something along these lines:

from Bio import SeqIO
gb_file = "sequences.gp"
for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank"):
   for gb_feature in gb_record.features:
       if gb_feature.type != "CDS": continue
           print gb_feature.qualifiers

Now you will need some way to identify *which* of the potentially
many CDS features present in the GenBank file is the one you
care about. I would guess you got StAA4_010100030484 from
the BLAST hits, so you should filter on the locus_tag qualifier.

There is a related example here,
http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

Peter