[Biopython] GFF Parsing

Brad Chapman chapmanb at 50mail.com
Fri Aug 14 16:51:13 EDT 2009


Hi all;
Peter, thanks for forwarding this along.

Vipin:
> By executing this script I am able to extract gene, mRNA and exon annotation
> from specified GFF file. But I am unable to extract the CDS related
> information from GFF file.
> It will be great if you can suggest me an idea to include gene, mRNA, exon
> and CDS information in a single strech of parsing of GFF file.

Sure, the CDS features are present in two places within the feature
tree. The first is as sub-sub features of genes:

gene -> mRNA -> CDS

the second is as sub features of proteins:

protein -> CDS

It's a bit of a confusing way to do it, in my opinion, but this
is the nesting defined in the Arabidopsis GFF file, so the
parser respects it and puts them where they are supposed to be.

Below is an updated script which should demonstrate where the CDS
features are; you can use either way to access them as the same CDSs
are present under both features.

This also uses the updated API for parsing, which is much cleaner
and will hopefully be what is in Biopython. There is some initial
documentation here:

http://www.biopython.org/wiki/GFF_Parsing

Hope this helps,
Brad

import sys

from BCBio.GFF import GFFParser
 
in_file = sys.argv[1]
parser = GFFParser()
 
limit_info = dict(
        gff_type = ["protein", "gene", "mRNA", "CDS", "exon"],
        gff_id = ["Chr1"],
        )
 
in_handle = open(in_file)
for rec in parser.parse(in_handle, limit_info=limit_info):
    print rec.id
    for feature in rec.features:
        if feature.type == "protein":
            print feature.type, feature.id
            for sub in feature.sub_features:
                if sub.type == "CDS":
                    print sub.type
        elif feature.type == "gene":
            for sub in feature.sub_features:
                if sub.type == "mRNA":
                    print sub.type, sub.id
                    for sub_sub in sub.sub_features:
                        if sub_sub.type == "CDS":
                            print sub_sub.type

in_handle.close()


More information about the Biopython mailing list