[Biopython] GFF parsing: getting features of specific proteins in gff

Brad Chapman chapmanb at 50mail.com
Wed Jan 22 01:25:04 UTC 2014


Philipp;
Thanks for the e-mail about GFF parsing and sorry for the delay in
getting back with you. I've merged your second off-list e-mail with this
and copied back to the mailing list in case other folks have
comments/thoughts to share as well.

> I just started exploring the GFF parser for some Augustus derived gff3
> files, but running into trouble when trying to collect information for
> a specific protein. Ultimately my goal is to get introns and exons for
> a specific set of genes.
[...]
> However now I'd like not to print all rec.features, but only for a
> specific gene.
>
> I found that in principle I can do something like“
> ```for rec in GFF.parse(in_handle, limit_info=limit_info):
>             if 'g1' in rec.features[0].qualifiers:
>                 GFF.write([rec], out_handle)```
>
> However this does not really solve my problem. For once it gives me
> all the genes on a contig if the search string is in
> rec.features[0]. I guess I could somehow just write the first then,
> but what seems more important if a gene I am looking for is in
> rec.features[1] or higher index

To do this you'd want to also loop over the features, so do:

for rec in GFF.parse(in_handle, limit_info=limit_info):
   for feature in rec.features:
      if 'g1' in f.qualifiers:
          GFF.write([rec], out_handle)
          break

This is definitely sub-optimal since it's a brute force loop over all of
the items in the GFF, but would work for what you need.

If speed becomes an issue, Ryan Dale's GFFUtils may be useful:

https://github.com/daler/gffutils
http://pythonhosted.org/gffutils/

It creates a SQLite database based on the GFF, so enables faster query
access by gene than the line-based parser. It doesn't yet integrate with
Biopython (that is on my overdue todo list) but provides a nice Python
API with examples in the documentation.

Hope this helps,
Brad




More information about the Biopython mailing list