[Biopython] GFF parsing: getting features of specific proteins in gff
Brad Chapman
chapmanb at 50mail.com
Wed Jan 22 01:25:04 UTC 2014
Philipp;
Thanks for the e-mail about GFF parsing and sorry for the delay in
getting back with you. I've merged your second off-list e-mail with this
and copied back to the mailing list in case other folks have
comments/thoughts to share as well.
> I just started exploring the GFF parser for some Augustus derived gff3
> files, but running into trouble when trying to collect information for
> a specific protein. Ultimately my goal is to get introns and exons for
> a specific set of genes.
[...]
> However now I'd like not to print all rec.features, but only for a
> specific gene.
>
> I found that in principle I can do something like“
> ```for rec in GFF.parse(in_handle, limit_info=limit_info):
> if 'g1' in rec.features[0].qualifiers:
> GFF.write([rec], out_handle)```
>
> However this does not really solve my problem. For once it gives me
> all the genes on a contig if the search string is in
> rec.features[0]. I guess I could somehow just write the first then,
> but what seems more important if a gene I am looking for is in
> rec.features[1] or higher index
To do this you'd want to also loop over the features, so do:
for rec in GFF.parse(in_handle, limit_info=limit_info):
for feature in rec.features:
if 'g1' in f.qualifiers:
GFF.write([rec], out_handle)
break
This is definitely sub-optimal since it's a brute force loop over all of
the items in the GFF, but would work for what you need.
If speed becomes an issue, Ryan Dale's GFFUtils may be useful:
https://github.com/daler/gffutils
http://pythonhosted.org/gffutils/
It creates a SQLite database based on the GFF, so enables faster query
access by gene than the line-based parser. It doesn't yet integrate with
Biopython (that is on my overdue todo list) but provides a nice Python
API with examples in the documentation.
Hope this helps,
Brad
More information about the Biopython
mailing list