[Biopython] GFF parsing: getting features of specific proteins in gff

Philipp Schiffer philipp.schiffer at gmail.com
Wed Jan 22 13:44:52 UTC 2014


Hi Brad,  

thanks for coming back to me on this. Works (well of course). Also thanks for the GFFUtils link. I have actually been aware of that, but wanted to figure out my own way (kind off). Well, eh, failed there I guess. But I surely learnt something, which is always the point. Also I wanted to integrate this in a larger script where I get the genes of interest from a clustering output first. Anyway, in the end it might really make sense to use the GFFUtils on lists I prepared first.   

Thanks again

Philipp

--  
Philipp Schiffer
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, 22 January 2014 at 02:25, Brad Chapman wrote:

>  
> Philipp;
> Thanks for the e-mail about GFF parsing and sorry for the delay in
> getting back with you. I've merged your second off-list e-mail with this
> and copied back to the mailing list in case other folks have
> comments/thoughts to share as well.
>  
> > I just started exploring the GFF parser for some Augustus derived gff3
> > files, but running into trouble when trying to collect information for
> > a specific protein. Ultimately my goal is to get introns and exons for
> > a specific set of genes.
> >  
>  
> [...]
> > However now I'd like not to print all rec.features, but only for a
> > specific gene.
> >  
> > I found that in principle I can do something like“
> > ```for rec in GFF.parse(in_handle, limit_info=limit_info):
> > if 'g1' in rec.features[0].qualifiers:
> > GFF.write([rec], out_handle)```
> >  
> > However this does not really solve my problem. For once it gives me
> > all the genes on a contig if the search string is in
> > rec.features[0]. I guess I could somehow just write the first then,
> > but what seems more important if a gene I am looking for is in
> > rec.features[1] or higher index
> >  
>  
>  
> To do this you'd want to also loop over the features, so do:
>  
> for rec in GFF.parse(in_handle, limit_info=limit_info):
> for feature in rec.features:
> if 'g1' in f.qualifiers:
> GFF.write([rec], out_handle)
> break
>  
> This is definitely sub-optimal since it's a brute force loop over all of
> the items in the GFF, but would work for what you need.
>  
> If speed becomes an issue, Ryan Dale's GFFUtils may be useful:
>  
> https://github.com/daler/gffutils
> http://pythonhosted.org/gffutils/
>  
> It creates a SQLite database based on the GFF, so enables faster query
> access by gene than the line-based parser. It doesn't yet integrate with
> Biopython (that is on my overdue todo list) but provides a nice Python
> API with examples in the documentation.
>  
> Hope this helps,
> Brad
>  
>  






More information about the Biopython mailing list