[Biopython] GFF parsing

Brad Chapman chapmanb at 50mail.com
Fri Feb 26 13:28:34 UTC 2010


Hi John;

> The GFF page on the BioPython wiki (http://www.biopython.org/wiki/GFF_Parsing) 
[...]
> As far as I can work out if I have biopython 1.53 and I want to parse
> GFF, I should get the latest version of the parser from:
> http://github.com/chapmanb/bcbb/tree/master/gff

That's absolutely right. The GFF parser is still under development
so hasn't been rolled into Biopython proper yet, and we're working
on getting the documentation together. Sorry for any confusion.

> I've tried using this to parse my 40Mb GFF file and it takes a long 
> time. From inspecting my GFF file I thought it should be able to parse 
> the records independently or does it need to parse the whole file before 
> outputting the first record?

If you call GFF.parse without any arguments, this will parse the
entire file building up Record and Features objects for everything
contained there, then return you the organized records.

There are two different ways to limit the parsing to sections of the
file at once: either limit by the number of lines or by features you
are interested in. I added some text to the documentation examples
on the wiki to try and help explain the usage. Could you give it a
look now that it's better explained and see if this is helpful?

Alternatively, there could be something especially hard about the
GFF file in particular you are using. If you are still having issues
and could pass along the code and file you are parsing, I can take
a deeper look.

Thanks for the feedback. It's really helpful and we are currently trying
to work through use cases and designing an API for accessing GFF in the
most intuitive way. Another approach we have been discussing is having a
high level index of the GFF file which allows retrieval by IDs, features
and locations. See the comments by myself and Brent Pedersen here:

http://chapmanb.posterous.com/link-potpourri-large-file-indexing-and-analys

Thanks again,
Brad



More information about the Biopython mailing list