[Biopython-dev] Bio.GFF and Brad's code

Sat Apr 11 11:29:47 UTC 2009

Hi Brad,

Thanks for the examples; that clarified it a lot.
I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython.
Looking at your first example:

> from BCBio.GFF.GFFParser import GFFAddingIterator
> 
> gff_iterator = GFFAddingIterator()
> rec_dict = gff_iterator.get_all_features(gff_file)
> 
> The returned dictionary is like a dictionary from
> SeqIO.to_dict;
> keys are ids and values are SeqRecords.

It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this:

from Bio import GFF
handle = open("my_gff_file.gff")
for line in handle:
    # call the appropriate GFF function on the line

The second point is about GFFAddingIterator.get_all_features. If this is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
Then the code looks as follows:

from Bio import GFF
handle = open("my_gff_file.gff")
rec_dict = GFF.to_dict(handle)

Another thing to consider is that IDs in the GFF file do not need to be unique. For example, consider a GFF file that stores genome mapping locations for short sequences stored in a Fasta file. Since each sequence can have more than one mapping location, we can have multiple lines in the GFF file for one sequence ID.

The last point is about storing SeqRecords in rec_dict. A GFF file typically does not store sequences; if it does, it's not clear which field in the GFF file does. On the other hand, a SeqRecord often does not contain the chromosomal location, which is what the GFF file stores. So why use a SeqRecord for GFF information?

Sorry for bringing up lots of issues. But I think that a GFF parser will be heavily used, so we should optimize its design as much as possible.

Best,

--Michiel.