[Biopython-dev] Bio.GFF and Brad's code

Mon Apr 13 13:35:39 UTC 2009

Michiel and Peter;
Thanks for your comments on this. I'm definitely open to modifying
the interface and am happy to y'all giving feedback.

In reading through your comments, there is a bit of a disconnect
between what you are expecting the parser to do and how it is
designed right now. You both are thinking of the GFF parser as a
line oriented parser that emits an object, like a SeqFeature, for
each line in the file. This one way to do it, but the downsides are:

- Many features, like coding regions, are actually represented over
  multiple lines.
- As Michiel pointed out, almost all files have many replicating 
  IDs (the first column). Ideally you want all of these features
  consolidated to a single SeqRecord.

So the parser now takes a higher level view and assumes that the
user will want those two things done for them. So it is designed as
an "adder," that puts features onto SeqRecord objects. A normal
use case would be:

- Use SeqIO to parse a FASTA file with the sequences => SeqRecords
- Use the GFFParser to add features from a separate GFF file to the
  SeqRecords. These are SeqFeatures, added to the right records and
  nested in a parent/child relationship as appropriate.

Ideally you would parse the entire GFF file and do all this feature
adding at once. For big files this fails due to memory issues, which
is why the filtering and iterating features were introduced.

Okay, so that is the top level view. I will try to hit some of the
specifics:

> Why are the functions _gff_line_map() and _gff_line_reduce() private
> (leading underscores)?  I had thought you wanted to make the
> map/reduce approach available to people trying to parse GFF files on
> multiple threads (e.g. using disco) which would require them to use
> these two functions, wouldn't it?  If so, they should be part of the
> public API.

I don't think a standard user would want to deal with these
directly. They just parse lines into their components and build an
intermediate dictionary object. To parallelize the job, the
GFFMapReduceFeatureAdder class has a 'disco_host' parameter which
then runs the job in parallel.

> I don't see any support for the optional FASTA block in a GFF file.
> Is this something you intend to add later Brad?  See also my thoughts
> below for Bio.SeqIO integration.

I haven't added anything for parsing header and footer directives but
it is on the to do list and I have a good idea how to handle them. Definitely
pass along a file that uses these you want to parse and we can work on it.

> > I have a couple of suggestions of how to make the GFF parser more
> > generally usable, and more consistent with other parsers in Biopython.
[...]
> > It's not clear to me why we need an iterator for GFF files. Can't we
> > just use Python's line iterator instead? I would expect code like this:
> >
> > from Bio import GFF
> > handle = open("my_gff_file.gff")
> > for line in handle:
> >    # call the appropriate GFF function on the line

Right, so this was tackled in the top level overview above. Michiel,
does the design make more sense now?

> > The second point is about GFFAddingIterator.get_all_features. If this
> > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
> > Then the code looks as follows:
> >
> > from Bio import GFF
> > handle = open("my_gff_file.gff")
> > rec_dict = GFF.to_dict(handle)

Yes, except in the more common cases you are adding to a dictionary
of records as opposed to generating one from scratch. My thought was
that copying the SeqIO behavior made it more confusing because it
doesn't do quite the same thing. After my explanation, what are your
thoughts?

> > Another thing to consider is that IDs in the GFF file do not need to be unique.
> > For example, consider a GFF file that stores genome mapping locations for
> > short sequences stored in a Fasta file. Since each sequence can have more
> > than one mapping location, we can have multiple lines in the GFF file for one
> > sequence ID.

Yes, this goes back to my explanation above and is why the
parser works differently than the standard SeqIO parsers. GFF ends
up being a different beast. I think it makes sense to copy useful
patterns we have already, but don't want to confuse users with close
by not the same functionality.

> > The last point is about storing SeqRecords in rec_dict. A GFF file typically
> > does not store sequences; if it does, it's not clear which field in the GFF file
> > does. On the other hand, a SeqRecord often does not contain the
> > chromosomal location, which is what the GFF file stores. So why use a
> > SeqRecord for GFF information?

Hopefully the SeqRecords make more sense now. What it is really doing is
adding SeqFeatures to SeqRecords. When the user doesn't provide one,
it creates an empty SeqRecord with the appropriate ID to use and
adds SeqFeatures to it.

> If you look at the NCBI FTP site, they often provide genome sequences
> in a range of file formats including GenBank and GFF.
[...]
> Their GFF3 file only contains the features:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
> 
> Some GFF files will include the sequence too, in this case we can
> fetch it in FASTA format:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna

Right on. So you would first parse the Fasta file with the SeqIO
parser to_dict functionality, and then feed this dictionary to the
GFF parser to add the features.

> In principle, you could parse this FASTA file and the GFF3 file and
> put together a GenBank file - or vice versa.

Yes.

> Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
> give me a SeqRecord with lots of SeqFeature objects.  If the sequence
> is present in the file, it should use that (not the case for these
> NCBI GFF3 files).  Otherwise, we wouldn't necessarily know the actual
> sequence length which we'd need to use the new Bio.Seq.UnknownSeq
> object.  However, we can infer from the maximum feature coordinates a
> minimum sequence length.  For these NCBI GFF3 files, as there is a
> source feature this does actually give use the genome length, so this
> should work very nicely.

Using UnknownSeq is a good idea, and I will do.

Whew. Michiel and Peter -- hopefully the high level intentions are a
bit more clear. Thanks for your input so far; let's hash this out so
it makes sense to everyone.

Brad