[Biopython-dev] Bio.GFF and Brad's code

Mon Apr 13 14:19:54 UTC 2009

> Okay, so that is the top level view. I will try to hit some of the
> specifics:
>
>> Why are the functions _gff_line_map() and _gff_line_reduce() private
>> (leading underscores)?  I had thought you wanted to make the
>> map/reduce approach available to people trying to parse GFF files on
>> multiple threads (e.g. using disco) which would require them to use
>> these two functions, wouldn't it?  If so, they should be part of the
>> public API.
>
> I don't think a standard user would want to deal with these
> directly. They just parse lines into their components and build an
> intermediate dictionary object. To parallelize the job, the
> GFFMapReduceFeatureAdder class has a 'disco_host' parameter which
> then runs the job in parallel.

Are you aware of any alternatives to disco for doing map/reduce on
Python, and does that impact your design choices?

>> I don't see any support for the optional FASTA block in a GFF file.
>> Is this something you intend to add later Brad?  See also my thoughts
>> below for Bio.SeqIO integration.
>
> I haven't added anything for parsing header and footer directives but
> it is on the to do list and I have a good idea how to handle them. Definitely
> pass along a file that uses these you want to parse and we can work on it.

There are some partial examples here:
http://www.sequenceontology.org/gff3.shtml
We should have a peep at BioPerl's unit tests and/or ask Lincoln directly.

>> > I have a couple of suggestions of how to make the GFF parser more
>> > generally usable, and more consistent with other parsers in Biopython.
> [...]
>> > It's not clear to me why we need an iterator for GFF files. Can't we
>> > just use Python's line iterator instead? I would expect code like this:
>> >
>> > from Bio import GFF
>> > handle = open("my_gff_file.gff")
>> > for line in handle:
>> >    # call the appropriate GFF function on the line
>
> Right, so this was tackled in the top level overview above. Michiel,
> does the design make more sense now?
>
>> > The second point is about GFFAddingIterator.get_all_features. If this
>> > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
>> > Then the code looks as follows:
>> >
>> > from Bio import GFF
>> > handle = open("my_gff_file.gff")
>> > rec_dict = GFF.to_dict(handle)
>
> Yes, except in the more common cases you are adding to a dictionary
> of records as opposed to generating one from scratch. My thought was
> that copying the SeqIO behavior made it more confusing because it
> doesn't do quite the same thing. After my explanation, what are your
> thoughts?

Maybe there is a role for a to_dict() function for when you start from
scratch, but as you say, it does sound like there is a general need to
add to an existing dict.

>> > Another thing to consider is that IDs in the GFF file do not need to be unique.
>> > For example, consider a GFF file that stores genome mapping locations for
>> > short sequences stored in a Fasta file. Since each sequence can have more
>> > than one mapping location, we can have multiple lines in the GFF file for one
>> > sequence ID.
>
> Yes, this goes back to my explanation above and is why the
> parser works differently than the standard SeqIO parsers. GFF ends
> up being a different beast. I think it makes sense to copy useful
> patterns we have already, but don't want to confuse users with close
> by not the same functionality.
>
>> > The last point is about storing SeqRecords in rec_dict. A GFF file typically
>> > does not store sequences; if it does, it's not clear which field in the GFF file
>> > does. On the other hand, a SeqRecord often does not contain the
>> > chromosomal location, which is what the GFF file stores. So why use a
>> > SeqRecord for GFF information?
>
> Hopefully the SeqRecords make more sense now. What it is really doing is
> adding SeqFeatures to SeqRecords. When the user doesn't provide one,
> it creates an empty SeqRecord with the appropriate ID to use and
> adds SeqFeatures to it.
>
>> If you look at the NCBI FTP site, they often provide genome sequences
>> in a range of file formats including GenBank and GFF.
>> [...]
>> Their GFF3 file only contains the features:
>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
>>
>> Some GFF files will include the sequence too, in this case we can
>> fetch it in FASTA format:
>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
>
> Right on. So you would first parse the Fasta file with the SeqIO
> parser to_dict functionality, and then feed this dictionary to the
> GFF parser to add the features.

Hmm.  I'm with you on the idea that you may need to parse a GFF file
and a separate second file to get the actual sequence (e.g. a FASTA
file), but there is more than one way to combine the two.  For a
single sequence, I was thinking more along the lines of:

from Bio import SeqIO
record = SeqIO.read(open("NC_000913.fna"),"fasta")
record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features

Or, depending on what other annotation you can extract, perhaps the
other way round would be best:

from Bio import SeqIO
record = SeqIO.read(open("NC_000913.gff"),"gff3")
record.seq = SeqIO.read(open("NC_000913.fna"),"fasta").seq

The above is pretty trivial I think, as long as we include examples of
this in our documentation.  This kind of manipulation is also file
format neutral - it would work equally well with a FASTA file and a
PTT file (assuming we add parsing NCBI protein tables to Bio.SeqIO as
outlined in my earlier email).  Or for another example, perhaps an
annotated GenBank file without the sequence (e.g. just a CONTIG
assembly line) plus a FASTA file for the full nucleotide sequence.

If the FASTA and GFF file apply to multiple sequences (e.g. a set of
contigs, rather than a single chromosome), and you have enough memory,
then something using dictionaries should work:

from Bio import SeqIO
records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta"))
for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") :
    records[temp_rec.id].features = temp_rec.features

or,

from Bio import SeqIO
records = SeqIO.to_dict(SeqIO.read(open("NC_000913.gff"),"gff3"))
for temp_rec in SeqIO.parse(open("NC_000913.fna"),"fasta") :
    records[temp_rec.id].seq = temp_rec.seq

(You may need to massage the keys to match up, I'm assuming here that
isn't required).

i.e. It can all be done from Bio.SeqIO without needing to dive into
Bio.GFF unless you need to do something special (e.g. filtering the
features).

>> In principle, you could parse this FASTA file and the GFF3 file and
>> put together a GenBank file - or vice versa.
>
> Yes.
>
>> Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
>> give me a SeqRecord with lots of SeqFeature objects.  If the sequence
>> is present in the file, it should use that (not the case for these
>> NCBI GFF3 files).  Otherwise, we wouldn't necessarily know the actual
>> sequence length which we'd need to use the new Bio.Seq.UnknownSeq
>> object.  However, we can infer from the maximum feature coordinates a
>> minimum sequence length.  For these NCBI GFF3 files, as there is a
>> source feature this does actually give use the genome length, so this
>> should work very nicely.
>
> Using UnknownSeq is a good idea, and I will do.

Great.

> Whew. Michiel and Peter -- hopefully the high level intentions are a
> bit more clear. Thanks for your input so far; let's hash this out so
> it makes sense to everyone.

Good plan :)

As you can probably tell, I am concentrating on getting this to match
up well with the Bio.SeqIO framework.  It will be nice to know the
underlying Bio.GFF module has more options, but I expect most people
to start with reading in a GFF file using Bio.SeqIO, and being able to
transfer their existing knowledge of SeqFeature objects learnt from
using Bio.SeqIO to read in GenBank files.

Peter