[Biopython-dev] Bio.GFF and Brad's code

Brad Chapman chapmanb at 50mail.com
Fri Apr 17 13:23:34 UTC 2009


Peter, Michiel and Jared;
Thanks for the comments. My apologies for the late reply; I've been
sick the past few days and am trying to catch back up. All your
points from the different posts are consolidated below.

[Michiel]
> First, let's discuss how to represent the information contained in
> a GFF file. SeqRecords are good if the GFF file is associated with
> a Fasta file (or contains the sequence itself), but if not it seems
> to be a bit awkward. How about the following (and I think Peter was
> hinting at the same idea):
>
> The actual parser lives in Bio.GFF, and produces Bio.GFF.Record
> objects that closely resemble the GFF file structure. For example, we
> use the GFF specified fields (<seqname> <source> <feature> <start>
> <end> <score> <strand> <frame> [attributes] [comments]) as attributes
> to Bio.GFF.Record objects.

The GFF parser right now is really generating SeqFeature objects for
each GFF line; the top level SeqRecords are a collection
that holds the individual features. The SeqFeature object is
pretty similar to GFF and the generic object you are proposing. For
instance, here is a GFF line and the relevant attributes from
SeqFeature for the line:

I	Orfeome	PCR_product	12759747	12764936	.	-	.	PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1

type: PCR_product
location: [12759746:12764936]
strand: -1
qualifiers: 
        Key: amplified, Value: ['1']
        Key: pcr_product, Value: ['mv_B0019.1']
        Key: source, Value: ['Orfeome']

Things are a bit more generalized as key/value pairs in qualifiers,
but the mapping straightforward. My only suggestion would be that we
add 'start' and 'end' accessors to SeqFeature that map to
feature.location.nofuzzy_start and feature.location.nofuzzy_end,
respectively. SeqFeature is more generalized, for GenBank location
nastiness, but we should make the common simple case simpler.

> Bio.SeqIO then uses the parser in Bio.GFF, and puts its information
> in the appropriate fields of a SeqRecord. Here, we have to think
> about two cases: Simply creating a SeqRecord based on the GFF file,
> and adding the information in the GFF file as annotations to a
> pre-existing set of SeqRecords.

Yes. Both of these cases are handled now -- a user can supply a
seed dictionary of SeqRecords to which SeqFeatures are added.
Alternatively, a new SeqRecord is created for features if one is not
provided.

> Users then have a choice to use Bio.SeqIO to get SeqRecords, or
> Bio.GFF to see the "raw" GFF data, depending on their needs.
>
> How does that sound?

So we could have two ways to access the GFF file:

- An iterator that returns SeqFeature objects for each line in the
  file. No other processing is done. 

- The higher level interface that we have been discussing, which
  adds them to records and nests features.

My only question is concerning the nested features, like coding
sequences. This a very common GFF case (see
http://www.sequenceontology.org/gff3.shtml; The Canonical Gene
section for the GFF). A raw parser iterator cannot handle these as
it needs to read multiple lines to build the nested feature. Is this
still useful for the use cases you were thinking of?

[Peter]
> Hmm.  I'm with you on the idea that you may need to parse a GFF file
> and a separate second file to get the actual sequence (e.g. a FASTA
> file), but there is more than one way to combine the two.  For a
> single sequence, I was thinking more along the lines of:
> 
> from Bio import SeqIO
> record = SeqIO.read(open("NC_000913.fna"),"fasta")
> record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features

Make sense, but this only works for the case where you have a single
FASTA sequence and a single GFF file describing one record. This is a
special case for bacterial genomes and GFF from NCBI, but doesn't work
for other Eukaryotic GFFs and SOLiD GFF files. Do we want different ways
to use the parser for custom cases?

> If the FASTA and GFF file apply to multiple sequences (e.g. a set of
> contigs, rather than a single chromosome), and you have enough memory,
> then something using dictionaries should work:
> 
> from Bio import SeqIO
> records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta"))
> for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") :
>     records[temp_rec.id].features = temp_rec.features

Your intention makes good sense here, and this is more or less what
it is doing under the covers. Could we think about expanding SeqIO
to have functionality for this "adding to a record" case? Something
like:

from Bio import SeqIO
records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta"))
records = SeqIO.add_to_dict(records, open("NC_000913.gff"), "gff3")

This exposes less of the actual implementation details to the user.

> As you can probably tell, I am concentrating on getting this to match
> up well with the Bio.SeqIO framework.  It will be nice to know the
> underlying Bio.GFF module has more options, but I expect most people
> to start with reading in a GFF file using Bio.SeqIO, and being able to
> transfer their existing knowledge of SeqFeature objects learnt from
> using Bio.SeqIO to read in GenBank files.

I'm really glad you are thinking about it from this angle. The limit
cases will be pretty common for real life work; most of the
eukaryotic GFF dumps from Ensembl or wherever are quite large and
are going to need some intelligent parsing to not get into memory
issues. I worry that if we try to put this right on top of the
existing SeqIO functionality, which deal with different kinds of
files, we are going to clutter the interface.

> I have pondered a "paired file iterator" function for Bio.SeqIO for
> dealing with FASTA+QUAL, FASTA+GFF, FASTA+PPT, etc, which would take
> TWO file handles and return SeqRecord objects.  Interestingly all the
> examples thus far are FASTA+other.  Anyway, this could be added later
> if need be.

I like the way you did this for FASTA/Qual files but am not sure if
would map nicely to GFF for the memory reasons mentioned above.

[MapReduce]
> Are you aware of any alternatives to disco for doing map/reduce on
> Python, and does that impact your design choices?

Jared is right on; Hadoop is the another MapReduce framework in wide use.
More generally, I agree with you; the distributed portion needs to
be generalized. Let's lock down the interface and local parsing, and
then I will circle around on that again.

Thanks all again for the thoughts,
Brad



More information about the Biopython-dev mailing list