[Bioperl-l] Re: problems with Bio::Tools::GFF

Mon Nov 3 15:47:13 EST 2003

On Mon, 3 Nov 2003, Scott Cain wrote:

> On Mon, 2003-11-03 at 14:13, Jason Stajich wrote:
> > Feel free to fix it to spec Scott.
>
> Will do--I mentioned it because I am always concerned that I am
> misinterpreting the spec; if I codify my misinterpretations, that would
> kind of shoot the idea of standard out the window.

Well given that I just found the published spec online
 http://song.sourceforge.net/gff3.shtml

I had been basing things off of Lincoln's earlier emails so I really
didn't pay much attention to all of that.

I am a bit wary of splitting on space wrt the last column but so we'll
have to cook up some test cases to make sure it goes through okay.

> >
> > Note that I have also made no attempt to parse/write the Gap or Alignment
> > stuff in any sort of special way - I basically made it so it supports what
> > GFF2 currently looks like only in GFF3 flavor.  Perhaps it makes sense to
> > do all of that work on Chris's Unflattner though rather than in
> > Tools::GFF.  A SeqFeature::Tools::Flattner is probably in order as well to
> > turn HSPs and other paired sequences into GFF3 Alignments.
>
> I'm not sure it's necessary to move to Unflattener.  Since the format is
> fairly simple, it is only really necessary to split the information in
> the groups column to tag value pairs and let the user decide what to do
> with the information.  The only thing that I am somewhat at a loss to
> deal with is cigar line info, but I don't think that is being parse by
> Bio::DB::GFF yet either.

One day I could imagine us building Gene/Transcript objects from the GFF3.
Actually I was thinking we'd need a Flattner to turn the Gene object back
into flattened features.  Likewise with HSP objects and alignments.  I
can't produce CIGAR lines currently from HSPs - I'm still a little
confused about how to construct them but it means I need to read the spec
a little more probably.

> >
> > As for the seq stuff - will likely need a Bio::SeqIO::gff3 for that.
> >
> Ouch--I was afraid you were going to suggest that.  I suppose if we make
> it a read-only module, I guess that should be ok.  The thought of making
> it write makes my head hurt.

For writing multiple sequences, could be pretty ugly.  Either some
caching OR a special write_seq which takes an arrayref.  Maybe not a SeqIO
after all....  unless GFF3 lets a new set start with
# gff-version 3
so you could interleave them?
# gff-version 3
...
##FASTA
>oneseq.1
CAGT
# gff-version 3
...
## FASTA
>oneseq.2
GATC

For reading sequences next_seq will have to parse in the entire GFF file
at once and next_seq will have to iterate through an internal array I
guess.  Not that hard I hope...

>
> > Anyone is welcome to add these changes - I don't think I'll be able to
> > make many contributions until December so it would be best if someone else
> > took it on.
> >
> > -jason
> >
> > On Mon, 3 Nov 2003, Scott Cain wrote:
> >
> > > Hi Jason and Lincoln,
> > >
> > > I have a few concerns with Bio::Tools::GFF. The first is with the method
> > > _from_gff3_string, which does a split on \t to separate columns.  I
> > > think the GFF3 spec says it can be space delimited, so that should
> > > probably be \s+.  Additionally, to split the groups column, it uses
> > > \s*;\s*, but I think that spaces have to be escaped, therefore, it
> > > should only split on ; and spaces would indicate a problem (especially
> > > if one splits on spaces as indicated above).
> > >
> > > Finally, it doesn't provide a method of accessing the sequence that is
> > > optionally at the bottom of the file.  I am not exactly sure how to
> > > implement that (or I would), but I suspect it will have to be handled in
> > > the next_feature method.  Of course, the problem with handling it there
> > > is that it is not a feature.
> > >
> > > Scott
> > >
> > >
> >
> > --
> > Jason Stajich
> > Duke University
> > jason at cgt.mc.duke.edu
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu