[BioRuby] GFF3

Pjotr Prins pjotr.public14 at thebird.nl
Thu Aug 12 12:10:16 EDT 2010


On Fri, Aug 13, 2010 at 12:12:05AM +0900, Naohisa GOTO wrote:
> Could you please tell me the complete URL of the Lincoln's
> test data?  Why I'd like to know the origin is:
> I submitted the test.gff3 to the GFF3 Validator,
> (http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online )
> and it is reported as "Invalid". So, I'd like to know if this
> is intended or not, and that best way to know that is seeing
> the file's development history.

proper test data for the module by Lincoln:

http://github.com/bioperl/bioperl-live/blob/master/t/data/biodbgff/test.gff3

> > This data contains empty lines - so I modified
> > the GFF3 parser to ignore those.
> 
> How to treat empty lines is undefined in the GFF3 spec.
> (http://www.sequenceontology.org/gff3.shtml)
> It may be good to ignore empty lines.

I think so.

> > Before I continue, I also wonder about the wisdom of including a
> > Bio::FastaFormat record *inside* a Bio::Sequence record. This duplicates the
> > @definition with @entry_id. Not only that, the sequence contains white space,
> > which does not match GFF's positioning data:
> > 
> > #<Bio::Sequence:0xb7c2b354 @entry_id="test01",
> > @source_data=#<Bio::FastaFormat:0xb7c31574 @entry_overrun=nil,
> > @data="\nACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA\nGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA\nCCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT\nAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT\nGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC\nCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC\n\n",
> > @definition="test01">>
> 
> You can see that FastaFormat object is stored in the @source_data.
> It will be parsed only when the sequence is really needed.
> This is a kind of lazy evaluation.

Very lazy ;)

But duplication of ID and containment of extraneous information. Not
so efficient with space. We may want to change that.

The main problem is that it is not intuitive to have a FastaFormat
inside a Sequence object. But that could be just me.

> Please execute
>   puts gff3.sequences[0][0..100]
> and report what sequence is shown.
> 
> > Now, to print FASTA I now do:
> > 
> >   gff3.sequences.each do | item |
> >     print item.to_fasta(item.entry_id, 70)
> >   end
> 
>  gff3.sequences.each do | item |
>    print item.output(:fasta)
>  end

I should have known ;)

Pj.


More information about the BioRuby mailing list