[BioRuby] GFF3

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Thu Aug 12 11:12:05 EDT 2010


Hi,

On Thu, 12 Aug 2010 16:30:12 +0200
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> I intend to use GFF3 and document its use.
> 
> In my gff3 github branch (see http://github.com/pjotrp/bioruby/tree/gff3) I
> have just added a first example for fetching sequence data from GFF3. First I
> took an example from Lincoln Stein (in his BioPerl repository) and stuck that
> in ./test/data/gff/test.gff3.

Could you please tell me the complete URL of the Lincoln's
test data?  Why I'd like to know the origin is:
I submitted the test.gff3 to the GFF3 Validator,
(http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online )
and it is reported as "Invalid". So, I'd like to know if this
is intended or not, and that best way to know that is seeing
the file's development history.

> This data contains empty lines - so I modified
> the GFF3 parser to ignore those.

How to treat empty lines is undefined in the GFF3 spec.
(http://www.sequenceontology.org/gff3.shtml)
It may be good to ignore empty lines.

> Before I continue, I also wonder about the wisdom of including a
> Bio::FastaFormat record *inside* a Bio::Sequence record. This duplicates the
> @definition with @entry_id. Not only that, the sequence contains white space,
> which does not match GFF's positioning data:
> 
> #<Bio::Sequence:0xb7c2b354 @entry_id="test01",
> @source_data=#<Bio::FastaFormat:0xb7c31574 @entry_overrun=nil,
> @data="\nACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA\nGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA\nCCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT\nAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT\nGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC\nCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC\n\n",
> @definition="test01">>

You can see that FastaFormat object is stored in the @source_data.
It will be parsed only when the sequence is really needed.
This is a kind of lazy evaluation.

Please execute
  puts gff3.sequences[0][0..100]
and report what sequence is shown.

> Now, to print FASTA I now do:
> 
>   gff3.sequences.each do | item |
>     print item.to_fasta(item.entry_id, 70)
>   end

 gff3.sequences.each do | item |
   print item.output(:fasta)
 end

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


More information about the BioRuby mailing list