[BioRuby] GFF3
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Thu Aug 12 11:12:05 EDT 2010
Hi,
On Thu, 12 Aug 2010 16:30:12 +0200
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:
> I intend to use GFF3 and document its use.
>
> In my gff3 github branch (see http://github.com/pjotrp/bioruby/tree/gff3) I
> have just added a first example for fetching sequence data from GFF3. First I
> took an example from Lincoln Stein (in his BioPerl repository) and stuck that
> in ./test/data/gff/test.gff3.
Could you please tell me the complete URL of the Lincoln's
test data? Why I'd like to know the origin is:
I submitted the test.gff3 to the GFF3 Validator,
(http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online )
and it is reported as "Invalid". So, I'd like to know if this
is intended or not, and that best way to know that is seeing
the file's development history.
> This data contains empty lines - so I modified
> the GFF3 parser to ignore those.
How to treat empty lines is undefined in the GFF3 spec.
(http://www.sequenceontology.org/gff3.shtml)
It may be good to ignore empty lines.
> Before I continue, I also wonder about the wisdom of including a
> Bio::FastaFormat record *inside* a Bio::Sequence record. This duplicates the
> @definition with @entry_id. Not only that, the sequence contains white space,
> which does not match GFF's positioning data:
>
> #<Bio::Sequence:0xb7c2b354 @entry_id="test01",
> @source_data=#<Bio::FastaFormat:0xb7c31574 @entry_overrun=nil,
> @data="\nACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA\nGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA\nCCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT\nAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT\nGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC\nCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC\n\n",
> @definition="test01">>
You can see that FastaFormat object is stored in the @source_data.
It will be parsed only when the sequence is really needed.
This is a kind of lazy evaluation.
Please execute
puts gff3.sequences[0][0..100]
and report what sequence is shown.
> Now, to print FASTA I now do:
>
> gff3.sequences.each do | item |
> print item.to_fasta(item.entry_id, 70)
> end
gff3.sequences.each do | item |
print item.output(:fasta)
end
--
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the BioRuby
mailing list