[BioRuby] GFF3
Pjotr Prins
pjotr.public14 at thebird.nl
Thu Aug 12 10:30:12 EDT 2010
I intend to use GFF3 and document its use.
In my gff3 github branch (see http://github.com/pjotrp/bioruby/tree/gff3) I
have just added a first example for fetching sequence data from GFF3. First I
took an example from Lincoln Stein (in his BioPerl repository) and stuck that
in ./test/data/gff/test.gff3. This data contains empty lines - so I modified
the GFF3 parser to ignore those.
Before I continue, I also wonder about the wisdom of including a
Bio::FastaFormat record *inside* a Bio::Sequence record. This duplicates the
@definition with @entry_id. Not only that, the sequence contains white space,
which does not match GFF's positioning data:
#<Bio::Sequence:0xb7c2b354 @entry_id="test01",
@source_data=#<Bio::FastaFormat:0xb7c31574 @entry_overrun=nil,
@data="\nACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA\nGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA\nCCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT\nAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT\nGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC\nCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC\n\n",
@definition="test01">>
Now, to print FASTA I now do:
gff3.sequences.each do | item |
print item.to_fasta(item.entry_id, 70)
end
(to_fasta is being deprecated)
To get a FASTA sequence I would like to do the sane:
gff3.sequences.each do | item |
rec = Bio::FastaFormat.new('> '+item.definition.strip+"\n"+item.data)
print rec
end
where item.data is just the clean sequence.
The current implementation is rather uninituitive. I realise GFF3 contains
FASTA, but there is no reason to store it like that. How about removing the
contained Bio::FastaFormat and just use a sequence string? And remove the white
space by default?
It does also away with FASTA formatting - the to_fasta in GFF3.
I can make the changes, if you agree.
Pj.
More information about the BioRuby
mailing list