[BioRuby] GFF3
Pjotr Prins
pjotr.public14 at thebird.nl
Sun Jan 2 12:04:48 UTC 2011
The GFF3 plugin works rather well. Anyone who has ruby 1.9.x on his
system can just type as a user:
gem install bio-gff3
and even bioruby itself gets installed, if needed. Next you can type,
for example
gff3-fetch mRNA test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3
to assemble all mRNA.
Unfortunately I am finding some problems with data. For example
the reading frame is *wrong* in this wormbase data file (predicted
gene). The contig starts as:
>MhA1_Contig3426
TTAATAAATTTAATTCATTAAAATTTTAAAAAGAAAGGGACATTCGAGGGGAAATGAGAGAGAACGAGAGAAAATGGACG
GGAAATTAAATTAAAAAATAAAAAATTAATTTTTATTTTTTTTTATTTAATTTAAAATTAATTTTCTACATTTATTAAAT
CTTAAATTATTAATTTTAAATTAATTTAAAG GCATCCAACAACAACAATTAGAAGTCTTTCCCAGCTCCTCCTCTGCCCC
TCAGCAACAACAATACCCAGCGCAGCAGCTTCAATTAGTTACTCCTTTTATTGCATGCATAGCAGATGAATTGAGGGAGT
TGATAGATGAAATGCGTATGTTTTAG AATATTTTTTAAAAAAAAATTAAAAAAAATTTTTTTTTGCCAAACAGGCTCTCG
and the full record is:
##gff-version 3
##sequence-region MhA1_Contig3426 1 2029
# Gene gene:MhA1_Contig3426.frz3.gene1
MhA1_Contig3426 WormBase gene 192 346 . + .
ID=gene:MhA1_Contig3426.frz3.gene1;Name=MhA1_Contig3426.frz3.gene1;Note=PREDICTE
D protein_coding;public_name=MhA1_Contig3426.frz3.gene1
MhA1_Contig3426 WormBase mRNA 192 346 . + .
ID=transcript:MhA1_Contig3426.frz3.gene1;Parent=gene:MhA1_Contig3426.frz3.gene1;
Name=MhA1_Contig3426.frz3.gene1;public_name=MhA1_Contig3426.frz3.gene1
MhA1_Contig3426 WormBase exon 192 346 . + .
ID=exon:MhA1_Contig3426.frz3.gene1.1;Parent=transcript:MhA1_Contig3426.frz3.gene
1
MhA1_Contig3426 WormBase CDS 192 346 . + 0
ID=cds:MhA1_Contig3426.frz3.gene1;Parent=transcript:MhA1_Contig3426.frz3.gene1
So, forward reading frame start at 192 and CDS phase 0. The actual sequence is
GCATCCAACA ACAACAATTA GAAGTCTTTC CCAGCTCCTC CTCTGCCCCT CAGCAACAAC AATACCCAGC GCAGCAGCTT
CAATTAGTTA CTCCTTTTAT TGCATGCATA GCAGATGAAT TGAGGGAGTT GATAGATGAA ATGCGTATGT TTTAG
which translates to a valid protein only in frame 2(!). This is not
compliant with GFF3 in any interpretation. Turns out for this
particular GFF3 file this is the case only with the *first* ORF on every
contig, and probably a bug of the gene predictor used. None of the
other genes is in the wrong frame.
I have informed Wormbase some time ago, but I don't have the
impression that anyone is interested. You can validate its contents at
http://www.wormbase.org/db/gb2/gbrowse/m_hapla/?name=id:2258995;dbid=m_hapla:database
I am going to add an option to the GFF3 plugin to test for valid
reading frames, so these files give the expected results. Be good for
validation anyway.
Pj.
More information about the BioRuby
mailing list