[Bioperl-l] bp_seqfeature_load of latest Flybase GFF annotation fails due to data inconsistency.

Cook, Malcolm MEC at stowers-institute.org
Tue Jan 9 19:38:48 UTC 2007


Drat!

bash> bp_seqfeature_load.PLS --fast --dsn
'dbi:mysql:database=dmel_r5_1;host=mysql-dev' --create --noverbose <(
flygenegff
./flybase.net/genomes/Drosophila_melanogaster/dmel_r5.1/gff/*.gff )


(note: `flygenegff` used above sorts and filters the GFF input so that
the GFF features are loaded in order needed: gene before mRNA before
exon)

This worked fine with the last release of Flybase.  But now I get:

------------- EXCEPTION  -------------
MSG: FBtr0110936 doesn't have a primary id
STACK
Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:682
STACK Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:663
STACK Bio::DB::SeqFeature::Store::GFF3Loader::finish_load
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:372
STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:345
STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
/home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:242
STACK toplevel
/home/mec/cvs/bioperl-live/scripts/Bio-SeqFeature-Store/bp_seqfeature_lo
ad.PLS:76

And indeed, sleuthing the data proves that FBtr0110936 is an example of
a Flybase transcript identifier that is annotated as being one of the
multiple parents of exons but that does not itself have an entry in
Flybase!

Proof: 

`grep FBtr0110936 dmel_r5.1/gff/*.gff` returns only exon features (no
gene, CDS, UTR, or mRNA)

... whereas, grepping for any of the other three transcripts mentioned
as parents of those exons yields the expected additional feature of type
mRNA, protein, CDS, etc

By the way, this data-bug manifests itself when searching the Flybase
website (FB2006_01, released December 8, 2006) for transcript
FBtr0110936 as:

"ERROR: report for FBtr0110936 not found"

I wonder if anyone can tell me what causes this data problem, and tell
me whether it is ubiquitous (i.e. are there other transcripts mentioned
as exon parents that do not have their own feature)?

I am trying to load this latest Flybase GFF into Lincoln Steins
Bio::DB::SeqFeature database (using bp_seqfeature_load) but the load
fails due to this data problem.   Any recommendations/workarounds to
this issue are quite welcome.


Malcolm Cook
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City, Missouri
 




More information about the Bioperl-l mailing list