[Bioperl-l] help creating de novo GFF3

Joshua Orvis jorvis at gmail.com
Fri Aug 15 15:45:23 EDT 2008


I don't have a lot of experience with Bioperl and have used it mostly for
simple format conversions or parsing Genbank files.  I need to create a
quick script to create GFF3 and decided to give bioperl a try again instead
of just printing the columns myself but have had a few problems.  My
apologies for the narrative here but I know it can sometimes be informative
to hear 'how' a user arrived at a problem rather than just knowing the
problem itself.

Is there a documented explicit mapping between the GFF3 columns and the
predefined tags (ID, Name, etc.) and their Bioperl object attribute
equivalents?  Is it preferrable to create Bio::SeqFeature::Generic objects
and pass them to Bio::Tools::GFF->write_feature or rather to create
Bio::SeqFeature::Annotated and pass them to Bio::FeatureIO::gff ?   I may be
overlooking it, but a simple tutorial showing how to create and define a new
sequence object, attach annotations to it and dump in GFF format seems to be
missing.  This seems like a basic thing to do - most of the documentation I
find is about converting between formats rather than creating new
annotation.

Here are some of the problems I (a typical naive user?) ran into when
adventuring with bioperl here.   My first attempt resulted in the string
"SEQ" as column 0 in all my GFF output.  I thought that maybe this was
because my features weren't 'attached' to a sequence, so I created a
Bio::Seq::RichSeq object and tried both (separately):

$seq->add_SeqFeature( $feat );

and

$feat->attach_seq( $seq );

Neither changed the first column of output.  Looking at the
docs.bioperl.orgmethods for Bio::SeqFeature::Generic I found the
seq_id attribute, which
came with the warning: "This attribute should *not* be used in GFF dumping"
- but since it's the only thing I did that worked, I used it anyway.

Next I wanted to have ID tags within my last column.  I first tried setting
all the relevant attributes I could see on my features (id, primary_tag,
display_name, display_id, etc.) but none of these caused ID=? to be
created.  Next, I tried something like this:


my $feat = new Bio::SeqFeature::Annotated (
        -start        => $start,
        -end          => $end,
        -strand       => $strand,
        -primary      => 'gene',
        -seq_id       => $asmbl_id,  ## this works but is discouraged
        -tag          => { ID => $transcript->{pub_locus},
                           product_name => $transcript->{com_name},
                           ec_number => $transcript->{'ec#'},
                           gene_symbol => $transcript->{gene_sym}
                         }
);

My hopes that passing it via the -tag option would do the trick failed, as
it created a line like this instead:

10263   .       gene    58512   56983   .       +       .       iD=AN9220.4;

Notice the 'ID' -> 'iD' transformation (without any command-line warnings).

I'm still stuck on this one (Parent would be next) but overall guidance or
pointers to a tutorial/documentation I'm overlooking would be great.

JO



More information about the Bioperl-l mailing list