[Bioperl-l] GFF to GTF converter

Chris Fields cjfields at illinois.edu
Thu Mar 11 13:02:43 EST 2010


On Mar 11, 2010, at 9:56 AM, Alexander Kanapin wrote:

> Hi BioPerl gurus,
> 
> Does anybody knows a reliable GFF to GTF converter which can generate files acceptable by cufflinks ?
> 
> We attempted to convert a drosophila and worm genome GFFs  (taken from Flybase and Wormbase ftp) to GTF with Bio::FeatureIO
> 
> #read from a file
> my $in  = Bio::FeatureIO->new(-file => $infile , -format => 'GFF');
> 
> #write out features
> my $out = Bio::FeatureIO->new(-file    => ">$outfile" ,
>                          -format  => 'GFF' ,
>                          -version => 2.5);
> 
> However, we discovered that the resulting file is not compliant with GTF format specifications as they are described here: http://mblab.wustl.edu/GTF22.html

Just so this is clear, even though the FeatureIO docs currently state (and I quote):

"[Bio::FeatureIO] is the officially sanctioned way of getting at the format objects, which most people should use."

it is nowhere near complete, so I have removed said quote from main trunk and replaced with it a very explicit caveat about it's current state, i.e. highly experimental and not currently suggested for production use.  It's basically half-baked right now; I am in the midst of refactoring Bio::FeatureIO to try getting it up to speed and to add in flexibility when parsing this data (I'm actually working on it right now), but it's early days on that and may take a bit.

Do realize that, even with a refactored FeatureIO, this is one of the more significant problems with GTF, e.g. there are too many definitions of what constitutes GTF or GFF2, so no clear path on how to go about this.  At this point most users end up writing up their own parsers, unfortunately.

> Although, this chunk of code produces CDS and exon entries in the output file, it does not output start codon/stop codon annotations.
> Also, we think it misinterprets annotations, so that one do see UTR entries annotated as CDS' or exons.

The start/stop codons can normally be inferred from the CDS/UTRs and exons if they are provided, but again this is one of those issues where there isn't a lot of consistency with the data across various data sources (something addressed at the recent GMOD meeting).  What is the source of your GFF?

> Many thanks for ideas/notes.
> 
> Alex
> 
> --
> Alexander Kanapin, PhD
> Scientific Associate
> 
> Ontario Institute for Cancer Research
> MaRS Centre, South Tower
> 101 College Street, Suite 800
> Toronto, Ontario, Canada M5G 0A3
> Tel: 647-260-7993
> Toll-free: 1-866-678-6427
> www.oicr.on.ca <http://www.oicr.on.ca/>
> This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.


chris



More information about the Bioperl-l mailing list