[Bioperl-l] Genbank to gff3 conversion problem

Thu Jan 29 21:26:46 UTC 2009

This looks like a problem with the Unflattener code (which I wrote a  
while ago) - as ChrisF states, this is sometimes required to  
reconstruct the hierarchical subfeature relations from the 'flat'  
genbank model

Here is  NT_011512:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&val=37558541

It's human chromosome 21

This genbank record that is split over multiple sub-records, using  
join directives to stitch it all back together. I once new how to  
handle such beasts in bioperl, but I think I've forgotten. In any  
case, the unflattener does not make any attempt to download and join  
together the sequence fragments, it just gets confused by this record  
and throws a wobbly.

A bioperl guru may be able to help with the correct strategy here, but  
roughly speaking I can see two alternate approaches:

1. reconstruct the entire chromosome record and feed that to the  
unflattener
2. iterate though each record running the gb2gff conversion on each,  
then stitch the gff records together

2 will be quite tricky because of the coordinate transformation. On  
the other hand, there may be memory issues.

I'm sure this is doable as I used to regularly load all of genbank  
human into a bioSQL database using a predecessor of the unflattener;  
I'm just not sure what the best strategy is these days. I have a  
feeling the human assemblies are (or were) available on a one file per  
chromosome basis from the NCBI ftp site. This may be your best bet.

On Jan 28, 2009, at 8:23 PM, shafeeq rim wrote:

> Hi,
>
> I was trying my hands on  bp_genbank2gff3.pl script
> in Bioperl but getting lots of errors. I first tried with .gbk file in
> genbank and then .gbs file but still no success. I actually want Human
> genome annotation file in gff3 format for my application. Is there  
> any other
> software or script for gff3 conversion from genbank format apart from
> this bioperl script? Readseq utility is there but it converts only  
> into
> gff2.
> I extracted and removed problematic file but still of no use  
> NT_011512.
>
> I am getting errors like this :-
> I am using this command perl bp_genbank2gff3.pl ref_chr21.gbk
> ##########################################################
> STACK Bio::SeqFeature::Tools::Unflattener::unflatten_seq /usr/local/ 
> share/perl/5.8.8/Bio/SeqFeature/Tools/Unflattener.pm:1549
> STACK (eval) bp_genbank2gff3.pl:378
> STACK main::unflatten_seq bp_genbank2gff3.pl:377
> STACK toplevel bp_genbank2gff3.pl:229
>
> --------------------------------------
>
> Possible gene unflattening error withNT_011512: consult STDERR
>
> ------------- EXCEPTION  -------------
> MSG: seq_id must be set
> STACK Bio::SeqFeature::Tools::IDHandler::generate_unique_persistent_id
> /usr/local/share/perl/5.8.8/Bio/SeqFeature/Tools/IDHandler.pm:242
> STACK main::generic_features bp_genbank2gff3.pl:353
> STACK toplevel bp_genbank2gff3.pl:254
> #####################################################
>
> Thanks in advance
> Regards
> Shafeeq
>
>
>      Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>