[Bioperl-l] CON(structed) sequence databases?

Wed Jan 31 14:05:46 UTC 2007

On Jan 30, 2007, at 1:45 AM, JK ((Jesper Agerbo Krogh)) wrote:

> Hi.
>
> What do you do about parsing sequences from the "CON"-divisions of
> EMBL/Genbank? The entries looks just like this one:
>
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? 
> db=nucleotide&cmd=search&t
> erm=CH445337&doptcmdl=GenBank
>
> The bioperl 1.4 parser dies on the embl-version and the 1.5 parser  
> uses
> the complete .dat file as a single entry.
>
> Thanks.
>
> Jesper

For GenBank CONTIG/WGS line parsing you'll have to update to Bioperl  
1.5.2 (I added that in after 1.5.1).  The CONTIG data is currently  
just carved up by newline and stored as SimpleValue annotation when  
parsing GenBank records; I don't believe it is even parsed with EMBL  
at this time.  Although we could probably do something using  
Bio::Location objects, there really hasn't been much demand for it  
since one can retrieve the sequences assembled by NCBI by requesting  
the full GenBank record (automatically set up in Bio::DB::GenBank) or  
requesting return format 'gbwithparts' when using eutils.

To retrieve the parsed data from a GenBank record in a Bio::Seq object:

my @contigs = $seq->annotation->get_Annotations('CONTIG');

If the complete .dat file is read as a single file then there's  
definitely a bug (end of seq record isn't detected), which is  
possible since I only tested against single CON files.  Could you  
point out the dat file you checked so I can test it out?

chris