[Bioperl-l] GenBank files CONTIG line messes up sequences

Wed May 14 16:37:21 UTC 2014

Ah; yes, this is likely a bug due to a change in NCBI’s output.  Previously they used to have either CONTIG or the sequence, not both.  I’ll add this as an issue, but it should be fairly easy to fix.

chris

On May 14, 2014, at 11:25 AM, Jason Stajich <jason.stajich at gmail.com> wrote:

> Hi - sorry that I can't help more at this time but sending this on to the mailing list in hopes that someone can debug too. 
> 
> 
> 
> 
> 
> On Wed, May 14, 2014 at 09:22 AM, Jochen Blom<jockoblom at gmail.com>, wrote:
> Dear Jason,
> 
> 
> sorry for the broken format in my first mail, here is a reworked version of
> 
> my request:
> 
> 
> Hi!
> 
> 
> I'm using BioPerl to work with GenBAnk files, mostly public files from the
> 
> NCBI. For GenBank files with a CONTIG line like thw following
> 
> "CONTIG      join(BA000030.3:1..9025608)"
> 
> the Genbank parser does not work properly.
> 
> 
> A simple example script can be found here:
> 
> https://gist.github.com/anonymous/e50ef06c30f10de01298
> 
> 
> 
> This script creates a valid Genbank file, but the genome sequence is
> 
> missing.
> 
> This is independent from what I do in the "do stuff" part, even the "empty"
> 
> version where nothing happens does not export a sequence.
> 
> The problem seems to be that the sequence is not accesible, e.g.,  the
> 
> example in the bottom comment will create empty sequences in $feature_seq.
> 
> 
> Even the most simple script, just reading genbank and writing it again as
> 
> in the example
> 
> 
> https://gist.github.com/anonymous/fd9d7a6d485aee6412ca
> 
> 
> does not work. Same script with Genbank to EMBL works just fine. EMBL to
> 
> Genbank fails as
> 
> well if the EMBL file has the line corresponding to the CONTIG line:
> 
> "CO   join(BA000030.3:1..9025608)"
> 
> 
> So the problem seem to be specific for the genbank parser/writer.
> 
> 
> I found this post
> 
> https://groups.google.com/forum/#!msg/bioperl-l/ur9ZQIXyoj0/xT8izNWb-8kJ
> 
> telling that the problem is fixed in BioPerl 1.6.922, but I still encounter
> 
> the problems with sequence retieval.
> 
> 
> Using Perl 5.18.2
> 
> Bio::Root::Version::VERSION: 1.006923
> 
> 
> I work on local data mirrored from the VBI page, one example here:
> 
> http://mirrors.vbi.vt.edu/mirrors/ftp.ncbi.nih.gov/genomes/Bacteria/Streptomyces_avermitilis_MA_4680_uid57739/NC_004719.gbk
> 
> 
> I just noticed that data directly from the NCBI web page does not include
> 
> the CONTIG lines, thus most likely not too much people have the same
> 
> problems and the problem can be circumvented by using NCBI data, but I
> 
> think this does not work as intended and a solution would help me.
> 
> 
> So is there anyone with the same problems or any helpful ideas?
> 
> 
> Best regards,
> 
> 
> Jochen
> 
> 
> 
> 
> 
> 
> 
> 
> 2014-05-14 16:45 GMT+02:00 Jason Stajich <jason at bioperl.org>:
> 
> 
>> Jochen -
> 
>> 
> 
>> So my recollection was we had dealt with these feature types in genbank
> 
>> and your link to the mailing list seems to support this, but I think what
> 
>> you are saying is the corresponding code fix in EMBL does work? I can't
> 
>> really read your code well in this email to make out the part that is
> 
>> reporting the problem.
> 
>> 
> 
>> 
> 
>> 
> 
>> Can you
> 
>> a) post the code in something like gist.github.com so it i easier to
> 
>> checkout and use- formatting is all garbled here
> 
>> b) provide an accession number and the code to replicate the problem in a
> 
>> self-contained manner
> 
>> 
> 
>> note - i really won' t be able to do any coding to fix it but hopefully it
> 
>> can be narrowed down and diagnosed for  a bugreport with a reproducible
> 
>> code block.
> 
>> 
> 
>> Cheers,
> 
>> 
> 
>> Jason
> 
>> 
> 
>> 
> 
>> Jason Stajich
> 
>> jason at bioperl.org
> 
>> http://bioperl.org/wiki/User:Jason
> 
>> http://twitter.com/hyphaltip
> 
>> 
> 
>> 
> 
>> On Wed, May 14, 2014 at 6:41 AM, <jockoblom at gmail.com> wrote:
> 
>> 
> 
>>> Hi!
> 
>>> 
> 
>>> I'm using BioPerl to work with GenBAnk files, mostly public files from the
> 
>>> NCBI. For GenBank files with a CONTIG line like thw following
> 
>>> "CONTIG      join(BA000030.3:1..9025608)"
> 
>>> the Genbank parser does not work properly.
> 
>>> 
> 
>>> As a simple example:
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> *#!/usr/bin/env perluse strict;use warnings;use Bio::SeqIO;my $file =
> 
>>> shift;# create reader for the inputmy $seq_in = Bio::SeqIO->new(   -file
> 
>>> =>
> 
>>> $file,                                -format =>
> 
>>> 'genbank'                            );my @entries;while (my $contig =
> 
>>> $seq_in->next_seq()){    ## do stuff    push @entries, $contig;}my
> 
>>> $outfile
> 
>>> = Bio::SeqIO->new(  -file => ">out.gbk",
> 
>>> -format => 'genbank'                             );foreach my $contig
> 
>>> (@entries){        $outfile->write_seq($contig);}*
> 
>>> 
> 
>>> This script creates a Genbank file, but the genome sequence is missing.
> 
>>> This is independent from what I do in the "do stuff" part, even this empty
> 
>>> version does not export a sequence.
> 
>>> The problem seems to be that the sequence is not accesible, e.g.,  the
> 
>>> following lines will create empty sequences in $feature_seq.
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> *foreach my $feat ($contig->get_SeqFeatures()){    if($feat->primary_tag()
> 
>>> eq 'CDS'){        $feature_seq = $feat->seq();...*Even the most simple
> 
>>> (and
> 
>>> most useless) script example does fail to produce a genbank file with
> 
>>> sequence:
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> *my $input = $ARGV[0];my $output = $ARGV[1];my $in = Bio::SeqIO->new(-file
> 
>>> => "$input" ,            -format => 'genbank');my $out  =
> 
>>> Bio::SeqIO->new(-file => ">$output" ,                       -format =>
> 
>>> 'genbank');                       while ( my $seq = $in->next_seq() ) {
> 
>>> $out->write_seq($seq);}*
> 
>>> 
> 
>>> Same script with Genbank to EMBL works just fine. EMBL to Genbank fails as
> 
>>> well if the EMBL file has the line corresponding to the CONTIG line:
> 
>>> "CO   join(BA000030.3:1..9025608)"
> 
>>> 
> 
>>> I found this post
> 
>>> https://groups.google.com/forum/#!msg/bioperl-l/ur9ZQIXyoj0/xT8izNWb-8kJ
> 
>>> telling that the problem is fixed in BioPerl 1.6.922, but I still
> 
>>> encounter
> 
>>> the problems with sequence retieval.
> 
>>> 
> 
>>> Using Perl 5.18.2
> 
>>> Bio::Root::Version::VERSION: 1.006923
> 
>>> 
> 
>>> Anyone with the same problems or any helpful ideas?
> 
>>> 
> 
>>> Best regards,
> 
>>> 
> 
>>> Jochen
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> 
> 
>>> _______________________________________________
> 
>>> Bioperl-l mailing list
> 
>>> Bioperl-l at lists.open-bio.org
> 
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
>>> 
> 
>> 
> 
>> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l