[Bioperl-l] GenBank files CONTIG line messes up sequences

Jason Stajich jason.stajich at gmail.com
Wed May 14 12:25:26 EDT 2014


Hi - sorry that I can't help more at this time but sending this on to the mailing list in hopes that someone can debug too. 





On Wed, May 14, 2014 at 09:22 AM, Jochen Blom<jockoblom at gmail.com>, wrote:
Dear Jason,


sorry for the broken format in my first mail, here is a reworked version of

my request:


Hi!


I'm using BioPerl to work with GenBAnk files, mostly public files from the

NCBI. For GenBank files with a CONTIG line like thw following

"CONTIG      join(BA000030.3:1..9025608)"

the Genbank parser does not work properly.


A simple example script can be found here:

https://gist.github.com/anonymous/e50ef06c30f10de01298



This script creates a valid Genbank file, but the genome sequence is

missing.

This is independent from what I do in the "do stuff" part, even the "empty"

version where nothing happens does not export a sequence.

The problem seems to be that the sequence is not accesible, e.g.,  the

example in the bottom comment will create empty sequences in $feature_seq.


Even the most simple script, just reading genbank and writing it again as

in the example


https://gist.github.com/anonymous/fd9d7a6d485aee6412ca


does not work. Same script with Genbank to EMBL works just fine. EMBL to

Genbank fails as

well if the EMBL file has the line corresponding to the CONTIG line:

"CO   join(BA000030.3:1..9025608)"


So the problem seem to be specific for the genbank parser/writer.


I found this post

https://groups.google.com/forum/#!msg/bioperl-l/ur9ZQIXyoj0/xT8izNWb-8kJ

telling that the problem is fixed in BioPerl 1.6.922, but I still encounter

the problems with sequence retieval.


Using Perl 5.18.2

Bio::Root::Version::VERSION: 1.006923


I work on local data mirrored from the VBI page, one example here:

http://mirrors.vbi.vt.edu/mirrors/ftp.ncbi.nih.gov/genomes/Bacteria/Streptomyces_avermitilis_MA_4680_uid57739/NC_004719.gbk


I just noticed that data directly from the NCBI web page does not include

the CONTIG lines, thus most likely not too much people have the same

problems and the problem can be circumvented by using NCBI data, but I

think this does not work as intended and a solution would help me.


So is there anyone with the same problems or any helpful ideas?


Best regards,


Jochen








2014-05-14 16:45 GMT+02:00 Jason Stajich <jason at bioperl.org>:


> Jochen -

>

> So my recollection was we had dealt with these feature types in genbank

> and your link to the mailing list seems to support this, but I think what

> you are saying is the corresponding code fix in EMBL does work? I can't

> really read your code well in this email to make out the part that is

> reporting the problem.

>

>

>

> Can you

> a) post the code in something like gist.github.com so it i easier to

> checkout and use- formatting is all garbled here

> b) provide an accession number and the code to replicate the problem in a

> self-contained manner

>

> note - i really won' t be able to do any coding to fix it but hopefully it

> can be narrowed down and diagnosed for  a bugreport with a reproducible

> code block.

>

> Cheers,

>

> Jason

>

>

> Jason Stajich

> jason at bioperl.org

> http://bioperl.org/wiki/User:Jason

> http://twitter.com/hyphaltip

>

>

> On Wed, May 14, 2014 at 6:41 AM, <jockoblom at gmail.com> wrote:

>

>> Hi!

>>

>> I'm using BioPerl to work with GenBAnk files, mostly public files from the

>> NCBI. For GenBank files with a CONTIG line like thw following

>> "CONTIG      join(BA000030.3:1..9025608)"

>> the Genbank parser does not work properly.

>>

>> As a simple example:

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>> *#!/usr/bin/env perluse strict;use warnings;use Bio::SeqIO;my $file =

>> shift;# create reader for the inputmy $seq_in = Bio::SeqIO->new(   -file

>> =>

>> $file,                                -format =>

>> 'genbank'                            );my @entries;while (my $contig =

>> $seq_in->next_seq()){    ## do stuff    push @entries, $contig;}my

>> $outfile

>> = Bio::SeqIO->new(  -file => ">out.gbk",

>> -format => 'genbank'                             );foreach my $contig

>> (@entries){        $outfile->write_seq($contig);}*

>>

>> This script creates a Genbank file, but the genome sequence is missing.

>> This is independent from what I do in the "do stuff" part, even this empty

>> version does not export a sequence.

>> The problem seems to be that the sequence is not accesible, e.g.,  the

>> following lines will create empty sequences in $feature_seq.

>>

>>

>>

>>

>>

>>

>> *foreach my $feat ($contig->get_SeqFeatures()){    if($feat->primary_tag()

>> eq 'CDS'){        $feature_seq = $feat->seq();...*Even the most simple

>> (and

>> most useless) script example does fail to produce a genbank file with

>> sequence:

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>>

>> *my $input = $ARGV[0];my $output = $ARGV[1];my $in = Bio::SeqIO->new(-file

>> => "$input" ,            -format => 'genbank');my $out  =

>> Bio::SeqIO->new(-file => ">$output" ,                       -format =>

>> 'genbank');                       while ( my $seq = $in->next_seq() ) {

>> $out->write_seq($seq);}*

>>

>> Same script with Genbank to EMBL works just fine. EMBL to Genbank fails as

>> well if the EMBL file has the line corresponding to the CONTIG line:

>> "CO   join(BA000030.3:1..9025608)"

>>

>> I found this post

>> https://groups.google.com/forum/#!msg/bioperl-l/ur9ZQIXyoj0/xT8izNWb-8kJ

>> telling that the problem is fixed in BioPerl 1.6.922, but I still

>> encounter

>> the problems with sequence retieval.

>>

>> Using Perl 5.18.2

>> Bio::Root::Version::VERSION: 1.006923

>>

>> Anyone with the same problems or any helpful ideas?

>>

>> Best regards,

>>

>> Jochen

>>

>>

>>

>>

>>

>>

>>

>>

>> _______________________________________________

>> Bioperl-l mailing list

>> Bioperl-l at lists.open-bio.org

>> http://lists.open-bio.org/mailman/listinfo/bioperl-l

>>

>

>


More information about the Bioperl-l mailing list