[Bioperl-l] GenBank files CONTIG line messes up sequences
Jason Stajich
jason.stajich at gmail.com
Wed May 14 12:25:26 EDT 2014
Hi - sorry that I can't help more at this time but sending this on to the mailing list in hopes that someone can debug too.
On Wed, May 14, 2014 at 09:22 AM, Jochen Blom<jockoblom at gmail.com>, wrote:
Dear Jason,
sorry for the broken format in my first mail, here is a reworked version of
my request:
Hi!
I'm using BioPerl to work with GenBAnk files, mostly public files from the
NCBI. For GenBank files with a CONTIG line like thw following
"CONTIG join(BA000030.3:1..9025608)"
the Genbank parser does not work properly.
A simple example script can be found here:
https://gist.github.com/anonymous/e50ef06c30f10de01298
This script creates a valid Genbank file, but the genome sequence is
missing.
This is independent from what I do in the "do stuff" part, even the "empty"
version where nothing happens does not export a sequence.
The problem seems to be that the sequence is not accesible, e.g., the
example in the bottom comment will create empty sequences in $feature_seq.
Even the most simple script, just reading genbank and writing it again as
in the example
https://gist.github.com/anonymous/fd9d7a6d485aee6412ca
does not work. Same script with Genbank to EMBL works just fine. EMBL to
Genbank fails as
well if the EMBL file has the line corresponding to the CONTIG line:
"CO join(BA000030.3:1..9025608)"
So the problem seem to be specific for the genbank parser/writer.
I found this post
https://groups.google.com/forum/#!msg/bioperl-l/ur9ZQIXyoj0/xT8izNWb-8kJ
telling that the problem is fixed in BioPerl 1.6.922, but I still encounter
the problems with sequence retieval.
Using Perl 5.18.2
Bio::Root::Version::VERSION: 1.006923
I work on local data mirrored from the VBI page, one example here:
http://mirrors.vbi.vt.edu/mirrors/ftp.ncbi.nih.gov/genomes/Bacteria/Streptomyces_avermitilis_MA_4680_uid57739/NC_004719.gbk
I just noticed that data directly from the NCBI web page does not include
the CONTIG lines, thus most likely not too much people have the same
problems and the problem can be circumvented by using NCBI data, but I
think this does not work as intended and a solution would help me.
So is there anyone with the same problems or any helpful ideas?
Best regards,
Jochen
2014-05-14 16:45 GMT+02:00 Jason Stajich <jason at bioperl.org>:
> Jochen -
>
> So my recollection was we had dealt with these feature types in genbank
> and your link to the mailing list seems to support this, but I think what
> you are saying is the corresponding code fix in EMBL does work? I can't
> really read your code well in this email to make out the part that is
> reporting the problem.
>
>
>
> Can you
> a) post the code in something like gist.github.com so it i easier to
> checkout and use- formatting is all garbled here
> b) provide an accession number and the code to replicate the problem in a
> self-contained manner
>
> note - i really won' t be able to do any coding to fix it but hopefully it
> can be narrowed down and diagnosed for a bugreport with a reproducible
> code block.
>
> Cheers,
>
> Jason
>
>
> Jason Stajich
> jason at bioperl.org
> http://bioperl.org/wiki/User:Jason
> http://twitter.com/hyphaltip
>
>
> On Wed, May 14, 2014 at 6:41 AM, <jockoblom at gmail.com> wrote:
>
>> Hi!
>>
>> I'm using BioPerl to work with GenBAnk files, mostly public files from the
>> NCBI. For GenBank files with a CONTIG line like thw following
>> "CONTIG join(BA000030.3:1..9025608)"
>> the Genbank parser does not work properly.
>>
>> As a simple example:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *#!/usr/bin/env perluse strict;use warnings;use Bio::SeqIO;my $file =
>> shift;# create reader for the inputmy $seq_in = Bio::SeqIO->new( -file
>> =>
>> $file, -format =>
>> 'genbank' );my @entries;while (my $contig =
>> $seq_in->next_seq()){ ## do stuff push @entries, $contig;}my
>> $outfile
>> = Bio::SeqIO->new( -file => ">out.gbk",
>> -format => 'genbank' );foreach my $contig
>> (@entries){ $outfile->write_seq($contig);}*
>>
>> This script creates a Genbank file, but the genome sequence is missing.
>> This is independent from what I do in the "do stuff" part, even this empty
>> version does not export a sequence.
>> The problem seems to be that the sequence is not accesible, e.g., the
>> following lines will create empty sequences in $feature_seq.
>>
>>
>>
>>
>>
>>
>> *foreach my $feat ($contig->get_SeqFeatures()){ if($feat->primary_tag()
>> eq 'CDS'){ $feature_seq = $feat->seq();...*Even the most simple
>> (and
>> most useless) script example does fail to produce a genbank file with
>> sequence:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *my $input = $ARGV[0];my $output = $ARGV[1];my $in = Bio::SeqIO->new(-file
>> => "$input" , -format => 'genbank');my $out =
>> Bio::SeqIO->new(-file => ">$output" , -format =>
>> 'genbank'); while ( my $seq = $in->next_seq() ) {
>> $out->write_seq($seq);}*
>>
>> Same script with Genbank to EMBL works just fine. EMBL to Genbank fails as
>> well if the EMBL file has the line corresponding to the CONTIG line:
>> "CO join(BA000030.3:1..9025608)"
>>
>> I found this post
>> https://groups.google.com/forum/#!msg/bioperl-l/ur9ZQIXyoj0/xT8izNWb-8kJ
>> telling that the problem is fixed in BioPerl 1.6.922, but I still
>> encounter
>> the problems with sequence retieval.
>>
>> Using Perl 5.18.2
>> Bio::Root::Version::VERSION: 1.006923
>>
>> Anyone with the same problems or any helpful ideas?
>>
>> Best regards,
>>
>> Jochen
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>
More information about the Bioperl-l
mailing list