[Bioperl-l] important change for Bio::SeqIO::GenBank

Wang, Kai Wang.Kai@mayo.edu
Wed, 31 Oct 2001 10:26:21 -0600


It is the first time I visit this mailing list. I do not know whether
somebody has pointed out this, but it is a serious problem.

When I browsed source code of Bio::SeqIO in the Bioperl project, I found a
bug that the code cannot process moleuclar shape of "circular" or "linear",
which are annotated in a few GenBank files.

An even serious problem is that a new GenBank format will be published soon.
The "LOCUS" line will be changed a little. According to the release notes of
GenBank, "the introduction of the new format will occur with Release 127.0
in December 2001". Obviously the current Bio::SeqIO code cannot process the
new format.

Because of the above two reasons, I rewrite a small part of the
Bio::SeqIO::GenBank code and hopefully it will support the new file format
as well as the old one. A new hash key "molshape" is introduced to represent
"circular" or "linear". If you think my code is acceptable, please included
it into new release of BioPerl. GenBank records with new format will be
released by December 2001 this year!

Below is my code:


# CODE FROM Bio::SeqIO::GenBank
sub next_seq{
        my ($self,@args) = @_;
        my ($pseq,$c,$name,@desc,$seqc,$mol,$div,$date);
        my $line;
        my @acc = ();
        my $seq = Bio::Seq::RichSeq->new('-verbose' =>$self->verbose());

        while(defined($line = $self->_readline())) {
                $line =~ /^LOCUS\s+\S+/ && last;
        }
        return undef if( !defined $line ); # end of file
# BEGIN MY CODE
        $line =~
/^LOCUS\s+(\S+)\s+\S+\s+(bp|aa)\s+(\S+)?\s+(linear|circular)?\s+(\w{3})\s+(\
S+)/ || do {
                $line =~ /^LOCUS\s+(\S+)/ ||
                        $self->throw("GenBank stream with no LOCUS. Not
GenBank in my book. Got $line");
                $self->warn("GenBank record with LOCUS line in unexpected
format. Attributes from this line other than ID will be missing.")

        }
        $name = $1;
        $seq->display_id($name);
        if ($2 eq 'bp') {
                $seq->moltype ('dna');
                $seq->molecule ($3);                #examples of $3 are
'ss-tRNA', 'ds-DNA', 'DN', 'sno-RNA'
                $seq->molshape ($4);               #molshape means either
'circular' or 'linear': this term is not included in your original program

        } elsif ($2 eq 'aa') {
                $seq->moltype ('protein');
                $seq->molecule ('PRT');              #for protein there are
no $3 and $4 defined.
        }
        $seq->division ($5);
        defined ($6) and $seq->add_date ($6);
# END MY CODE
# BEGIN YOUR CODE