[Bioperl-l] important change for Bio::SeqIO::GenBank

Wang, Kai Wang.Kai@mayo.edu
Wed, 31 Oct 2001 14:30:32 -0600


Thank you Heikki,
I am using BioPerl-0.7.1, but it was downloaded several months ago so I do
not have the new version.
I have checked your new version and I think it should work for the new
GenBank release.
The problem is that the code is based on the assumption that $seq->molecule
and $seq->is_circular either occur simultaneously or do not occur at all in
the LOCUS line. I do not know whether it is true or not. At least in the
future it is very possible that they add 'protein' as $seq->molecule but do
not provide 'circular' or 'linear'. So I think including
'(circular|linear)?' in the code may be a better way. How do you think about
that?
Another problem is that you still cannot judge whether a molecule is mRNA or
DNA just by $seq->molecule. According the the new GenBank Release note, the
$seq->molecule may be something like 'ss-DNA', 'snoRNA', 'ss-mRNA', 'cDNA',
'ms-RNA' or just 'NA'. So maybe we should do something more to split the
$seq->molecule further to accomodate the new GenBank format. What's your
opinion?
--Kai


-----Original Message-----
From: Heikki Lehvaslaiho [mailto:heikki@ebi.ac.uk]
Sent: Wednesday, October 31, 2001 10:50 AM
To: Wang, Kai
Cc: 'bioperl-l@bioperl.org'
Subject: Re: [Bioperl-l] important change for Bio::SeqIO::GenBank


"Wang, Kai" wrote:
> 
> It is the first time I visit this mailing list. I do not know whether
> somebody has pointed out this, but it is a serious problem.

Which version of BioPerl you are using?

I fixed the problem circular  molecules two weeks ago in bioperl-live ("from
CVS"). At the same time I noticed the announcement of the LOCUS line change,
but did not do anything about it. See:
http://bioperl.org/pipermail/bioperl-l/2001-October/006397.html
for the thread.

Do you think you could check out the latest code and check your changes to
it. 
(see http://cvs.bioperl.org/ for instructions of getting a full copy or
check changes at WEbCVS: 'View CVS' on the bioperl home page.)

Thanks,

	-Heikki

> When I browsed source code of Bio::SeqIO in the Bioperl project, I found a
> bug that the code cannot process moleuclar shape of "circular" or
"linear",
> which are annotated in a few GenBank files.
> 
> An even serious problem is that a new GenBank format will be published
soon.
> The "LOCUS" line will be changed a little. According to the release notes
of
> GenBank, "the introduction of the new format will occur with Release 127.0
> in December 2001". Obviously the current Bio::SeqIO code cannot process
the
> new format.
> 
> Because of the above two reasons, I rewrite a small part of the
> Bio::SeqIO::GenBank code and hopefully it will support the new file format
> as well as the old one. A new hash key "molshape" is introduced to
represent
> "circular" or "linear". If you think my code is acceptable, please
included
> it into new release of BioPerl. GenBank records with new format will be
> released by December 2001 this year!
> 
> Below is my code:
> 
> # CODE FROM Bio::SeqIO::GenBank
> sub next_seq{
>         my ($self,@args) = @_;
>         my ($pseq,$c,$name,@desc,$seqc,$mol,$div,$date);
>         my $line;
>         my @acc = ();
>         my $seq = Bio::Seq::RichSeq->new('-verbose' =>$self->verbose());
> 
>         while(defined($line = $self->_readline())) {
>                 $line =~ /^LOCUS\s+\S+/ && last;
>         }
>         return undef if( !defined $line ); # end of file
> # BEGIN MY CODE
>         $line =~
>
/^LOCUS\s+(\S+)\s+\S+\s+(bp|aa)\s+(\S+)?\s+(linear|circular)?\s+(\w{3})\s+(\
> S+)/ || do {
>                 $line =~ /^LOCUS\s+(\S+)/ ||
>                         $self->throw("GenBank stream with no LOCUS. Not
> GenBank in my book. Got $line");
>                 $self->warn("GenBank record with LOCUS line in unexpected
> format. Attributes from this line other than ID will be missing.")
> 
>         }
>         $name = $1;
>         $seq->display_id($name);
>         if ($2 eq 'bp') {
>                 $seq->moltype ('dna');
>                 $seq->molecule ($3);                #examples of $3 are
> 'ss-tRNA', 'ds-DNA', 'DN', 'sno-RNA'
>                 $seq->molshape ($4);               #molshape means either
> 'circular' or 'linear': this term is not included in your original program
> 
>         } elsif ($2 eq 'aa') {
>                 $seq->moltype ('protein');
>                 $seq->molecule ('PRT');              #for protein there
are
> no $3 and $4 defined.
>         }
>         $seq->division ($5);
>         defined ($6) and $seq->add_date ($6);
> # END MY CODE
> # BEGIN YOUR CODE
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________