[Bioperl-l] important change for Bio::SeqIO::GenBank

Heikki Lehvaslaiho heikki@ebi.ac.uk
Thu, 01 Nov 2001 17:14:36 +0000


"Wang, Kai" wrote:
> 
> Thank you Heikki,
> I am using BioPerl-0.7.1, but it was downloaded several months ago so I do
> not have the new version.
> I have checked your new version and I think it should work for the new
> GenBank release.
> The problem is that the code is based on the assumption that $seq->molecule
> and $seq->is_circular either occur simultaneously or do not occur at all in
> the LOCUS line. I do not know whether it is true or not. At least in the

The current code allows molecule without 'circular' on the LOCUS line.

> future it is very possible that they add 'protein' as $seq->molecule but do
> not provide 'circular' or 'linear'. So I think including
> '(circular|linear)?' in the code may be a better way. How do you think about
> that?

Both should work, but the more general regexp less likely it is to break.

> Another problem is that you still cannot judge whether a molecule is mRNA or
> DNA just by $seq->molecule. According the the new GenBank Release note, the
> $seq->molecule may be something like 'ss-DNA', 'snoRNA', 'ss-mRNA', 'cDNA',
> 'ms-RNA' or just 'NA'. So maybe we should do something more to split the
> $seq->molecule further to accomodate the new GenBank format. What's your
> opinion?

The thing here is that molecule() is a method of Bio::Seq::RichSeq which
tries to faithfully represent the database entry and the value is dependent
on the database (EMBL, GenBank,...). We really  should not go and try to
outguess what databases are trying to do. It's better to take in whatever
has been put into an entry and and let the application programmer decide.

cDNA is not a valid value on LOCUS line, so the problem is not that hard.
For NA we can do nothing.

In this case I'd write:

	# cDNA is not a valid value on LOCUS line, so this works
	if ($seq->molecule =~ /RNA/ ) { 
		...
	}


Yours,
	-Heikki

> --Kai
> 
> -----Original Message-----
> From: Heikki Lehvaslaiho [mailto:heikki@ebi.ac.uk]
> Sent: Wednesday, October 31, 2001 10:50 AM
> To: Wang, Kai
> Cc: 'bioperl-l@bioperl.org'
> Subject: Re: [Bioperl-l] important change for Bio::SeqIO::GenBank
> 
> "Wang, Kai" wrote:
> >
> > It is the first time I visit this mailing list. I do not know whether
> > somebody has pointed out this, but it is a serious problem.
> 
> Which version of BioPerl you are using?
> 
> I fixed the problem circular  molecules two weeks ago in bioperl-live ("from
> CVS"). At the same time I noticed the announcement of the LOCUS line change,
> but did not do anything about it. See:
> http://bioperl.org/pipermail/bioperl-l/2001-October/006397.html
> for the thread.
> 
> Do you think you could check out the latest code and check your changes to
> it.
> (see http://cvs.bioperl.org/ for instructions of getting a full copy or
> check changes at WEbCVS: 'View CVS' on the bioperl home page.)
> 
> Thanks,
> 
>         -Heikki
> 
> > When I browsed source code of Bio::SeqIO in the Bioperl project, I found a
> > bug that the code cannot process moleuclar shape of "circular" or
> "linear",
> > which are annotated in a few GenBank files.
> >
> > An even serious problem is that a new GenBank format will be published
> soon.
> > The "LOCUS" line will be changed a little. According to the release notes
> of
> > GenBank, "the introduction of the new format will occur with Release 127.0
> > in December 2001". Obviously the current Bio::SeqIO code cannot process
> the
> > new format.
> >
> > Because of the above two reasons, I rewrite a small part of the
> > Bio::SeqIO::GenBank code and hopefully it will support the new file format
> > as well as the old one. A new hash key "molshape" is introduced to
> represent
> > "circular" or "linear". If you think my code is acceptable, please
> included
> > it into new release of BioPerl. GenBank records with new format will be
> > released by December 2001 this year!
> >
> > Below is my code:
> >
> > # CODE FROM Bio::SeqIO::GenBank
> > sub next_seq{
> >         my ($self,@args) = @_;
> >         my ($pseq,$c,$name,@desc,$seqc,$mol,$div,$date);
> >         my $line;
> >         my @acc = ();
> >         my $seq = Bio::Seq::RichSeq->new('-verbose' =>$self->verbose());
> >
> >         while(defined($line = $self->_readline())) {
> >                 $line =~ /^LOCUS\s+\S+/ && last;
> >         }
> >         return undef if( !defined $line ); # end of file
> > # BEGIN MY CODE
> >         $line =~
> >
> /^LOCUS\s+(\S+)\s+\S+\s+(bp|aa)\s+(\S+)?\s+(linear|circular)?\s+(\w{3})\s+(\
> > S+)/ || do {
> >                 $line =~ /^LOCUS\s+(\S+)/ ||
> >                         $self->throw("GenBank stream with no LOCUS. Not
> > GenBank in my book. Got $line");
> >                 $self->warn("GenBank record with LOCUS line in unexpected
> > format. Attributes from this line other than ID will be missing.")
> >
> >         }
> >         $name = $1;
> >         $seq->display_id($name);
> >         if ($2 eq 'bp') {
> >                 $seq->moltype ('dna');
> >                 $seq->molecule ($3);                #examples of $3 are
> > 'ss-tRNA', 'ds-DNA', 'DN', 'sno-RNA'
> >                 $seq->molshape ($4);               #molshape means either
> > 'circular' or 'linear': this term is not included in your original program
> >
> >         } elsif ($2 eq 'aa') {
> >                 $seq->moltype ('protein');
> >                 $seq->molecule ('PRT');              #for protein there
> are
> > no $3 and $4 defined.
> >         }
> >         $seq->division ($5);
> >         defined ($6) and $seq->add_date ($6);
> > # END MY CODE
> > # BEGIN YOUR CODE
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> 
> --
> ______ _/      _/_____________________________________________________
>       _/      _/                      http://www.ebi.ac.uk/mutations/
>      _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
>     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
>    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
>   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
>      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
> ___ _/_/_/_/_/________________________________________________________

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________