[Bioperl-l] SWISS-PROT writing

Hilmar Lapp lapp@gnf.org
Tue, 02 Jan 2001 15:45:28 -0800


Kris Boulez wrote:
> 
> 
> - at line 356 there is
>    $mol = $seq->molecule;
> I think this should be $seq->moltype; as ->molecule only looks for
> {'molecule'} which is not set by ->new. Bio::Seq->new only sets
> {'moltype'}.
> We should change the 'protein' of ->moltype to 'PRT' to conform to the
> standard.

moltype() is internal to BioPerl. Whenever there is an attribute synonymous
to moltype() but defined by a databank, molecule() should be used for that.
So the code is correct I think.

Bio::Seq->new() indeed only sets moltype(), because at this point there is
no databank specificity. molecule() should be set by the parser. If you
want to instantiate a swissprot seq from memory and have it written in
swissprot format, the way we want to go is have dedicated classes under
Bio::Seq::*. If there is need for a swissprot-dedicated class, that one
probably would also set molecule() at instantiation time.

> 
> B.T.W. do we want to allow SWISS-PROT to try to write out DNA/RNA
> sequences ?

In my opinion there's no need for that, but others may think differently.

> 
> - around line 369 the whole else block should be changed. We should make
>   sure we have a division ($div) in the ID part. The previous version of
> the code which is now commented out did a better try at this. Looking at
> next_seq() we why we're not able to read this (entry name must contain
> an underscore section 3.1.1 of the SWISS-PROT manual).
> 
>     $line =~ /^ID\s+([^\s_]+)_([^\s_]+)\s+([^\s;]+);\s+([^\s;]+);/
>      || $self->throw("swissprot stream with no ID. Not swissprot in my
> book");
>    $name = $1."_".$2;
>    $seq->primary_id($1);
>    $seq->division($2);
> 

If this is the code you're referring to (sorry, don't have at hand right
now), it does ensure that there is a division part. I'm probably missing
something.

> How standard compliant do we want to be with this. If we want to be very
> strict we should e.g. make sure the 'entry name' (first item on the ID
> line) is not more then 10 characters.
> 
> P.S. (very) minor issue: the division we choose 'UNK' for sequences
> which don't have a division set is not in the standard (speclist.txt),
> it only contains UNKP
> 

Sure, can (should) be changed.

> Should I try to adopt swiss.pm to the thoughts I (tried to) put out or
> are there major objections ?
>

See above. I'm not sure what we already have in the Bio::Seq::* hierarchy.
If there's no Swiss.pm yet and GenBank/GenPept doesn't fit well, you could
give Bio::Seq::Swiss.pm a start and adopt the parser to instantiate objects
of that class.

Apart from this, Lorenz may wish to comment. He's been our Swissprot
cruncher for a while, but haven't heard from him for some time. Lorenz,
still out there?

Happy new year to all.

	Hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------