[Bioperl-l] SWISS-PROT writing

Kris Boulez krbou@pgsgent.be
Tue, 2 Jan 2001 22:21:36 +0100


[ I know there are some specialists on SWISS-PROT on this list, so I
might make a fool of me, but here goes ]

When chasing down the reason why swiss.pm was not able to read a
SWISS-PROT formatted file it wrote itself I found the following things
which look suspicious in write_seq()

- at line 356 there is 
   $mol = $seq->molecule;
I think this should be $seq->moltype; as ->molecule only looks for
{'molecule'} which is not set by ->new. Bio::Seq->new only sets
{'moltype'}.
We should change the 'protein' of ->moltype to 'PRT' to conform to the
standard.

B.T.W. do we want to allow SWISS-PROT to try to write out DNA/RNA
sequences ?


- around line 369 the whole else block should be changed. We should make
  sure we have a division ($div) in the ID part. The previous version of
the code which is now commented out did a better try at this. Looking at
next_seq() we why we're not able to read this (entry name must contain
an underscore section 3.1.1 of the SWISS-PROT manual).

    $line =~ /^ID\s+([^\s_]+)_([^\s_]+)\s+([^\s;]+);\s+([^\s;]+);/
     || $self->throw("swissprot stream with no ID. Not swissprot in my
book");
   $name = $1."_".$2;
   $seq->primary_id($1);
   $seq->division($2);

How standard compliant do we want to be with this. If we want to be very
strict we should e.g. make sure the 'entry name' (first item on the ID
line) is not more then 10 characters.

P.S. (very) minor issue: the division we choose 'UNK' for sequences
which don't have a division set is not in the standard (speclist.txt),
it only contains UNKP

Should I try to adopt swiss.pm to the thoughts I (tried to) put out or
are there major objections ?


Kris,