[Bioperl-l] Bio/SeqIO/swiss.pm parsing error

James D. White jdw at ou.edu
Mon Nov 13 23:50:15 UTC 2006


"Erik" <er at xs4all.nl> wrote:

>Hi all,
>
>I noticed the parsing is borked with newest swisprot files:
>  UniProt Knowledgebase Release 9 consists of:
>  UniProtKB/Swiss-Prot Release 51.0 of 31-Oct-2006
>  UniProtKB/TrEMBL Release 34.0 of 31-Oct-2006
>
>
>I edited my local copy of Bio/SeqIO/swiss.pm to parse the ID lines
>in swissprot/trembl according to the new specification (see
>http://expasy.org/sprot/relnotes/sp_news.html).
>
>Basically, the change is as follows:
>  ID   EntryName DataClass; MoleculeType; SequenceLength.
>is changed to:
>  ID   EntryName DataClass; SequenceLength.
>
>
>
>The change I made was only in the regex capturing the entry name:
>method next_seq (Bio/SeqIO/swiss.pm) :
>
>===============
>
>  unless(  m/
>               ^
>                  ID              \s+     #
>                  (\S+)           \s+     #  $1  entryname
>                  ([^\s;]+);      \s+     #  $2  DataClass
>                  [0-9]+[ ]AA     \.      #      Sequencelength (capture?)
>                $
>            /ox )
>  {
>    $self->throw("swissprot stream with no ID. Not swissprot in my book");
>  }
>
>===============
>  
>

How about something like the following to recognize both old and new formats

===============

  unless(  m/
               ^
                  ID              \s+           #
                  (\S+)           \s+           #  $1  entryname
                  ( (: [^\s;]+;   \s+ )? )      #  $2  DataClass (including ";\s+")
                  [0-9]+[ ]AA     \.            #      Sequencelength (capture?)
                $
            /ox )
  {
    $self->throw("swissprot stream with no ID. Not swissprot in my book");
  }
  # Because $2 now contains a trailing ";\s+" in the new format, it needs to be fixed
  $DataClass = $2 || 'default DataClass';       # provide default for old file format
  $DataClass =~ s/;\s+$//;                      # remove trailing ";\s+"

===============

The code trailing the unless block should be modified to use the appropriate
variable names.  This is provided only to show what post-match modification is
needed.

>
>I tested this (=entry parsable and SeqIO created) against several
>hundred Swissprot and Trembl entries.
>
>Of course, files with the older format are now broken - it may be better
>to leave old and new format, and try both (newest first).
>
>hth,
>
>Erik
>
>
>
>
>  
>






More information about the Bioperl-l mailing list