[Bioperl-l] Suggested patches

Murad Nayal murad@godel.bioc.columbia.edu
Mon, 12 Mar 2001 03:07:59 +0100


This is a multi-part message in MIME format.
--------------41F4832FB5AEC7F69268E3FD
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


Hello

Congratulations to all on the new 0.7 release. what a wonderful job.

My current problem. I have been using trembl and trembl_new sequence
collections. the formats of these files seem to be somewhat different
from either standard swissprot or embl formats. for example the id field
(in the ID record) is not split into primary_id and division separated
by an underscore (like in swissprot). similarly the ID record does not
contain a div attribute (as in embl). it also does not contain a FH
record which is used to exit the header parsing loop. another problem is
that the regular expression used to match the accession number leads to
the inclusion of the semicolon in the accession (because of the greedy
nature of matching using the + modifier). because of these issues both
SeqIO::swiss and SeqIO::embl fail at reading these files. I have made
some minor (and hopefully not too ugly) modifications to swiss.pm and
embl.pm to accommodate these differences. patches attached. I suppose I
could have uploaded the patches to the CVS. but I wanted to make sure I
know bug fixing etiquette in bioperl first. I am also not sure my
password is still valid. in any event. let me know what you think.


BTW what parser should be used for such files: swissprot (because they
are aminoacids) or embl (based on origin)?

Best regards


-- 
Murad Nayal M.D. Ph.D.
Department of Biochemistry and Molecular Biophysics
College of Physicians and Surgeons of Columbia University
630 West 168th Street. New York, NY 10032
Tel: 212-305-6884	Fax: 212-305-6926
--------------41F4832FB5AEC7F69268E3FD
Content-Type: text/plain; charset=us-ascii;
 name="swiss.pm.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="swiss.pm.patch"

*** swiss.pm.org	Mon Mar 12 02:22:37 2001
--- swiss.pm	Mon Mar 12 02:21:24 2001
***************
*** 150,161 ****
         return undef; # end of file
     }
  
!    $line =~ /^ID\s+([^\s_]+)_([^\s_]+)\s+([^\s;]+);\s+([^\s;]+);/ 
!      || $self->throw("swissprot stream with no ID. Not swissprot in my book");
!    $name = $1."_".$2;
!    $seq->primary_id($1);
!    $seq->division($2);
!    $seq->molecule($4);
      # this is important to have the id for display in e.g. FTHelper, otherwise
      # you won't know which entry caused an error
     $seq->display_id($name);
--- 150,168 ----
         return undef; # end of file
     }
  
!    if     ($line =~ /^ID\s+([^\s_]+)_([^\s_]+)\s+([^\s;]+);\s+([^\s;]+);/) {
!      $name = $1."_".$2;
!      $seq->primary_id($1);
!      $seq->division($2);
!      $seq->molecule($4);
!    } elsif($line =~ /^ID\s+(\S+)\s+([^\s;]+);\s+([^\s;]+);/              ) {
!      $name = $1;
!      $seq->primary_id($1);
!      $seq->molecule($3);
!    } else                                                                  {
!      $self->throw("swissprot stream with no ID. Not swissprot in my book");
!    }
! 
      # this is important to have the id for display in e.g. FTHelper, otherwise
      # you won't know which entry caused an error
     $seq->display_id($name);

--------------41F4832FB5AEC7F69268E3FD
Content-Type: text/plain; charset=us-ascii;
 name="embl.pm.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="embl.pm.patch"

*** embl.pm.org	Mon Mar 12 01:17:13 2001
--- embl.pm	Mon Mar 12 02:26:44 2001
***************
*** 151,160 ****
         return undef; # end of file
     }
     $line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not embl in my book");
!    $line =~ /^ID\s+(\S+)\s+\S+\;\s+(\S+)\;\s+(\S+)\;/;
!    $name = $1;
!    $mol = $2;
!    $div = $3;
     if(! $name) {
         $name = "unknown id";
     }
--- 151,166 ----
         return undef; # end of file
     }
     $line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not embl in my book");
! 
!    if   ($line =~ /^ID\s+(\S+)\s+\S+\;\s+(\S+)\;\s+(\S+)\;/) {
!      $name = $1;
!      $mol  = $2;
!      $div  = $3;
!    } elsif($line =~ /^ID\s+(\S+)\s+\S+\;\s+(\S+)\;/        ) {
!      $name = $1;
!      $mol  = $2;
!    }
! 
     if(! $name) {
         $name = "unknown id";
     }
***************
*** 176,181 ****
--- 182,193 ----
     until( !defined $buffer ) {
         $_ = $buffer;
  
+        # Exit if you found FT or SQ before encountering FH
+        if(/^FT   \w/ or /^SQ /) {
+          $self->_pushback($buffer);
+          last;
+        }
+ 
         # Exit at start of Feature table
         last if /^FH/;
  
***************
*** 185,191 ****
         }
  
         #accession number
!        if( /^AC\s+(\S+);?/ ) {
  	   $acc = $1;
  	   $acc =~ s/\;//;
  	   $seq->accession_number($acc);
--- 197,203 ----
         }
  
         #accession number
!        if( /^AC\s+(\S+?);?/ ) {
  	   $acc = $1;
  	   $acc =~ s/\;//;
  	   $seq->accession_number($acc);
***************
*** 192,198 ****
         }
         
         #version number
!        if( /^SV\s+(\S+);?/ ) {
  	   my $sv = $1;
  	   $sv =~ s/\;//;
  	   $seq->seq_version($sv);
--- 204,210 ----
         }
         
         #version number
!        if( /^SV\s+(\S+?);?/ ) {
  	   my $sv = $1;
  	   $sv =~ s/\;//;
  	   $seq->seq_version($sv);

--------------41F4832FB5AEC7F69268E3FD--