[Bioperl-l] IMGT parsing

Alex Brown abrown at tech.mrc.ac.uk
Fri Feb 18 06:15:32 EST 2005


Hi.

I had a small problem using BioSeqIO (in BioPerl 1.4) to parse the IMGT 
flat file database - although the IMGT uses an EMBL-like format,  
BioSeqIO was unable to extract display_id(), which is a bit of a 
nuisance when converting between formats. This is due to a difference 
between the ID line of the EMBL and the IMGT formats:

EMBL -
ID   TRBG361    standard; mRNA; PLN; 1859 BP.

IMGT -
ID   MMTCRGBV1 IMGT/LIGM annotation : by annotators; RNA; ROD; 290 BP.

The following modification to embl.pm seems to allowing correct parsing 
of both formats :
change the lines:

    $line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not 
embl in my book");
      $line =~ /^ID\s+(\S+)\s+\S+\;\s+([^;]+)\;\s+(\S+)\;/;
      $name = $1;
      $mol = $2;
      $div = $3;
      if(! $name) {
          $name = "unknown id";
      }

to :

    $line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not 
embl in my book");
      # this is the new line to replace the above, allowing IMGT records 
to be read as well
      ($name, $mol, $div) = ($line =~ 
/^ID\s*(\S*).*;\s*(\S*);\s*(\S*);/);
      if(! $name) {
          $name = "unknown id";
      }

Hope this is useful.

Alex Brown.

PS. BACK-UP embl.pm before changing.



More information about the Bioperl-l mailing list