[Bioperl-l] Bio::SeqIO Genbank parsing bug?

CHALFANT_CHRIS_M@Lilly.com CHALFANT_CHRIS_M@Lilly.com
Mon, 15 Jul 2002 09:43:51 -0500


While parsing the Genbank record for GI:1710638, I discovered that 
Bio::SeqIO was dropping the VERSION line.  Here is the VERSION line for 
this record:


VERSION     P51449  GI:1710638

Here is the regex that parses the VERSION line:

        #Version number
        if( /^VERSION\s+(\S+)\.(\d+)\s*(GI:\d+)?/ ) {
            $seq->seq_version($2);
            $seq->primary_id(substr($3, 3)) if($3);
        }

It appears that this regex requires that the accession number in the 
VERSION line have a "dot-version" extension.  This requirement causes the 
parser to miss the VERSION lines in records without "dot-version" 
extensions in the accession and leaves $seq->accession undefined.

I verified this behavior by changing a local copy of the record for 
1710638 to read:

VERSION     P51449.1  GI:1710638

I then parsed the altered copy with Bio::SeqIO. The VERSION line was 
parsed correctly this time.

Should the regex be changed to include files which do not have 
"dot-version" extensions?

Chris

Chris Chalfant
Bioinformatics
Eli Lilly and Company
317-433-3407