[Bioperl-l] bug in genbank.pm

Wang, Kai Wang.Kai@mayo.edu
Sat, 16 Feb 2002 17:30:05 -0600


I pointed out this problem about two months ago, but nobody changed it. The
new GenBank file format add a "molecular shape" in the LOCUS line so current
genbank.pm cannot process it.

in the file:

# $Id: genbank.pm,v 1.46 2002/02/14 16:41:22 jason Exp $
    if (($2 eq 'bp') || defined($5)) {
	if ($4 eq 'circular') {
	    $seq->molecule($3);
	    $seq->is_circular($4);
	    $seq->division($5);
	    ($date) = $line =~ /.*(\d\d-\w\w\w-\d\d\d\d)/;
	} else {
	    $seq->molecule($3);
	    $seq->division($4);
	    $date = $5;
	}
    } else {
	$seq->molecule('PRT') if($2 eq 'aa');
	$seq->division($3);
	$date = $4;
    }


The above code was based on the wrong assumption that NCBI will not add
'linear' tag to a record. 
One example is accession number 'NM_003748'. The first line is:

LOCUS       NM_003748               3134 bp    mRNA    linear   PRI
01-NOV-2000

The current genbank.pm cannot recognize 01-NOV-2000.


I think the best way is to use:    $line =~
/^LOCUS\s+(\S+)\s+\S+\s+(bp|aa)\s+(\S+)?\s+(\S+)?\s+(\w\w\w)?\s+(\d\d-\w\w\w
-\d\d\d\d)?/