[Bioperl-l] Bio::SeqIO::genbank

Thu Apr 8 17:07:17 UTC 2010

I'm not a reader or the bioperl list, but this might be a format to
address a question that I'm dealing with:

I have a parser (non-Perl) that is having some trouble with "genbank"
formatted files.
The troublesome files are from from another source that uses bioperl
to write their files with Bio::SeqIO::genbank

Trouble is that the molecule type (in the LOCUS line) they are writing
is free text, as allowed by the Bioperl  Bio::SeqIO::genbank module:

	    $temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s",
				  'LOCUS', $seq->id(),$len,
				  (lc($alpha) eq 'protein') ? ('aa','', '') :
				  ('bp', '',$mol),$circular,
				  $div,$date);

however the genbank file definition at
ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

section 3.4.4 specifies the format for the LOCUS line:
in the table of column positions they specify a limited vocabulary of
fixed width:
"48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
          mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
          snoRNA. Left justified."

which to me strongly suggests that the genbank file format requires a
fixed vocabulary for molecule type.

Seems that in Bio::SeqIO::genbank at
	if( !$seq->can('molecule') || ! defined ($mol = $seq->molecule()) ) {
	    $mol =  $alpha || 'DNA';
	}

if $mol is not in the fixed list of genbank molecule types it should
be set to the default value of 'DNA', or some other smarter way of
forcing the molecule type into the fixed vocabulary would be a help.

Thanks for any replies.

-- 
Wayne Davis

Department of Biology
University of Utah
257 South 1400 East
Salt Lake City, UT 84112-0840
(801) 585-3692