[emboss-dev] Regression in GenBank/GenPept parsing?

Mon Jul 20 16:56:45 UTC 2009

Hi all,

One of Biopython's unit tests uses the EMBOSS tools. This is for
several tasks, including checking we agree for basic sequence
translations using different tables, as well as making sure Biopython
can parse the alignments output by needle and water. Another area
is cross checking we can read each other's sequence output files.

I've been going over the Biopython unit tests with EMBOSS 6.1.0,
and have found a regression compared to EMBOSS 6.0.1. This is
to do with how EMBOSS parses a minimal GenBank file written
with Biopython.

The file in question is a 10kb GenBank (well, a GenPept file as
it holds protein sequences) converted from an Inteligentics file.
I can email this on request. The file contains 16 records:

$ grep "^LOCUS" VIF_mase-pro.gb | wc -l
      16

Using EMBOSS 6.0.1, there are warning messages about the
LOCUS line, but all 16 records do get converted into FASTA
format fine. I'm not sure why it is complaining, and would be
grateful for feedback:

$ embossversion
Writes the current EMBOSS version number to a file
6.0.1
$ seqret -sequence VIF_mase-pro.gb -sformat genbank -osformat fasta
-auto -filter | grep ">" | wc -l
Warning: bad Genbank LOCUS line 'LOCUS       most-likely
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       U455
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       HXB2R
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       ELI
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       MVP5180
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       AD_MAL
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       CPZGAB
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       CPZANT
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       ROD
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       EHOA
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       MM251
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       STM
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       AGM3
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       AGM677
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       SAB1C
298 aa                     UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS       SYK
298 aa                     UNK 01-JAN-1980
'
      16

In any case, seqret 6.0.1 was able to convert this to a FASTA file of 16
records. However, seqret 6.1.0 fails - only the first record is extracted:

$ embossversion
Reports the current EMBOSS version number
6.1.0
$ seqret -sequence VIF_mase-pro.gb -sformat genbank -osformat fasta
-auto -filter | grep ">" | wc -l
       1

If there is something wrong with my LOCUS lines, I can fix them. Any
thoughts? The LOCUS lines are reproduced above in the EMBOSS 6.0.1
warning messages. One possible issue is the inclusion of an arbitary
date (01-JAN-1980, a common default which shouldn't get confused
with a real date), over something equally arbitrary (like the date of the
conversion), or simply omitting the date (which may be invalid).

Thanks,

Peter C.
(@Biopython)