[emboss-dev] Regression in GenBank/GenPept parsing?
Peter
biopython at maubp.freeserve.co.uk
Mon Jul 20 16:56:45 UTC 2009
Hi all,
One of Biopython's unit tests uses the EMBOSS tools. This is for
several tasks, including checking we agree for basic sequence
translations using different tables, as well as making sure Biopython
can parse the alignments output by needle and water. Another area
is cross checking we can read each other's sequence output files.
I've been going over the Biopython unit tests with EMBOSS 6.1.0,
and have found a regression compared to EMBOSS 6.0.1. This is
to do with how EMBOSS parses a minimal GenBank file written
with Biopython.
The file in question is a 10kb GenBank (well, a GenPept file as
it holds protein sequences) converted from an Inteligentics file.
I can email this on request. The file contains 16 records:
$ grep "^LOCUS" VIF_mase-pro.gb | wc -l
16
Using EMBOSS 6.0.1, there are warning messages about the
LOCUS line, but all 16 records do get converted into FASTA
format fine. I'm not sure why it is complaining, and would be
grateful for feedback:
$ embossversion
Writes the current EMBOSS version number to a file
6.0.1
$ seqret -sequence VIF_mase-pro.gb -sformat genbank -osformat fasta
-auto -filter | grep ">" | wc -l
Warning: bad Genbank LOCUS line 'LOCUS most-likely
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS U455
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS HXB2R
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS ELI
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS MVP5180
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS AD_MAL
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS CPZGAB
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS CPZANT
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS ROD
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS EHOA
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS MM251
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS STM
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS AGM3
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS AGM677
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS SAB1C
298 aa UNK 01-JAN-1980
'
Warning: bad Genbank LOCUS line 'LOCUS SYK
298 aa UNK 01-JAN-1980
'
16
In any case, seqret 6.0.1 was able to convert this to a FASTA file of 16
records. However, seqret 6.1.0 fails - only the first record is extracted:
$ embossversion
Reports the current EMBOSS version number
6.1.0
$ seqret -sequence VIF_mase-pro.gb -sformat genbank -osformat fasta
-auto -filter | grep ">" | wc -l
1
If there is something wrong with my LOCUS lines, I can fix them. Any
thoughts? The LOCUS lines are reproduced above in the EMBOSS 6.0.1
warning messages. One possible issue is the inclusion of an arbitary
date (01-JAN-1980, a common default which shouldn't get confused
with a real date), over something equally arbitrary (like the date of the
conversion), or simply omitting the date (which may be invalid).
Thanks,
Peter C.
(@Biopython)
More information about the emboss-dev
mailing list