[Biojava-l] Re: [Bioperl-l] Writing genbank files

Fri, 18 Jan 2002 22:07:33 -0700

>I have come across a problem with genbank files using the perl module
>Bio::DB::GenBank. When I get the genbank sequence from NCBI and write the
>sequence out to in genbank format the Locus line is missing the date.

Let me take a stab at answering this.

What does it mean for something to be in the GenBank format?  There
are many answers.  All start with the format definition given in the
release notes, at ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

This document is very specific.  For example, here is the current
definition for the LOCUS line

LOCUS       16Char_LocusName 99999999999 bp ss-snoRNA  circular DIV
DD-MMM-YYYY

Positions  Contents
---------  --------
01-05      LOCUS
06-12      spaces
13-28      Locus name
31-31      space
30-40      Length of sequence, right-justified
41-41      space
42-43      bp
44-44      space
45-47      spaces, ss- (single-stranded), ds- (double-stranded), or
           ms- (mixed-stranded)
48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA),
           mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
           snoRNA. Left justified.
54-55      space
56-63      'linear' followed by two spaces, or 'circular'
64-64      space
65-67      The division code (see Section 3.3)
68-68      space
69-79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)

(Note that this is the new LOCUS line format as the format changed
last month.)

The document makes no statement that those fields are optional, so
in that regard the Bioperl 'genbank' output is not in GenBank format.
On the other hand, BioJava is incorrect for saying the problem with
the LOCUS line.  If you read the format specification, section 3.1
says a GenBank file starts with a few header lines

    The first line of the file contains the file name in character
    positions 1 to 9 and the full database name (Genetic Sequence
    Data Bank) starting in column 22. The brief names of the files
    in this release are listed in section 2.2.

    The second line contains the date of the current release in the
    form `day month year', beginning in position 27. The fourth line
    contains the current GenBank release number.

so BioJava should complain that the first few lines of the file should
look like

    GBBCT1.SEQ           Genetic Sequence Data Bank
                              15 December 2001

Bioperl doesn't generate GenBank output but then again BioJava
will try to make some sense from non-GenBank files.

There are two difficulties here.  Most people face the parsing
problem while fewer need to deal with the writing problem.

There are two aspects to the parsing problem: there are a lot of formats,
and each contains many fields, of which only a few are needed.  Few people
enjoy writing parsers so odds are that when you (or your friendly
neighborhood bio* dealers) need to support a new format, you'll only
read the essential fields from a file and ignore the other fields.
If the input file is in the expected format then this works fine.

The other aspect of this problem is the input doesn't need to be in
the right format to be parsed.  For example, you can add new fields
to a SWISS-PROT entry and the BioJava parser will ignore the fields.
This is good and bad.  The good part is it allows a certain amount
of future compatibility.  ExPASy can add new line types to the file
and BioJava will still read it.

But the change might be unexpected.  For example, SWISS-PROT once
only allowed a single AC line.  That changed in release 39, as
mentioned in the release notes at:
  http://www.expasy.ch/txt/old-rel/relnotes.38.txt
That silently broke the Bioperl code, because it assumed only a single
AC line, and it ended up only saving the accession numbers from the
last line.  The patch to fix that error was added last year and is at
http://www.bioperl.org/pipermail/bioperl-guts-l/2000-December/002624.html

For an example that affect the Biopython code, my genbank parser
requires the old-style LOCUS line and the parser is column based,
so the fields must be in exactly the correct location.  It will
break with the new-style LOCUS line mentioned above, but if I used
a field based parser instead of a column based parser it wouldn't
have broken.  Until this release there was no guarantee in how the
LOCUS line would change over time, so I believe that breaking loudly
is better then breaking silently.

People take advantage of the leniency of the various readers.
Structure people have to work with PDB files.  The PDB defines
dozens of different card types but for structure visualization
and manipulation you can start by supporting only the ATOM, HETATM,
and possibly END and ENDMDL fields.  This is well known, and
a lot of software will work fine with only those fields defined.
Indeed, generating a well-formed PDB file is a lot of work.
There are perhaps a half-dozen programs capable of doing so, and
they are all related to PDB submission (XPLOR, O, etc.)

The leniency applies to the GenBank format as well.  I just did
a google search for "genbank file format" and came up with:

http://www.icp.ucl.ac.be/~opperd/private/cox2seq.html
which claims "COX2 sequences in GenBank format."  But the entries
look like this

  LOCUS       cox2_leita       176 bp
  DEFINITION  cox2_leita       421 bp, 176 bases, F21F24E0 checksum.
  ORIGIN
     1  MAFILSFWMI FLLDSVIVLL SFVCFVCVWI CALLFSTVLL VSKLNNIYCT
    51  WDFTASKFID VYWFTIGGMF SLGLLLRLCL LLYFGHLNFV SFDLCKVVGF
   101  QWYWVYFIFG ETTIFSNLIL ESDYMIGDLR LLQCNHVLTL LSLVIYKLWL
   151  SAVDVIHSFA ISSLGVKVEN LVAVMK
  //

This isn't in GenBank format.  Some parsers are lenient enough to
read this.  Others are not.

(There are a few other aspects to the parsing problem I just missing.
Some people extend a format, to add support for new data typs.
For example, http://www.tigr.org/~jeisen/GDE/AppendixA.html
    The Genbank character set has also been extended in order to
    support these other data types.
In other cases the format documentation is incorrect, or at least
lacking, while in still others the database is in an obviously
incorrect format.)

  The writing problem.

Assume you've read the data from one format and want to generate
it in another format.  You don't necessarily have all the information
needed for that.  Either you have to say it can't be done or you
have to lie and generate invalid data.  You can make a direct lie
and generate false information, and make a false date for the LOCUS
line.  Or you can lie through omission and not put anything there
(eg, use spaces or generate a short line).  The first produces
syntactically valid output, but that false data may accidentally
be treated as valid.  The second produces incorrectly formatted
data but is less likely to corrupt your other data.

(I use "lie" somewhat provocatively.  In reality the software should
respond to a request for something in GenBank format by saying it
can't generate a GenBank file but can generate something which is
GenBank-like, and somehow describe the differences.  Documenting
this nuance and those differences is itself hard and tedious,
which is why it is rarely done outside of pointers to source code.)

In then happens that there's a de facto consensus amoung those
in the know about what constitutes a "correct" GenBank file.  This
consensus isn't documented and is learned mostly from experience.
(Eg, a common consensus is to skip everything in the file up to
the LOCUS line and still call it a GenBank file.  This supports both
NCBI distributed GenBank data files and the "sequence of GenBank
entries" format generated by most software.)

I don't like the current situation.  I think it makes it too hard
for people to work with these files, especially since mastering
file formats doesn't advance science one whit.  So I've
been working for the last couple years on ways to improve it, and
think I've a solution, which I've been contributing to biopython.
But that's going to be several long emails in its own right - I'm
planning on presenting it publically for the first time at the
Tuscon conferences.

>Is the date a required element in the Locus line? Is there consensus
>on what constitutes correct format? Has it changed recently?

Yes.  Mostly, but not really.  No, but the rest of the line has.
(Though that's not the problem.)

>I also noticed that the biojava parser is very picky about the number of
>spaces; delete a few spaces between DNA and linear and it dies too.

Under the old-style definition that was an appropriate choice, since the
fields in that line were defined by column positions.  New-style also
has column definitions, but says that a field based interpretation will
always be valid for the future.

                    Andrew
                    dalke@dalkescientific.com