[Biopython-dev] Parsing malformed genbank files (e.g. VectorNTI)

Thu Dec 4 15:02:13 UTC 2008

Peter wrote:
> On Wed, Dec 3, 2008 at 12:19 AM, Timothy Ham <timothyham at gmail.com> wrote:
>   
>> Hi everyone,
>>
>> The current biopython GenBank parser dies while parsing VectorNTI
>> generated files.  For example, until recently, BioPython did not
>> accept an empty SOURCE field. It still does not handle an empty
>> VERSION or ACCESSION fields (consumer.data.id never gets filled),
>> which is the default for user generated vector maps via VectorNTI.
>>     
>
> I fixed the SOURCE issue in Bio/GenBank/__init__.py CVS revision 1.97
> after Tim contacted me offlist - there was no bug report.
>
>   
>> Now, it is easy enough to change the GenBank parser to handle
>> malformed genbank files, (I can submit patches) but the real question
>> becomes:
>>     
>>> Should BioPython handle malformed genbank files at all?
>>>       
>> I would like to be practical and say yes, since VectorNTI is a very
>> common, widely used format, but I wanted to ask the community before
>> submitting my patches.
>>
>> Thanks for the great work,
>> Tim
>>     
>
> As I'm the defacto maintainer for Bio.GenBank, I guess unless the list
> as a whole has a consensus this is my call.
>
> Reading the GenBank file format spec, the ACCESSION and VERSION lines
> are clearly intended to be mandatory.  Note that for mandatory fields,
> IIRC, the NCBI will use a single dot/period as a place holder when
> there is no data.  So I would argue that VectorNTI is producing
> invalid files, and you should write to the authors and encourage them
> to follow the spec more closely (even if we do change Biopython to
> cope).
>
> However, I'm willing to bend a little on out of spec GenBank files (in
> cases like this where there is no ambiguity about the parsing), but I
> would want a real example output file from VectorNTI to include for a
> unit test.  This is important as we need to use something sensible for
> the SeqRecord's id property if the ACCESSION and VERSION are missing.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>   
At http://www.ncbi.nlm.nih.gov/Genbank/index.html there is a link to the 
'complete release notes for the current version of GenBank'.
 From ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt, it clearly states that 
ACCESSION and VERSION are mandatory and I interpret the '/' to mean 
'with'. The relevant section is:

3.4.2  Entry Organization
"
  The second part of each sequence entry record contains the information
appropriate to its keyword, in positions 13 to 80 for keywords and
positions 11 to 80 for the sequence.

  The following is a brief description of each entry field. Detailed
information about each field may be found in Sections 3.4.4 to 3.4.15.

LOCUS	- A short mnemonic name for the entry, chosen to suggest the
sequence's definition. Mandatory keyword/exactly one record.

DEFINITION	- A concise description of the sequence. Mandatory
keyword/one or more records.

ACCESSION	- The primary accession number is a unique, unchanging
identifier assigned to each GenBank sequence record. (Please use this
identifier when citing information from GenBank.) Mandatory keyword/one
or more records.

VERSION		- A compound identifier consisting of the primary
accession number and a numeric version number associated with the
current version of the sequence data in the record. This is followed
by an integer key (a "GI") assigned to the sequence by NCBI.
"
Mandatory keyword/exactly one record.

If these entries are missing then Biopython must raise an exception 
because the GenBank file is invalid.

While I have not seen an example, does a VectorNTI output contain the 
LOCUS field that could be used an accession number?
I think it is fairly common for the accession number to be part of the 
LOCUS field.

Bruce