[Biopython-dev] Parsing malformed genbank files (e.g. VectorNTI)

Timothy Ham timothyham at gmail.com
Thu Dec 4 21:52:33 UTC 2008


On Thu, Dec 4, 2008 at 2:26 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Dec 3, 2008 at 12:19 AM, Timothy Ham <timothyham at gmail.com> wrote:
>>
>> Hi everyone,
>>
>> The current biopython GenBank parser dies while parsing VectorNTI
>> generated files.  For example, until recently, BioPython did not
>> accept an empty SOURCE field. It still does not handle an empty
>> VERSION or ACCESSION fields (consumer.data.id never gets filled),
>> which is the default for user generated vector maps via VectorNTI.
>
> I fixed the SOURCE issue in Bio/GenBank/__init__.py CVS revision 1.97
> after Tim contacted me offlist - there was no bug report.
>
>> Now, it is easy enough to change the GenBank parser to handle
>> malformed genbank files, (I can submit patches) but the real question
>> becomes:
>>> Should BioPython handle malformed genbank files at all?
>> I would like to be practical and say yes, since VectorNTI is a very
>> common, widely used format, but I wanted to ask the community before
>> submitting my patches.
>>
>> Thanks for the great work,
>> Tim
>
> As I'm the defacto maintainer for Bio.GenBank, I guess unless the list
> as a whole has a consensus this is my call.
>
> Reading the GenBank file format spec, the ACCESSION and VERSION lines
> are clearly intended to be mandatory.  Note that for mandatory fields,
> IIRC, the NCBI will use a single dot/period as a place holder when
> there is no data.  So I would argue that VectorNTI is producing
> invalid files, and you should write to the authors and encourage them
> to follow the spec more closely (even if we do change Biopython to
> cope).
>
> However, I'm willing to bend a little on out of spec GenBank files (in
> cases like this where there is no ambiguity about the parsing), but I
> would want a real example output file from VectorNTI to include for a
> unit test.  This is important as we need to use something sensible for
> the SeqRecord's id property if the ACCESSION and VERSION are missing.
>
> Peter
>

I have attached two representative example genbank outputs from
VectorNTI. I don't know if the mailing list accepts attachments, but
if it can't, is there a place where I can put it (maybe the biopython
wiki?)

Tim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vnti_example.zip
Type: application/zip
Size: 11716 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20081204/15ebddc9/attachment-0001.zip>


More information about the Biopython-dev mailing list