[Biopython-dev] [Bug 2294] These patches allow one to write a GenBank file using the SeqIO interface

Peter biopython-dev at maubp.freeserve.co.uk
Wed May 16 22:05:44 UTC 2007


Hi Howard,

I'm replying to the mailing list as you've raised a few more general 
issues on bug 2294.

Peter wrote:
>> In the writer class, for your write_file(self, records) method, you allow
>> explicitly check for and allow "records" to be a single SeqRecord.  Don't. Any
>> such "helpfulness" should be done in Bio.SeqIO.write() only, and not the
>> individual write_file.  Otherwise we'll end up with a situation where some
>> writers are "helpful" and others are not.

Howard replied:
> Currently, the SeqIO's write function is
> 
> def write(sequences, handle, format):
>  ...
 >
> I can add checks to see if "sequences" is (a) a generator, (b) a SeqRecord
> object, or (c) something else. If (a), then call
> writer_class(handle).write_file(sequences). If (b), then call
> writer_class(handle).write_record(sequences). If (c), spit out an error (for
> now).

I've added a check in Bio.SeqIO.write() for the "sequences" argument 
being a SeqRecord (your case b), and if so it now raises a ValueError. 
This is better than whatever cryptic error would have happened.

I agree that it might be "nicer" if Bio.SeqIO.write() would also accept 
a SeqRecord object as input and did the expected thing, but having a 
fixed simple API is more straight forward.

For comparison, see the previous discussions on the mailing list about 
having the file argument accepting either a handle or a filename (it was 
agreed that we would accept handles only).

> Ok, so the standard is very exact in what the LOCUS line should be. However,
> I've found that many programs do not write Genbank files exactly according to
> this standard! So we might want to make the Genbank parser a bit more forgiving
> to small changes in the spacing of Locus line, especially since many programs
> leave out keywords.

Have you got some examples?  I would be keen to add a few more test 
cases for reasonable GenBank variations.

> As it stands now, the patch code can handle missing keywords in the LOCUS line.

If it doesn't already, the existing code column based code can easily do 
this too.

> For example, the code defines a pair of dictionaries with lambda functions as
> their keys
> 
 > ...
> 
> I know this looks crazy, but it works really well. Where else but Python can I
> have a dictionary / hash / whatever with the key being a function! :) Play
> around with the code and you'll see how it works.

Crazy code is scary ;)  I'll try and have a play with this at the weekend.

Note that this issue (parsing the LOCUS line) is a bit tangential to 
writing GenBank records.

>> It also looks like when you write the LOCUS line you are not following
 >> the column based definition
> 
> I'll fix this. The writing of the Genbank file should follow the standard to
> the exactitude.

I agree completely - As a general principle we should be a little bit 
flexible on reading files, but very strict on output.

Regards,

Peter




More information about the Biopython-dev mailing list