[Biopython-dev] Bio.SeqIO

Peter biopython-dev at maubp.freeserve.co.uk
Mon Jan 15 20:04:34 UTC 2007


Michiel de Hoon wrote:
> In my opinion, the new Bio.SeqIO code is a huge improvement to 
> Biopython, so I'd be happy to make a new release for it.
> 
> ...
> 
> For Bio.SeqIO, we're also in pretty good shape, as far as I can tell. 
>  From what I remember, the remaining issues were
> 1) Which functionality to include, in particular
>    a) if functions should accept file names in addition to file handles;

I have decided to follow Michiel's stance on this issue: handles only.

>    b) if functions should infer the file format from the file extension, 
> the file content, or otherwise.

Right now the file format string is optional and if omitted the file 
extension (via handle.name) is used to try and guess.

It would be trivial to remove this functionality and make format a 
required argument.

We could at a later date chose to add limited support for format 
guessing based on file contents without altering the function parameters 
(i.e. the API).

Both these features would be nice to have (speaking as user) but then 
again, am I prepared to support the headaches they may cause later on. 
I'm wavering on this issue (having previously been in favour of 
including the format guessing).

Item 1(c) on Michiel's list could have been do we need the three "helper 
functions" which turned a file into a SeqRecord list, dictionary or 
alignment.

Again, I have come round to Michiel's view and removed these as they 
were just simple wrappers for list, SequencesToDictionary and 
SequencesToAlignment.

> 2) What are the best names for the functions that the user will see.

The good news is that after that little spring clean there are less 
functions to name - just these four really:

SequenceIterator, once known as FileToSequenceIterator and before that 
File2SequenceIterator.  Now takes just an input file handle and an 
optional file format.  Returns a SeqRecord iterator.

SequencesToDictionary - takes SeqRecord iterator or list, plus an 
optional function to define the keys, and returns a dictionary.

SequencesToAlignment - takes SeqRecord iterator or list, and returns an 
alignment object.  Perhaps this functionality should be included in the 
alignment class itself...

WriteSequences, once known as SequencesToFile - takes a SeqRecord 
iterator or list, and output handle, and a format string.  Intended for 
use on a whole file at once (i.e. the general case where there may be 
headers/footers etc).  This does not let you do incremental writes one 
for each record (which would be possible for some formats like GenBank 
or fasta)

Peter




More information about the Biopython-dev mailing list