[Biopython-dev] New Bio.SeqIO code

Sun Oct 29 11:25:35 UTC 2006

Michiel de Hoon wrote:
> Well let's first decide which functions we want in Bio.SeqIO, and then 
> decide how to name them.

Agreed.

One point against names like File2SequenceIterator is the pun on two 
versus to (i.e. convert) will not be so obvious to non-native English 
speakers.

>  > That was one thing I wanted to discuss - having a SequenceDict and
>  > SequenceList class would let us add doc strings and perhaps methods
>  > like maxlength, minlength, totallength, ...
>  >
>  > Or, I can just use simple list and dict objects in the functions
>  > File2SequenceList and File2SequenceDict.
>  >
>  > I have no strong preference on this issue - so unless someone else
>  > speaks up, I'll go back to simple lists and dictionaries - keeps
>  > things simple.
> 
> If we go back to simple lists and dictionaries, do we still need the 
> functions File2SequenceList and File2SequenceDict? I'd like to avoid 
> software bloat as much as possible, so if we don't need these two 
> functions, so much the better.

I think there is some benefit to having File2SequenceDict included as 
converting from a SeqRecord iterator to a dictionary of SeqRecords isn't 
completely trivial.

There are at least two important questions: What to use as the 
dictionary key (e.g. record.id) and how to deal with duplicate keys 
(e.g. use first/last record with that id, or simply abort).

Consider this line of code as an alternative to File2SequenceDict:

iterator = File2SequenceList(...)
d = dict([record.id, record] for record in iterator)

I don't think its very readable, or intuitive (and could scare 
beginners).  Part of my aim with Bio.SeqIO was to make the interface simple.

More importantly, if there are records with duplicate ids then with this 
code the resulting dictionary will have only the last record. 
Personally I would want duplicate keys to cause an exception.

Rewriting File2SequenceDict() to use a simple dict would give something 
like this, where record2key is an optional user supplied function.

def File2SequenceDict(..., record2key=None) :
     iterator = File2SequenceIterator(...)
     if record2key is None : record2key = lambda record : record.id
     answer = dict()
     for record in iterator :
         key = record2key(record)
         assert key not in answer, "Duplicate key"
         answer[key] = record
     return answer

The record2key function is perhaps not needed - I was trying to make the 
function flexible.  The duplicate key behaviour could also be an option.

The other function, File2SequenceList isn't really needed if we are 
using simple lists.  Its basically a wrapper for 
list(File2SequenceIterator(...)) or some other one liner.

The main reason I invented File2SequenceList() was for completeness - 
given I already had File2SequenceDict() and File2SequenceIterator()

> About file handles:
> 
>  > The File2SequenceIterator() function (and friends) can take a
>  > filename, handle, or a string containing the contents of a file (in
>  > addition to the format).  However, these are done as three separate
>  > arguments.
>  >
>  > I could have one argument that takes a file name or handle, and works 
>  > it out on its own.  Bio.Nexus tries to do this for example.  Having
>  > the individual iterators also do this trick would be pretty simple
>  > (using a shared utility function).
>  >
>  > The "contents of a file" string argument was handy when testing, but I
>  > imagine this is not going to be a common situation.  If people need
>  > this, they can use python's StringIO module to turn their data string
>  > into a handle easily enough.
> 
> I like the idea of one argument that takes a file name or handle. I 
> believe that that is how other Biopython functions work.

OK then - I'll do that.

Peter