[Biopython-dev] New Bio.SeqIO code

Peter (BioPython Dev) biopython-dev at maubp.freeserve.co.uk
Wed Nov 15 12:52:58 UTC 2006


Chris Lasher wrote:
> Just pitching in again, I agree with Michiel with regards to the list
> of functions necessary. To restate, these would be:

On Monday I switched from the "2" pun names to "To" giving the following:

(*) FileToSequenceIterator, previously File2SequenceIterator
     File to SeqRecord iterator

(*) SequencesToDict, previously SequenceIter2Dict
     SeqRecord iterator/list to dictionary

(*) SequencesToAlignment, previously Iter2Alignment
     SeqRecord iterator/list to alignment

(*) SequencesToFile, previously Sequences2File
     Write SeqRecord iterator/list to a file

I agree that these are all important "core functions".

> I also think there's wisdom to Michiel's statement it's easier to add
> functionality than it is to remove it.

Very true.  On that note...

We also currently have three "convenience functions", which seem
scheduled for removal based on these discussions.  Unless anyone speaks
up for these three, I'll remove them (and update the Wiki to match):

(*) FileToSequenceList previously called File2SequenceList
(*) FileToSequenceDict previously called File2SequenceDict
(*) FileToAlignment    previously called File2Alignment

These simply wrap FileToSequenceIterator with the list, SequencesToDict
or SequencesToAlignment function.

> I agree with Iddo on his arguments against dealing with filename
> extensions. Upon reflection, however, I feel comfortable with a
> lookahead-based file-format guesser for the sake of convenience and as
> a matter of compromise to those who are not keen on being explicit in
> regards to every detail. It's been stated that bio file formats are
> quite distinct. I tried to think of a counterexample but failed.

I would say telling EMBL and Swiss (aka SwissProt aka Unigene) apart is
tricky.  They both start with an "ID ..." line and finish with "//", the
feature table format is the big difference.

If we did try guessing file formats by looking at the file contents, I
would not try and guess every file format which Bio.SeqIO could read -
just those which are easily identifiable.  In this case, I would be
inclined not to try and tell EMBL and SwissProt apart, and simply abort
with "Unrecognised format".

Peter




More information about the Biopython-dev mailing list