[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Fri Jul 28 13:50:39 UTC 2006

This follows on from the discussion last month started by Marc Colosimo, 
  but I want to focus just on reading in sequence files:

http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002386.html

There was also a thread back a few years ago where Michael Hoffman was 
looking at timings for parsing Fasta files.

http://www.biopython.org/pipermail/biopython-dev/2003-October/001480.html

Jeffrey Chang wrote:
> That is a nice implementation.  However, Biopython already has at least 
> 3 Fasta parsers!
>    Bio/Fasta
>    Bio/SeqIO/FASTA
>    Bio/expressions/fasta
> 
> Bio/Fasta, the one you compared against, is easily the slowest one.  
> Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> to be significantly faster or slower.  Bio/expressions/fasta uses 
> Martel.  I don't know how well that will perform.  The parsing part 
> should be blazingly fast (since it is mostly in C), but building the 
> object will be slow.  It might be a wash.
> 
> Jeff

Clearly we could try and consolidate these (while making things as nice 
as possible with depreciation warnings etc for existing code).

I've had a little read on the BioPerl SeqIO system:
http://www.bioperl.org/wiki/HOWTO:SeqIO

I agree with Marc that what we have in BioPython could (and should) be 
more organised.

Ideally (in my opinion) BioPython should be able to read sequences from 
multiple sequence file formats (e.g. Fasta, GenBank, EMBL, ...)
* using a standard interface
* into a standard object
* do this quickly

The resulting object should be able to hold addition information like 
annotation and (sub)features, the Bio.SeqRecord.SeqRecord object seems 
ideal.

It looks like we have:

(1) We have a number of format specific sequence reading modules (in 
particular Fasta and GenBank) which can read their particular file 
format into one or more different object representations.  These seem to 
be the best documented (in my opinion).

(2) We have a fairly generic (but relatively slow) framework in the 
Bio.FormatIO system which uses Martel expressions internally.  I have 
found Martel frustrating to debug, and especially slow with large 
individual records (like genomic GenBank files).  There is some 
documentation on this, e.g.

http://www.biopython.org/DIST/docs/cookbook/genbank_to_fasta.html

(3) We have the start of a generic "pure python" framework in the 
Bio.SeqIO system, but it needs some work (e.g. some doc strings, fixing 
the LargeFastaFormat class, GenBank support).

QUESTION: What do you all tend to use?  Should I draft a "questionnaire" 
to be posted on the main discussion list (and the announcements?).

Personally, I have been using Bio.Fasta and Bio.GenBank to read 
sequences.  I tend to only output Fasta files, and usually do this "by 
hand" as they are so simple and I want full control over the description 
lines.

Peter