[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Peter (BioPython Dev)
biopython-dev at maubp.freeserve.co.uk
Fri Jul 28 09:50:39 EDT 2006
This follows on from the discussion last month started by Marc Colosimo,
but I want to focus just on reading in sequence files:
http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002386.html
There was also a thread back a few years ago where Michael Hoffman was
looking at timings for parsing Fasta files.
http://www.biopython.org/pipermail/biopython-dev/2003-October/001480.html
Jeffrey Chang wrote:
> That is a nice implementation. However, Biopython already has at least
> 3 Fasta parsers!
> Bio/Fasta
> Bio/SeqIO/FASTA
> Bio/expressions/fasta
>
> Bio/Fasta, the one you compared against, is easily the slowest one.
> Bio/SeqIO/FASTA is very similar to your implementation and not likely
> to be significantly faster or slower. Bio/expressions/fasta uses
> Martel. I don't know how well that will perform. The parsing part
> should be blazingly fast (since it is mostly in C), but building the
> object will be slow. It might be a wash.
>
> Jeff
Clearly we could try and consolidate these (while making things as nice
as possible with depreciation warnings etc for existing code).
I've had a little read on the BioPerl SeqIO system:
http://www.bioperl.org/wiki/HOWTO:SeqIO
I agree with Marc that what we have in BioPython could (and should) be
more organised.
Ideally (in my opinion) BioPython should be able to read sequences from
multiple sequence file formats (e.g. Fasta, GenBank, EMBL, ...)
* using a standard interface
* into a standard object
* do this quickly
The resulting object should be able to hold addition information like
annotation and (sub)features, the Bio.SeqRecord.SeqRecord object seems
ideal.
It looks like we have:
(1) We have a number of format specific sequence reading modules (in
particular Fasta and GenBank) which can read their particular file
format into one or more different object representations. These seem to
be the best documented (in my opinion).
(2) We have a fairly generic (but relatively slow) framework in the
Bio.FormatIO system which uses Martel expressions internally. I have
found Martel frustrating to debug, and especially slow with large
individual records (like genomic GenBank files). There is some
documentation on this, e.g.
http://www.biopython.org/DIST/docs/cookbook/genbank_to_fasta.html
(3) We have the start of a generic "pure python" framework in the
Bio.SeqIO system, but it needs some work (e.g. some doc strings, fixing
the LargeFastaFormat class, GenBank support).
QUESTION: What do you all tend to use? Should I draft a "questionnaire"
to be posted on the main discussion list (and the announcements?).
Personally, I have been using Bio.Fasta and Bio.GenBank to read
sequences. I tend to only output Fasta files, and usually do this "by
hand" as they are so simple and I want full control over the description
lines.
Peter
More information about the Biopython-dev
mailing list