[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Mon Jul 31 10:36:00 UTC 2006

Michiel de Hoon wrote:
> Thanks Peter.
> 
> Peter wrote:
>>QUESTION: What do you all tend to use?
> 
> I use the stuff in Bio.Fasta, but actually just because it's in the 
> documentation.

Me too.

 > From your timings, and also because I'm not smart enough
> to be able to understand Martel, let alone maintain Martel-based 
> parsers, I'm pretty much in favor of Bio.SeqIO.

That was my gut instinct too.

Starting with Bio.SeqIO as a base, I've been "playing" with the code and 
have a rough "Sequence Iterator" class that supports iteration (provides 
a next() and __iter__() method),  as well as strictly increasing index 
access.

At the moment I have iterators returning SeqRecords for:
- Fasta Files
- GenBank features (returns the CDS features and their translations)
- Genbank files (with the features as SeqFeature objects)

There is code in Bio/SeqIO/general.py for a few more file formats which 
I haven't used yet.

This new GenBank iterator actually uses the current Bio.Genbank parser 
(with a slight tweak to how it acts once it reaches the end of a record).

Michiel de Hoon wrote:
>
 >Peter wrote:
>> Should I draft a "questionnaire" 
>> to be posted on the main discussion list (and the announcements?).
> 
> By all means, yes. In the questionnaire, be sure to separate the issue 
> of parser internals (Martel vs. pure Python) from the issue of how the 
> results should be formatted (Fasta.Record or SeqRecord).
> 

Draft questionnaire follows, I have included by comments for the record. 
  Too long?  Missing any important questions?

Peter

--

Introduction
============
There is some discussion on the Developer's Mailing list about 
BioPython's sequence input/output routines.

For example, its a bit silly that there are three different Fasta 
reading routines in BioPython (even if only one of them, Bio.Fasta, is 
properly documented).

Note that we are not going to "just remove" any of the current 
functionality.  Some existing code may be re-written internally, while 
other code might be marked with a DeprecationWarning.

If you could answer the following questions that would help guide our 
choices.

Question One
============
Is reading sequence files an important function to you, and if so which 
file formats in particular (e.g. Fasta, GenBank, ...)

If you have had to write you own code to read a "common" file format 
which BioPython doesn't support, please get in touch.

Peter's answer:
 > I read Fasta and GenBank files mostly.  Also Clustalw alignments,
 > and Stockholm alignments.

Question Two - Reading Fasta Files
==================================
Which of the following do you currently use (and why)?:

(a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a 
title, and the sequence as a string)
(b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
(c) Bio.Fasta with your own parser (Could you tell us more?)
(d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
(e) Bio.FormatIO (giving SeqRecord objects)
(f) Other (Could you tell us more?)

Peter's answer:
 > In most of my script I use Bio.Fasta with either the RecordParser or
 > FeatureParser.  I did look at Bio.FormatIO when I started but found
 > Bio.Fasta was much better documented (and a similar speed).  I have
 > only recently looked at Bio.SeqIO (hence this entire thread).

Question Three - index_file based dictionaries
==============================================
Do you use any of the following:
(a) Bio.Fasta.Dictionary
(b) Bio.Genbank.Dictionary
(c) Any other "Martel/Mindy" based dictionary which first requires 
creation of an index using the index_file function

If so, do you have any comments?

Peter's answer:
 > I do not use multi-record Genbank files (mine are single chromosomes).
 >
 > I have used Bio.Fasta.Dictionary but found dealing with the indexes
 > created by index_file to be annoying - especially when re-indexing
 > Fasta files which change often.
 >
 > I now use a simple wrapper function to load a Fasta file with an
 > iterator and build the dictionary in memory.  For me this is much
 > less hassle and the memory demands are not too great.

Question Four - Record Access...
================================
When loading a file with multiple sequences do you use:

(a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the 
records one by one in the order from the file.

(b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you 
random access to the records using their identifier.

(c) A list giving random access by index number (e.g. load the records 
using an iterator but saving them in a list).

Do you have any additional comments on this?  For example, flexibility 
versus memory requirements.

For example, when I need random access to a Fasta file, I build a 
dictionary in memory (using an iterator) rather than messing about with 
the index_file based dictionary.

Peter's answer:
 > I usually deal with each record sequentially using an iterator.
 >
 > However, I often need random access using the record identifier and
 > for this I use a dictionary which I create in memory using an iterator.
 >
 > As stated in the question, I had tired used Bio.Fasta.Dictionary but
 > found dealing with the indexes created by index_file to be annoying,
 > especially having to re-indexing Fasta files which change often.

Question Four - Fasta files: FastaRecord or SeqRecord
=====================================================
If you use Fasta files, do you want get records returned as FastaRecords 
or as SeqRecords?  If SeqRecords, do you use your own title2ids mapping?

For example,

 >name text text text
ACGTACACGT

As a FastaRecord this would have:

FastaRecord.title = "name text text text" (string)
FastaRecord.sequence= "ACGTACACGT" (string)

As a SeqRecord (with the default title2ids mapping):

SeqRecord.id = (default string)
SeqRecord.name = (default string)
SeqRecord.description = "name text text text" (string)
SeqRecord.seq = Seq("ACGTACACGT", alphabet)

Peter's answer
 > For FASTA files I have usually used FastaRecord objects (with the
 > sequence as a string) but I have no strong preference.  Thinking of
 > the big picture it would be better to have every parser return
 > SeqRecords by default.

Question Five - GenBank files: GenbankRecord or SeqRecord
==========================================================
If you use GenBank files, do you use:
(a) Bio.Genbank.FeatureParser which returns SeqRecord objects
(b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects

Do you care much either way?  For me the only significant difference is 
that feature locations are held as objects in the SeqRecord, and as the 
raw string in the Record.

Peter's answer
 > I have no strong preference - unless I wanted to manipulate the
 > feature locations.  I think there might be a performance difference...

Question Six - Martel, Scanners and Consumers
==============================================
Some of BioPython's existing parsers (e.g. those using Martel) use an 
event/callback model, where the scanner component generates parsing 
events which are dealt with by the consumer component.

Do any of you use this system to modify existing parser behaviour, or 
use it as part of your own personal file parser?

(a) I don't know, or don't care.  I just the the parsers provided.
(b) I use this framework to modify a parser in order to do ... (please 
provide details).

Peter's answer
 > As a user I don't care about the internals.  I do care about what
 > gets used as the name/id/description for SeqRecords but that level
 > of flexibility is enough.
 >
 > As a BioPython contributor: Martel is scary.  I think I understand
 > the whole scanner/consumer model but don't see the point (unless
 > using a event based scanner like Martel).  I suspect all the
 > function call backs is one reason Martel parsers are slow.

Peter