[BioPython] Dealing with sequence files - Questionaire

Thu Aug 17 16:33:56 EDT 2006

Hello list,

This is a request for a little bit of feedback from you all - it would 
be very helpful if you could answer some or all of the following 
questions...

Thanks

Peter

Introduction
============
There is some discussion on the Developer's Mailing list about
BioPython's sequence input/output routines.

For example, its a bit silly that there are at least three different 
Fasta reading routines in BioPython (even if only one of them, 
Bio.Fasta, is properly documented).

Note that we are not going to "just remove" any of the current
functionality.  Some existing code may be re-written internally, while
other code might be marked with a Deprecation Warning.

If you could answer the following questions that would help guide our
choices.

Question One
============
Is reading sequence files an important function to you, and if so which
file formats in particular (e.g. Fasta, GenBank, ...)

Question Two
============
Are there any sequence formats you would like to be able to read using 
BioPython that are not currently supported (e.g. EMBL, ...)

Question Three - Reading Fasta Files
====================================
Which of the following do you currently use (and why)?:

(a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
(b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
(c) Bio.Fasta with your own parser (Could you tell us more?)
(d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
(e) Bio.FormatIO (giving SeqRecord objects)
(f) Other (Could you tell us more?)

Question Four - Reading GenBank Files
=====================================
Which of the following do you currently use (and why)?:

(a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
(b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
(c) Other (Could you tell us more?)

Question Five - Record Access...
================================
When loading a file with multiple sequences do you use:

(a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
records one by one in the order from the file.

(b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
random access to the records using their identifier.

(c) A list giving random access by index number (e.g. load the records
using an iterator but save them in a list).

Do you have any additional comments on this?  For example, flexibility
versus memory requirements.

For example, when I need random access to a Fasta file, I build a
dictionary in memory (using an iterator) rather than messing about with
the index_file based dictionary.

Question Six - Martel, Scanners and Consumers
=============================================
Some of BioPython's existing parsers (e.g. those using Martel) use an
event/callback model, where the scanner component generates parsing
events which are dealt with by the consumer component.

Do any of you use this system to modify existing parser behaviour, or
use it as part of your own personal file parser?

(a) I don't know, or don't care.  I just the the parsers provided.
(b) I use this framework to modify a parser in order to do ... (please
provide details).

And finally...
==============
Do you have any general questions of comments.

Thank you,

Peter (and all the other BioPython developers/maintainers)