[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Michiel de Hoon
mdehoon at c2b2.columbia.edu
Fri Aug 4 03:20:18 UTC 2006
> Question One
> ============
>
> Is reading sequence files an important
> function to you, and if so which file formats in particular (e.g.
> Fasta, GenBank, ...)
>
I use Fasta, GenBank, and occasionally clustalw.
>
> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
>
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with
> a title, and the sequence as a string) (b) Bio.Fasta with the
> FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own
> parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader
> (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord
> objects) (f) Other (Could you tell us more?)
I use Bio.Fasta with the RecordParser, but just because it's easy to
find in the documentation. As a user, I think Bio.Fasta requires too
many steps to be typed in; I would prefer something more
straightforward. For the output format, I don't care so much, but for
the sake of consistency a SeqRecord may be preferable.
>
> Question Three - index_file based dictionaries
> ============================================== Do you use any of the
> following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c)
> Any other "Martel/Mindy" based dictionary which first requires
> creation of an index using the index_file function
>
No. I never really understood index files.
>
> Question Four - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
>
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.
>
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> random access to the records using their identifier.
>
> (c) A list giving random access by index number (e.g. load the
> records using an iterator but saving them in a list).
I use (a). It's easy to create (b) or (c), if needed, if (a) is available.
>
> Question Four - Fasta files: FastaRecord or SeqRecord
> ===================================================== If you use
> Fasta files, do you want get records returned as FastaRecords or as
> SeqRecords? If SeqRecords, do you use your own title2ids mapping?
>
> For example,
>
>> name text text text
> ACGTACACGT
>
> As a FastaRecord this would have:
>
> FastaRecord.title = "name text text text" (string)
> FastaRecord.sequence= "ACGTACACGT" (string)
>
> As a SeqRecord (with the default title2ids mapping):
>
> SeqRecord.id = (default string) SeqRecord.name = (default string)
> SeqRecord.description = "name text text text" (string) SeqRecord.seq
> = Seq("ACGTACACGT", alphabet)
I use the FastaRecord, but again for no particular reason. I have not
experienced an advantage of Seq objects over simple strings, so for me
the fact that FastaRecord contains a simple string is more convenient.
But it doesn't matter much.
> Question Five - GenBank files: GenbankRecord or SeqRecord
> ========================================================== If you use
> GenBank files, do you use: (a) Bio.Genbank.FeatureParser which
> returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns
> Bio.GenBank.Record objects
>
I don't care so much, but I think that having two record types is
confusing, so it would be better if we could decide on one. A SeqRecord
is more general than a Bio.GenBank.Record, so I have a slight preference
for a SeqRecord.
>
> Question Six - Martel, Scanners and Consumers
> ============================================== Some of BioPython's
> existing parsers (e.g. those using Martel) use an event/callback
> model, where the scanner component generates parsing events which are
> dealt with by the consumer component.
>
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?
>
> (a) I don't know, or don't care. I just the the parsers provided.
> (b) I use this framework to modify a parser in order to do ...
> (please provide details).
>
(a). Often, I'm just at the Python prompt typing away. What I like about
Python and Numerical Python is that the commands are often obvious and
easy to remember. With the parser framework, on the other hand, I always
need to look up in the documentation how to use them.
--Michiel
More information about the Biopython-dev
mailing list