[Biopython-dev] Low level string based FASTA parser

Peter Cock p.j.a.cock at googlemail.com
Mon Oct 22 16:43:07 UTC 2012


Hello all,

Something I've wanted/needed recently was a low-level FASTA
iterating parser which just returns tuples of strings (without the
overhead of Bio.SeqIO building SeqRecords).

We don't currently have such a thing, so I have added one to the
SeqIO Fasta module (mirroring the low level string-tuple parser
for FASTQ files) with some associated unit tests and refactoring
(separate commits):

https://github.com/biopython/biopython/commit/751fe39765ca6ba60e517b3b4657718fd48f7817

Does anyone have any views on the name of this new
function, currently SimpleFastaParser, used as follows:

    >>> from Bio.SeqIO.FastaIO import SimpleFastaParser
    >>> with open("Fasta/dups.fasta") as handle:
    ...     for values in SimpleFastaParser(handle):
    ...         print values
    ('alpha', 'ACGTA')
    ('beta', 'CGTC')
    ('gamma', 'CCGCC')
    ('alpha (again - this is a duplicate entry to test the indexing
code)', 'ACGTA')
    ('delta', 'CGCGC')

The capitalisation style is consistent with other functions in
SeqIO, but not with PEP8.

Peter

P.S. I've also updated the legacy function quick_FASTA_reader
in Bio.SeqUtils to use this. Since it loads the whole dataset into
memory, if no one objects I would like to deprecate this old function.



More information about the Biopython-dev mailing list