[Biopython] Indexing large sequence files

Peter biopython at maubp.freeserve.co.uk
Thu Jun 25 10:24:54 UTC 2009


On Fri, Jun 19, 2009 at 4:20 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> P.S. Here is a rough version which works on more file formats. This tries to
> use the record.id as the dictionary key, based on how the SeqIO parsers
> work and the default behaviour of the Bio.SeqIO.to_dict() function.
>
> In some cases (e.g. FASTA and FASTQ) this is easy to mimic (getting the
> same string for the record.id). For SwissProt or GenBank files this is harder,
> so the choice is parse the record (slow) or mimic the record header parsing
> in Bio.SeqIO (fragile - we'd need good test coverage). Something based on
> this code might be a worthwhile addition to Bio.SeqIO, obviously this would
> need tests and documentation first.

I've realised there is a subtle bug in that code for some FASTQ files, because
I simply break up the file looking for lines starting with "@". As some may be
aware, the quality string lines in a FASTQ file can sometimes also start with
a "@" (a poor design choice really), which means that there would be a few
extra false entries in the index (and trying to access them would trigger an
error).

Obviously our FASTQ parser can cope with these records, but the indexing
code would also need to look at several lines in context to do this properly.
So, this can be solved for FASTQ, but would inevitably be a bit slower.

I'm currently thinking that if there is sufficient interest in having
this kind of
functionality in Bio.SeqIO, it might be best to allow separate implementations
for each file type, all providing a similar dict like object (rather
trying to handle
many formats in one indexer). This can be done with subclasses to avoid
code duplication. We could then have a Bio.SeqIO.to_index(...) function
which would looks up the appropriate indexer for the specified file format,
and return a dictionary like index giving SeqRecord objects.

Peter



More information about the Biopython mailing list