[Biopython] Indexing large sequence files
Peter
biopython at maubp.freeserve.co.uk
Thu Jun 25 06:24:54 EDT 2009
On Fri, Jun 19, 2009 at 4:20 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> P.S. Here is a rough version which works on more file formats. This tries to
> use the record.id as the dictionary key, based on how the SeqIO parsers
> work and the default behaviour of the Bio.SeqIO.to_dict() function.
>
> In some cases (e.g. FASTA and FASTQ) this is easy to mimic (getting the
> same string for the record.id). For SwissProt or GenBank files this is harder,
> so the choice is parse the record (slow) or mimic the record header parsing
> in Bio.SeqIO (fragile - we'd need good test coverage). Something based on
> this code might be a worthwhile addition to Bio.SeqIO, obviously this would
> need tests and documentation first.
I've realised there is a subtle bug in that code for some FASTQ files, because
I simply break up the file looking for lines starting with "@". As some may be
aware, the quality string lines in a FASTQ file can sometimes also start with
a "@" (a poor design choice really), which means that there would be a few
extra false entries in the index (and trying to access them would trigger an
error).
Obviously our FASTQ parser can cope with these records, but the indexing
code would also need to look at several lines in context to do this properly.
So, this can be solved for FASTQ, but would inevitably be a bit slower.
I'm currently thinking that if there is sufficient interest in having
this kind of
functionality in Bio.SeqIO, it might be best to allow separate implementations
for each file type, all providing a similar dict like object (rather
trying to handle
many formats in one indexer). This can be done with subclasses to avoid
code duplication. We could then have a Bio.SeqIO.to_index(...) function
which would looks up the appropriate indexer for the specified file format,
and return a dictionary like index giving SeqRecord objects.
Peter
More information about the Biopython
mailing list