[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Fri Sep 4 13:22:27 EDT 2009

On Tue, Sep 1, 2009 at 2:25 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Sep 1, 2009 at 2:06 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
>> Hi Peter;
>>
>> [indexed dict usage]
>>> What file formats where you working on, and how many records?
>>
>> It was a 100Mb fasta file with about 41,000 records. Nothing too
>> heavy but it worked great.
>
> Yeah, with just 41,000 keys and offsets the in memory dict would
> be pretty small too. This is within the range of file sizes I expect
> the Bio.SeqIO.indexed_dict() functionality to be used on. Cool.
>
>> The only change I made was to generalize the record building line:
>>
>> self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)
>>
>> to allow an arbitrary function to be passed to define the
>> identifier, instead of defaulting to the first part of the line.
>> This is helpful for those fun NCBI ids
>> (gi|83029091|ref|XM_357633.3|) where other parts of the program only
>> have the accession number.
>
> Did your callback function get given the "title string" and return
> the desired key?
>
> I had wondered about this, but the only way for this to be general
> (to work on all file formats) is for the callback function to be given
> a SeqRecord object - which means having to fully parse the file
> during the indexing, which ends up being *much* slower. We can
> do this if you think it adds a lot of utility i.e. mimic the key_function
> argument we already have on Bio.SeqIO.to_dict()

A less flexible option is a callback function which maps the default
record.id to a new key. This would solve your NCBI FASTA issue,
and might be handy in other settings (e.g. removing the version
suffix in GenBank identifiers). However, it would not allow for
example switching to a completely different identifier (e.g. the GI
number) which is present elsewhere in the file.

The point is we can support this kind of limited key_function
without suffering the severe speed penalty which doing a full
parse to give SeqRecord objects would impose.

How does that sound Brad? It should add just a little complexity
to the current code, and allows some neat tricks. Or we can
leave things as they are (KISS).

Peter