[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Tue Sep 8 08:14:05 EDT 2009

Hi Peter;

[... callback function for specifying an ID ...]
> > Did your callback function get given the "title string" and return
> > the desired key?
> >
> > I had wondered about this, but the only way for this to be general
> > (to work on all file formats) is for the callback function to be given
> > a SeqRecord object - which means having to fully parse the file
> > during the indexing, which ends up being *much* slower. We can
> > do this if you think it adds a lot of utility i.e. mimic the key_function
> > argument we already have on Bio.SeqIO.to_dict()
> 
> A less flexible option is a callback function which maps the default
> record.id to a new key. This would solve your NCBI FASTA issue,
> and might be handy in other settings (e.g. removing the version
> suffix in GenBank identifiers). However, it would not allow for
> example switching to a completely different identifier (e.g. the GI
> number) which is present elsewhere in the file.
> 
> The point is we can support this kind of limited key_function
> without suffering the severe speed penalty which doing a full
> parse to give SeqRecord objects would impose.

This is a great compromise. You're right, parsing the SeqRecord is too
much, and allowing manipulation of default identifier would work fine.
If people need to do something much more complicated to get the ID
they would probably be better off extending the existing classes and
writing a custom indexer that pulls the IDs they need.

Brad