[Biopython] fastq manipulations speed

Sun Mar 17 21:24:33 UTC 2013

On Sun, Mar 17, 2013 at 8:22 PM, Chris Mitchell <chris.mit7 at gmail.com> wrote:
> Hi Natassa,
>
> First, I wouldn't bother indexing.  This seems a one-and-done operation and
> indexing is thus a waste of time.  Have the list of stuff you want to find
> first, then iterate through the fasta file looking for what you want.

You might be able to do a paired iteration between the trimmed
FASTA file and the untrimmed quality file. I'll reply separately
with comments on the current code...

> One comment on the code that will speed it up:
> don't use if record in fq_dict.keys().  That returns a list which is going
> to have a lookup time proportional to the list size.  Do:
> fq_keys = set(fq_dict.keys()) and then if record in fq_keys, this will be
> O(1) lookup time.
>
> Chris

That's an excellent point, but both dictionaries and sets use
hash based lookups for speed, and should be about the same.
i.e. instead of this:

if record in fq_dict.keys():
    #do stuff

Use this:

if record in fq_dict:
    #do stuff

That is also considered better style. Another related point,
rather than:

for record in fasta_dict.keys():
    #do stuff

this would typically be written as:

for record in fasta_dict:
    #do stuff

In this case it would be a little faster since there is no need
to run the keys method, but will do the same thing.

Peter