[Biopython] matching headers and then writing the seq record

Peter Cock p.j.a.cock at googlemail.com
Sat Jul 28 20:48:32 UTC 2012


On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> ... It seems as if set undoes the elegance of using a generator.
> Any advice is greatly appreciated! ...
>
> headers_read1 = set(...)
> headers_read2 = set(...)
> header_matches = [x for x in headers_read1 if x in headers_read2]

I would expect that using the built in set's intersection operation would
be faster than this list comprehension solution to create header_matches.

Also, you should use a set not a list for header_matches because testing
membership with a set is much faster than a list. i.e. Try:

header_matches = headers_read1.intersection(headers_read2)

This might be a tiny change, but I expect it to be noticeably faster.

Also, here:

> def matched_records(records, pairType, header_matches):
>    for rec in records:
>        id = get_header(rec)
>        result = id in header_matches
>        if (result == True):
>            newrec = replace_header(rec,pairType)
>            yield newrec

If you don't mind my style comments, you don't really need
to create the variables 'id' and  'result', and 'newrec' - I would
just do:

def matched_records(records, pairType, header_matches):
    for rec in records:
        if get_header(rec) in header_matches:
            yield replace_header(rec,pairType)

And at that point you could write the whole thing as a
generator expression, which you may or may not find
more pleasing (I'm not sure if it makes any significant
difference to the speed). i.e.

records = SeqIO.parse(sys.argv[1], "fastq")
pairType = 1
wanted = (replace_header(rec,pairType) \
                 for rec in records \
                 if get_header(rec) in header_matches)
count = SeqIO.write(wanted, sys.argv[3], "fastq")

I hope that helps,

Peter



More information about the Biopython mailing list