[Biopython] matching headers and then writing the seq record
Peter Cock
p.j.a.cock at googlemail.com
Sat Jul 28 20:48:32 UTC 2012
On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
> ... It seems as if set undoes the elegance of using a generator.
> Any advice is greatly appreciated! ...
>
> headers_read1 = set(...)
> headers_read2 = set(...)
> header_matches = [x for x in headers_read1 if x in headers_read2]
I would expect that using the built in set's intersection operation would
be faster than this list comprehension solution to create header_matches.
Also, you should use a set not a list for header_matches because testing
membership with a set is much faster than a list. i.e. Try:
header_matches = headers_read1.intersection(headers_read2)
This might be a tiny change, but I expect it to be noticeably faster.
Also, here:
> def matched_records(records, pairType, header_matches):
> for rec in records:
> id = get_header(rec)
> result = id in header_matches
> if (result == True):
> newrec = replace_header(rec,pairType)
> yield newrec
If you don't mind my style comments, you don't really need
to create the variables 'id' and 'result', and 'newrec' - I would
just do:
def matched_records(records, pairType, header_matches):
for rec in records:
if get_header(rec) in header_matches:
yield replace_header(rec,pairType)
And at that point you could write the whole thing as a
generator expression, which you may or may not find
more pleasing (I'm not sure if it makes any significant
difference to the speed). i.e.
records = SeqIO.parse(sys.argv[1], "fastq")
pairType = 1
wanted = (replace_header(rec,pairType) \
for rec in records \
if get_header(rec) in header_matches)
count = SeqIO.write(wanted, sys.argv[3], "fastq")
I hope that helps,
Peter
More information about the Biopython
mailing list