[Biopython] matching headers and then writing the seq record
Dilara Ally
dilara.ally at gmail.com
Tue Jul 31 14:53:27 EDT 2012
Thanks Peter it sped it up considerably! I appreciate the fast replies on this listserv.
On Jul 28, 2012, at 1:48 PM, Peter Cock wrote:
> On Thu, Jul 26, 2012 at 6:48 PM, Dilara Ally <dilara.ally at gmail.com> wrote:
>> ... It seems as if set undoes the elegance of using a generator.
>> Any advice is greatly appreciated! ...
>>
>> headers_read1 = set(...)
>> headers_read2 = set(...)
>> header_matches = [x for x in headers_read1 if x in headers_read2]
>
> I would expect that using the built in set's intersection operation would
> be faster than this list comprehension solution to create header_matches.
>
> Also, you should use a set not a list for header_matches because testing
> membership with a set is much faster than a list. i.e. Try:
>
> header_matches = headers_read1.intersection(headers_read2)
>
> This might be a tiny change, but I expect it to be noticeably faster.
>
> Also, here:
>
>> def matched_records(records, pairType, header_matches):
>> for rec in records:
>> id = get_header(rec)
>> result = id in header_matches
>> if (result == True):
>> newrec = replace_header(rec,pairType)
>> yield newrec
>
> If you don't mind my style comments, you don't really need
> to create the variables 'id' and 'result', and 'newrec' - I would
> just do:
>
> def matched_records(records, pairType, header_matches):
> for rec in records:
> if get_header(rec) in header_matches:
> yield replace_header(rec,pairType)
>
> And at that point you could write the whole thing as a
> generator expression, which you may or may not find
> more pleasing (I'm not sure if it makes any significant
> difference to the speed). i.e.
>
> records = SeqIO.parse(sys.argv[1], "fastq")
> pairType = 1
> wanted = (replace_header(rec,pairType) \
> for rec in records \
> if get_header(rec) in header_matches)
> count = SeqIO.write(wanted, sys.argv[3], "fastq")
>
> I hope that helps,
>
> Peter
More information about the Biopython
mailing list