[Biopython] Support for k-mer analyses?

McCulloch, Alan alan.mcculloch at agresearch.co.nz
Wed Feb 15 19:51:45 UTC 2017


Cheers Alexey thanks for the update.

PS sorry for the delay getting back to you about your question on caching – my use-case is making a
multivariate k-mer distribution, across many samples.  For each sample (= usually a fastq file) ,
the program checks if  there is already a cached univariate  distribution for that sample, and if so it will use that.
If not it will build and cache the univariate distribution. This is useful in the case of a restart
of the multivariate build, and also  in case I want to make a different multivariate build
– e.g. add some more samples.  Each univariate distribution is reasonably costly to build ,
as I am usually processing (and sub-sampling) a large fastq file – so I don’t want to have to
rebuild those.


From: Biopython [mailto:biopython-bounces+alan.mcculloch=agresearch.co.nz at mailman.open-bio.org] On Behalf Of Alexey Morozov
Sent: Wednesday, 15 February 2017 7:44 p.m.
Cc: biopython at biopython.org
Subject: Re: [Biopython] Support for k-mer analyses?

As there is not so much interest in this task (and people in biopython-dev were uncertain over whether it'd be useful to include kmer module into Biopython distriution), I've decided not to make it. If anyone needs a (non-Biopython) kmer library, there already are several. I haven't done a proper comparison, but kPAL seems most developed.

2017-02-08 20:41 GMT+08:00 Alexey Morozov <alexeymorozov1991 at gmail.com<mailto:alexeymorozov1991 at gmail.com>>:
Hi Alan, thanks for your feedback.
You've thrown some ideas that haven't crossed my mind. I was just wondering: why did you even find it necessary to cache distributions? What was the scale of your work? In my experience, aminoacid distributions of six complete eukaryotic proteomes up to k of 6 or 7 could fit into something like seven or so gigs (with no optimisation, without even numpy), so I thought nucleotide distributions will be prohibitive in terms of RAM only when there are tens of them.



--
Alexey Morozov,
LIN SB RAS, bioinformatics group.
Irkutsk, Russia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20170215/25ec8c76/attachment.html>


More information about the Biopython mailing list