[Biopython] Support for k-mer analyses?
McCulloch, Alan
alan.mcculloch at agresearch.co.nz
Tue Feb 7 02:27:00 UTC 2017
Hi Alexey,
I’ve frequently used k-mer analysis of nucleotide seqs over the past couple of years, mainly as
part of Q/C of NGS sequence data. I’ve written python code to construct the k-mer distributions and
R code to do the plotting :
https://github.com/AgResearch/prbdf/blob/master/kmer_entropy.py
https://github.com/AgResearch/prbdf/blob/master/kmer_plots.r
(Applications like FastQC already do some kmer-analysis based Q/C – however I found
it useful to extend that in various ways)
Based on my own experience, I think this proposed addition to BioPython would have a good chance of
being useful.
Some ideas for features and implementation below – hope there is something useful in
there.
Cheers
Alan McCulloch
Bioinformatics Software Engineer
AgResearch NZ
Feature Suggestions :
=================
* All of the plots I do are based on the self-information of a kmer (i.e. – log N(kmer i) / N(all kmers)) .
Support for self-information as well as frequency output would be handy. I’ve also found it
useful to plot log rank of kmer frequency against self information ( – a kind of plot I call
“zipfian” as its closely related to the zipf plots of log rank against log frequency that have
been used on various datasets) – so outputting ranks as well as frequencies could be useful.
* Usually I find its only necessary to analyse a fairly small random sample of (e.g.) an NGS fastq file in
order to characterise it using a kmer-distribution approach. This greatly reduces the cost of kmer
analysis. So support for analysing a random sample of a file would be good.
* Usually one is comparing the kmer-distributions from multiple files / samples , and looking for
things like outlier files, or groups of files that segregate in some way. That means constructing
a "kmer-multi-distribution" rather than a kmer-distribution. So support for constructing
a kmer-multi-distribution would be very useful – i.e. basically a table in which each of
N rows is a kmer , and each of M columns is a sample or file. Values in the table are either
frequencies, self-information or ranks of a kmer in a file / sample
* I’ve found it useful to be able to easily incrementally add to multi-distributions so added support
for that (i.e. prepare an N x (M+1) table , without having to recalculate everything from
scratch)
* I wanted a single step analysis – e.g. I did not want to have to first index a file, and then
run the kmer construction off the index ( as some kmer analysis software requires)
* I’ve found it useful to be able to take a set of over-represented kmers , and go back in
and pull out seqs that exhibit one or more of those kmers – e.g. this can yield
data features such as overlooked adapter fragments, repeats and low complexity
AT rich fragments that can be difficult to pick up by other means. (Over-represented
kmers can often easily be picked up visually from “zipfian” plots – they show up as
steps in the plot. We did a poster on this at a conference last year (ATGA 2016
in Auckland)). It could be useful to add this kind of kmer based querying of sequence
data.
Implementation Suggestions :
=======================
* I have found a multiprocessing approach useful to speed things up
(each process working on part of the file yielding a partial distribution,
then these are combined).
* to help support incremental additions to kmer multi-distributions, I cache
distributions in pickle files.
* constructing an N x M multi-distribution of N kmer counts across M
files / samples , is a special case of a more general multi-distribution
construction that is very common in bioinformatics - e.g. other
examples I’ve encountered recently :
- tabulate counts of blast top-hits to each of N taxa in each of M samples
- tabulate the percentage identity of the top hit of each of
N DNA oligos in each of M bacterial genomes
So I have structured my code as, a kernel of methods and data structures
which build N x M multi-distributions of “anything” , and then
clients implement code to (for example) parse some file and generate
data values – that data generator method is then passed to the kernel,
which calls it to generate the data streams it needs. I’ve found this
an improvement over my previous approach , which would have included
in the kernel quite a bit of code related to parsing data files.
This has made it more flexible because it can easily handle (with a bit of
client coding) odd file formats not previously encountered. For example
tassel genotyping by sequencing (GBS) software generates “tag count files”
which are binary summaries of the counts of tags ( = "very long kmers"),
and I wanted to use these to generate multi-distributions of 6-mers.
Just needed to write a client function to unpack the tag-count files and
pass this to the multi-distribution kernel - no change needed to that code.
From: Biopython [mailto:biopython-bounces+alan.mcculloch=agresearch.co.nz at mailman.open-bio.org] On Behalf Of Alexey Morozov
Sent: Tuesday, 7 February 2017 12:14 a.m.
To: biopython at biopython.org
Subject: [Biopython] Support for k-mer analyses?
Dear Biopython users,
I wanted to add a support for k-mer analysis to Biopython, if it's gonna be useful. I was thinking about some basic analyses, ie constructing k-mer distributions from aminoacid and nucleotide sequences (or sequence sets), a few distance metrics between them, classifying query sequences using this distributions and such.
Does anyone actually need it? If you think you'll benefit from having such methods available from Biopython (and not a separate lib), please drop a few lines: what kind of analysis do you need, what sorts of file formats would you like Biopython to support and so on.
--
Alexey Morozov,
LIN SB RAS, bioinformatics group.
Irkutsk, Russia.
More information about the Biopython
mailing list