[Biopython] Support for k-mer analyses?

Tue Feb 7 02:27:00 UTC 2017

Hi Alexey,

I’ve frequently used k-mer analysis of nucleotide seqs over the past couple of years, mainly as 
part of Q/C of  NGS sequence data. I’ve written python code to construct the k-mer distributions and 
R code to do the plotting : 

https://github.com/AgResearch/prbdf/blob/master/kmer_entropy.py

https://github.com/AgResearch/prbdf/blob/master/kmer_plots.r

(Applications like FastQC already do some kmer-analysis based Q/C – however I found 
it useful to extend that in various ways)

Based on my own experience, I think this proposed addition to BioPython would have a good chance of 
being useful. 

Some ideas for features and implementation below – hope there  is something useful in 
there.  

Cheers

Alan McCulloch
Bioinformatics Software Engineer
AgResearch NZ

Feature Suggestions : 
=================

* All of the plots I do are based on the self-information of a kmer (i.e. – log N(kmer i) / N(all kmers)) .
    Support for self-information as well as frequency output would be handy. I’ve also found it 
    useful to plot log rank of kmer frequency against self information ( – a kind of plot I call
    “zipfian” as its closely related to the zipf plots of log rank against log frequency that have 
    been used on various datasets) – so outputting ranks as well as frequencies could be useful. 

* Usually I find its only necessary to analyse a fairly small random sample of (e.g.) an NGS fastq  file in 
    order to characterise it using a kmer-distribution approach. This greatly reduces the cost of kmer 
    analysis. So support for analysing a random sample of a file would be good.

* Usually one is comparing the kmer-distributions from multiple files / samples , and looking for 
    things like outlier files, or groups of files that segregate in some way. That means constructing 
    a "kmer-multi-distribution" rather than a kmer-distribution. So support for constructing 
    a kmer-multi-distribution would be very useful – i.e. basically a table in which each of 
    N rows is a kmer , and each of M columns is a sample or file. Values in the table are either 
    frequencies, self-information or ranks of a kmer in a file / sample

* I’ve found it useful to be able to easily incrementally add to multi-distributions so added support
    for that (i.e. prepare an N x (M+1) table , without having to recalculate everything from 
    scratch)

* I wanted a single step analysis – e.g. I did not want to have to first index a file, and then
    run the kmer construction off the index ( as some kmer analysis software requires)

* I’ve found it useful to be able to take a set of over-represented kmers , and go back in 
    and pull out seqs that exhibit one or more of those kmers – e.g. this can yield 
    data features such as overlooked adapter fragments, repeats and low complexity 
    AT rich fragments that can be difficult to pick up by other means. (Over-represented 
    kmers can often easily be picked up visually from “zipfian” plots – they show up as 
    steps in the plot. We did  a poster on this at a conference last year (ATGA 2016
    in Auckland)). It could be useful to add this kind of kmer based querying of sequence
    data.

Implementation Suggestions : 
=======================

* I have found a multiprocessing approach useful to speed things up
    (each process working on part of the file yielding a partial distribution, 
    then these are combined). 

* to help support incremental additions to kmer multi-distributions, I cache
   distributions in pickle files. 

* constructing an N x M multi-distribution of N kmer counts across M 
    files / samples , is a special case of a more general multi-distribution 
    construction that  is very common in bioinformatics  - e.g. other 
    examples I’ve encountered recently : 

    - tabulate counts of blast top-hits to each of N taxa in each of M samples
    - tabulate the percentage identity of the top hit of each of 
        N DNA oligos in each of M bacterial genomes 

    So I have structured my code as, a kernel of methods and data structures
    which build N x M multi-distributions of “anything” , and then 
    clients implement code to (for example) parse some file and generate 
    data values – that data generator method is then passed to the kernel,
    which calls it to generate the data streams it needs. I’ve found this 
    an improvement over my previous approach , which would have included
    in the kernel quite a bit of code related to parsing data files. 
    This has made it more flexible because it can easily handle (with a bit of 
    client coding) odd file formats not previously encountered. For example 
    tassel genotyping by sequencing (GBS) software generates “tag count files”
    which are binary summaries of the counts of tags ( = "very long kmers"),
    and I wanted to use these to generate multi-distributions of 6-mers.
    Just needed to write a client function to unpack the tag-count files and
    pass this to the multi-distribution kernel - no change needed to that code.

From: Biopython [mailto:biopython-bounces+alan.mcculloch=agresearch.co.nz at mailman.open-bio.org] On Behalf Of Alexey Morozov
Sent: Tuesday, 7 February 2017 12:14 a.m.
To: biopython at biopython.org
Subject: [Biopython] Support for k-mer analyses?

Dear Biopython users,
I wanted to add a support for k-mer analysis to Biopython, if it's gonna be useful. I was thinking about some basic analyses, ie constructing k-mer distributions from aminoacid and nucleotide sequences (or sequence sets), a few distance metrics between them, classifying query sequences using this distributions and such.
Does anyone actually need it? If you think you'll benefit from having such methods available from Biopython (and not a separate lib), please drop a few lines: what kind of analysis do you need, what sorts of file formats would you  like Biopython to support and so on.

-- 
Alexey Morozov,
LIN SB RAS, bioinformatics group.
Irkutsk, Russia.