[Biopython-dev] interest module for sequence based clustering tools?

Krishna Roskin krishnaroskin at gmail.com
Tue Apr 1 06:41:04 UTC 2014


Hey all,

Long time fan, taking my first crack at contributing.

I've built a basic module to run and parse the result of sequence based
clustering tools such as DNACLUST and CD-HIT. I've written subclasses of
AbstractCommandline to run dnaclust and cd-hit. I've also written classes
to store the clusters and their members and loaders for the output formats
used by those programs.

I posting here to gauge interest and get some feedback and maybe some beta
testers.

My code is available at:

https://github.com/krishnaroskin/biopython.git

under the seqcluster branch. I've started writing some test code at:

Tests/seqcluster/test_seqcluster.py

that also severs as example code. I've pasted that at the end of this
message so people can get an idea of how it works without having to
checkout code.

If there is interest, I'm planning on adding a seqclust.cluster function
that takes a list of SeqRecords and returns a clustering using one of the
supported tools. I envision that function being the main interface to this
module. I also want to write a something that will map the cluster
membership (given by ids) back to collections of SeqRecords.

Other to-dos:

Test that all the flavors of CD-HIT work (there are many)
Add support for other sequence based clustering tools (suggestions?)
Documentation
Tutorial
Test code

-krish

#!/usr/bin/env python


from __future__ import print_function


import StringIO


from Bio.seqcluster.applications import DNAClustCommandline

from Bio.seqcluster import DNAClustIterator


from Bio.seqcluster.applications import CDHITCommandline

from Bio.seqcluster import CDHITClustIterator


cmd = DNAClustCommandline(similarity=0.8, header=True, threads=2, inputfile=
"test_sequences.fasta")

stdout, stderr = cmd()

clusters = DNAClustIterator(StringIO.StringIO(stdout))

for cluster in clusters:

    print(cluster.name)

    for member in cluster:

        if member == cluster.representative:

            print("\t" + member.name + "*")

        else:

            print("\t" + member.name)


print()     # blank line


cmd = CDHITCommandline(cutoff=0.8, threads=2, inputfile=
"test_sequences.fasta", outputfile="tmp")

stdout, stderr = cmd()

clusters = CDHITClustIterator(open("tmp.clstr", "r"))

for cluster in clusters:

    print(cluster.name)

    for member in cluster:

        if member == cluster.representative:

            print("\t" + member.name + "*")

        else:

            print("\t" + member.name)



More information about the Biopython-dev mailing list