[Biopython] Cluster Blast's most frequent Alignments

Fri Sep 18 19:50:55 UTC 2015

Hello all,

I am new to Bioinformatics, so excuse me if I have got this all wrong.

I am aligning multiple sequences (ESTs) to a genome (scaffolds fasta file) using NcbiblastnCommandline module, and for the purposes of my project I need to cluster the overlapping alignments in order to locate highly expressed genes. I was suprised not to found any articles online about a standard (formalised) methodology of this step.

Well, one can easily locate the scaffolds that appear on multiple alignments using Biopython's parsers and just go on processing his data.
The thing is that I was wondering if this process would be meaningful to be added to Biopython, for example as a method inside BlastIO package.

If so, then we should decide on the output format of the new file/info produced as the result of this process. For example one idea would be (?) to gather all the alignments in one place discarding the source sequences (queries), and just highlight by some way, e.g introducing a new score index, the most expressed scaffolds.

Any thoughts on this?

Thank you for your time,
Stelios
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20150918/f5d6e4e9/attachment-0001.html>