[Biopython] next-gen sequencing software

Fri Jul 24 08:53:15 UTC 2009

Hi:

We have been writting some code that we think that could be interesting to the 
Biopython community. Right now we're mainly interested in the new sequencing 
technologies, specially in:
	- cleaning of the raw reads provided by the sequencers.
	- parsing of the assembler results (ace, caf and bowtie map files)
	- SNP detecion and mining.
	- sequence annotation.
We're writing some software to deal with that problems. Currently the software 
is not finished but it starts to be useful. Everything is written in python. 
We have used Biopython for some things, but for some others we have used a 
slighty different approach. If the Biopython developers think that some of 
our ideas could be of any use we would be willing to incorporate it into 
Biopython.
If you want to take a look just go to:
http://bioinf.comav.upv.es/svn/biolib/biolib/src/

Recently we have finished the cleaning infrastructure. We haven't yet 
pipelines defined for all the new sequencing technologies but we have created 
a pipeline system very easy to modify. With just a dozen of lines of code a 
new pipeline suited to a new sequencing technology can be created. There's 
also an script that runs those pipelines (run_cleannig_pipeline.py).
We have also created a set of scripts that create statistics that ease the 
quality evaluation of the cleaning process.

Regarding the SNPs we can get them using ace and caf files and we're finishing 
the parsing of the bowtie map files. All these files are transformed into an 
iterator of contig objects. There is also funcionallity to get SNPs and 
statistics from these contig objects.

We're willing to get comments, suggestions, criticisms.
Best regards,

-- 
Jose M. Blanca
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

P.D. We're using this functionallity in a computer cluster, so everything is 
parallelized.