[Bioperl-l] alignment and assembly

Robson Francisco de Souza rfsouza@citri.iq.usp.br
Tue, 19 Feb 2002 17:13:16 -0200 (BRST)


	Hi everyone,

	Concerning Jason's message on extending Bio::SimpleAlign to handle
assemblies...
	I have not yet started coding bioperl modules, but I'm very
interested in designing and writing an assembly object/interface. Problem
is I don't know exactly were to start from. I started reading bioperl's
tutorial and biodesign documentation yesterday, but I'm afraid I do not
know enough perl's OOP, although I've written a few modules of my own
(without inheritance, which I do not understand fully yet :/).
	Anyway, as I wrote a few months ago to the list and to Chad
Matsalla, I implemented a module that loads phrap output (both ACE and
phrap.out files) into an hierarchical hash structure inside a separate
module and namespace (which I called Assembly). Every time a user wants to
access phrap data, it creates an Assembly object, loads the file and
access the data through the interface I defined (which, by the way, is
awful).
	Now, where do you guys think I should start from? A concern
that I have is how do I store assembly data (which often is quite
huge) in to the modules memory? I'm not sure an hierarchy like this one,
which I used in my module, is adequate:

	assembly
	  -> assembly data (# of contigs, etc)
	  -> clone data (inferred from read locations and name or loaded
from phrap.out)
	  -> contig
		-> contig data (sequence, quality, # of reads, etc)
		-> read or sequence
			-> read data (sequence, quality, etc)

is the most appropriate data structure, because many times a user may ask
which contig was assembled with read XXX or, conversely, wich reads or
sequence fall between positions S and E in contig M. There is also a
problem concerning were to store the huge amount of features an assembly
may have (lists of poor quality regions, a description (scaffold) of how
different contigs must be positioned in relation to one another to build a
greater contig). Maybe, I should split such data among several classes,
like Bio::Assembly::Assembly (to hold assembly data),
Bio::Assembly::Contig, Bio::Assembly::Analysis (mainly methods used to get
information out of a loaded assembly, like Consed's low quality regions
list or high quality discrepancies.
	Another concern I have, since I've only experience with the
phred/phrap/consed package, would be to keep an Assembly object
(whether built on top of Bio::SimpleAlign or not) free from the
particularities of the phrap program. Does anyone know of a general
implementation of sequence assembly objects, independent of the
assembly program? In biojava maybe?
	Anyway, I'm starting to design my implementation and would
appreciate any help.comments/suggestions from you.
	Best regards,
				Robson

> See the Bio::AlignIO for how to read in alignments from files.  We don't
> interface with phrap or the tigr assembler at this point.  Happy to see
> someone design the appropriate objects that extend the Bio::SimpleAlign
> object (via the Bio::Align::AlignI interface) to handle assemblies.  We're
> happy to help with the object design if you lay out your plan.
> 
> -- 
> Jason Stajich
> Duke University
> jason@cgt.mc.duke.edu
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>