[Bioperl-l] Assembly package and phredPhrap tools

Robson Francisco de Souza rfsouza@citri.iq.usp.br
Mon, 5 Nov 2001 13:06:13 -0200 (BRST)


	Hi!

	Hello everyone, I have just subscribed to this mailing list and I
would like to ask some things, share some thoughts...
	I have been working on a perl module to load information from
phrap .ace files, phd files and some other things in the universe of
phredPhrap's data. Although I haven't code it following bioperl's
programming model, neither used bioperl's objects in its implementation, I
would like to move my module to bioperl's approach. I was actually
thinking of merging my code with Chad's code, but I believe that is gonna
be hard so I would like to hear something from you (all of you
and specially Chad) first.
	In my implementation .ace file information describing an assembly
is represented as a tree-like data structure:

 (PACKAGE Assembly):
        assembly (HASH reference):
                files (HASH reference):
                        ace_file (SCALAR = ALL.fasta.screen.ace.1)
                number_of_contigs (SCALAR = 521)
                total_number_of_reads (SCALAR = 188362)
        contigs (ARRAY reference):
                0 (SCALAR = )
                1 (HASH reference):
                        consensus (SCALAR = aggggcnnnctattatcgatccctctgtaaacacxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
			length (SCALAR = 804)
                        number_of_reads (SCALAR = 1)
                        number_of_segments (SCALAR = 1)
                        orientation (SCALAR = U)
                        quality (SCALAR =  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                        reads (HASH reference):
                                A0QR5701B11.b (HASH reference):
                                        align_clipping_end (SCALAR = 804)
                                        align_clipping_start (SCALAR = 741)
                                        end (SCALAR = 804)
                                        length (SCALAR = 804)
                                        number_of_read_info_items (SCALAR = 0)
                                        number_of_tags (SCALAR = 1)
                                        orientation (SCALAR = U)
                                        padded_end (SCALAR = 804)
                                        padded_start (SCALAR = 1)
                                        qual_clipping_end (SCALAR = -1)
                                        qual_clipping_start (SCALAR = -1)
                                        sequence (SCALAR = aggggcnnnctattatcgatccctctgtaaacacxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
					start (SCALAR = 1)
                2 (HASH reference):
                        consensus (SCALAR = gcggggtattatgatxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxttgtgggttcttggtcagctcgct
			length (SCALAR = 932)

 As you can see, all sequences are stored as strings (the same is true for
quality values). Now, I was thinking, if I change this representation to
bioperl objects, how would it look like? Or, more generally, what is the
best way to represent a DNA sequence assembly data in the bioperl
framework? I thought that maybe to store contigs as UnivAln objects and
contig data in tables could be a good ideia...
	Anyway, I would like to know what you guys are doing on this
subject. I can send you my code anytime so that you see what I have done
and how this could help. Most important, some of the methods I implemented
in my module reproduce consed's function, like finding LCQs or single
strand regions and they could be used by an assembly module.
	Well, hope this may start a discussion :).
	Best regards,
			Robson