[Bioperl-l] not all sequence is created equal (base quality d ata)

Malcolm Cook mcook@dna.com
Wed, 27 Jun 2001 14:34:00 -0700


Jason,

Perhaps another way to go with this:
- an additional abstract method on Seq which took a location and returned a
'quality' 

By the way, what would be a 'quality' - a new object? a single number?  a
z-score, a p-value? i've seen : real[0,1] ,  int[1,10].

Regards

>-----Original Message-----
>From: Jason Stajich [mailto:jason@chg.mc.duke.edu]
>Sent: Wednesday, June 27, 2001 1:05 PM
>To: Bioperl
>Subject: [Bioperl-l] not all sequence is created equal (base quality
>data)
>
>
>It would obviously be of interest to our friends doing sequencing as
>well as our friends doing prediction and other analysis who want to
>weigh low quality sequence less if we could incorperate base quality
>information into the idea of Sequence somehow.
>
>Could we architect a design to handle this and have quality 
>values paired
>with bases?  
>
>I can imagine a couple of ways to do this
> - an additional data field in PrimarySeq object, 
> - a parallel Seq::Quality object paired with a PrimarySeq object
> - a SeqFeature which spanned the entire sequence and had the
>   primary tag 'quality' and a value of the sequence quality.
>
>None of these seem particularly elegant, but the 
>BaseWithQualityScores object
>(biojava) is not going out work very well in perl either.
>
>Anyone have ideas on this?   This is something I think that 
>would be worthy to
>consider as a project for 1.0 if anyone else agrees.  
>
>This came up because I started playing with pir data and we 
>can eaily make
>it work except for the fact that some PIR files have quality 
>information
>about their bases, embedded in the sequence (probably not the best way
>to do this...)
>
>>P1;CCDG
>cytochrome c - dog (tentative sequence)
>GDVEKGKKIFVQK(C.A.Q.C.H.T.V.E)KGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKN
>KGITWGEETLMEYLENP
>KKYIPGTKMIFAGIKKTGERADLIAYLKKATKE*
>
>Looking at their coding table (+) this is oh so much fun to 
>try and code
>for...  I can at least strip out this quality data for now to 
>allow us to
>read in pir files, but it would be very interesting if we 
>COULD integrate
>quality data into the sequence object. If we wanted to be able 
>to read in
>the sequence read quality values.
>
>
>(+) 
>Table II: Punctuation Description in Protein Sequences
>
>XX   Two adjacent amino acids, with no punctuation between, indicates
>       that they are connected, as determined experimentally.
>()   Encloses a region, the composition but not the complete sequence
>       of which has been determined experimentally, or encloses a
>       single residue that has been tentatively identified.
> =   Indicates ")(", the juxtaposition of two regions of indeterminate
>       sequence, while preserving proper spacing between amino acids.
> /   Indicates that the adjacent amino acids are from different
>       peptides, not necessarily connected. When the amino end of a
>       protein has not been determined, "/" precedes the first residue.
>       When the carboxyl end has not been determined, "/" follows the
>       last residue. When ")/", "/(", or ")/(" are needed, only "/" is
>       used.
> .  Outside of parentheses, indicates the ends of sequence fragments.
>       The relative order of these fragments was not determined
>       experimentally but is clear from homology or other indirect
>       evidence.
> .  Within parentheses, indicates that the amino acid to the left
>       has been placed with at least 90% confidence by homology with
>       known sequences.
> ,  Indicates that the amino acid to its left could not be
>       positioned with confidence by homology.
>
>
>
>Jason Stajich
>jason@chg.mc.duke.edu
>Center for Human Genetics
>Duke University Medical Center 
>http://www.chg.duke.edu/ 
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l@bioperl.org
>http://bioperl.org/mailman/listinfo/bioperl-l
>