[DAS2] Example alignments

Lincoln Stein lstein at cshl.edu
Mon Jun 5 10:31:50 EDT 2006


Hi Andrew,

I'm truly sorry at how long it has taken me to get these examples to you. I 
hope that the example alignments in the enclosure makes sense to you.

Unfortunately I found that I had to add a new "target" attribute to <LOC> in 
order to make the cigar string semantics unambiguous. Otherwise you wouldn't 
be able to tell how to interpret the gaps.

Lincoln

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
-------------- next part --------------
CASE #1. A SIMPLE PAIRWISE ALIGNMENT.

A simple alignment is one in which the alignment is represented as a
single feature with no subfeatures. This is the preferred
representation to be used when the entire alignment shares the same
set of properties.

This is an alignment between Chr3 (the reference) and EST23 (the
target). Both aligned sequences are in the forward (+) direction. We
represent this as a single alignment

Chr4       100 CAAGACCTAAA-CTGGAATTCCAATCGCAACTCCTGGACC-TATCTATA 147
               |||||||X||| ||||| |||||||       ||||X||| ||||||||
EST23        1 CAAGACCAAAATCTGGA-TTCCAAT-------CCTGCACCCTATCTATA  41

This has a CIGAR gap string of M11 I1 M5 D1 M7 D7 M8 I1 M8:

     M11  match 11 bp
     I1   insert 1 gap into the reference sequence
     M5   match 5 bp
     D1   insert 1 gap into the target sequence
     M7   match 7 bp
     D7   insert 7 gaps into the target
     M8   match 8 bp
     I1   insert 1 gap into the reference
     M8   match 8 bp

Content-Type: application/x-das-features+xml

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES
     xmlns="http://biodas.org/documents/das2"
     xml:base="http://www.biodas.org/das2/sequence/fly/Jun2006/">

<FEATURE uri="./Alignment1" type="./expressed_sequence_match" >
  <LOC
       segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
       range="100:147:1"
   </LOC>
   <LOC
       segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/EST23"
       target="1"
       range="1:41:1"
       gap="M11 I1 M5 D1 M7 D7 M8 I1 M8"
    </LOC>
    <PROP key="est2genomescore" value="180" />
</FEATURE>
    
</FEATURES>

NOTE: I've had to introduce a new <LOC> attribute named "target" in
order to distinguish the reference sequence from the target
sequence. This is necessary for the CIGAR string concepts to work.

Perhaps it would be better to have a "role" attribute whose values are
one of "ref" and "target?"

<!----------------------------------------------------------------------->

CASE #2. A COMPLEX PAIRWISE ALIGNMENT.

The complex pairwise alignment is used when the alignment is the
composite of two different alignments, each of which has its own set
of properties. An example of this is BLAST, in which each "BLAST hit"
is composed of multiple aligned segments called "HSPs".

We extend the previous example by adding another aligned segment to
the alignment.

BLAST hit: align Chr4:100:300 with EST23:1:58

HSP 1:

Chr4       100 CAAGACCTAAA-CTGGAATTCCAATCGCAACTCCTGGACC-TATCTATA 147
               |||||||X||| ||||| |||||||       ||||X||| ||||||||
EST23        1 CAAGACCAAAATCTGGA-TTCCAAT-------CCTGCACCCTATCTATA  41

BLAST score = 80

CIGAR gap string M11 I1 M5 D1 M7 D7 M8 I1 M8:


HSP 2:

Chr4       211 TCAAACTGATAATGGGGT 228
               ||||||||||| ||||||
EST23       42 TCAAACTGATA-TGGGGT  58

BLAST score = 85

CIGAR gap string M11 D1 M6

We represent this as an "expressed_sequence_match" feature relating
Chr4 100:300 to EST23 1:58. The feature contains two subparts, one
corresponding to the HSP1 and the other corresponding to HSP2.

<?xml version="1.0" encoding="UTF-8"?>
<FEATURES
     xmlns="http://biodas.org/documents/das2"
     xml:base="http://www.biodas.org/das2/sequence/fly/Jun2006/">

  <!-- A feature for the entire BLAST hit -->

   <FEATURE uri="./Alignment2" type="./expressed_sequence_match" >
     <LOC
          segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
          range="100:300:1"
      </LOC>
      <LOC
          segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/EST23"
          target="1"
          range="1:58:1"
       </LOC>
       <PART uri="./Alignment2.1" />
       <PART uri="./Alignment2.2" />
   </FEATURE>

  <!-- HSP 1 -->
   <FEATURE uri="./Alignment2.1" type="./match_part">
     <LOC
          segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
          range="100:147:1"
      </LOC>
      <LOC
          segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/EST23"
          target="1"
          range="1:41:1"
          gap="M11 I1 M5 D1 M7 D7 M8 I1 M8"
       </LOC>
       <PARENT uri="./Alignment2" />
       <PROP key="blastscore" value="80" />
   </FEATURE>
    
  <!-- HSP 2 -->
   <FEATURE uri="./Alignment2.2" type="./match_part">
     <LOC
          segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/4"
          range="211:228:1"
      </LOC>
      <LOC
          segment="http://www.flybase.org/genome/D_melanogaster/R4.3/dna/EST23"
          target="1"
          range="42:58:1"
          gap="M11 D1 M6"
       </LOC>
       <PARENT uri="./Alignment2" />
       <PROP key="blastscore" value="85" />
   </FEATURE>

</FEATURES>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060605/2b6fd923/attachment.bin>


More information about the DAS2 mailing list