[Bioperl-l] Mapping 'mate pairs' in a Sanger sequencing assembly
    Dan Bolser 
    dan.bolser at gmail.com
       
    Fri Nov 14 16:26:00 UTC 2008
    
    
  
Hi,
I have written some code using BioPerl to do what I want, however,
because I am new to BioPerl, I have the 'am I doing it right' fear. I
wonder if perhaps there is a much better way to do what I want.
I am working with some Sanger sequencing data from a 'double barreled
shotgun' sequencing experiment. What this means is that the genomic
inserts are sequenced separately from both ends. For this reason, each
read has a natural 'pair' or 'mate pair' that has a typical sequence
separation, depending on the average insert size that results from a
given sequencing protocol.
The idea of a 'paired end' read is very important in the NextGen
sequencing data, as the fact that two short reads are separated by a
known distance is very informative when it comes to assembly. (For
example see: http://www.sciencemag.org/cgi/content/abstract/1149504 )
I searched the mailing list archives and the wiki, but I didn't find
any modules specifically dealing with this data type.
I wrote the following script to simply measure the average insert size
between mate pairs in the assembly. Any comments on the code are very
welcome (i.e. "BioPerl - ur doing it wrong!"). If anyone is working on
code to deal with paired end data directly (or if I just missed those
modules), please let me know.
#!/usr/bin/perl -w
## This script is based loosely on "contig_draw.PLS" from somewhere in
## BioPerl. (e.g. scripts/graphics/contig_draw.PLS or
## http://code.open-bio.org/svnweb/index.cgi/bioperl/view/bioperl-live/
## trunk/scripts/graphics/contig_draw.PLS)
## The purpose of this script is to read a .ace file (produced by
## phrap) and identify the 'mate pairs' or 'read pairs' using the St
## Louis naming convention. Once the 'paired end' reads have been
## identified, the mean insert length is calculated.
use strict;
use Bio::Assembly::IO;
use Statistics::Descriptive;
## Some things just don't seem like they should be objects...
my $stat = Statistics::Descriptive::Full->new();
## OPTIONS
die "\nusage: parse.plx <assembly ace file>\n\n"
  unless @ARGV == 1;
my $aceFile = $ARGV[0];
unless(-s $aceFile){
  die "\nempty or missing file : $aceFile\n\n";
}
warn "parsing $aceFile\n";
my $parser =
  new Bio::Assembly::IO( -file   => $aceFile,
			 -format => 'ace' );
## Typically there is only one assembly parsed by 'the parser' at a
## time. I'm not sure when this is not the case (there is only one
## 'assembly' per ace file).
my $ass =
  $parser->next_assembly();
## The assembly is a 'Bio::Assembly::Scaffold' object
## Store the forward and reverse reads.
my %readP; # Hash of linked forward and reverse read pairs
my $for = 0;
my $rev = 0;
my @reads =
  $ass->get_all_seq_ids;
## The seq IDs are strings
for my $read (@reads){
  ## Check that we can parse the read name according to the St Louis
  ## naming convention (basically '.b' identifies one half of a 'read
  ## pair' and '.g' the other half).
  unless ($read =~ /^(\d{3}[A-P]\d{2}X\d{5}).([bg]).abi$/){
    ## See also Bug 2648
    ## http://bugzilla.open-bio.org/show_bug.cgi?id=2648
    warn "is this a valid read : $read \n";
    next;
  }
  #warn "$read\n";
  ## From the definition of an assembly (specifically
  ## Bio::Assembly::Contig), the following test is impossible to
  ## fail...
  die "duplicate read : $read\n" if
    $readP{$1}{$2}++;
  $2 eq 'b' ? $for++ : $rev ++;
}
## Store the "read to contig" mapping
my %readC;
my @contigs =
  $ass->all_contigs;
## The contigs are 'Bio::Assembly::Contig' objects
foreach my $contig (@contigs){
  my $cid = $contig->id;
  #warn $cid, "\n";
  ## Loop through the sequences in this contig
  for my $read ($contig->get_seq_ids){
    #warn "\t$read\n";
    $readC{$read} = $cid;
  }
}
## Now we can do something useful
## Check the 'read pairs' in detail...
my @insertLengths;
my ($paired, $unPaired) = (0,0);
for my $read (keys %readP){
  #warn "$read\n";
  my @readP = keys %{$readP{$read}};
  if    (@readP==1){
    #warn "no mate pair $read ($readP[0])\n";
    $unPaired++;
  }
  elsif (@readP==2){
    #warn "mate pair $read\n";
    $paired++;
    ## Do the read pairs come from the same contig?
    if ($readC{"$read.b.abi"} eq
	$readC{"$read.g.abi"}){
      #print "read $read hits to contig ", $readC{"$read.b.abi"}, "\n";
      ## Get the sequence objects
      my $bId = $ass->get_seq_by_id( "$read.b.abi" );
      my $gId = $ass->get_seq_by_id( "$read.g.abi" );
      ## Get the contig object
      my $contig =
	$ass->get_contig_by_id( $readC{"$read.b.abi"} );
      ## Get the sequence to contig alignment information
      my $bSeq = $contig->get_seq_coord( $bId );
      my $gSeq = $contig->get_seq_coord( $gId );
      ## Check the sequence pair mapping sanity (?) and measure the
      ## insert length.
      my $insertLength;
#       printf(("%7d\t%7d\t%+3d\t%5d\t - \t" x 2),
# 	     ##
# 	     $bSeq->start,
# 	     $bSeq->end,
# 	     $bSeq->strand,
# 	     $bSeq->length,
# 	     ##
# 	     $gSeq->start,
# 	     $gSeq->end,
# 	     $gSeq->strand,
# 	     $gSeq->length,
# 	     ##
# 	    );
      if(+$bSeq->strand == -$gSeq->strand){
	if($bSeq->strand == +1){
	  if($bSeq->start > $gSeq->end){
#	    print "mate pair mis-match\n";
	    next;
	  }
	  $insertLength = $gSeq->end - $bSeq->start;
	}
	if($bSeq->strand == -1){
	  if($gSeq->start > $bSeq->end){
#	    print "mate pair mis-match\n";
	    next;
	  }
	  $insertLength = $bSeq->end - $gSeq->start;
	}
      }
      else{
#	print "mate pair mis-match\n";
	next;
      }
#      printf("%6d\n", $insertLength);
      push @insertLengths, $insertLength;
    }
    else{
      ## Reads map to different contigs, we cannot get an insert
      ## length.
      print
	"read $read spans ",
	  $readC{"$read.b.abi"}, " <-> ",
	    $readC{"$read.g.abi"}, "\n";
    }
  }
  else{
    ## Reads should be paired or not, nothing else.
    die "WTF?\n";
  }
}
print "$paired paired reads and $unPaired unpaired\n";
$stat->add_data(@insertLengths);
print "mean: ", $stat->mean, "\n";
print "stdv: ", $stat->standard_deviation, "\n";
Thanks very much for spending the time to consider the above script.
All the best,
Dan.
-- 
http://network.nature.com/profile/dan
    
    
More information about the Bioperl-l
mailing list