[Bioperl-l] Mapping 'mate pairs' in a Sanger sequencing assembly

Fri Nov 14 11:26:00 EST 2008

Hi,

I have written some code using BioPerl to do what I want, however,
because I am new to BioPerl, I have the 'am I doing it right' fear. I
wonder if perhaps there is a much better way to do what I want.

I am working with some Sanger sequencing data from a 'double barreled
shotgun' sequencing experiment. What this means is that the genomic
inserts are sequenced separately from both ends. For this reason, each
read has a natural 'pair' or 'mate pair' that has a typical sequence
separation, depending on the average insert size that results from a
given sequencing protocol.

The idea of a 'paired end' read is very important in the NextGen
sequencing data, as the fact that two short reads are separated by a
known distance is very informative when it comes to assembly. (For
example see: http://www.sciencemag.org/cgi/content/abstract/1149504 )

I searched the mailing list archives and the wiki, but I didn't find
any modules specifically dealing with this data type.

I wrote the following script to simply measure the average insert size
between mate pairs in the assembly. Any comments on the code are very
welcome (i.e. "BioPerl - ur doing it wrong!"). If anyone is working on
code to deal with paired end data directly (or if I just missed those
modules), please let me know.

#!/usr/bin/perl -w

## This script is based loosely on "contig_draw.PLS" from somewhere in
## BioPerl. (e.g. scripts/graphics/contig_draw.PLS or
## http://code.open-bio.org/svnweb/index.cgi/bioperl/view/bioperl-live/
## trunk/scripts/graphics/contig_draw.PLS)

## The purpose of this script is to read a .ace file (produced by
## phrap) and identify the 'mate pairs' or 'read pairs' using the St
## Louis naming convention. Once the 'paired end' reads have been
## identified, the mean insert length is calculated.

use strict;

use Bio::Assembly::IO;

use Statistics::Descriptive;

## Some things just don't seem like they should be objects...
my $stat = Statistics::Descriptive::Full->new();

## OPTIONS

die "\nusage: parse.plx <assembly ace file>\n\n"
  unless @ARGV == 1;

my $aceFile = $ARGV[0];

unless(-s $aceFile){
  die "\nempty or missing file : $aceFile\n\n";
}

warn "parsing $aceFile\n";

my $parser =
  new Bio::Assembly::IO( -file   => $aceFile,
			 -format => 'ace' );

## Typically there is only one assembly parsed by 'the parser' at a
## time. I'm not sure when this is not the case (there is only one
## 'assembly' per ace file).

my $ass =
  $parser->next_assembly();

## The assembly is a 'Bio::Assembly::Scaffold' object

## Store the forward and reverse reads.

my %readP; # Hash of linked forward and reverse read pairs

my $for = 0;
my $rev = 0;

my @reads =
  $ass->get_all_seq_ids;

## The seq IDs are strings

for my $read (@reads){

  ## Check that we can parse the read name according to the St Louis
  ## naming convention (basically '.b' identifies one half of a 'read
  ## pair' and '.g' the other half).

  unless ($read =~ /^(\d{3}[A-P]\d{2}X\d{5}).([bg]).abi$/){
    ## See also Bug 2648
    ## http://bugzilla.open-bio.org/show_bug.cgi?id=2648
    warn "is this a valid read : $read \n";
    next;
  }
  #warn "$read\n";

  ## From the definition of an assembly (specifically
  ## Bio::Assembly::Contig), the following test is impossible to
  ## fail...

  die "duplicate read : $read\n" if
    $readP{$1}{$2}++;

  $2 eq 'b' ? $for++ : $rev ++;
}

## Store the "read to contig" mapping

my %readC;

my @contigs =
  $ass->all_contigs;

## The contigs are 'Bio::Assembly::Contig' objects

foreach my $contig (@contigs){

  my $cid = $contig->id;

  #warn $cid, "\n";

  ## Loop through the sequences in this contig

  for my $read ($contig->get_seq_ids){
    #warn "\t$read\n";
    $readC{$read} = $cid;
  }
}

## Now we can do something useful

## Check the 'read pairs' in detail...

my @insertLengths;

my ($paired, $unPaired) = (0,0);

for my $read (keys %readP){
  #warn "$read\n";

  my @readP = keys %{$readP{$read}};

  if    (@readP==1){
    #warn "no mate pair $read ($readP[0])\n";
    $unPaired++;
  }
  elsif (@readP==2){
    #warn "mate pair $read\n";
    $paired++;

    ## Do the read pairs come from the same contig?

    if ($readC{"$read.b.abi"} eq
	$readC{"$read.g.abi"}){
      #print "read $read hits to contig ", $readC{"$read.b.abi"}, "\n";

      ## Get the sequence objects
      my $bId = $ass->get_seq_by_id( "$read.b.abi" );
      my $gId = $ass->get_seq_by_id( "$read.g.abi" );

      ## Get the contig object
      my $contig =
	$ass->get_contig_by_id( $readC{"$read.b.abi"} );

      ## Get the sequence to contig alignment information
      my $bSeq = $contig->get_seq_coord( $bId );
      my $gSeq = $contig->get_seq_coord( $gId );

      ## Check the sequence pair mapping sanity (?) and measure the
      ## insert length.

      my $insertLength;

#       printf(("%7d\t%7d\t%+3d\t%5d\t - \t" x 2),
# 	     ##
# 	     $bSeq->start,
# 	     $bSeq->end,
# 	     $bSeq->strand,
# 	     $bSeq->length,
# 	     ##
# 	     $gSeq->start,
# 	     $gSeq->end,
# 	     $gSeq->strand,
# 	     $gSeq->length,
# 	     ##
# 	    );

      if(+$bSeq->strand == -$gSeq->strand){
	if($bSeq->strand == +1){
	  if($bSeq->start > $gSeq->end){
#	    print "mate pair mis-match\n";
	    next;
	  }
	  $insertLength = $gSeq->end - $bSeq->start;
	}
	if($bSeq->strand == -1){
	  if($gSeq->start > $bSeq->end){
#	    print "mate pair mis-match\n";
	    next;
	  }
	  $insertLength = $bSeq->end - $gSeq->start;
	}
      }
      else{
#	print "mate pair mis-match\n";
	next;
      }

#      printf("%6d\n", $insertLength);

      push @insertLengths, $insertLength;
    }
    else{
      ## Reads map to different contigs, we cannot get an insert
      ## length.
      print
	"read $read spans ",
	  $readC{"$read.b.abi"}, " <-> ",
	    $readC{"$read.g.abi"}, "\n";
    }
  }
  else{
    ## Reads should be paired or not, nothing else.
    die "WTF?\n";
  }
}

print "$paired paired reads and $unPaired unpaired\n";

$stat->add_data(@insertLengths);

print "mean: ", $stat->mean, "\n";
print "stdv: ", $stat->standard_deviation, "\n";

Thanks very much for spending the time to consider the above script.

All the best,

Dan.

-- 
http://network.nature.com/profile/dan