[Bioperl-l] Mapping 'mate pairs' in a Sanger sequencing assembly
Dan Bolser
dan.bolser at gmail.com
Fri Nov 14 11:26:00 EST 2008
Hi,
I have written some code using BioPerl to do what I want, however,
because I am new to BioPerl, I have the 'am I doing it right' fear. I
wonder if perhaps there is a much better way to do what I want.
I am working with some Sanger sequencing data from a 'double barreled
shotgun' sequencing experiment. What this means is that the genomic
inserts are sequenced separately from both ends. For this reason, each
read has a natural 'pair' or 'mate pair' that has a typical sequence
separation, depending on the average insert size that results from a
given sequencing protocol.
The idea of a 'paired end' read is very important in the NextGen
sequencing data, as the fact that two short reads are separated by a
known distance is very informative when it comes to assembly. (For
example see: http://www.sciencemag.org/cgi/content/abstract/1149504 )
I searched the mailing list archives and the wiki, but I didn't find
any modules specifically dealing with this data type.
I wrote the following script to simply measure the average insert size
between mate pairs in the assembly. Any comments on the code are very
welcome (i.e. "BioPerl - ur doing it wrong!"). If anyone is working on
code to deal with paired end data directly (or if I just missed those
modules), please let me know.
#!/usr/bin/perl -w
## This script is based loosely on "contig_draw.PLS" from somewhere in
## BioPerl. (e.g. scripts/graphics/contig_draw.PLS or
## http://code.open-bio.org/svnweb/index.cgi/bioperl/view/bioperl-live/
## trunk/scripts/graphics/contig_draw.PLS)
## The purpose of this script is to read a .ace file (produced by
## phrap) and identify the 'mate pairs' or 'read pairs' using the St
## Louis naming convention. Once the 'paired end' reads have been
## identified, the mean insert length is calculated.
use strict;
use Bio::Assembly::IO;
use Statistics::Descriptive;
## Some things just don't seem like they should be objects...
my $stat = Statistics::Descriptive::Full->new();
## OPTIONS
die "\nusage: parse.plx <assembly ace file>\n\n"
unless @ARGV == 1;
my $aceFile = $ARGV[0];
unless(-s $aceFile){
die "\nempty or missing file : $aceFile\n\n";
}
warn "parsing $aceFile\n";
my $parser =
new Bio::Assembly::IO( -file => $aceFile,
-format => 'ace' );
## Typically there is only one assembly parsed by 'the parser' at a
## time. I'm not sure when this is not the case (there is only one
## 'assembly' per ace file).
my $ass =
$parser->next_assembly();
## The assembly is a 'Bio::Assembly::Scaffold' object
## Store the forward and reverse reads.
my %readP; # Hash of linked forward and reverse read pairs
my $for = 0;
my $rev = 0;
my @reads =
$ass->get_all_seq_ids;
## The seq IDs are strings
for my $read (@reads){
## Check that we can parse the read name according to the St Louis
## naming convention (basically '.b' identifies one half of a 'read
## pair' and '.g' the other half).
unless ($read =~ /^(\d{3}[A-P]\d{2}X\d{5}).([bg]).abi$/){
## See also Bug 2648
## http://bugzilla.open-bio.org/show_bug.cgi?id=2648
warn "is this a valid read : $read \n";
next;
}
#warn "$read\n";
## From the definition of an assembly (specifically
## Bio::Assembly::Contig), the following test is impossible to
## fail...
die "duplicate read : $read\n" if
$readP{$1}{$2}++;
$2 eq 'b' ? $for++ : $rev ++;
}
## Store the "read to contig" mapping
my %readC;
my @contigs =
$ass->all_contigs;
## The contigs are 'Bio::Assembly::Contig' objects
foreach my $contig (@contigs){
my $cid = $contig->id;
#warn $cid, "\n";
## Loop through the sequences in this contig
for my $read ($contig->get_seq_ids){
#warn "\t$read\n";
$readC{$read} = $cid;
}
}
## Now we can do something useful
## Check the 'read pairs' in detail...
my @insertLengths;
my ($paired, $unPaired) = (0,0);
for my $read (keys %readP){
#warn "$read\n";
my @readP = keys %{$readP{$read}};
if (@readP==1){
#warn "no mate pair $read ($readP[0])\n";
$unPaired++;
}
elsif (@readP==2){
#warn "mate pair $read\n";
$paired++;
## Do the read pairs come from the same contig?
if ($readC{"$read.b.abi"} eq
$readC{"$read.g.abi"}){
#print "read $read hits to contig ", $readC{"$read.b.abi"}, "\n";
## Get the sequence objects
my $bId = $ass->get_seq_by_id( "$read.b.abi" );
my $gId = $ass->get_seq_by_id( "$read.g.abi" );
## Get the contig object
my $contig =
$ass->get_contig_by_id( $readC{"$read.b.abi"} );
## Get the sequence to contig alignment information
my $bSeq = $contig->get_seq_coord( $bId );
my $gSeq = $contig->get_seq_coord( $gId );
## Check the sequence pair mapping sanity (?) and measure the
## insert length.
my $insertLength;
# printf(("%7d\t%7d\t%+3d\t%5d\t - \t" x 2),
# ##
# $bSeq->start,
# $bSeq->end,
# $bSeq->strand,
# $bSeq->length,
# ##
# $gSeq->start,
# $gSeq->end,
# $gSeq->strand,
# $gSeq->length,
# ##
# );
if(+$bSeq->strand == -$gSeq->strand){
if($bSeq->strand == +1){
if($bSeq->start > $gSeq->end){
# print "mate pair mis-match\n";
next;
}
$insertLength = $gSeq->end - $bSeq->start;
}
if($bSeq->strand == -1){
if($gSeq->start > $bSeq->end){
# print "mate pair mis-match\n";
next;
}
$insertLength = $bSeq->end - $gSeq->start;
}
}
else{
# print "mate pair mis-match\n";
next;
}
# printf("%6d\n", $insertLength);
push @insertLengths, $insertLength;
}
else{
## Reads map to different contigs, we cannot get an insert
## length.
print
"read $read spans ",
$readC{"$read.b.abi"}, " <-> ",
$readC{"$read.g.abi"}, "\n";
}
}
else{
## Reads should be paired or not, nothing else.
die "WTF?\n";
}
}
print "$paired paired reads and $unPaired unpaired\n";
$stat->add_data(@insertLengths);
print "mean: ", $stat->mean, "\n";
print "stdv: ", $stat->standard_deviation, "\n";
Thanks very much for spending the time to consider the above script.
All the best,
Dan.
--
http://network.nature.com/profile/dan
More information about the Bioperl-l
mailing list