[Bioperl-l] Bio::DB::Sam strange behaviour for read pairs

Frank Schwach fs5 at sanger.ac.uk
Wed Jan 6 17:16:13 EST 2010


I'm trying to extract paired reads from a BAM file that span a given 
region. I would then like to get the two read ends of the sequenced 
clone that spans the region.
I use Bio::DB::Sam->get_features_by_location for this and it does give 
me the correct read pairs as a region match but it doesn't give me both 
read pairs in all cases.

Here is the script:

#!/usr/bin/perl
use Bio::DB::Sam;

my $usage = "usage: $0 BAMFILE CHROMOSOME STARTPOS ENDPOS\n" ;
my ($bam_file,$chrom,$start,$end) = @ARGV ;
die $usage unless $bam_file && $chrom && $start && $end;

my $bam = Bio::DB::Sam->new(-bam => $bam_file);

my @pairs = $bam->get_features_by_location(
    -type   => 'read_pair',
    -seq_id => $chrom,
    -start  => $start,
    -end    => $end);

print "region: $chrom:$start..$end\n" ;
foreach my $pair (@pairs) {
  print "  pair: id: ".$pair->id.", start".$pair->start.', 
end:'.$pair->end."\n";
  my ($first_mate,$second_mate) = $pair->get_SeqFeatures;
  print "    first_mate: start:".$first_mate->start.', 
end:'.$first_mate->end."\n";
  if ($second_mate){
    print "    second_mate: start:".$second_mate->start.', 
end:'.$second_mate->end."\n";
  } else {
    print "    no second mate\n";
  }
}

And here are the matching pairs that it produces with one of my files 
for the region tal12:22479..29232:
region: 
tal12:22479..29232                                                                                                                          

  pair: id: tal-2446c08, start17496, 
end:29423                                                                                                      

    first_mate: start:28540, 
end:29423                                                                                                               

    no second 
mate                                                                                                                                   

  pair: id: tal-2463d10, start23534, 
end:31363                                                                                                      

    first_mate: start:23534, 
end:24448                                                                                                               

    no second 
mate                                                                                                                                   

  pair: id: tal-2371c09, start20860, 
end:28230                                                                                                      

    first_mate: start:27604, 
end:28230                                                                                                               

    no second 
mate                                                                                                                                   

  pair: id: tal-2440b06, start19232, 
end:27099                                                                                                      

    first_mate: start:26025, 
end:27099                                                                                                               

    no second 
mate                                                                                                                                   

  pair: id: tal-2327g09, start18909, 
end:26129                                                                                                      

    first_mate: start:25354, 
end:26129                                                                                                               

    no second 
mate                                                                                                                                   

  pair: id: tal-2381b05, start25658, 
end:35054                                                                                                      

    first_mate: start:25658, 
end:26295                                                                                                               

    no second 
mate                                                                                                                                   

  pair: id: tal-2377c11, start20898, 
end:28230                                                                                                      

    first_mate: start:27473, 
end:28230                                                                                                               

    no second 
mate                                                                                                                                   

  pair: id: tal-2426e12, start21975, 
end:27562                                                                                                      

    first_mate: start:21975, 
end:23008                                                                                                               

    second_mate: start:26396, 
end:27562                                                                                                              

  pair: id: tal-2365h10, start22843, 
end:31944                                                                                                      

    first_mate: start:22843, 
end:23184                                                                                                               

    no second 
mate                                                                                                                                   

  pair: id: tal-2388h09, start19016, 
end:28238                                                                                                      

    first_mate: start:27475, 
end:28238                                                                                                               

    no second mate                   


So it finds a lot of pairs that span the region and the start/end from 
the pair is also correct but it only gives me both individual mates in 
one case:
  pair: id: tal-2426e12, start21975, 
end:27562                                                                                                      

    first_mate: start:21975, 
end:23008                                                                                                               

    second_mate: start:26396, end:27562

In this case, both pairs are actually inside the query region (at least 
partially) whereas in the other cases, one of the mates is not inside, 
e.g. this one:

  pair: id: tal-2388h09, start19016, 
end:28238                                                                                                      

    first_mate: start:27475, 
end:28238                                                                                                               

    no second mate
  
 > get this read pair from the BAM file:
$ samtools view clones.bam | grep tal-2388h09

tal-2388h09    99      tal12  19016   205     
36H9M1D14M1D664M1D16M1D21M1D28M1D15M1D10M1D12M1D7M1D8M1D5M      =       
27475   9223    
CTTTGGATGAAATAGTTTTTAAATAATACTTATTAAATATTAAATATATAACACATAAATAAGTATTGATGCAAATTTTAAAGTATTATAGAAAACTAGGTTTGATTATATTGTTATACTGTACTTTAAGAGGAGAGAGATAAGATATCTTTGCTCTTTTAATATATAAATTTAGATAAATATTCGTTAAATTTTCTACATAGTTATTTTTTATCTTATATATTATACTGCTATAGTTATCAATGTATATACATTCAAATAATTTATTAAAAATTCTATATTATATTAATTCTATGATAAAATAATCCTGTTTGTGATTTAAAAAATGATGATTCAATAAAAACTAATAATATAATACGAGTTAATATGGAATAATAAAATGGCATTTAACATGAATTTAGTCTTTAACCTTTTCTTTGTTTGTCAAGTTTTTTAAAACATAAAACCACACATTTCAAAATGGATTTTTAGCAAATATATAAAAATTATACATTTATAATGTATTGTTATGCGTCTTTTCGATAGAATCAATATTTAATTATATGAAGTTTCCACAATAAAATAATATTTAATATTATTTATTAGTAGAGTATTTGATTATATATATAGGCATATAATAATAACTCTAGTTCTATCTACCATATTATTTATAATTATTATAACAAAATGTGATATGAAATTTTATTATATACTTATATTATTTTTTTAACTATTTTAAAATATATTTATTTATACCTCAAAACTATAAAATTGAAATTATTAATAATAATCTAATATATACCTTTATAAAAATAAACGTATAAACTAAT   
 ><:4/+1+*)+4>BEH=9-,,66IIIIIIIIEDA>>>>A at DDFFIHHHHHITIIIIIHIIHHHHHHIYYYYYTTTYDDDHDDDDDDDIIINNTNHHHHHIYYYYIIIIIINNNNTTIIIIIIIIIIITTNTTTTTYYYTTTTTTYYNNNNNNLLLLLLLLLLLNNNNNNTTTTTTTTTTTTTNNTNNNTTTYYTLLLLLLTTTTTTTTYTTTNNNNNNTTTTTTTNNTTNNTTTTTTTTTTYYTTTTTTTNNNNNNTTTTTTTYYTTTTTTYYYTTTTTTTTTTTYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTTTTTTTTTYYYYYYYYYYYYYYYYYYYYYYYTTTTTYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTTTTYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTTNNINTTYYYYYYYYYYYYYYYTTTTTTYYYYYYYYYYYYTTTTTTYYYYYYYYYYYYYTTTTTOOOIFFFIFIIOICC>>II@>>>>>>C>>>>>>CIBECCCHIIOOOOOOOOTTTIIFDDEIQQA:55839AA>99>@IIIIII>>::;;I;>>CC>>>>>@III<::=>AAA<>>>>I>:>>99:>842225006824855;5>68844//.//00:>::338:99<:/-+*-./0)((((+00+..,++(((+-()(*((((()*)***))3)''')*..+*++((*1++--+*''''((+/)*42.((***)+,+('*'''*((''''((,'%%''''''''(     
AS:i:614        MS:i:50
tal-2388h09    147     tal12  27475   205     1H764M40H       =       
19016   -9223   
ATTAAATCGGTATCGCCAACACAATGAGTATAATCATTGTCAAATATGCGTTTGTAAGTATATTCATTGTCACATTCACGTTTGTAAGTATATTCATTGTCACATTCACGTTTGTAAGTATAATCATTGGAACGTTCATTTTTGTAAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTGTAAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTGTAAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTGTAAGTATAATCATTGGAACGTTCATTTTTATGAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTATGAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTAATTTTTGTGAGTATAATCATTGGAACG  
(((0))*,-1-../2((())03---03266300271+*.-0-*''''+*,+/+))*-05330+)..4>7=77273911**((+20+03688633:93036<8;::5:<99379>>::>>>:57:<:7--)))1435::333228>::>II>::>A>>3/.958677AA=AA:>:==IIII8338<>A>>>>IIIIIIIIYYYYYKKYYYMIFFFFEIIIMI::4..8AIIC>9>=EIQQQMCAAAAAACIIIIAICIIIOOYTIIIMOQQMIIIIC>>AAABCCCCCEAI>C>>IQQIIIIIIIIIIKKYYYYYYYYYYYYYYYYYYYYYTIIIIIIYYYYTNINNNTYYYYYYYYYYYYYTTTTTYYYYYYYYYYYYYYYYYYYYYYYYYTTTTTTYYYYYYYYYYYYYYYSSYYYYYYYTTTTTTYYYYYYYYYYYYYYYYYYYYYYYYTTTTTTTTTTTTTTTTTTYYYYYTTTTTTYYYTTNNNNTTYYYYYYYYYTTTTLLTTNNTTTTYTTTTTTYYYYYYYYYYYTTOOKKKLKOOTYYYYYYYYYYYYYYYYTNNNNNNNNNTTTNYNNNNTNNNNTTYYYYYYYYTTNNNNTTYNNNNNITTTTTYYYYYYYYYYTTNNIIIIIDIIIIHTNNNNTTYYYYTNNNIIIIIITTTINIIIINNNNTTTYYYYIHHHDDHHDDIHDDGDFFFTIIINTTYYYYTTTTHHHHCCIIIHIHHHHCAI9:++**1168>ACCIIDDDDDDI>>>>>?NNN  
AS:i:688        MS:i:50

So the read in the first line starts before the start of the query 
region and is not accessible via $pair->get_SeqFeatures although this is 
a valid pair.
Am I doing something wrong, is this the desired behaviour or is it a bug?

Thanks for your help!



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 



More information about the Bioperl-l mailing list