[Bioperl-l] Bio::DB::Sam strange behaviour for read pairs
Frank Schwach
fs5 at sanger.ac.uk
Wed Jan 6 17:16:13 EST 2010
I'm trying to extract paired reads from a BAM file that span a given
region. I would then like to get the two read ends of the sequenced
clone that spans the region.
I use Bio::DB::Sam->get_features_by_location for this and it does give
me the correct read pairs as a region match but it doesn't give me both
read pairs in all cases.
Here is the script:
#!/usr/bin/perl
use Bio::DB::Sam;
my $usage = "usage: $0 BAMFILE CHROMOSOME STARTPOS ENDPOS\n" ;
my ($bam_file,$chrom,$start,$end) = @ARGV ;
die $usage unless $bam_file && $chrom && $start && $end;
my $bam = Bio::DB::Sam->new(-bam => $bam_file);
my @pairs = $bam->get_features_by_location(
-type => 'read_pair',
-seq_id => $chrom,
-start => $start,
-end => $end);
print "region: $chrom:$start..$end\n" ;
foreach my $pair (@pairs) {
print " pair: id: ".$pair->id.", start".$pair->start.',
end:'.$pair->end."\n";
my ($first_mate,$second_mate) = $pair->get_SeqFeatures;
print " first_mate: start:".$first_mate->start.',
end:'.$first_mate->end."\n";
if ($second_mate){
print " second_mate: start:".$second_mate->start.',
end:'.$second_mate->end."\n";
} else {
print " no second mate\n";
}
}
And here are the matching pairs that it produces with one of my files
for the region tal12:22479..29232:
region:
tal12:22479..29232
pair: id: tal-2446c08, start17496,
end:29423
first_mate: start:28540,
end:29423
no second
mate
pair: id: tal-2463d10, start23534,
end:31363
first_mate: start:23534,
end:24448
no second
mate
pair: id: tal-2371c09, start20860,
end:28230
first_mate: start:27604,
end:28230
no second
mate
pair: id: tal-2440b06, start19232,
end:27099
first_mate: start:26025,
end:27099
no second
mate
pair: id: tal-2327g09, start18909,
end:26129
first_mate: start:25354,
end:26129
no second
mate
pair: id: tal-2381b05, start25658,
end:35054
first_mate: start:25658,
end:26295
no second
mate
pair: id: tal-2377c11, start20898,
end:28230
first_mate: start:27473,
end:28230
no second
mate
pair: id: tal-2426e12, start21975,
end:27562
first_mate: start:21975,
end:23008
second_mate: start:26396,
end:27562
pair: id: tal-2365h10, start22843,
end:31944
first_mate: start:22843,
end:23184
no second
mate
pair: id: tal-2388h09, start19016,
end:28238
first_mate: start:27475,
end:28238
no second mate
So it finds a lot of pairs that span the region and the start/end from
the pair is also correct but it only gives me both individual mates in
one case:
pair: id: tal-2426e12, start21975,
end:27562
first_mate: start:21975,
end:23008
second_mate: start:26396, end:27562
In this case, both pairs are actually inside the query region (at least
partially) whereas in the other cases, one of the mates is not inside,
e.g. this one:
pair: id: tal-2388h09, start19016,
end:28238
first_mate: start:27475,
end:28238
no second mate
> get this read pair from the BAM file:
$ samtools view clones.bam | grep tal-2388h09
tal-2388h09 99 tal12 19016 205
36H9M1D14M1D664M1D16M1D21M1D28M1D15M1D10M1D12M1D7M1D8M1D5M =
27475 9223
CTTTGGATGAAATAGTTTTTAAATAATACTTATTAAATATTAAATATATAACACATAAATAAGTATTGATGCAAATTTTAAAGTATTATAGAAAACTAGGTTTGATTATATTGTTATACTGTACTTTAAGAGGAGAGAGATAAGATATCTTTGCTCTTTTAATATATAAATTTAGATAAATATTCGTTAAATTTTCTACATAGTTATTTTTTATCTTATATATTATACTGCTATAGTTATCAATGTATATACATTCAAATAATTTATTAAAAATTCTATATTATATTAATTCTATGATAAAATAATCCTGTTTGTGATTTAAAAAATGATGATTCAATAAAAACTAATAATATAATACGAGTTAATATGGAATAATAAAATGGCATTTAACATGAATTTAGTCTTTAACCTTTTCTTTGTTTGTCAAGTTTTTTAAAACATAAAACCACACATTTCAAAATGGATTTTTAGCAAATATATAAAAATTATACATTTATAATGTATTGTTATGCGTCTTTTCGATAGAATCAATATTTAATTATATGAAGTTTCCACAATAAAATAATATTTAATATTATTTATTAGTAGAGTATTTGATTATATATATAGGCATATAATAATAACTCTAGTTCTATCTACCATATTATTTATAATTATTATAACAAAATGTGATATGAAATTTTATTATATACTTATATTATTTTTTTAACTATTTTAAAATATATTTATTTATACCTCAAAACTATAAAATTGAAATTATTAATAATAATCTAATATATACCTTTATAAAAATAAACGTATAAACTAAT
><:4/+1+*)+4>BEH=9-,,66IIIIIIIIEDA>>>>A at DDFFIHHHHHITIIIIIHIIHHHHHHIYYYYYTTTYDDDHDDDDDDDIIINNTNHHHHHIYYYYIIIIIINNNNTTIIIIIIIIIIITTNTTTTTYYYTTTTTTYYNNNNNNLLLLLLLLLLLNNNNNNTTTTTTTTTTTTTNNTNNNTTTYYTLLLLLLTTTTTTTTYTTTNNNNNNTTTTTTTNNTTNNTTTTTTTTTTYYTTTTTTTNNNNNNTTTTTTTYYTTTTTTYYYTTTTTTTTTTTYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTTTTTTTTTYYYYYYYYYYYYYYYYYYYYYYYTTTTTYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTTTTYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTTNNINTTYYYYYYYYYYYYYYYTTTTTTYYYYYYYYYYYYTTTTTTYYYYYYYYYYYYYTTTTTOOOIFFFIFIIOICC>>II@>>>>>>C>>>>>>CIBECCCHIIOOOOOOOOTTTIIFDDEIQQA:55839AA>99>@IIIIII>>::;;I;>>CC>>>>>@III<::=>AAA<>>>>I>:>>99:>842225006824855;5>68844//.//00:>::338:99<:/-+*-./0)((((+00+..,++(((+-()(*((((()*)***))3)''')*..+*++((*1++--+*''''((+/)*42.((***)+,+('*'''*((''''((,'%%''''''''(
AS:i:614 MS:i:50
tal-2388h09 147 tal12 27475 205 1H764M40H =
19016 -9223
ATTAAATCGGTATCGCCAACACAATGAGTATAATCATTGTCAAATATGCGTTTGTAAGTATATTCATTGTCACATTCACGTTTGTAAGTATATTCATTGTCACATTCACGTTTGTAAGTATAATCATTGGAACGTTCATTTTTGTAAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTGTAAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTGTAAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTGTAAGTATAATCATTGGAACGTTCATTTTTATGAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTCATTTTTATGAGTATAATCATTGGAACGTTCATTTTTATGAGTATAATCATTGGTACGTTCATTTTTGTGAGTATAATCATTGGTACGTTAATTTTTGTGAGTATAATCATTGGAACG
(((0))*,-1-../2((())03---03266300271+*.-0-*''''+*,+/+))*-05330+)..4>7=77273911**((+20+03688633:93036<8;::5:<99379>>::>>>:57:<:7--)))1435::333228>::>II>::>A>>3/.958677AA=AA:>:==IIII8338<>A>>>>IIIIIIIIYYYYYKKYYYMIFFFFEIIIMI::4..8AIIC>9>=EIQQQMCAAAAAACIIIIAICIIIOOYTIIIMOQQMIIIIC>>AAABCCCCCEAI>C>>IQQIIIIIIIIIIKKYYYYYYYYYYYYYYYYYYYYYTIIIIIIYYYYTNINNNTYYYYYYYYYYYYYTTTTTYYYYYYYYYYYYYYYYYYYYYYYYYTTTTTTYYYYYYYYYYYYYYYSSYYYYYYYTTTTTTYYYYYYYYYYYYYYYYYYYYYYYYTTTTTTTTTTTTTTTTTTYYYYYTTTTTTYYYTTNNNNTTYYYYYYYYYTTTTLLTTNNTTTTYTTTTTTYYYYYYYYYYYTTOOKKKLKOOTYYYYYYYYYYYYYYYYTNNNNNNNNNTTTNYNNNNTNNNNTTYYYYYYYYTTNNNNTTYNNNNNITTTTTYYYYYYYYYYTTNNIIIIIDIIIIHTNNNNTTYYYYTNNNIIIIIITTTINIIIINNNNTTTYYYYIHHHDDHHDDIHDDGDFFFTIIINTTYYYYTTTTHHHHCCIIIHIHHHHCAI9:++**1168>ACCIIDDDDDDI>>>>>?NNN
AS:i:688 MS:i:50
So the read in the first line starts before the start of the query
region and is not accessible via $pair->get_SeqFeatures although this is
a valid pair.
Am I doing something wrong, is this the desired behaviour or is it a bug?
Thanks for your help!
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Bioperl-l
mailing list