[Bioperl-l] Bio::SeqIO::scf traces scrambled?

Thu Jun 18 20:02:53 UTC 2009

Cook, Malcolm wrote:
> Charles,
>
> Another possible stopgap that might work for you, if you're working with AB1 chromatograms and have ABIs kb-basecaller turned on, is to use Bio::Trace::ABIF
>
> 	http://search.cpan.org/dist/Bio-Trace-ABIF/lib/Bio/Trace/ABIF.pm
>
> It works great and includes implementation of ABIs algorithm allowing to (re)compute trace clear ranges using kc-basecallers quality scores and any windowing/quality parameters.
>
> Its not in the bioperl project but it is an easy install from CPAN.
>   
Thanks - we installed that a few weeks ago, and it was on my list of 
things to try, but I had not gotten to it yet since I was getting data 
out of the SCF SeqIO module. Even though the SeqIO::scf data looks ok, 
the fact that I need to unscramble it makes me nervous... Thanks, too, 
for the example code. I'll try out the Bio::Trace::ABIF module and see 
if it works with our files.

Thanks,
CAT
> I am familiar with staden::read installation woes.  
>
> Below is a quick script I wrote that employs it... it could be better parameterized, but it might be useful to you "out of the box"....
>
> Malcolm Cook
> Stowers Institute for Medical Research - Kansas City, Missouri
>   
>
> #!/usr/bin/env perl
>
> # PURPOSE: extract from AB1 files into fasta format the sequence in
> # the 'clear range' defined by 3 parameters.  If there is no clear
> # range, emit warning and skip the sequence.  The fasta 'defline'
> # identifier is taken as the sample name.  Other useful attributes are
> # also embedded into the defline using attribute=value syntax.
>
> # USAGE: ABIFqtrim $window_width $bad_bases_threshold $quality_threshold f1.ab1 ... fn.ab1
>
> # NOTE: 20 4 20 is ABI default settings
>
> # EXAMPLE:
> # ABIFqtrim 20 4 20 /n/facility/Bioinformatics/Software/ABI/ab1_test_files/*.ab1 > ab1_test_files_trimmed.fasta
>
> # AUTHOR: malcolm_cook at stowers-institute.org
>
> use strict;
> use warnings;
> use Bio::Trace::ABIF;
> use Text::Wrap qw(wrap);
> $Text::Wrap::columns = 72;	# wrap the sequence
>
> use File::Basename;
> my ($window_width,
>     $bad_bases_threshold,
>     $quality_threshold,
>     @ARGV) = @ARGV;
>
> my $abif = Bio::Trace::ABIF->new();
>
> sub main {} {
>   foreach (@ARGV) {
>     $abif->open_abif($_) or die "error opening $_ as ABIF";
>     my ($clear_range_start,$clear_range_stop) = $abif->clear_range($window_width,
> 								   $bad_bases_threshold,
> 								   $quality_threshold
> 								  );
>     my $sample_score = $abif->sample_score(
> 					   $window_width,
> 					   $bad_bases_threshold,
> 					   $quality_threshold
> 					  );
>     #    my $contiguous_read_length = $abif->contiguous_read_length($window_width,
>     #							       $quality_threshold,
>     #							       0, # ==> trim_ends
>     #							      );
>     #    my $length_of_read = $abif->length_of_read(
>     #				    $window_width,
>     #				    $quality_threshold,
>     #				    # $method
>     #				   );
>     my $defline = 
>       join "\t", 
> 	$abif->sample_name,
> 	  #basename($_,qw(.ab1 .abi)), # use this to use the filename's basename in the defline
> 	  #$abif->container_identifier . ':' . $abif->well_id,  # or this, for container:well_id formatted defline identifiers
> 	  (map {my $method = $_;
> 		"$method=". ($abif->$method() || '')}
> 	   qw(sample_name comment run_name well_id container_identifier sequence_length )), #comment
> 	     # sample_tracking_id - don't use this - it is internal to ABI software
> 	     "clear_range_start=$clear_range_start",
> 	       "clear_range_stop=$clear_range_stop",
> 		 "sample_score=$sample_score",
> 		   #"contiguous_read_length=$contiguous_read_length",
> 		   #"length_of_read=$length_of_read",
> 		   ;
>     if ($clear_range_start == -1) {
>       warn "NO CLEAR RANGE! SKIPPING $_:\n\t$defline";
>       next;
>     }
>     my $seq = wrap('','',substr($abif->sequence, $clear_range_start + 1, ($clear_range_stop + 1) - ($clear_range_start + 1) + 1));
>     print ">$defline\n$seq\n";
>     $abif->close_abif();
>
>   }
> }
>
> main ();
>
>
>
>
>
>
>   
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org 
>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of 
>> Charles Tilford
>> Sent: Thursday, June 18, 2009 8:39 AM
>> To: BioPerl List
>> Subject: [Bioperl-l] Bio::SeqIO::scf traces scrambled?
>>
>> Nutshell: Bio::SeqIO::scf seems to be mixing up A/C/G/T trace 
>> channels. 
>> Can anyone confirm?
>>
>> Hi all,
>>
>> I'm using the SCF Bio::SeqIO module to parse trace data out 
>> of chromatograms. The SCF files are being produced by phred 
>> using the "-cd" 
>> parameter. The traces come out great, and the corresponding 
>> base calls from the .phd files align with the peaks 
>> wonderfully when I visualize them on a rendered trace. 
>> However, only the A bases align to the appropriate trace 
>> channel, the rest are mixed up. I find that if I do the 
>> following re-mapping, the phred base calls match the
>>
>> SeqIO : Remapped
>> A : A
>> C : G
>> G : T
>> T : C
>>
>> The relevant part of Bio::SeqIO::scf is here:
>>
>> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/B
>> io/SeqIO/scf.html#CODE9
>>
>> ... which indicates that it expects the pack()ed trace data 
>> to be in order ATGC. The base call parsing code is here:
>>
>> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/B
>> io/SeqIO/scf.html#CODE8
>>
>> ... which is unpacking in order ACGT. As far as I can tell, 
>> the relevant official SCF documentation is here:
>>
>> http://staden.sourceforge.net/manual/formats_unix_4.html
>>
>> ... which indicates that both trace and base order should be 
>> ACGT (matching the SeqIO unpack() for bases, but not traces). 
>> My empirical channel unscrambling mapping implies order ACTG, 
>> which is different from either of the two orders above. The 
>> sequence from the SCF file (should be that from original AB1 
>> file, I think) is not perfectly identical to that called by 
>> phred, but is very similar (to be expected); that is, I don't 
>> need to remap C, G and T to get it to align with the phred data.
>>
>> So it looks like the SeqIO module is not mapping the sections 
>> of the packed trace data to the appropriate bases. The unpack 
>> order is different than the staden documentation ... but so 
>> is the order I impose to correct the problem. I am still 
>> unclear as to the differences between
>> V2 and V3 of the format. The major difference appears to be 
>> coding the trace absolutely (V2) or relatively to prior 
>> values (V3); I'd expect if I was using one format and SeqIO 
>> was trying to parse the other that I would get garbage out. 
>> Running in verbose reports "scf.pm is working with a version 2 scf."
>>
>> Thoughts on this would be appreciated - can anyone confirm a 
>> problem with trace extraction from SCF?
>>
>> I'm hoping that once I convince our admin to (properly) 
>> install staden::read that I can work directly with the ab1 
>> files, but I need to stopgap on SCF for the time being....
>>
>> -CAT
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>