[Bioperl-l] Bio::SeqIO::scf traces scrambled?
Charles Tilford
charles.tilford at bms.com
Thu Jun 18 20:02:53 UTC 2009
Cook, Malcolm wrote:
> Charles,
>
> Another possible stopgap that might work for you, if you're working with AB1 chromatograms and have ABIs kb-basecaller turned on, is to use Bio::Trace::ABIF
>
> http://search.cpan.org/dist/Bio-Trace-ABIF/lib/Bio/Trace/ABIF.pm
>
> It works great and includes implementation of ABIs algorithm allowing to (re)compute trace clear ranges using kc-basecallers quality scores and any windowing/quality parameters.
>
> Its not in the bioperl project but it is an easy install from CPAN.
>
Thanks - we installed that a few weeks ago, and it was on my list of
things to try, but I had not gotten to it yet since I was getting data
out of the SCF SeqIO module. Even though the SeqIO::scf data looks ok,
the fact that I need to unscramble it makes me nervous... Thanks, too,
for the example code. I'll try out the Bio::Trace::ABIF module and see
if it works with our files.
Thanks,
CAT
> I am familiar with staden::read installation woes.
>
> Below is a quick script I wrote that employs it... it could be better parameterized, but it might be useful to you "out of the box"....
>
> Malcolm Cook
> Stowers Institute for Medical Research - Kansas City, Missouri
>
>
> #!/usr/bin/env perl
>
> # PURPOSE: extract from AB1 files into fasta format the sequence in
> # the 'clear range' defined by 3 parameters. If there is no clear
> # range, emit warning and skip the sequence. The fasta 'defline'
> # identifier is taken as the sample name. Other useful attributes are
> # also embedded into the defline using attribute=value syntax.
>
> # USAGE: ABIFqtrim $window_width $bad_bases_threshold $quality_threshold f1.ab1 ... fn.ab1
>
> # NOTE: 20 4 20 is ABI default settings
>
> # EXAMPLE:
> # ABIFqtrim 20 4 20 /n/facility/Bioinformatics/Software/ABI/ab1_test_files/*.ab1 > ab1_test_files_trimmed.fasta
>
> # AUTHOR: malcolm_cook at stowers-institute.org
>
> use strict;
> use warnings;
> use Bio::Trace::ABIF;
> use Text::Wrap qw(wrap);
> $Text::Wrap::columns = 72; # wrap the sequence
>
> use File::Basename;
> my ($window_width,
> $bad_bases_threshold,
> $quality_threshold,
> @ARGV) = @ARGV;
>
> my $abif = Bio::Trace::ABIF->new();
>
> sub main {} {
> foreach (@ARGV) {
> $abif->open_abif($_) or die "error opening $_ as ABIF";
> my ($clear_range_start,$clear_range_stop) = $abif->clear_range($window_width,
> $bad_bases_threshold,
> $quality_threshold
> );
> my $sample_score = $abif->sample_score(
> $window_width,
> $bad_bases_threshold,
> $quality_threshold
> );
> # my $contiguous_read_length = $abif->contiguous_read_length($window_width,
> # $quality_threshold,
> # 0, # ==> trim_ends
> # );
> # my $length_of_read = $abif->length_of_read(
> # $window_width,
> # $quality_threshold,
> # # $method
> # );
> my $defline =
> join "\t",
> $abif->sample_name,
> #basename($_,qw(.ab1 .abi)), # use this to use the filename's basename in the defline
> #$abif->container_identifier . ':' . $abif->well_id, # or this, for container:well_id formatted defline identifiers
> (map {my $method = $_;
> "$method=". ($abif->$method() || '')}
> qw(sample_name comment run_name well_id container_identifier sequence_length )), #comment
> # sample_tracking_id - don't use this - it is internal to ABI software
> "clear_range_start=$clear_range_start",
> "clear_range_stop=$clear_range_stop",
> "sample_score=$sample_score",
> #"contiguous_read_length=$contiguous_read_length",
> #"length_of_read=$length_of_read",
> ;
> if ($clear_range_start == -1) {
> warn "NO CLEAR RANGE! SKIPPING $_:\n\t$defline";
> next;
> }
> my $seq = wrap('','',substr($abif->sequence, $clear_range_start + 1, ($clear_range_stop + 1) - ($clear_range_start + 1) + 1));
> print ">$defline\n$seq\n";
> $abif->close_abif();
>
> }
> }
>
> main ();
>
>
>
>
>
>
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org
>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>> Charles Tilford
>> Sent: Thursday, June 18, 2009 8:39 AM
>> To: BioPerl List
>> Subject: [Bioperl-l] Bio::SeqIO::scf traces scrambled?
>>
>> Nutshell: Bio::SeqIO::scf seems to be mixing up A/C/G/T trace
>> channels.
>> Can anyone confirm?
>>
>> Hi all,
>>
>> I'm using the SCF Bio::SeqIO module to parse trace data out
>> of chromatograms. The SCF files are being produced by phred
>> using the "-cd"
>> parameter. The traces come out great, and the corresponding
>> base calls from the .phd files align with the peaks
>> wonderfully when I visualize them on a rendered trace.
>> However, only the A bases align to the appropriate trace
>> channel, the rest are mixed up. I find that if I do the
>> following re-mapping, the phred base calls match the
>>
>> SeqIO : Remapped
>> A : A
>> C : G
>> G : T
>> T : C
>>
>> The relevant part of Bio::SeqIO::scf is here:
>>
>> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/B
>> io/SeqIO/scf.html#CODE9
>>
>> ... which indicates that it expects the pack()ed trace data
>> to be in order ATGC. The base call parsing code is here:
>>
>> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/B
>> io/SeqIO/scf.html#CODE8
>>
>> ... which is unpacking in order ACGT. As far as I can tell,
>> the relevant official SCF documentation is here:
>>
>> http://staden.sourceforge.net/manual/formats_unix_4.html
>>
>> ... which indicates that both trace and base order should be
>> ACGT (matching the SeqIO unpack() for bases, but not traces).
>> My empirical channel unscrambling mapping implies order ACTG,
>> which is different from either of the two orders above. The
>> sequence from the SCF file (should be that from original AB1
>> file, I think) is not perfectly identical to that called by
>> phred, but is very similar (to be expected); that is, I don't
>> need to remap C, G and T to get it to align with the phred data.
>>
>> So it looks like the SeqIO module is not mapping the sections
>> of the packed trace data to the appropriate bases. The unpack
>> order is different than the staden documentation ... but so
>> is the order I impose to correct the problem. I am still
>> unclear as to the differences between
>> V2 and V3 of the format. The major difference appears to be
>> coding the trace absolutely (V2) or relatively to prior
>> values (V3); I'd expect if I was using one format and SeqIO
>> was trying to parse the other that I would get garbage out.
>> Running in verbose reports "scf.pm is working with a version 2 scf."
>>
>> Thoughts on this would be appreciated - can anyone confirm a
>> problem with trace extraction from SCF?
>>
>> I'm hoping that once I convince our admin to (properly)
>> install staden::read that I can work directly with the ab1
>> files, but I need to stopgap on SCF for the time being....
>>
>> -CAT
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
More information about the Bioperl-l
mailing list