[Bioperl-l] Bio::SeqIO::scf traces scrambled?
Charles Tilford
charles.tilford at bms.com
Thu Jun 18 13:38:34 UTC 2009
Nutshell: Bio::SeqIO::scf seems to be mixing up A/C/G/T trace channels.
Can anyone confirm?
Hi all,
I'm using the SCF Bio::SeqIO module to parse trace data out of
chromatograms. The SCF files are being produced by phred using the "-cd"
parameter. The traces come out great, and the corresponding base calls
from the .phd files align with the peaks wonderfully when I visualize
them on a rendered trace. However, only the A bases align to the
appropriate trace channel, the rest are mixed up. I find that if I do
the following re-mapping, the phred base calls match the
SeqIO : Remapped
A : A
C : G
G : T
T : C
The relevant part of Bio::SeqIO::scf is here:
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/scf.html#CODE9
... which indicates that it expects the pack()ed trace data to be in
order ATGC. The base call parsing code is here:
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/scf.html#CODE8
... which is unpacking in order ACGT. As far as I can tell, the relevant
official SCF documentation is here:
http://staden.sourceforge.net/manual/formats_unix_4.html
... which indicates that both trace and base order should be ACGT
(matching the SeqIO unpack() for bases, but not traces). My empirical
channel unscrambling mapping implies order ACTG, which is different from
either of the two orders above. The sequence from the SCF file (should
be that from original AB1 file, I think) is not perfectly identical to
that called by phred, but is very similar (to be expected); that is, I
don't need to remap C, G and T to get it to align with the phred data.
So it looks like the SeqIO module is not mapping the sections of the
packed trace data to the appropriate bases. The unpack order is
different than the staden documentation ... but so is the order I impose
to correct the problem. I am still unclear as to the differences between
V2 and V3 of the format. The major difference appears to be coding the
trace absolutely (V2) or relatively to prior values (V3); I'd expect if
I was using one format and SeqIO was trying to parse the other that I
would get garbage out. Running in verbose reports "scf.pm is working
with a version 2 scf."
Thoughts on this would be appreciated - can anyone confirm a problem
with trace extraction from SCF?
I'm hoping that once I convince our admin to (properly) install
staden::read that I can work directly with the ab1 files, but I need to
stopgap on SCF for the time being....
-CAT
More information about the Bioperl-l
mailing list