[Bioperl-l] Bio::SeqIO::scf traces scrambled?

Charles Tilford charles.tilford at bms.com
Thu Jun 18 13:38:34 UTC 2009


Nutshell: Bio::SeqIO::scf seems to be mixing up A/C/G/T trace channels. 
Can anyone confirm?

Hi all,

I'm using the SCF Bio::SeqIO module to parse trace data out of 
chromatograms. The SCF files are being produced by phred using the "-cd" 
parameter. The traces come out great, and the corresponding base calls 
from the .phd files align with the peaks wonderfully when I visualize 
them on a rendered trace. However, only the A bases align to the 
appropriate trace channel, the rest are mixed up. I find that if I do 
the following re-mapping, the phred base calls match the

SeqIO : Remapped
A : A
C : G
G : T
T : C

The relevant part of Bio::SeqIO::scf is here:

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/scf.html#CODE9

... which indicates that it expects the pack()ed trace data to be in 
order ATGC. The base call parsing code is here:

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SeqIO/scf.html#CODE8

... which is unpacking in order ACGT. As far as I can tell, the relevant 
official SCF documentation is here:

http://staden.sourceforge.net/manual/formats_unix_4.html

... which indicates that both trace and base order should be ACGT 
(matching the SeqIO unpack() for bases, but not traces). My empirical 
channel unscrambling mapping implies order ACTG, which is different from 
either of the two orders above. The sequence from the SCF file (should 
be that from original AB1 file, I think) is not perfectly identical to 
that called by phred, but is very similar (to be expected); that is, I 
don't need to remap C, G and T to get it to align with the phred data.

So it looks like the SeqIO module is not mapping the sections of the 
packed trace data to the appropriate bases. The unpack order is 
different than the staden documentation ... but so is the order I impose 
to correct the problem. I am still unclear as to the differences between 
V2 and V3 of the format. The major difference appears to be coding the 
trace absolutely (V2) or relatively to prior values (V3); I'd expect if 
I was using one format and SeqIO was trying to parse the other that I 
would get garbage out. Running in verbose reports "scf.pm is working 
with a version 2 scf."

Thoughts on this would be appreciated - can anyone confirm a problem 
with trace extraction from SCF?

I'm hoping that once I convince our admin to (properly) install 
staden::read that I can work directly with the ab1 files, but I need to 
stopgap on SCF for the time being....

-CAT



More information about the Bioperl-l mailing list