[Biopython-dev] Bio.Sequencing

Mon Jun 29 11:47:36 EDT 2009

Hi Peter,

2009/6/29 Peter Cock <p.j.a.cock at googlemail.com>

> Hi Cymon,
>
> I've checked in some of your patch on Bug 2865 already,
> recording the per-letter-annotation which I was planning to
> do but hadn't got round to yet - thank you:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2865
>
> This means with the latest code you can now use Biopython
> to convert a PHD output file into a FASTQ file (or a QUAL
> file) which could be handy for doing meta assemblies.

Yeah, that's nice. Conversely, the reason I wrote the Phd writer is that I
want to 'fake' some Phd files from FASTA and QUAL files - should/might be
possible by using the default headers and equally spaced peak locations. The
use-case is to fool Consed into displaying the trace (which it 'fakes') from
a 454 Mira assembly ACE file output, but which it will only do if the Phd
files are available. So I'm hoping to write the Phd files from the original
FASTA/QUAL input files. Not sure if this is going to work, or if its a
sensible thing to be trying...

> I did relatively recently update SeqIO for the Ace format to
> record the qualities - but there is an issue here. Only the
> nucleotides get given quality scores, but not the insertions
> (gaps, shown as "*" in the Ace file consensus sequence).
> Currently the Bio.SeqIO parser gives the gapped sequence.
> This means to record the quality scores, we need to give
> some null value to the gap characters (and I used None).
>
> What I am wondering about is making the Bio.SeqIO Ace
> parser just return the ungapped sequence (and the
> associated PHRED quality scores). This means we could
> then convert Ace files into FASTQ or QUAL files, and also
> a simple Ace to FASTA conversion would give something
> useful for downstream analysis (the ungapped consensus).
>
> The gaps *are* important if you want to see how the
> consensus was built up - in which case it makes sense to
> think about each Ace contig as a kind of multiple sequence
> alignment. See this earlier discussion with David Winter:
> http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html
> http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html
>
> Any thoughts?

I think it's probably unwise to return an ungapped sequence/qual by default
if the contig in the ACE assembly is gapped. It would be nice if the parser
had a switch ungapped=True, but thats not going to work with the SeqIO
interface. Second best option would be to have an easy way of getting the
ungapped SeqRecord from the gapped SeqRecord - a function somewhere in
Bio.Sequencing?

Anyway, I assume (havent checked) that currently if all the contigs are free
of gaps then the SeqIO.AceIO will parse them into an Ungapped alphabet which
can then be written to FASTA/QUAL etc. I think this is the right way to go,
if the contigs have gaps the user needs to decide how to deal with them
explicitly.

Cheers, C.
--