[Biopython-dev] [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO

Tue Jun 17 08:46:22 UTC 2008

On Tue, Jun 17, 2008 at 8:35 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi:
> My main use of the Alignment class is to parse Ace files. I've been thinking
> about that problem recently. My proposal to modify SeqRecord was due to this
> problem. I think that the best solution would be to treat the Alignment as a
> sequence. The consensus would be the actual sequences and the aligned read
> would be features with per-base-annotations.

So integrating the "ace" format into Bio.SeqIO representing the
consensus sequence of each contig as a SeqRecord would be useful.
Initially I would try and represent the aligned reads as SeqFeature
objects (much like when reading a genome from a GenBank file you get
CDS features with their amino acid translation).

Note that for memory reasons, I would be inclined to scan over the Ace
file in one pass (using the existing Iterator in the
Bio.Sequencing.Ace parser) returning SeqRecords as we go.  As Frank
points out in the code comments, this means we can't easily include
the WA, CT, RT and WR tags found in the Ace file footer.  Do you use
this information Jose?

> I've implemented such a class
> and it works fine for me. In fact the Alignment class is just a wrapper
> around a standard SeqRecord (I name it RichSeq in my implementation).
> To do that you just need a SeqRecord with a __getitem__ method. You have
> already proposing that so that's not a problem.

Your enthusiasm Jose is one of the things motivating me to try and do
more with the Seq and SeqRecord.  Without a third party to offer
feedback, making big changes is risky.

> Padding with spaces is not an option when you're dealing with genomic wide
> alignments, that's one of the problems of the actual Alignment class.

It might make sense to talk about a "Contig Alignment" object/class,
compared to the existing "multiple sequence alignment"  object/class
where all the sequences are the same length.  Ideally these should
provide as similar an API as possible - even if the internals are
different.  One idea is a sub-class of the current alignment class
which stores an offset (>=0) for each supporting read, used when
accessing columns.  Maybe we should check out BioPerl etc for
inspiration?

> If you want I can send my implementation to the list, although it could take a
> while because I've got my home computer dead.

Good luck with the broken computer - I hope you have an easier time
fixing it / rebuilding it than I did last time this hapended to me.

Regards,

Peter