[BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO

Peter biopython at maubp.freeserve.co.uk
Mon Jun 16 14:01:31 UTC 2008


I've recently had to deal with some contig files in the Ace format
(output by CAP3, but many assembly files will produce this output).

We have a module for parsing Ace files in Biopython,
Bio.Sequencing.Ace but I was wondering about integrating this into the
Bio.SeqIO or Bio.AlignIO framework.
http://www.biopython.org/wiki/SeqIO
http://www.biopython.org/wiki/AlignIO

I'd like to hear from anyone currently using Ace files, on how they
tend to treat the data - and if they think a SeqRecord or Alignment
based representation would be useful.

Each contig in an Ace file could be treated as a SeqRecord using the
consensus sequence.  The identifiers of each sub-sequence used to
build the consensus could be stored as database cross-references, or
perhaps we could store these as SeqFeatures describing which part of
the consensus they support.  This would then fit into Bio.SeqIO quite
well.

Alternatively, each contig could be treated as an alignment (with a
consensus) and integrated into Bio.AlignIO.  One drawback for this is
doing this with the current generic alignment class would require
padding the start and/or end of each sequence with gaps in order to
make every sequence the same length.  However, if we did this (or
created a more specialised alignment class), the Ace file format would
then fit into Bio.AlignIO too.

So, Ace users - would either (or both) of the above approaches make
sense for how you use the Ace contig files?

Thanks

Peter



More information about the Biopython mailing list