[emboss-dev] EMBOSS and its FASTA like alignment output

Tue Jul 21 12:06:43 UTC 2009

Peter wrote:
> Hi,
> 
> One of the many things I talked to Peter Rice about in Sweden
> was the Pearson FASTA like output from needle and water (e.g.
> what EMBOSS calls the markx10 output format), and why it
> includes the EMBOSS header and footer lines (which start with
> a # character), which are not present in real FASTA output.
> 
> Biopython can parse the pairwise -m 10 output from Bill
> Pearson's FASTA tools, so in theory we (Biopython) should
> be able to parse the markx10 output from EMBOSS needle
> and water. We could probably cope with the extra header
> and footer, but I think it would be best if EMBOSS could
> produce something more closely matching the real FASTA
> output. Unfortunately, it appears to be more than just the
> headers which upset our parser - even ignoring them,
> EMBOSS markx10 output still looks rather different to
> (current) FASTA -m 10 output. Was the markx10 output
> mimicking a particular (old) version of the FASTA tools?

The source code documentation refers to FASTA 3.4 which may be the last
time I took a detailed look at the FASTA alignment outputs.

Can you send us some example files so we can check for the significant
differences?

We plan to install all the bio* projects so it would be helpful to have
a set of biopython parser scripts we can use to test locally. We can add
them to our routine QA tests and flag up changes as soon as they appear.

> Peter R. did say it would be simple to turn off this header and
> footer output, so I thought I would try this myself. It looks like
> this is handled in file ajax/ajalign.c by function alignWriteMark,
> but I don't see a switch to disable the headers and footers.

You correctly found how to turn off the header. The footer is reported
for anything except pure sequence output.

For the next release I will add attributes to the list of alignment
formats to say whether the header and footer are needed. That will allow
us better control and reporting.

Meanwhile, we are very happy to standardise the markx* outputs to make
them easier to parse. Biopython is the first project to report problems
with this. There are alternatives - specifying -aformat and using some
other alignment format for all applications - but we like to conform and
will do our best to fir what parsers expect.

Also, of course, once we know we are being parsed we will do our best
not to let the output change.

regards,

Peter Rice