[BioPython] AlignIO: Sequences of different length

Peter biopython at maubp.freeserve.co.uk
Tue Dec 9 11:42:34 UTC 2008


On Tue, Dec 9, 2008 at 11:25 AM, João Rodrigues <anaryin at gmail.com> wrote:
>> > Using the web versions, there may be some workarounds. If you convert
>> > the format to one of the others, you may get a usable one for Biopython.
>>
>> If you just want the alignment itself, using FASTA as the output
>> format from needle is very simple.
>>
>> e.g.
>>
>> $ needle one.fasta two.fasta --auto --filter -aformat fasta
>> ...
>
> Yep, but in the web version such format does not exist.. don't know why.

A strange omission on their part.

>> > I tried markx1 I believe, and it was "almost" parsable, it just didn't
>> > get the correct sequences (if you deleted everything BUT the
>> > sequences, it would work).
>>
>> How were you trying to parse the markx1 output?
>>
>> Note that the EMBOSS markx10 output is similar to, but differs from,
>> the FASTA -m 10 output (which Biopython can parse as the "fasta-m10"
>> format in Bio.AlignIO).
>
> I tried with FASTA as the argument for the parser, because the description
> said:
> "This is the standard default output format used by Bill Pearson's suite of
> FASTA programs."
>
> And btw, it was the markx0, not the 1. Typo yesterday night..

The various EMBOSS output formats are described here,
http://emboss.sourceforge.net/docs/themes/AlignFormats.html

The outputs markx0, markx1, ..., markx10 are EMBOSS *imitations* of
the FASTA tool's output formats (but with the addition of EMBOSS style
header/footers).  Right now, Biopython doesn't parse any of these.

In Biopython's Bio.AlignIO, "fasta" refers to the FASTA input file
format (the simple file format using greater than signs for each new
sequence).  The only FASTA output format we support is "fasta-m10"
which is how we refer to the output from FASTA's -m 10 command line
argument.

Right now, the Biopython FASTA m10 parser can't cope with the EMBOSS
markx10 format.  It might be nice if it did, but given we can parse
EMBOSS's default output this doesn't seem like a big issue.

>> > So, I think there should at least be a warning somewhere for the
>> > users so that they don't get nuts or reporting bugs :)
>>
>> Do you mean a warning about trying to use Bio.AlignIO with the
>> "emboss" format to read output from old versions of EMBOSS needle
>> tool?
>
> Well, it may be frustrating for someone who's using that webservice to try
> and parse it and it gives that error. It might be useful for example, to
> mention, when such error occurs, that it might be happening due to use of
> web version. Just a small appendix to the error message f example.

So instead of "Error parsing alignment - sequences of different
length?" it could say "Error parsing alignment - sequences of
different length?  Possibly you are using an old version of EMBOSS."
That should help.

As an aside, do you mind me asking why are you using needle via a
webservice?  If you expect to do lots of alignments, surely running it
locally is faster and more reliable (no network issues to worry
about)?

Peter




More information about the Biopython mailing list