[Biopython-dev] User-defined annotations in Stockholm alignment file

Tue Apr 5 07:47:10 UTC 2016

On Tue, Apr 5, 2016 at 5:21 AM, João Rodrigues
<j.p.g.l.m.rodrigues at gmail.com> wrote:
> Thanks Peter, but I'm not sure these issues relate to what I am looking for.
>
> I went a bit through the parser and the part that I actually need is to
> read/write custom keys in GS records. Specifically, I am parsing Stockholm
> files produced by HMMER (and looking to add some extra info to the resulting
> Alignment obj), which I just realized are not properly formatted because
> they contain multiple GS annotations in one line (see below).
>
>> # STOCKHOLM 1.0
>> #=GF ID sp|P00929|TRPA_SALTY-i5
>> #=GS sp|P00929|TRPA_SALTY          DE Tryptophan synthase alpha chain
>> OS=Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) GN=trpA PE=1
>> SV=1

That does look wrong, it should be as you say multiple GS lines:

#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>

i.e.

#=GS sp|P00929|TRPA_SALTY DE Tryptophan synthase alpha chain
#=GS sp|P00929|TRPA_SALTY OS=Salmonella typhimurium (strain LT2 /
SGSC1412 / ATCC 700720)
#=GS sp|P00929|TRPA_SALTY GN=trpA PE=1
#=GS sp|P00929|TRPA_SALTY  SV=1

Could you file a bug with the HMMER team?

As written, it fits the specification as a single DE (description) annotation
with some (odd) free text, and our parser is doing exactly what I expect:

>>
>> SeqRecord(seq=Seq('MERYENLFAQLNDR-REG-AFVPFVTLG-D--PGIEQSLKIIDTLIDAGADALE...SRA',
>> SingleLetterAlphabet()), id='sp|P00929|TRPA_SALTY',
>> name='sp|P00929|TRPA_SALTY', description='Tryptophan synthase alpha chain
>> OS=Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) GN=trpA PE=1
>> SV=1', dbxrefs=[])
>
>
> I am genuinely surprised that HMMER outputs this weird format.

Me too.

> Would it be
> an option to verify if such formatting exists (regex?) in a GS line and if
> so break it accordingly, or is this a remote edge case and the added
> overhead is just too much?

Talk to the HMMER team first. If they have good reason and a citable
authority for this file format change, we could support it directly?

> My original question was why does AlignIO ignore "custom" annotations it
> doesn't know, while writing (StockholmIO, line 254)?

https://github.com/biopython/biopython/blob/master/Bio/AlignIO/StockholmIO.py#L254

Because as far as I know only a short list of accepted feature types for
the GS lines exist (from PFAM/RFAM). The associated comment about
this could have been prefixed with TODO - do you have a strong use
case for custom annotations?

Peter