[Biopython-dev] User-defined annotations in Stockholm alignment file
Peter Cock
p.j.a.cock at googlemail.com
Tue Apr 5 07:47:10 UTC 2016
On Tue, Apr 5, 2016 at 5:21 AM, João Rodrigues
<j.p.g.l.m.rodrigues at gmail.com> wrote:
> Thanks Peter, but I'm not sure these issues relate to what I am looking for.
>
> I went a bit through the parser and the part that I actually need is to
> read/write custom keys in GS records. Specifically, I am parsing Stockholm
> files produced by HMMER (and looking to add some extra info to the resulting
> Alignment obj), which I just realized are not properly formatted because
> they contain multiple GS annotations in one line (see below).
>
>> # STOCKHOLM 1.0
>> #=GF ID sp|P00929|TRPA_SALTY-i5
>> #=GS sp|P00929|TRPA_SALTY DE Tryptophan synthase alpha chain
>> OS=Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) GN=trpA PE=1
>> SV=1
That does look wrong, it should be as you say multiple GS lines:
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
i.e.
#=GS sp|P00929|TRPA_SALTY DE Tryptophan synthase alpha chain
#=GS sp|P00929|TRPA_SALTY OS=Salmonella typhimurium (strain LT2 /
SGSC1412 / ATCC 700720)
#=GS sp|P00929|TRPA_SALTY GN=trpA PE=1
#=GS sp|P00929|TRPA_SALTY SV=1
Could you file a bug with the HMMER team?
As written, it fits the specification as a single DE (description) annotation
with some (odd) free text, and our parser is doing exactly what I expect:
>>
>> SeqRecord(seq=Seq('MERYENLFAQLNDR-REG-AFVPFVTLG-D--PGIEQSLKIIDTLIDAGADALE...SRA',
>> SingleLetterAlphabet()), id='sp|P00929|TRPA_SALTY',
>> name='sp|P00929|TRPA_SALTY', description='Tryptophan synthase alpha chain
>> OS=Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) GN=trpA PE=1
>> SV=1', dbxrefs=[])
>
>
> I am genuinely surprised that HMMER outputs this weird format.
Me too.
> Would it be
> an option to verify if such formatting exists (regex?) in a GS line and if
> so break it accordingly, or is this a remote edge case and the added
> overhead is just too much?
Talk to the HMMER team first. If they have good reason and a citable
authority for this file format change, we could support it directly?
> My original question was why does AlignIO ignore "custom" annotations it
> doesn't know, while writing (StockholmIO, line 254)?
https://github.com/biopython/biopython/blob/master/Bio/AlignIO/StockholmIO.py#L254
Because as far as I know only a short list of accepted feature types for
the GS lines exist (from PFAM/RFAM). The associated comment about
this could have been prefixed with TODO - do you have a strong use
case for custom annotations?
Peter
More information about the Biopython-dev
mailing list