[Biopython] Struggling with MSA...
marek.schwarz
marek.schwarz at biomed.cas.cz
Sat Feb 17 02:24:38 EST 2024
Hi,
there are python bindings for the HMMER3 c library
(https://github.com/althonos/pyhmmer), this could be of interest to you
if you want to inspect the actual model itself.
Best
Marek
Ps.: Not affiliated with the library nor the HMMER.
Dne 2024-02-16 11:59, Dan Bolser napsal:
> I don't suppose there is a package for reading in the HMM? 😅
>
> On Thu, 15 Feb 2024 at 16:18, Dan Bolser <dan.bolser at outsee.co.uk>
> wrote:
>
>> Recreating the profile was just a sanity check. Really I want to
>> look at conservation across the protein of interest.
>>
>> On Thu, 15 Feb 2024 at 16:13, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>>
>>> You may have more joy starting from the HMM for the model?
>>>
>>>
>>
> https://www.ebi.ac.uk/interpro/wwwapi//entry/pfam/PF08241?annotation=hmm
>>>
>>> Peter
>>>
>>> On Thu, Feb 15, 2024 at 4:06 PM Dan Bolser
>>> <dan.bolser at outsee.co.uk> wrote:
>>>>
>>>> Searching HMMER gives me (cryptic?) details of the model match:
>>>>
>>>> Family-Id Family-Accession Clan Start End Ali-Start Ali-End
>>> Model-Start Model-End Bit-Score Ind.-E-value Cond.-E-value
>>> Description
>>>>
>>>
>>
> ==================================================================================================================================
>>>>
>>>> Methyltransf_11 PF08241.15 CL0063 63 173 63 172 1 95 51.30
>>> 1.5e-13 3.1e-17 Methyltransferase domain
>>>>
>>>>
>>>
>>
> ---------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> MODEL
>>>
>>
> LdvGcGtGrlaealakrg.arvvgvDlskemlklakekaseeglkvefvvadaeklpfednsfDlvvssevlhhv...e............dpekalkeiaRvLkpgGllv
>>>> MATCH L +GcG+ +l+ +l g +v+ vD+s ++++ +++ + ++++ +d++kl
>>> f+++sfD+v+ + +l+ + e ++++l+e+ RvL pgG+++
>>>> PPL
>>>
>>
> 569************7779*************77766666666.69*****************************8662667778888888******************97
>>>> SEQ
>>>
>>
> LVLGCGNSALSYELFLGGfPNVTSVDYSSVVVAAMQARHAHVP-QLRWETMDVRKLDFPSASFDVVLEKGTLDALlagErdpwtvssegvhTVDQVLSEVSRVLVPGGRFI
>>>>
>>>>
>>>> So I guess it's possible to work backwards from there...
>>>>
>>>>
>>>>
>>>> On Thu, 15 Feb 2024 at 16:02, Peter Cock
>>> <p.j.a.cock at googlemail.com> wrote:
>>>>>
>>>>> Sorry yes, I used this as a threshold for if a column was
>>> gappy, and
>>>>> likely to be in the model or not:
>>>>>
>>>>> amino_acids.get("-", 0) < len(align)*0.5
>>>>>
>>>>> Might need to use <= to match, or perhaps they use a more
>>>>> sophisticated criteria.
>>>>>
>>>>> Peter
>>>>>
>>>>> On Thu, Feb 15, 2024 at 3:56 PM Dan Bolser
>>> <dan.bolser at outsee.co.uk> wrote:
>>>>>>
>>>>>> Sorry, by 'half gaps' I thought you meant a specific thing,
>>> like, "look at that half gap!". I guess you mean, columns where
>>> half of the values in the column are gaps.
>>>>>>
>>>>>> Looking directly at the alignment, I guess there is no way to
>>> work out precisely which columns are gaps or not in the model?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 15 Feb 2024 at 15:47, Dan Bolser
>>> <dan.bolser at outsee.co.uk> wrote:
>>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On Thu, 15 Feb 2024 at 15:28, Peter Cock
>>> <p.j.a.cock at googlemail.com> wrote:
>>>>>>>>
>>>>>>>> Hello Dan,
>>>>>>>>
>>>>>>>> It that is a probability at the end, you have the wrong
>>> denominator -
>>>>>>>> should be len(align) == sum(amino_acids.values()
>>>>>>>
>>>>>>>
>>>>>>> Makes no difference, they are the same (including -s). The
>>> profile reports -s as 'Insert Probability'.
>>>>>>>
>>>>>>>
>>>>>>>> Note this seed alignment has 164 columns, yet the model
>>> page says the
>>>>>>>> model has 95 columns. If I count the number of columns
>>> which are half
>>>>>>>> gaps that's pretty close...
>>>>>>>
>>>>>>>
>>>>>>> I don't understand what that means or implies... Is
>>> something wrong somewhere?
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Dan.
>>>>>>>
>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> On Thu, Feb 15, 2024 at 3:09 PM Dan Bolser
>>> <dan.bolser at outsee.co.uk> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Sorry, I can't follow the docs (or find the right docs).
>>>>>>>>>
>>>>>>>>> I've got the 'seed' stockholm alignment for this domain:
>>>>>>>>>
>>>
>>
> https://www.ebi.ac.uk/interpro/entry/pfam/PF08241/entry_alignments/?type=seed
>>>>>>>>>
>>>>>>>>> and I'm trying to reproduce the signature it shows here:
>>>>>>>>> https://www.ebi.ac.uk/interpro/entry/pfam/PF08241/logo/
>>>>>>>>>
>>>>>>>>> I'm not sure a) why the probabilities differ in the
>>> profile relative to the seed alignment, or b) how to filter
>>> columns in the alignment by those that have a match in the model
>>> (see columns 4-6 in the alignment, which are gaps in the model).
>>>>>>>>>
>>>>>>>>> I think if I can answer b) then the answer to a) will be,
>>> "look at the full alignment".
>>>>>>>>>
>>>>>>>>> Here is my crude 'best guess' code:
>>>>>>>>>
>>>>>>>>> import gzip
>>>>>>>>> import Bio.AlignIO
>>>>>>>>>
>>>>>>>>> # msa = "PF08241.alignment.full.gz"
>>>>>>>>> msa = "PF08241.alignment.seed.gz"
>>>>>>>>>
>>>>>>>>> with gzip.open(msa, "rt") as handle:
>>>>>>>>> align = Bio.AlignIO.read(handle, "stockholm")
>>>>>>>>> ncols = align.get_alignment_length()
>>>>>>>>>
>>>>>>>>> for col in range(ncols):
>>>>>>>>> amino_acids = dict()
>>>>>>>>> for s in align[:, col]:
>>>>>>>>> amino_acids[s] = amino_acids.get(s, 0) + 1
>>>>>>>>> print(amino_acids)
>>>>>>>>> for s in amino_acids:
>>>>>>>>> print(f"{s}: {amino_acids[s]:3d} {amino_acids[s] /
>>> len(align):.3f}")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have the feeling I'm doin it rong...
>>>>>>>>>
>>>>>>>>> The above is just a 'warm up', really I want to see the
>>> conservation score, base by base on a given protein in the
>>> alignment (where it matches the model).
>>>>>>>>>
>>>>>>>>> Many thanks for any suggestions, and sorry for not being
>>> able to find the right document to answer these questions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kthxbi,
>>>>>>>>> Dan.
>>>>>>>>> _______________________________________________
>>>>>>>>> Biopython mailing list - Biopython at biopython.org
>>>>>>>>> https://mailman.open-bio.org/mailman/listinfo/biopython
> _______________________________________________
> Biopython mailing list - Biopython at biopython.org
> https://mailman.open-bio.org/mailman/listinfo/biopython
Upozorneni:
Neni-li v teto zprave vyslovne uvedeno jinak, ma tato E-mailova zprava nebo jeji prilohy pouze informativni charakter. Tato zprava ani jeji prilohy v zadnem ohledu ustavy AV CR, v.v.i. k nicemu nezavazuji. Text teto zpravy nebo jejich priloh neni navrhem na uzavreni smlouvy, ani prijetim pripadneho navrhu na uzavreni smlouvy, ani jinym pravnim jednanim smerujicim k uzavreni jakekoliv smlouvy a nezaklada predsmluvni odpovednost ustavu AV CR, v.v.i.
Disclaimer:
If not expressly stated otherwise, this e-mail message (including any attached files) is intended purely for informational purposes and does not represent a binding agreement on the part of Institutes of the Czech Academy of Sciences. The text of this message and its attachments cannot be considered as a proposal to conclude a contract, neither the acceptance of a proposal to conclude a contract, nor any other legal act leading to concluding any contract; nor does it create any pre-contractual liability on the part of Institutes of the Czech Academy of Sciences.
More information about the Biopython
mailing list