[Biopython] Struggling with MSA...

Sat Feb 17 02:24:38 EST 2024

Hi,
there are python bindings for the HMMER3 c library 
(https://github.com/althonos/pyhmmer), this could be of interest to you 
if you want to inspect the actual model itself.

Best
Marek

Ps.: Not affiliated with the library nor the HMMER.

Dne 2024-02-16 11:59, Dan Bolser napsal:
> I don't suppose there is a package for reading in the HMM? 😅
> 
> On Thu, 15 Feb 2024 at 16:18, Dan Bolser <dan.bolser at outsee.co.uk>
> wrote:
> 
>> Recreating the profile was just a sanity check. Really I want to
>> look at conservation across the protein of interest.
>> 
>> On Thu, 15 Feb 2024 at 16:13, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>> 
>>> You may have more joy starting from the HMM for the model?
>>> 
>>> 
>> 
> https://www.ebi.ac.uk/interpro/wwwapi//entry/pfam/PF08241?annotation=hmm
>>> 
>>> Peter
>>> 
>>> On Thu, Feb 15, 2024 at 4:06 PM Dan Bolser
>>> <dan.bolser at outsee.co.uk> wrote:
>>>> 
>>>> Searching HMMER gives me (cryptic?) details of the model match:
>>>> 
>>>> Family-Id Family-Accession Clan Start End Ali-Start Ali-End
>>> Model-Start Model-End Bit-Score Ind.-E-value Cond.-E-value
>>> Description
>>>> 
>>> 
>> 
> ==================================================================================================================================
>>>> 
>>>> Methyltransf_11 PF08241.15 CL0063 63 173 63 172 1 95 51.30
>>> 1.5e-13 3.1e-17 Methyltransferase domain
>>>> 
>>>> 
>>> 
>> 
> ---------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> MODEL
>>> 
>> 
> LdvGcGtGrlaealakrg.arvvgvDlskemlklakekaseeglkvefvvadaeklpfednsfDlvvssevlhhv...e............dpekalkeiaRvLkpgGllv
>>>> MATCH  L +GcG+ +l+ +l   g  +v+ vD+s  ++++ +++    + ++++  +d++kl
>>> f+++sfD+v+ + +l+ +   e             ++++l+e+ RvL pgG+++
>>>> PPL
>>> 
>> 
> 569************7779*************77766666666.69*****************************8662667778888888******************97
>>>> SEQ
>>> 
>> 
> LVLGCGNSALSYELFLGGfPNVTSVDYSSVVVAAMQARHAHVP-QLRWETMDVRKLDFPSASFDVVLEKGTLDALlagErdpwtvssegvhTVDQVLSEVSRVLVPGGRFI
>>>> 
>>>> 
>>>> So I guess it's possible to work backwards from there...
>>>> 
>>>> 
>>>> 
>>>> On Thu, 15 Feb 2024 at 16:02, Peter Cock
>>> <p.j.a.cock at googlemail.com> wrote:
>>>>> 
>>>>> Sorry yes, I used this as a threshold for if a column was
>>> gappy, and
>>>>> likely to be in the model or not:
>>>>> 
>>>>> amino_acids.get("-", 0) < len(align)*0.5
>>>>> 
>>>>> Might need to use <= to match, or perhaps they use a more
>>>>> sophisticated criteria.
>>>>> 
>>>>> Peter
>>>>> 
>>>>> On Thu, Feb 15, 2024 at 3:56 PM Dan Bolser
>>> <dan.bolser at outsee.co.uk> wrote:
>>>>>> 
>>>>>> Sorry, by 'half gaps' I thought you meant a specific thing,
>>> like, "look at that half gap!". I guess you mean, columns where
>>> half of the values in the column are gaps.
>>>>>> 
>>>>>> Looking directly at the alignment, I guess there is no way to
>>> work out precisely which columns are gaps or not in the model?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, 15 Feb 2024 at 15:47, Dan Bolser
>>> <dan.bolser at outsee.co.uk> wrote:
>>>>>>> 
>>>>>>> Hi Peter,
>>>>>>> 
>>>>>>> On Thu, 15 Feb 2024 at 15:28, Peter Cock
>>> <p.j.a.cock at googlemail.com> wrote:
>>>>>>>> 
>>>>>>>> Hello Dan,
>>>>>>>> 
>>>>>>>> It that is a probability at the end, you have the wrong
>>> denominator -
>>>>>>>> should be len(align) == sum(amino_acids.values()
>>>>>>> 
>>>>>>> 
>>>>>>> Makes no difference, they are the same (including -s). The
>>> profile reports -s as 'Insert Probability'.
>>>>>>> 
>>>>>>> 
>>>>>>>> Note this seed alignment has 164 columns, yet the model
>>> page says the
>>>>>>>> model has 95 columns. If I count the number of columns
>>> which are half
>>>>>>>> gaps that's pretty close...
>>>>>>> 
>>>>>>> 
>>>>>>> I don't understand what that means or implies... Is
>>> something wrong somewhere?
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Dan.
>>>>>>> 
>>>>>>> 
>>>>>>>> Peter
>>>>>>>> 
>>>>>>>> On Thu, Feb 15, 2024 at 3:09 PM Dan Bolser
>>> <dan.bolser at outsee.co.uk> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Sorry, I can't follow the docs (or find the right docs).
>>>>>>>>> 
>>>>>>>>> I've got the 'seed' stockholm alignment for this domain:
>>>>>>>>> 
>>> 
>> 
> https://www.ebi.ac.uk/interpro/entry/pfam/PF08241/entry_alignments/?type=seed
>>>>>>>>> 
>>>>>>>>> and I'm trying to reproduce the signature it shows here:
>>>>>>>>> https://www.ebi.ac.uk/interpro/entry/pfam/PF08241/logo/
>>>>>>>>> 
>>>>>>>>> I'm not sure a) why the probabilities differ in the
>>> profile relative to the seed alignment, or b) how to filter
>>> columns in the alignment by those that have a match in the model
>>> (see columns 4-6 in the alignment, which are gaps in the model).
>>>>>>>>> 
>>>>>>>>> I think if I can answer b) then the answer to a) will be,
>>> "look at the full alignment".
>>>>>>>>> 
>>>>>>>>> Here is my crude 'best guess' code:
>>>>>>>>> 
>>>>>>>>> import gzip
>>>>>>>>> import Bio.AlignIO
>>>>>>>>> 
>>>>>>>>> # msa = "PF08241.alignment.full.gz"
>>>>>>>>> msa = "PF08241.alignment.seed.gz"
>>>>>>>>> 
>>>>>>>>> with gzip.open(msa, "rt") as handle:
>>>>>>>>> align = Bio.AlignIO.read(handle, "stockholm")
>>>>>>>>> ncols = align.get_alignment_length()
>>>>>>>>> 
>>>>>>>>> for col in range(ncols):
>>>>>>>>> amino_acids = dict()
>>>>>>>>> for s in align[:, col]:
>>>>>>>>> amino_acids[s] = amino_acids.get(s, 0) + 1
>>>>>>>>> print(amino_acids)
>>>>>>>>> for s in amino_acids:
>>>>>>>>> print(f"{s}: {amino_acids[s]:3d} {amino_acids[s] /
>>> len(align):.3f}")
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I have the feeling I'm doin it rong...
>>>>>>>>> 
>>>>>>>>> The above is just a 'warm up', really I want to see the
>>> conservation score, base by base on a given protein in the
>>> alignment (where it matches the model).
>>>>>>>>> 
>>>>>>>>> Many thanks for any suggestions, and sorry for not being
>>> able to find the right document to answer these questions.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> kthxbi,
>>>>>>>>> Dan.
>>>>>>>>> _______________________________________________
>>>>>>>>> Biopython mailing list  -  Biopython at biopython.org
>>>>>>>>> https://mailman.open-bio.org/mailman/listinfo/biopython
> _______________________________________________
> Biopython mailing list  -  Biopython at biopython.org
> https://mailman.open-bio.org/mailman/listinfo/biopython

Upozorneni:
Neni-li v teto zprave vyslovne uvedeno jinak, ma tato E-mailova zprava nebo jeji prilohy pouze informativni charakter. Tato zprava ani jeji prilohy v zadnem ohledu ustavy AV CR, v.v.i. k nicemu nezavazuji. Text teto zpravy nebo jejich priloh neni navrhem na uzavreni smlouvy, ani prijetim pripadneho navrhu na uzavreni smlouvy, ani jinym pravnim jednanim smerujicim k uzavreni jakekoliv smlouvy a nezaklada predsmluvni odpovednost ustavu AV CR, v.v.i.

Disclaimer:
If not expressly stated otherwise, this e-mail message (including any attached files) is intended purely for informational purposes and does not represent a binding agreement on the part of Institutes of the Czech Academy of Sciences. The text of this message and its attachments cannot be considered as a proposal to conclude a contract, neither the acceptance of a proposal to conclude a contract, nor any other legal act leading to concluding any contract; nor does it create any pre-contractual liability on the part of Institutes of the Czech Academy of Sciences.