[Biopython] Struggling with MSA...

Dan Bolser dan.bolser at outsee.co.uk
Fri Feb 16 05:59:32 EST 2024


I don't suppose there is a package for reading in the HMM? 😅


On Thu, 15 Feb 2024 at 16:18, Dan Bolser <dan.bolser at outsee.co.uk> wrote:

> Recreating the profile was just a sanity check. Really I want to look at
> conservation across the protein of interest.
>
> On Thu, 15 Feb 2024 at 16:13, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>
>> You may have more joy starting from the HMM for the model?
>>
>> https://www.ebi.ac.uk/interpro/wwwapi//entry/pfam/PF08241?annotation=hmm
>>
>> Peter
>>
>> On Thu, Feb 15, 2024 at 4:06 PM Dan Bolser <dan.bolser at outsee.co.uk>
>> wrote:
>> >
>> > Searching HMMER gives me (cryptic?) details of the model match:
>> >
>> > Family-Id Family-Accession Clan Start End Ali-Start Ali-End Model-Start
>> Model-End Bit-Score Ind.-E-value Cond.-E-value Description
>> >
>> ==================================================================================================================================
>> >
>> > Methyltransf_11 PF08241.15 CL0063 63 173 63 172 1 95 51.30 1.5e-13
>> 3.1e-17 Methyltransferase domain
>> >
>> >
>> ---------------------------------------------------------------------------------------------------------------------------------
>> >
>> > MODEL
>> LdvGcGtGrlaealakrg.arvvgvDlskemlklakekaseeglkvefvvadaeklpfednsfDlvvssevlhhv...e............dpekalkeiaRvLkpgGllv
>> > MATCH  L +GcG+ +l+ +l   g  +v+ vD+s  ++++ +++    + ++++  +d++kl
>> f+++sfD+v+ + +l+ +   e             ++++l+e+ RvL pgG+++
>> > PPL
>> 569************7779*************77766666666.69*****************************8662667778888888******************97
>> > SEQ
>> LVLGCGNSALSYELFLGGfPNVTSVDYSSVVVAAMQARHAHVP-QLRWETMDVRKLDFPSASFDVVLEKGTLDALlagErdpwtvssegvhTVDQVLSEVSRVLVPGGRFI
>> >
>> >
>> > So I guess it's possible to work backwards from there...
>> >
>> >
>> >
>> > On Thu, 15 Feb 2024 at 16:02, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>> >>
>> >> Sorry yes, I used this as a threshold for if a column was gappy, and
>> >> likely to be in the model or not:
>> >>
>> >> amino_acids.get("-", 0) < len(align)*0.5
>> >>
>> >> Might need to use <= to match, or perhaps they use a more
>> >> sophisticated criteria.
>> >>
>> >> Peter
>> >>
>> >> On Thu, Feb 15, 2024 at 3:56 PM Dan Bolser <dan.bolser at outsee.co.uk>
>> wrote:
>> >> >
>> >> > Sorry, by 'half gaps' I thought you meant a specific thing, like,
>> "look at that half gap!". I guess you mean, columns where half of the
>> values in the column are gaps.
>> >> >
>> >> > Looking directly at the alignment, I guess there is no way to work
>> out precisely which columns are gaps or not in the model?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Thu, 15 Feb 2024 at 15:47, Dan Bolser <dan.bolser at outsee.co.uk>
>> wrote:
>> >> >>
>> >> >> Hi Peter,
>> >> >>
>> >> >> On Thu, 15 Feb 2024 at 15:28, Peter Cock <p.j.a.cock at googlemail.com>
>> wrote:
>> >> >>>
>> >> >>> Hello Dan,
>> >> >>>
>> >> >>> It that is a probability at the end, you have the wrong
>> denominator -
>> >> >>> should be len(align) == sum(amino_acids.values()
>> >> >>
>> >> >>
>> >> >> Makes no difference, they are the same (including -s). The profile
>> reports -s as 'Insert Probability'.
>> >> >>
>> >> >>
>> >> >>> Note this seed alignment has 164 columns, yet the model page says
>> the
>> >> >>> model has 95 columns. If I count the number of columns which are
>> half
>> >> >>> gaps that's pretty close...
>> >> >>
>> >> >>
>> >> >> I don't understand what that means or implies... Is something wrong
>> somewhere?
>> >> >>
>> >> >>
>> >> >> Cheers,
>> >> >> Dan.
>> >> >>
>> >> >>
>> >> >>> Peter
>> >> >>>
>> >> >>> On Thu, Feb 15, 2024 at 3:09 PM Dan Bolser <
>> dan.bolser at outsee.co.uk> wrote:
>> >> >>> >
>> >> >>> > Hi,
>> >> >>> >
>> >> >>> > Sorry, I can't follow the docs (or find the right docs).
>> >> >>> >
>> >> >>> > I've got the 'seed' stockholm alignment for this domain:
>> >> >>> >
>> https://www.ebi.ac.uk/interpro/entry/pfam/PF08241/entry_alignments/?type=seed
>> >> >>> >
>> >> >>> > and I'm trying to reproduce the signature it shows here:
>> >> >>> > https://www.ebi.ac.uk/interpro/entry/pfam/PF08241/logo/
>> >> >>> >
>> >> >>> > I'm not sure a) why the probabilities differ in the profile
>> relative to the seed alignment, or b) how to filter columns in the
>> alignment by those that have a match in the model (see columns 4-6 in the
>> alignment, which are gaps in the model).
>> >> >>> >
>> >> >>> > I think if I can answer b) then the answer to a) will be, "look
>> at the full alignment".
>> >> >>> >
>> >> >>> > Here is my crude 'best guess' code:
>> >> >>> >
>> >> >>> > import gzip
>> >> >>> > import Bio.AlignIO
>> >> >>> >
>> >> >>> > # msa = "PF08241.alignment.full.gz"
>> >> >>> > msa = "PF08241.alignment.seed.gz"
>> >> >>> >
>> >> >>> > with gzip.open(msa, "rt") as handle:
>> >> >>> > align = Bio.AlignIO.read(handle, "stockholm")
>> >> >>> > ncols = align.get_alignment_length()
>> >> >>> >
>> >> >>> > for col in range(ncols):
>> >> >>> > amino_acids = dict()
>> >> >>> > for s in align[:, col]:
>> >> >>> > amino_acids[s] = amino_acids.get(s, 0) + 1
>> >> >>> > print(amino_acids)
>> >> >>> > for s in amino_acids:
>> >> >>> > print(f"{s}: {amino_acids[s]:3d} {amino_acids[s] /
>> len(align):.3f}")
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > I have the feeling I'm doin it rong...
>> >> >>> >
>> >> >>> > The above is just a 'warm up', really I want to see the
>> conservation score, base by base on a given protein in the alignment (where
>> it matches the model).
>> >> >>> >
>> >> >>> > Many thanks for any suggestions, and sorry for not being able to
>> find the right document to answer these questions.
>> >> >>> >
>> >> >>> >
>> >> >>> > kthxbi,
>> >> >>> > Dan.
>> >> >>> > _______________________________________________
>> >> >>> > Biopython mailing list  -  Biopython at biopython.org
>> >> >>> > https://mailman.open-bio.org/mailman/listinfo/biopython
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20240216/c7a9ebec/attachment.htm>


More information about the Biopython mailing list