[Biopython] Struggling with MSA...

Thu Feb 15 11:02:06 EST 2024

Sorry yes, I used this as a threshold for if a column was gappy, and
likely to be in the model or not:

amino_acids.get("-", 0) < len(align)*0.5

Might need to use <= to match, or perhaps they use a more
sophisticated criteria.

Peter

On Thu, Feb 15, 2024 at 3:56 PM Dan Bolser <dan.bolser at outsee.co.uk> wrote:
>
> Sorry, by 'half gaps' I thought you meant a specific thing, like, "look at that half gap!". I guess you mean, columns where half of the values in the column are gaps.
>
> Looking directly at the alignment, I guess there is no way to work out precisely which columns are gaps or not in the model?
>
>
>
>
>
>
> On Thu, 15 Feb 2024 at 15:47, Dan Bolser <dan.bolser at outsee.co.uk> wrote:
>>
>> Hi Peter,
>>
>> On Thu, 15 Feb 2024 at 15:28, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>>
>>> Hello Dan,
>>>
>>> It that is a probability at the end, you have the wrong denominator -
>>> should be len(align) == sum(amino_acids.values()
>>
>>
>> Makes no difference, they are the same (including -s). The profile reports -s as 'Insert Probability'.
>>
>>
>>> Note this seed alignment has 164 columns, yet the model page says the
>>> model has 95 columns. If I count the number of columns which are half
>>> gaps that's pretty close...
>>
>>
>> I don't understand what that means or implies... Is something wrong somewhere?
>>
>>
>> Cheers,
>> Dan.
>>
>>
>>> Peter
>>>
>>> On Thu, Feb 15, 2024 at 3:09 PM Dan Bolser <dan.bolser at outsee.co.uk> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Sorry, I can't follow the docs (or find the right docs).
>>> >
>>> > I've got the 'seed' stockholm alignment for this domain:
>>> > https://www.ebi.ac.uk/interpro/entry/pfam/PF08241/entry_alignments/?type=seed
>>> >
>>> > and I'm trying to reproduce the signature it shows here:
>>> > https://www.ebi.ac.uk/interpro/entry/pfam/PF08241/logo/
>>> >
>>> > I'm not sure a) why the probabilities differ in the profile relative to the seed alignment, or b) how to filter columns in the alignment by those that have a match in the model (see columns 4-6 in the alignment, which are gaps in the model).
>>> >
>>> > I think if I can answer b) then the answer to a) will be, "look at the full alignment".
>>> >
>>> > Here is my crude 'best guess' code:
>>> >
>>> > import gzip
>>> > import Bio.AlignIO
>>> >
>>> > # msa = "PF08241.alignment.full.gz"
>>> > msa = "PF08241.alignment.seed.gz"
>>> >
>>> > with gzip.open(msa, "rt") as handle:
>>> > align = Bio.AlignIO.read(handle, "stockholm")
>>> > ncols = align.get_alignment_length()
>>> >
>>> > for col in range(ncols):
>>> > amino_acids = dict()
>>> > for s in align[:, col]:
>>> > amino_acids[s] = amino_acids.get(s, 0) + 1
>>> > print(amino_acids)
>>> > for s in amino_acids:
>>> > print(f"{s}: {amino_acids[s]:3d} {amino_acids[s] / len(align):.3f}")
>>> >
>>> >
>>> >
>>> > I have the feeling I'm doin it rong...
>>> >
>>> > The above is just a 'warm up', really I want to see the conservation score, base by base on a given protein in the alignment (where it matches the model).
>>> >
>>> > Many thanks for any suggestions, and sorry for not being able to find the right document to answer these questions.
>>> >
>>> >
>>> > kthxbi,
>>> > Dan.
>>> > _______________________________________________
>>> > Biopython mailing list  -  Biopython at biopython.org
>>> > https://mailman.open-bio.org/mailman/listinfo/biopython