[Biopython] biopython module for variant descriptions?

David Merberg merbergd at gmail.com
Thu Nov 2 08:47:43 EDT 2023


I guess there is some relationship because they both deal with alterations
compared to a reference sequence.

To my knowledge, the Variant Call Format, or VCF, file is generally used in
the context of an NGS experiment. Generating a vcf file is the next step
after a bam file. The bam file contains the alignment of each sequencing
read to the reference collection, then the vcf file summarizes the
differences.

The HGVS mutation description is usually used in a more low-throughput
context. So for example if you’re studying a disease known to be associated
with mutations in a specific gene, then you might describe the mutations
using the HGVS specification.

So, for example, cystic fibrosis is caused by the G542X mutation (i.e.
Glycine 542 is changed to Termination) in the cystic fibrosis transmembrane
regulator. If you go to the gnomad database and search for the IDS gene,
you get a table with many variants of this gene that cause Hunter Syndrome,
e.g.:
c.1650T>C
c.1648C>T
c.1645A>G
c.1644G>T
c.1642T>C
c.1637A>G
c.1636C>T
p.Pro550Pro
p.Pro550Ser
p.Met549Val
p.Leu548Phe
p.Leu548Leu
p.Gln546Arg
p.Gln546Ter
c.1181-32_1181-16dup
c.1181-83_1181-73del

There are 1608 rows in this table for the IDS gene.

If a new mutation is described in the literature it will (should be)
specified in HGVS format. In many older papers that is not the case.

Some of the things you might want to do with these HGVS variant
descriptions are:
1. Given the standard (i.e. reference) sequence for a gene and a variant,
what is the sequence of the mutated gene?
2. Given the gene sequence and the HGVS description of the DNA change, what
is the protein change?
3. Given just the protein change, what are the possible DNA changes that
could cause it?
4. Given just the DNA change and reference sequence, is it a missense or
nonsense mutation?
5. Given a variant description, is it consistent with the reference
sequence? For example, in the CFTR case mentioned above G542X is a mutation
found in the literature. If I am collecting data and I see a mutation
described as T542X it is wrong. There is no T at position 542 of CFTR. I
would determine that by checking the CFTR sequence.

In general, I think of VCF as part of a NGS workflow, while HGVS is used
further downstream in structure-function and genotype-phenotype discussions.

I hope that helps clarify.

It would have helped me to find a biopython module that would instantiate
classes and subclasses of mutations/variants and provide some basic
methods. I know that there are other scientists asking the same sorts of
questions, but I don’t know whether any are attempting to answer them by
writing python programs.

Dave



On November 1, 2023 at 3:36:16 PM, Peter Cock (p.j.a.cock at googlemail.com)
wrote:

I don't think we have anything like this (yet). Are efforts like VCF
(variant call format) related but separate in your mind?

Peter

On Tue, Oct 31, 2023 at 7:31 PM David Merberg <merbergd at gmail.com> wrote:

> Hello biopython world,
>
> For my last job, I wrote some python code to categorize and describe
> sequence changes of many types. I used biopython to handle sequences and
> some basic functions like  IO and translation, but I did not find a module
> for reading variants/mutants and applying them to sequences.
>
> Some cases are trivial, but some are not. For example, a small deletion in
> the nucleotide sequence may have no effect on the amino acid corresponding
> to the position of the affected codon, but will affect downstream amino
> acids. Protein changes caused by deletions or insertions of 3, 6, 9 . . .
> nucleotides can also be tricky to calculate.
>
> My question is whether there is a biopython module to read variants in a
> standard format (see for example http://varnomen.hgvs.org/)? Along with
> the variant objects there could be a set of methods to operate on mutated
> sequences. Does the community think that this would be useful if it does
> not already exist?
>
> I implemented many functions for these sorts of operations, but I realized
> soon afterwards that there are probably better ways to do much of it. I
> always wanted to redo the work, but never had time. Now I have time, but am
> not at that job. If it would be useful to the community, I may be able to
> take it on as a contribution to biopython.
>
> A caveat is that I don’t have experience contributing to multi-developer
> projects. I try to write clean, well documented code and I’m familiar with
> the basics of git. So, it’s OK if you’d prefer that I start with something
> smaller (like unit tests or documentation). Just let me know.
>
> Dave Merberg
>
> _______________________________________________
> Biopython mailing list  -  Biopython at biopython.org
> https://mailman.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20231102/288e2449/attachment.htm>


More information about the Biopython mailing list