[Biopython] searching for a human chromosome position

Mon Jun 1 17:35:29 UTC 2009

On Mon, Jun 1, 2009 at 5:52 PM, dr goettel <biopythonlist at gmail.com> wrote:
> Wow,
> Thankyou very much!! Of course it's very usefull. You almost gave me the
> code.

Not really - my code was only the first step, which is the easy part,
working out which annotated gene might be affected by a possible SNP.
You next question shows where things start to get complicated...

> There's one thing I still don't get. I have access to everything I
> need but the coding frame to look at, I mean, with the code you sent I know
> the feature location (from 1129251 to 1175010 position). Since I just want
> to know if changing the nucleotide in the snp_sequence position would lead
> to a change of aminoacid, it would be enough to translate this portion of
> nucleotides and see if changing that position it also changes the aminoacid,
> but how should I proceed to translate that portion of adn? I mean what frame
> should I use?
> Does my question have meaning? maybe I'm loosing something.

This particular example, the CDS spans 1129251 to 1175010 - but you
need to remove the introns before translating it. Looking at the
GenBank entry for this feature:

     CDS             join(1129252..1129438,1148532..1148632,1149622..1149769,
                     1151957..1151988,1153184..1153291,1154387..1154519,
                     1157195..1157258,1158824..1158872,1159344..1159456,
                     1161056..1161173,1164662..1164761,1166976..1167172,
                     1173801..1173938,1174924..1175010)
                     /gene="PAK2"
                     ...
                     /protein_id="NP_002568.2"
                     /db_xref="GI:32483399"
                     /db_xref="CCDS:CCDS3321.1"
                     /db_xref="GeneID:5062"
                     /db_xref="HGNC:8591"
                     /db_xref="MIM:605022"

Doing this by email will probably mess up the formatting but I hope it
will still be clear. What I want you to focus on is the location
string, the bit that goes join(1129252..1129438,1148532..1148632,...)
and basically describes the exons. In some GenBank files, the features
also include the amino acid translation (but not in this case).

In this gene, the first exon is 1129252..1129438 (one based counting),
the second exon is 1148532..1148632, etc. This information is captured
in Biopython using child SeqFeature objects for each exon within the
parent feature for the CDS.  As here everything is on the forward
strand, we don't need to worry about taking the reverse complement.

You could look at the exon lengths, and where your SNP is, in order to
know which codon it is part of. This is complicated - your SNP could
be by a splice point so that part of the codon is in exon 2 and part
is in exon 3 (for example). Once you have the codon (and which of the
three positions is the SNP at), you can then tell if the SNP would be
a synonymous or non-synonymous change (would the amino acid change).
This whole approach seems tricky.

Alternatively, to get the coding sequence in python, you would extract
record.seq[1129251:1129438] for the first exon, then
record.seq[1148531:1148632] for the second exon, etc, and add them
together, and then do the translation. You could repeat this for a
"mutated" parent sequence, where the SNP position has been edited
(e.g. to an N), and compare the translations. This is not as elegant,
but might be the simplest approach.

Creating the mutated sequence from the original sequence is quite easy
using a MutableSeq object:

mut_seq = record.seq.tomutable() #makes an editable copy
mut_seq[snp_position] = "N" #make the SNP position into an N
mut_seq = mut_seq.toseq() #optional, make it read only

The other step, extracting a SeqFeature's sequence from the parent
sequence (or the mutated version of the parent sequence), isn't yet
built into Biopython. Have a look at the (development) mailing list
archives for some discussion on this (in the last month or two).

Finally, I've mentioned features on the reverse stand are a bit more
complicated, but things get even worse if there are any fuzzy
locations involved. e.g. NP_775742.3 also on Chromosome 3, where the
start of the gene is unclear.

Peter