[Biopython] searching for a human chromosome position

Mon Jun 1 10:54:35 UTC 2009

On Fri, May 29, 2009 at 10:36 AM, dr goettel <biopythonlist at gmail.com> wrote:
> Hello,
> I am new using biopython and after reading the documentation I'd like some
> guides to resolve one "simple" thing.
> I want to, given a number of a human chromosome, the position of the
> nucleotide and the nucleotide that should be in this position, search for
> that position and determine if there has been a mutation and if that
> mutation produces an aminoacid change or not. I supose that first of all I
> have to query genome database(?) using Entrez module and retrieve the
> sequence where this base is. Then I supose I have to look for translated
> sequences of this sequence and see what is the most probably frame of
> traduction for this sequence and then see if there  is a change of aminoacid
> or not.
>
> Please could anybody send some clues for querying the database and find the
> most probably frame of traduction to protein (in case that this is a good
> workflow to solve this particular problem)??
>
> Thankyou very much.
> d

I don't think your task is "simple".

Given a human chromosome (e.g. as a FASTA or GenBank file from the
NCBI) and a location on it, you can easily use Biopython to extract
that position (or region).

You could also look at the provided annotation in the GenBank file to
see if the location falls within a gene CDS, and thus if a mutation at
that position would cause an amino acid change. Note that because in
humans you have introns/exons to worry about, this is actually quite
complicated! (If you don't want to use the existing annotation, you
would have to do your own gene finding, which is even more
complicated.)

You could manually download the complete chromosomes from here. I
would get the GenBank files (which will need uncompressing):
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/

If you have a location, you will need to check which version of the
chromosome it refers to. Note that there are three versions of the
human chromosomes available on the above FTP site, and there will be
lots soon from the 1000 genomes project. You could search Entrez for
the human chromosome, but make sure you get the right version for your
location! I would probably do this manually (not in a script).

If you parse the GenBank file using Bio.SeqIO, the gene annotations
will be stored as SeqFeature objects. Have a look in the tutorial, and
also this page for some tips on dealing with these:
http://www.warwick.ac.uk/go/peter_cock/python/genbank/

On a general point, you are talking about mutations - are you going to
be re-sequencing this region in different patients to actually check
for a mutation? Working from a single reference genome you won't be
able to say if there is a mutation (e.g. a SNP) at a given position -
although data from the the 1000 genome project could be useful.

I hope that helps.

Peter