[Biopython] searching for a human chromosome position

Peter biopython at maubp.freeserve.co.uk
Mon Jun 1 13:28:07 UTC 2009


On Mon, Jun 1, 2009 at 2:15 PM, dr goettel <biopythonlist at gmail.com> wrote:
>>
>> I don't think your task is "simple".
>>
> I should have added a :-) right after "simple".

:)

>> Given a human chromosome (e.g. as a FASTA or GenBank file from the
>> NCBI) and a location on it, you can easily use Biopython to extract
>> that position (or region).
>
>> You could also look at the provided annotation in the GenBank file to
>> see if the location falls within a gene CDS, and thus if a mutation at
>> that position would cause an amino acid change. Note that because in
>> humans you have introns/exons to worry about, this is actually quite
>> complicated! (If you don't want to use the existing annotation, you
>> would have to do your own gene finding, which is even more
>> complicated.)
>
> This is exactly what I need to do. Could someone redirect me to the
> documentation part or some code needed to, given the chromosome, use
> Biopython to extract that position??

There are two steps here - getting the sequence data (e.g. a GenBank
file), and then extracting the data.

> Looking at the documentation
>
> handle=Entrez.efetch(db="genome", id="9606", rettype="gb") but cannot find
> where to set the chromosome (e.g chr="3"??)

Where did the ID "9606" come from? Using the term '"Homo
sapiens"[orgn] chromosome 3' on the Entrez website pulls up three
matches on Entrez, corresponding to the three available on the NCBI
FTP site:

AC_000135	
    Homo sapiens chromosome 3, alternate assembly (based on HuRef),
whole genome shotgun sequence
    dsDNA; linear; Length: 195,175,600 nt

AC_000046	
    Homo sapiens chromosome 3, alternate assembly (based on Celera
assembly), whole genome shotgun sequence
    dsDNA; linear; Length: 196,588,766 nt

NC_000003	
    Homo sapiens chromosome 3, reference assembly, complete sequence
    dsDNA; linear; Length: 199,501,827 nt

Note that their lengths differ - demonstrating why it is essential to
know which reference your possible SNP locations refer to.

If you really want to use Entrez, try and manually compile a list of
accession numbers first (e.g. NC_000003). Personally, as I said
before, I would just download the chromosomes by FTP.

> Fortunately, all the positions that I need to search are allways in exons
> and withing a gene CDS.

Can you give an explicit example of a particular chromosome accession
and the location you care about?

Peter



More information about the Biopython mailing list