[Biopython] searching for a human chromosome position

Mon Jun 1 15:57:04 UTC 2009

On Mon, Jun 1, 2009 at 2:28 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Mon, Jun 1, 2009 at 2:15 PM, dr goettel <biopythonlist at gmail.com> wrote:
>> This is exactly what I need to do. Could someone redirect me to the
>> documentation part or some code needed to, given the chromosome, use
>> Biopython to extract that position??
>
> There are two steps here - getting the sequence data (e.g. a GenBank
> file), and then extracting the data.
>

This file includes the annotations and the nucleotide sequence (241 MB),
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbk.gz

This file includes the annotations but just has a contig line at the end (5 MB)
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbs.gz

These should match up to the files you'd get with Entrez using the a
return type of "gbwithparts" and "gb". As you actually will want the
nucleotides, the larger files (*.gbk) are more useful and actually
don't take that much longer to parser with Biopython. The same code
can be used to parse either file in Biopython and look for a gene/CDS
feature spanning a given position.

For example, using a random position I picked in a gene in the first
contig for chromosome three:

from Bio import SeqIO
gb_filename = "hs_ref_chr3.gbs" # Contains 9 records
#gb_filename = "hs_ref_chr3.gbk" # Contains 9 records
snp_sequence = "NT_029928" # Which LOCUS
snp_position = 1151990 #Python counting!
for record in SeqIO.parse(open(gb_filename), "genbank") :
    if record.name != snp_sequence :
        print "Ignoring %s" % record.id
        continue
    print "Searching %s" % record.id
    for feature in record.features :
        if feature.type != "CDS" : continue
        if snp_position < feature.location.nofuzzy_start : continue
        if feature.location.nofuzzy_end < snp_position : continue
        #TODO - use the sub_features to check if the SNP
        #is in an intron or exon
        print feature.location, feature.qualifiers["protein_id"]
print "Done"

This gives:

Searching NT_029928.12
[1129251:1175010] ['NP_002568.2']
Ignoring NT_005535.16
Ignoring NT_113881.1
Ignoring NT_113882.1
Ignoring NT_113883.1
Ignoring NT_113884.1
Ignoring NT_022459.14
Ignoring NT_005612.15
Ignoring NT_022517.17
Done

i.e. The possible SNP at location 1151990 on NT_029928.12 falls within
the region spanned by the CDS feature encoding NP_002568.2 - however
in actual fact, this is not a coding SNP as it is in a intron. You can
check this with a slight extension of the code to look at the
sub_features which record the exons.

As discussed earlier, this is a simple brute force loop to locate any
matching feature. A hashing algorithm might faster. You might also
take advantage of the fact that the features in a GenBank file should
be sorted - but dealing with overlapping CDS features would require
care.

Anyway, I hope this proves useful.

Peter