[Biopython] searching for a human chromosome position

dr goettel biopythonlist at gmail.com
Mon Jun 1 12:52:09 EDT 2009


Wow,
Thankyou very much!! Of course it's very usefull. You almost gave me the
code. There's one thing I still don't get. I have access to everything I
need but the coding frame to look at, I mean, with the code you sent I know
the feature location (from 1129251 to 1175010 position). Since I just want
to know if changing the nucleotide in the snp_sequence position would lead
to a change of aminoacid, it would be enough to translate this portion of
nucleotides and see if changing that position it also changes the aminoacid,
but how should I proceed to translate that portion of adn? I mean what frame
should I use?
Does my question have meaning? maybe I'm loosing something.

Thankyou again!
d

On Mon, Jun 1, 2009 at 5:57 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Mon, Jun 1, 2009 at 2:28 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > On Mon, Jun 1, 2009 at 2:15 PM, dr goettel <biopythonlist at gmail.com>
> wrote:
> >> This is exactly what I need to do. Could someone redirect me to the
> >> documentation part or some code needed to, given the chromosome, use
> >> Biopython to extract that position??
> >
> > There are two steps here - getting the sequence data (e.g. a GenBank
> > file), and then extracting the data.
> >
>
> This file includes the annotations and the nucleotide sequence (241 MB),
> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbk.gz
>
> This file includes the annotations but just has a contig line at the end (5
> MB)
> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbs.gz
>
> These should match up to the files you'd get with Entrez using the a
> return type of "gbwithparts" and "gb". As you actually will want the
> nucleotides, the larger files (*.gbk) are more useful and actually
> don't take that much longer to parser with Biopython. The same code
> can be used to parse either file in Biopython and look for a gene/CDS
> feature spanning a given position.
>
> For example, using a random position I picked in a gene in the first
> contig for chromosome three:
>
> from Bio import SeqIO
> gb_filename = "hs_ref_chr3.gbs" # Contains 9 records
> #gb_filename = "hs_ref_chr3.gbk" # Contains 9 records
> snp_sequence = "NT_029928" # Which LOCUS
> snp_position = 1151990 #Python counting!
> for record in SeqIO.parse(open(gb_filename), "genbank") :
>    if record.name != snp_sequence :
>        print "Ignoring %s" % record.id
>        continue
>    print "Searching %s" % record.id
>    for feature in record.features :
>        if feature.type != "CDS" : continue
>        if snp_position < feature.location.nofuzzy_start : continue
>        if feature.location.nofuzzy_end < snp_position : continue
>        #TODO - use the sub_features to check if the SNP
>        #is in an intron or exon
>        print feature.location, feature.qualifiers["protein_id"]
> print "Done"
>
> This gives:
>
> Searching NT_029928.12
> [1129251:1175010] ['NP_002568.2']
> Ignoring NT_005535.16
> Ignoring NT_113881.1
> Ignoring NT_113882.1
> Ignoring NT_113883.1
> Ignoring NT_113884.1
> Ignoring NT_022459.14
> Ignoring NT_005612.15
> Ignoring NT_022517.17
> Done
>
> i.e. The possible SNP at location 1151990 on NT_029928.12 falls within
> the region spanned by the CDS feature encoding NP_002568.2 - however
> in actual fact, this is not a coding SNP as it is in a intron. You can
> check this with a slight extension of the code to look at the
> sub_features which record the exons.
>
> As discussed earlier, this is a simple brute force loop to locate any
> matching feature. A hashing algorithm might faster. You might also
> take advantage of the fact that the features in a GenBank file should
> be sorted - but dealing with overlapping CDS features would require
> care.
>
> Anyway, I hope this proves useful.
>
> Peter
>


More information about the Biopython mailing list