[Biopython] searching for a human chromosome position
dr goettel
biopythonlist at gmail.com
Mon Jun 1 12:52:09 EDT 2009
Wow,
Thankyou very much!! Of course it's very usefull. You almost gave me the
code. There's one thing I still don't get. I have access to everything I
need but the coding frame to look at, I mean, with the code you sent I know
the feature location (from 1129251 to 1175010 position). Since I just want
to know if changing the nucleotide in the snp_sequence position would lead
to a change of aminoacid, it would be enough to translate this portion of
nucleotides and see if changing that position it also changes the aminoacid,
but how should I proceed to translate that portion of adn? I mean what frame
should I use?
Does my question have meaning? maybe I'm loosing something.
Thankyou again!
d
On Mon, Jun 1, 2009 at 5:57 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:
> On Mon, Jun 1, 2009 at 2:28 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > On Mon, Jun 1, 2009 at 2:15 PM, dr goettel <biopythonlist at gmail.com>
> wrote:
> >> This is exactly what I need to do. Could someone redirect me to the
> >> documentation part or some code needed to, given the chromosome, use
> >> Biopython to extract that position??
> >
> > There are two steps here - getting the sequence data (e.g. a GenBank
> > file), and then extracting the data.
> >
>
> This file includes the annotations and the nucleotide sequence (241 MB),
> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbk.gz
>
> This file includes the annotations but just has a contig line at the end (5
> MB)
> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbs.gz
>
> These should match up to the files you'd get with Entrez using the a
> return type of "gbwithparts" and "gb". As you actually will want the
> nucleotides, the larger files (*.gbk) are more useful and actually
> don't take that much longer to parser with Biopython. The same code
> can be used to parse either file in Biopython and look for a gene/CDS
> feature spanning a given position.
>
> For example, using a random position I picked in a gene in the first
> contig for chromosome three:
>
> from Bio import SeqIO
> gb_filename = "hs_ref_chr3.gbs" # Contains 9 records
> #gb_filename = "hs_ref_chr3.gbk" # Contains 9 records
> snp_sequence = "NT_029928" # Which LOCUS
> snp_position = 1151990 #Python counting!
> for record in SeqIO.parse(open(gb_filename), "genbank") :
> if record.name != snp_sequence :
> print "Ignoring %s" % record.id
> continue
> print "Searching %s" % record.id
> for feature in record.features :
> if feature.type != "CDS" : continue
> if snp_position < feature.location.nofuzzy_start : continue
> if feature.location.nofuzzy_end < snp_position : continue
> #TODO - use the sub_features to check if the SNP
> #is in an intron or exon
> print feature.location, feature.qualifiers["protein_id"]
> print "Done"
>
> This gives:
>
> Searching NT_029928.12
> [1129251:1175010] ['NP_002568.2']
> Ignoring NT_005535.16
> Ignoring NT_113881.1
> Ignoring NT_113882.1
> Ignoring NT_113883.1
> Ignoring NT_113884.1
> Ignoring NT_022459.14
> Ignoring NT_005612.15
> Ignoring NT_022517.17
> Done
>
> i.e. The possible SNP at location 1151990 on NT_029928.12 falls within
> the region spanned by the CDS feature encoding NP_002568.2 - however
> in actual fact, this is not a coding SNP as it is in a intron. You can
> check this with a slight extension of the code to look at the
> sub_features which record the exons.
>
> As discussed earlier, this is a simple brute force loop to locate any
> matching feature. A hashing algorithm might faster. You might also
> take advantage of the fact that the features in a GenBank file should
> be sorted - but dealing with overlapping CDS features would require
> care.
>
> Anyway, I hope this proves useful.
>
> Peter
>
More information about the Biopython
mailing list