[Biopython] searching for a human chromosome position

Mon Jun 1 12:54:51 UTC 2009

On Mon, Jun 1, 2009 at 1:20 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> - Define a reference genome to use, along with feature mappings of
>  gene models.
>
> - Parse the gene models (normally as GenBank format or GFF) and
>  extract locations of coding regions.

Yes, if you can get the annotation in GFF format that would also be an
option - it might be simpler than dealing with the intron/exon
representation used in the SeqRecord and SeqFeature objects from
parsing a GenBank file. However, I had a quick look on the NCBI FTP
site for GFF but only saw GenBank files. I don't work on human
genetics, so I don't know where else to look.

> - Use the coding region locations to build a hash table of locations
>  to coding identifiers. For these type of hashes, Berkeley DB is
>  useful and in the standard library. There are also many other
>  key/value document stores out there that handle the task well.
>
> - Use your lookup hash to determine if potential SNP bases fall into
>  coding regions.

If there are only a few possible SNPs to look at (say 10), then it
might be simpler just to loop over the gene/CDS feature objects and
check their coordinates against the SNP location. You could do this
with the GenBank file and the SeqFeature locations. (i.e. relatively
quick to write the code, but slow to run.)

Brad's suggestion of a hash based lookup is probably going to faster,
but is also more complex. If you have a lot of SNPs then this is
probably worthwhile. (i.e. relatively slow to write the code, but
quick to run).

Peter