[Biopython-dev] Fwd: contributing comparative genomics tools
Chris Dagdigian
dag at sonsorol.org
Fri Aug 4 10:38:52 UTC 2006
Begin forwarded message:
> From: Christopher Lee <leec at chem.ucla.edu>
> Date: August 3, 2006 9:11:42 PM EDT
> To: biopython-dev-owner at lists.open-bio.org
> Subject: Fwd: contributing comparative genomics tools
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
> there appears to be an error in your code submission instructions
> on the biopython.org/wiki, or in the configuration of the biopython-
> dev list server. The code submission instructions tell me to
> submit my proposal by email to biopython-dev at biopython.org, but the
> list server responds by saying that all mail will automatically be
> rejected! Please forward this proposal to the appropriate people
> (presumably biopython-dev?), and let me know that you have done
> so. Otherwise I won't have any way of knowing whether anyone even
> reads this email address...
>
> Yours with thanks,
>
> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA
>
> Begin forwarded message:
>
>> You are not allowed to post to this mailing list, and your message
>> has
>> been automatically rejected. If you think that your messages are
>> being rejected in error, contact the mailing list owner at
>> biopython-dev-owner at lists.open-bio.org.
>>
>>
>> From: Christopher Lee <leec at chem.ucla.edu>
>> Date: August 3, 2006 3:55:52 PM PDT
>> To: biopython-dev at biopython.org
>> Cc: Namshin Kim <deepreds at gmail.com>
>> Subject: contributing comparative genomics tools
>>
>>
>> Hi Biopython developers,
>> I'd like to contribute some Python tools that my lab has been
>> developing for large-scale comparative genomics database query.
>> These tools make it easy to work with huge multigenome alignment
>> databases (e.g. the UCSC Genome Browser multigenome alignments)
>> using a new disk-based interval indexing algorithm that gives very
>> high performance with minimal memory usage. e.g. whereas queries
>> of the UCSC 17genome alignment typically take about 30 sec. per
>> query using MySQL, the same query takes about 200 microsec. per
>> query, making it possible to run huge numbers of queries for
>> genome-wide studies.
>>
>> Here's an example usage (click the URL or just look at the code
>> below)
>> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/seq-
>> align.html#SECTION000125000000000000000
>>
>> We've tested this code very extensively in our own research, and
>> it has had four open source releases so far. At this point the
>> code is in production use. All the code is compatible back to
>> Python version 2.2, but not 2.1 or before (we use generators).
>> There is C code (accessed as Python classes) for the high-
>> performance interval database index. For details of history see
>> the website
>> http://www.bioinformatics.ucla.edu/pygr
>>
>> There is also extensive tutorial and reference documentation:
>> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/
>>
>> Let me know what questions you have, and what process we would
>> need to follow to contribute this code.
>>
>> Yours with best wishes,
>>
>> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA
>>
>>
>> ####### EXAMPLE USAGE
>> from pygr import cnestedlist
>> msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE
>> ALIGNMENT DB
>>
>> def printResults
>> (prefix,msa,site,altID='NULL',cluster_id='NULL',seqNames=None):
>> 'get alignment of each genome to site, print %identity and %
>> aligned'
>> for src,dest,edge in msa[site].edges(mergeMost=True): #
>> ALIGNMENT QUERY!
>> print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \
>> %(altID,cluster_id,prefix,seqNames[dest],
>> 100.*edge.pIdentity(),100.*edge.pAligned(),src[:
>> 2],dest[:2])
>>
>> def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs):
>> 'gene must be a slice of a sequence in our genome alignment msa'
>> ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES
>> ss2=gene[start2-2:start2]
>> ss3=gene[stop:stop+2]
>> e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES
>> e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND
>> stop
>> zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd
>> REQUESTS
>> cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS
>> TO MINIMIZE TRAFFIC
>> for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3),
>> ('e1',e1),('e2',e2)]:
>> printResults(prefix,msa,site,seqNames=~
>> (msa.seqDict),**kwargs)
>>
>> # RUN A QUERY LIKE THIS...
>> # getAlt3Conservation(msa,some_gene,some_start,other_start,stop)
>>
>> ############ EXPLANATION & NOTES
>> David Haussler's group has constructed alignments of multiple
>> genomes. These alignments are extremely useful and interesting,
>> but so large that it is cumbersome to work with the dataset using
>> conventional methods. For example, for the 8-genome alignment you
>> have to work simultaneously with the individual genome datasets
>> for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as
>> well as the huge alignment itself. Pygr makes this quite easy.
>> Here we illustrate an example of mapping an alternative 3' exon,
>> which has two alternative splice sites (start1 and start2) and a
>> single terminal splice site (stop). We use the alignment database
>> to map each of these splice sites onto all the aligned genomes,
>> and to print the percent-identity and percent-aligned for each
>> genome, as well as the two nucleotides consituting the splice site
>> itself. To examine the conservation of the two exonic regions
>> (between start1 and start2, and the adjacent region terminated by
>> stop, we print the same information for each genome's alignment to
>> these two regions as well. The code first opens the alignment
>> database. The function (getAlt3Conservation) obtains sequence
>> slice objects representing the various ``sites'' to be queried.
>> The actual alignment database query is performed in printResults:
>>
>> * The alignment database query is in the first line of
>> printResults(). msa is the database; site is the interval query;
>> and the edges methods iterates over the results, returning a tuple
>> for each, consisting of a source sequence interval (i.e. an
>> interval of site), a destination sequence interval (i.e. an
>> interval in an aligned genome), and an edge object describing that
>> alignment. We are taking advantage of Pygr's group-by operator
>> mergeMost, which will cause multiple intervals in a given sequence
>> to be merged into a single interval that constitutes their
>> ``union''. Thus, for each aligned genome, the edges iterator will
>> return a single aligned interval. The alignment edge object
>> provides some useful conveniences, such as calculating the percent-
>> identity between src and dest automatically for you. pIdentity()
>> computes the fraction of identical residues; pAligned computes the
>> fraction of aligned residues (allowing you to see if there are big
>> gaps or insertions in the alignment of this interval). If we had
>> wanted to inspect the detailed alignment letter by letter, we
>> would just iterate over the letters attribute instead of the edges
>> method. (See the NLMSASlice documentation for further information).
>>
>> * src[:2] and dest[:2] print the first two nucleotides of the
>> site in gene and in the aligned genome.
>>
>> * it's worth noting that the actual sequence string
>> comparisons are being done using a completely different database
>> mechanism (formerly NCBI's fastacmd, now our own (much faster)
>> pureseq text format), not the cnestedlist database. Basically,
>> each genome is being queried as a separate BLAST formatted
>> database, represented in Pygr by the BlastDB class. Pygr makes
>> this complex set of multi-database operations more or less
>> transparent to the user. For further information, see the BlastDB
>> documentation.
>>
>> * The other operations here are entirely vanilla: mainly
>> slicing a gene sequence to obtain the specific sites that we want
>> to query. Note: gene must itself be a slice of a sequence in our
>> alignment, or the alignment query msa[site] will raise an
>> IndexError informing the user that the sequence site is not in the
>> alignment.
>>
>> * The only slightly interesting operation here is the use of
>> interval addition to obtain the ``union'' of two intervals, e.g.
>> e1=ss1+ss2. This obtains a single interval that contains both of
>> the input intervals.
>>
>> * When the print statement requests str() representations of
>> these sequence objects, Pygr uses fastacmd -L to extract just the
>> right piece of the corresponding chromosomes from the eight BLAST
>> databases.
>>
>> (Actually, because of Pygr's caching / optimizations, considerably
>> more is going on than indicated in this simplified sketch. But you
>> get the idea: Pygr makes it relatively effortless to work with a
>> variety of disparate (and large) resources in an integrated way.)
>>
>> Here is some example output:
>>
>> 1 Mm.99996 ss1 hg17 50.0 100.0 AG GG
>> 1 Mm.99996 ss1 canFam1 50.0 100.0 AG GG
>> 1 Mm.99996 ss1 panTro1 50.0 100.0 AG GG
>> 1 Mm.99996 ss1 rn3 100.0 100.0 AG AG
>> 1 Mm.99996 ss2 hg17 100.0 100.0 AG AG
>> 1 Mm.99996 ss2 canFam1 100.0 100.0 AG AG
>> 1 Mm.99996 ss2 panTro1 100.0 100.0 AG AG
>> 1 Mm.99996 ss2 rn3 100.0 100.0 AG AG
>> 1 Mm.99996 ss3 hg17 100.0 100.0 GT GT
>> 1 Mm.99996 ss3 canFam1 100.0 100.0 GT GT
>> 1 Mm.99996 ss3 panTro1 100.0 100.0 GT GT
>> 1 Mm.99996 ss3 rn3 100.0 100.0 GT GT
>> 1 Mm.99996 e1 hg17 78.9 100.0 AG GG
>> 1 Mm.99996 e1 canFam1 84.2 100.0 AG GG
>> 1 Mm.99996 e1 panTro1 77.6 100.0 AG GG
>> 1 Mm.99996 e1 rn3 97.4 98.7 AG AG
>> 1 Mm.99996 e2 hg17 91.6 99.1 CC CC
>> 1 Mm.99996 e2 canFam1 88.8 99.1 CC CC
>> 1 Mm.99996 e2 panTro1 91.6 99.1 CC CC
>> 1 Mm.99996 e2 rn3 97.2 100.0 CC CC
>>
>>
>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (Darwin)
>>
>> iD8DBQFE0n8GLQ4dB3bqQz4RApcxAKCIHdZ9mttB1uC4HkY3xXEw1cWYswCeIg4i
>> xhxE2zrffLaiCjSiEp4Eo6k=
>> =BeOe
>> -----END PGP SIGNATURE-----
>>
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (Darwin)
>
> iD8DBQFE0p7iLQ4dB3bqQz4RAkzJAJ4wxiZqi7lZGBUMTFwyquGOCajiKQCfUDBm
> Wx/4AIstFjb+rbqY2QBppLg=
> =fghY
> -----END PGP SIGNATURE-----
More information about the Biopython-dev
mailing list