[Bioperl-l] UCSC database backend

Thu Aug 10 11:00:17 UTC 2006

On 8/10/06 3:14 AM, "Sendu Bala" <bix at sendu.me.uk> wrote:

> Sean Davis wrote:
>> 
>> Before we get too far down this line of thought, keep in mind that this will
>> be dozens of Gb of sequence and database tables.  See here for details:
>> 
>> http://genome.ucsc.edu/admin/mirror.html
>> 
>> The sequences include all of genbank, essentially.  The mysql tables ALONE
>> (no sequence) for only ONE human assembly is on the order of 10Gb--not the
>> kind of thing you can download in a few minutes (or even hours).  Just to
>> keep in mind....
> 
> I think if someone needs heavy-duty access to genomic data, they'll find
> the discspace. That wouldn't be the problem. The problem would be
> finding an easy way of getting the data, which is where I hoped
> something like a UCSC frontend would come in.

If you look into the code that underlies the UCSC browser, they use a piece
of software called the "autojoiner".  It describes the relationships between
databases and their tables and how they relate to each other.  They don't
have the strict concept of a foreign key, but rather "join" rules that can
include things like the key in one table being used in a join to a key in
another table, but perhaps with "fuzzy" matching or with an arbitrary prefix
or suffix.  In order to reproduce what UCSC does, we need to recreate the
autojoiner.  I've looked at it a bit, but it is not a trivially easy and is
probably not a task that I can complete in the space of a few weeks, but one
never knows.  Here is a link to the table/database autojoiner description
file and accompanying documentation, just to give you a sense of what it
looks like (extracted from the UCSC source tree):

http://watson.nci.nih.gov/~sdavis/all.joiner
http://watson.nci.nih.gov/~sdavis/joiner.doc

A laudable goal would be to parse and use this file, and this is quite
doable, I suppose.  If one wanted to make a table-browser-featured
interface, it would include parsing this file and then having the
appropriate introspection methods and a means to use them to design a query
of interest.  Getting the information into a bioperl format, after doing
this, is another matter, of course.

> 
>> On another point, the strength of UCSC is not in obtaining sequence, but in
>> mapping to the genome.  I think getting actual sequence should be secondary
>> here, if for no other reason than there are trivially easy ways of getting
>> sequence information from elsewhere given an accession or ID.  There is
>> simply too much information to be stored locally for most people and getting
>> the data remotely from UCSC doesn't seem possible currently.
> 
> The work would certainly be highly valuable even if it didn't allow for
> sequence retrieval, but from my own point of view my main interest was
> exactly the retrieval of arbitrary bits of genomic sequence - for which
> there is no accession or ID that can be used to query some other database.

For this purposes, their DAS server works just fine.  Try this:

http://genome.ucsc.edu/cgi-bin/das/hg16/dna?segment=7:50001,51000&segment=8:
1,100000

There are a number of other alternatives, including working with .nib or
.2bit files, as Malcolm mentioned.

Sean