[Bioperl-l] UCSC database backend

Sean Davis sdavis2 at mail.nih.gov
Wed Aug 9 19:02:42 UTC 2006




On 8/9/06 2:11 PM, "Chris Fields" <cjfields at uiuc.edu> wrote:

>> Chris,
>> 
>> Once I get CVS access, I will commit what I have done (as long as it
>> "works").
>> 
>> Now for the details.  Keep in mind that for many of the "sequences"
>> available from UCSC, there is no actual "sequence" stored in the database;
>> rather they are stored in flat files not accessible directly via SQL.
>> Therefore, a sequence would be "abstract" in the sense of being a "join
>> location" on the chromosome, and even that isn't quite right, as the mRNA
>> sequence != genomic alignment sequence.  Also, there are many different
>> tables that maintain "sequence" information.  So, implementing
>> RandomAccessI
>> is not going to be straightforward and will require some assumptions about
>> what will be searched.  In fact, since the same "sequence" can be in many
>> different tables, there may need to be a way of specifying where the
>> search
>> is done (what table(s)).
>> 
>> Sean
> 
> Sean,
> 
> Okay, makes sense.  So, the MySQL database holds the sequence information
> (location, etc) and the actual sequences (mRNA, EST, genomic) are in various
> flat files.  Seems like this calls for a helper set-up script to index the
> appropriate sequence flat files and possibly load the MySQL database table
> information.  Bio::DB::Fasta could be used for indexing the sequence files
> as it's pretty fast.

Before we get too far down this line of thought, keep in mind that this will
be dozens of Gb of sequence and database tables.  See here for details:

http://genome.ucsc.edu/admin/mirror.html

The sequences include all of genbank, essentially.  The mysql tables ALONE
(no sequence) for only ONE human assembly is on the order of 10Gb--not the
kind of thing you can download in a few minutes (or even hours).  Just to
keep in mind....

On another point, the strength of UCSC is not in obtaining sequence, but in
mapping to the genome.  I think getting actual sequence should be secondary
here, if for no other reason than there are trivially easy ways of getting
sequence information from elsewhere given an accession or ID.  There is
simply too much information to be stored locally for most people and getting
the data remotely from UCSC doesn't seem possible currently.

Sean




More information about the Bioperl-l mailing list