[Bioperl-l] UCSC database backend

Wed Aug 9 18:38:15 UTC 2006

Just a thought (chiming in)....

Both blast and blat indices databases have ways of retrieving sequence
using identifiers and coordinates.

If you're building these indices for local copies of these files anyway,
they can do double duty.

It is pretty easy to write tied hash interfaces to blast/blat formated
databases which could be wrapped Bio::DB::fasta like.

Might save some time....

--Malcolm

>-----Original Message-----
>From: bioperl-l-bounces at lists.open-bio.org 
>[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields
>Sent: Wednesday, August 09, 2006 1:12 PM
>To: 'Sean Davis'; bioperl-l at lists.open-bio.org
>Subject: Re: [Bioperl-l] UCSC database backend
>
>> Chris,
>> 
>> Once I get CVS access, I will commit what I have done (as long as it
>> "works").
>> 
>> Now for the details.  Keep in mind that for many of the "sequences"
>> available from UCSC, there is no actual "sequence" stored in 
>the database;
>> rather they are stored in flat files not accessible directly via SQL.
>> Therefore, a sequence would be "abstract" in the sense of 
>being a "join
>> location" on the chromosome, and even that isn't quite 
>right, as the mRNA
>> sequence != genomic alignment sequence.  Also, there are 
>many different
>> tables that maintain "sequence" information.  So, implementing
>> RandomAccessI
>> is not going to be straightforward and will require some 
>assumptions about
>> what will be searched.  In fact, since the same "sequence" 
>can be in many
>> different tables, there may need to be a way of specifying where the
>> search
>> is done (what table(s)).
>> 
>> Sean
>
>Sean,
>
>Okay, makes sense.  So, the MySQL database holds the sequence 
>information
>(location, etc) and the actual sequences (mRNA, EST, genomic) 
>are in various
>flat files.  Seems like this calls for a helper set-up script 
>to index the
>appropriate sequence flat files and possibly load the MySQL 
>database table
>information.  Bio::DB::Fasta could be used for indexing the 
>sequence files
>as it's pretty fast.
>
>So, if I were to retrieve a particular sequence (region of scaffold of
>genomic DNA for instance), I would need:
>
>1)  unique ID or name for the sequence
>2)  start-end coordinates (in UCSC terms, I suppose; UCSC 
>starts with 0, if
>I remember correctly?)
>3)  table to retrieve data from
>4)  either the location of indexed sequence files or a 
>flat-file db handler
>
>These could be all set upon instantiation for sequence retrieval :
>
>$factory = Bio::DB::UCSC::Sequence(-table     => $table,
>                                   -seq_start => $start,
>                                   -seq_end   => $end,
>                                   -db        => $handler,);
>
># returns Bio::PrimarySeq::Fasta via Bio::DB::Fasta DB Handler
>
>$seq = $factory->get_Seq_by_id($id);  
>
>If you just want the sequence associated with an ID, the location info
>(whether it is Simple, Split, Fuzzy, etc) could be used to retrieve the
>subsequence from the appropriate flatfile dependent on the table used.
>
>$factory = Bio::DB::UCSC::Sequence(-table     => $table,
>                                   -db        => $handler,);
>
># returns Bio::PrimarySeq::Fasta via Bio::DB::Fasta DB handler
>
>$seq = $factory->get_Seq_by_id($id);  
>
>Would something like that be appropriate?  Not sure if I'm missing
>something.  Sendu may have other suggestions/additions; I'm letting the
>coffee talk now.
>
>Chris
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>