[Bioperl-l] storing and retrieving partial sequences

Aaron J Mackey Aaron J. Mackey" <amackey@virginia.edu
Thu, 6 Dec 2001 18:00:25 -0500 (EST)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi Lincoln,

Interesting, I'll definitely study this further.  Could this be extended
to a problem like mine (you have a set of ranges, you'd like to enumerate
the overlaps, the "noncovered" betweens, etc) without [much of] a procedural
language?  What led you to this algorithm?

- -Aaron

On Thu, 6 Dec 2001, Lincoln Stein wrote:

> Hi All,
>
> The Bio::DB::GFF::Adaptor::mysqlopt module in bioperl-live uses a very simple
> optimization that is very effective at speeding up the following three types
> of query:
>
> 	Given a search range:
>
> 	1) find ranges that overlap with it
> 	2) find ranges that are contained within it
> 	3) find ranges that contain it
>
> It uses a binning scheme in which there are several bin tiers of different
> orders of magnitude (the smallest tier consists of 1K bins and the largest is
> large enough to hold the largest range in the database; computed dynamically
> during database load).  The bins are then turned into floating point indexes.
>  Range queries do a series of range checks on the bins in order to reduce the
> search space.  Here's an overlap query for all features between positions
> 500,000 and 520,000 on chromosome "I":
>
>  SELECT fref,fstart,fstop
>  FROM fdata
>  WHERE data.fref='I'
>         AND (fbin between 100000000.000000 and 100000000.000000
>          OR fbin between 10000000.000000 and 10000000.000000
>          OR fbin between 1000000.000000 and 1000000.000000
>          OR fbin between 100000.000004 and 100000.000005
>          OR fbin between 10000.000049 and 10000.000052
>          OR fbin between 1000.000499 and 1000.000520)
>         AND fdata.fstop>=500000 AND fdata.fstart<=520000
>
> And here's one that finds features that are completely contained within the
> same range:
>
>  SELECT fref,fstart,fstop
>  FROM fdata
>  WHERE data.fref='I'
>         AND (fbin between '100000000.000000' and '100000000.000000'
>          OR fbin between '10000000.000000' and '10000000.000000'
>          OR fbin between '1000000.000000' and '1000000.000000'
>          OR fbin between '100000.000004' and '100000.000005')
>         AND fdata.fstart<='500000' AND fdata.fstop>='520000'
>
> (pardon the extraneous quotes in the latter example; they are introduced by
> my perl debugging code).
>
> On a MySQL database with 6 million ranges and reference segments of 15 megs
> average size, it takes 10.22s for the unoptimized query to run (one without
> the bin conditions), and 0.03 seconds for the optimized query to run.
> Obviously you need a bit of a Perl script to generate the SQL, but that's
> what Bio::DB::GFF is for.
>
> I'm writing the algorithm up, but feel free to use it if you think it's
> useful (you can steal the details from the code).
>
> PostgreSQL has an R-Tree datatype that is specialized for multi-dimensional
> range queries.  I experimented with it before arriving at this solution, but
> found that it doesn't do well with sequence feature data.  The problem is
> that R-Trees create a set of dynamically-sized bins, but the great
> differences in magnitude of the features in a typical data set (some features
> are megabases, others are a single base pair) causes many features to be
> greedily allocated to the largest bin.
>
> Lincoln
>
>
> On Thursday 06 December 2001 15:05, Aaron J Mackey wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> >
> > This is a bit offtopic, but the invocation of 'postgres' made my ears perk
> > up ...
> >
> > On Wed, 5 Dec 2001, Chris Mungall wrote:
> > > If you're using postgres there are a number of optimisations to
> > > consider -
> > >
> > > you can use the native range type to store the coordinates, this is
> > > presumably optimised for this sort of thing.
> >
> > "presumably" is the key word here.  I've recently migrated to using
> > Postgres, mainly because of all the wonderful things it seemed to provide
> > along these lines (more interesting datatypes and inheritance, namely).
> > I've found that, with a little inspection, many of these things aren't
> > actually optimized at all.  Provided, yes.  Optimized, not necessarily.
> >
> > > other than the speed issue, which is still open, i think postgres
> > > could be better than mysql as an open source bioinformatics dbms.
> >
> > I want to agree.  I want to believe.  I want to read some documentation
> > about inheritance that tells me more than just the trite cities/capitals
> > example.  I've read the developers list where they talk about how
> > inheritance isn't actually "finished".  How you cannot access children
> > attributes without explictly joining in the table (which is how you have
> > to handle subtyping in MySQL).  How foreign keys amongst children aren't
> > shared.  etc. etc. etc.
> >
> > Sorry, didn't mean to rant.  To get bac on topic, to answer the range
> > overlap/exclusion problem, here's how we do it (with MySQL, begin/end
> > columns):
> >
> > There are two ways, and no I haven't explictly compared them to see which
> > is better (but my hunch is the first on Postgres, the second with MySQL).
> >
> > Method #1. Use set logic.  The key point here is that you need to create a
> > temporary table (or VIEW in other RDBMs) that "explodes" each begin/end
> > range pair into one row for each position.  You do this by joining to an
> > auxiliary table of integers (that has at least as many rows as your
> > longest sequence).  So if you have a table of "ranges" like this:
> >
> > CREATE TABLE ranges (
> > range_id int unsigned not null auto_increment primary key,
> > seq_id int unsigned not null foreign key references seq(id),
> > begin int unsigned not null default 1,
> > end int unsigned not null default 1,
> > index(seq_id),
> > index(begin),
> > index(end)
> > )
> > - -- CONSTRAINT good_positions
> > - -- CHECK (begin < end AND
> >           end <= SELECT length from seq where id = seq_id),
> > ;
> >
> > You need an auxiliary table "allnumbers" like this (adjust last select as
> > necessary for your largest sequence length):
> >
> > CREATE TABLE allnumbers (
> > number int unsigned not null primary key
> > );
> > INSERT INTO allnumbers (number)
> > VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
> > INSERT INTO allnumbers (number)
> > SELECT DISTINCT N1.number +
> >                (N2.number * 10) +
> >                (N3.number * 100) +
> >                (N4.number * 1000) +
> >                (N5.number * 10000) +
> >                (N6.number * 10000)
> > FROM allnumbers as N1,
> >      allnumbers as N2,
> >      allnumbers as N3,
> >      allnumbers as N4,
> >      allnumbers as N5,
> >      allnumbers as N6
> > WHERE N1.number +
> >      (N2.number * 10) +
> >      (N3.number * 100) +
> >      (N4.number * 1000) +
> >      (N5.number * 10000) +
> >      (N6.number * 10000) > 9
> >
> > (Whew!)
> >
> > Now make your temporary tables (in MySQL, an in-memory heap; YMMV)
> >
> > CREATE TEMPORARY TABLE coverage (
> > range_id int unsigned not null,
> > seq_id int unsigned not null,
> > position int unsigned not null,
> > primary key (range_id, seq_id, position)
> > ) TYPE = Heap;
> >
> > INSERT INTO coverage (range_id, seq_id, position)
> > SELECT range_id, seq_id, number
> > FROM ranges
> > 	INNER JOIN allnumbers
> > 	ON (number >= begin AND number <= end)
> > ;
> >
> > Now you can work with "coverage" as with usual set operations
> > (intersection, union, exception, etc); see Celko's "SQL For Smarties"
> > Note that this requires a RDBM with sub-selects, i.e. not MySQL
> >
> >
> > Method #2 - MySQL compatible!
> >
> > OK, given all that theoretical crap above, here's what we actually do;
> > it's quite fast.  What follows is a loosely commented SQL "script" that we
> > feed to fasta when we want to "search using all the intervening sequence
> > between a set of orf calls/hits already in the database".  fasta passes
> > off any command that starts with "do" to the database server; the first
> > statement without a "do" is a "select" from which fasta gets it's data.
> >
> > - -- extract the coordinates of the hits we already have:
> > do
> > create table myhits type = heap
> > select lib_id as libid,
> >        if(strand = 'f', lbeg, lend) as begin,
> >        if(strand = 'f', lend, lbeg) as end -- reversed coordinates if
> > strand <> 'f' from hit, search
> > where hit.search_id = search.id
> > and hit.exp <= 1e-6
> > and search.tag = 'my-favorite-search'
> > ;
> >
> > - -- grab the first bunch of intergenic regions
> > do
> > create temporary table ranges type = heap
> > select h1.libid as libid,
> >        max(h1.end + 1) as begin,
> >        min(h2.begin - 1) as end
> > from myhits as h1, myhits as h2
> > where h1.libid = h2.libid
> > and h2.begin > h1.end
> > group by h1.libid, h1.begin
> > ;
> >
> > - -- add all the beginning ranges we missed
> > do
> > insert into ranges (libid, begin, end)
> > select h1.libid as libid,
> >        1 as begin,
> >        min(h1.begin - 1) as end
> > from myhits as h1
> > group by h1.libid
> > having end >= begin
> > ;
> >
> > - -- add all the ending ranges we missed
> > do
> > insert into ranges (libid, begin, end)
> > select h1.libid as libid,
> >        max(h1.end + 1) as begin,
> >        libseq.len as end
> > from myhits as h1 left join libseq on h1.libid = libseq.id
> > group by h1.libid
> > having end >= begin
> > ;
> >
> > - -- ok, now add ranges for those contigs that had no hits at all
> > do
> > create temporary table missing type = heap
> > select libseq.id as libid,
> >        1 as begin,
> >        libseq.len as end
> > from libseq
> > 	left join ranges on libseq.id = ranges.libid
> > 	left join search on libseq.search_id = search.id
> > where search.tag = 'my-favorite-search'
> >   and ranges.libid is null
> > group by libseq.id
> > order by libseq.id
> > ;
> >
> > - -- had to do this in two steps because mysql won't let you
> > - -- insert into a table you're currently selecting from.
> > do
> > insert into ranges (libid, begin, end)
> > select libid, begin, end
> > from missing
> > ;
> >
> > - -- lastly, we may have a few duplicate overlapping ranges,
> > - -- because our earlier strategy didn't remove them; we'll
> > - -- now keep the shortest of them:
> > do
> > create temporary table newranges (
> > id int unsigned not null auto_increment primary key,
> > libid bigint unsigned,
> > begin bigint unsigned,
> > end bigint unsigned
> > );
> >
> > - -- again, two steps for MySQL
> > do
> > insert into newranges (libid, begin, end)
> > select libid, max(begin) as begin, end
> > from ranges
> > group by libid, end
> > order by libid
> > ;
> >
> > - -- oh, we'll only search if they're longer than 150 bp.
> > do
> > delete
> > from newranges
> > where (end - begin + 1) < 150
> > ;
> >
> > - -- OK, give fasta some data!
> > - -- the @C is a coordinate flag for fasta
> > select newranges.id,
> >        mid(contig.seq,
> >            newranges.begin,
> > 	   (newranges.end - newranges.begin + 1)
> >        ),
> >        concat("CONTIG_ID: ", contig.id, " @C:", newranges.begin, " ",
> > contig.name) from newranges, libseq, contig
> > where newranges.libid = libseq.id
> > and libseq.seq_id = contig.id
> > ;
> >
> >
> > That's all incredibly simple, right?
> >
> > - -Aaron
> >
> >
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.0.6 (OSF1)
> > Comment: For info see http://www.gnupg.org
> >
> > iD8DBQE8D89yt6Sp9Om+GYMRAhXwAJ4irnhvTGDyv8ZBabXx4YUr72OoYgCfSKcc
> > pWdEXSUZ2chUIFMgmYtvXBk=
> > =NP3O
> > -----END PGP SIGNATURE-----
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
>
>

- -- 
 o ~   ~   ~   ~   ~   ~  o
/ Aaron J Mackey           \
\  Dr. Pearson Laboratory  /
 \ University of Virginia  \
 /  (434) 924-2821          \
 \  amackey@virginia.edu    /
  o ~   ~   ~   ~   ~   ~  o

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (OSF1)
Comment: For info see http://www.gnupg.org

iD8DBQE8D/iJt6Sp9Om+GYMRAt2QAJoCiuGpJt7+9VLv9F/XYIscQBkPIQCeIWXr
W8jZDsvpoLv4VHkmgi7a/Vo=
=WRRN
-----END PGP SIGNATURE-----