[Bioperl-l] storing and retrieving partial sequences

Lincoln Stein lstein@cshl.org
Thu, 6 Dec 2001 18:37:19 -0500


Hi Aaron,

I reread your message and realized that I misunderstood the problem you're 
solving.  What you're doing makes a lot of sense, and is similar to what I 
do, except that I do everything with a big Unix pipe:

	dump_ranges |
	    sort ranges with UNIX sort |
                    merge overlapping regions |
		invert the regions |
		    extract the regions from FASTA files

What led me to the algorithm?  A lot of beating my head against the wall.

Lincoln
			

On Thursday 06 December 2001 18:00, Aaron J Mackey wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Hi Lincoln,
>
> Interesting, I'll definitely study this further.  Could this be extended
> to a problem like mine (you have a set of ranges, you'd like to enumerate
> the overlaps, the "noncovered" betweens, etc) without [much of] a
> procedural language?  What led you to this algorithm?
>
> - -Aaron
>
> On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > Hi All,
> >
> > The Bio::DB::GFF::Adaptor::mysqlopt module in bioperl-live uses a very
> > simple optimization that is very effective at speeding up the following
> > three types of query:
> >
> > 	Given a search range:
> >
> > 	1) find ranges that overlap with it
> > 	2) find ranges that are contained within it
> > 	3) find ranges that contain it
> >
> > It uses a binning scheme in which there are several bin tiers of
> > different orders of magnitude (the smallest tier consists of 1K bins and
> > the largest is large enough to hold the largest range in the database;
> > computed dynamically during database load).  The bins are then turned
> > into floating point indexes. Range queries do a series of range checks on
> > the bins in order to reduce the search space.  Here's an overlap query
> > for all features between positions 500,000 and 520,000 on chromosome "I":
> >
> >  SELECT fref,fstart,fstop
> >  FROM fdata
> >  WHERE data.fref='I'
> >         AND (fbin between 100000000.000000 and 100000000.000000
> >          OR fbin between 10000000.000000 and 10000000.000000
> >          OR fbin between 1000000.000000 and 1000000.000000
> >          OR fbin between 100000.000004 and 100000.000005
> >          OR fbin between 10000.000049 and 10000.000052
> >          OR fbin between 1000.000499 and 1000.000520)
> >         AND fdata.fstop>=500000 AND fdata.fstart<=520000
> >
> > And here's one that finds features that are completely contained within
> > the same range:
> >
> >  SELECT fref,fstart,fstop
> >  FROM fdata
> >  WHERE data.fref='I'
> >         AND (fbin between '100000000.000000' and '100000000.000000'
> >          OR fbin between '10000000.000000' and '10000000.000000'
> >          OR fbin between '1000000.000000' and '1000000.000000'
> >          OR fbin between '100000.000004' and '100000.000005')
> >         AND fdata.fstart<='500000' AND fdata.fstop>='520000'
> >
> > (pardon the extraneous quotes in the latter example; they are introduced
> > by my perl debugging code).
> >
> > On a MySQL database with 6 million ranges and reference segments of 15
> > megs average size, it takes 10.22s for the unoptimized query to run (one
> > without the bin conditions), and 0.03 seconds for the optimized query to
> > run. Obviously you need a bit of a Perl script to generate the SQL, but
> > that's what Bio::DB::GFF is for.
> >
> > I'm writing the algorithm up, but feel free to use it if you think it's
> > useful (you can steal the details from the code).
> >
> > PostgreSQL has an R-Tree datatype that is specialized for
> > multi-dimensional range queries.  I experimented with it before arriving
> > at this solution, but found that it doesn't do well with sequence feature
> > data.  The problem is that R-Trees create a set of dynamically-sized
> > bins, but the great differences in magnitude of the features in a typical
> > data set (some features are megabases, others are a single base pair)
> > causes many features to be greedily allocated to the largest bin.
> >
> > Lincoln
> >
> > On Thursday 06 December 2001 15:05, Aaron J Mackey wrote:
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA1
> > >
> > >
> > > This is a bit offtopic, but the invocation of 'postgres' made my ears
> > > perk up ...
> > >
> > > On Wed, 5 Dec 2001, Chris Mungall wrote:
> > > > If you're using postgres there are a number of optimisations to
> > > > consider -
> > > >
> > > > you can use the native range type to store the coordinates, this is
> > > > presumably optimised for this sort of thing.
> > >
> > > "presumably" is the key word here.  I've recently migrated to using
> > > Postgres, mainly because of all the wonderful things it seemed to
> > > provide along these lines (more interesting datatypes and inheritance,
> > > namely). I've found that, with a little inspection, many of these
> > > things aren't actually optimized at all.  Provided, yes.  Optimized,
> > > not necessarily.
> > >
> > > > other than the speed issue, which is still open, i think postgres
> > > > could be better than mysql as an open source bioinformatics dbms.
> > >
> > > I want to agree.  I want to believe.  I want to read some documentation
> > > about inheritance that tells me more than just the trite
> > > cities/capitals example.  I've read the developers list where they talk
> > > about how inheritance isn't actually "finished".  How you cannot access
> > > children attributes without explictly joining in the table (which is
> > > how you have to handle subtyping in MySQL).  How foreign keys amongst
> > > children aren't shared.  etc. etc. etc.
> > >
> > > Sorry, didn't mean to rant.  To get bac on topic, to answer the range
> > > overlap/exclusion problem, here's how we do it (with MySQL, begin/end
> > > columns):
> > >
> > > There are two ways, and no I haven't explictly compared them to see
> > > which is better (but my hunch is the first on Postgres, the second with
> > > MySQL).
> > >
> > > Method #1. Use set logic.  The key point here is that you need to
> > > create a temporary table (or VIEW in other RDBMs) that "explodes" each
> > > begin/end range pair into one row for each position.  You do this by
> > > joining to an auxiliary table of integers (that has at least as many
> > > rows as your longest sequence).  So if you have a table of "ranges"
> > > like this:
> > >
> > > CREATE TABLE ranges (
> > > range_id int unsigned not null auto_increment primary key,
> > > seq_id int unsigned not null foreign key references seq(id),
> > > begin int unsigned not null default 1,
> > > end int unsigned not null default 1,
> > > index(seq_id),
> > > index(begin),
> > > index(end)
> > > )
> > > - -- CONSTRAINT good_positions
> > > - -- CHECK (begin < end AND
> > >           end <= SELECT length from seq where id = seq_id),
> > > ;
> > >
> > > You need an auxiliary table "allnumbers" like this (adjust last select
> > > as necessary for your largest sequence length):
> > >
> > > CREATE TABLE allnumbers (
> > > number int unsigned not null primary key
> > > );
> > > INSERT INTO allnumbers (number)
> > > VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
> > > INSERT INTO allnumbers (number)
> > > SELECT DISTINCT N1.number +
> > >                (N2.number * 10) +
> > >                (N3.number * 100) +
> > >                (N4.number * 1000) +
> > >                (N5.number * 10000) +
> > >                (N6.number * 10000)
> > > FROM allnumbers as N1,
> > >      allnumbers as N2,
> > >      allnumbers as N3,
> > >      allnumbers as N4,
> > >      allnumbers as N5,
> > >      allnumbers as N6
> > > WHERE N1.number +
> > >      (N2.number * 10) +
> > >      (N3.number * 100) +
> > >      (N4.number * 1000) +
> > >      (N5.number * 10000) +
> > >      (N6.number * 10000) > 9
> > >
> > > (Whew!)
> > >
> > > Now make your temporary tables (in MySQL, an in-memory heap; YMMV)
> > >
> > > CREATE TEMPORARY TABLE coverage (
> > > range_id int unsigned not null,
> > > seq_id int unsigned not null,
> > > position int unsigned not null,
> > > primary key (range_id, seq_id, position)
> > > ) TYPE = Heap;
> > >
> > > INSERT INTO coverage (range_id, seq_id, position)
> > > SELECT range_id, seq_id, number
> > > FROM ranges
> > > 	INNER JOIN allnumbers
> > > 	ON (number >= begin AND number <= end)
> > > ;
> > >
> > > Now you can work with "coverage" as with usual set operations
> > > (intersection, union, exception, etc); see Celko's "SQL For Smarties"
> > > Note that this requires a RDBM with sub-selects, i.e. not MySQL
> > >
> > >
> > > Method #2 - MySQL compatible!
> > >
> > > OK, given all that theoretical crap above, here's what we actually do;
> > > it's quite fast.  What follows is a loosely commented SQL "script" that
> > > we feed to fasta when we want to "search using all the intervening
> > > sequence between a set of orf calls/hits already in the database". 
> > > fasta passes off any command that starts with "do" to the database
> > > server; the first statement without a "do" is a "select" from which
> > > fasta gets it's data.
> > >
> > > - -- extract the coordinates of the hits we already have:
> > > do
> > > create table myhits type = heap
> > > select lib_id as libid,
> > >        if(strand = 'f', lbeg, lend) as begin,
> > >        if(strand = 'f', lend, lbeg) as end -- reversed coordinates if
> > > strand <> 'f' from hit, search
> > > where hit.search_id = search.id
> > > and hit.exp <= 1e-6
> > > and search.tag = 'my-favorite-search'
> > > ;
> > >
> > > - -- grab the first bunch of intergenic regions
> > > do
> > > create temporary table ranges type = heap
> > > select h1.libid as libid,
> > >        max(h1.end + 1) as begin,
> > >        min(h2.begin - 1) as end
> > > from myhits as h1, myhits as h2
> > > where h1.libid = h2.libid
> > > and h2.begin > h1.end
> > > group by h1.libid, h1.begin
> > > ;
> > >
> > > - -- add all the beginning ranges we missed
> > > do
> > > insert into ranges (libid, begin, end)
> > > select h1.libid as libid,
> > >        1 as begin,
> > >        min(h1.begin - 1) as end
> > > from myhits as h1
> > > group by h1.libid
> > > having end >= begin
> > > ;
> > >
> > > - -- add all the ending ranges we missed
> > > do
> > > insert into ranges (libid, begin, end)
> > > select h1.libid as libid,
> > >        max(h1.end + 1) as begin,
> > >        libseq.len as end
> > > from myhits as h1 left join libseq on h1.libid = libseq.id
> > > group by h1.libid
> > > having end >= begin
> > > ;
> > >
> > > - -- ok, now add ranges for those contigs that had no hits at all
> > > do
> > > create temporary table missing type = heap
> > > select libseq.id as libid,
> > >        1 as begin,
> > >        libseq.len as end
> > > from libseq
> > > 	left join ranges on libseq.id = ranges.libid
> > > 	left join search on libseq.search_id = search.id
> > > where search.tag = 'my-favorite-search'
> > >   and ranges.libid is null
> > > group by libseq.id
> > > order by libseq.id
> > > ;
> > >
> > > - -- had to do this in two steps because mysql won't let you
> > > - -- insert into a table you're currently selecting from.
> > > do
> > > insert into ranges (libid, begin, end)
> > > select libid, begin, end
> > > from missing
> > > ;
> > >
> > > - -- lastly, we may have a few duplicate overlapping ranges,
> > > - -- because our earlier strategy didn't remove them; we'll
> > > - -- now keep the shortest of them:
> > > do
> > > create temporary table newranges (
> > > id int unsigned not null auto_increment primary key,
> > > libid bigint unsigned,
> > > begin bigint unsigned,
> > > end bigint unsigned
> > > );
> > >
> > > - -- again, two steps for MySQL
> > > do
> > > insert into newranges (libid, begin, end)
> > > select libid, max(begin) as begin, end
> > > from ranges
> > > group by libid, end
> > > order by libid
> > > ;
> > >
> > > - -- oh, we'll only search if they're longer than 150 bp.
> > > do
> > > delete
> > > from newranges
> > > where (end - begin + 1) < 150
> > > ;
> > >
> > > - -- OK, give fasta some data!
> > > - -- the @C is a coordinate flag for fasta
> > > select newranges.id,
> > >        mid(contig.seq,
> > >            newranges.begin,
> > > 	   (newranges.end - newranges.begin + 1)
> > >        ),
> > >        concat("CONTIG_ID: ", contig.id, " @C:", newranges.begin, " ",
> > > contig.name) from newranges, libseq, contig
> > > where newranges.libid = libseq.id
> > > and libseq.seq_id = contig.id
> > > ;
> > >
> > >
> > > That's all incredibly simple, right?
> > >
> > > - -Aaron
> > >
> > >
> > >
> > > -----BEGIN PGP SIGNATURE-----
> > > Version: GnuPG v1.0.6 (OSF1)
> > > Comment: For info see http://www.gnupg.org
> > >
> > > iD8DBQE8D89yt6Sp9Om+GYMRAhXwAJ4irnhvTGDyv8ZBabXx4YUr72OoYgCfSKcc
> > > pWdEXSUZ2chUIFMgmYtvXBk=
> > > =NP3O
> > > -----END PGP SIGNATURE-----
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
>
> - --
>  o ~   ~   ~   ~   ~   ~  o
> / Aaron J Mackey           \
> \  Dr. Pearson Laboratory  /
>  \ University of Virginia  \
>  /  (434) 924-2821          \
>  \  amackey@virginia.edu    /
>   o ~   ~   ~   ~   ~   ~  o
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.0.6 (OSF1)
> Comment: For info see http://www.gnupg.org
>
> iD8DBQE8D/iJt6Sp9Om+GYMRAt2QAJoCiuGpJt7+9VLv9F/XYIscQBkPIQCeIWXr
> W8jZDsvpoLv4VHkmgi7a/Vo=
> =WRRN
> -----END PGP SIGNATURE-----

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY

NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
PLEASE WRITE FOR DETAILS.
========================================================================