[Bioperl-l] storing and retrieving partial sequences

Aaron J Mackey Aaron J. Mackey" <amackey@virginia.edu
Thu, 6 Dec 2001 15:05:05 -0500 (EST)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


This is a bit offtopic, but the invocation of 'postgres' made my ears perk
up ...

On Wed, 5 Dec 2001, Chris Mungall wrote:

> If you're using postgres there are a number of optimisations to
> consider -
>
> you can use the native range type to store the coordinates, this is
> presumably optimised for this sort of thing.

"presumably" is the key word here.  I've recently migrated to using
Postgres, mainly because of all the wonderful things it seemed to provide
along these lines (more interesting datatypes and inheritance, namely).
I've found that, with a little inspection, many of these things aren't
actually optimized at all.  Provided, yes.  Optimized, not necessarily.

> other than the speed issue, which is still open, i think postgres
> could be better than mysql as an open source bioinformatics dbms.

I want to agree.  I want to believe.  I want to read some documentation
about inheritance that tells me more than just the trite cities/capitals
example.  I've read the developers list where they talk about how
inheritance isn't actually "finished".  How you cannot access children
attributes without explictly joining in the table (which is how you have
to handle subtyping in MySQL).  How foreign keys amongst children aren't
shared.  etc. etc. etc.

Sorry, didn't mean to rant.  To get bac on topic, to answer the range
overlap/exclusion problem, here's how we do it (with MySQL, begin/end
columns):

There are two ways, and no I haven't explictly compared them to see which
is better (but my hunch is the first on Postgres, the second with MySQL).

Method #1. Use set logic.  The key point here is that you need to create a
temporary table (or VIEW in other RDBMs) that "explodes" each begin/end
range pair into one row for each position.  You do this by joining to an
auxiliary table of integers (that has at least as many rows as your
longest sequence).  So if you have a table of "ranges" like this:

CREATE TABLE ranges (
range_id int unsigned not null auto_increment primary key,
seq_id int unsigned not null foreign key references seq(id),
begin int unsigned not null default 1,
end int unsigned not null default 1,
index(seq_id),
index(begin),
index(end)
)
- -- CONSTRAINT good_positions
- -- CHECK (begin < end AND
          end <= SELECT length from seq where id = seq_id),
;

You need an auxiliary table "allnumbers" like this (adjust last select as
necessary for your largest sequence length):

CREATE TABLE allnumbers (
number int unsigned not null primary key
);
INSERT INTO allnumbers (number)
VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
INSERT INTO allnumbers (number)
SELECT DISTINCT N1.number +
               (N2.number * 10) +
               (N3.number * 100) +
               (N4.number * 1000) +
               (N5.number * 10000) +
               (N6.number * 10000)
FROM allnumbers as N1,
     allnumbers as N2,
     allnumbers as N3,
     allnumbers as N4,
     allnumbers as N5,
     allnumbers as N6
WHERE N1.number +
     (N2.number * 10) +
     (N3.number * 100) +
     (N4.number * 1000) +
     (N5.number * 10000) +
     (N6.number * 10000) > 9

(Whew!)

Now make your temporary tables (in MySQL, an in-memory heap; YMMV)

CREATE TEMPORARY TABLE coverage (
range_id int unsigned not null,
seq_id int unsigned not null,
position int unsigned not null,
primary key (range_id, seq_id, position)
) TYPE = Heap;

INSERT INTO coverage (range_id, seq_id, position)
SELECT range_id, seq_id, number
FROM ranges
	INNER JOIN allnumbers
	ON (number >= begin AND number <= end)
;

Now you can work with "coverage" as with usual set operations
(intersection, union, exception, etc); see Celko's "SQL For Smarties"
Note that this requires a RDBM with sub-selects, i.e. not MySQL


Method #2 - MySQL compatible!

OK, given all that theoretical crap above, here's what we actually do;
it's quite fast.  What follows is a loosely commented SQL "script" that we
feed to fasta when we want to "search using all the intervening sequence
between a set of orf calls/hits already in the database".  fasta passes
off any command that starts with "do" to the database server; the first
statement without a "do" is a "select" from which fasta gets it's data.

- -- extract the coordinates of the hits we already have:
do
create table myhits type = heap
select lib_id as libid,
       if(strand = 'f', lbeg, lend) as begin,
       if(strand = 'f', lend, lbeg) as end -- reversed coordinates if strand <> 'f'
from hit, search
where hit.search_id = search.id
and hit.exp <= 1e-6
and search.tag = 'my-favorite-search'
;

- -- grab the first bunch of intergenic regions
do
create temporary table ranges type = heap
select h1.libid as libid,
       max(h1.end + 1) as begin,
       min(h2.begin - 1) as end
from myhits as h1, myhits as h2
where h1.libid = h2.libid
and h2.begin > h1.end
group by h1.libid, h1.begin
;

- -- add all the beginning ranges we missed
do
insert into ranges (libid, begin, end)
select h1.libid as libid,
       1 as begin,
       min(h1.begin - 1) as end
from myhits as h1
group by h1.libid
having end >= begin
;

- -- add all the ending ranges we missed
do
insert into ranges (libid, begin, end)
select h1.libid as libid,
       max(h1.end + 1) as begin,
       libseq.len as end
from myhits as h1 left join libseq on h1.libid = libseq.id
group by h1.libid
having end >= begin
;

- -- ok, now add ranges for those contigs that had no hits at all
do
create temporary table missing type = heap
select libseq.id as libid,
       1 as begin,
       libseq.len as end
from libseq
	left join ranges on libseq.id = ranges.libid
	left join search on libseq.search_id = search.id
where search.tag = 'my-favorite-search'
  and ranges.libid is null
group by libseq.id
order by libseq.id
;

- -- had to do this in two steps because mysql won't let you
- -- insert into a table you're currently selecting from.
do
insert into ranges (libid, begin, end)
select libid, begin, end
from missing
;

- -- lastly, we may have a few duplicate overlapping ranges,
- -- because our earlier strategy didn't remove them; we'll
- -- now keep the shortest of them:
do
create temporary table newranges (
id int unsigned not null auto_increment primary key,
libid bigint unsigned,
begin bigint unsigned,
end bigint unsigned
);

- -- again, two steps for MySQL
do
insert into newranges (libid, begin, end)
select libid, max(begin) as begin, end
from ranges
group by libid, end
order by libid
;

- -- oh, we'll only search if they're longer than 150 bp.
do
delete
from newranges
where (end - begin + 1) < 150
;

- -- OK, give fasta some data!
- -- the @C is a coordinate flag for fasta
select newranges.id,
       mid(contig.seq,
           newranges.begin,
	   (newranges.end - newranges.begin + 1)
       ),
       concat("CONTIG_ID: ", contig.id, " @C:", newranges.begin, " ", contig.name)
from newranges, libseq, contig
where newranges.libid = libseq.id
and libseq.seq_id = contig.id
;


That's all incredibly simple, right?

- -Aaron



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (OSF1)
Comment: For info see http://www.gnupg.org

iD8DBQE8D89yt6Sp9Om+GYMRAhXwAJ4irnhvTGDyv8ZBabXx4YUr72OoYgCfSKcc
pWdEXSUZ2chUIFMgmYtvXBk=
=NP3O
-----END PGP SIGNATURE-----