[BioSQL-l] Indexing of (seqfeature) locations?
Hilmar Lapp
hlapp at gmx.net
Wed Aug 26 15:56:25 UTC 2009
On Aug 26, 2009, at 8:29 AM, Peter wrote:
> On Wed, Aug 26, 2009 at 1:07 PM, Hilmar Lapp<hlapp at gmx.net> wrote:
>>
>>
>> On Aug 26, 2009, at 6:53 AM, Peter wrote:
>>
>>> The BioSQL schema includes a few indexes on the location table
>>> (e.g. quoting the MySQL schema, but it looks the same on pg too):
>>>
>>> CREATE INDEX seqfeatureloc_start ON location(start_pos, end_pos);
>>> [...]
>>> Will these facilitate searches like this?:
>>>
>>> "SELECT ... WHERE 2000 <= location.start_pos
>>> AND location.end_pos <= 5000 AND ..."
>>>
>>> Or, for this would it help to include:
>>>
>>> CREATE INDEX seqfeatureloc_start ON location(start_pos);
>>> CREATE INDEX seqfeatureloc_start ON location(end_pos);
>>
>> With a decent RDBMS, having two indexes instead of a compound one
>> will slow
>> this query down. What the compound one won't help you with is if
>> your query
>> doesn't constrain the leading columns. For example, a compound
>> index on
>> (start_pos,end_pos) won't be used if you only constrain end_pos. If
>> you want
>> to do that, you need on index on (end_pos) too.
>
> Thanks for your reply Hilmar. Just to make sure I understood, the
> current
> BioSQL indexes are fine for this:
>
> "SELECT ... WHERE 2000 <= location.start_pos
> AND location.end_pos <= 5000 AND ..."
>
> but not so great for:
>
> "SELECT ... WHERE 2000 <= location.start_pos AND ..."
No, this one will work fine. (provided that start_pos comes first in
the index)
>
> or,
>
> "SELECT ... WHERE location.end_pos <= 5000 AND ..."
Yes.
> [...]
> Having just two separated indexes on start_pos and end_pos would
> speed up queries on just start or end, but would slow down queries
> using both.
Yes (though not necessarily much), and occupy more space.
>
> Presumably having three indexes as follows would cover all these
> examples efficiently, but at the cost of two more indexes?:
>
> CREATE INDEX seqfeatureloc_start ON location(start_pos, end_pos);
> CREATE INDEX seqfeatureloc_start ON location(start_pos);
> CREATE INDEX seqfeatureloc_start ON location(end_pos);
With this set, the waste of space for the compound index probably far
outweighs the performance gain you might see from it. If I need to be
able to constrain by both independently, I create a compound index,
and separate indexes for each column after the first in the index.
I.e., for the purposes of querying by start_pos,
CREATE INDEX seqfeatureloc_start ON location(start_pos, end_pos);
CREATE INDEX seqfeatureloc_start ON location(start_pos);
are redundant.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the BioSQL-l
mailing list