[BioSQL-l] Indexing of (seqfeature) locations?

Wed Aug 26 15:56:25 UTC 2009

On Aug 26, 2009, at 8:29 AM, Peter wrote:

> On Wed, Aug 26, 2009 at 1:07 PM, Hilmar Lapp<hlapp at gmx.net> wrote:
>>
>>
>> On Aug 26, 2009, at 6:53 AM, Peter wrote:
>>
>>> The BioSQL schema includes a few indexes on the location table
>>> (e.g. quoting the MySQL schema, but it looks the same on pg too):
>>>
>>> CREATE INDEX seqfeatureloc_start ON location(start_pos, end_pos);
>>> [...]
>>> Will these facilitate searches like this?:
>>>
>>> "SELECT ... WHERE 2000 <= location.start_pos
>>> AND location.end_pos <= 5000 AND ..."
>>>
>>> Or, for this would it help to include:
>>>
>>> CREATE INDEX seqfeatureloc_start ON location(start_pos);
>>> CREATE INDEX seqfeatureloc_start ON location(end_pos);
>>
>> With a decent RDBMS, having two indexes instead of a compound one  
>> will slow
>> this query down. What the compound one won't help you with is if  
>> your query
>> doesn't constrain the leading columns. For example, a compound  
>> index on
>> (start_pos,end_pos) won't be used if you only constrain end_pos. If  
>> you want
>> to do that, you need on index on (end_pos) too.
>
> Thanks for your reply Hilmar. Just to make sure I understood, the  
> current
> BioSQL indexes are fine for this:
>
> "SELECT ... WHERE 2000 <= location.start_pos
> AND location.end_pos <= 5000 AND ..."
>
> but not so great for:
>
> "SELECT ... WHERE 2000 <= location.start_pos AND ..."

No, this one will work fine. (provided that start_pos comes first in  
the index)

>
> or,
>
> "SELECT ... WHERE location.end_pos <= 5000 AND ..."

Yes.

> [...]
> Having just two separated indexes on start_pos and end_pos would
> speed up queries on just start or end, but would slow down queries
> using both.

Yes (though not necessarily much), and occupy more space.

>
> Presumably having three indexes as follows would cover all these
> examples efficiently, but at the cost of two more indexes?:
>
> CREATE INDEX seqfeatureloc_start ON location(start_pos, end_pos);
> CREATE INDEX seqfeatureloc_start ON location(start_pos);
> CREATE INDEX seqfeatureloc_start ON location(end_pos);

With this set, the waste of space for the compound index probably far  
outweighs the performance gain you might see from it. If I need to be  
able to constrain by both independently, I create a compound index,  
and separate indexes for each column after the first in the index.  
I.e., for the purposes of querying by start_pos,

CREATE INDEX seqfeatureloc_start ON location(start_pos, end_pos);
CREATE INDEX seqfeatureloc_start ON location(start_pos);

are redundant.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================