[Bioperl-l] Bio::DB::SeqFeature::Store::Pg index on "name": why varchar_pattern_ops?
Robert Buels
rmb32 at cornell.edu
Thu Apr 8 15:05:05 EDT 2010
Hmm, well, the vanilla index seems necessary for my pg 8.3.9/ubuntu
hardy/utf8 locales/UTF8 encoding installation. If both a
varchar_pattern_ops and a vanilla index are present, will installations
that need the varchar_pattern_ops one use it? Randall, do you have an
installation you can test that on?
Rob
Scott Cain wrote:
> Hi Rob,
>
> I was a little concerned that this might happen. This change was
> suggested by Randall on the GBrowse mailing list a few weeks ago, and
> after some testing on my machine, it seemed to work well. The problem
> was that wild card searches on the name column were seqscanning, and
> the suggested index fixed the problem. I wonder if the query planner
> in postgres would be smart enough to use the appropriate index if we
> created both one with and one without varchar_pattern_ops.
>
> The reason for using varchar_pattern_ops specifically was to allow for
> searching in a non-C locale. Another option would be to make the
> creation of the index conditional when the database is created. It
> could use the standard C locale setting and if a flag is set, create
> the index to allow non-C locale searches (and presumably, the default
> C locale would be the right thing for most users).
>
> Scott
>
>
> On Wed, Apr 7, 2010 at 8:23 PM, Robert Buels <rmb32 at cornell.edu> wrote:
>> What's the thinking behind the BDBSFS::Pg (ha) index on the "name" table at
>> Pg.pm line 282:
>>
>> CREATE INDEX name_name_varchar_patt_ops_idx ON name USING BTREE
>> (lower(name) varchar_pattern_ops);
>>
>> Why the varchar_pattern_ops operator class?
>>
>> I'm testing an installation (backing gbrowse, of course) on Pg 8.3, and
>> performance was really stinking on the SFS query:
>>
>> SELECT f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand
>> FROM feature as f, name as n
>> WHERE (n.id=f.id AND lower(n.name) = lower($1) AND n.display_name>0)
>>
>> so I explained it, giving
>>
>> QUERY PLAN
>> --------------------------------------------------------------------------------------
>> Nested Loop (cost=0.00..130159.60 rows=11059 width=556)
>> -> Seq Scan on name n (cost=0.00..53489.41 rows=11059 width=4)
>> Filter: ((display_name > 0) AND (lower((name)::text) = 'foo'::text))
>> -> Index Scan using feature_pkey on feature f (cost=0.00..6.92 rows=1
>> width=556)
>> Index Cond: (f.id = n.id)
>>
>>
>> Seq scan, not good. It's not using the name_name_varchar_patt_ops_idx. So
>> I added an index on just lower(name) without the varchar_pattern_ops
>> business (create index lowername on name ( lower(name) )) and performance
>> started getting fine again, with explain:
>>
>>
>> QUERY PLAN
>> ------------------------------------------------------------------------------------
>> Nested Loop (cost=640.81..175775.44 rows=20521 width=501)
>> -> Bitmap Heap Scan on name n (cost=640.81..31557.42 rows=20521 width=4)
>> Recheck Cond: (lower((name)::text) = 'foo'::text)
>> Filter: (display_name > 0)
>> -> Bitmap Index Scan on lowername (cost=0.00..635.68 rows=20597
>> width=0)
>> Index Cond: (lower((name)::text) = 'foo'::text)
>> -> Index Scan using feature_pkey on feature f (cost=0.00..7.02 rows=1
>> width=501
>> Index Cond: (f.id = n.id)
>> (8 rows)
>>
>>
>> This has much better performance in practice, since it's not sequentially
>> scanning the 4.1M row "name" table.
>>
>> This suggests to me that the varchar_pattern_ops class should be removed,
>> but could one of the original BDBSFS::Pg authors (like Lincoln or Scott)
>> weigh in on this?
>>
>> Rob
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>
>
More information about the Bioperl-l
mailing list