[Bioperl-l] Bio::DB::SeqFeature::Store::Pg index on "name": why varchar_pattern_ops?

Thu Apr 8 19:05:05 UTC 2010

Hmm, well, the vanilla index seems necessary for my pg 8.3.9/ubuntu 
hardy/utf8 locales/UTF8 encoding installation.  If both a 
varchar_pattern_ops and a vanilla index are present, will installations 
that need the varchar_pattern_ops one use it?  Randall, do you have an 
installation you can test that on?

Rob

Scott Cain wrote:
> Hi Rob,
> 
> I was a little concerned that this might happen.  This change was
> suggested by Randall on the GBrowse mailing list a few weeks ago, and
> after some testing on my machine, it seemed to work well.  The problem
> was that wild card searches on the name column were seqscanning, and
> the suggested index fixed the problem.  I wonder if the query planner
> in postgres would be smart enough to use the appropriate index if we
> created both one with and one without varchar_pattern_ops.
> 
> The reason for using varchar_pattern_ops specifically was to allow for
> searching in a non-C locale.  Another option would be to make the
> creation of the index conditional when the database is created.  It
> could use the standard C locale setting and if a flag is set, create
> the index to allow non-C locale searches (and presumably, the default
> C locale would be the right thing for most users).
> 
> Scott
> 
> 
> On Wed, Apr 7, 2010 at 8:23 PM, Robert Buels <rmb32 at cornell.edu> wrote:
>> What's the thinking behind the BDBSFS::Pg (ha) index on the "name" table at
>> Pg.pm line 282:
>>
>>    CREATE INDEX name_name_varchar_patt_ops_idx ON name USING BTREE
>>    (lower(name) varchar_pattern_ops);
>>
>> Why the varchar_pattern_ops operator class?
>>
>> I'm testing an installation (backing gbrowse, of course) on Pg 8.3, and
>> performance was really stinking on the SFS query:
>>
>>    SELECT f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand
>>    FROM feature as f, name as n
>>    WHERE (n.id=f.id AND lower(n.name) = lower($1) AND n.display_name>0)
>>
>> so I explained it, giving
>>
>>                                      QUERY PLAN
>> --------------------------------------------------------------------------------------
>>  Nested Loop  (cost=0.00..130159.60 rows=11059 width=556)
>>   ->  Seq Scan on name n  (cost=0.00..53489.41 rows=11059 width=4)
>>         Filter: ((display_name > 0) AND (lower((name)::text) = 'foo'::text))
>>   ->  Index Scan using feature_pkey on feature f  (cost=0.00..6.92 rows=1
>> width=556)
>>         Index Cond: (f.id = n.id)
>>
>>
>> Seq scan, not good.  It's not using the name_name_varchar_patt_ops_idx.  So
>> I added an index on just lower(name) without the varchar_pattern_ops
>> business (create index lowername on name ( lower(name) )) and performance
>> started getting fine again, with explain:
>>
>>
>>                                      QUERY PLAN
>> ------------------------------------------------------------------------------------
>>  Nested Loop  (cost=640.81..175775.44 rows=20521 width=501)
>>   ->  Bitmap Heap Scan on name n  (cost=640.81..31557.42 rows=20521 width=4)
>>         Recheck Cond: (lower((name)::text) = 'foo'::text)
>>         Filter: (display_name > 0)
>>         ->  Bitmap Index Scan on lowername  (cost=0.00..635.68 rows=20597
>> width=0)
>>               Index Cond: (lower((name)::text) = 'foo'::text)
>>   ->  Index Scan using feature_pkey on feature f  (cost=0.00..7.02 rows=1
>> width=501
>>         Index Cond: (f.id = n.id)
>> (8 rows)
>>
>>
>> This has much better performance in practice, since it's not sequentially
>> scanning the 4.1M row "name" table.
>>
>> This suggests to me that the varchar_pattern_ops class should be removed,
>> but could one of the original BDBSFS::Pg authors (like Lincoln or Scott)
>> weigh in on this?
>>
>> Rob
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
> 
> 
>