[Bioperl-l] Bio::DB::SeqFeature::Store::Pg index on "name": why varchar_pattern_ops?

Scott Cain scott at scottcain.net
Thu Apr 8 02:44:48 UTC 2010


Hi Rob,

I was a little concerned that this might happen.  This change was
suggested by Randall on the GBrowse mailing list a few weeks ago, and
after some testing on my machine, it seemed to work well.  The problem
was that wild card searches on the name column were seqscanning, and
the suggested index fixed the problem.  I wonder if the query planner
in postgres would be smart enough to use the appropriate index if we
created both one with and one without varchar_pattern_ops.

The reason for using varchar_pattern_ops specifically was to allow for
searching in a non-C locale.  Another option would be to make the
creation of the index conditional when the database is created.  It
could use the standard C locale setting and if a flag is set, create
the index to allow non-C locale searches (and presumably, the default
C locale would be the right thing for most users).

Scott


On Wed, Apr 7, 2010 at 8:23 PM, Robert Buels <rmb32 at cornell.edu> wrote:
> What's the thinking behind the BDBSFS::Pg (ha) index on the "name" table at
> Pg.pm line 282:
>
>    CREATE INDEX name_name_varchar_patt_ops_idx ON name USING BTREE
>    (lower(name) varchar_pattern_ops);
>
> Why the varchar_pattern_ops operator class?
>
> I'm testing an installation (backing gbrowse, of course) on Pg 8.3, and
> performance was really stinking on the SFS query:
>
>    SELECT f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand
>    FROM feature as f, name as n
>    WHERE (n.id=f.id AND lower(n.name) = lower($1) AND n.display_name>0)
>
> so I explained it, giving
>
>                                      QUERY PLAN
> --------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..130159.60 rows=11059 width=556)
>   ->  Seq Scan on name n  (cost=0.00..53489.41 rows=11059 width=4)
>         Filter: ((display_name > 0) AND (lower((name)::text) = 'foo'::text))
>   ->  Index Scan using feature_pkey on feature f  (cost=0.00..6.92 rows=1
> width=556)
>         Index Cond: (f.id = n.id)
>
>
> Seq scan, not good.  It's not using the name_name_varchar_patt_ops_idx.  So
> I added an index on just lower(name) without the varchar_pattern_ops
> business (create index lowername on name ( lower(name) )) and performance
> started getting fine again, with explain:
>
>
>                                      QUERY PLAN
> ------------------------------------------------------------------------------------
>  Nested Loop  (cost=640.81..175775.44 rows=20521 width=501)
>   ->  Bitmap Heap Scan on name n  (cost=640.81..31557.42 rows=20521 width=4)
>         Recheck Cond: (lower((name)::text) = 'foo'::text)
>         Filter: (display_name > 0)
>         ->  Bitmap Index Scan on lowername  (cost=0.00..635.68 rows=20597
> width=0)
>               Index Cond: (lower((name)::text) = 'foo'::text)
>   ->  Index Scan using feature_pkey on feature f  (cost=0.00..7.02 rows=1
> width=501
>         Index Cond: (f.id = n.id)
> (8 rows)
>
>
> This has much better performance in practice, since it's not sequentially
> scanning the 4.1M row "name" table.
>
> This suggests to me that the varchar_pattern_ops class should be removed,
> but could one of the original BDBSFS::Pg authors (like Lincoln or Scott)
> weigh in on this?
>
> Rob
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research




More information about the Bioperl-l mailing list