[Bioperl-l] Bio::DB::SeqFeature::Store::Pg index on "name": why varchar_pattern_ops?

randalls at bioinfo.wsu.edu randalls at bioinfo.wsu.edu
Wed Apr 7 23:30:37 EDT 2010


Wow, I really stirred up the bees nest........

The varchar_pattern_ops is needed only if your locale is not C or POSIX. Mine is en_US.UTF-8 as demonstrated by SHOW LC_COLLATE;

This is as I understand set when initdb is ran and is set for the life of the datbase cluster (i.e. all those files and folders in $PGDATA).   

I am wondering if the Pg.pm module could detect the locale and set the appropriate indexes when the database is created.  Perhaps that would be the smarter way to do it?  

Rob, What is your Locale set to?  I am guessing it is C or Posix or something related.  Do you specify a locale when you run initdb or do you let initdb select it for you?

Thanks,

Randall Svancara
Systems Administrator/DBA/Developer
Main Bioinformatics Laboratory



----- Original Message -----
From: "Scott Cain" <scott at scottcain.net>
To: "Robert Buels" <rmb32 at cornell.edu>
Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>, randalls at bioinfo.wsu.edu
Sent: Wednesday, April 7, 2010 7:44:48 PM
Subject: Re: [Bioperl-l] Bio::DB::SeqFeature::Store::Pg index on "name": why 	varchar_pattern_ops?

Hi Rob,

I was a little concerned that this might happen.  This change was
suggested by Randall on the GBrowse mailing list a few weeks ago, and
after some testing on my machine, it seemed to work well.  The problem
was that wild card searches on the name column were seqscanning, and
the suggested index fixed the problem.  I wonder if the query planner
in postgres would be smart enough to use the appropriate index if we
created both one with and one without varchar_pattern_ops.

The reason for using varchar_pattern_ops specifically was to allow for
searching in a non-C locale.  Another option would be to make the
creation of the index conditional when the database is created.  It
could use the standard C locale setting and if a flag is set, create
the index to allow non-C locale searches (and presumably, the default
C locale would be the right thing for most users).

Scott


On Wed, Apr 7, 2010 at 8:23 PM, Robert Buels <rmb32 at cornell.edu> wrote:
> What's the thinking behind the BDBSFS::Pg (ha) index on the "name" table at
> Pg.pm line 282:
>
>    CREATE INDEX name_name_varchar_patt_ops_idx ON name USING BTREE
>    (lower(name) varchar_pattern_ops);
>
> Why the varchar_pattern_ops operator class?
>
> I'm testing an installation (backing gbrowse, of course) on Pg 8.3, and
> performance was really stinking on the SFS query:
>
>    SELECT f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand
>    FROM feature as f, name as n
>    WHERE (n.id=f.id AND lower(n.name) = lower($1) AND n.display_name>0)
>
> so I explained it, giving
>
>                                      QUERY PLAN
> --------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..130159.60 rows=11059 width=556)
>   ->  Seq Scan on name n  (cost=0.00..53489.41 rows=11059 width=4)
>         Filter: ((display_name > 0) AND (lower((name)::text) = 'foo'::text))
>   ->  Index Scan using feature_pkey on feature f  (cost=0.00..6.92 rows=1
> width=556)
>         Index Cond: (f.id = n.id)
>
>
> Seq scan, not good.  It's not using the name_name_varchar_patt_ops_idx.  So
> I added an index on just lower(name) without the varchar_pattern_ops
> business (create index lowername on name ( lower(name) )) and performance
> started getting fine again, with explain:
>
>
>                                      QUERY PLAN
> ------------------------------------------------------------------------------------
>  Nested Loop  (cost=640.81..175775.44 rows=20521 width=501)
>   ->  Bitmap Heap Scan on name n  (cost=640.81..31557.42 rows=20521 width=4)
>         Recheck Cond: (lower((name)::text) = 'foo'::text)
>         Filter: (display_name > 0)
>         ->  Bitmap Index Scan on lowername  (cost=0.00..635.68 rows=20597
> width=0)
>               Index Cond: (lower((name)::text) = 'foo'::text)
>   ->  Index Scan using feature_pkey on feature f  (cost=0.00..7.02 rows=1
> width=501
>         Index Cond: (f.id = n.id)
> (8 rows)
>
>
> This has much better performance in practice, since it's not sequentially
> scanning the 4.1M row "name" table.
>
> This suggests to me that the varchar_pattern_ops class should be removed,
> but could one of the original BDBSFS::Pg authors (like Lincoln or Scott)
> weigh in on this?
>
> Rob
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research



More information about the Bioperl-l mailing list