Postgresql Performance Optimization Research

Jonathan Chao
5 min readOct 2, 2018

This post serves more for myself as note as I dig into optimization on Postgres performance.

The problem I was facing was that my boss wants me to improve the performance on one of the pages that lists out a query result. This query consists of a nested SELECT with several COUNT(*) OVER (PARTITION ...) . The page takes over 3 minutes to load, and my boss wants me to make the wait under 2 seconds.

Great……

The initial idea I had was to index all joined/filtered columns. This should be able to shorten the time needed, right? Right, but not enough. There are simply too many rows to go through, there are total 34GB of data and we’re pretty doing a full scan here. Not good.

I started searching for other ways to decrease the time. Thanks to great util commands like EXPLAIN and ANALYZE, I can see where the bottleneck is. EXPLAIN shows the plans Postgres wants to run, and it shows the estimation of cost, rows, and columns covered at each step. ANALYZE on the other hand actually executes the command and shows the actual time spent and rows/columns covered.

You can simply run

EXPLAIN ANALYZE SELECT * FROM your_table;

and it will show both results, planned and actual.

One thing I noticed is that Postgres runs a different plan when I run on local and when I run on prod-like db. The prod-like db has production volume, which is the 34GB I was referring to. My local of course, doesn’t have that much data. It has about 300 MB only. The plan was BITMAP INDEX SCAN for my local while the prod-like runs Sequential scan. For those who are not familiar with the terms, you can check this site out.

But long story short, when you run a query that would return a lot of rows, Postgres would run sequential scan as sequential scan requires only a single I/O for each row, while indexed scan would first scan the index and then the table. With too many rows to scan, this becomes more expensive than sequential scan. Sure, if the amount of data is small enough, index scan can be REALLY efficient because everything is ALREADY in index and would require only scan in index. This explains why the result is returned so fast in my local db.

I don’t like where this is going……

Since Postgres tries to optimize the plan itself, I don’t want to interfere. There is indeed a way to force an indexed scan, but I don’t think this will help. I have to tackle it from somewhere else.

In my query, there is a nested SELECT, a COUNT(*) OVER(PARTITION...), and a RANK() OVER(...). Additionally, there is a ilike %%<string>%% where string is from user search input.

It seems like I can change the ilike to like by changing the string to lowercase first. It doesn’t save much time, though.

Second approach is to use trigram to index the column being partitioned. If you are not familiar with how trigram works, check this out.

Simply put, a string like ‘cat’ will be indexed in a set of ['c', 'ca', 'cat', 'at']. I’m using GIN type to index as, apparently, Postgres focuses a lot more on GIN and the algorithm after 9.6 (I’m using 10.4)is a lot more sophisticated than GIST. I cannot guarantee this as I did not find any doc supporting this, but this is what stackoverflow says, so give and take.

Still, from the doc, it says that GIN is preferred.

The indexing takes a long time and takes a lot of memory. It’s expected as it needs to index 34 G of data having a lot of variations in words.

However, the result is not satisfying. I guess using trigram cannot overcome the sequential scan vs. indexed scan problem. I thought the GIN index should provide a better result than b-tree, but Postgres still opts to use sequential scan. Interesting……

My next trial is to flatten the query. The nested SELECT is used for getting the count and rank over partitioning. These variables are then used to order the results. Do I need it? What’s the intention of having ordering this way? I asked my boss, and he asked me to try to do the querying without such ordering.

Well, guess what, the duration of query went from 3 minutes and 40 seconds to 17 seconds. I talked to my boss again about this, and he asked me to try to make the duration even shorter.

I think I exhausted all my resource on the query and indexing side, so I went for configuration. Thanks to Postgres, it again has a nice documentation about it.

It seems like the shared_buffers is the primary target. The default is set to 128 MB. The doc recommends starting with 25% of the RAM used. Whatever, gonna try increasing it anyway. I first raised it to 256 MB. I thought this would cut the time in half (plus some), but it only drops the time to around 10 seconds, so I raised it to 512 MB again. This time, the time drops to around 5 seconds, still not as good as I want it to be, but for sure this is a lot better compare to 220 seconds from the original query. However, this is all experimental and I need more information to warrant my approach to raise the buffer to 512 MB. I’ll need to check the allocated memory from machine to make the ultimate decision on the buffer size.

Okay, so this is not really “How I optimized my database performance” but rather “How I changed my query and expected result in order to fit the criteria”. So take this result with a grain of salt. However, I did go through the possible routes of making the query more efficient. As I said in the beginning, this acts more like a note to myself, so I think the result is still worthwhile.

If you have other ways to further optimize the query, please leave a comment below.

--

--

Jonathan Chao

I am a software developer who has been in this industry for close to a decade. I share my experience to people who are or want to get into the industry