Troubleshooting common issues with composite indexes in SQL databases
Sem Postma, September 10, 2023
Composite indexes are very powerful tools at your disposal when improving performance on large databases.
It’s important to state beforehand that you should always be careful not to fall into the trap of premature optimization. If you are running a relatively small database and you are getting performance issues, it might be more cost effective to pay a little extra for more powerful hardware than to spend time optimizing and increasing complexity.
In my opinion optimizations should follow the following order:
- Eliminating unused queries
- Adding indexes
- Optimizing queries
- Add covering indexes
- Caching
- Hardware optimizations
- Partitioning
- Sharding
Next we will be going over a few common issues/problems with composite indexes. If you are in the design phase, it’s good to read up on the these issues because you may encounter them later in the implementation phase. Anticipating these issues and designing your database so that these issues won’t occur will save you a lot of time.
Key/Column order
The key order is extremely important. The order of the keys determines how to index is structure. An easy way to think about it, is that every entry in a normal index points to a row. Like the following example where the author_id is mapped to the pages table using it’s primary key.
# normal index (pages table):
# author_id: 1 2 2 3 1 1
# | | | | | |
# page_id: 1 2 3 4 5 6
CREATE INDEX normal_index ON pages (author_id);
SELECT * FROM pages WHERE author_id = 1;
In a composite index, the first key of the index points to the next key and the last key points to the table itself like this:
# composite index (pages table):
# author_id: 1 2 3
# / | \ / \ |
# view_count: 323 97 123 432 14 15421
# | | | | | |
# page_id: 1 5 6 2 3 4
CREATE INDEX normal_index ON pages (author_id, view_count);
SELECT * FROM pages WHERE author_id = 1 AND view_count > 100;
The most basic rule to composite indexes is to put columns in the `WHERE` statement, in the first part of the composite index, and the `ORDER BY` column as the last key in the index:
SELECT * WHERE category_id = 30 AND author_id = 40 ORDER BY created_at;
CREATE INDEX my_composite_index ON pages (category_id, author_id, created_at);
If you leave out any columns in the WHERE query, the database cant use the last column for ORDER BY. This is a called a “covering index”, which is when the index covers all of the columns in a query (in this case a “covered query”). If you also `SELECT` only columns from the index, the database won’t even look up the table and will just use the INDEX itself.
The following composite indexes will all work for the previous query:
CREATE INDEX my_composite_index_2 ON pages (category_id, author_id); # no ORDER BY column
CREATE INDEX my_composite_index_3 ON pages (author_id, category_id, created_at) # WHERE columns in a different order
In some edge cases, the database will still use your composite index, even if column in the `WHERE` statement is missing from the index. It will skip that column, lookup the index and then filter the records in the index based on the missing column. This only works of the query optimizer determines that the missing column is not very important to the query. If you included the `ORDER BY` column in the composite index, the database won’t be able to use it in this case.
Most likely your query will include other operations than just checking for equality. Like range conditions. If you are using a range condition, it should be the last of the `WHERE` columns, but before the `ORDER BY` columns. Like this:
Putting view_count first will not work because the database will do a range scan of “view_count”, after that it will return ALL records with more than a 1000 views regardless of the category_id (remember that the composite index works from left to right).
If you are using multiple range conditions or you have a range condition with an `ORDER BY` column, the database will only use the first condition. In short, a range condition or `ORDER BY` column should be the last column in your composite index. Any subsequent columns are ignored. A software engineer at Mollie wrote a blog post about this here. There are ways to work around this issue if you really need to. For example, you could rewrite this:
CREATE INDEX my_slower_index ON pages (author_id, created_at);
# pages which are less than 3 days old, ordered by view_count
SELECT * FROM pages
WHERE author_id = 1
AND DATEDIFF(NOW(), created_at) < 3
ORDER BY view_count
LIMIT 10 OFFSET 10;
Into this (using a functional index):
CREATE INDEX my_fast_index ON pages (author_id, (TO_DAYS(created_at)), view_count);
# instead of using a range condition, we fetch rows for each day separately and use an equality operation for each day
(
SELECT * FROM pages WHERE author_id = 1 AND TO_DAYS(NOW) - TO_DAYS(created_at) = 0 ORDER BY view_count DESC LIMIT 10 OFFSET 10
) UNION ALL (
SELECT * FROM pages WHERE author_id = 1 AND TO_DAYS(NOW) - TO_DAYS(created_at) = 1 ORDER BY view_count DESC LIMIT 10 OFFSET 10
) UNION ALL (
SELECT * FROM pages WHERE author_id = 1 AND TO_DAYS(NOW) - TO_DAYS(created_at) = 2 ORDER BY view_count DESC LIMIT 10 OFFSET 10
) UNION ALL (
SELECT * FROM pages WHERE author_id = 1 AND TO_DAYS(NOW) - TO_DAYS(created_at) = 3 ORDER BY view_count DESC LIMIT 10 OFFSET 10
) WHERE DATEDIFF(NOW(), created_at) < 3 ORDER BY view_count DESC LIMIT 10 OFFSET 10;
# because the `UNION ALL` queries will fetch records from 4 full days,
# and we strictly want the last 72 hours, we add the original `WHERE` query at the end.
You’ll notice that it will retrieve 40 records instead of 10 and filter it back down to 10 in the end. You should first test the performance differences and set an upper limit to the maximum amount of `UNION ALL` statements that you are allowing to prevent denial of service attacks.
A good practise overall, is to order your columns from least specific to most specific. Assuming that’s possible while keeping all the rules above in place. This will not improve performance at all but it does improve index key compression. You can read more about index key compression in this blog post from Oracle.
Index types
By default, most databases use B-Tree indexes. B-Tree’s or Binary trees are a type of data structure that stores data in a sorted format. They facilitate fast searching, insertion and deletion operations. Because B-Tree’s are inherently sorted, they work well for `ORDER BY index` statements for example as well as equality or range conditions. I have created at an article where I code a binary tree from scratch which you can find here.
Another index type is the Hash Index. Hash indexes do not work for `ORDER BY` operations but they are exceptionally fast for equality operations.
There are also R-Tree indexes, Full text indexes but those are outside the scope of this guide.
If your index is not being used it could be due to the type of index you have chosen so check the type of index you are using and if it’s compatible with the query you are trying to optimize.
The Query Optimizer
Even the smallest change (in big databases) can make a 20ms query turn into 60s just because the query optimizer decides that due to this seemingly unrelated change, your beautiful composite index is no longer useful. The query optimizer very often may not use your indexes for reasons which sometimes seem incomprehensible. Databases work in mysterious ways but remember that more often than not, these seemingly “slow” execution plans are actually faster: When a table scan is actually faster than an index scan.
That being said, query optimizers sometimes do make bad decisions. The query optimizer will also always choose the least “risky” execution plan. You may have a very good index which is actually faster for 90% of your queries, but very slow for the other 10%. In this case the query optimizer will always go for the slower but more predictable execution plan. Another example where the query optimizer may come up with slower execution plans is when you are using a functional index in your composite index and you have a database with news articles for example. Only 1% of the news articles are requested most often and the other 99% are almost never accessed. Even if your composite index is actually faster for the 1% of news articles, which are accessed most often, the query optimizer will still want to go for consistency and use the slower execution plan. There are some exceptions and databases do use some statistics about your data and take those into account, but generally the decisions are made based on the database structure.
When you suspect that the query optimizer is making a bad decision, you have the option to suggest or force the query to use a specific index. For MySQL you can use “USE INDEX” or “FORCE INDEX”. For Postgres you can use “enable_indexscan” and “enable seqscan” (https://www.postgresql.org/docs/current/runtime-config-query.html). I suggest that you do use those statements in production, only for testing. If you actually find out that your queries run faster you should find out why the query optimizer is not using them. This means testing and tweaking with the database config variables, which is not something for everyone and you should probably consider hiring a consultant if you are unable to find the issue.