How to leverage indexes for ILIKE & LIKE queries

You cannot always avoid LIKE or ILIKE queries when it comes to solving specific business requirements. However, it does not mean that those queries have to be slow. Lets look at how the keywords affect query usage, and what can be done to optimise the query plans and the impact each option has on the overall performance of the database.

`LIKE`, `ILIKE` and why you might need to use them

LIKE and ILIKE operators allow you to use wildcards in your queries to provide a level of fuzzy-matching should you require it. The wildcard in SQL syntax is represented by the percent symbol: %. A very common use-case is to find all names or emails matching a prefix, suffix or both. Generally this kind of query can be optimised such that we avoid the LIKE or ILIKE syntax which can have pretty impactful performance considerations.

The difference between LIKE and ILIKE is that the ILIKE operator will ignore the case of the strings which it compares, ie: it will match AbC with a record containing abc or aBc - this is particularly useful anytime you are dealing with user input - as case consistency is not a given.

Alternatives to `LIKE` and `ILIKE`

Firstly, if you need to check for only a prefix - you can actually trim both strings to the desired length and perform an equality check, which can use an existing index and yield really good performance. The same can be applied if you also need to ignore casing, you can create an index which transforms the column - resulting in slightly more work at INSERT time but better performance for reads.

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_lower_name ON
"users" (lower(name));

Allowing us to write a query like this:

SELECT *
FROM users
WHERE lower(name) = ?

The above query will be able to use the index defined above, and as long as we ensure that we pass a string which has been lower-cased from the calling point, we are ensuring great performance by using the index.

Another alternative is using the citext extension - which stands for case insensitive text, which can be used as a drop-in replacement for text columns. However, if you are expecing to have letters that have accents, those will not be normalized to the non-accented letter. There is also a small storage overhead to be paid when using citext - but it allows you to avoid the LOWER(...) calls on your queries.

Optimising `LIKE` and `ILIKE` queries

LIKE queries which check only for prefixes, eg: LIKE 'alex%' - can make use of existing BTree indexes to speed up their performance and do an IndexScan which can plug some gaps. ILIKE cannot use BTree indexes and as such cannot be optimised from. How the query plan will look like for hitting an IndexScan for a LIKE query:

EXPLAIN ANALYZE 
SELECT * 
FROM emails 
WHERE LOWER(recipient_email) LIKE 'alex%' 
LIMIT 5;
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..129.79 rows=5 width=1370) (actual time=61.852..1837.272 rows=5 loops=1)
   ->  Index Scan using idx_lower_rec_email on emails  (cost=0.43..1472852.26 rows=56930 width=1370) (actual time=61.851..1837.266 rows=5 loops=1)
         Filter: (lower((recipient_email)::text) ~~ 'alex%'::text)
         Rows Removed by Filter: 2079
 Planning Time: 0.165 ms
 Execution Time: 1842.558 ms
(6 rows)

However, sometimes you cannot get away with just a single non-leading wildcard, and you require more power.

Trigrams & `pg_trgm` for multiple wildcards

Enter trigrams - via the pg_trgrm extension, which splits up the words into 3 letter segments, allowing much more granularity when it comes to querying. You will also need to use a GIN or GiST index to be able to leverage the trigrams.

[Trigrams for "Charlie"]
+-------+
| cha   |
| har   |
| arl   |
| rli   |
| lie   |
+-------+

[Trigram GIN Index]
+---------+------------+
| Trigram | Row IDs    |
+---------+------------+
| cha     | {3}        |
| har     | {3}        |
| arl     | {3}        |
| rli     | {3}        |
| lie     | {3}        |
+----------------------+

This will allow queries like SELECT * FROM users WHERE name LIKE "%arl%" to have very very fast results. However, it is obviously going to come at a pretty large cost in terms of storage, as now instead of storing a single reference to the name, you have split it up in threes:

This is a classic exponential situation which:

Exponential effect of trigrams

Storage considerations aside, you have to consider that the trigrams have to be computed at INSERT time and saved in the index. This will increase the latency of the data ingestion, however trading that off for the faster data lookups.

Recommendations

Try to use LOWER(...) = functions if possible to do prefix matching, rather than a LIKE "alex%". Create an index that computes the LOWER(...) and can be used to speed up these queries.
Use citext if you require case-insensitive, prefix-lookups and want to avoid the LOWER(...) changes.
Use pg_trgm for short strings which require multiple wildcards. Be mindful of the insert time overhead.

LIKE, ILIKE and why you might need to use them

Alternatives to LIKE and ILIKE

Optimising LIKE and ILIKE queries

Trigrams & pg_trgm for multiple wildcards

Recommendations

`LIKE`, `ILIKE` and why you might need to use them

Alternatives to `LIKE` and `ILIKE`

Optimising `LIKE` and `ILIKE` queries

Trigrams & `pg_trgm` for multiple wildcards