Distributed SQL, Primary Keys & Indexes
When you move to a distributed SQL database, you quickly learn that schema design requires a new way of thinking. This post refers specifically to my experience with YugabyteDB, but the broad concepts will apply to other distributed SQL databases as well.
Introduction
Once you move from a single server DB, schema design is more important as your data is now split between many machines. It’s no longer enough to define your columns and data types; you have to consider how your data will be physically spread across machines: both for performance benefits and also to ensure optimal load across the cluster. The two biggest impacts on these metrics are the choices you make for primary keys and indexes.
How data is stored and partitioned
In YugabyteDB a table’s primary key is also the partitioning key. This means the primary key dictates how data is divided into tablets and which nodes those tablets live on. The database constantly monitors the size of these tablets, automatically splitting them when they grow too large to ensure that they remain balanced.
Every table begins with a set number of tablets, typically equal to the number of nodes in the cluster.
Partitioning Strategies: Hash vs. Range
There are two strategies for partitioning your data:
Hash Partitioning
With hash partitioning, the database applies a hashing function to the primary
key and distributes rows based on the resulting hash value. This is the default
in YugabyteDB and is excellent for ensuring an even spread of data across all
nodes, which helps avoid “hotspots” where one node becomes overloaded with
writes. For example, a table of events with PRIMARY KEY (identifier HASH)
will have its data spread evenly across the tablets and thus the cluster.
Attaching an example of how a hash primary key would be distributed on a 3 node cluster:
Range Partitioning
Range partitioning groups data based on a continuous range of primary key
values. For instance, a logs table with PRIMARY KEY (created_at ASC) would
place records from January on one node, February on another, and so on. This
will be efficient for queries that scan over a specific range, like fetching all
logs from last week. However, it carries a significant risk: if your primary
key is monotonically increasing (like a timestamp or a traditional sequence),
all incoming writes will target a single node, creating a hotspot. There are
a few tricks that we can use to mitigate this and help distribute the writes
more evenly, which I will dig into a follow-up post!
As a range-partitioned tablet grows, it will eventually split. For example, a tablet holding data for the entire year of 2023 might split into two, one for the first six months and another for the second.
Example of how a range primary key would be distributed on a 3 node cluster:
Example of how a range primary key would be split:
Indexes Are Also Distributed Tables
In Yugabyte, an index is stored as its own table with its own primary key and
partitioning strategy. When you run CREATE INDEX, you’re creating another
data structure that consumes storage and requires its own maintenance.
Index Creation Syntax in YugabyteDB
It’s important to understand how YugabyteDB interprets the CREATE INDEX
statement. Often, what you write is shorthand for a more explicit command that
defines the partitioning strategy.
For example, a simple index creation:
CREATE INDEX idx_tbl ON tbl (id);
is translated to use hash partitioning on the index’s primary key:
-- This is what YugabyteDB actually runs
CREATE INDEX idx_tbl ON tbl (id HASH);
Similarly, for a composite index, the first column is hashed by default, while subsequent columns are stored in ascending order:
CREATE INDEX idx_tbl2 ON tbl (id, created_at);
This becomes:
-- This is what YugabyteDB actually runs
CREATE INDEX idx_tbl2 ON tbl (id HASH, created_at ASC);
You can also create more complex indexes, for example, having a hash of a two columns to help with data distribution:
CREATE INDEX idx_tbl_distrb_status ON table((id, status) HASH);
Understanding this translation is key to predicting how the table that gets created will perform both for writes and also for reads.
Common Indexing Pitfalls
-
Indexing Low-Cardinality Columns: Creating an index on a column with few unique values, like an
enumforstatus, is a classic mistake. This leads to poor data distribution, as all rows with the same status will be clumped together in a few tablets, creating hotspots. This will lead to both slower inserts - as all writes will be directed to the same tablet, and slower reads - as queries filtering on that status will hit only one tablet, negating the benefits of distribution. -
Using Hash Indexes for Range Queries: A
HASHindex on a timestamp column is often not useful. While it distributes the data well, it is inefficient for range queries (e.g.,WHERE created_at BETWEEN ? AND ?) because the data is not stored in chronological order.
The Need for Housekeeping
Because data is constantly being written, updated and removed there is a need for continuous housekeeping. Processes like compaction: merging small data files into larger ones are essential for maintaining read performance. While the database handles this automatically, a well-designed schema makes these operations far more efficient.
Not only do we have to maintain our table, but also the tables used for the indexes. It’s important to remember that for every index you add to your table there is an additional write cost as the index has to be kept up to date with every insert, update and delete operation.
Final Thoughts
Designing schemas for distributed SQL is about balancing trade-offs.
- Your primary key is your partitioning strategy. Choose it wisely.
- In Yugabyte, your indexes are tables. They add overhead, so use them carefully and design them for good distribution - both for write and read operations.
- The goal is to leverage the distributed nature of the system. A schema that spreads the load evenly across all available nodes will lead to a healthier, more performant, and more scalable cluster.