When you move to a distributed SQL database, you quickly learn that schema design requires a new way of thinking. This post refers specifically to my experience with YugabyteDB, but the broad concepts will apply to other distributed SQL databases as well.

Introduction

Once you move from a single server DB, schema design is more important as your data is now split between many machines. It’s no longer enough to define your columns and data types; you have to consider how your data will be physically spread across machines: both for performance benefits and also to ensure optimal load across the cluster. The two biggest impacts on these metrics are the choices you make for primary keys and indexes.

How data is stored and partitioned

In YugabyteDB a table’s primary key is also the partitioning key. This means the primary key dictates how data is divided into tablets and which nodes those tablets live on. The database constantly monitors the size of these tablets, automatically splitting them when they grow too large to ensure that they remain balanced.

Every table begins with a set number of tablets, typically equal to the number of nodes in the cluster.

Partitioning Strategies: Hash vs. Range

There are two strategies for partitioning your data:

Hash Partitioning

With hash partitioning, the database applies a hashing function to the primary key and distributes rows based on the resulting hash value. This is the default in YugabyteDB and is excellent for ensuring an even spread of data across all nodes, which helps avoid “hotspots” where one node becomes overloaded with writes. For example, a table of events with PRIMARY KEY (identifier HASH) will have its data spread evenly across the tablets and thus the cluster.

Attaching an example of how a hash primary key would be distributed on a 3 node cluster:

Hash Partitioning

Range Partitioning

Range partitioning groups data based on a continuous range of primary key values. For instance, a logs table with PRIMARY KEY (created_at ASC) would place records from January on one node, February on another, and so on. This will be efficient for queries that scan over a specific range, like fetching all logs from last week. However, it carries a significant risk: if your primary key is monotonically increasing (like a timestamp or a traditional sequence), all incoming writes will target a single node, creating a hotspot. There are a few tricks that we can use to mitigate this and help distribute the writes more evenly, which I will dig into a follow-up post!

As a range-partitioned tablet grows, it will eventually split. For example, a tablet holding data for the entire year of 2023 might split into two, one for the first six months and another for the second.

Example of how a range primary key would be distributed on a 3 node cluster: Range Partitioning

Example of how a range primary key would be split: Range Partitioning Split

Indexes Are Also Distributed Tables

In Yugabyte, an index is stored as its own table with its own primary key and partitioning strategy. When you run CREATE INDEX, you’re creating another data structure that consumes storage and requires its own maintenance.

Index Creation Syntax in YugabyteDB

It’s important to understand how YugabyteDB interprets the CREATE INDEX statement. Often, what you write is shorthand for a more explicit command that defines the partitioning strategy.

For example, a simple index creation:

CREATE INDEX idx_tbl ON tbl (id);

is translated to use hash partitioning on the index’s primary key:

-- This is what YugabyteDB actually runs
CREATE INDEX idx_tbl ON tbl (id HASH);

Similarly, for a composite index, the first column is hashed by default, while subsequent columns are stored in ascending order:

CREATE INDEX idx_tbl2 ON tbl (id, created_at);

This becomes:

-- This is what YugabyteDB actually runs
CREATE INDEX idx_tbl2 ON tbl (id HASH, created_at ASC);

You can also create more complex indexes, for example, having a hash of a two columns to help with data distribution:

CREATE INDEX idx_tbl_distrb_status ON table((id, status) HASH);

Understanding this translation is key to predicting how the table that gets created will perform both for writes and also for reads.

Common Indexing Pitfalls

  1. Indexing Low-Cardinality Columns: Creating an index on a column with few unique values, like an enum for status, is a classic mistake. This leads to poor data distribution, as all rows with the same status will be clumped together in a few tablets, creating hotspots. This will lead to both slower inserts - as all writes will be directed to the same tablet, and slower reads - as queries filtering on that status will hit only one tablet, negating the benefits of distribution.

  2. Using Hash Indexes for Range Queries: A HASH index on a timestamp column is often not useful. While it distributes the data well, it is inefficient for range queries (e.g., WHERE created_at BETWEEN ? AND ?) because the data is not stored in chronological order.

The Need for Housekeeping

Because data is constantly being written, updated and removed there is a need for continuous housekeeping. Processes like compaction: merging small data files into larger ones are essential for maintaining read performance. While the database handles this automatically, a well-designed schema makes these operations far more efficient.

Not only do we have to maintain our table, but also the tables used for the indexes. It’s important to remember that for every index you add to your table there is an additional write cost as the index has to be kept up to date with every insert, update and delete operation.

Final Thoughts

Designing schemas for distributed SQL is about balancing trade-offs.

  • Your primary key is your partitioning strategy. Choose it wisely.
  • In Yugabyte, your indexes are tables. They add overhead, so use them carefully and design them for good distribution - both for write and read operations.
  • The goal is to leverage the distributed nature of the system. A schema that spreads the load evenly across all available nodes will lead to a healthier, more performant, and more scalable cluster.