UUIDs sit at the heart of most modern distributed systems due to the very low risk of collisions and reduced attack vector compared to traditional int IDs. They are used to identify customers, events … and so much more. They’re one of core details and very often overlooked, until you start looking in depth as you’re storing hundreds of millions of them, or maybe billions.

This post explores something deceptively simple: how UUIDs are stored across various systems, and how simple assumptions about common storage can end up surprisingly wasteful, and how much can be saved both in bytes and in real money by being more deliberate in how we serialize them. A seemingly trivial choice can translate to terabytes of wasted storage and significant financial cost.

UUIDs in Modern Systems

UUIDs are easy to generate, globally unique, and require no coordination between services. They are the obvious choice for any distributed architecture.

Many systems, use UUID v7 which is UUID that can be time-ordered, making them an easy replacement for numerical sequential IDs, as they can fit nicely into the traditional btree database indexes with no performance loss.

A UUID v7 is composed of:

  • a 48-bit timestamp
  • version bits
  • variant bits
  • 60 bits of randomness
48-bit timestamp 6 bytes randomness 60 bits version 4 bits variant 2 bits UUID v7 layout

This structure makes collisions effectively impossible and ensures that identifiers cluster by time.

Underneath all of that, a UUID is simply 32 hexadecimal characters representing 16 raw bytes. When displayed as a string, the addition of four decorative dashes increases it to 36 characters. Readable for humans, yes. Efficient for machines, no.

Protobuf’s Role

Protobuf remains a common choice for modern service ecosystems: it is compact, strongly typed, and designed for high-throughput communication.

Protobuf serializes data using:

  • field numbers
  • wire types
  • encoded values
  • optional length prefixes

This results in efficient, structured binary messages. The downside is that strings are not treated the same as fixed-length bytes. Strings are a “length-delimited” wire type, which means they incur both:

  • a UTF-8 encoding cost
  • a length-prefix header

When dealing with a field that contains UUID, this overhead starts to matter. And unfortunately, UUIDs do not have a native Protobuf type and as a result, most often they end up being stored as strings rather than binary byte sequences.

How UUIDs Should Be Stored

Let’s break this down.

A UUID can be stored as:

  • 16 bytes - raw binary; no variable length - requires custom type
  • 18 bytes - bytes type with 2-byte variable length header
  • 38 bytes - string: 36 chars (incl 4 dashes) + 2-byte variable length header

The most common but least efficient approach is to store UUIDs as strings. That carries more than double the required space (a 111% increase), with no benefit to the system. Using a byte array or a dedicated fixed-length UUID type eliminates the UTF-8 overhead entirely.

This can look like a micro-optimisation, but at scale - these bytes end up, not just in storage - but also transfer, backups as well as the additional memory required to process large amounts of these messages concurrently.

Why This Matters in Modern Architectures

In high-volume event systems, data growth rarely happens slowly. It accelerates. A single table can grow into terabytes quickly, and storing data inefficiently compounds the problem.

String-encoded UUIDs double the footprint of one of the most frequently used fields. They also increase:

  • network bandwidth
  • Kafka topic sizes
  • I/O on storage engines
  • VACUUM costs
  • CPU usage during encoding/decoding

All for no benefit.

Optimizing UUID storage won’t fix everything, but it is one of the cleanest, lowest-risk improvements available.

Final Thoughts

Performance gains in distributed systems often come from tightening foundational pieces the parts executed millions or billions of times.
Storing UUIDs as strings is familiar and convenient, but it is also unnecessarily expensive.

Small decisions when done at scale end up having incredibly large impact, so in a distributed system world its paramount to pay attention to the details!