Database Replication in System Design: How It Works and Why Every Engineer Should Care

system design database replication

Database Replication: Does it mean that the contents of a database are copied to different databases? The answer is YES. Okay, so why do we need to do this at all? Why waste thousands of terabytes of storage for storing the same data? How does it impact the performance of a system? The answer to these questions is what this article covers. Keep reading.

What is Database Replication?

Database Replication is the process of storing copies of the same data on multiple database servers that are connected together on a network. Each copy is also called as a replica. It is an important aspect of System Architecture that creates significant impact on application processes and users who are using them.

Does my system need database replication?

In short, if your system is not handling more than 5,000 concurrent users (or ~500 database requests per second) and you’re fine with up to 5-10 minutes of downtime, you likely won’t need database replication yet, even in a worst-case scenario. However, this can vary significantly based on factors like whether your system is read-heavy or write-heavy, whether you’re serving multi-region users with frequent DB queries, or whether you need to run heavy analytics on live data. We’ll explore these scenarios later in the article.

Before we dive into when database replication becomes crucial, let’s understand what are Read-heavy and Write-heavy systems.

Read-heavy Systems

As the name suggests, systems which are constantly fetching/retrieving data from the database more than they are writing/inserting to the database are called Read-heavy systems. There is no defined formula for determining this. However, as a rule of thumb we consider a system with 80% read requests, as a decent Read-heavy system.

Read queries like,

SELECT id, name, price, stock_count
FROM products
WHERE category = 'electronics'
ORDER BY price ASC
LIMIT 20;

are frequently called by a user of an e-commerce application (The Buyer). Whereas, Write queries like

INSERT INTO products (id, name, price, stock_count, category)
VALUES (1043, 'Wireless Headphones', 2999.99, 150, 'electronics');

are comparatively much lesser. These queries are also called by a user of an e-commerce application (The Seller).

Hence, we can say that in general, an e-commerce application is fairly Read-heavy.

Write-heavy Systems

Write-heavy systems are just the opposite of Read-heavy systems- The systems which constantly write or update the database. Once again, there is no defined formula.

Here is an example of a Write-heavy system,

Consider a payment application, Every time you make a transaction, multiple writes happen, one of them could be-

INSERT INTO transactions (account_id, type, amount, balance_after, created_at)
VALUES (3021, 'DEBIT', 500.00, 4500.00, NOW());

Now, there are millions of transactions happening every few minutes. So, millions of writes happening too. But, how often do you check your transaction history? To see when did you spend that 450 bucks? That is far less frequent than the number of transactions on average. Therefore, queries like:

SELECT id, type, amount, balance_after, created_at
FROM transactions
WHERE account_id = 3021
ORDER BY created_at DESC
LIMIT 10;

are less frequent. We can say this is a Write-heavy system in general.

Can a System’s Read/Write Nature Change Over Time?

Not necessarily. A system’s read/write nature can shift significantly as the product evolves and user behavior changes. Take a social media platform as an example- in its early days, when content is being created and users are on-boarding rapidly, it may lean write-heavy. But as the user base matures and growth stabilizes, the same platform becomes overwhelmingly read-heavy as users shift from creating to consuming content. This is why it’s important to continuously monitor your query patterns rather than making a one-time classification and designing your infrastructure around it permanently. What holds true at 10,000 users may not hold true at 10 million.

When is Replication essential?

Replication is one of the most effective solutions for read-heavy systems at scale. When a system needs to serve millions of read operations across hundreds of thousands of concurrent users, distributing that load across multiple read replicas becomes the most practical and scalable approach.

Does that mean replication is not meant for write-heavy systems? The answer is not so straightforward. Write-heavy systems can still benefit from replication in terms of availability and throughput, but data consistency becomes a significant challenge. This is a consequence of something known as Replication Lag.

Some common use cases of database replication:

  1. Low-Latency: The database is replicated across different regions and the users are served from the database nearest to them.
  2. High Availability: Even if one replica fails, the system will continue serving requests from another replica (with minimal or no downtime).
  3. Increased Read-Throughput: When there are multiple databases available for serving a request, a far greater number of concurrent requests can be handled easily by the system.

Data that rarely changes vs. Data that changes frequently

This distinction is crucial because it lies at the root of most challenges we face in database replication.

Let’s understand both categories with a real-world example.

Consider an X (formerly Twitter) timeline. We can broadly divide the data into two sets: User Profile and User Tweets.

  • User Profile: When a user registers on X, they provide basic information such as name, email, phone number, profile picture, and other preferences.
  • User Tweets: This includes everything the user posts on their timeline (daily, weekly, or monthly), tweets from people they follow, retweets, comments and reactions on their posts, and various analytical data.

Can you guess which of these represents rarely changing data and which represents frequently changing data?

You’re correct if you identified User Profile as a case of rarely changing data. Once created, a user profile is rarely updated except perhaps for the profile picture, which might change at most a few times a month in the worst case.

User Tweets, on the other hand, represent highly dynamic data. Updates come not only from the user themselves but also from their followers and viewers. New posts, retweets, comments, likes, and other interactions can occur multiple times per minute. In the worst case, a user’s timeline may need to be updated every few seconds.

Do you see the clear difference between the two?

When data rarely changes, replication is relatively straightforward. Simple strategies like Single Leader replication (which we’ll discuss next) work very well.

The real challenges arise with frequently changing data. Data that is constantly being written to the database must be propagated to every replica in the system with the lowest possible latency. When updates happen every few seconds or minutes, any significant replication lag makes it increasingly difficult to maintain data consistency, preserve the correct sequence of events, and avoid corruption or stale reads.

Data replication strategies are primarily about minimizing this replication gap (lag) while achieving a highly performant, available, and consistent system.

Leaders and Followers

leader and followers example for database replication in system design

As you may already know by now, an important aspect of database replications is to maintain the same data on all the replicas with the lowest latency possible. This means, every write happening to the database has to be processed by each replica. We use something called as leader-based replication to achieve this.

One of the replicas in the cluster is designated as the leader or master. Whenever a client wants to write something, it sends a request to the leader. The leader writes it to its local storage and then propagates the changes to other replicas on the network, also known as followers, slaves, or read replicas. The transfer happens via a replication log or change stream. Each follower reads the replication log sent by the leader and applies the changes to its own database in the exact same order they were processed on the leader. When a client wants to read from the database, it can send a request to either the leader or a follower.

Writes are only allowed on the leader, while followers are read-only from the client’s perspective. This is the simplest form of database replication, commonly known as master-slave replication, where the slave (follower) follows the master (leader). It is a widely used feature in relational databases like MySQL and PostgreSQL.

Consider an e-commerce platform during a major sale event. Millions of users are simultaneously browsing products, checking prices, and reading reviews, while thousands of orders are being placed every second. A single database handling all of this would quickly become a bottleneck.

With master-slave replication in place, the master database handles all write operations such as new orders, inventory updates, and payment records, while multiple follower replicas serve the millions of read requests. A user browsing products in Mumbai is served by a nearby replica, while a user in Delhi is served by one closer to them, keeping response times fast.

The master continuously propagates every write to all followers in the background, keeping them in sync. From the user’s perspective, everything feels seamless and fast.

Master-slave replication is just the beginning. While it solves a lot of problems, it comes with its own set of trade-offs and failure scenarios- what happens when the master goes down? What happens when a follower falls too far behind? And what if you need writes to scale too, not just reads? There is a lot more to database replication than meets the eye, and we are only getting started.

Leave a Reply

Discover more from GS | Tech Journal

Subscribe now to keep reading and get access to the full archive.

Continue reading