Table of contents
TogglePreface
As the amount of data increases and the read and write traffic of the database increases, the common bottlenecks of traditional databases will begin to appear. Data Replication is an effective method to solve bottlenecks and improve the performance, scalability and availability of the database by replicating data on multiple nodes. This article will introduce the three main models of data replication: master-slave replication (Single-leader replication), multi-leader replication (Multi-leader replication) and point-to-point replication (Leaderless replication), and analyze their advantages and disadvantages.
Readers who are not familiar with basic database knowledge can refer to this article Introduction to database basics – system design 08. Readers who are not familiar with scalability and usability can also refer to this article Non-functional features of software design – System Design 03.
Data copy (Data Replication)
Data replication is a technology that keeps copies of data on multiple nodes, mainly to achieve the following purposes:
- Improved performance: By spreading data across multiple nodes, you can improve database read and write performance.
- Improved scalability: As the amount of data grows, the database can be expanded by adding more nodes.
- Improved availability: If one node fails, other nodes can still provide data services.
Data replication is used in many applications, such as:
- Online Transaction Processing (OLTP): In OLTP systems, data replication can be used to improve the performance and availability of the database to meet the needs of high concurrency (High Concurrency) reading and writing.
- Disaster recovery: In a disaster without warning, data replication can be used to copy data to an off-site backup to prevent data loss.
clone(Replication)
Replication refers to retaining multiple data backups on each node, and preferably geographically different nodes, such as backups of databases in different countries. These data are replicated to achieve availability, scalability and performance.
Readers can think about a scenario. If today, in the case of a single database, when the database is damaged, the entire system will be affected. On the contrary, if there are multiple copies of data copied in different databases, even if one database is damaged, it will not affect the entire system.
However, the above are all advantages of data replication, but data replication also brings complexity. The following is a simple example of complexity that I gave:
- How to keep multiple data consistent with each other?
- Should you use synchronous or asynchronous replication?
- How to handle concurrent writes?
This article will go into the details of data replication, starting with an explanation of synchronous and asynchronous replication.
Synchronous versus asynchronous replication
If replication is to be implemented, it will be divided into the following two types: synchronous replication and asynchronous replication.
What are the main differences between the two?
In synchronous replication, the primary node (Leader Node) waits for a confirmation request from the secondary node (Follower Node) to update data. When the master node receives confirmation from all secondary nodes, it will report a successful confirmation to the client (Client Side). In asynchronous replication, the primary node does not wait for confirmation from the secondary node, but directly updates the data and reports success to the client.
Synchronous Pros & Cons Synchronous Pros & Cons
The advantage of synchronous replication is that all secondary nodes (Follower Node) are fully synchronized with the primary node (Leader Node). However, advantages often come with disadvantages. If the secondary node (Follower Node) fails to confirm due to failure or other reasons, the master node will not be able to reply to the user until a successful confirmation is received. Therefore, such shortcomings will lead to high latency (High Latency) in the response from the master node to the client.
Asynchronous Pros & Cons
The advantage of asynchronous replication is that even if all secondary nodes (Follower Node) are down, the primary node (Leader Node) can continue to operate. The converse disadvantage is that if the primary node fails, writes that are not replicated to the secondary node will be lost.
Here is an illustration from the database bible Designing Data-Intensive Applications, in which Leader is the primary node and Follower is the auxiliary node.
Data Replication Models
Next, I will compare the advantages and disadvantages of the models copied from the following databases:
Single Leader / Primary-Secondary Replication
In master-slave replication, one node is designated as the master node. The primary node handles writes and sends all writes to the secondary nodes and keeps them in sync.
Master-slave replication is very suitable when the read load is large, for example: Youtube has thousands of people who need to watch videos at any time, but compared to watching videos, the amount of uploaded videos is not very large, so master-slave replication The model is well suited for situations like this. We can also use this model to expand the database as the number of users increases and improve the scalability of the system. Of course, there may also be bottlenecks if data is replicated to many secondary nodes. Finally, if our write load is heavy, master-slave replication is not appropriate.
Another advantage of master-slave replication is that it is read-resilient, for example: in the event of a failure of the primary node, the secondary node can still handle read requests, so it is a good choice for systems with heavy read volumes. A good solution to the database model.
In addition, if we use asynchronous replication, it will bring about the problem of inconsistent data replication. You can imagine a situation where if the primary node fails and cannot deliver updated data to the secondary node, data read from different databases may see inconsistent data. Therefore, if the primary node fails, any data replication updates may be lost if the secondary node has not yet received and responded to them.
Primary-secondary replication methods
Master-slave replication can be implemented in many different ways:
Statement-based replication
Statement-based replication will be referred to as (SBR) hereafter, because I really can't find a translation for Traditional Chinese. SBR is the method used in the MySQL database. In this implementation, the master node can execute INSERT, UPDATE, DELETE, etc. SQL syntax, and then write these syntaxes to the log file. Next, the log files are sent to the secondary node for execution.
Although SBR can solve the problem of copying data from the primary node to the secondary node, it actually has shortcomings. For example, using a nondeterministic function may cause writes on the primary and secondary nodes to be different.
What is a nondeterministic function? For example: NOW() gets the current time, RAND() gets a random number, they are all uncertain functions, and different results will be obtained every time the function is executed.
write-ahead log (Write-Ahead Log)
Write-Ahead Log will be referred to as (WAL) below. WAL is the data replication technology used in PostgreSQL and Oracle. In WAL, when a transaction occurs, it is initially recorded in a transactional log file, which is then written to the database. Next, the logging operation is performed on the primary database and then transferred to the secondary node for execution.
Unlike SBR, WAL processes transaction logs instead of SQL syntax. This has the advantage of ensuring consistency when encountering nondeterministic functions. In addition, WAL writes data directly to disk to facilitate recovery in the event of a failure.
For example: When performing operations such as UPDATE in PostgreSQL, it will first write the transaction log file and disk before processing the database. The transaction log contains the transaction ID, data type, and affected tabla, and then the changes are copied to the auxiliary node.
Of course, there are advantages and disadvantages. The disadvantage of WAL is that it is tightly coupled with the internal structure of the database. If the primary node and the secondary node need to upgrade the software, it will become complicated.
logical copy(Logical Replication)
logical copy(Logical Replication) Some people also call it row-based (Row-Based) replication. This method is used in various relational databases, including PostgreSQL and MySQL. In this method, operations and modifications made to the database are recorded and then copied to the secondary node.
For example, when an operation such as INSERT or UPDATE is performed, the entire affected Row is retrieved on the primary node, which contains all column values for the specified Row. The fetched changes are then executed on the secondary node to ensure that the data is consistent with the data on the primary node.
Multi-leader replication
After introducing master-slave replication earlier, you should also find that master-slave replication using asynchronous replication has a fatal shortcoming. When there is only one master node, all write operations must go through it, which greatly hinders its performance. If the primary node fails, the secondary node may not update the database.
Multi-leader replication is another model that can be used to solve the problem of such high-concurrency systems. Multi-leader replication as the name suggests, there are multiple primary nodes handling writes and sending them to other primary and secondary nodes for replication.
This replication model is very advantageous for systems that can continue to operate offline. We can also use this model when maintaining and operating multiple data centers, so that each data center has a master node.
conflict
Multi-leader replication provides better performance and scalability than single-leader replication, but this model also has a drawback. Because all master nodes may process write requests at the same time, they may modify the same data. This situation is called a conflict.
For example: Suppose two users edit the same data field at the same time. In this case, we have no way of knowing the correctness of the final data without additional processing of the situation. Here are some common ways to handle conflicts:
avoid confict
A very simple idea, if we are going to deal with conflict, we should prevent it from happening in the first place. If the system can ensure that all writes to a specific record go through the same master node, there will be no conflict issues. We can direct client requests to the same data center, so that two users in different data centers will not modify the same document.
However, if the user moves to a different location and is now near a different data center, a conflict will still occur. If this happens, we need to reprocess the requested traffic. In this case, the conflict avoidance method fails and results in simultaneous writes.
Last write wins
All nodes use the node's own local time to assign a timestamp (Timestamp) to each update. When a conflict occurs, the latest timestamp is selected for update. However, this method also has blind spots, because in a decentralized system, the time between nodes may be different, and it will be difficult to achieve time synchronization. Also because the times may be different, time deviation can lead to data loss.
Multi-leader replication topology (Topology)
The following are topologies for multi-leader replication, including: ring topology, star topology, all-to-all topology, etc. The most common is the all-to-all topology. In star and circle topologies, similar disadvantages exist, and if one of the nodes fails, it may affect the entire system. This is why all-to-all is the most commonly used topology.
Peer-to-peer/leaderless replication
In master-slave replication, if a problem occurs on the master node, it will cause a database failure. In addition to multi-leader replication, there is another model, Peer-to-peer Replication, which can also help the system and database achieve read scalability, but it cannot provide write scalability. In other words, scalability for high traffic reads can be provided.
The point-to-point replication model solves these problems through non-primary nodes. All nodes have equal weight and can accept read and write requests. Cassandra uses this database replication model.
Like master-slave replication, this type of replication can also create inconsistencies because when multiple nodes accept write requests, concurrent writes can result.
Conclusion
Database replication is an effective method to solve database bottlenecks and can improve the performance, scalability and availability of the database. However, different replication models have different advantages and disadvantages, so when choosing a replication model, you need to weigh the trade-offs based on your specific needs. Readers who like system design content can refer to other related articles.
related articles
Introduction to database basics – system design 08
Load Balancer Explained – System Design 07
What is DNS? Introduction to Domain Name System – System Design 06
Introduction to System Design Components Building Block – System Design 05
Back-of-the-envelope Back-of-the-envelope Calculation – System Design 04
Non-functional features of software design – System Design 03
Application of abstraction in system design – System Design 02
Introduction to Modern System Design – System Design 01
Quote
Designing Data-Intensive Applications