Data Partitioning What is data partitioning? – System Design 10

system-design-system design 10-data copy-data-partitioning-introductoin-to-database-fundamentals-hogantech

Preface

This article How does Data Replication optimize the database? - System Design 09 The importance of database replication is explained. We can optimize the database through database replication techniques. Data Partitioning is another very important concept for optimizing the database. This article will also explain database partitioning in depth and explain different sharding methods (Database Sharding). The Chinese name of this article, including data partition and database sharding, currently refers to the translation of Azure official documents, but it is recommended that readers use the original text to refer to these terms.

What is Data Partitioning?

Data Partitioning is the process of dividing data in a database into smaller, more manageable sub-blocks. These sub-blocks are called partitions. Data segmentation can be based on a variety of factors, such as time, customer ID, or product category.

Why do we need to partition data?

For any system that does not expand, the amount of data will continue to grow, and the read and write traffic to the database will become larger and larger, putting scalability pressure on traditional databases. Therefore, we can use data segmentation to make We can use multiple nodes, each managing a portion of the entire profile.

Advantages of data partitioning

  • Improve performance: Make it easier to query data by breaking it into smaller parts. This is especially helpful for systems that handle large amounts of data.
  • Improve scalability: As the amount of data increases, more partitions can be easily added to store the new data, allowing the database to continue to expand without major changes.
  • Improve usability: If a partition fails, only the data in that partition will be affected. Other partitions can continue to be used, which helps reduce overall system downtime.
  • Simplify management: Data partitioning can simplify database management. For example: each partition can be backed up and restored individually, and queries can be optimized for each partition.

Disadvantages of data partitioning

  • Add complexity: Data partitioning increases the complexity of the database. Segmentation strategies need to be developed and maintained, and some changes may need to be made to the system to accommodate the segmented data.
  • increase cost: Data partitioning may require additional hardware and software, and the costs of managing split databases may be higher.

Sharding

In order to distribute traffic and load among multiple nodes, we need to partition the data through Partitioning or Sharding. We will divide large data collections into smaller data blocks and store them on different nodes.

However, since it is to be divided, it is necessary to balance the partitions to ensure that each partition stores approximately the same amount of data. If the partitions are unbalanced and most database queries fall into a few partitions, the overloaded partitions will not be able to bear the load, which will cause a system bottleneck. In addition, we will also call these partition nodes with excessive load as hotspots.

Generally speaking, we will use the following two methods for sharding: vertical sharding and horizontal sharding

system-design-system design 10-data copy-data-partitioning-introductoin-to-database-fundamentals-hogantech-hoganblab00
Digital Ocean - What is Sharding?

Vertical Sharding

Vertical Sharding is to place different tables in different databases. This method may run on different servers. Here I first provide illustrations from other official documents so that users can better understand the principles.

Usually, vertical sharding (Vertical Sharding) is used to improve retrieval speed, for example: a table composed of Blobs (Blob-related content will be introduced in detail in the future). In this case, columns with large text data or blobs will be split into different tables.

Vertical sharding is suitable for manual partitioning, mainly because this sharding method is relatively complex, and the database administrator needs to determine how to partition the data through these data correlations. In contrast, horizontal sharding (Horizontal Sharding), which will be introduced next, is suitable for automation even under dynamic conditions.

system-design-system design 10-data copy-data-partitioning-introductoin-to-database-fundamentals-hogantech-hoganblab01

Horizontal Sharding

In addition to the vertical sharding introduced in the former, there is another sharding method called horizontal sharding (Horizontal Sharding). If some tables in the database are too large and affect read/write performance, horizontal sharding is a good sharding method to solve this problem.

Horizontal sharding or partitioning is to divide a table into multiple tables by splitting data, and it is also easy to understand. If there are 500 rows of data, it can be directly divided into two databases, and each database has its own Two hundred and fifty lines.

I have also attached a picture to let readers better understand the concept of horizontal sharding. Usually we have the following horizontal sharding methods, namely: Key Based Sharding and Range Based Sharding. , Hash Based Sharding, these Chinese translations are based on my reference to Tencent’s technical documents. Of course, under normal circumstances we will use English as their names.

system-design-system design 10-data copy-data-partitioning-introductoin-to-database-fundamentals-hogantech-hoganblab02

Key Based Sharding

Key Based Sharding is a method of dividing data into multiple partitions based on key fields in the data table. For example: Customer data can be split into multiple partitions based on customer ID, so that each partition contains data for a specific customer.

advantage

The advantage of using the Key Based Sharding method is that the query is very efficient and easy to implement, and you can know exactly where to look for a specific range of keys (which node, which shard).

shortcoming

If the key selection is not correct, some nodes may have to store more data due to uneven traffic distribution. In short, the probability of uneven sharding is higher.

system-design-system design 10-data copy-data-partitioning-introductoin-to-database-fundamentals-hogantech-hoganblab03

Range Based Sharding

Range Based Sharding divides data into multiple partitions based on a certain range field in the data table. For example: product data can be divided into multiple partitions based on price, so that each partition contains data about a specific product.

advantage

The advantage of using Range Based Sharding is that the balance of sharding is better, and we can also decide which database to search based on the range alone.

shortcoming

Usually when querying data with a smaller range, this data sharding method is less efficient.

system-design-system design 10-data copy-data-partitioning-introductoin-to-database-fundamentals-hogantech-hoganblab04

Hash based sharding

Hash based sharding is to hash the data table based on the value of a certain field (Hash Function), and then allocate the data to different partitions based on the hash result.

advantage

The advantage of Hash based sharding is that it has very high balance and relatively high query efficiency.

shortcoming

There may be hot spots. Also review what was mentioned earlier. If the partitions are unbalanced and most database queries fall into a few partitions, the overloaded partitions will not be willing to bear the load, which will cause system bottlenecks. In addition, we will also call these partition nodes with excessive load as hotspots.

system-design-system design 10-data copy-data-partitioning-introductoin-to-database-fundamentals-hogantech-hoganblab05

rebalance split

When the database query load itself is unbalanced, it will cause poor database performance. At this time, we need to delve into the reasons to have a chance to correct the problem. The reasons for the imbalance may include:

  1. The data is not evenly distributed.
  2. Too much load on a single partition.
  3. As query traffic increases, we need to add more nodes to bear the system traffic.

Here are some ways you can rebalance your splits:

Fixed number of partitions

In this method, when we set up the database at the beginning, we create a fixed number of partitions. Normally, more partitions than the number of nodes are created and assigned to nodes. Therefore, when a new node joins the system, it can take some partitions from existing nodes until the partitions are evenly divided.

Of course, each method has advantages and disadvantages. The disadvantage of this approach is that when the size of each partition increases with the total amount of data in the cluster, it will also grow because all partitions only contain a small portion of the total data. In addition, if a partition is very small, it will cause the cost of partitioning to be too large, because each partition costs some money. If the partitions are very large, the cost of rebalancing nodes and recovering from node failures will be high. Choosing the correct number of partitions is very important.

Dynamic partitioning

In dynamic partitioning (Dynamic partitioning), when the size of the partition reaches the set value, the original partition will be evenly divided into two partitions. Different partitions are assigned a node, and eventually, the load will be evenly distributed. The number of partitions will be dynamically balanced with the total data volume. This is the advantage of dynamic partitioning.

However, there is a drawback to dynamic partitioning. It is difficult to dynamically rebalance when the database is being read and written simultaneously. Dynamic rebalancing during reading and writing is very complicated, because the data read and written move on different nodes. If dynamic balancing is performed at this time, it will cause conflicts or even delays. In order to ensure data consistency and availability, complexity will be introduced, which will affect system performance and reliability. MongoDB is one of the well-known databases that uses this kind of dynamic partitioning.

If you don’t know about consistency, availability, and reliability, you can refer to this article. Non-functional features of software design – System Design 03

Request routing

Various data partitions were introduced earlier, but before the data is transferred into the database, we need to ask a question: when the client makes a request, how does the system know which node to connect to? After rebalancing, the distribution of partitions to nodes changes. If we want to read a specific database, how do we know which IP address we need to connect to in order to read it? This problem is also called Service Discovery. Here is the solution to this problem:

  1. Allows clients to request any node in the network. If the node does not contain the requested data, the request is forwarded to a node that does.
  2. Establish a routing layer to transmit all requests to the routing layer, and then the routing layer decides which node to connect to satisfy the request.
  3. The client already has information about the partitions and which partition is connected to which node. Therefore, they can directly contact the node containing the information they need.

ZooKeeper

If we want to track the modifications of certain clusters in a distributed system, we can use some tools. One of the well-known tools, ZooKeeper, can do the above. ZooKeeper is a decentralized open source coordination service provided by Apache for decentralized systems. This tool can also track all mappings in the network, and each node is connected to ZooKeeper to obtain information. Whenever a partition changes, or a node is added or deleted, ZooKeeper will update and notify the routing layer of the change. ZooKeeper is used by companies such as Yelp, RackSpace, Yahoo!, Reddit, Facebook, and Twitter.

Conclusion

Data partitioning is a technique that divides the data in a database into smaller, more manageable subsets, which can improve the performance, scalability, availability, and manageability of the database.

When the database query load is unbalanced, the partitions need to be rebalanced to improve database performance. The process of rebalancing a split typically involves the steps of identifying unbalanced splits, developing a rebalancing plan, and executing the rebalancing plan.

Partitioning has become standard for all distributed systems today and is something most software engineers need to know. Also because the amount of data in the system continues to increase, it makes sense to partition the data. It can speed up writing and reading, further improving the availability, scalability and performance of the system.

related articles

How does Data Replication optimize the database? - System Design 09

Introduction to database basics – system design 08

Load Balancer Explained – System Design 07

What is DNS? Introduction to Domain Name System – System Design 06

Introduction to System Design Components Building Block – System Design 05

Back-of-the-envelope Back-of-the-envelope Calculation – System Design 04

Non-functional features of software design – System Design 03

Application of abstraction in system design – System Design 02

Introduction to Modern System Design – System Design 01

Quote

MongoDB – Hashed Sharding

DigitalOcean – Understanding Database Sharding

Tencent Cloud – Detailed explanation of Database Sharding

Leave a Comment

Your email address will not be published. Required fields are marked *

en_USEnglish