System Design Fundamentals: Database Sharding

Rabi Siddique
7 min readMar 3, 2023

--

Database sharding is the process of splitting up a database across multiple machines to improve the scalability of an application. It involves breaking up one’s data into two or smaller chunks, called logical shards. The logical shards are then distributed across separate database nodes, referred to as physical shards, which can hold multiple logical shards.

This approach helps to avoid performance bottlenecks caused by a single, overburdened machine and allows for more cost-effective and feasible scaling. As the volume of data and traffic increases, it becomes increasingly necessary to scale horizontally by adding more machines rather than vertically by upgrading a single, large server.

Why Sharding?

Let’s take the example of Facebook. Facebook is a prime example of how a small, niche platform can rapidly evolve into a global phenomenon. In its early days, the site was primarily used by Harvard students as an online yearbook and the storage requirements and query load on the database could be handled by a single server. However, as the platform grew in popularity, the volume of data and traffic increased dramatically. By 2008, Facebook received 14 billion monthly page views, requiring multiple backend queries to fulfil each request.

It’s not just the query load that became a concern, but also the storage capacity. As Facebook continued to grow and evolve, the company inevitably reached a point where the physical capacity of its database server was surpassed. When a service has a high volume of rapidly changing data(lots of writes) or is frequently queried by many users in a way that causes the working set to exceed the available memory(lots of reads), the primary bottleneck is often I/O.

At this juncture, the company faced a critical decision: whether to scale vertically by investing in a more powerful, costlier server with increased RAM, CPU horsepower, disk I/O, and storage capacity or to scale horizontally by spreading their data across multiple, more affordable servers. Both options had their pros and cons and the decision ultimately comes down to which approach would be the most cost-effective and efficient in terms of scaling the platform to meet the growing demands of its users. In such a situation, sharding the database can be a highly cost-effective and scalable solution with the trade-off being the increased complexity of the system.

Sharding Architectures

Once you have made the decision to shard your database, the next step is to figure out how to do it. This includes understanding the different types of sharding and how they are used to distribute data. In this section, we will discuss some of the most commonly used sharding types and their methods for data distribution.

1-Vertical Partitioning

One easy way to segment your application’s database is to move tables related to specific features to their own servers. For instance, you could place user profile information on one server, friend lists on another, and a third for user-generated content like photos.

This approach is relatively simple to implement and has minimal impact on the overall application. However, if the site continues to grow, further shard the feature-specific databases across multiple servers may be necessary.

2-Range Based Partitioning

Range-based sharding is a technique that involves dividing data based on ranges of a specific value. For example, consider a retailer’s catalogue database that stores information about all products. One way to shard this data would be to create multiple shards and divide the products’ information based on their price range.

This approach allows for a more organized and efficient data distribution, making it easier to manage and query the database. However, a key challenge with this approach is that if the value used for partitioning is not chosen carefully, it can result in uneven distribution of data across the servers. For example, if a specific price range contains a higher number of products, it could lead to one shard becoming overburdened while others are underutilized.

3-Key Based Partitioning

Key-based sharding, also known as hash-based sharding. It works by using a unique value, such as a user ID, as input into a hash function. The hash function then calculates an output, which is used to determine which server the data should be stored on. One example of this is using user IDs as the input value and using a modulo operation on the user ID with the number of servers available, to determine which server to store the user’s information on.

This approach aims to ensure a uniform distribution of data across servers. However, one of the main challenges with this method is that it limits the flexibility to add more servers to the system, as it would require changing the hash function and this can be difficult to perform without causing downtime.

4-Directory Based Partitioning

Directory-based partitioning involves having a lookup table that uses a shard key to keep track of which shard holds which data. This shard key is typically a unique identifier for each piece of data, such as a user ID or a product ID. The lookup table is used to determine which shard a specific piece of data belongs to, allowing the system to efficiently retrieve and update the data as needed.

This approach allows for more flexibility in terms of adding or removing servers, or changing the partitioning scheme, without affecting the rest of the application. For example, if there are currently ten servers and the hash function is a modulo operation, adding five more servers to the pool can be done without incurring downtime. A script can be run to copy data from the original ten servers to the new five servers based on a new hash function, using the new server count of fifteen. Once the data is copied over, the lookup table can change to using the new hash function.

Advantages of Sharding

1-High Availability

If one box goes down the others still operate. Sharding a database increases the availability of the application. In case of an outage in one shard, only a certain part of the application will become unavailable to users. However, other shards can continue operating without interruption. On the other hand, if the database is not sharded, an outage has the potential to make the entire application unavailable.

2-Faster Queries

Sharding a database improves query response times by dividing the data into smaller groups. When a query is submitted on a non-sharded database, it may have to search through every row in the table before finding the desired result set. In contrast, with a sharded database, the query only has to go over fewer rows, resulting in faster query response times.

3-Concurrent Writes

By distributing the write operations across multiple shards, the system is able to handle more concurrent writes, resulting in a more efficient and reliable database.

4-More Work

Sharding enables horizontal scaling, which is also known as scaling out. With a parallel backend, the system can handle more work simultaneously, allowing it to handle high user loads. The parallel paths through the system also allow for faster write operations, as the data is distributed across multiple shards. Load-balancing web servers can also be used to access the shards over different network paths, which are processed by separate CPUs and use separate caches of RAM and disk IO paths. This reduces bottlenecks and improves the overall performance and reliability of the system.

Disadvantages

1-Joins

Sharding a database can present challenges when it comes to performing joins on multiple tables. Prior to sharding, these queries would execute on a single server, but once the database is distributed across multiple servers, it becomes more difficult and complex to perform joins that span across database shards. This is because the data has to be compiled from multiple servers, which can add significant overhead and negatively impact performance.

However, there are ways to mitigate this issue. Utilizing caching and fast networks can help to speed up the process and ensure that page load times remain fast. Additionally, denormalizing the database by merging related data into a single table can also be a viable solution, as it allows for the execution of previously complex join queries on a single table.

2-Referential integrity

Maintaining data integrity, such as using foreign keys, can be a challenge when using a sharded database. Most relational database management systems don’t support foreign keys across different servers, making it difficult to enforce referential integrity. This means that applications that rely on this feature may need to implement it in their code and run regular SQL jobs to keep the data consistent. This can add extra complexity and maintenance to the application.

3-Keeping the Data Balanced

When using a sharded database, it’s important to keep an eye on the distribution of data across the shards, as one shard may become larger than the others, known as a hotspot. This can undo the benefits of sharding and cause performance issues. To prevent this, the database may need to be re-sharded to ensure that data is evenly distributed across all shards. It’s best to plan for this possibility from the beginning, as re-sharding can cause downtime if not done properly.

4-Limited Database Support

Not all database engines have built-in support for sharding, which means that it may require custom implementation. This can make finding documentation and troubleshooting solutions more difficult.

Thank you for reading. I hope this post is helpful to you. If you have any further questions, don’t hesitate to reach out. I’m always happy to help.

Let’s connect:
LinkedIn
Twitter

--

--

Rabi Siddique

A passionate Software Engineer who intends to be the best and nothing less.