6.2 Internals of Distributed Database Systems

Developing distributed database systems is not a simple task, and it requires concepts from both the database community and the distributed systems community. In our brief overview, we examine additional building blocks that are necessary when designing databases with distribution support. We also list some replication and partitioning strategies for distributed database systems.

Building Blocks

Hellerstein and Stonebraker provide a very comprehensible introduction to traditional database internals [Hel07]. Such internals include indexes and data structures, I/O handling components, transaction handling, concurrency control, query processing and client communication interfaces. As these concepts are common for database systems, we focus on some of the necessary building blocks for distributed database systems instead, including transaction management, concurrency control, data versioning, interfaces, and scalable data partitioning and parallel data processing.

Distributed Transaction Management

In a non-distributed scenario, handling concurrent transactions is generally easier, because everything happens locally on a single machine. Distributed transactions handle operations with transactional behavior between multiple nodes. Thus, a transaction in a distributed system must either be applied to all participating nodes, or to no one at all. Distributed transactions are more difficult to implement due to the risk of network errors, (partial) failures of nodes and non-locality. A basic component for distributed transactions is a coordinating service that manages and coordinates transactions between all participants, based on a transaction protocol.

Popular protocols are the 2PC [Lam79a] and the 3PC [Ske83]. 2PC separates a voting phase and a completion phase, but it is blocking and not fault-tolerant. 3PC addresses the drawbacks of 2PC by additional coordination steps. However, 3PC cannot cope with network partitions.

Alternatively, quorum-based voting protocols can be used for committing transactions in distributed setups [Ske82]. The underlying idea is to mark a transaction as executed, when the majority of nodes have executed it. So either the abort quorum or the commit quorum must be obtained for termination. The Paxos [Lam98] protocol family provides consensus solving that can be used for quorum-based voting.

Concurrency Control and Data Versioning

The inherent parallelism states a problem for distributed database systems, especially when concurrent write operations are allowed on different nodes. In particular, relaxed consistency guarantees and the acceptance of network partitions require concepts for data versioning and controlling concurrent operations on data.

Distributed concurrency control mechanisms can be generally divided into pessimistic and optimistic algorithms and--ortoghonally--into locking-based, timestamp-based or hybrid algorithms. Pessimistic algorithms provide conflict prevention by strict coordination of concurrent transactions, while optimistic algorithms do not expect regular conflicts and delay conflict checking to the end of a transaction life-cycle. Locking-based algorithms use explicit locks for operations in order to prevent conflicts. Popular locking-based algorithms include traditional two-phase-locking [Esw76] and its variants. Also quorum-based voting can be applied, using read and write quorums for the corresponding operations [Gif79].

Distributed database systems also take advantage of timestamp-based concurrency control mechanisms, such as MVCC [Ber81,Ree78]. We have already encountered MVCC as an underlying implementation for STM systems in chapter :autorefchapter5. Timestamp-based mechanisms use logical clocks to identify either data changes or transactions over time. The logical ordering allows to reason about the sequence of operations and to protect from conflicts. Several distributed database systems use vector clocks [Lam78] for versioning data entries in face of a network partition [DeC07]. The version history then allows to reason about conflicts and facilitate merges.

Data Interfaces and APIs

Traditional database systems are sometimes entirely embedded into an application, or they use arbitrary protocols for access. Software components such as JDBC or ODBC abstract from concrete database protocols and supply generic APIs. For distributed database systems, interfaces for remote access are obligatory. Hence, established distributed technologies for communication and data serialization are often integrated from the beginning. These technologies facilitate application integration and testing.

Database calls and queries are often dispatched using RPC invocations or HTTP requests. The RPC approach uses framework bindings like Thrift [Aga07]. Data is interchanged using serialization technologies such as Thrift's own serialization or Google's Protocol Buffers. HTTP-based APIs often emulate some of the REST principles. For serialization, formats like JSON, BSON (a binary JSON representation) or XML are then used. While low-level RPC calls generally provide a slightly better performance due to less overhead, the usage of HTTP-based APIs introduces HTTP concepts like caching for free.

Scalable Data Partitioning

Figure 6.2: Consistent hashing maps nodes and data items into the same ring for partitioning. The left illustration shows a set of data items (gray) mapped to three nodes. On the right side, the additional Node D has joined the system and is also mapped into the ring. As a consequence, only a small sector of the ring is affected from repartitioning. Node D takes over two data items that have formerly been assigned to Node A.

Allocating large amounts of data to a number of nodes becomes more complex, if data scalability is required and the number of available nodes changes. Scaling out means supplying additional nodes, often at runtime in the first place. Sometimes, also scaling back to less nodes is interesting, when the amount of data decreases. Appropriate strategies are required, how to partition and how to allocate data when scaling in and out.

Traditional setups with a fixed number of hosts often allocate data by applying a hash function on a data item (e.g. the key), then using the result modulo the number of nodes in order to calculate the node responsible for the item. The strategy is straightforward, but it fails when the number of nodes changes. Recalculating and redistributing all items due to changed partitioning keys is then necessary, but not reasonable in practice. One way to approach this problem is consistent hashing [Kar97]. The fundamental idea of consistent hashing is to hash data items and nodes into a common ring using the same hash function. The algorithm determines that each data item has to be stored by the next clockwise adjacent node in the ring, as shown in figure 6.2. When new nodes are added to the ring, or nodes leave the ring, a small sector of the ring is affected and only the data items in this sector must be reallocated. In essence, consistent hashing is a partitioning strategy that works with varying number of nodes and provides a consistent mapping that prevents an unnecessary reallocating of data when the amount of nodes scales.

Parallel Data Processing

Processing data entries in a distributed database system is necessary for several operations. For instance, generating indexes requires the execution of the same operations on all data entries and machines. In several non-relational database systems, it is the developer's task to implement index generating, hence appropriate programming models are required for such embarrassingly parallel tasks.

Figure 6.3: A schematic illustration of the phases of a MapReduce computation. The map() function is applied to all items and produces intermediate results. Grouped by their key, these intermediate results get merged using the reduce() function.

A popular approach is the MapReduce model [Dea08], which is inspired by functional programming languages. It separates parallel processing of possibly large data sets into two steps, as shown in figure 6.3. The map function takes data entries and emits intermediate key-value pairs. Next, all intermediate pairs are grouped by keys. The reduce function is then applied to all intermediate pairs with the same key, yielding simple values as a result. Distribution, coordination and execution is managed by the framework resp. database system, so the developer only has to provide the map and reduce function. This principle easily allows tasks such as counting or sorting on large data sets. MapReduce is also used for building indexes, either using the sorted intermediate key-value pairs, or using the sorted reduced results.

Replication and Partitioning Strategies

Replication and partitioning are necessary concepts for distributed database systems. Replication is responsible for data distribution between nodes. On the basis of replication, availability can be increased and fail-over mechanisms can be deployed for fault-tolerance. Replication can also help to scale read operations. Partitioning deals with the challenge of scaling out large amounts of data.


There are various forms of replication for distributed systems, but not all are applicable for database systems that target availability and highly concurrent access. Replication mechanisms can be either synchronous or asynchronous, active or passive, and they have different propagation characteristics [Dol05,Moi11].

Synchronous replication provides atomic semantics for operations, that are backed by all running replicas. This requires distributed transaction protocols for coordination. Asynchronous replication allows a single node to acknowledge an operation independently. Other nodes will eventually receive the updates. But, as opposed to synchronous replication, immediate updates of all nodes are not guaranteed. Asynchronous replication works either periodically or aperiodically.

In active replication, all nodes receive and process a request (e.g. write operation), and coordinate their response. Passive replication favors a designated primary that processes the request and updates the other nodes afterwards. In case of a fail-over, a secondary takes over the service.

The propagation aspects determine, how read and write operations from clients are handled, and how updates disseminate to the replicas. In a master-slave setup, writes are processed by a single master. As web applications tend to issue more read requests than write requests, many setups take advantage of this inherent property and provide a single master server and multiple slaves [Sch08]. The master server is solely issued for write requests, and all read requests are load-balanced to one of the slaves. Obviously, this setup does only help to scale read operations, but not write operations. A multi-master setup allows multiple nodes to accept writes. This indeed increases write scalability. However, it requires conflict management strategies, as simultaneous writes on the same data may lead to inconsistencies. Quorum-based systems [Gif79] allow to fine tune, how many replicas must be accessed for reading operations, how many replicas store the data, and how many replicas must acknowledge update operations. These three parameters directly affect the strengths of consistency and fault-tolerance. In figure 6.4, we can see two exemplary replication strategies in use for web applications.

Figure 6.4: Two replication setups for the backend of web application. On the left side, we can see a master-slave setup that is often used by MySQL. The master handles writes and asynchronously updates the slaves. Read requests are load-balanced to the slaves. On the right side, a common replication setup for CouchDB is depicted. Multiple masters handle all requests and perform asynchronous merge replication that might require conflict resolution.

Common replication strategies include snapshot replication, transactional replication, merge replication and statement-based replication [Moi11]. Snapshot replication is based on periodic copying of all data entries. These snapshots are then forwarded and applied on replicas. Transactional replication employs a transactional behavior for changes using distributed transactions between servers. Merge replication allows for partition tolerance and offline nodes, since it synchronizes data when nodes eventually become available. Conflict resolution strategies are necessary to handle conflicting changes. Statement-based replication forwards database queries to replicas. Read queries can be forwarded to a single instance, while queries including write operations are forwarded to all instances.


There are different partitioning approaches for scaling out large amounts of data: functional, vertically and horizontal [All08,Sch08]. Functional partitioning separates distinct parts of the application data that are not dependent on each other. For instance, customer data, product data and inventory data are not just stored in different tables, but can also be stored on different instances of database nodes. Vertical partitioning targets data partitioning, that is the efficient allocation of a complex data row into tables. Normalization and denormalization are typical mechanisms for vertical partitioning. For instance, "row splitting" separates a single table into multiple tables, thereby separating columns that change often from columns that are rather static. Such a split can also improve performance. Horizontal partitioning, also known as sharding, addresses the problem of large numbers of rows in a table. Instead of splitting existing rows across columns, an existing table is split into several structurally equivalent tables and the rows are portioned. While partitioning improves performance in many cases and makes large amounts of data more manageable, it has also some drawbacks. Providing a consistent logical view on partitioned data sets often requires multiple join operations or even merge operations on application level. As a result, finding the partitions, both vertical and horizontal, is often not trivial and requires specific knowledge, how the data is accessed and used by the application.

The design of shards heavily influences the performance for finding and retrieving data. Thus, the partitioning strategy in use affects the system. Partitioning is usually realized using a partitioning key that allocates rows to shards. When hash partitioning is used, the result of a hash function applied to the key states which shard should be used. List partitioning provides a fixed mapping of keys to shards. Similarly, range partitioning assigns numerical ranges to shards. Combining different criteria is called composite partitioning. For instance, the aforementioned mechanism of consistent hashing can be considered as a combination of hash and list partitioning.

Although we explained partitioning using tables, rows and columns, most of the concepts are valid for non-relational database systems as well. A storage organization solely based on keys makes this concept even more apparent.

Replication, data partitioning and sharding represent orthogonal concepts, and they are partially contradictory. However, in large-scale database systems, all of these concepts are inevitable as fundamental mechanisms. Otherwise, systems could not be able to accept huge amounts of data and simultaneous read/write requests, deal with faults or provide low latency responses at the same time. Hence, deliberate trade-offs are required in practice.