Portál AbcLinuxu, 30. dubna 2025 12:39
This article introduces one of many solutions for ensuring a high availability (HA). The solution is called DRBD (Distributed Replicated Block Device). In fact, the main goal of the article is not a presentation of advanced configuration cases but introduction of DRBD, description of its most important features and summarization of basic information. In the conclusion section I picked up the most interesting features based on DRBD's development roadmap, which are planned in the following versions, and selected some interesting statistics about DRBD usage. In this article I'll refer to DRBD version 8.3.10 unless stated otherwise.
DRBD (Distributed Replicated Block Device) is a software-based open source replicated storage solution for mirroring the content of block devices between servers via an assigned network. That's why you do not need to have any specialized hardware. It's actually a building block to form high availability (HA) clusters and it could be treated as a network-based RAID-1. Project DRBD enables data mirroring on hard disks, partitions and logical volumes and it works transparently in real-time synchronous or asynchronous mode over different types of replication transports. More information about synchronous and asynchronous replicating modes can be found in section 4.3.
The term DRBD refers to a logical block device and the software which breaks down into a kernel module and management userspace applications. Since the kernel module uses a driver for a virtual block device, DRBD is located as near the bottom of a system's I/O stack as possible (see figure 1). Such a virtual block device can be actually used as a PV for LVM.
At the beginning the focus was on single-primary mode but now it also supports off-site nodes and dual-primary mode (i.e. running of two active nodes at the same time). At the end of 2009 DRBD got merged into Linux kernel version 2.6.33.
![]() |
The core of DRBD has been developed on 8th December 1999 by Philipp Reisner as a part of his Master's thesis at the Vienna University of Technology in Austria. This early version of DRBD was intended for storing email messages in a redundant way.
In November 2001 Reisner co-founded company LINBIT which was focused on advancing the development of DRBD and Linux HA for enterprise sector.
Next important event was the join of Lars Ellenberg into LINBIT in May 2005. Lars impressed Phil so much that both led the R&D team.
Next major step was the release of DRBD version 8 in January 2007. This version broke previous performance barriers and introduced dual-primary clustering capability which allowed simultaneous write access from two cluster nodes.
LINBIT USA, LLC was founded in 2008. It was primarily focused on integration services in North, Central and South America, development, consultancy and 24/7 support. In the same year LINBIT has made a formerly commercially licensed add-on DRBD+ open-source. The add-on was merged into DRBD 8.3 and it offered stacked device support, huge device support for devices larger than 4 TB, checksum-based resync and three and four node clustering support (see [2, p. 46-47, 60-63]).
As was already mentioned, DRBD got merged into Linux kernel version 2.6.33 at the end of 2009.
In this section I'll refer to DRBD version 8.3.10 and outline its primary features, capabilities and objectives.
DRBD supports large single block devices that reach up to one petabyte on 64bit architectures. It also offers two main operation modes which are single-primary and dual-primary mode. In single-primary mode it's possible to use any conventional filesystem (e.g. ext3, ext4, ReiserFS or XFS) but in dual-primary we have to choose a distributed (cluster-aware) filesystem like GFS2 or OCFS2.
In order to add redundancy it facilitates incorporation into existing infrastructure by many integration scripts (e.g. for Heartbeat, Red Hat Cluster except GUI, Xen and LVM). Speaking of LVM, DRBD works with both physical volumes (PV) and logical volumes (LV) and it also supports LV snapshots. Another supported project is Enterprise Volume Management System (EVMS) which is a single unified system for handling all of storage management tasks and allows use of following filesystems: Ext2/3, JFS, ReiserFS, XFS, Swap, OCFS2, NTFS and FAT.
If we need to secure DRBD's transport channels and stored data we can transparently use IPSec or VPN and standard OS block device encryption. Peer authentication during an initial connection can be achieved by shared secret. When necessary, it's possible to do an online data integrity verification (e.g. as soon as a secondary node is promoted after failure of the primary node).
DRBD deployment is resistant to (and allow recovery of) several types of failure (e.g. the node, the storage device or the network) with fast and effective resynchronization (resync). After complete failure, DRBD automatically detects the most up-to-date data and the following resync transfers only those blocks1 which were modified during the outage and could be under bandwidth control. Moreover, sync can be based on checksums which make synchronization even faster and independent of the device size. Next feature is an online verification of device which enables us to do a block-by-block data integrity check of the nodes very efficiently (considering network bandwidth). Because all of this it could noticeably impact CPU utilization and load.
Administrator can also appreciate customization of masking local I/O errors. Masking errors has three basic strategies. Detach strategy, which is the recommended one and usually used, causes that when lower-level I/O error occur, the node drops its backing device and continues in diskless mode. Other strategies are not described here and can be found in documentation[2, p. 39-40]. Events like outage of primary node (pri-lost) and many others could be managed through event handlers[2, p. 126].
Fencing at resource level is provided by scripts. Scripts use Pacemaker's constraints and should prevent Pacemaker from promoting a DRBD Master/Slave resource when its DRBD replication link has been interrupted. This keeps Pacemaker from starting a service with outdated data and causing an unwanted "time warp" in the process[2, p. 57]. It's configurable with fencing directive in resource context in DRBD's configuration file.
As a replication transport it's possible to use TCP/IP over IPv4/IPv6, SuperSockets over Dolphin NICs or SDP over Infiniband. Mirroring over long distances and high throughput internet links often requires active bandwidth management which is also supported by DRBD. This bandwidth management optionally includes complete suspension in case the bandwidth is not sufficient.
DRBD Proxy is commercial facility or mechanism to buffer ongoing replication data from primary node. The commercial licence purchase and technical support is provided directly by LINBIT. This technique can be used for long distance replications, helps to mitigate write bursts and time spent in blocking state while a socket output buffer is full, and optionally compress or decompress the data it forwards. DRBD Proxy buffer is configurable and limited only by the address room size and available physical memory.
After a brief overview of the features it's time to introduce DRBD's components and management tools. The configuration file drbd.conf is in default located in directory /etc/. The configuration for each resource must match on each node.
Since DRBD contains lots of management tools and scripts, let's highlight the most important[2, p. 123-161]:
LINBIT has developed a graphical user interface called DRBD Management Console (DRBD MC). The DRBD MC is a Java application that eases the burden of managing DRBD and Pacemaker/Corosync or Heartbeat-based cluster systems (cf. Figure 2). The application is designed for administrators2 and developers and does not require any agent or client daemon on the cluster nodes. It uses ssh access, as you do when you work with your servers[3].
DRBD MC has a wizard driven installation/update of DRBD/Pacemaker, has a wizard driven creation of Heartbeat configuration file ha.cf, presents complete DRBD, NICs and block devices status, allows creation of a new DRBD resources, views in text console all commands that DRBD MC issues, etc. More information can be found in [3].
![]() |
In dual-primary mode, any resource is, at any given time, is in the primary role on both cluster nodes. That's why this mode requires use of a shared cluster file system such as GFS or OCFS2 that supports a distributed lock management.
For load-balancing clusters which require concurrent data access from more nodes at the same time is dual-primary mode the preferred approach. By default, this mode is disabled and must be explicitly enabled in the configuration file. However, this solution is more sensitive to replication network failure.
resource resource_name { net { allow-two-primaries; } ... startup { become-primary-on both; } ... }
DRBR since version 8.3.0 supports three node setup. In this configuration DRBD adds a third node to an existing 2-node cluster setup and replicates data to that node, where it can be used for backup and disaster recovery purposes. To enable three-way replication we need to add another stacked DRBD resource on top of the existing resource holding our production data (see figure 3).
The stacked resource is usually replicated using asynchronous replication (Protocol A, cf. section 4.3), whereas the production data would usually use synchronous replication (Protocol C). Selection of protocol A is not required for stacked resources. The choice depends on its application and basically you can select any of DRBD's replication protocols. It's possible to use three-way replication permanently, where the third node is being kept updated with data from the production cluster, or it may be also employed on demand, where the production cluster is normally disconnected from the backup site and site-to-site synchronization is performed on a regular basis, for example by running a nightly cron job.
When a stacked device is used, this device becomes the active one that we will mount and use. The device meta data will be stored twice on the underlying DRBD device and the stacked DRBD device. It's always necessary for the stacked device to use internal meta data on it.
resource r0 { protocol C; on node1 { device /dev/drbd0; disk /dev/sda6; address 10.0.0.1:7788; meta-disk internal; } on node2 { device /dev/drbd0; disk /dev/sda6; address 10.0.0.2:7788; meta-disk internal; } } resource r0-U { protocol A; stacked-on-top-of r0 { device /dev/drbd10; address 147.251.53.40:7788; } on node3 { device /dev/drbd10; disk /dev/hda6; address 147.251.53.41:7788; # Public IP of the backup node meta-disk internal; } }
You can see that the keyword stacked-on-top-of replaces one of the on section normally found in 2-node cluster configuration. The keyword informs DRBD that a resource in which this keyword is included is a stacked resource. Do not use stacked-on-top-of in a lower-level resource. More information about how to maintain or enable stacked resources can be found in [2, p. 46-47]
![]() |
Suppose we have three DRBD resources (two unstacked and one stacked) in use. In order to achieve 4-way storage redundancy, it will require two two-node Pacemaker clusters. For this configuration it means that up to three nodes can fail while still providing service availability. Since the configuration starts to be more complex, please see configuration, illustration and more information in [2, p. 60-63].
Split brain is a situation where both nodes are switched to primary role while disconnected due to temporary failure of all network links between cluster nodes or due to intervention by a cluster management software or human error. It's potentially harmful state, because modifications to the data might have been made on either node without its replication to the peer. You can imagine this situation as two diverging sets of data that have been created but which cannot be trivially merged.
DRBD allows automatic operator notification (by email or some other means4) when it detects split brain. The split brain victim (the node whose modifications will be discarded) is not subjected to full device synchronization. Instead, it has its local modifications rolled back, and any modification made on the split brain survivor propagates to the victim.
Automatic split brain recovery has some configurable policies. DRBD applies its split brain recovery procedures based on the number of nodes in primary role at the time split brain is detected. In the resource's net configuration section, DRBD examines keywords after-sb-Npri where N is number zero, one or two (e.g. after-sb-1pri). Each of the keywords indicates situation when split brain has just been detected and the resource is (at this time) in primary role on two nodes, one node or none. The keywords use specific actions that indicate how the split brain should be solved in such a situation (cf. configuration fragment in File 2). Action's overview can be found in [2, p. 45]. Automatic split brain recovery is disabled by default from DRBD 8 and newer.
The counterpart to automatic data recovery after a split brain is a manual data recovery. If automatic split brain recovery is not configured while DRBD detects that both nodes are (or were at some point, while disconnected) in the primary role, it immediately tears down the replication connection.
At this point you have to manually intervene by selecting one node whose modifications will be discarded. This intervention is made with the following commands:
# at the split brain victim drbdadm secondary resource drbdadm -- --discard-my-data connect resource # at the split brain survivor drbdadm connect resource
After reconnection, your split brain victim immediately changes its connection state to SyncTarget, and has its modifications overwritten by the remaining primary node. Once resynchronization has completed, the split brain is considered resolved and the two nodes form a fully consistent, redundant replicated storage system again[2, p. 53-54].
At the beginning of this section, it's important to make clear the difference between inconsistent and outdated data.
Inconsistent data is data that cannot be expected to be accessible and useful in any manner. The prime example for this is data on a node that is currently the target of an on-going synchronization. Data on such a node is part obsolete, part up to date, and impossible to identify as either.
On the other hand, outdated data is data on a secondary node that is consistent, but no longer in sync with the primary node. This would occur in any interruption (temporary or permanent) of the replication link. Data on an outdated, disconnected secondary node is expected to be clean, but it reflects a state of the peer node which might be obsolete. In order to avoid services using outdated data, DRBD disallows promoting a resource that is in the outdated state. DRBD allows an external application to outdate a secondary node as soon as a network interruption occurs. DRBD will then refuse to switch the node to primary role and this prevents applications from using the outdated data. Whenever an outdated resource has its replication link re-established, its outdated flag is automatically cleared and then follows background synchronization[2, p. 10].
Troubleshooting and recovery after outage can be separated into following outage categories5:
DRBD can be used in various types of deployment and configuration. One of the typical situations (when we can choose DRBD) is a load-balancing or high availability configuration. To be able to fully implement this configuration in practice, it will require more than just a network mirroring. For that reason DRBD facilitates integration with Pacemaker clusters, Heartbeat clusters, Red Hat Cluster Suite and/or Xen. For more information how to integrate DRBD into mentioned systems see [2, p. 56-101].
Services that can utilize value of DRBD are: file services (e.g. SMB, NFS), databases (e.g. MySQL, PgSQL), authentication services (e.g. LDAP), network services (e.g. DHCP) and many others7.
On the other hand, DRBD is Linux solution only and might not be suitable for every application e.g. when cross-site replication depends on available bandwidth and write rate or when we need to establish mirroring between more than four nodes.
If you read the article to this point, you should have the minimal knowledge about what DRBD is, what its basic principals are, what it offers and where you can find more information about those parts that are interesting for you.
In my opinion, DRBD project is very mature piece of software with lot of features which can be otherwise achieved by e.g. an expensive hardware or with a commercial software solution. Since DRBD is an open source with very large community and active development, it might be a good option for those who want a shared storage without special shared hardware, scalable and stable deployment and/or cheap HA solution for many applications. Sure, DRBD is better but not faster or cheaper than a single server and, on the other hand, it's cheaper but (might) not (be) better than a replicated SAN or NAS. I already tried to show on examples in section 4.9 that DRBD is not suitable for every kind of application related to network mirroring.
Let talk numbers. In the last twelve months (June 2010 - May 2011) it was recorded over 76 500 installations or updates8 of DRBD and almost 118 000 installations in total based on sum of current usage per all installed DRBD's deployments9. I'll finish this paragraph with something I read in user guide and many tutorials and you should keep that in your mind: "Replication is not a replacement for backups!"
In the following list is a basic overview of some selected features which should be implemented in next DRBD's releases and might be interesting for you.
Tiskni
Sdílej:
No to taky, ale hlavně, kterej debil tu píše anglicky?Mno myslím, že je rozhodně chytřejší než ten, kdo ho označuje debilem.
temp
, tam bych fakt na trvalost nevsázel.
Článek zajímavý, trochu mě trkl odstavec, kde se najednou objevil Pacemaker, aniž by bylo vysvětleno, co to je. Text pod obrázkem webového rozhraní typograficky nezapadá do zbytku textu.
Co by mě zajímalo a co jsem tu nenašel, je, jak je to s integrací se souborovými systémy. Správně je řečeno, že pro primary-primary konfiguraci je nutný distribuovaný souborový systém, ale nějak nechápu, jak se neclusterový souborový systém může vyrovnat s blokovým zařízením, v němž se mu pod jadernými strukturami mění data. To je těch několik jmenovaných systémů opatchovaných, nebo se blokové zařízení jeví zmražené (kam se pak odkládají aktualizace přijaté od primárního uzlu?)?
ISSN 1214-1267, (c) 1999-2007 Stickfish s.r.o.