AToL: Distributed Replicated Block Device

22.6.2011 14:10 | Přečteno: 2270× | Linux | poslední úprava: 22.6.2011 14:09

What is this article about?

This article introduces one of many solutions for ensuring a high availability (HA). The solution is called DRBD (Distributed Replicated Block Device). In fact, the main goal of the article is not a presentation of advanced configuration cases but introduction of DRBD, description of its most important features and summarization of basic information. In the conclusion section I picked up the most interesting features based on DRBD's development roadmap, which are planned in the following versions, and selected some interesting statistics about DRBD usage. In this article I'll refer to DRBD version 8.3.10 unless stated otherwise.

What is DRBD?

DRBD (Distributed Replicated Block Device) is a software-based open source replicated storage solution for mirroring the content of block devices between servers via an assigned network. That's why you do not need to have any specialized hardware. It's actually a building block to form high availability (HA) clusters and it could be treated as a network-based RAID-1. Project DRBD enables data mirroring on hard disks, partitions and logical volumes and it works transparently in real-time synchronous or asynchronous mode over different types of replication transports. More information about synchronous and asynchronous replicating modes can be found in section 4.3.

The term DRBD refers to a logical block device and the software which breaks down into a kernel module and management userspace applications. Since the kernel module uses a driver for a virtual block device, DRBD is located as near the bottom of a system's I/O stack as possible (see figure 1). Such a virtual block device can be actually used as a PV for LVM.

At the beginning the focus was on single-primary mode but now it also supports off-site nodes and dual-primary mode (i.e. running of two active nodes at the same time). At the end of 2009 DRBD got merged into Linux kernel version 2.6.33.

How was it with DRBD?

The core of DRBD has been developed on 8th December 1999 by Philipp Reisner as a part of his Master's thesis at the Vienna University of Technology in Austria. This early version of DRBD was intended for storing email messages in a redundant way.

In November 2001 Reisner co-founded company LINBIT which was focused on advancing the development of DRBD and Linux HA for enterprise sector.

Next important event was the join of Lars Ellenberg into LINBIT in May 2005. Lars impressed Phil so much that both led the R&D team.

Next major step was the release of DRBD version 8 in January 2007. This version broke previous performance barriers and introduced dual-primary clustering capability which allowed simultaneous write access from two cluster nodes.

LINBIT USA, LLC was founded in 2008. It was primarily focused on integration services in North, Central and South America, development, consultancy and 24/7 support. In the same year LINBIT has made a formerly commercially licensed add-on DRBD+ open-source. The add-on was merged into DRBD 8.3 and it offered stacked device support, huge device support for devices larger than 4 TB, checksum-based resync and three and four node clustering support (see [2, p. 46-47, 60-63]).

As was already mentioned, DRBD got merged into Linux kernel version 2.6.33 at the end of 2009.

How does it work?

In this section I'll refer to DRBD version 8.3.10 and outline its primary features, capabilities and objectives.

Features

DRBD supports large single block devices that reach up to one petabyte on 64bit architectures. It also offers two main operation modes which are single-primary and dual-primary mode. In single-primary mode it's possible to use any conventional filesystem (e.g. ext3, ext4, ReiserFS or XFS) but in dual-primary we have to choose a distributed (cluster-aware) filesystem like GFS2 or OCFS2.

In order to add redundancy it facilitates incorporation into existing infrastructure by many integration scripts (e.g. for Heartbeat, Red Hat Cluster except GUI, Xen and LVM). Speaking of LVM, DRBD works with both physical volumes (PV) and logical volumes (LV) and it also supports LV snapshots. Another supported project is Enterprise Volume Management System (EVMS) which is a single unified system for handling all of storage management tasks and allows use of following filesystems: Ext2/3, JFS, ReiserFS, XFS, Swap, OCFS2, NTFS and FAT.

If we need to secure DRBD's transport channels and stored data we can transparently use IPSec or VPN and standard OS block device encryption. Peer authentication during an initial connection can be achieved by shared secret. When necessary, it's possible to do an online data integrity verification (e.g. as soon as a secondary node is promoted after failure of the primary node).

DRBD deployment is resistant to (and allow recovery of) several types of failure (e.g. the node, the storage device or the network) with fast and effective resynchronization (resync). After complete failure, DRBD automatically detects the most up-to-date data and the following resync transfers only those blocks¹ which were modified during the outage and could be under bandwidth control. Moreover, sync can be based on checksums which make synchronization even faster and independent of the device size. Next feature is an online verification of device which enables us to do a block-by-block data integrity check of the nodes very efficiently (considering network bandwidth). Because all of this it could noticeably impact CPU utilization and load.

Administrator can also appreciate customization of masking local I/O errors. Masking errors has three basic strategies. Detach strategy, which is the recommended one and usually used, causes that when lower-level I/O error occur, the node drops its backing device and continues in diskless mode. Other strategies are not described here and can be found in documentation[2, p. 39-40]. Events like outage of primary node (pri-lost) and many others could be managed through event handlers[2, p. 126].

Fencing at resource level is provided by scripts. Scripts use Pacemaker's constraints and should prevent Pacemaker from promoting a DRBD Master/Slave resource when its DRBD replication link has been interrupted. This keeps Pacemaker from starting a service with outdated data and causing an unwanted "time warp" in the process[2, p. 57]. It's configurable with fencing directive in resource context in DRBD's configuration file.

As a replication transport it's possible to use TCP/IP over IPv4/IPv6, SuperSockets over Dolphin NICs or SDP over Infiniband. Mirroring over long distances and high throughput internet links often requires active bandwidth management which is also supported by DRBD. This bandwidth management optionally includes complete suspension in case the bandwidth is not sufficient.

DRBD Proxy is commercial facility or mechanism to buffer ongoing replication data from primary node. The commercial licence purchase and technical support is provided directly by LINBIT. This technique can be used for long distance replications, helps to mitigate write bursts and time spent in blocking state while a socket output buffer is full, and optionally compress or decompress the data it forwards. DRBD Proxy buffer is configurable and limited only by the address room size and available physical memory.

Configuration and tools

After a brief overview of the features it's time to introduce DRBD's components and management tools. The configuration file drbd.conf is in default located in directory /etc/. The configuration for each resource must match on each node.

Since DRBD contains lots of management tools and scripts, let's highlight the most important[2, p. 123-161]:

LINBIT has developed a graphical user interface called DRBD Management Console (DRBD MC). The DRBD MC is a Java application that eases the burden of managing DRBD and Pacemaker/Corosync or Heartbeat-based cluster systems (cf. Figure 2). The application is designed for administrators² and developers and does not require any agent or client daemon on the cluster nodes. It uses ssh access, as you do when you work with your servers[3].

DRBD MC has a wizard driven installation/update of DRBD/Pacemaker, has a wizard driven creation of Heartbeat configuration file ha.cf, presents complete DRBD, NICs and block devices status, allows creation of a new DRBD resources, views in text console all commands that DRBD MC issues, etc. More information can be found in [3].

Replication modes

Dual-primary mode

In dual-primary mode, any resource is, at any given time, is in the primary role on both cluster nodes. That's why this mode requires use of a shared cluster file system such as GFS or OCFS2 that supports a distributed lock management.

For load-balancing clusters which require concurrent data access from more nodes at the same time is dual-primary mode the preferred approach. By default, this mode is disabled and must be explicitly enabled in the configuration file. However, this solution is more sensitive to replication network failure.

Three-way replication

DRBR since version 8.3.0 supports three node setup. In this configuration DRBD adds a third node to an existing 2-node cluster setup and replicates data to that node, where it can be used for backup and disaster recovery purposes. To enable three-way replication we need to add another stacked DRBD resource on top of the existing resource holding our production data (see figure 3).

The stacked resource is usually replicated using asynchronous replication (Protocol A, cf. section 4.3), whereas the production data would usually use synchronous replication (Protocol C). Selection of protocol A is not required for stacked resources. The choice depends on its application and basically you can select any of DRBD's replication protocols. It's possible to use three-way replication permanently, where the third node is being kept updated with data from the production cluster, or it may be also employed on demand, where the production cluster is normally disconnected from the backup site and site-to-site synchronization is performed on a regular basis, for example by running a nightly cron job.

When a stacked device is used, this device becomes the active one that we will mount and use. The device meta data will be stored twice on the underlying DRBD device and the stacked DRBD device. It's always necessary for the stacked device to use internal meta data on it.

You can see that the keyword stacked-on-top-of replaces one of the on section normally found in 2-node cluster configuration. The keyword informs DRBD that a resource in which this keyword is included is a stacked resource. Do not use stacked-on-top-of in a lower-level resource. More information about how to maintain or enable stacked resources can be found in [2, p. 46-47]

Four-way replication

Suppose we have three DRBD resources (two unstacked and one stacked) in use. In order to achieve 4-way storage redundancy, it will require two two-node Pacemaker clusters. For this configuration it means that up to three nodes can fail while still providing service availability. Since the configuration starts to be more complex, please see configuration, illustration and more information in [2, p. 60-63].

Split brain and data recovery

Split brain is a situation where both nodes are switched to primary role while disconnected due to temporary failure of all network links between cluster nodes or due to intervention by a cluster management software or human error. It's potentially harmful state, because modifications to the data might have been made on either node without its replication to the peer. You can imagine this situation as two diverging sets of data that have been created but which cannot be trivially merged.

DRBD allows automatic operator notification (by email or some other means⁴) when it detects split brain. The split brain victim (the node whose modifications will be discarded) is not subjected to full device synchronization. Instead, it has its local modifications rolled back, and any modification made on the split brain survivor propagates to the victim.

Automatic split brain recovery has some configurable policies. DRBD applies its split brain recovery procedures based on the number of nodes in primary role at the time split brain is detected. In the resource's net configuration section, DRBD examines keywords after-sb-Npri where N is number zero, one or two (e.g. after-sb-1pri). Each of the keywords indicates situation when split brain has just been detected and the resource is (at this time) in primary role on two nodes, one node or none. The keywords use specific actions that indicate how the split brain should be solved in such a situation (cf. configuration fragment in File 2). Action's overview can be found in [2, p. 45]. Automatic split brain recovery is disabled by default from DRBD 8 and newer.

The counterpart to automatic data recovery after a split brain is a manual data recovery. If automatic split brain recovery is not configured while DRBD detects that both nodes are (or were at some point, while disconnected) in the primary role, it immediately tears down the replication connection.

At this point you have to manually intervene by selecting one node whose modifications will be discarded. This intervention is made with the following commands:

After reconnection, your split brain victim immediately changes its connection state to SyncTarget, and has its modifications overwritten by the remaining primary node. Once resynchronization has completed, the split brain is considered resolved and the two nodes form a fully consistent, redundant replicated storage system again[2, p. 53-54].

Outage or failure

At the beginning of this section, it's important to make clear the difference between inconsistent and outdated data.

Inconsistent data is data that cannot be expected to be accessible and useful in any manner. The prime example for this is data on a node that is currently the target of an on-going synchronization. Data on such a node is part obsolete, part up to date, and impossible to identify as either.

On the other hand, outdated data is data on a secondary node that is consistent, but no longer in sync with the primary node. This would occur in any interruption (temporary or permanent) of the replication link. Data on an outdated, disconnected secondary node is expected to be clean, but it reflects a state of the peer node which might be obsolete. In order to avoid services using outdated data, DRBD disallows promoting a resource that is in the outdated state. DRBD allows an external application to outdate a secondary node as soon as a network interruption occurs. DRBD will then refuse to switch the node to primary role and this prevents applications from using the outdated data. Whenever an outdated resource has its replication link re-established, its outdated flag is automatically cleared and then follows background synchronization[2, p. 10].

Troubleshooting and recovery after outage can be separated into following outage categories⁵:

Deployment

DRBD can be used in various types of deployment and configuration. One of the typical situations (when we can choose DRBD) is a load-balancing or high availability configuration. To be able to fully implement this configuration in practice, it will require more than just a network mirroring. For that reason DRBD facilitates integration with Pacemaker clusters, Heartbeat clusters, Red Hat Cluster Suite and/or Xen. For more information how to integrate DRBD into mentioned systems see [2, p. 56-101].

Services that can utilize value of DRBD are: file services (e.g. SMB, NFS), databases (e.g. MySQL, PgSQL), authentication services (e.g. LDAP), network services (e.g. DHCP) and many others⁷.

On the other hand, DRBD is Linux solution only and might not be suitable for every application e.g. when cross-site replication depends on available bandwidth and write rate or when we need to establish mirroring between more than four nodes.

What is the conclusion?

If you read the article to this point, you should have the minimal knowledge about what DRBD is, what its basic principals are, what it offers and where you can find more information about those parts that are interesting for you.

In my opinion, DRBD project is very mature piece of software with lot of features which can be otherwise achieved by e.g. an expensive hardware or with a commercial software solution. Since DRBD is an open source with very large community and active development, it might be a good option for those who want a shared storage without special shared hardware, scalable and stable deployment and/or cheap HA solution for many applications. Sure, DRBD is better but not faster or cheaper than a single server and, on the other hand, it's cheaper but (might) not (be) better than a replicated SAN or NAS. I already tried to show on examples in section 4.9 that DRBD is not suitable for every kind of application related to network mirroring.

Let talk numbers. In the last twelve months (June 2010 - May 2011) it was recorded over 76 500 installations or updates⁸ of DRBD and almost 118 000 installations in total based on sum of current usage per all installed DRBD's deployments⁹. I'll finish this paragraph with something I read in user guide and many tutorials and you should keep that in your mind: "Replication is not a replacement for backups!"

Roadmap

In the following list is a basic overview of some selected features which should be implemented in next DRBD's releases and might be interesting for you.

Resources and additional information

Bibliography

Footnotes

Komentáře

s tmavým css stylem ty obrázky nejsou čitelné díky té průhlednosti...

Vivre libre ou mourir!

23.6.2011 12:33 Leoš Žádník | blog: erotika
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

No to taky, ale hlavně, kterej debil tu píše anglicky? To by potom tu někdo mohl začít psát mongolsky nebo čínskou mandarínštinou...

23.6.2011 14:27 Marek 'marx' Grác | skóre: 21 | blog: Paralelný blog | Brno / Bratislava
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Pre ľudí neschopných prečítať perex mi príde úplne zbytočné písať v akomkoľvek jazyku.

23.6.2011 14:51 pools | skóre: 19 | blog: Svědek Damdogův | Opava/Praha
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Ook. Ook, ook ook ook. Ook ook ook OOK OOK! Ook.

Vivre libre ou mourir!

23.6.2011 18:30 abc
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Když si někdo přečte perex, to ještě neznamená, že s obsahem perexu souhlasí. Přikláním se k názoru, že autor zápisu je blbec, protože na českém webu píše cizím jazykem. Toleruji jedině slovenský jazyk, protože Slováci jsou naši bratři.

23.6.2011 20:12 pavlix | skóre: 54 | blog: pavlix
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

A já anglický jazyk, protože je to latina IT oboru.

Já už tu vlastně ani nejsem. Abclinuxu umřelo.

23.6.2011 22:12 trekker.dk | skóre: 72
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

To už radši anglicky než slovensky. A moji bratři teda nejsou.

Quando omni flunkus moritati

23.6.2011 14:39 Nikola Ciprich | skóre: 23 | blog: NiX_blog | Palkovice
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

no jako debil ses tu v prve rade projevil ted ty...

Did you ever touch the starlight ? Dream for a thousand years? Have you ever seen the beauty Of a newborn century?

23.6.2011 20:14 pavlix | skóre: 54 | blog: pavlix
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Na hloupé trolly je zbytečné reagovat na jejich úrovni :) (jen osobní rada).

Já už tu vlastně ani nejsem. Abclinuxu umřelo.

23.6.2011 23:49 podlesh | skóre: 38 | Freiburg im Breisgau
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Zbytečné... ale osvěžující :-)

(jen osobní rada)

24.6.2011 00:12 pavlix | skóre: 54 | blog: pavlix
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

No... možná to vážně zkusím :).

Já už tu vlastně ani nejsem. Abclinuxu umřelo.

23.6.2011 20:11 pavlix | skóre: 54 | blog: pavlix
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

No to taky, ale hlavně, kterej debil tu píše anglicky?

Mno myslím, že je rozhodně chytřejší než ten, kdo ho označuje debilem.

Já už tu vlastně ani nejsem. Abclinuxu umřelo.

Bylo by možné obrázky vložit přímo na Abíčko příslušnou funkcí v blogu? Takhle se z původního umístění za pár měsíců či let ztratí, a článek tak přijde o dost podstatnou část své hodnoty.

23.6.2011 18:24 abc
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Možná spíš za pár měsíců nebo let zanikne ábíčko, protože je prý ztrátové.

Já osobně mám na webech typu imageshack fotky (cca 7000) už řadu let a jsou tam pořád všechny do jedné.

23.6.2011 18:33 Filip Jirsák | skóre: 68 | blog: Fa & Bi
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Pokud zanikne Abíčko, zanikne i ten zápisek a ty obrázky nikde chybět nebudou. Navíc tohle odkazu na nějaký nejspíš uživatelský adresář na VŠ, který možná zanikne druhý den po promoci. A ještě do adresáře temp, tam bych fakt na trvalost nevsázel.

Článek zajímavý, trochu mě trkl odstavec, kde se najednou objevil Pacemaker, aniž by bylo vysvětleno, co to je. Text pod obrázkem webového rozhraní typograficky nezapadá do zbytku textu.

Co by mě zajímalo a co jsem tu nenašel, je, jak je to s integrací se souborovými systémy. Správně je řečeno, že pro primary-primary konfiguraci je nutný distribuovaný souborový systém, ale nějak nechápu, jak se neclusterový souborový systém může vyrovnat s blokovým zařízením, v němž se mu pod jadernými strukturami mění data. To je těch několik jmenovaných systémů opatchovaných, nebo se blokové zařízení jeví zmražené (kam se pak odkládají aktualizace přijaté od primárního uzlu?)?

23.6.2011 14:51 Nikola Ciprich | skóre: 23 | blog: NiX_blog | Palkovice
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

no pokud mate primary-primary konfiguraci, tak je nutne pouzit clusterovy fs, nebo klidne pouzit normalni fs ale musi byt zajisteno ze nikdy nebude pripojen zaroven na dvou uzlech (a pak neni moc duvod pouzivat primary-primary). pri master/slave je to samozrejme jedno...

Did you ever touch the starlight ? Dream for a thousand years? Have you ever seen the beauty Of a newborn century?

23.6.2011 15:43 petr_p | skóre: 59 | blog: pb
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

To jste mi to moc neozřejmil. Mně jde o situaci, že mi stačí uzel, který bude poskytovat souborový systém jen pro čtení. Pokud má zároveň přijímat aktualizace z primary, tak i tak budu potřebova clusterový systém. Je to tak? Protože z článku jsem nabyl dojmu, že by mi na tento scénář stačil i neclusterový systém, což mi přišlo divné, a tak jsem se zeptal.

23.6.2011 18:33 Ladicek | skóre: 28 | blog: variace | Havlíčkův brod
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Pokud vím (už je to nějaká doba, co jsem měl s DRBD naposledy něco do činění, to bylo ještě ve verzi 0.7 tuším), souborový systém může být připojený jedině z primary blokového zařízení. Takže DRBD není to, co chcete.

Ještě na tom nejsem tak špatně, abych četl Viewegha.

24.6.2011 14:37 Marián André | skóre: 10 | blog: Qblog
Rozbalit Rozbalit vše Re: AToL: Distributed Replicated Block Device

Myslim, ze najbezpecnejsie riesenie je v pripade primary/primary pouzit clusterovy suborovy system aj v pripade, ze "ten druhy nod" bude sluzit len na citanie.

O mne.