Navigation Menu
KVM Cluster with DRBD/GFS

KVM Cluster with DRBD/GFS

By on Jul 6, 2010 in Article | 30 comments

 

Background

Recently, I started a project at (www.eyemg.com) to migrate from VMWare to KVM. Our standard server deployment is based on RHEL5 running on HP DL380 hardware. Given our hardware/software deployment, it made sense to align ourselves with Red Hat’s offering of KVM. We are able to achieve feature parity with VMware server while adding live migration. At the time of this writing, live migration was not available on VMware Server and required an upgrade to ESX server with vMotion. KVM on RHEL5 provides these features at a price point that is much lower than VMware ESX Server.

Our hardware/software stack is well tested and generally very stable. This makes it an excellent base upon which to build a shared nothing HA cluster system. The goal of this system is to have complete hardware and software redundancy for the underlying virtual machine host servers.

To use live migration, both nodes of an HA cluster must have access to a live copy of the same virtual machine data files. This is traditionally done with a SAN, but we use GFS2 over DRBD to gain the advantages of having no shared hardware/software at a lower cost. We implement RAID1 over Ethernet using a Primary/Primary configuration with DRBD protocol C.

I wrote this article because there is a vacuum of documentation on the Internet with solid implementation details around KVM in a production environment. The down side of this project is that it requires extensive knowledge and expertise in the use of several pieces of software. Also, this software requires some manual installation and complex configuration. Depending on your level of experience, this may be a daunting task.

My company (www.eyemg.com) offers KVM/DRBD/GFS2 as a hosted solution. VMware’s offering has the advantage of assisting with installation of many of these features, with the disadvantage of lacking a shared nothing architecture.


Architecture

The hardware/software architecture is described in the drawing below


Hardware Inventory

The hardware is very standard and very robust. HP servers allow monitoring of power supply/CPU temperature, power supply failure and hard drive failures. They also come equipped with a remote access card called an Integrated Lights Out (ILO). This ILO allows power control of the server, which is useful for fencing during clustering.

  • (2) DL380 G6
  • (16) 146GB SAS Drives 6GBps
  • (2) Integrated Lights Out Cards (Built In to DL380)
  • 12GB RAM, Upgradable
  • 900GB Usable Space, Not easily upgradable because all of the drive slots are filled

Software Inventory

Using GFS2 on DRBD we are able to provision pairs of servers which are capable of housing 30 to 40 virtual guests while providing complete physical redundancy. This solution is by no means simple and requires extensive knowledge of Linux, Redhat Cluster, and DRBD.

  • DRBD: Distributed Replicated Block Device[1]
  • GFS2: Redhat global file system. Allows 1 to 16 nodes access the same file system.[2]
  • CMAN: Redhat cluster manager

Network Inventory

Two special interfaces are used to achieve functional parity with VMware Server/ESX. This allows the clustered pairs of KVM host servers access to different VLANS while communicating over bonded crossover cables for DRBD synchronization.

  • Bridge: Interface used to connect to multiple VLANS. There is a virtual bridge for each VLAN.
  • Bond: Three 1GB ethernet cards and three crossover cables work as one to provide a fast reliable backend network for DRBD synchronization

Installation

Operating System Storage

First thing is to configure each machine with one large RAID5 volume. This can be done from the smart start CD[3] or from within the BIOS at boot time by hitting F9. Once the operating system is installed the configuration can be checked with the following command.

Operating System

Currently, the version of the operating system must be X86_64 and RHEL 5.4 or above. I will not detail RHEL5 installation as there is extensive documentation. We us a kickstart installation which is burned to CD using a tool written in house which will eventually be open sourced. Remember to leave enough space for your operating system, we use 36GB on our installations. The rest of the volume will be used for virtual machine disks.

Data Storage

Now that the operating system is installed, the next step, is to create a partition for the virtual machine disk files. To get better performance from our virtual machines, this tutorial shows how to do boundary alignment per the recommendation of EMC[4] and VMware[5]. To achieve the most performance from this stack, the boundary alignment should really be done for guest OSes too. Currently, this cannot be achieved from kickstart.

First partition, then reboot, and align the new partiion.

Create a partition that looks like the following

Switch to advanced mode:

Then hit ‘p’, you should see output that looks like the following. Use the start column in this advanced output to calculate your boundary alignment.

Command (m for help): x

Realign the beginning

Use the following formula to get the correct starting block:

For example, if your partition currently begins on block 19682411:

Round up

Then multiply it by 128

You should end up with partitioning that looks like the following

Bonded Interfaces

This is used to bond three 1 Gbps interfaces together to be used with DRBD

Then configure each of eth1, eth2, eth3 to use the bond0 interface

Bridge Interfaces (VLAN)

Install bridge utilites

Create a bridge interface. Take the IP address from eth0 and move it to the new bridge interface

Remove the servers main ip address from eth0 and change the eth0 interface to bind to a bridge

Configure iptables to allow traffic accross the bridge

Restart services

DRBD

DRBD installation is straitforward and well documented here. Once installation is complete, use the following configuration file as a template. The most important parts to notice are the 256MB rate limit on sync and the interface that it uses. This will prevent DRBD from using all of the bandwidth on the bonded interface.

Start the DRBD service. The default installation comes with init script (see DRBD documentation for details)

Once DRBD is started, it will begin it’s initial sync. You should see something like the following.

Then run a status to make sure both nodes are working correctly.

Make both nodes primary by running the following command on each

If you have trouble getting DRBD working, use the following to troubleshoot. Be patient, sometimes it takes a while to get the hang of it.

Troubleshooting Guide

GFS2

To use GFS, a Redhat Cluster must be in place and active. The manual installation of Redhat Cluster is beyond the scope of this tutorial given the time and length I have permitted, but the Redhat documentation should get you started if you are patient and persistent.

Eventually, I will replace this link with a guide for an open source tool that I will be releasing called fortitude. Fortitude installs Redhat Cluster, provides net-snmp configuration, Nagios check scripts, and easy to use init scripts which really streamline the Redhat Cluster build/deployment process, but for now, please follow the Redhat documentation. Fortitude is written and works in our environment but needs some love and documentation before it can be release open source. Until then, I leave you with the Redhat Documentation.

First install GFS tools

Then create GFS filesystem

There are few tuning parameters that are critical for GFS2 to perform well. First, the file system must be configured to disable writing the access time on files and directories. modify /etc/fstab to look similar to the following.

Finally, add the following to the cluster.conf XML file anywhere in the section. By default GFS2 limits the number of plocks per second to 100. This configuration, removes the limits for dlm and gfs_controld.

Update the cluster.conf version number and push the changes to the cluster. After pushing the file, each node can be restarted individually without bringing down the entire cluster. The changes will be picked up by each node individually (tested with Ping Pong).

KSM

A newer feature that can be used with KVM is called KSM (Kernel Shared Memory Pages)[6]. KSM allows KVM hosts to consolidate memory pages which are identical between guest virtual machines. On our HA pair running 15 RHEL5 virtual machines, we recouped 2GB of physical memory.

Below is a graph demonstrating memory usage when we implemented KSM. Notice that the memory usage dropped off by about 2 GB, which the Linux Kernel then began to slowly use for caching and buffers.

Tuning KSM is beyond the scope of this tutorial but the following script, written by Red Hat, can be used to assist in controlling how aggressively the Kernel will try and merge memory pages.[7]

Add to default start up

Final Notes

In our environment, we have decided to manually start cluster services and mount the DRBD partition, but in a different environment it may be useful to have a startup script that will automatically mount/unmount the drbd volume. The unmount is especially useful if an administrator forgets to unmount before doing a reboot. The following script can be used as an example. Remember it should unmount very early in the reboot process, before cluster services stop or there will be problems.

Testing & Performance

  • Ping Pong: Small C program used to check plocks/sec performance in GFS2
  • Bonnie++: All disk I/O was generated with bonnie++. Bonnie++ also provides results on benchmarking
  • iostat: Disk I/O was also measured with iostat during the bonnie++ testing to confirm results
  • sar: Network traffic was monitored with sar to verify the network I/O was coherent with the disk I/O data

Analysis

When generating these performance numbers there is a desire to run the tests for months and months to get everything just perfect, but at some point, you must say that enough is enough. I ran these tests over about 3 days and eventually came to the conclusion that the performance was good enough given the price point. Since the main I/O performed by a virtual machine host is on large virtual disk files, I paid special attention to sequential read/write times in bonnie++ and these numbers seemed acceptable to me.

Disk I/O

GFS2

To verify that all of the GFS2 setting are correct use the following small test program called Ping Pong. Before the above changes, the maximum number of locks displayed by Ping Pong was about 95/sec. After the above changes, performance should be similar to the following Ping Pong output.

Bonnie++

The following command was used to benchmark disk I/O.

Results from Bonnie++ showed that we could achieve sequential reads of almost 300MB/s and writes of almost 90MB/s. This is with GFS over DRBD connected thorugh 3 bonded 1Gb ethernet interfaces.

iostat

The following command was used to measure disk I/O. A 10 second interval was used to smooth out peaks/valleys.

IOstat confirms bursts of over 200MB/s for 10 second periods

ifstat

The following command was used to measure network I/O. A 10 second interval was used to smooth out peaks/valleys. Sar provides a wealth of information, but we are using it for it’s network I/O capabilities.

IFstat verifies bursts in network I/O confirming that all three ethernet devices are, indeed, being used.

  1. http://www.drbd.org/home/what-is-drbd/ []
  2. http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Logical_Volume_Manager_Administration/index.html []
  3. http://h18013.www1.hp.com/products/servers/management/smartstart/index.html []
  4. http://media.netapp.com/documents/tr-3747.pdf []
  5. http://www.vmware.com/pdf/esx3_partition_align.pdf []
  6. http://www.ibm.com/developerworks/linux/library/l-kernel-shared-memory/index.html []
  7. http://www.redhat.com/archives/fedora-virt/2009-September/msg00023.html []

    30 Comments

  1. Could you please share how you actually mange such a cluster and how to behaive in case of a desaster. Also interesting would be how KVM compares to vMotion in those desaster scenarios?

  2. How do you manage it at all?

  3. I must be reading this wrong. You keep mentioning VMware in use in the proposed solution… are you suggesting you use RHEL vm’s on VMware????

    • I don’t understand? This solution is RHEL on RHEL. I make comparisons to VMware since that is what so many people are comfortable with.

  4. may i ask what the bridge interface is actually required for? is it for the VMs on top of that filesystem, or do you actually need it for the DRBD+GFS stuff? Shouldn’t this work the same if the DRBD+GFS traffic just goes through the bonded interface over the regular address?

    thanks robert

    • The bridge interfaces are what allow the servers to connect to multiple VLANs. This gives the ability to put virtual machines into different VLANs based on need. It would work the same and wouldn’t matter if you weren’t crossing VLANs, I added it to give it feature parity with VMware server.

  5. Ciao, please could you share your cluster.conf ? i’m working in order to performe a KVM cluster of VMs between 2 physical Host, and i some trouble 🙂 in particular about this: how can monitor server hardware in order to perform Vm migration? for example, what happen if nodeA (master) lose (broken) the network interfce who provide VM service (in your case, the bridge interface)?

    Thank you very much !!

    • I am working getting a cluster.conf file out there. When I do, I will ping you.

  6. Thanks for this guide.

    In the setup you describe, is there anything that can be done to prevent you from starting the same vm on more than one node? I inadvertently did this, and the vm now has a corrupted disk image.

    Also wondering if you’ve tried opennode for management?

    Thanks.

    • I had the same problem with our setup. I still need to research a way to prevent that from happening. I will post something if/when I figure something out. Haven’t had much time to work on it lately.

      I have not tried open manager yet

      • I’ve found that taking a snapshot of the vm is a good idea. (Convert image to qcow2 format and use virsh snapshot-create vmname) This allows you to fall back to a good system.

        • This is good on a low traffic server that doesn’t write much data, but can cause huge problems in production. Copy on write can be very expensive and will definitely limit the number of servers that you can consolidate on a platform.

  7. Hi, thanks for the information. It seems that GFS is not required in dual-node cluster, I didn’t test/confirm it, but I found the similar information at http://pve.proxmox.com/wiki/DRBD , no cluster filesystem was mentioned.

    What is your opinion? Thank you very much!

    • I am not familiar with what Proxmox is, but it appears that you pass it raw lvm groups and it somehow manages the access to the data. I suspect it uses some kind of clustered file system behind the scenes during this step:

      http://pve.proxmox.com/wiki/DRBD#Add_the_LVM_group_to_the_Proxmox_VE_storage_list_via_web_interface

      There are two main ways to handle live migration of a virtual machine, or really any real time data.
      1. Pass the application a clustered filesystem which does inode level lock managament
      2. Pass the application a clustered volume which does block level lock managmeent

      Since Proxmox is being handed a non-clustered physical volume, it is doing one of the two behind the scenes. This is completely reasonable, because GFS2 does the same thing in my set up. Also, I don’t see why clvm couldn’t do it at the block level, but I have never tried it. GFS2, CLVM, and almost any other clustered storage needs a lock manager, proxmox is probably handling that for you.

      • CLVM can simply use corosync started via aisexec to manage CLVM locking. There is no need to have a clustered filesystem behind it. The daemon handles locking and communicates with its peers via corosync. You need to have lvm.conf setup for clustering in order for this to work properly.

  8. Hi, I tested KVM live migration with dual primary DRBD today.
    I didn’t use any clustered filesystem, CLVM etc. I just used /dev/drbd0 as VM storage and it works.

  9. To manage VMs (not sure about clusters) running from KVM i just found this free web gui built in Cappuccino: http://archipelproject.org

  10. Hi,

    Are you still employing this configuration? Have you filled the gap of failover management tools? I was planning on implementing something very similar to what you’ve proposed. I did find this tool that might help with the failover management:
    http://code.google.com/p/ganeti/

    Thanks for the great article!
    Justin

    • There has been some work in this area at Red Hat. In RHEL6 you can use rgmanager to manage the start/stop of virtual machines as a resource, just like you would apache, or mysql. Also, in Ovirt/RHEV, failover is built into the system. At this point in time, I would very much consider using oVirt/RHEV or RHEL6 with the virtual machine resource. http://www.ovirt.org/

  11. Hi,

    I have a 2-node setup based on RHEL6 which is very similar to the one you describe but I’ve been having real problems with stalling during disk writes which can completely kill some VMs, as the stalls can last 30secs or more!

    I stripped everything right down to basics (no cluster, no DRBD) and got to the point where I just have Disk–>GFS2 and I still see the same write stalls. If I reformat with EXT4 there are no problems.

    Do you have any ideas what might be causing this or where to look? It’s driving me mad! I really want to use GFS2 but with its current performance it’s just unusable!

    Regards,

    Graham.

    • Graham, my apologies, I have not tested this in RHEL6, but I am thinking about building another proof of concept with CLVM/DRBD/KVM/RHEL6. This will remove the need for GFS2. When I have a chance to build one, or if you build one first, please let me know how it goes (what works/doesn’t work), I would love to publish.

      • Hi “admin”, thanks for your reply.

        I’ve become so frustrated with GFS2 that I’ve ditched it, at least for now. DRBD seems to work pretty well but without GFS2 it’s of limited use to me and still seems to impose some performance hits.

        I’m taking a different tack now and seeing if I can use ZFS2 across the two nodes to give better performance whilst keeping the node redundancy. I’ll let you know how I get on.

      • Well it all seems to be going ok at the moment. I’ve used iSCSI to export two drives from one machine of the pair and used ZFS to stripe and mirror across all 4 drives of the two nodes.

        Performance is substantially higher than it was using DRBD+GFS2. For example, when using DRBD+GFS2 I was getting around 20MB/s of write performance with regular 30sec+ stalls – these were the real killer.

        With ZFS I’m getting up to 180MB/s with no stalls at all, not to mention all the other great features of ZFS.

        I had considered using AoE rather than iSCSI as it’s much lighter weight and so potentially quicker but I just couldn’t find any AoE target daemons that have seen any development in the last few years – shame really.

  12. Hi,

    Your article is nice and superb to understand it. I have a query regarding fencing device.

    As per me, fencing device does remove server from cluster when it has h/w or any problem related to run cluster services in proper manner.

    Let me correct , if i am wrong.

    Can we use any customized script to do fencing instead of any specialized h/w.

    Thanks,Ben

    • You can easily take the fence manual code and create a script to, for example, stop a virtual machine to fence.

Trackbacks/Pingbacks

  1. everyday 07/27/2011 » 暗似透春绿 - [...] HA KVM Cluster with DRBD and GFS2 | Crunch Tools [...]
  2. DRBD, Open vSwitch.. | TooMeeK - [...] Dunno.. I found some interesting articles about KVM + vSwitch implementation in Ubuntu and DRBD for KVM cluster with…

Leave a Reply to admin Cancel reply

Your email address will not be published.