gluster


There's little info on the web about how to monitor a glusterfs brick with nagios (or any other tool). There is a hint of a gluster utility script – http://www.mail-archive.com/gluster-devel@nongnu.org/msg06928.html – but it's not available in the source package.

It also needed some updates to make it suitable for nagios. I hope the gluster devs take my version of the script, along with these instructions, and add them to the main gluster source…

These instructions are for a default nagios3 installation on ubuntu karmic with gluster 3.0.3 compiled from source so you may need to edit this for your site.

Download this script (glfs-health.sh) and store it somewhere useful:

wget http://www.sirgroane.net/downloads/glfs-health.sh --output-document=/usr/local/bin/glfs-health.sh
chmod u+x /usr/local/bin/glfs-health.sh

Assuming a simple TCP gluster install we can set up a nagios command like this:

echo '
define command{
         command_name    check_gluster
         command_line    sudo  /usr/local/bin/glfs-health.sh  $HOSTADDRESS$ 6996 tcp $ARG1$
         }
' >> /etc/nagios-plugins/config/gluster.cfg

Notice the "sudo" in the command? This is because glfs-health.sh has to run as root. To enable this we have to add a line to /etc/sudoers:

echo "nagios  ALL=(ALL)  NOPASSWD: /usr/local/bin/glfs-health.sh" >> /etc/sudoers

Now you can construct a nagios service to monitor the bricks. For example: you've created a nagios hostgroup "gluster-bricks" with all the bricks in and they all export a volume "export_data":

define service {
        hostgroup_name                  gluster-bricks
        service_description             Glusterfsd
        check_command                   check_gluster!export_data
        use                             generic-service
        notification_interval           0
}

Restart nagios and you're done.

This work was supported by:

The gluster installation described in a previous post is being used for a webserver cluster on Amazon EC2 using two storage bricks serving a whole bunch of "client" webservers. I tuned the system with "end-to-end" performance testing using a website load tester rather than worry about contrived disk-access tests. That, and helpful comments from various devs on the user list, lead to the following conclusions.

There's a large collection of "performance translators" in gluster used for improving speed. Let's have a look at the ones I didn't use and why:

  • performance/read-ahead – Probably useful if your server has physical disks as it will minimise disk seeks. But amazon EBS storage is no doubt a layered storage system with its own caching. So this translator doesn't offer any speed increase and just gets in the way.
  • performance/write-behind – Same issues as read-ahead. Plus this translator seems to have problems if you try to read a file quickly after writing it.
  • performance/stat-prefetch – Pre-fetches and caches file stat information when a directory is read. Speeds up operations like ls -l but apache never needs that so it just gets in the way.
  • performance/quick-read – Uses a feature of the gluster protocol so the whole of a (small) file can be fetched during the lookup phase so opens and reads are not needed. Also caches the file data. Unfortunately it has a memory-leak bug that may be fixed in v3.0.5. Until then it can't really be used.

These are the performance filters I did use

  • performance/io-cache – Caches read file data in 128K pages for 1-60 seconds. The page size and maximum cache timeout can be changed in the source. Should only be used in volumes where files are read much more often than they are written because the translator just invalidates a whole 128K page when any part of it is written. This is perfect for website pages though.
  • performance/io-threads – Doesn't fork extra processes, but does configure a thread pool that allows faster operations to leap-frog blocked ones.

The translator stack I came up with has this layout:

    APACHE
       |
    performance/io-cache
       |
    performance/io-threads
       |
    cluster/replicate
       |
    protocol/client
      | |
    AMAZON NETWORK
      | |
    protocol/server
       |
    performance/io-threads
       |
    features/locks
       |
    storage/posix
       |
    ext3/xfs/whatever
      | |
    AMAZON EBS STORAGE

The philosophy is

  1. Only use the translators that you can prove actually provide a benefit. Translators are cheap but still get in the way. The gluster volgen command provides a good start for a general server but the volume config can be tweaked more for webservers.
  2. Caching first. It's quick and should be serving most of the files.
  3. Lots of threads on the client side. Apache is multi-threaded and Amazon EC2 servers are multi-core. Anything we can do to help concurrency to the bricks is a good thing.
  4. Threads on the server side too. I've read some articles that say this is a waste. But, in my experience, a large rsync on one client for example can really hold up accesses made from other clients unless io-threads is configured on the server side too. Also, EBSs never "fail" but occasionally they do exhibit huge iowait spikes of 100s of ms. In these circumstances io-threads on the server side mean that a minimum of the clients are kept waiting.
  5. Don't bother caching on the server side. The kernel will already be caching the filesystem underneath gluster.

 

The best tip though is to understand the whole architecture of your system and concentrate your optimisation efforts where they will have the most benefit. Seems obvious once it's said, but it takes some out-of-the-box / holistic / whatever thinking to actually do it.

In the case of an Apache web service, moving from single-server nfs to replicated gluster initially caused pages to take an extra 500ms or much more to load! This was almost a disaster – glusterfs is tuned for big files rather than small… In this case the solution was simple: migrate all .htaccess files into <Directory> directives in the Apache config, and specify AllowOverride None. This prevented Apache checking directories for .htaccess files and the overhead of gluster was greatly reduced: enough so the sites feel just as responsive as before. When the gluster devs fix the quick-read bug in v3.0.5 then the sites will be even quicker.

This work was supported by:

A lot of people are using Amazon EC2 to build web site clusters. The EBS storage provided is quite reliable, but you still really need a clustered file-server to reliably present files to the servers.

Unfortunately AWS doesn't support floating virtual IPs so the normal solutions of using nfs servers on a virtual IP managed by heartbeat or something is just not available. There is a cookbook for a Heath Robinson approach using vtunnel etc., but it has several problems not least its complexity.

Fortunately there's glusterfs. Gluster is mainly built for very large scale, peta-byte, storage problems – but it has features that make glusterfs perfect as a distributed file system on amazon EC2:

  • No extra meta-data server that would also need clustering
  • Highly configurable, with a "stacked filter" architecture
  • Not tied to any OS or kernel modules (except fuse)
  • Open Source

I use ubuntu on EC2 so the rest of this article will focus on that, but gluster can be used with any OS that has a reliable fuse module.

I'll show how to create a system with 2 file severs (known as "bricks") in a mirrored cluster with lots of clients. All gluster config will all be kept centrally on the bricks.

At the time of writing the ubuntu packages are still in the 2.* branch (though v3.0.2 of gluster will be packaged into Ubuntu 10.4 "Lucid Lynx") so I'll show how to compile from source (other installation docs can be found on the gluster wiki but it tends to be a bit out of date).

To compile version 3.0.3 from the source at http://ftp.gluster.com/pub/gluster/glusterfs

apt-get update
apt-get -y install gcc flex bison
mkdir /mnt/kits
cd /mnt/kits 

wget http://ftp.gluster.com/pub/gluster/glusterfs/3.0/3.0.3/glusterfs-3.0.3.tar.gz
tar fxz glusterfs-3.0.3.tar.gz
cd glusterfs-3.0.3
./configure && make && make install
ldconfig

Clean up the compilers:

apt-get -y remove gcc flex bison
apt-get autoremove

This is done on both the servers and clients as the codebase is the same for both, but on the client we should prevent the server from starting by removing the init scripts:

# only on the clients
rm /etc/init.d/glusterfsd
rm /etc/rc?.d/*glusterfsd

It's also useful to put the logs in the "right" place by default on all boxes:

[ -d /usr/local/var/log/glusterfs ] && mv /usr/local/var/log/glusterfs /var/log || mkdir /var/log/glusterfs
ln -s /var/log/glusterfs /usr/local/var/log/glusterfs

And clear all config:

rm /etc/glusterfs/* 

Ok, that's all the software installed, now to make it work.

As I said above, gluster is configured by creating a set of "volumes" out of a stack of "translators".

For the server side (the bricks) we'll use the translators:

  • storage/posix
  • features/locks
  • performance/io-threads
  • protocol/server

and for the clients:

  • protocol/client
  • cluster/replicate
  • performance/io-threads
  • performance/io-cache

(in gluster trees the root is at the bottom).

I'll assume you've configured an EBS partition of the same size on both bricks and mounted them as /gfs/web/sites/export.

To export the storage directory, create a file /etc/glusterfs/glusterfsd.vol on both bricks containing:

volume dir_web_sites
  type storage/posix
  option directory /gfs/web/sites/export
end-volume

volume lock_web_sites
    type features/locks
    subvolumes dir_web_sites
end-volume

volume export_web_sites
  type performance/io-threads
  option thread-count 64  # default is 1
  subvolumes lock_web_sites
end-volume

volume server-tcp
    type protocol/server
    option transport-type tcp
    option transport.socket.nodelay on 

    option auth.addr.export_web_sites.allow *
    option volume-filename.web_sites /etc/glusterfs/web_sites.vol

    subvolumes export_web_sites
end-volume

NB. the IP authentication line  option auth.addr.export_web_sites.allow *  is safe on EC2 as you'll be using the EC2 security zones to prevent others from accessing your bricks.

Create another file /etc/glusterfs/web_sites.vol on both bricks containing the following (replace brick1.my.domain and brick2.my.domain with the hostnames of your bricks):

volume brick1_com_web_sites
    type protocol/client
    option transport-type tcp
    option transport.socket.nodelay on
    option remote-host brick1.my.domain
    option remote-subvolume export_web_sites
end-volume

volume brick2_com_web_sites
    type protocol/client
    option transport-type tcp
    option transport.socket.nodelay on
    option remote-host brick2.my.domain
    option remote-subvolume export_web_sites
end-volume

volume mirror_web_sites
    type cluster/replicate
    subvolumes brick1_web_sites brick2_com_web_sites
end-volume

volume iothreads_web_sites
  type performance/io-threads
  option thread-count 64  # default is 1
  subvolumes mirror_web_sites
end-volume

volume iocache_web_sites
  type performance/io-cache
  option cache-size 512MB               # default is 32MB
  option cache-timeout 60                # default is 1 second
  subvolumes iothreads_web_sites
end-volume

and restart glusterfs on both bricks:

/etc/init.d/glusterfsd restart

Check /var/log/glusterfs/etc-glusterfs-glusterfsd.vol.log for errors.

On the clients edit /etc/fstab to mount the gluster volume:

echo "brick1.my.domain:web_sites /web/sites glusterfs backupvolfile-server=brick2.my.domain,direct-io-mode=disable,noatime 0 0" >> /etc/fstab 

Then create the mount point and mount the partition:

mkdir -p /web/sites
mount /web/sites 

Check /var/log/glusterfs/web-sites.log for errors.

And you're done!

The output of "df -h" should be something like this (though your sizes will be different).

bash# df -h
Filesystem Size Used Avail Use% Mounted on
...
brick1.my.domain 40G 39G 20M 0% /web/sites

In another post I'll pontificate on tuning gluster performance, why I chose this particular set of filters and what the options mean.

This work was supported by: