High Availability STorage in FreeBSD

Speaker: Pawel Jakub Dawidek

1. High Availability

Definition: services available at all time.

How hard can it be?

1.1. Challenge

What can possibly go wrong?

Most HA cluster failures come from the HA implementations themselves…

2. HAST

2.1. Features

HAST replicates data over the net. It works at the block level, which makes it FS-agnostic. From the point of view of the rest of the system, it is just another GEOM provider. HAST uses TCP, and is intended to be used over local, low-latency networks.

HAST:

recovers quickly, by maintaining a bitmap of extends already synchronised;
detects split brain conditions;
does not decide about its role : HAST does not decide who is the master, the sysadmin does.

Currently, only 2 nodes operation is supported: master + slave. Applications communicate with the primary node.

2.2. Split brain

A split brain happens when the communications between the nodes break. Both nodes think they are the primary, and both accept write operations. The file system running on a HAST device cannott handle that.

During a split-brain, both nodes use local storage. When communication is restored, it is up to the sysadmin to decide which data she is interested in.

HAST includes hooks that allow the sysadmin to handle certain situations. Typically, during a split brain (rather: after a split brain, when communication between the nodes is restored and the split brain itself is detected), the nodes can send an alert to the sysadmin so she can recover and decide which data is correct.

3. Write operations

The data comes from the kernel, passed to userland hastd.

3.1. Naïve mode

Write data locally, send data to the secondary node. What could possibly go wrong?

3.2. Super slow mode

When we are actually interested in our data, we want to be sure it is written to the slave's disks…

mark extend dirty;
write data locally;
send a copy of the data to the slave;
wait for the slave's ack;
mark extend clean;
report success to the application.

Each write operation from the application results in three writes: marking extends as clean or dirty actually implies writing some metedata to the disk (we wouldn't want HAST to forget what blocks need synchronisation after a node crash, would we?).

3.3. fullsync (available now)

This is the default mode.

mark extend dirty;
write data locally;
send data to slave;
secondary acks;
do not mark extend as clean;
report success

Not marking extends as clean allows to somewhat speed up things, removing one write operation (mark dirty and write metadata, write date as opposed to mark dirty and write metadata, write data, mark clean and write metadata). In addition, thanks to the data locality of most applications, the first mark as dirty quite often is a no-op. The cost is a slightly more costly synchronisation after a node failure, because some extends are marked as dirty while they are, in fact, clean (from the point of view of HAST).

Actually, only a pool of extends are kept dirty, with the oldest extends marked as clean when the (fixed-size) pool becomes full.

3.4. memsync (not merged yet)

mark extend dirty;
write locally;
send to secondary;
second acks reception of data;
report success;
secondary acks write;
not mark extend as clean.

There is a small time window during which the master is not sure whether the slave could write the data if the slave failed at just the wrong time. But with the remark above, the extends marked dirty would have been synchronised anyway.

This mode needs some more love; it is almost ready, and should be available in a FreeBSD tree near you soon.

3.5. async (almost completed)

mark extent as dirty;
write locally;
send to secondary;
report success;
secondary acks write;
do not mark as clean.

To be used when latency between the nodes is too high for efficient synchronisation. The time window during which we don't know whether the secondary received our data is larger when compared to memsync.

4. Configuration

resource data {
  on hosta {
    local /dev/ada0
    remote 10.1
  }
  on hostb {
    local /dev/ada0
    remote 10.2
  }
}

The same configuration file can be used on both nodes.

5. Startup

On hostb:

hastctl create data
hastd
hastctl role secondary data

On hosta: idem, with s/secondary/primary/.

A new node is created in /dev/hast/, which can be used with newfs(8), nfsd(8), whatever.

In order to do the actual switchover from master to slave, something like net/ucarp with hastctl(8) should be used. Example configuration scripts are included somewhere in /usr/share/examples/.

6. Performance

Benchmark: using small blocks, no parallelism (worst case for HAST). I could not note down the exact test, a bonnie-type benchmark.

6.1. With the secondary disconnected

raw disk : 17k
async : 11k
memsync, fullsync : 10.6k

6.2. With the secondary over lo0

fullsync : 6k

Did not have time to note more data points. I seem to remember that memsync did better.

7. Demonstration

Pawel created a test setup on two disks, using NFS to export the HAST device. net/ucarp is used to control the switchover from master to slave. Pawel started a video. He then killed the master. We saw the video stopping, traces from ucarp detecting the failure of the master, switching to the slave, and the video resuming. The video stopped for about a second.

I didn't have time to note some of the essential parameters of the demo. Oh well.

8. Impressions

I (Fred) was slightly disappointed to see that HAST is only about the replication over the network, and that it doesn't even handle exporting the /dev/hast/ nodes to clients or switching from slave to master. The sysadmin needs to configure NFS or iSCSI to make the data available to the clients, and requires something like ucarp or heartbeat for the actual HA. I get the feeling that this will result in a brittle setup with too much moving parts for what is logically one functionality, namely a network-accessible RAID1 setup. But that's only that, a feeling; I haven't had the opportunity to actually try this (though the ResEl would need something like that…).

I was surprised to see that HAST used TCP. It means that a lost or delayed segment will delay all the following segments during the slow restart recovery. I would have expected UDP with hand-rolled acknowledgements (or raw Ethernet? but that would have made HAST non-routable). I guess not having to re-invent acknowledgements was more important to Pawel than performance.

HAST seems more simple than DRDB, and doesn't pretend to do as much (even though a number of those bullet points seems to be "we export a block device, and you can use your usual tools to build encryption or LVM on top of it", which is also true for HAST). I guess this is just another side of the first point.

I find that "only" a 3× slow-down is quite good, especially considering the fact that Pawel place HAST outside of its comfort zone. The honesty over those results was quite refreshing, too.