The InterMezzo High Availability File System

Peter J. Braam braam@cs.cmu.edu and Rob Simmonds simmonds@stelias.com

v1.2, Jan 5, 2000


This document explains the configuration and operation of the InterMezzo file system on Linux.

1. Disclaimer and License

InterMezzo is an experimental file system. It contains kernel code and daemons running with root permissions and is known to have bugs. Please back up all data when using or experimenting with InterMezzo.

InterMezzo is covered by the GPL. The GPL describes the warranties made to you.

Copyright on InterMezzo is held by Stelias Computing, Carnegie Mellon University, Phil Schwan, Los Alamos National Laboratory and Red Hat Inc.

2. Introduction

2.1 What is InterMezzo?

InterMezzo is a file system that keeps replicas of folder collections, a.k.a. volumes residing on multiple computers in sync. The computers that express an interest in the replica are called the replicators of the volume. InterMezzo has one server for the volume, which plays an organizing role in exchanging the updates with replicators.

InterMezzo has disconnected operation, i.e. it maintains a journal to remember all updates that need to be forwarded when a failed communication channel comes back. This is a best effort synchronization since during disconnected operation conflicting updates are possible.

InterMezzo uses an existing disk file system, in practice Ext2, as the storage location for all data. When an Ext2 file system is mounted as file system type InterMezzo instead of Ext2, the InterMezzo software starts monitoring all access to the file system. It manages the journals of modification records and negotiates permits to modify the disk file system, to avoid conflicting updates during connected operation.

3. Using InterMezzo

Here we describe how to set up a server and clients.

3.1 Installing the software

  1. Build and install the presto kernel code:
    1. modprobe -a,
    2. "alias char-major-185 presto" >> /etc/conf.modules
  2. Install POE, TimeHires, Storable and Term-GNU-Readline
  3. Install Lento

3.2 Config files

Your default config directory is /etc/intermezzo.

Here you should place three files:

/etc/intermezzo/sysid

Holds a name of your system. At present this should resolve to an IP address. Suppose your server has the name muskox, with IP address 192.168.0.3, and your clients are clientA and clientB. In its simplest form the sysid file on each host would just contain the hosts name. i.e., on muskox the file would contain:

muskox

You may add a second field to the sysid file, which applies to clients only. It is a bind address for the client, indicating that the endpoint of connections this client initiates originate from the IP address mentioned. You normally don't need this, but it is good to have when you run a client and server on a single machine, since it allows you to clearly distinguish the server end point of the connection from that of the client. For clientA with IP address 192.168.0.20 the sysid file would contain:

clientA 192.168.0.20

/etc/intermezzo/serverdb

Holds a database of servers. The server structure is a Perl hash, as follows:

{ 
  muskox => { 
    ipaddr => "192.168.0.3", 
    port => 2222 , 
    bindaddr => "192.168.0.3"
  }
};

The above contains a single server description for the server muskox with IP address "192.168.0.3". The port and bindaddr are optional; the default port is 2222. Without a bindaddr the server listens to all interfaces for requests, with it, the server only listens on the bindaddr address.

/etc/intermezzo/voldb

Holds a database of volumes. The server structure is a Perl hash, as follows:

{
  volname => {
    servername => "muskox", 
    replicators => ['clientA', 'clientB' ] 
  }
};

The above contains a single volume description for a volume called volname on server muskox. The volume is replicated on hosts clientA and clientB.

/etc/fstab

To ease the mounting of InterMezzo volumes add the following to the /etc/fstab file:

/tmp/cache  /izo0      InterMezzo loop,volume=volname,prestodev=/dev/intermezzo0,\
                       mtpt=/izo0,noauto 0 0

where /tmp/cache is a file associated with a loop device, /izo1 is a mount point (a directory), volname is the name of the volume and /dev/intermezzo0 is the name of the presto device. The creation of the cache file and the presto device is explained in the examples at the end of this section. The kernel must be configured with loopback device support enabled to do this.

Let's consider three common cases, for each we will give the config files and the correct invocations to start the server/cache manager.

One client and one server (typical use: replicate a WWW server):

In this case we assume that the host muskox is serving the volume shared and the host clientA is replicating the volume. The following files are placed on both muskox and clientA.

/etc/intermezzo/serverdb

{
  muskox => { ipaddr => "192.168.0.3" }
};

/etc/intermezzo/voldb

{
  shared => {  
    servername => "muskox",
    replicators => ['clientA']
  }
};

/etc/intermezzo/sysid

On muskox this contains:

muskox
On clientA this contains:
clientA

/etc/fstab

The following line is added on both muskox and clientA:

/tmp/fs0  /izo0      InterMezzo loop,volume=shared,prestodev=/dev/intermezzo0,\
                     mtpt=/izo0,noauto 0 0

/tmp/fs0

This file in constructed using the following commands:

dd if=/dev/zero of=/tmp/fs0 bs=1024 count=10k
losetup /dev/loop0 /tmp/fs0
mke2fs /dev/loop0
losetup -d /dev/loop0

/dev/intermezzo0

This is created using the following commands:

mknod /dev/intermezzo0 c 185 0
chmod 700 /dev/intermezzo0

/etc/conf.modules

Add the line:

"alias char-major-185 presto" >> /etc/conf.modules

Before starting lento, mount the cache:

mkdir /izo0; mount /izo0

Now lento can be started on both muskox and clientA by typing

./lento.pl
in the directory containing the lento.pl file.
Two clients and one server (typical use: laptop - desktop syncing):

/etc/intermezzo/serverdb

The can be the same as for the one client and one server case above.

/etc/intermezzo/voldb

{
  shared => {  
    servername => "muskox",
    replicators => ['clientA', 'clientB']
  } 
}; 

This is the same as in the first example, but clientB is added to the replicators list.

/etc/intermezzo/sysid

This is the same as in the first example for muskox and clientA, and on clientB contains the following:

clientB

/etc/fstab

This is the same as used with the one client and one server case above.

One client and one server on one host (typical use: testing InterMezzo):

Suppose that we are running on the host muskox. To run multiple lentos on one host we need to use ip-aliasing. This allows a one interface to have more than one IP address associated with it. Suppose the name muskoxA1 and the IP address 192.168.0.100 are available. In:

/etc/hosts

Add the line:

192.168.0.100   muskoxA1        

Then add the ip-alias by typing:

    ifconfig eth0:1 muskoxA1 up

Then create two files containing the following:

/etc/intermezzo/sysid

muskox  192.168.0.3

/etc/intermezzo/muskoxA1

muskoxA1 192.168.0.100

The latter file will act as a sysid file for the lento running on the aliased IP address.

To run the second lento, a second loopback cache and presto device are required. These are constructed as follows:

dd if=/dev/zero of=/tmp/fs1 bs=1024 count=10k
losetup /dev/loop1 /tmp/fs1
mke2fs /dev/loop1
losetup -d /dev/loop1

mknod /dev/intermezzo0 c 185 0
chmod 700 /dev/intermezzo0

/etc/fstab

Now two entries are needed:

/tmp/fs0  /izo0      InterMezzo loop,volume=shared,prestodev=/dev/intermezzo0,\
                     mtpt=/izo0,noauto 0 0
/tmp/fs1  /izo1      InterMezzo loop,volume=shared,prestodev=/dev/intermezzo1,\
                     mtpt=/izo1,noauto 0 0

Then mount the two InterMezzo directories:

mkdir /izo0; mount /izo0
mkdir /izo1; mount /izo1

The lento acting as the server can be started as before:

./lento.pl

The lento acting as the replicator has to be told which sysid file, and which presto device to use. It is started as follows:

./lento.pl --sysid=muskoxA1 --prestodev=/dev/intermezzo1

4. How does InterMezzo work?

InterMezzo was heavily inspired by Coda, and its current cache synchronization protocol is one of the many protocols that Coda supports. It is likely not the best for every situation but it is as simple as we could make it.

InterMezzo's mechanisms are very different from those of Coda. We employ very different kernel code which maintains the cache in another file system (typically Ext2/Ext3/Reiser). The kernel code also uses the journaling support in the kernel to make transactional updates (with lazy commits) to the file space and update journals.

4.1 InterMezzo's protocol

The primary reason for keeping it simple is that we wanted to use it as soon as possible. It is also hoped that it will not be too confusing to the end user, as is frequently the case with advanced network file systems.

InterMezzo divides the file space up in volumes. Typically a volume is much larger than a directory and smaller than a full disk partition. Good examples of volumes might be /usr or someones home directory.

The typical event sequence for a volume in InterMezzo is as follows:

Creation

the volume is created on the server, possibly populated, possibly empty. The file server and the kernel on the server are now aware of the volume.

Client needs the volume

A client which needs the volume is told about the volume and its server. The client is added to the servers list of replicators of the volume and the server is made a replicator of the cache on the client. The volume is mounted on the client, and the client cache manager and kernel know about it.

A replicator has the following state:

peer

replicator describes replication between this system and peer.

volname

replicator describes replication of volume volname

next_to_expect

the next update record to expect from the peer in the peer's numbering sequence

next_to_send

the next update record to send to the peer in this systems numbering sequence

Syncing up

Before the volume is usable on the client, it needs to sync up with the server. Syncing up is complicated by the fact that the server copy of the volume may change during the synchronization. Syncing up is done as follows:

  1. rsync the server volume to the client
  2. atomically suspend update reintegration on the server and add the new client to the list for which updates on the volume need forwarding.
  3. rsync once more (this should be quicker)
  4. sync client disk
  5. mark replicator as synced on the client and server. Inform the client of the replicators next_to_send number which becomes the next_to_expect number on the client
  6. activate update forwarding and reintegration

Normal Operation

The volume on the replicator is now in a state of normal operation which we describe next

Under normal operation there is a collection of replicators that is connected to the server, and some replicators are in disconnected operation (the latter can also happen when the server fails).

The kernel and server/cache manager keep the log of updates to the volume journalistically in sync with the contents of the file system, i.e. under all circumstances any update applied to the file system is also entered in the update database. Also, the systems transactionally update the next_to_expect counters as updates are entered in the cache. (All transactions have lazy commits.)

The following rules govern the operations

Permits

Before an update can be made to the file system, a permit is acquired. A permit acquisition consists of:

  1. Notifying the server of the request
  2. The server revokes the permit from the current holder. The current permit holder will reintegrate its changes to the server before giving up its permit.
  3. The server propagates the changes to other synced replicators asking, and then grants the permit.

Read access

Read access on a synced volume is unrestricted.

Disconnections

When a client or server notices that a peer is no longer available it does the following.

  1. The client grants itself a permit for the volume
  2. The server notices that a client has gone away and if that client held the permit on a volume it grants itself the permit.

Reconnection

The reconnection protocol is the most complicated:

  1. When a client rediscovers a server, it binds a connection to the server.
  2. The client discards its permits for all volumes served by the peer.
  3. The server forwards its updates on the volumes to the client. The client tries to apply these updates but verifies that the versions to which the updates apply are correct. (If not the client declares a conflict the handling of which we postpone.)
  4. The client verifies that its update journal is adjusted so as not to be in conflict with the current state propagated by the server.
  5. When no records are left to be reintegrated on the client, the client sends its update journal to the server.

Normal operation can now resume.