braam@cs.cmu.edu
and Rob Simmonds simmonds@stelias.com
InterMezzo is an experimental file system. It contains kernel code and daemons running with root permissions and is known to have bugs. Please back up all data when using or experimenting with InterMezzo.
InterMezzo is covered by the GPL. The GPL describes the warranties made to you.
Copyright on InterMezzo is held by Stelias Computing, Carnegie Mellon University, Phil Schwan, Los Alamos National Laboratory and Red Hat Inc.
InterMezzo is a file system that keeps replicas of folder collections, a.k.a. volumes residing on multiple computers in sync. The computers that express an interest in the replica are called the replicators of the volume. InterMezzo has one server for the volume, which plays an organizing role in exchanging the updates with replicators.
InterMezzo has disconnected operation, i.e. it maintains a journal to remember all updates that need to be forwarded when a failed communication channel comes back. This is a best effort synchronization since during disconnected operation conflicting updates are possible.
InterMezzo uses an existing disk file system, in practice Ext2, as the storage location for all data. When an Ext2 file system is mounted as file system type InterMezzo instead of Ext2, the InterMezzo software starts monitoring all access to the file system. It manages the journals of modification records and negotiates permits to modify the disk file system, to avoid conflicting updates during connected operation.
Here we describe how to set up a server and clients.
Your default config directory is /etc/intermezzo
.
Here you should place three files:
Holds a name of your system. At
present this should resolve to an IP address. Suppose your server has
the name muskox
, with IP address 192.168.0.3
, and your
clients are clientA
and clientB
. In its simplest form the sysid
file on each host would just contain the hosts name. i.e., on muskox
the file would contain:
muskox
You may add a second field to the sysid file, which applies to clients
only. It is a bind address for the client, indicating that the
endpoint of connections this client initiates originate from the IP
address mentioned. You normally don't need this, but it is good to
have when you run a client and server on a single machine, since it
allows you to clearly distinguish the server end point of the
connection from that of the client. For clientA
with IP address
192.168.0.20
the sysid
file would contain:
clientA 192.168.0.20
Holds a database of servers. The server structure is a Perl hash, as follows:
{
muskox => {
ipaddr => "192.168.0.3",
port => 2222 ,
bindaddr => "192.168.0.3"
}
};
The above contains a single server description for the server
muskox
with IP address "192.168.0.3"
. The port
and
bindaddr
are optional; the default port is 2222. Without a
bindaddr
the server listens to all interfaces for requests, with
it, the server only listens on the bindaddr
address.
Holds a database of volumes. The server structure is a Perl hash, as follows:
{
volname => {
servername => "muskox",
replicators => ['clientA', 'clientB' ]
}
};
The above contains a single volume description for a volume called
volname
on server muskox
. The volume is replicated on hosts
clientA
and clientB
.
To ease the mounting of InterMezzo volumes add the following to the /etc/fstab
file:
/tmp/cache /izo0 InterMezzo loop,volume=volname,prestodev=/dev/intermezzo0,\ mtpt=/izo0,noauto 0 0
where /tmp/cache
is a file associated with a loop device,
/izo1
is a mount point (a directory), volname
is the
name of the volume and /dev/intermezzo0
is the name of the
presto device. The creation of the cache file and the presto device
is explained in the examples at the end of this section.
The kernel must be configured with loopback device support enabled to
do this.
Let's consider three common cases, for each we will give the config files and the correct invocations to start the server/cache manager.
In this case we assume that the host muskox
is serving the volume
shared
and the host clientA
is replicating the volume.
The following files are placed on both muskox
and clientA
.
{
muskox => { ipaddr => "192.168.0.3" }
};
{
shared => {
servername => "muskox",
replicators => ['clientA']
}
};
On muskox
this contains:
muskox
On clientA
this contains:
clientA
The following line is added on both muskox
and clientA
:
/tmp/fs0 /izo0 InterMezzo loop,volume=shared,prestodev=/dev/intermezzo0,\
mtpt=/izo0,noauto 0 0
This file in constructed using the following commands:
dd if=/dev/zero of=/tmp/fs0 bs=1024 count=10k
losetup /dev/loop0 /tmp/fs0
mke2fs /dev/loop0
losetup -d /dev/loop0
This is created using the following commands:
mknod /dev/intermezzo0 c 185 0
chmod 700 /dev/intermezzo0
Add the line:
"alias char-major-185 presto" >> /etc/conf.modules
Before starting lento, mount the cache:
mkdir /izo0; mount /izo0
Now lento can be started on both muskox
and clientA
by typing
./lento.pl
in the directory containing the lento.pl
file.
The can be the same as for the one client and one server case above.
{
shared => {
servername => "muskox",
replicators => ['clientA', 'clientB']
}
};
This is the same as in the first example, but clientB is added to the replicators list.
This is the same as in the first example for muskox
and
clientA
, and on clientB
contains the following:
clientB
This is the same as used with the one client and one server case above.
Suppose that we are running on the host muskox
. To run multiple
lentos on one host we need to use ip-aliasing. This allows a
one interface to have more than one IP address associated with it.
Suppose the name muskoxA1
and the IP address 192.168.0.100
are available. In:
Add the line:
192.168.0.100 muskoxA1
Then add the ip-alias by typing:
ifconfig eth0:1 muskoxA1 up
Then create two files containing the following:
muskox 192.168.0.3
muskoxA1 192.168.0.100
The latter file will act as a sysid
file for the lento running on
the aliased IP address.
To run the second lento, a second loopback cache and presto device are required. These are constructed as follows:
dd if=/dev/zero of=/tmp/fs1 bs=1024 count=10k losetup /dev/loop1 /tmp/fs1 mke2fs /dev/loop1 losetup -d /dev/loop1 mknod /dev/intermezzo0 c 185 0 chmod 700 /dev/intermezzo0
Now two entries are needed:
/tmp/fs0 /izo0 InterMezzo loop,volume=shared,prestodev=/dev/intermezzo0,\ mtpt=/izo0,noauto 0 0 /tmp/fs1 /izo1 InterMezzo loop,volume=shared,prestodev=/dev/intermezzo1,\ mtpt=/izo1,noauto 0 0
Then mount the two InterMezzo directories:
mkdir /izo0; mount /izo0 mkdir /izo1; mount /izo1
The lento acting as the server can be started as before:
./lento.pl
The lento acting as the replicator has to be told which sysid
file, and which presto device to use. It is started as follows:
./lento.pl --sysid=muskoxA1 --prestodev=/dev/intermezzo1
InterMezzo was heavily inspired by Coda, and its current cache synchronization protocol is one of the many protocols that Coda supports. It is likely not the best for every situation but it is as simple as we could make it.
InterMezzo's mechanisms are very different from those of Coda. We employ very different kernel code which maintains the cache in another file system (typically Ext2/Ext3/Reiser). The kernel code also uses the journaling support in the kernel to make transactional updates (with lazy commits) to the file space and update journals.
The primary reason for keeping it simple is that we wanted to use it as soon as possible. It is also hoped that it will not be too confusing to the end user, as is frequently the case with advanced network file systems.
InterMezzo divides the file space up in volumes. Typically a
volume is much larger than a directory and smaller than a full disk
partition. Good examples of volumes might be /usr
or
someones home directory.
The typical event sequence for a volume in InterMezzo is as follows:
the volume is created on the server, possibly populated, possibly empty. The file server and the kernel on the server are now aware of the volume.
A client which needs the volume is told about the volume and its server. The client is added to the servers list of replicators of the volume and the server is made a replicator of the cache on the client. The volume is mounted on the client, and the client cache manager and kernel know about it.
A replicator has the following state:
replicator describes replication between this system and peer.
replicator describes replication of volume volname
the next update record to expect from the peer in the peer's numbering sequence
the next update record to send to the peer in this systems numbering sequence
Before the volume is usable on the client, it needs to sync up with the server. Syncing up is complicated by the fact that the server copy of the volume may change during the synchronization. Syncing up is done as follows:
next_to_send
number which becomes the
next_to_expect
number on the client
The volume on the replicator is now in a state of normal operation which we describe next
Under normal operation there is a collection of replicators that is connected to the server, and some replicators are in disconnected operation (the latter can also happen when the server fails).
The kernel and server/cache manager keep the log of updates to the
volume journalistically in sync with the contents of the file system,
i.e. under all circumstances any update applied to the file system is
also entered in the update database. Also, the systems transactionally
update the next_to_expect
counters as updates are entered in the
cache. (All transactions have lazy commits.)
The following rules govern the operations
Before an update can be made to the file system, a permit is acquired. A permit acquisition consists of:
Read access on a synced volume is unrestricted.
When a client or server notices that a peer is no longer available it does the following.
The reconnection protocol is the most complicated:
Normal operation can now resume.