{adilger,braam}@stelias.com
This software forms an experimental file system. It contains kernel code and daemons running with root permissions and is known to have bugs. Please back up all data when using or experimenting with Object Based Storage software.
This software may be redistributed it and/or modified under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. The file COPYING contains version 2 of the GPL.
Copyright on the Object Based Storage software is held by a large number of developers because the code derives from other parts of the Linux kernel. Specific copyright holders are listed in the source files.
This project is further documented at http://www.lustre.org
.
An Object Based Disk (OBD) or Object Based Storage Device (OBSD) is one that works at the level of files ("storage objects"), rather than at the level of individual blocks as conventional storage devices do. The OBD keeps track of allocated objects, which blocks belong to each object, free space, etc. internally, rather than exposing these details to the operating system.
An OBD could be a real OBD if disk vendors should decide that it is a worthwhile idea to produce such devices. We have written a simulated object based device, based on the "lower half" of the Ext2 file system. (Other file systems could easily be used as well.).
A utility called obdcontrol allows for direct manipulation of objects. More interesting is the object based file system (OBDFS) which uses the an object based device as its storage device.
OBDFS communicates to the underlying ext2obd device by means of object id's and logical blocks inside objects - NOT physical blocks on the storage device. It lets the ext2obd device handle block and inode allocation. Roughly speaking the combination of OBDFS and ext2obd equals Ext2, and indeed OBDFS is another file system that can access Ext2 formatted drives.
However, instead of gluing them straight on top of each other, one can insert logical object drivers in the middle. These receive object commands from "above", e.g. from OBDFS and speak to other object driver(s). Examples of such configurations are RAID and snapshots.
Because of the object abstraction used in OBDs, it is possible to layer OBD drivers on top of each other. The logical object driver is a client of a lower level driver (or the direct device driver), but is itself a target of a higher layer driver, or application such as OBDFS issues object methods to the driver it utilizes.
This allows OBD the ability to stack OBD drivers to implement different functionalities in each OBD layer. The snapshot and network layers are simply OBD drivers stacked on top of a base OBD driver. Also under discussion and/or development for OBDFS drivers are RAID0, RAID1, Object Volume Management, and others.
As an example we have implemented a snapshot driver that can be used in conjucntion with ext2obd and OBDFS. The current OBDFS implementation has the ability to create multiple timed snapshots of a filesystem, allowing historical views of a filesystem, or consistent filesystem backups for a mounted filesystem. The way that snapshots are currently implemented, however, means the underlying ext2 filesystem is not a valid filesystem for the normal ext2 driver when any snapshots exist (it is a valid ext2 filesystem when all of the snapshots have been removed, however).
It will also be possible to use OBDFS in a network mode, like NFS, to access files on a remote system; code for a SUN RPC driver for the storage object protocol is forthcoming. (Of course, faster interconnects such as FC or InfiniBand are attractive too.) This will form the basis for the Lustre file system - a Linux Cluster file system based on object based storage.
In addition to the highly modular architecture for storage management, another possible benefit of an OBD over a conventional block-based storage device can be likened to using an accelerated graphics adapter to handle drawing circles, filled rectangles, etc., instead of having the CPU draw each pixel individually. OBDFS can pass a few high-level commands to an OBD when creating, copying, or deleting a file, instead of being concerned with keeping track of hundreds or thousands of individual blocks for each file on a device. In clusters such devices avoid sharing the allocation metadata mong all cluster nodes which is a cause of complexity. Precisely how beneficial all this is, remains to be evaluated.
In this HOWTO, I will go over how to use the Object Based Device filesystem under Linux. Since OBDFS is still under development, it should not be used on production systems, or to store important data. It is expected that anyone using OBDFS knows how to patch and compile a kernel.
We will cover the basic installation and configuration of OBD devices and filesystems under Linux, as well as go through some examples of how OBDFS might be used. Since OBDFS is still under development, there are likely bugs to be found if you venture off the beaten path. Please report these bugs to the obd-devel mailing list (see Contacting the Authors for more information).
In order to use OBDFS under Linux you need kernel 2.3.31, in addition
to compiling several modules. You will also need to have loop devices
enabled in the kernel in order to do safe filesystem testing. The
user-space tool (obdcontrol) is written in Perl, which is normally
installed, but requires the Term-Readline-GNU-1.04.tar.gz
Perl module (see below).
This code only works with linux-2.3.31, in particular it won't work with 2.2 versions of Linux; it compiles as a module and you should NOT need to patch your kernel. The modules install a character device on major 186 (allocated to us for this purpose and a file system named OBDFS. So create some character devices:
# mknod /dev/obd0 c 186 0
# mknod /dev/obd1 c 186 1
# mknod /dev/obd2 c 186 2
# mknod /dev/obd3 c 186 3
# mknod /dev/obd4 c 186 4
In order to use OBDFS, you also need to compile the kernel modules obdext2, obdclass, obdfs, and obdsnap (if you are using the snapshot facility). The OBD drivers are closely tied to the kernel version, as there was a major change to the VFS layer around kernel version 2.3.25.
For an initial configuration and compile, run:
# cd /path/of/obdcode
# make config all
The configuration will ask some basic questions about your system configuration. For symbols I suggest you tell the script to find it in your Linux kernel tree. It should then proceed to compile the various OBD modules.
One other piece of software is needed: a Perl readline package that makes the commandline tool obdcontrol ever so much nicer to use. Get the package from:
ftp://ftp.lustre.org/pub/lustre/Term-ReadLine-Gnu-1.04.tar.gz
Installation is easy:
# untar
# cd into it
# perl Makefile.PL
# make install
Ready to play!
The quick way to get OBDFS mounted is:
# mkdir /mnt/obd
# cd top-of-the-source/demos
# ./obdfssetup.sh
If you type mount
you will see your new file system. Copy
a couple of files in there to test it out. To clean up again, use:
# ./obdfsclean.sh
Interestingly the underlying file system is still a good old Ext2 file system. Let's run e2fsck on it, to see that we didn't corrupt it:
# losetup /dev/loop0 /tmp/obdfs.tmpfile
# e2fsck /dev/loop0
You could go on and mount /dev/loop0
as an ext2 file
system to verify. Instead, let's go on and play with snapshots.
Again, we have provided a quick way to get you off the ground:
# cd top-of-the-source/demos
# sh obdfsclean.sh
# rm /tmp/obdfs.tmpfile
# ./snapsetup.sh
When you type mount
you will see two file systems. One uses
/dev/obd1
and the other /dev/obd2
. Both of these
OBD devices are talking to an obdext2 on /dev/obd0
, which is
configured, as before, to talk to /dev/loop0
on
/tmp/obdfs.tmpfile
.
It is instructive to investigate the inodes of the file system with debugfs:
# debugfs /dev/loop0
debugfs: stat <2> # look at the contents of the root inode
debugfs: ls <2> # see the hello file we created? it's inode 12
debugfs: stat <12> # let's look at the block assigned to hello
debugfs: q
This shows that objects (inodes) 2 and 12 have a block attached to them, holding the directory and file data, respectively.
Now we can make a few changes to the /mnt/obd
filesystem and see
what effect this has on the two filesystems (which both share one device).
# echo "today" >> /mnt/obd/hello
Now run debugfs again.
# debugfs /dev/loop0
debugfs: stat <12> # file 12 (hello) looks like it has 3 blocks
debugfs: stat <18> # file 18 (old hello) has old data block
debugfs: stat <19> # file 19 (new hello) has a new data block
debugfs: q
For the /mnt/obd/hello
file, the first "block" listed is
actually a magic number which indicates to the snapshot
driver that this inode has multiple versions. The second "block" is
(in this case) the object id of the current snapshot
of this inode. (That snapshot is mounted on /mnt/obd
.) The
last "block" is the object id for this file corresponding to a
snapshot that was timed to preserve state just after we created the file
/mnt/obd/hello
in the file system (look in snapsetup.sh for
details).
What has happened is that the inode was made into an indirect object that refers the caller to either the old data (in object 18) or the new data in object 19. Of course, the numbers 18 and 19 can change if you do extra file system operations.
In the file system we can see this too:
# cat /mnt/obd/hello
# cat /mnt/snap/hello
The final test is of course:
# rm /mnt/obd/hello
# ls /mnt/obd
# cat /mnt/snap/hello
Finally we will restore the old world with our snaprest.sh shell script:
# ./snaprest.sh
# ls /mnt/obd
# cat /mnt/obd/hello
To clean up from this call snaprestclean.sh
.
More fun with snapshots can be found below, along with more
explanation how it operates. Ok, let's explain in some more detail
how this works.
The first steps in starting to use OBD software is to load the modules into the kernel. For basic usage, you need to install the obdclass, obdext2, and obdfs modules. The following examples assume you are in the main OBD directory.
# insmod class/obdclass.o
# insmod ext2obd/obdext2.o
# insmod obdfs/obdfs.o
The obdclass module provides a dispatching service for object type dependent methods used by various of the OBD drivers. The obdext2 module is the low-level block device driver which simulates an object based device using an ext2 filesystem on disk. It only works with blocks, inodes, and bitmaps, but has no understanding of directories or filenames. The obdfs module is the filesystem which manipulates files and directories, and presents a view of the underlying device to the user.
In order to test OBD stuff, you need to create a small test filesystem for the obdext2 driver to work with.
This can be done using the normal ext2 tools found in the e2fsutils package (this is installed as part of the base install of every Linux system). The easiest way to do this is with a loopback device:
# dd if=/dev/zero of=/tmp/obdfs.tmpfile bs=1k count=10k
# insmod loop
# losetup /dev/loop0 /tmp/obdfs.tmpfile
# mke2fs -b 4096 /dev/loop0
Note: that the ext2 filesystem currently needs to be created with a 4k block size because the obdext2 driver assumes the block size matches the page size. This needs to be fixed in a later release of the obdext2 driver to allow ext2 filesystems with 1k and 2k block sizes.
The majority of configuration of OBDFS is through the control program obdcontrol. This is a relatively complete command-line interface, with basic help, command completion, and command history. It allows you to (un)configure basic OBD and snapshot devices, as well as do debugging and testing of OBD devices and objects.
The most common commands in obdcontrol are (in matching pairs)
attach
and detach
, setup
and cleanup
,
connect
and disconnect
, help
, and quit
.
To get a complete listing of available commands, type help
at the
obdcontrol prompt. To get basic help on the meaning and syntax of a
command, type help command
. Command completion is activated
with the TAB
key, and command history is available via the
up- and down-arrow keys.
Attach will attach the specified OBD driver
(ext2_obd or snap_obd) to the current OBD device (by default
/dev/obd0
. (You can change device with the device
command.) This serves two purposes. First the device
/dev/obdX
now has methods given in through the type in the
attach command. In some cases we also pass some data in to the system,
for example to indicate what snapshot view /dev/obdX
should
give.
We need the ext2_obd driver so we can attach to the test filesystem we created.
# class/obdcontrol
Device now /dev/obd0
obdcontrol > attach ext2_obd /dev/loop0
Setup will complete the configuration of the current OBD device. For ext2_obd a setup command initializes an inode and buffer cache which the obd driver exploits.
obdcontrol > setup
obdcontrol > quit
At this point, you should be able to mount the OBDFS filesystem:
# mkdir /mnt/obd
# mount -t obdfs -o device=/dev/obd0 none /mnt/obd
# df -k /mnt/obd
Filesystem 1k-blocks Used Available Use% Mounted on
none 9668 20 9148 0% /mnt/obd
These steps are included in the script demos/obdfssetup.sh
.
Light usage of the file system (such as rebuilding the obd code) is
usually possible.
NOTE: Reams of debugging output are produced by the various OBD components. This can be quelled by
echo 0 > /proc/sys/obd/debug echo 0 > /proc/sys/obd/trace
For example create a few files there for testing:
# echo "yesterday" > /mnt/obd/hello
# echo "test" > /mnt/obd/bye
# touch /mnt/obd/a /mnt/obd/b
# ln -s hello /mnt/obd/link
# cat /mnt/obd/link
yesterday
# ls -li /mnt/obd
total 23
15 -rw-r--r-- 1 root root 0 Dec 16 16:43 a
16 -rw-r--r-- 1 root root 0 Dec 16 16:43 b
13 -rw-r--r-- 1 root root 5 Dec 16 16:43 bye
12 -rw-r--r-- 1 root root 10 Dec 16 16:43 hello
14 lrwxrwxrwx 1 root root 5 Dec 16 16:43 link -> hello
11 drwxr-xr-x 1 root root 16384 Dec 16 16:43 lost+found
Connect will establish a unique connection to the OBD device.
This allows the device to keep track of parameters and resources on a
per-client basis. Most operations require such a connection to have
been made. For example, to get the attributes of inode 12 (the file
hello
in the previous listing) we need to first connect:
obdcontrol > connect
Client ID : 2
Finished (success)
obdcontrol > getattr 12
Inode: 12 Mode: 100644
User: 0 Group: 0 Size: 10
ctime: 3859792b -- Thu Dec 16 23:43:39 1999
atime: 00000000 -- Thu Jan 1 00:00:00 1970
mtime: 3859792b -- Thu Dec 16 23:43:39 1999
flags: 3859792b
Finished (success)
obdcontrol > disconnect
Finished (success)
The OBDFS file system makes a connection when it is mounted. An important purpose of connections is to release pre-allocation data from obdext2 when the connection is closed.
The power of the object storage paradigm can be seen by storage management modules which reside between our file system (OBDFS) and storage driver (obdext2). Snapshots are read-only clones of file systems, which are present in addition to the current copy. Snapshots are assocaited with a certain point in time, enabling consistent views of older versions of the file system as well as un-assisted retrieval of old files after accidental deletion.
An uninteresting way to produce a type of snapshot is to simply copy the entire filesystem to a certain location. With the OBDFS snapshots we maintain the read-only clones through a "copy on write" mechanism. With this mechanimsm snapshots require much less space.
The stack of drivers is now different. The file system uses the snapshot OBD driver as its device and NOT ext2_obd. The snapshot driver type is associated with several devices. There is a current snapshot which is read/write, and then one snapshot device can be instanatiated for each timed snapshot.
Leave some files around in the file system you mounted above. These will be the "old" copies, preserved in the snapshots, while the new ones are maintained in the current snapshot. First unmount the file system, then install the snapshot OBD driver and finally configure (attach and set-up) the snapshot drivers:
# cd top-of-the-source
# insmod snap/obdsnap.o
# umount /mnt/obd
# class/obdcontrol
obdcontrol > snaptable
enter file name: /tmp/obdfs.snaptable
Add, Delete or Quit [adq]: a
enter index where you want this snapshot: 1
enter time or 'now' or 'current': current
Time: current -- Index 1
Add, Delete or Quit [adq]: a
enter index where you want this snapshot: 2
enter time or 'now' or 'current': now
Time: current -- Index 1
Time: Thu Dec 16 16:32:37 1999 -- Index 2
Add, Delete or Quit [adq]: q
OK with new table? [Yn]: y
All that we've done so far is create a table, which in real use would
be a configuration file somewhere in /etc
, but for now was
placed in /tmp/obdfs.snaptable
. This has information about
snapshot times and which OBD slots are associated with each snapshot.
Every inode can remember up to 12 snapshots and we allocate each snapshot
to a slot. We always need to have a "current" snapshot, which we placed at
index 1 in this case, which is where updates to the filesystem go
(read-write snapshot).
We also created a "historical" snapshot (at index 2), which means that the state of all files stored in the OBD filesystem before 16:32:37 (the time when I created the "now" snapshot) will be preserved in that snapshot. Deletions will leave those old files around, and writing to a file created before that timestamp and not modified after will cause a copy (the COW - Copy on Write) to be left behind in the historical snapshot. Updates to the atime are so frequent that we have eliminated them from the causes of COW.
Now we load the newly created snapshot table into the snapshot driver.
We will load this into snapshot table 0, with the snapset
command. We also want to attach the snapshot OBD driver to OBD
devices, one device for each snapshot. We will attach
/dev/obd1
to snapshot index 1 (current), and
/dev/obd2
to snapshot index 2 (historical). In both cases we
use /dev/obd0
as the underlying data storage area.
obdcontrol > snapset 0 /tmp/obdfs.snaptable
Time: current -- Index 1
Time: Thu Dec 16 16:47:16 1999 -- Index 2
Snapcount 2
type snap_obd (len 8), datalen 24 (24)
Finished (success)
obdcontrol > device /dev/obd1
Device now /dev/obd1
obdcontrol > attach snap_obd 0 1 0
type snap_obd (len 8), datalen 12 (12)
Finished (success)
obdcontrol > setup snap_obd
Finished (success)
obdcontrol > device /dev/obd2
Device now /dev/obd2
obdcontrol > attach snap_obd 0 2 0
type snap_obd (len 8), datalen 12 (12)
Finished (success)
obdcontrol > setup snap_obd
Finished (success)
For the first attach command, we attach the current OBD device
(/dev/obd1
) to type snap_obd and give attachment data
in the form of 3 parameters. The first lists the underlying object
device to use, /dev/obd0
(first parameter), the second the snap
index to use (snap index 1 in this case) and the third lists the table
giving the times (table 0, the third parameter). The second attach
command is similar. Now we are ready to try out use our snapshots as
devices. The need to mount /dev/obd2
read-only is a deficiency
in our software and will be enforced automatically in a future release.
# mount -t obdfs -o device=/dev/obd1 none /mnt/obd
# mkdir /mnt/snap
# mount -t obdfs -o ro,device=/dev/obd2 none /mnt/snap
The previous steps for configuring the snapshot device are included in
the demos/snapsetup.sh
script. Finally we will see the
snapshot in operation. First we take a look at the files in the two
directories, and note that they have the same inode numbers for both
the read-write and read-only devices:
# ls -li /mnt/snap /mnt/obd
/mnt/snap:
total 19
15 -rw-r--r-- 1 root root 0 Dec 16 16:43 a
16 -rw-r--r-- 1 root root 0 Dec 16 16:43 b
13 -rw-r--r-- 1 root root 5 Dec 16 16:43 bye
12 -rw-r--r-- 1 root root 10 Dec 16 16:43 hello
14 lrwxrwxrwx 1 root root 5 Dec 16 16:43 link -> hello
11 drwxr-xr-x 1 root root 16384 Dec 16 16:43 lost+found
/mnt/obd:
total 20
15 -rw-r--r-- 1 root root 0 Dec 16 16:43 a
16 -rw-r--r-- 1 root root 0 Dec 16 16:43 b
13 -rw-r--r-- 1 root root 5 Dec 16 16:43 bye
12 -rw-r--r-- 1 root root 10 Dec 16 16:43 hello
14 lrwxrwxrwx 1 root root 5 Dec 16 16:43 link -> hello
11 drwxr-xr-x 1 root root 16384 Dec 16 16:43 lost+found
It is instructive to investigate the inodes of the file system with debugfs:
$num; debugfs /dev/loop0 debugfs: stat <2> # look at the blocks assigned to the root inode debugfs: ls <2> # list the root directory debugfs: stat <12> # stat the file "hello", inode 12 above
This shows that objects (inodes) 2 and 12 have a block attached to them, holding the directory or file data.
Now we can make a few changes to the /mnt/obd
filesystem and see
what effect this has on the two filesystems (which both share one device):
# chmod 777 /mnt/obd
# echo "today" >> /mnt/obd/hello
# cp /etc/hosts /mnt/obd
# rm /mnt/obd/a
# chmod 777 /mnt/obd/b
# ls -li /mnt/snap /mnt/obd
/mnt/snap:
total 19
15 -rw-r--r-- 1 root root 0 Dec 16 16:43 a
16 -rw-r--r-- 1 root root 0 Dec 16 16:43 b
13 -rw-r--r-- 1 root root 5 Dec 16 16:43 bye
12 -rw-r--r-- 1 root root 10 Dec 16 16:43 hello
14 lrwxrwxrwx 1 root root 5 Dec 16 16:43 link -> hello
11 drwxr-xr-x 1 root root 16384 Dec 16 16:43 lost+found
/mnt/obd:
total 24
16 -rwxrwxrwx 1 root root 0 Dec 16 16:43 b
13 -rw-r--r-- 1 root root 5 Dec 16 16:43 bye
12 -rw-r--r-- 1 root root 16 Dec 16 18:28 hello
19 -rw-r--r-- 1 root root 394 Dec 16 19:32 hosts
24 lrwxrwxrwx 1 root root 7 Dec 16 19:34 link -> bye
11 drwxr-xr-x 1 root root 16384 Dec 16 16:43 lost+found
# cat /mnt/snap/hello
yesterday
# cat /mnt/obd/hello
yesterday
today
# cat /mnt/obd/link
test
We can see that /mnt/snap
has stayed constant (inode numbers,
file size, mtime), while /mnt/obd
shows the changes we have
made to the various files, yet they also have the same inode numbers
(very important for directory lookups, NFS, etc). This is all handled in
the snap OBD driver, where it does copy-on-write for modified objects,
and handles redirection to the proper underlying object, depending on
the context of the object request.
Debugfs shows the details:
# debugfs /dev/loop0
debugfs: stat <2>
debugfs: ls <17>
debugfs: ls <18>
What is seen here is how the inode is no longer pointing to blocks: it has a magic constant (stating "I'm a snapshot inode) and it contains referrals to two other inodes, number 17 and 18 in my case (if you do more things with the file system, the allocated inode numbers can be different of course). Doing the two ls's reveals the two copies of the directory introduced by the snapshot driver.
Because of the redirection in the snapshot layer, the underlying ext2 filesystem is not in a valid ext2 state (this may be fixed in a later release of OBDFS). However, we can delete a read-only (old) snapshot and leave the "current" state as a clean ext2 filesystem. We can also restore the filesystem to its former state.
Note: In the current release of OBDFS, it is possible to add a snapshot while a filesystem is mounted, but it is not possible to remove the snapshot while the filesystem is mounted. While it may appear to work in many cases, it will likely corrupt the filesystem.
obdcontrol > device /dev/obd2
Device now /dev/obd2
obdcontrol > connect
Client ID : 2
Finished (success)
obdcontrol > snapdelete
type snap_obd (len 8), datalen 4 (4)
Finished (success)
obdcontrol > cleanup
Finished (success)
obdcontrol > detach
Finished (success)
obdcontrol > device /dev/obd1
Disconnecting active session (2)...Finished (success)
Device now /dev/obd1
obdcontrol > cleanup
Finished (success)
obdcontrol > detach
Finished (success)
obdcontrol > quit
# rmmod obdsnap
# rmmod obdfs
# rmmod obdext2
# rmmod obdclass
# mount /dev/loop0 /mnt/obd
# ls -li /mnt/obd
total 32
16 -rwxrwxrwx 1 root root 0 Dec 16 16:43 b
13 -rw-r--r-- 1 root root 5 Dec 16 16:43 bye
12 -rw-r--r-- 1 root root 16 Dec 16 18:28 hello
19 -rw-r--r-- 1 root root 394 Dec 16 19:32 hosts
24 lrwxrwxrwx 1 root root 7 Dec 16 19:34 link -> bye
11 drwxr-xr-x 2 root root 16384 Dec 16 16:43 lost+found
After removing the snapshot, we did a bit of cleanup on the devices we had previously configured, so that we can safely remove the loaded modules. Finally, we remounted the underlying filesystem as ext2 again without any problems, and can see that it is in the state of the current snapshot.
Finally we mention the snap restore operation (see the shell
scipt demos/snaprest.sh
for how it is used. This allows you
to revert a file system to the state in a previous snapshot.
Over the next months we will be working on other aspects of these systems. We hope to release the following:
If you want to help, let us know!
While every effort is made to have a functioning system, because the software is under development, there may be dated releases which do not work. In general, the versioned releases should be working releases.
Some known issues/bugs with obdfs at the time this document was created:
There are two mailing lists for OBD, one for questions and development of the OBDFS software, and a second low-volume list for the announcement of new OBDFS releases. We read email sent to both of these lists regularly.
In order to send email to these lists, you must be subscribed. To
subscribe to the obd-devel
list, send email to
obd-devel-request@lustre.org
with the body:
subscribe your@email.addr obd-devel
The process for subscribing to obd-announce
is the same.
To contact the authors directly, send email to
braam@stelias.com
or adilger@stelias.com