Linux-HA Project Task List

This page describes a set of Linux-HA Phase I activities that need to be done.  There is also a documentation team whose activities I haven't listed here, because I don't know enough about what needs to be done and how it should be done to break it up into distinct tasks.

The purpose of this list is to track all the things we think need doing, and hopefully match them up to people who will do them.  Perhaps you?  If you see a task you are interested in, or want to add tasks to the list or comment on it, contact Alan Robertson.
 

Organization of Activities

Linux-HA phase I is divided into three basic areas:
  1. Heartbeat and cluster management services
  2. HA resource implementations
  3. Diagnostics

High-Priority Jobs

These are the jobs I feel strongly need to be done soon: Volunteers willing to tackle these head-on soon would be very much appreciated.

HA Resource Implementations

Anything that can move from one machine to another (processes, IP addresses, MAC addresses, filesystems) are implemented as resources.  Lots of interesting activities fall under this category.  For example, IP address takeover is in this category, filesystem replication and takeover is in this category.  Lots of things are implemented as resources.  This is where we get into properly handling the great diversity of things that people want to put into HA systems.  Some of these resource types will require specialized hardware to test.  If you want to see a particular configuration supported, you may have to create the resource for it.

Resources are basically objects with four member functions:

Activity Description DependsOn WhoDo?
SharedFS Shared filesystem takeover.  Useful for multi-interface RAID boxes and shared SCSI implementations.
NFS NFS server takeover.  The tricky parts are (a) lock takeover and (b) integration with ReiserFS. You'll need to get the latest nfs-utils package for this one. The folks at MC Linux and SGI have both implemented this and may be of some help.
Samba Samba server takeover.  This made tricky by the stateful protocol. It may be necessary to fix some client software to get this to work for every application. Good to consult with Jeremy Allison or one of the other Samba principals.
Netatalk Appletalk (netatalk) server takeover. 
Oracle Oracle database server takeover. 
PostsreSQL PostgreSQL database server takeover. 
MACaddr MAC address resource implementation
Intermezzo Intermezzo file sharing strategy.  This may not have to be a resource (?)
IPaddr-bcast Fix IPaddr so that it handles netmasks and broadcast addresses correctly.  Probably involves changing findip.c. alanr DONE 0.4.3.

Related Activities

Things that don't fall into one of the other categories show up here...
 
Activity Description DependsOn WhoDo?
AutoMake/AutoBuild Convert heartbeat to be based on Automake/autobuild. David Lee <T.D.Lee@durham.ac.uk> and Michael Moerz <e9625136@stud3.tuwien.ac.at>
SolarisPort re-port heartbeat to Solaris AutoMake/AutoBuild David Lee <T.D.Lee@durham.ac.uk>
OpenBSDPort Complete OpenBSD port of heartbeat AutoMake/AutoBuild Frank DENIS aka Jedi/Sector One
FreeBSDPort Complete FreeBSD port of heartbeat AutoMake/AutoBuild Matt Soffen
FileSync Transactional file synchronization between nodes.
configlib Separate out configuration parsing code into a library or .o Preferably a dynamically loaded module. alanr
ResourceMove Add an interface to allow users to move resources between one node and another.
GGUI GNOME-based configuration and status GUI [can KGUI and GGUI share code?] (?)
KGUI KDE-based configuration and status GUI [can KGUI and GGUI share code?]
GFS Test GFS with Linux-HA...?
LVS Figure out how to best integrate Linux-HA with the LVS project. Note that Jacob Rief has already done a little work in this area. DONE - see UltraMonkey

Diagnostics Activities

Linux-HA Phase I needs a diagnostics subsystem to notice and handle things like hardware and software failures that aren't complete node failures.  This is where that will be carried out.
 
Activity Description DependsOn WhoDo?
DiagFrame Implement a Diagnostics API framework, or just adopt Mon and/or its API.
EtherDiag Implement a dead ethernet check for serial ports using new code for ethernet diagnostics stuff DiagFrame
HBDiags Implement a disconnect check (RTS, DCD) for serial ports.  This would be triggered on demand from Mon or called directly from heartbeat as needed.  It should probably exist in a library version and an a.out version. DiagFrame done/replaced

Testing Activities

Lots of things need testing.  Linux-HA needs special attention in the testing department.  This is the beginning of such a list.
Activity Description DependsOn WhoDo?
TestUtil Utility for banging on Linux-HA, and testing it. Should be capable of causing failovers, and seeing if they worked. Current implementation is well underway, is called CTS and is now under CVS, under the "cts" directory. It is written in Python. InProgress
ConfigRegress Configuration regression testing database containing valid and invalid configurations for testing the input validation below.
TestPlan Write a test plan for Phase I of Linux HA delineating specific test configurations and test cases that we really mean to have work.

Heartbeat and Cluster Manager Activities

These things are at the heart of Linux-HA.  You'll notice that many of them are marked as critical.  Lots of fun stuff to be done here.
Activity Description DependsOn WhoDo?
CMsplit Split out the Cluster Manager function from the current heartbeat and make the heartbeat core "pure" - devoid of policies and decision-making. hbAPI
ResourceMon Add resource monitoring to the CMsplit cluster manger. CMsplit
PartCluster Detect and perform basic recovery from a partitioned cluster condition.  Of course, this won't unscramble shared SCSI filesystems that might have occured as a result of a partitioned cluster :-) partly done - needs testing
ModuleLoading Implement a general module loading strategy for heartbeat marcelo
DONE - for now
PingMedia Ping mediatype for pseudo-membership. This would allow a router, or switch, or other device to become a pseudo-member for purposes of quorum calculations, etc. DONE
CMFrame Create framework for "real" cluster manager.  This constitutes the APIs and supporting code allowing a cluster manager to be written NPhase, hbAPI
NPhase Create an n-phase commit protocol similar to IBM's Phoenix cluster services.  Pages 424-430 in "In Search of Clusters".  See especially pages 428 and 429.
OrderedMessaging Create an ordered messaging API.
CM1 Create the first "real" cluster manager.  A translation of the current methodology into a cluster manager structure.  May ultimately be a throwaway, or it may be wonderful for a 2-node cluster. CMFrame, HBProtocol (before release)
CM2 Creat the first real cluster manager.  Must support an arbitary number of nodes.  Probably a voting/quorum-based cluster manager. CM1
InputCheck Verify and Validate system configuration rigorously before starting up.  Provide a standalone configuration validation tool or input checking mode for heartbeat. (much of this is done now)
Restart heartbeat processes Heartbeat should be able to restart its processes that die. This is intended to allow for the possibility that one day a bug might be found in the code which would cause it to die. Heavens! Perish the thought! :-) A little infrastructure work to support this effort is in 0.4.3.
syslog-rsc Make the cluster-wide syslog a cluster resource. This may require a little thought to make it reliable, and keep messages from getting lost during transitions. Maybe have each message logged to two hosts? SysLog
buffers Should inspect code and modify to eliminate the possibility of buffer overrun attacks. This is especially true of the messaging code.
patchdoc Should document my expectations for patch submission. This should include a little bit about coding style.
manpage Write wonderful man pages for heartbeat, heartbeat.cf and haresources Shawn McKenzie (?)
IpResourceSyntax Allow some form of continuation lines in haresources file. done