As an example of how the various components of the system work together, we will develop a data server and client app. The data server will simply provide a named pipe which contains a ``known'' amount of data and the client app which continuely try to read this. If the server fails, the overlord will restart it. If the client notices, it will rollback and try again.
The data server will be a simple program which runs on all the supported platforms (ie. no platform specific tricks). For this purpose we will create a named pipe, and loop through writing the alphabet to it.
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#define PIPENAME "/tmp/simple"
int main(int argc, char *argv[] )
{
int pd; // fd for named pipe
srand(time(NULL));
if( mkfifo( PIPENAME, S_IRUSR | S_IWUSR ) )
{
if( errno != EEXIST )
{
printf( "Unable to make fifo \"%s\" (%s)\n",
PIPENAME, strerror(errno));
exit( EXIT_FAILURE );
}
}
pd = open( PIPENAME, O_WRONLY );
if( pd == -1 )
{
printf( "Unable to open fifo \"%s\" (%s)\n",
PIPENAME, strerror(errno));
exit( EXIT_FAILURE );
}
while(1)
{
if( (rand()%25) == 0 )
write( pd,
"This is bad data and has to be at least 25 chars!!!!",
52 );
else
write(pd, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 26 );
sleep(1);
}
close( pd );
exit( EXIT_SUCCESS );
}
You can copy this source from the distribution:
$ cd ~/hafta $ cp /opt/hafta/examples/getstart/server.c . $ cc -o server server.c $
There is nothing special about this server to make it highly available. In a real world server, it would use the checkpoint library to handle unexpected errors. In our case here, we will use the overlord to restart it if it fails.
Note: This server will not fail if the pipe already exists. This is to allow the server to restart without failing.
The server also randomly inserts bad data into the outgoing data stream in order for the client to show how it can handle bad data.
Now in the client case, we are going to add a few smarts. Although it will be as simple as the server, it will use the HAFTA checkpoint library to make the application roll back and try again when someone goes wrong.
The application will read data in 26 byte chunks and it ``knows'' that the first byte will always be an A. We will use this information to confirm that we are working and if we have a problem, we will roll back and try to open the pipe again.
The checkpoint library requires the application to consist of one or more series of sequences. Each sequence has one or more nodes which contain normal function to do whatever that node was supposed to do, a policy function to decide what to do when you succeed or fail, and rollback function to compliment the normal function in case you have to undo whatever the normal function did.
The client consists of one sequence. That sequence consists of two nodes. The first node initializes the system (opens the pipe) and the second loops while reading the data. It uses checkpoints to record it's progress, so the rollback can properly handle what it needs to fix.
In the first node, we try to open the pipe. If it fails we call HC_NormalFail with the reason it failed. If it succeeds, we pass the new node we want to proceed to.
Note: It is worth mentioning that the policy_data argument to HC_NormalFail, HC_NormalSuccess, HC_RollBackFail and HC_RollBackSuccess is completely defined by the particular policy. The default policy expects the Success calls to pass the new node number to proceed to, and the Fail calls to pass the error.
In the second node, we loop up to 25 times, reading the pipe, then validating the data, then repeating the loop. Of importance here, is the HC_Checkpoint calls. These are used in the rollback functions to understand where in the normal function the problem occurred.
If we succeed without problems through the 25 loops, we return success and finish the sequence.
In the first node, the RollBack function does three things, two obvious and one not so obvious. Obviously it prints a failure message, and then sleeps for a second to give the overlord a chance to restart the server. The non obvious thing it does is always succeed. On success we go to the node which is passed as an argument. The reason we always want to succeed, is because the default policy will rollback to the previous node after 5 concecutive failures and we really don't have anywhere to rollback to. Failure would try to rollback right away and make even less sense. Success has the effect of retrying forever - which may not be something you want to do in your application, but as a tutorial, we are showing how to do it.
In the second node, we look at the checkpoint data to decide what to do with it. In this case we do nothing more than optionally print a message. We also return Success, but as you will see in the policies, we handle success differently in each node.
In both these policy functions, we only replace the functionality in which we want to override the default. You can certainly replace the whole function if you need to. You can take a look at the default function in /opt/hafta/src/checkpoint/lib/HC_Default.c.
In the first node, all we want to override is the default behaviour on failure and never rollback to the previous node. This would allow us to return a failure from the normal function for as many times as you want.
In the second node, we differentiate why we failed from the error codes which we passed on the NormalFail call. In the case where we failed to read, we assume this means the server is dead and short circuit the default of 5 retries and immediately rollback to the previous node. In any other case, which is limited to bad data, we print the bad data and fall through to the default which will try to read the data again until it succeeds.
The process to call a sequence is fairly simple. You need a nodelist structure which contains the an array of the node number, the type of function, and the function pointer. You use this to create a sequence pointer which is passed to all the functions. When you want to ``call'' your sequence, you call the function HC_CallSequence and pass it the sequence you just created, some user specific data and the starting node number.
Once done with the sequence, it can be released with a called to HC_DeleteSequence.
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include "checkpoint.h"
#define PIPENAME "/tmp/simple"
#define CP_CLEAR 0
#define CP_RECV_DATA 1
#define CP_GOOD_DATA 2
#define ERR_OPEN_FAILED 100
#define ERR_READ_FAILED 101
#define ERR_BAD_DATA 102
#define NODE_EXIT 0
#define NODE_INIT 1
#define NODE_GETWORK 2
int
normalInit (HC_Sequence_t *sequence, void *userdata)
{
int *fd = (int *) userdata;
*fd = open (PIPENAME, O_RDONLY);
if (*fd < 0)
{
HC_NormalFail (sequence, ERR_OPEN_FAILED);
}
return (HC_NormalSuccess (sequence, NODE_GETWORK));
}
int
normalGetWork (HC_Sequence_t *sequence, void *userdata)
{
int *fd = (int *) userdata;
int rc, i = 0;
char buf[27];
while( i < 25 )
{
HC_Checkpoint (sequence, CP_CLEAR); // clear checkpoints
memset (buf, 0, sizeof (buf));
rc = read (*fd, buf, 26);
if (rc != 26)
{
return (HC_NormalFail (sequence, ERR_READ_FAILED));
}
HC_Checkpoint (sequence, CP_RECV_DATA);
if (buf[0] != 'A')
{
return (HC_NormalFail (sequence, buf[0]));
}
printf("%d successful read(s)\n", ++i );
}
return (HC_NormalSuccess (sequence, NODE_EXIT));
}
int
rollbackInit (HC_Sequence_t *sequence, void *userdata,
unsigned long checkpoint)
{
printf("Open failed! Perhaps the server died?\n");
sleep (1); // open failed - wait and try again
return (HC_RollBackSuccess (sequence, NODE_INIT));
}
int
rollbackGetWork (HC_Sequence_t *sequence, void *userdata,
unsigned long checkpoint)
{
int *fd = (int *) userdata;
switch (checkpoint)
{
case CP_RECV_DATA:
printf("Some bad data\n" );
case CP_CLEAR:
// we don't do any thing different whether the read failed or the
// data is bad.
break;
default:
HC_Panic (sequence, "Invalid checkpoint %ul!\n", checkpoint);
break;
}
// We say that we succeed so we try again. The policy will try 5 times,
// then go to the previous node and reopen the file.
return (HC_RollBackSuccess (sequence, NODE_GETWORK));
}
int
policyInit (HC_Sequence_t *sequence, void *userdata,
HC_PolicyEvent_t event, long policy_data)
{
switch (event)
{
case HC_NORMALFAIL:
// Rollback forever
return (HC_RollBackCurrent (sequence));
default:
return (HC_DefaultPolicy (sequence, userdata, event, policy_data));
}
}
int
policyGetWork (HC_Sequence_t *sequence, void *userdata,
HC_PolicyEvent_t event, long policy_data)
{
int *fd = (int *)userdata;
switch (event)
{
case HC_NORMALFAIL:
// override default in case of failed read - there is no point in
// trying this 5 times.
if (policy_data == ERR_READ_FAILED)
{
close( *fd );
return (HC_RollBackPrev (sequence));
}
printf("policy_data = %c\n", policy_data );
default:
return (HC_DefaultPolicy (sequence, userdata, event, policy_data));
}
}
int
main (int argc, char *argv[])
{
HC_NodelistNode_t nodes[] = {
{NODE_INIT, HC_NORMALFUNC, normalInit},
{NODE_GETWORK, HC_NORMALFUNC, normalGetWork},
{NODE_INIT, HC_ROLLBACKFUNC, rollbackInit},
{NODE_GETWORK, HC_ROLLBACKFUNC, rollbackGetWork},
{NODE_INIT, HC_POLICYFUNC, policyInit},
{NODE_GETWORK, HC_POLICYFUNC, policyGetWork},
{0, HC_LISTEND, NULL}
};
HC_Sequence_t *sequence;
int fd;
int rc;
sequence = HC_NewSequence (nodes);
rc = HC_CallSequence (sequence, &fd, NODE_INIT);
fprintf (stderr, "Sequence was: %s\n", HC_Strerror (rc));
HC_DeleteSequence( sequence );
return (EXIT_SUCCESS);
}
You can copy this source from the distribution:
$ cd ~/hafta $ cp /opt/hafta/examples/getstart/client.c . $ cc client.c -lcheckpoint -o client $