Gracefull Server Shutdown

A common feature for servers is the ability to let them shutdown without killing the remaining work. Instead the server will finish the work first before shutting down. While this sounds like a very simple feature and easy to achieve there are a couple of concerns which needs to be addressed:

  • New work should be rejected but existing work should be finished.
  • What about existing work that triggers a new job?
  • What if one of the jobs is hanging?
  • Should there exist some intermediate state where the server does not accept new work but keeps running for inspection?

The software we build consists of 4 different types of servers which do not offer this feature. This is a common complaint from our customers who now have to kill active jobs and restart them again later. We are currently busy tackling this problem and incorporate this feature into the software.

Our servers consist of many different components, while we are technically providing a way to gracefully shut down a single component, this will not be available to the users yet. Instead the entire server will be shutdown which will forward the request to every component. The actions taken for a graceful shutdown will differ for each component, some component will be as easy as just rejecting to take calls, while other will require much more logic to determine whether they have to do accept something or not.

We have opted to keep the server running even after the shutdown to allow users to examine the server. This should help troubleshooting problems and allows the server to be restarted using our user interface instead of starting the process again. Besides the ‘stopped’ state we also provide a couple of others, such as ‘starting’, ‘running’, ‘maintenance’ and ‘terminating’. These states will not be discussed here but I believe the idea behind each of them is pretty straightforward.

We have decided to use a certain grace period that allows the server, and each individual component to finish their jobs. Once this grace period is over, the components will be forcefully killed by the manager that will take of all the state changes. To prevent slow calls to get the current server state, the state is cached locally and updated based on events of the components. This grace period should prevent hanging jobs from shutting down a server, but the danger exists that the manager will not be adequate to shut down the component, if for instance the component ignores system interrupts.

If it were up to me I would have chosen to let the component itself be responsible for its own state instead of an exterior manager. The main reason for this is that these two will always be closely coupled, on top of that the manager will need to have the authority to shutdown the component, which from the components point of view is very intrusive. Instead the architect has decided to separate them with as main reason the ‘separation of concerns’. Which in my opinion is a fallacy, as both the component and the manager share the same concern: ‘Doing or not doing work’.

The feature has only just begun being developed and many things will be discovered of the next couple of months.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.