We all continuously use protocols and most of the times this is fine and we don’t really care about it. The protocol however determines a lot more than just which data you have to send, and in this blog post I will go into detail on the requirements to have lossless communication on the application level. There is a big difference between lossless communication on the lowest layer, such as TCP and on the application layer.
Lossless communication can not be achieved with any protocol, the protocol must be designed for it in such a way that it can handle all types of failures:
- Network failure
- Client crash
- Server crash
What follows is an explanation how a couple of common protocol patterns can be used to handle the different type of failures and their limitations. Some general ideas used a lot in protocols are:
- Message numbers: this allows easy identification of duplicated messages and avoids processing them again if doing it multiple times would cause unwanted side-effects. A ‘special’ message number, usually 0 can indicate that the application was restarted. This allows for some more advanced recovery strategies.
- Two way communication: this happens if the server has events it wants to send to the clients, instead of the clients polling the server. In this case the client/server status fades a bit and they become more like peers with the only different the client initiating the actual connection.
- Acknowledgements: to make sure the other side received and processed the message an acknowledgement can be sent.
A common protocol type is the simple request/response pattern such as HTTP. In this case the client will initiate the request and the server will only respond.
- Network failure: If a certain timeout is passed before getting a response, resend the request.
- Client crash: If there is no knowledge of the response, resend the request.
- Server crash: If a certain timeout passed before getting a response, resend the request.
There isn’t much more you can do than re-sending the request. If processing the same request multiple times is unacceptable, we can use message numbers to identify this. This however means that the result of the request needs to be stored, such that it can be sent back again. The biggest flaw here is that we don’t know how long we have to store the response, meaning that if the client goes offline before processing the response, the server may already have forgotten it when it comes back online. One option could be to only store the response until we get the next result, this however means that the client can never make a next call before completely processing the response of the one before. This means that in some cases this pattern is not useable
A safe protocol pattern is that of the three-way handshake. By sending an acknowledgement that we have received and processed the message the server knows he doesn’t have to send the messages again.
- Network failure: In case of a network failure we have to resend the message again. To avoid duplicated messages being processed we should use message numbers such that both sides can identify duplicated messages.
- Client failure: The client can fail before being able to store the information about sending the messages. Either we have to resend the request again, or the server can automatically resend the response.
- Server failure: If the server fails the client will keep trying to send the same message until it gets the right response.
Some scenario’s of failure are shown in the images below where messages will be resend if there is no response.
It is important to always make sure the application can take all necessary actions such that it knows that it received the message. Depending on the type of response it may be necessary for the server to cache the request and only persist it after he received the final acknowledgement. It is however also possible that a simple response stating that it already received the request is good enough.
From the client’s point of view the message is completely handled after it has finished the response. From the server’s point of view the message is completely handled after it received the final acknowledgement. It is possible that the client restarts before sending the final acknowledgement, in that case the server will send the response again, and the client will identify that it already handled this message, only to send the acknowledgement. If we were to delay marking the message as handled until after sending the acknowledgement the server might fail in between, making him think he did not yet process the message, while the server is already finished and will be in an inconsistent state. Adding a separate step for ‘handled message’ and ‘sent ack’ has little benefit either as there is still a chance that the client fails between sending the acknowledgement and changing the state, meaning you should always be able to recover from the ‘handled message’ state.
The three-way handshake is a very easy way to guarantee that no information gets lost without the possibility to recover from it. This however does not mean that the applications on top of the lower level clients can safely assume nothing will go wrong. It remains the responsibility of the application to resend all messages that might have been lost in case of a system restart. Network failures or failures from the other party can however be dealt with by the lower level layer.