One of the things I started in 2017 and wanted to get more knowledge about in 2018 was Hadoop. While it is possible and very easy to run Hadoop on a single machine, this is not what it is designed for. Using it in such a way would even make your program run slower because of the overhead that is involved. So since Hadoop is designed for big data, and distributes that data over a cluster of machines it makes sense to also run it like this. While it is possible to have many nodes running on the same machine, you would need to setup quite a lot of configuration to prevent port collisions and such. On the other hand I don’t have a cluster available to do some experimenting with. Although I know it is possible to get some free tier machines at cloud providers such as Amazon, this is not something I want to do for a minor experiment like this. Instead I have chosen to create a virtual cluster by using Docker.
Not only does docker allow me to create a virtual cluster of nodes that have identical configuration, it also creates a more real-life setup. All the different clusters will have their own file system completely separate from the rest, you can selectively inspect and bring down a node to see how the cluster reacts. By choosing Docker I also hit two birds with one stone as it allows me to:
- Freshen up my docker knowledge
- Get a taste of Hadoop
I am in many aspects a minimalist. I always wonder how heavy my application is both with regards to memory and CPU usage. I do the same with my docker container. I always want to keep the size of the container as small as possible, but with Java this is not very easy.
So I started my journey with Alpine, on which I installed bash and Java, which is a hard requirement. Java could however not run due to some weird issues. I guess (and have read about it) that they are caused by some compiler/library which is different. But even using the version that was suggested did not fix the problem.
It didn’t take long before I gave up on this minimalistic endeavour as it is not the core of this experiment. I may look into it again at a later point, when I have my docker cluster running. But for now I decided to use the official Java 8 container.
My first experiment consisted of having Hadoop installed in a container and let it run a job. As a first test I used the included example (WordCount), which did not run immediately due to the class not being found. In contrary to what the example tells, I had to specify the fully qualified name of the class. This is a quick fix but an important one and something we should keep in the back of our mind.
Before doing the actual setup of a cluster I wanted to work out some small use case of my own. I decided on a very simple one, which consists of going through a log file and count how often we received a message from each client. Although in general it was easy to get this working, I did encounter my first problem where my output directory remained empty. It was not that easy to figure out because the log file didn’t immediately show the problem, but I was able to discover that it was caused by a NullPointerException as there was a line that did not match the expected format. In the end it was pretty easy to get my use case working.
For the actual cluster, I based myself on work from others, which made it very easy. To do this, I had to start using docker compose as I have to start a bunch of containers that are linked together instead of single containers. I also glanced into docker swarm, but this would only be required if I really had multiple devices. I created separate docker images for the datanode and namenode to allow for different configuration, they are of course based on the same general hadoop image. It was pretty easy to get my namnode up and running and have a bunch of datanodes connect to it.
But as I was going through the web interfaces I noticed that the log files were not available. Going back to the Hadoop setup page I noticed that they used a different script to setup the node as a daemon. This is not useable with docker as the script will spawn a child process and the original process itself will terminate, meaning the docker container will exit. Instead I had to keep using the direct ‘hdfs namenode’ approach. But I did notice that running the daemon script did have the logs file, where mine did not. After some examination of what this script does, and how configuration is used I discovered all I had to do was set an environment variable to write to file instead of the console.
So now I have a Hadoop cluster running consisting of a namenode and some datanodes. I still haven’t done anything with them, and that will be the topic of my next blog post. If you are interested in my docker cluster, you can find all the of the files on my GitHub account.