With the HDFS cluster setup, we still need a way to actually use this data. Although just using the cluster as a way for duplication and backup of data is a viable option, this is not really what it is meant for. So we should start setting up everything that is required to run mapreduce jobs on the cluster.
The first thing we need to setup is the Resource Manager. The Resource Manager is similar as the NameNode in that sense that it performs a critical function and should run in a separate container.
The setup of the Resource Manager is relatively easy. But compared to the HDFS nodes, we can not start from an empty config, as there are too many essential config options in the ‘capacity-scheduler.xml’. If this configuration file is missing, you can not start the resource manager, so it does not even consider default configuration values. The changes done to the configuration files are limited to overwriting the default user.
By exposing the Web UI port (8088 by default), you can already see the Web UI. The Web UI looks completely different from the DataNode or NameNode.
The next step is to setup a NodeManager, and because of the way Hadoop works, this should be running in the same container as the DataNode. This poses a small problem, as a Docker container can only have a single start command. I guess there will be ways around this, like starting both applications from the same script. But it would also be nice if the container would quit as soon as either the DataNode or NodeManager exit.
Hadoop does have mechanisms to start all processes as a daemon, but this can not be used for both of the processes as the container will consider both processes as done and just stop. Running one as a daemon process and the other directly would give up the possibility to quit the container when the daemon process exits.
As a starter, lets just create a NodeManager in a separate container and let it connect to the ResourceManager. This should allow us to fix all problems such as hidden ports. And as expected the NodeManager can not connect to the Resource Manager as it needs the following ports: 8030, 8031, 8032, 8033.
One of the things you can see from the Web UI is that there are no log files on the Resource Manager. This is pretty familiar as we had a similar problem with the NameNode, but in contrary to the NameNode, the problem here is just that the folder does not exist and no files are written to it. This of course makes you wonder why YARN does not automatically log stuff to file? Is it because some loggers have the wrong settings?
With the Resource Manager setup, and a Node Manager connected to it, we can start launching jobs. But how do you do this? From a quick search it looks like you have a couple of different options.
- Command line: by using the ‘mapred job’ command.
- The REST API
- From code
To launch a job from the command line, you need some job file. As I could not find out what this format needs to be, as just using a jar file fails with a SAX parsing exception. It is not documented anywhere. The REST API can only be used to query the status of jobs and not to create a new job, or at least so it seems. So the only option left for me te test is from code.
I once did launch a local job with local data, so I will have to change it a bit. Adding some configuration allowed me to connect the job to the HDFS cluster, but running the job still failed because the user ‘chris’ was not allowed to write to the output folder. There are two possible solutions: change the user to ‘root’ or change the ownership of the folders, since every user writes to his own directory anyway.
With this done, I was able to run a local job but using data from the HDFS cluster. The configuration I had to add in my code for this is as follows:
But how do you run your job on the cluster as well? I have had a quick look into it, and it looks like you have to create a Yarn client application and do a lot more setup and just launching a quick job. There is not that much documentation about this, which makes it much harder than it should be.
Due to the amount of work and complexity of setting it all up, I have decided to leave my Hadoop cluster setup like this for now. I was able to do some very basic setup and trying to do more would involve a lot more effort and investigation to setup everything up.
The open points at the moment are:
- Running the NodeManager and DataNode in the same container.
- Launching Jobs on the cluster.
- How does a NodeManager know which data is stored locally?
I do not intent on investigating these topics anytime soon, as I will focus my attention on other topics first. So this concludes my first experience with Hadoop.