Building A Hadoop Cluster With Docker – Part 3

This part continues where we left of with part 2, we will examine the other ways on how to upload files to a Hadoop cluster:
  1. Using the command line.
  2. From Java using the DFSClient
  3. From Java using the FileSystem

If you did not read the previous part, I highly recommend you to do so before continuing. Before diving into the new work, lets summarise what we learned from the previous part.

  • We had to fix the hostname of the container.
  • The mapping of ports caused a problem.
  • The filesystem permissions are very restrictive to non-root users.

The first two points means that the cluster can not be used as it was designed in part 1, and is restricted to a single datanode.

Continue reading


Building A Hadoop Cluster With Docker – Part 2

After setting up a HDFS cluster in part 1, it is now time to put some data in the cluster. There exists multiple ways to upload files to a Hadoop cluster:

  1. Using the Web UI of the namenode
  2. Using the REST API.
  3. Using the command line.
  4. From Java using the DFSClient
  5. From Java using the FileSystem

Treating them all in a single post would make the post too long (in my opinion). Therefor I have chosen to split it up into two separate posts, meaning that in this blog post I will only discuss the Web UI and REST API (which should be nearly identical). The other approaches will be handled in the next.

Even though you may not be interested in using the Web UI or the REST API to upload files to you Hadoop cluster, I recommend reading through this post as problems encountered here, will come back in the next one. Even more important though are the fixes and changes that are applied to get it working as they will have an impact on the other ways of uploading files as well.

Continue reading

Building A Hadoop Cluster With Docker – Part 1

One of the things I started in 2017 and wanted to get more knowledge about in 2018 was Hadoop. While it is possible and very easy to run Hadoop on a single machine, this is not what it is designed for. Using it in such a way would even make your program run slower because of the overhead that is involved. So since Hadoop is designed for big data, and distributes that data over a cluster of machines it makes sense to also run it like this. While it is possible to have many nodes running on the same machine, you would need to setup quite a lot of configuration to prevent port collisions and such.  On the other hand I don’t have a cluster available to do some experimenting with. Although I know it is possible to get some free tier machines at cloud providers such as Amazon, this is not something I want to do for a minor experiment like this. Instead I have chosen to create a virtual cluster by using Docker.


Not only does docker allow me to create a virtual cluster of nodes that have identical configuration, it also creates a more real-life setup. All the different clusters will have their own file system completely separate from the rest, you can selectively inspect and bring down a node to see how the cluster reacts. By choosing Docker I also hit two birds with one stone as it allows me to:

  • Freshen up my docker knowledge
  • Get a taste of Hadoop

I am in many aspects a minimalist. I always wonder how heavy my application is both with regards to memory and CPU usage. I do the same with my docker container. I always want to keep the size of the container as small as possible, but with Java this is not very easy.

So I started my journey with Alpine, on which I installed bash and Java, which is a hard requirement. Java could however not run due to some weird issues. I guess (and have read about it) that they are caused by some compiler/library which is different. But even using the version that was suggested did not fix the problem.

It didn’t take long before I gave up on this minimalistic endeavour as it is not the core of this experiment. I may look into it again at a later point, when I have my docker cluster running. But for now I decided to use the official Java 8 container.

My first experiment consisted of having Hadoop installed in a container and let it run a job. As a first test I used the included example (WordCount), which did not run immediately due to the class not being found. In contrary to what the example tells, I had to specify the fully qualified name of the class. This is a quick fix but an important one and something we should keep in the back of our mind.

Before doing the actual setup of a cluster I wanted to work out some small use case of my own. I decided on a very simple one, which consists of going through a log file and count how often we received a message from each client. Although in general it was easy to get this working, I did encounter my first problem where my output directory remained empty. It was not that easy to figure out because the log file didn’t immediately show the problem, but I was able to discover that it was caused by a NullPointerException as there was a line that did not match the expected format. In the end it was pretty easy to get my use case working.

For the actual cluster, I based myself on work from others, which made it very easy. To do this, I had to start using docker compose as I have to start a bunch of containers that are linked together instead of single containers. I also glanced into docker swarm, but this would only be required if I really had multiple devices. I created separate docker images for the datanode and namenode to allow for different configuration, they are of course based on the same general hadoop image. It was pretty easy to get my namnode up and running and have a bunch of datanodes connect to it.

But as I was going through the web interfaces I noticed that the log files were not available. Going back to the Hadoop setup page I noticed that they used a different script to setup the node as a daemon. This is not useable with docker as the script will spawn a child process and the original process itself will terminate, meaning the docker container will exit. Instead I had to keep using the direct ‘hdfs namenode’ approach. But I did notice that running the daemon script did have the logs file, where mine did not. After some examination of what this script does, and how configuration is used I discovered all I had to do was set an environment variable to write to file instead of the console.

So now I have a Hadoop cluster running consisting of a namenode and some datanodes. I still haven’t done anything with them, and that will be the topic of my next blog post. If you are interested in my docker cluster, you can find all the of the files on my GitHub account.

Using Docker In Development

I was introduced to Docker by a colleague of mine who attended a presentation about it at Devoxx 2014. I am always curious to try out something new, which is why I started experimenting with it.

Docker is becoming a buzz word, a hype. There are however a lot of articles that show why docker can not be used for production environments. I will not be talking about that, as a developer I use it only in my development environment.

What I will talk about are my experiences I have gather over the course of using it. Note that I am far from an experienced Docker user, nor am I a Docker fanatic. I am a simple developer who has experienced a bit with Docker.

Continue reading