Building A Hadoop Cluster With Docker – Part 4

With the HDFS cluster setup, we still need a way to actually use this data. Although just using the cluster as a way for duplication and backup of data is a viable option, this is not really what it is meant for. So we should start setting up everything that is required to run mapreduce jobs on the cluster.

Continue reading

Advertisements

Building A Hadoop Cluster With Docker – Part 3

This part continues where we left of with part 2, we will examine the other ways on how to upload files to a Hadoop cluster:

  1. Using the command line.
  2. From Java using the DFSClient
  3. From Java using the FileSystem

If you did not read the previous part, I highly recommend you to do so before continuing. Before diving into the new work, lets summarise what we learned from the previous part.

  • We had to fix the hostname of the container.
  • The mapping of ports caused a problem.
  • The filesystem permissions are very restrictive to non-root users.

The first two points means that the cluster can not be used as it was designed in part 1, and is restricted to a single datanode.

Continue reading

Building A Hadoop Cluster With Docker – Part 2

After setting up a HDFS cluster in part 1, it is now time to put some data in the cluster. There exists multiple ways to upload files to a Hadoop cluster:

  1. Using the Web UI of the namenode
  2. Using the REST API.
  3. Using the command line.
  4. From Java using the DFSClient
  5. From Java using the FileSystem

Treating them all in a single post would make the post too long (in my opinion). Therefor I have chosen to split it up into two separate posts, meaning that in this blog post I will only discuss the Web UI and REST API (which should be nearly identical). The other approaches will be handled in the next.

Even though you may not be interested in using the Web UI or the REST API to upload files to you Hadoop cluster, I recommend reading through this post as problems encountered here, will come back in the next one. Even more important though are the fixes and changes that are applied to get it working as they will have an impact on the other ways of uploading files as well.

Continue reading

Building A Hadoop Cluster With Docker – Part 1

One of the things I started in 2017 and wanted to get more knowledge about in 2018 was Hadoop. While it is possible and very easy to run Hadoop on a single machine, this is not what it is designed for. Using it in such a way would even make your program run slower because of the overhead that is involved. So since Hadoop is designed for big data, and distributes that data over a cluster of machines it makes sense to also run it like this. While it is possible to have many nodes running on the same machine, you would need to setup quite a lot of configuration to prevent port collisions and such.  On the other hand I don’t have a cluster available to do some experimenting with. Although I know it is possible to get some free tier machines at cloud providers such as Amazon, this is not something I want to do for a minor experiment like this. Instead I have chosen to create a virtual cluster by using Docker.

Continue reading