Building A Hadoop Cluster With Docker – Part 3

This part continues where we left of with part 2, we will examine the other ways on how to upload files to a Hadoop cluster:

  1. Using the command line.
  2. From Java using the DFSClient
  3. From Java using the FileSystem

If you did not read the previous part, I highly recommend you to do so before continuing. Before diving into the new work, lets summarise what we learned from the previous part.

  • We had to fix the hostname of the container.
  • The mapping of ports caused a problem.
  • The filesystem permissions are very restrictive to non-root users.

The first two points means that the cluster can not be used as it was designed in part 1, and is restricted to a single datanode.

Continue reading

Advertisements

Building A Hadoop Cluster With Docker – Part 2

After setting up a HDFS cluster in part 1, it is now time to put some data in the cluster. There exists multiple ways to upload files to a Hadoop cluster:

  1. Using the Web UI of the namenode
  2. Using the REST API.
  3. Using the command line.
  4. From Java using the DFSClient
  5. From Java using the FileSystem

Treating them all in a single post would make the post too long (in my opinion). Therefor I have chosen to split it up into two separate posts, meaning that in this blog post I will only discuss the Web UI and REST API (which should be nearly identical). The other approaches will be handled in the next.

Even though you may not be interested in using the Web UI or the REST API to upload files to you Hadoop cluster, I recommend reading through this post as problems encountered here, will come back in the next one. Even more important though are the fixes and changes that are applied to get it working as they will have an impact on the other ways of uploading files as well.

Continue reading

Building A Hadoop Cluster With Docker – Part 1

One of the things I started in 2017 and wanted to get more knowledge about in 2018 was Hadoop. While it is possible and very easy to run Hadoop on a single machine, this is not what it is designed for. Using it in such a way would even make your program run slower because of the overhead that is involved. So since Hadoop is designed for big data, and distributes that data over a cluster of machines it makes sense to also run it like this. While it is possible to have many nodes running on the same machine, you would need to setup quite a lot of configuration to prevent port collisions and such.  On the other hand I don’t have a cluster available to do some experimenting with. Although I know it is possible to get some free tier machines at cloud providers such as Amazon, this is not something I want to do for a minor experiment like this. Instead I have chosen to create a virtual cluster by using Docker.

Continue reading

Using Docker In Development

I was introduced to Docker by a colleague of mine who attended a presentation about it at Devoxx 2014. I am always curious to try out something new, which is why I started experimenting with it.

Docker is becoming a buzz word, a hype. There are however a lot of articles that show why docker can not be used for production environments. I will not be talking about that, as a developer I use it only in my development environment.

What I will talk about are my experiences I have gather over the course of using it. Note that I am far from an experienced Docker user, nor am I a Docker fanatic. I am a simple developer who has experienced a bit with Docker.

Continue reading