After setting up a HDFS cluster in part 1, it is now time to put some data in the cluster. There exists multiple ways to upload files to a Hadoop cluster:
- Using the Web UI of the namenode
- Using the REST API.
- Using the command line.
- From Java using the DFSClient
- From Java using the FileSystem
Treating them all in a single post would make the post too long (in my opinion). Therefor I have chosen to split it up into two separate posts, meaning that in this blog post I will only discuss the Web UI and REST API (which should be nearly identical). The other approaches will be handled in the next.
Even though you may not be interested in using the Web UI or the REST API to upload files to you Hadoop cluster, I recommend reading through this post as problems encountered here, will come back in the next one. Even more important though are the fixes and changes that are applied to get it working as they will have an impact on the other ways of uploading files as well.
When trying to upload a file using the Web UI of the namenode, which I expect is the easiest way (meaning least chance of problems), already fails. The error (“Couldn’t upload the file”) doesn’t reveal any indication as to what the problem is. Trying to create a directory also fails, but shows a meaningful error:
Permission denied: user=dr.who, access=WRITE, inode=”/”:root:supergroup:drwxr-xr-x
So the message clearly shows that the user ‘dr.who’ does not have permission to write on the node. Two questions we should ask ourself:
- Who is ‘dr.who’ and why is that user used?
- On which node does this user need permissions?
Since after shutting down the datanode, I still get the exact same message, this must mean it needs permissions on the namenode. But who is ‘dr.who’? Well, looking at the default configuration of hadoop, this is the default http user. Since I never invested any time in setting up decent permissions on my nodes, I want to run all of my operations as the root user. Luckily, this is easy to do by changing the configuration value.
After changing this, I can create directories through the namenode, even if there is no datanode up and running. So this already makes it clear that the namenode is the one who is responsible for the while filesystem structure and the datanodes are completely unaware and only have blocks. Moreover, the actual directory is never physically created, the actual directory is of course not required because it will never be used to store any data in it.
Uploading a file however, still fails with the exact same message. When trying to debug such an issue, the log files are always the place to go. But the only log file that is created doesn’t show anything related to this problem. Even after changing the log configuration, such that every possible log file is used still doesn’t show any information about this
Since I assume that the Web UI uses the REST API, I decide to try this REST API myself. The Hadoop documentation is very clear in which calls you have to make. First you have to make a call to the namenode:
curl -i -X PUT http://localhost:50070/webhdfs/v1/test.md?op=CREATE
The namenode replies with a redirect request to a datanode. This request contains the location to redirect to, which in my case is:
Looking at the redirect location, it is clear that this redirect will never be possible as the hostname is the container id. This can not be resolved and will result in an ‘unknown host’ exception. But there is also another problem, the port. While the container internally uses port 50075, this is mapped to a random port to the outside. To fix this problem, and use my Hadoop cluster as is, where all datanodes have the same configuration, we would need some reverse lookup that would translate the request and forward it to the right container. This translation however would depend on both the hostname and the port, which could be impossible, but I’m not a network expert.
Since this is not my current goal, I decided to postpone this, and for now work with a single datanode. This allows me to set the hostname of the datanode to ‘localhost’ and map the port to 50075 to the outside. With these changes I try again to upload a file using the Web UI, but I still get the same error. Looking into the log file again shows me that it is again related to user permissions. The file we are trying to upload is being uploaded as our good old ‘dr.who’. So for some reason does the namenode not pass on its own user, leaving it up to the datanode to fall back to the default user. Setting the same configuration as we did with the namenode finally allows me to upload a file.
Using the REST API I can now continue with the second call, since the location returned by the namenode is now resolvable. After making some stupid mistake where I didn’t escape the ‘&’ in the URL, I got a response from the datanode.
curl -i -X PUT -T README.md http://localhost:50075/webhdfs/v1/test.md?op=CREATE\&namenoderpcaddress=hdfs-namenode:9000\&createflag=\&createparent=true&overwrite=false
Of course, it was again using the default user. Specifying a user in the REST call is as simple as adding the ‘user.name’ parameter to the call.
curl -i -X PUT http://hdfs-namenode:50070/webhdfs/v1/test.md?user.name=root\&op=CREATE
When making a call to the redirect location I finally succeeded in uploading a file using the REST API myself.
It has been a long struggle to get everything working, and I learned a lot about the behaviour of Hadoop and how combined with Docker this is hitting the limits of what is possible. I still believe a reverse lookup could be possible, and it is something I will try later, after I have managed to completely setup the Hadoop cluster.