After setting up a HDFS cluster in part 1, it is now time to put some data in the cluster. There exists multiple ways to upload files to a Hadoop cluster:
- Using the Web UI of the namenode
- Using the REST API.
- Using the command line.
- From Java using the DFSClient
- From Java using the FileSystem
Treating them all in a single post would make the post too long (in my opinion). Therefor I have chosen to split it up into two separate posts, meaning that in this blog post I will only discuss the Web UI and REST API (which should be nearly identical). The other approaches will be handled in the next.
Even though you may not be interested in using the Web UI or the REST API to upload files to you Hadoop cluster, I recommend reading through this post as problems encountered here, will come back in the next one. Even more important though are the fixes and changes that are applied to get it working as they will have an impact on the other ways of uploading files as well.
When trying to upload a file using the Web UI of the namenode, which I expect is the easiest way (meaning least chance of problems), already fails. The error (“Couldn’t upload the file”) doesn’t reveal any indication as to what the problem is. Trying to create a directory also fails, but shows a meaningful error:
Permission denied: user=dr.who, access=WRITE, inode=”/”:root:supergroup:drwxr-xr-x
- Who is ‘dr.who’ and why is that user used?
- On which node does this user need permissions?
Since after shutting down the datanode, I still get the exact same message, this must mean it needs permissions on the namenode. But who is ‘dr.who’? Well, looking at the default configuration of hadoop, this is the default http user. Since I never invested any time in setting up decent permissions on my nodes, I want to run all of my operations as the root user. Luckily, this is easy to do by changing the configuration value.
curl -i -X PUT http://localhost:50070/webhdfs/v1/test.md?op=CREATE
curl -i -X PUT -T README.md http://localhost:50075/webhdfs/v1/test.md?op=CREATE\&namenoderpcaddress=hdfs-namenode:9000\&createflag=\&createparent=true&overwrite=false
curl -i -X PUT http://hdfs-namenode:50070/webhdfs/v1/test.md?user.name=root\&op=CREATE
When making a call to the redirect location I finally succeeded in uploading a file using the REST API myself.
It has been a long struggle to get everything working, and I learned a lot about the behaviour of Hadoop and how combined with Docker this is hitting the limits of what is possible. I still believe a reverse lookup could be possible, and it is something I will try later, after I have managed to completely setup the Hadoop cluster.