This part continues where we left of with part 2, we will examine the other ways on how to upload files to a Hadoop cluster:
- Using the command line.
- From Java using the DFSClient
- From Java using the FileSystem
If you did not read the previous part, I highly recommend you to do so before continuing. Before diving into the new work, lets summarise what we learned from the previous part.
- We had to fix the hostname of the container.
- The mapping of ports caused a problem.
- The filesystem permissions are very restrictive to non-root users.
The first two points means that the cluster can not be used as it was designed in part 1, and is restricted to a single datanode.
Hadoop comes with some command line script you can run to copy files from your local system to the hdfs filesystem. The target hdfs filesystem does not need to be on your local machine, which means we can use it to upload files to our cluster. The command to do this is:
./bin/hadoop fs -put README.md hdfs://localhost:9000/readme.md
Just like with the REST API, we run into user permissions. However, the user that is reported is no longer ‘dr.who’ but is the user setup on my machine. If there is no user specified, Hadoop will run a command to find out the current user. On Unix based systems, this is ‘whoami’. To prevent this behaviour, we have to explicitly set the username by setting the environment variable ‘HADOOP_USER_NAME’.
Even with the correct user set, I get an error when trying to write the file. In the namenode log file I can see the following lines:
2017-12-19 13:35:55,276 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=172.19.0.3:50010 for /readme.md._COPYING_
File /readme.md could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
It is not clear why the datanode is excluded from the operation, and if you search for it online the most common cases are either the port that is not open, or the datanode that has run out of diskspace. From the namenode I am able to connect to the datanode on IP 172.19.0.3, and port 50010 is open. So there should be no reason as to why the namenode can not connect to the datanode. Moreover, if the namenode is unable to connect to the datanode, then wouldn’t the Web UI give some indication of this? From the Web UI everything looks fine, and the datanode is connected.
But what we learned from the REST API is that the namenode itself will not write the file, it only acts as a lookup that tells you the datanode to write the file to. It is my host machine that will have to write the file. This is a very reasonable decision and it will prevent the namenode from becoming the bottleneck. Just imagine that every request to write data is passing by the namenode. This would cause a lot of traffic and requests to the namenode, which would completely destroy the distributed capabilities of the system. Yet it is exactly because my local machine has to write the data to the datanode that it now fails.
Where the IP address of the datanode can be reached by the namenode, this is not the case for my host machine. The IP address of the datanode is an internal IP address of the docker network, and is therefor not known or reachable for my system. Another minor problem was that I did not expose the port in the docker compose configuration file, but only in the dockerfile of the image itself. Opening the port is easy, but even then there is the problem of the unreachable IP address.
It is interesting to see how the REST API, not only uses a different port (which makes sense as it is a completey different protocol), but it also uses hostname instead of IP. I am not sure why this was difference is there, but some quick search on the internet suggested that they did not want to have any problems with DNS lookups and bad resolving of hostnames. Luckily, they did think about cases such as mine and added a configuration option that allows the usage of hostname instead of IP address. This config option (dfs.client.use.datanode.hostname) has to be set on the client and not the namenode nor datanode.
Doing this on the command line means that the command has an addition -D flag to set some configuration settings.
./bin/hadoop fs -D dfs.client.use.datanode.hostname=true -put README.md hdfs://localhost:9000/readme.md
Uploading content to the hadoop cluster from code can again be done in multiple ways. Of course you can use the REST API, just like we did in part 1 from the command line using cURL. Since there should be no difference between these two approaches, I did not do this. Instead I used the two approaches that can be done using the Hadoop jar file, using the DFSClient or the FileSystem.
For both approaches you have to setup a Configuration object, and both approaches complain about the Hadoop home directory not being set.
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
I don’t see why the hadoop home directory is required, neither do some other people as there is an issue ticket related to this (https://issues.apache.org/jira/browse/HADOOP-9422). My code works perfectly with the exception, but having an exception you should ignore in your console or log file is bad practice I tried to fix it by setting the ‘hadoop.home.dir’ in my configuration object. But with no success, the exception was still present which may indicate that the part that throws the exception does not use this configuration object.
The first real problem however is a known one, the application again figures out my user using the ‘whoami’ command, which means that I don’t have permission to write the data. I have tried different configurations to overwrite this, but none worked, I searched in the JavaDoc but couldn’t find anything related to this. So it looks to me that the only way to actually overwrite this setting is again to set the environment variable (HADOOP_USER_NAME). This is very similar as to how to fix the problem with the Hadoop home directory.
Another known problem is that of the unreachable IP address, but since we encountered this problem before we know how to deal with it. It suffices to set the configuration option that connects based on hostname instead of ip to true and we are able to write data. It is however interesting to see that if you fail to write the file, in this case because we can not reach any datanode, the file does exist on the namenode. It has a size of 0 bytes and has no further details. The fact that is exists is however important as your application may only want to upload files if the file does not yet exist. This means that if you failed to upload the file you have to delete it yourself again as well.
So how does the DFSClient differ from the FileSystem? A first difference is that the DFSClient takes the URI of the namenode as a parameter where the FileSystem reads it from the configuration object. Another difference is that the API of the DFSClient is a bit more consistent with the usage of flags. The API of the DFSClient forces you to specify whether you want to overwrite the file or not if it exists, and while the FileSystem allows you to do the same thing, the shortest method does not have a flag for this and will overwrite files.
Copying files from a local system to the Hadoop cluster is most easily done with the FileSystem, as it has a method to directly copy files based on the path. The DFSClient does not have such a method, and only works with input and output streams, which of course is also possible with the FileSystem.
Personally I prefer the API of DFSClient as it shows more clearly what you are doing, while the FileSystem tries to hide the whole HDFS stuff, but I guess this is a matter of taste. I did not do an analysis of throughput or other important aspects, so my decision here is completely based on a first impression.