Related to the previous blog post, where I examined the effect of specifying the initial capacity of an ArrayList, I will now do the same for a HashMap. Compared to an ArrayList a HashMaps works different and it takes an additional parameter, the load factor. This parameter is required as a HashMaps performance will depend on how full it is. Because of the difference in adding elements to a HashMap and retrieving them again I decided to split this up into two separate blog posts. In this part I will examine adding elements to a HashMap.
A HashMap, as the name suggests, uses the hash code of the elements to determine where they belong. The hashcode of the element is however not purely used, the lower 16 bits are mangled with the higher 16 bits by performing an XOR between the original hashcode and shifting it for 16 places. As an example consider the hashcode of 1.939.665.527, which would result in an actual hash of 1.939.636.715
Since these hash values will play an important part I have created two tests, the first uses a simple sequence of integer numbers, which results in a nice sequence of hashes as well. The second test uses random numbers (with a fixed seed of course).
The results of the first tests are shown in the images below:
As expected, the default performs worse, but strangely having a hashmap that is smaller performs the same as one that is just right. Having one that is a bit larger performs better. Why is this? The reason why the hashmap that is a bit smaller performs the same is because they both lead to a hashmap of the same size. A hashmap will always create an internal table that is a power of 2, meaning 2^10 and 2^10-1 both create an internal table of size 2^10. But 2^10 will result in a table that has a size of 2^11, and thus not just one element bigger, but twice as big.
The default size will create a table that is too small, but due to our nice sequence of hashes there won’t be any hash collision. It will simply fill the table and then resizes it by doubling the capacity. Increasing the size of the table improves performance, but why? Since our hashcodes are all nice consequent numbers there shouldn’t be any clashes that can be avoided. The only explanation would be that by filling the table we trigger a resize which is avoided by having the map bigger. This is where the load factor comes into play.
Repeating the same tests for a sequence of random numbers gives us very similar results, except that every case is much slower than the sequential one. This is to be expected as now all of a sudden we have different elements that have the same hashcode. When such a collision happens, the elements will be stored at the same location, but if too many elements are stored in such a location they are converted into a tree structure.
The only remark we can make about this is to why a hashmap that is just big enough still performs worse to the hashmap that is a bit too large. Given that we are using random values, there is a big chance that a value will be re-used and this remove the previous one, on top of that we will have elements that have the same hashcode and thus only occupy a single space. But the trick here is that the hashmap doesn’t check how many locations are occupied in the internal table, instead it only checks how many elements there are in the map. This means that 100 elements that all map to the same hashcode, are treated the same as if they would occupy one location each, and the internal table is resized nevertheless.
Some quick testing about how fast it is to get elements back out of the hashmap shows that it still performs at O(1) speed. This is to be expected because in all causes the actual load (or fill ratio) of the hashmap is low enough.
It is hard to discuss what the initial capacity should be without doing some more tests with the load factor. This is something we will do in the following blog post and only if all elements have been covered we will be able to make a decent decision on what the values should be set to.