We all know that with a hashmap, the performance depends on the fill ratio of the actual map. A hashmap that is too full would result in bad performance due to hash clashes and the whole lookup would result in a lineair lookup.
However, it is important to note that there are different ways to implement a hashmap. One consists of having a single map and if the location where to store the element (depending on the hash value) is already occupied, you select the next available one. Another keeps a list of all elements that are present on a single location. This means that even if elements have the same hash value, there is no expensive procedure of finding the next free location and the same holds for the lookup.
But how exactly does the load factor influence the performance of the hashmap with the Java implementation? In the previous blog post I already did examine how the load capacity comes into play, this time we will extend it to also use the load capacity.
The load factor is important as it determines when the map is grown and all entries have to be re-organised. An interesting detail is that since Java chose for the option where each element is backed by a list or tree, this means that a map can be grown even though it is no where near full. Just imagine a bad set of elements where they all map to the same small amount of hash codes. As more and more elements are added, only those ‘bins’ (the list/tree behind the actual hash code element) are getting bigger. The actual map itself remains very empty. Yet since there are too many elements in the map, it is grown. If the elements are re-hashed they are hopefully distributed more nicely but there is no guarantee. Just to indicate that a map that is ‘almost full’ doesn’t actually have to be anywhere near full.
To test the performance I have created different scenarios where the initial capacity and the load factor are varied to see the effect. I did an evaluation of both the time it takes to insert the elements and to retrieve them again.
The first test assesses how the insertion duration performs under the different circumstances. If the hashmap is initially created to be large enough the load factor doesn’t matter. A hashmap that is too small however, the result depends on the exact ratio between capacity and load factor. Note that as long as the capacity of the hashmap is smaller than or equal to the amount of elements to add, the load factor of 1 performs pretty well. If however the map is too big, this load factor becomes less important and all different scenarios start to look alike.
When organising the data a bit different, and grouping all results on their load factor instead of the capacity we see some other results. In all cases we see that the the hashmaps that have a capacity > 100% perform the best. This is due to the fact that their is no need to resize the map and perform any re-hashing of the data. The load of 100% shows different result from the others, most likely because it will always fill up completely before doubling the capacity. With a load of 0.5 or 0.75 this happens earlier and the exact amount of work will therefore depend on when exactly the resize happens and how often.
But more interestingly is of course the duration of getting elements out of the hashmap. For this I repeated a similar test and extracted the same data.
The second test compares the duration to retrieve the elements again. The data is not that exciting, and the common understanding is confirmed. When the load is 1, then it falls behind the rest, although not by much, this is because if the size is too small, it will resize and the actual load will become less than the load factor. The maps with an actual load of 1 are those with capacity 100% and 200%, which both show bad results for the load 1. Interestingly the capacity 75% shows worse results.
Grouping them per load reveals that in all cases the results are pretty close, having a higher capacity will result in better results, but this is simply because this means that the actual load is less then the specified factor.
The general conclusion is that getting elements from a hashmap is always performant enough, this means we must conclude that the implementation of the hashmap in Java does a good job.
In the end it turns out that putting elements in a hashmap is much heavier then getting them, due to the cost of resizing. But then again, the times getting elements from a hashmap is in normal conditions more often then putting elements in the map.