The product we develop uses a lot of XML documents, some of those are stored in eXistdb. For those who never heard of eXistdb, don’t panic, neither did I before I started working here. And even then it took me some time to figure out what it was, and why they choose to use it.
The idea behind eXistdb is to have one data model to represent the data across all layers of your application, which would be XML. With a traditional database you would have to convert the XML data when storing and retrieving it, introducing some overhead.
In the past week I have been investigating some performance issues we had when searching for documents stored in eXistdb. In this blog post I will tell you about my findings, and how I solved them.
We are currently using eXistdb as some kind of indexer/cache. Instead of going to a separate server and make requests to it, we duplicate the data in the local eXistdb and get our information from that one, which is closer and should be faster. By storing the entire document, we can do both perform searches and get the file without ever having to go to the separate server. This however does not go as fast as we hoped it would. Instead we discovered some problems with eXistdb.
The first ‘run’ is a lot slower then consequent runs. This is probably caused by the fact that eXist will have nothing in memory the first run, and will have to load it from disk. Though this is not a big problem, it is an indication for trouble. What if our data set becomes too big to have in memory? Will performance drop dramatically? So far we have not yet experienced this, but it makes me worry about the future.
It turns out that the size of your documents has an impact on the search you perform. Though it could make sense, as you have more data to look into, it does not in our case. We perform a search where we order the documents on an element that is at a fixed position in the beginning. Files only get bigger after that element. Yet we discovered that the bigger the files, the slower the query gets.
When I was trying to solve this I introduced some new indices, hoping things would get faster. While there already are indices for the document, we hoped that adding some more might reveal an index we missed. But for one particular index, the search got slower instead of faster. The element the query orders on, uses an attribute as a predicate. Adding an index on this attribute would double the search time.
I believe the reason for this is that all documents have the same values for that attribute, and instead of using both indices and combining them to get the right element, eXistdb only uses the index on the attribute as it is more useful according to it. Leaving it with much less information than the other index provides, causing it to fall back to iterate over all documents for each document.
Another problematic ‘feature’ of eXistdb I discovered when playing around with indices, is that you have to restart eXistdb each time. Sure, there is an option to re-index a collection (table), but it seems some dirty environment is created by doing so.
When I removed the bad index, and re-indexed everything, I discovered that my run times for the query didn’t go back to the original ones. They would stay at the level of the bad index, the only way to solve this was to restart eXistdb.
The solution we came up with for our performance issues is to use eXistdb only as a search indexer, and no longer as a cache. This allows us to shrink the documents to the bare minimum before storing them. The query goes a lot faster, but data now needs to be fetched from a separate server. Since we only need to fetch a small portion of the documents, this scales much better before.
We may get rid of eXistdb completely in the future, but for now it is still good enough for this new way we are using it.