Relational databases have been around for a very long time and they have evolved up to the point where they can handle processing millions of results within seconds. But for someone like me having a query that performs good isn’t enough, I am constantly wondering whether there is a way I can speed things up, whether there is a more efficient way to do it.
Recently I had to refactor part of an application which contained some queries. And at first glance, I could already tell the queries wouldn’t be as efficient as they could be, but before rewriting them, I decided to have a small experiment to determine the best approach to write them. The main concern I had with theses queries was how they gathered data from different tables. That will be the focus of this blog post.
The approach that they took in the old code was splitting the query up into smaller queries. First get the element you want from one table, and use the result in the query for the next. As programmers we are familiar with this as it is very similar to what we do in code, we split it up in different small methods and link them together.
Splitting up queries into smaller parts and combining the results in your code is a bad idea because it takes away the control from the database. This means the database is not able to perform all the optimisations it could have done if it would have known the full query. Moreover there is a constant overhead of returning the intermediate results.
The database schema used in this tests consists of 4 tables (3 small tables < 500 rows and 1 big table > 900000 rows) that are linked together and contained indices at the most important columns to make lookup faster. The different queries done in the test are:
- Multiple queries, each designed to get the result from 1 table.
SELECT * FROM device WHERE key = …
SELECT * FROM variable WHERE key = ...
SELECT * FROM device_variable WHERE device_id = ... AND variable_id = ...
SELECT ts, value FROM data WHERE device_variable_id = ... AND ts >= ... AND ts <= ...
- A combined query, linking the tables in the where clause.
SELECT ts, value FROM data WHERE device_variable_id IN (SELECT id FROM device_variable WHERE device_id IN (SELECT id FROM device WHERE key = ...) AND variable_id IN (SELECT id FROM variable WHERE key = ...)) AND ts >= .. AND ts <= ..
- A combined query linking the tables with inner joins.
SELECT ts, value FROM data INNER JOIN device_variable ON device_variable_id = id INNER JOIN device ON device_id = id INNER JOIN variable ON variable_id = id WHERE Device.key = ... AND Variable.key = ... AND ts >= ... AND ts < ...
The queries where tested when getting only a small subset from the database and a much larger one.
The small subset reveals the expected behaviour, doing the single queries yourself is a bad idea and costs a lot compared to letting the database do it for you. With the large subset however all of a sudden doing the single queries is the fastest approach. A clear conclusion from this is that combining elements in the ‘Where’ clause is always a bad idea because in neither case it comes out on top.
But then to answer the question whether you should to the multiple queries and combine them yourself or leave it up to the database is a lot harder to answer. Personally I would still go for the inner joined query just because it is also a lot easier to do it yourself. It is very clear to see what is executed compared to splitting them up. In that case if you ever want to do it manually, you have to follow the same steps. But for performance reasons it may just depend on your use-case