30th October 2022 7th July 2024 By Bzzz In editorial note databasefilesortmariadboptimizationprimary keyqueryselect fromsql

MariaDB/SQL query gotcha and optimized solution (#E19)

Haven’t had a general advice post for a while, have we.

My small fleet of ESP32 boards across the apartment is still gathering information for the AVM Fritz DECT 301 radiator control units that just started the heating season. While their queries are really simple, quick and fully automated, I also have an HighCharts overview plot of all data with user-defined query depth, usually 300 minutes / 5 hours, but I think there’s no arbitrary limit to it. One could plot all 2 million rows of data…but I’m not gonna try.

Anyway, I recently noticed query times went up significantly, and the whole thing never was blazing fast in the first place. Well, it’s run on a Raspberry Pi 3, and that isn’t all THAT fast, it’s running a PiHole, TeslaMate and some other stuff in parallel, plus a Pi 4 is currently unobtainium at well over 100€ from some scalpers. 10 seconds of query execution time are fine, 20 seconds are a tad much, and then I “optimized” it and it took…45 seconds. Time to investigate.

You see, I had good results in the past with database partitioning on very similar use cases…back on a machine with spinning rust. A large part of that effect probably was re-writing the database in a non-fragmented way, but when logging data sequentially and basically only every pulling the most recent entries, separating the old stuff (that nobody ever looks at again, but needs to be kept) should work well. Of course the datetime column is the primary key, so that one of the first stages of query optimization is ruling out n-1 files of the partitioned database, since the partition is based on the timestamp.

In my current optimization case, we have the following database layout:

CREATE TABLE `weatherdata` (
	`time` TIMESTAMP NOT NULL DEFAULT '2000-01-01 00:00:00',
	`humi_bad` DECIMAL(4,2) NULL DEFAULT NULL,
	`temp_bad` DECIMAL(4,2) NULL DEFAULT NULL,
	`humi_wz` DECIMAL(4,2) NULL DEFAULT NULL,
	`temp_wz` DECIMAL(4,2) NULL DEFAULT NULL,
	`humi_kue` DECIMAL(4,2) NULL DEFAULT NULL,
	`temp_kue` DECIMAL(4,2) NULL DEFAULT NULL,
	`pres_kue` DECIMAL(5,1) NULL DEFAULT NULL,
	`humi_sz` DECIMAL(4,2) NULL DEFAULT NULL,
	`temp_sz` DECIMAL(4,2) NULL DEFAULT NULL,
	`humi_lau` DECIMAL(4,2) NULL DEFAULT NULL,
	`temp_lau` DECIMAL(4,2) NULL DEFAULT NULL,
	`humi_out` DECIMAL(4,2) NULL DEFAULT NULL,
	`temp_out` DECIMAL(4,2) NULL DEFAULT NULL,
	PRIMARY KEY (`time`) USING BTREE,
	UNIQUE INDEX `time` (`time`) USING BTREE
)
COLLATE='utf8mb4_general_ci'
;

The current one-per-minute timestamp is created and updated by several devices; sometimes not all data arrives in time, so we do have gaps every now and then.
The partition was performed this way:

ALTER TABLE weatherdata
PARTITION BY RANGE ( UNIX_TIMESTAMP(time) ) (
   PARTITION p2018 VALUES LESS THAN ( UNIX_TIMESTAMP('2019-01-01 00:00:00')),
   PARTITION p2019 VALUES LESS THAN ( UNIX_TIMESTAMP('2020-01-01 00:00:00')),
   PARTITION p2020 VALUES LESS THAN ( UNIX_TIMESTAMP('2021-01-01 00:00:00')),
   PARTITION p2021 VALUES LESS THAN ( UNIX_TIMESTAMP('2022-01-01 00:00:00')),
   PARTITION p2022 VALUES LESS THAN ( UNIX_TIMESTAMP('2023-01-01 00:00:00')),
   PARTITION p_future VALUES LESS THAN (MAXVALUE)
);

So one partition per year, roughly 20 MB of data per file. There’s two updates at the moment due to the last partition being unused, but that’ll go away on January 1st. 20 MB on a reasonably fast MicroSD card should be handled well by the Pi – 100MB are probably a bit unwieldy already, given the thing only has 1GB of RAM and will swap afterwards – on the very same SD card of course.

What does the query do? It just picks up individual location data points from the x most recent rows of data, using the time column that is our primary key:

SELECT UNIX_TIMESTAMP(time) as time, 
humi_kue as humi, temp_kue as temp, pres_kue as pres 
FROM weather.weatherdata  ORDER BY time DESC LIMIT 1440;

The UNIX_TIMESTAMP conversion is needed for plotting in HighCharts as far as I remember, but maybe they do work with some DD.MM.YYYY HH:MM format nowadays. Other than that, this should be extremely fast, right?

Well:

/* Affected rows: 0  Found rows: 1,440  Warnings: 0  Duration for 1 query: 33.821 sec. (+ 0.011 sec. network) */

It’s not. And running the EXPLAIN command in front of the query actually explains why:

id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;weatherdata;ALL;\N;\N;\N;\N;2015244;Using filesort

“Using filesort”, no keys. The damn thing sorts a 2-million-row table and ignores the primary key that would be exactly what we need here. Why? Because UNIX_TIMESTAMP(time) is a derived value and not part of the primary key. It generates a lot of data, sorts it, and then throws away literally 99.9%…

Running the very same query with a fully unused additional column, say

SELECT time as yarp, UNIX_TIMESTAMP(time) as time, 
humi_kue as humi, temp_kue as temp, pres_kue as pres 
FROM weather.weatherdata  ORDER BY yarp DESC LIMIT 1440;

so that the unaltered timestamp key column is actually in use, we get this:

/* Affected rows: 0  Found rows: 1,440  Warnings: 0  Duration for 1 query: 0.011 sec. (+ 0.055 sec. network) */

33 seconds to 1/100th of a second. A speed-up of a measly three orders of magnitude This is also true for the HighCharts website that now starts plotting instantly, and of course backed up by the EXPLAIN output:

id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;weatherdata;index;\N;PRIMARY;4;\N;1440;

No filesort, instead using the primary key. That’s what I want, and that’s what I should have used right from the beginning. But sometimes slow and inefficient queries are just fine and run for years before becoming painful enough for someone to look into…