TransWikia.com

Index Strategy for MySQL Query that matches on one column and between two others

Database Administrators Asked on December 8, 2021

I have a table that contains a set of measurements for a continuous stream of processes. Although each process is individual, they are categorized into groups. The events have a start and end timestamp and a process group identifier.

The table structure is as follows (InnoDB, MariaDB 10):

Table Name: measurements

 CREATE TABLE `measurements` (
 `row_id` int(11) NOT NULL AUTO_INCREMENT,
 `process_name` varchar(100) COLLATE utf8_bin NOT NULL,
 `process_id` int(11) NOT NULL,
 `process_group_id` tinyint(4) NOT NULL,
 `measurement_1` float NOT NULL,
 `measurement_2` float NOT NULL,
 `measurement_3` float NOT NULL,
 `measurement_4` float NOT NULL,
 `start_timestamp` int(11) NOT NULL,
 `end_timestamp` int(11) NOT NULL,
 PRIMARY KEY (`row_id`),
 KEY `process_group_id` (`process_group_id`,
     `start_timestamp`,`end_timestamp`),
 KEY `process_id` (`process_id`)
) ENGINE=InnoDB 
  AUTO_INCREMENT=7294932 
  DEFAULT CHARSET=utf8 COLLATE=utf8_bin

I’m designing a query to obtain the sum of measurements 1,2,3 & 4 for all processes running within a group at a particular point in time so that the app can express each measurement for a specific process as a percentage of the total measurements in the group at that time. The start and end times of processes within a group are not synchronized and they are of variable length.

So for a process running in Group 5, at timestamp 1431388800

SELECT  SUM(measurement_1), 
        SUM(measurement_2), 
        SUM(measurement_3),
        SUM(measurement_4)
    FROM  measurements
    WHERE  process_group_id = 5
      AND  1431388800 
           BETWEEN start_timestamp 
               AND end_timestamp 

This query runs, but takes around 0.5s. The table has 8m records and grows by about 30,000 a day.

I have an index on process_group_id, start_timestamp, end_timestamp. However, the query does not appear to use anything but the process_group_id part of the index. I created an additional index on process_group_id alone to check this, and once created EXPLAIN showed it using this index.

After some searching, I saw a suggestion to modify the query and add an ORDER BY clause. Having done this the query is accelerated to around 0.06s and it seems to use the full index. However, I’m unsure as to why:

SELECT  process_group_id, 
        SUM(measurement_1), 
        SUM(measurement_2),
        SUM(measurement_3), 
        SUM(measurement_4)
    FROM  measurements
    WHERE  process_group_id = 5
      AND  1431388800 
  BETWEEN start_timestamp 
      AND end_timestamp
    ORDER BY  process_group_id ASC 

With 30,000 new records a day that requires their shares to be calculated, 0.06s is still not particularly fast. Is there a better way of structuring either the table or designing the query to get a few orders of magnitude quicker, or is a query which matches on one column and then a range query on two others always going to be fairly slow to run?

One Answer

(Not an answer, but too clumsy for a comment.)

have an index on process_group_id, start_timestamp, end_timestamp. However the query does not appear to use anything but the process_group_id part of the index.

It is actually using start_timestamp although it does not say so. What was the key_len? That may give a clue. Also try EXPLAIN FORMAT=JSON SELECT ....

AND  1431388800 BETWEEN start_timestamp AND end_timestamp

Turn that into the following to see if it helps:

AND  start_timestamp <= 1431388800
AND  end_timestamp   >= 1431388800

I suspect it is identical, but I am not sure.

Caution. The difference between 0.5s and 0.06s could a warmed up cache. Run timings twice. Also, use SQL_NO_CACHE to avoid the Query cache.

How wide are the timestamp ranges typically? How precise is 1431388800? The values sound like they have a resolution of 1 second. What if we switched to 1 minute or 1 hour?

After you provide some answers, I will possibly suggest turning this into a Data Warehouse application and discuss Summary tables.

Edit

Consider this approach to storing the data. (I still don't have enough details to determine what variant of the following would be optimal.)

Since you have a "processing" phase that leads to the table in question, I suggest rewriting it to store into a different table (either in place of the existing one, or in addition)

CREATE TABLE ByMinute (
    process_group_id ...
    ts TIMESTAMP NOT NULL, -- rounded to the minute
    sum_1 FLOAT  -- see below
    sum_2 ...,
    PRIMARY KEY(process_group_id, ts)
);

The table contains one row per minute. That's 0.5M per year, not terribly big. If converting from the existing structure do sum_1 += measurement_1 for each row BETWEEN starttime AND endtime.

That is possibly more processing than you are currently doing, but it should not be excessive. And it makes the SELECT extremely efficient:

SELECT  sum_1, sum_2, sum_3, sum_4
    FROM  ByMinute
    WHERE  process_group_id = 5
      AND  ts = somehow round 1431388800 to a minute

You currently have a daily dump. The processing should be obvious. If you switch to "streaming" and use the ping-pong method I mentioned, then very similar code can be used for each transient table. And you would probably have nearly up-to-the-second data all the time.

Answered by Rick James on December 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP