TransWikia.com

15 seconds to insert a single row in MYSQL 5.7 (happens occasionally)

Database Administrators Asked by patrickdavey on October 27, 2021

I’m seeing very sporadic slow queries in MYSQL, at least, that’s where Datadog is suggesting the issues are. I have, for example, 5 queries out of 726K traces (both selects & updates) in the last 48 hours which are slow. It seems to correlate with a spike in disk latency and or stolen CPU (however, there is ample spare capacity on the box). I’m stumped and I’m unsure where to start trying to look for a fix (or even if it’s worth attempting)

Here’s the example with the 15 second COMMIT. It’s doing a single insert into a visitors table. That table has ~5Million rows and the following structure:

CREATE TABLE `visitors` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `session_id` varchar(255) DEFAULT NULL,
  `market_code` varchar(255) DEFAULT NULL,
  `user_id` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `index_visitors_on_session_id` (`session_id`)
) ENGINE=InnoDB AUTO_INCREMENT=41764933 DEFAULT CHARSET=utf8;

query trace

In this particular example, is it likely the index is being updated and that is causing this extreme spike? What’s curious is that often these slow updates "cluster" (there’ll be 4 or 5 transactions to different tables which are slow at the same time) In fact, with the example above of the 5 slow queries in the last 48 hours, they all happened within a second of eachother.

Any suggestions as to what might be causing this would be welcome.

slow traces (outline) below

The "COMMIT" and "BEGIN"s which are slow are for two different tables, both wrapping a transaction that updates a single row indexed by the primary key.

enter image description here

Update 1 – answering queries

  1. There was no spike in iops or bulk transfer than I can see in the logs. There was a spike in disk latency and cpu.iowait (see attached)

enter image description here
2. I do have the slow query log enabled. It’s gigantic because it’s logging too much and hasn’t been cleared in a long while. I probably need to reset it to work with it nicely – I’ll look into that.
3. I don’t have access to the host, but on the VM Datadog does report stolen cpu, and that’s certainly ~ nil (it reports < .01% stolen cpu for the last month).

The provider did suggest changing the values for sync_binlog and
innodb_flush_log_at_trx_commit. I’ve actually already set innodb_flush_log_at_trx_commit=2 (we’re fine with losing seconds or even minutes of data in the event of a crash). I haven’t yet attempted to change sync_binlog=0, I need to understand a bit more about that. I wonder if that’s likely to help?

Update 2 – 9 month disk latency graph.

I thought this graph of disk latency for the last 9 months was interesting. It looks like something changed somewhere in our hosting provider (linode) environment around March 13th. Maybe a noisy neighbour, maybe something changed in the way their storage works. I’m following up with them about this in the hope that it may resolve the issues we’ve been seeing. While the graph just shows spikes of ~3ms, the underlying spikes are a lot higher.

9 month disk latency graph

Update 3 – values for Wilson Hauck

  • SELECT @@innodb_file_per_table; = 1
  • SELECT @@innodb_io_capacity; = 200
  • SHOW GLOBAL STATUS LIKE ‘%flush%’; flush values
  • uptime – 138437

Over the weekend I trialed changing the sync_binlog to zero. I still seem to get some odd slow queries in there. For example, in the slow query log I see this one:

SET timestamp=1595871703;
SHOW /*!50000 ENGINE*/ INNODB STATUS;
# Time: 2020-07-27T17:41:43.145667Z
# 
# Query_time: 2.138283  Lock_time: 0.000169 Rows_sent: 1  Rows_examined: 1
SET timestamp=1595871703;
SELECT  `public_codes`.* FROM `public_codes` WHERE `public_codes`.`code` = 'fwreh7x1' LIMIT 1;

again corresponding with a spike in latency. Why would a query time like that be 2 seconds?

Update 5

So, I reset the slow-query log. One interesting thing since then.. datadog for example reported 4.43 seconds to execute UPDATE documents SET last_browse_path = ? WHERE documents . id = ? (updating a single non-indexed column referenced by the primary key). However, I do not see that listed in the slow-query log, which, I imagine I should be seeing. Possibly Datadog is giving some spurious results?

Update 6 – show create table result

Note, there are only 2.4 million rows in here, despite what the auto-increment says. We had an increment value of 10 for quite a long time.

CREATE TABLE `public_codes` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `document_id` int(11) NOT NULL,
  `code` varchar(255) DEFAULT NULL,
  `enabled` tinyint(1) DEFAULT '0',
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `collaborative_editing` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  UNIQUE KEY `index_public_codes_on_document_id` (`document_id`),
  UNIQUE KEY `index_public_codes_on_public_code` (`code`)
) ENGINE=InnoDB AUTO_INCREMENT=17148900 DEFAULT CHARSET=utf8

Update 7.

The provider was about to close the ticket and asked "was there anything else to add?", so I said how the issue was still happening, and how it was weird that something had changed on the 12th March but that I still had no idea what. At which point they said that we had been live migrated to a new (shared) host on the 12th March. They had mailed us, unfortunately we hadn’t been monitoring the inbox.

I’m waiting to migrate the db box to a dedicated host as I suspect this is a noisy neighbour issue . Once I have done the migration (in a week or two hopefully) then I’ll update the ticket here after that, assuming it fixes the issues.

Pretty annoyed they didn’t tell me the first time I asked about the 12th March.. still, just hoping this resolves the issues.

3 Answers

This turned out to be a noisy neighbour issue. While we do still have some issues, migrating on linode to a dedicated CPU instance made most of these issues go away.

Answered by patrickdavey on October 27, 2021

Suggestion to consider for your my.ini [mysqld] section

innodb-io-capacity=1900  # from 200 to allow higher IOPS for your SSD device

Consider running ANALYZE on your public_codes table to ensure indexes are current.

I suspect the slow query was bumping against some form of flushing activity that caused it to take 2 seconds.

Answered by Wilson Hauck on October 27, 2021

It seems to correlate with a spike in disk latency

there'll be 4 or 5 transactions to different tables which are slow at the same time

Is there also a spike in disk throughput (either bulk transfer or IOPs?) also? If so then it may simply be a sudden glut of IO contention. Check to see if there are some nasty queries running around then that are scanning large tables/indexes or temporary increases in concurrency (i.e. you see an uptick in user activity at the time?, and/or are there scheduled maintenance jobs around then?)

both wrapping a transaction

There could be locking issues here, with your update waiting for other transactions that hold read locks on the affected rows/pages/tables. Those waits might happen all the time, but are only noticeable due to one of the possibilities in my previous paragraph, or could be the issue on their own if caused by a single (or small number of) long running statement that holds relevant locks.

If you haven't already, enable the slow query log (the documentation for mariadb, IIRC it and mysql do not differ in this respect) as that may provide further clues if something significant comes up there at the same time as your visible issue.

stolen CPU

That says you are running in a VM so there could be a lot of extra external causes for a temporary drop in performance if you can not find good reason internally. Do you have access to the host to check its logs/metrics, or only this guest?

Answered by David Spillett on October 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP