TransWikia.com

How to make delete duplicates faster?

Database Administrators Asked on January 2, 2022

On a mysql table with about 1.7M rows, I tried to delete duplicates posts:

delete a FROM comment a
  INNER JOIN comment a2
     WHERE a.id < a2.id
     AND   a.body = a2.body;

The result was:

  Query OK, 35071 rows affected (5 hours 36 min 48.79 sec)

The table schema:

CREATE TABLE `comment` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `author_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `title` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `body` longtext COLLATE utf8_unicode_ci NOT NULL,
  `post_id` int(11) NOT NULL,
  `published` tinyint(1) NOT NULL,
  `cid` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `id_postid_idx` (`id`,`post_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1774682 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 

This happened on my almost idle workstation with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz.
I’m wondering if there are some tricks to make this delete operation faster?

One Answer

Your algorithm is "Order(N^2)". If you have a million rows, it will take about a trillion comparisions! (Actually closer to N*N/2.)

I assume you want to run this again later? Or you want to avoid inserting duplicates? In either case, the following will speed things up greatly.

ALTER TABLE `comment` ADD COLUMN hash BINARY(20);
UPDATE `comment` SET hash = UNHEX(SHA1(body));
ALTER TABLE `comment` ADD UNIQUE(hash);  -- (see discussion below)

When INSERTing, include hash with a value of UNHEX(SHA1(body)). If the hash is a dup, then body is extremely likely to be a dup, and the INSERT will suitably fail.

INDEX versus UNIQUE:

  • INDEX gives you a much faster way to do the dup-elimination: check hash instead of body.
  • UNIQUE prevents future insertion of dups.

So, let's do it in two phases:

  1. Add INDEX(hash) (not UNIQUE(hash))
  2. Do the dedupping -- With an INDEX(hash), the DELETE is Order(N) -- a million tests, not a trillion.
  3. Change to UNIQUE: DROP INDEX + ADD UNIQUE. -- Now future INSERTs will balk; so add suitable code to deal with such. Consider INSERT IGNORE.

All of that will take minutes, not hours. And you won't have to do any of it again.

(With a newer version of MySQL, you could use GENERATED ALWAYS to avoid some of the code.)

One shot

To do the delete once and do it without modifying the table:

CREATE TABLE t (
        id INT NOT NULL,
        hash BINARY(20),
    PRIMARY KEY(id),
    INDEX(hash) ) ENGINE=InnoDB;

INSERT INTO t (id, hash)
    SELECT id, UNHEX(SHA1(body))
        FROM `comment`;

DELETE FROM `comment`
       JOIN  t AS t1  ON t1.id = comment.id
       JOIN  t AS t2  ON t2.hash = t1.hash
                     AND t2.id < t1.id;

DROP TABLE t;

Caveat: I have not tested this code. Recommend you test it separately, perhaps with a subset of the data.

Answered by Rick James on January 2, 2022

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP