How to make delete duplicates faster?

Question

On a mysql table with about 1.7M rows, I tried to delete duplicates posts:

delete a FROM comment a
  INNER JOIN comment a2
     WHERE a.id < a2.id
     AND   a.body = a2.body;

The result was:

Query OK, 35071 rows affected (5 hours 36 min 48.79 sec)

The table schema:

CREATE TABLE `comment` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `author_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `title` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `body` longtext COLLATE utf8_unicode_ci NOT NULL,
  `post_id` int(11) NOT NULL,
  `published` tinyint(1) NOT NULL,
  `cid` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `id_postid_idx` (`id`,`post_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1774682 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

This happened on my almost idle workstation with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz. 
I'm wondering if there are some tricks to make this delete operation faster?

Rick James · Answer

Your algorithm is "Order(N^2)".  If you have a million rows, it will take about a trillion comparisions!  (Actually closer to N*N/2.)

I assume you want to run this again later?  Or you want to avoid inserting duplicates?  In either case, the following will speed things up greatly.

ALTER TABLE `comment` ADD COLUMN hash BINARY(20);
UPDATE `comment` SET hash = UNHEX(SHA1(body));
ALTER TABLE `comment` ADD UNIQUE(hash);  -- (see discussion below)

When INSERTing, include hash with a value of UNHEX(SHA1(body)).  If the hash is a dup, then body is extremely likely to be a dup, and the INSERT will suitably fail.

INDEX versus UNIQUE:

INDEX gives you a much faster way to do the dup-elimination:  check hash instead of body.
UNIQUE prevents future insertion of dups.

So, let's do it in two phases:

Add INDEX(hash) (not UNIQUE(hash))
Do the dedupping -- With an INDEX(hash), the DELETE is Order(N) -- a million tests, not a trillion.
Change to UNIQUE:  DROP INDEX + ADD UNIQUE. -- Now future INSERTs will balk; so add suitable code to deal with such.  Consider INSERT IGNORE.

All of that will take minutes, not hours.  And you won't have to do any of it again.

(With a newer version of MySQL, you could use GENERATED ALWAYS to avoid some of the code.)

One shot

To do the delete once and do it without modifying the table:

CREATE TABLE t (
        id INT NOT NULL,
        hash BINARY(20),
    PRIMARY KEY(id),
    INDEX(hash) ) ENGINE=InnoDB;

INSERT INTO t (id, hash)
    SELECT id, UNHEX(SHA1(body))
        FROM `comment`;

DELETE FROM `comment`
       JOIN  t AS t1  ON t1.id = comment.id
       JOIN  t AS t2  ON t2.hash = t1.hash
                     AND t2.id < t1.id;

DROP TABLE t;

Caveat:  I have not tested this code.  Recommend you test it separately, perhaps with a subset of the data.

How to make delete duplicates faster?

One Answer

Add your own answers!

Ask a Question