How to make delete duplicates faster?

Database Administrators Asked on January 2, 2022

On a mysql table with about 1.7M rows, I tried to delete duplicates posts:

delete a FROM comment a
  INNER JOIN comment a2
     WHERE <
     AND   a.body = a2.body;

The result was:

  Query OK, 35071 rows affected (5 hours 36 min 48.79 sec)

The table schema:

CREATE TABLE `comment` (
  `author_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `title` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `body` longtext COLLATE utf8_unicode_ci NOT NULL,
  `post_id` int(11) NOT NULL,
  `published` tinyint(1) NOT NULL,
  `cid` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `id_postid_idx` (`id`,`post_id`)

This happened on my almost idle workstation with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz.
I’m wondering if there are some tricks to make this delete operation faster?

One Answer

Your algorithm is "Order(N^2)". If you have a million rows, it will take about a trillion comparisions! (Actually closer to N*N/2.)

I assume you want to run this again later? Or you want to avoid inserting duplicates? In either case, the following will speed things up greatly.

UPDATE `comment` SET hash = UNHEX(SHA1(body));
ALTER TABLE `comment` ADD UNIQUE(hash);  -- (see discussion below)

When INSERTing, include hash with a value of UNHEX(SHA1(body)). If the hash is a dup, then body is extremely likely to be a dup, and the INSERT will suitably fail.


  • INDEX gives you a much faster way to do the dup-elimination: check hash instead of body.
  • UNIQUE prevents future insertion of dups.

So, let's do it in two phases:

  1. Add INDEX(hash) (not UNIQUE(hash))
  2. Do the dedupping -- With an INDEX(hash), the DELETE is Order(N) -- a million tests, not a trillion.
  3. Change to UNIQUE: DROP INDEX + ADD UNIQUE. -- Now future INSERTs will balk; so add suitable code to deal with such. Consider INSERT IGNORE.

All of that will take minutes, not hours. And you won't have to do any of it again.

(With a newer version of MySQL, you could use GENERATED ALWAYS to avoid some of the code.)

One shot

To do the delete once and do it without modifying the table:

        id INT NOT NULL,
        hash BINARY(20),
    PRIMARY KEY(id),
    INDEX(hash) ) ENGINE=InnoDB;

INSERT INTO t (id, hash)
    SELECT id, UNHEX(SHA1(body))
        FROM `comment`;

DELETE FROM `comment`
       JOIN  t AS t1  ON =
       JOIN  t AS t2  ON t2.hash = t1.hash
                     AND <;


Caveat: I have not tested this code. Recommend you test it separately, perhaps with a subset of the data.

Answered by Rick James on January 2, 2022

Add your own answers!

Related Questions

SqlPackage does not pick up variables from profile

3  Asked on December 24, 2021 by thx-1138


This simple mysql query is taking 2-3 seconds

2  Asked on December 24, 2021 by mikkergp


Powershell Invoke-Sqlcmd Replacing £ with?

1  Asked on December 24, 2021 by dennis-finke


Upgrade Postgres 9.6 to 12

0  Asked on December 24, 2021 by doug-p


Stored procedure for adding cascade

1  Asked on December 22, 2021 by user73639


MySQL Optimization Guidance Needed

3  Asked on December 22, 2021 by enyceexdanny


Sybase IQ cache database result

1  Asked on December 21, 2021 by daria


Help Outputing MySQL General Log to Named Pipe

2  Asked on December 21, 2021 by widdlewabbit


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP