Database Administrators Asked on January 2, 2022
On a mysql table with about 1.7M rows, I tried to delete duplicates posts:
delete a FROM comment a INNER JOIN comment a2 WHERE a.id < a2.id AND a.body = a2.body;
The result was:
Query OK, 35071 rows affected (5 hours 36 min 48.79 sec)
The table schema:
CREATE TABLE `comment` ( `id` int(11) NOT NULL AUTO_INCREMENT, `author_id` int(11) NOT NULL, `created` datetime NOT NULL, `title` varchar(100) COLLATE utf8_unicode_ci NOT NULL, `body` longtext COLLATE utf8_unicode_ci NOT NULL, `post_id` int(11) NOT NULL, `published` tinyint(1) NOT NULL, `cid` int(11) NOT NULL, PRIMARY KEY (`id`), KEY `id_postid_idx` (`id`,`post_id`) ) ENGINE=InnoDB AUTO_INCREMENT=1774682 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
This happened on my almost idle workstation with
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz.
I’m wondering if there are some tricks to make this delete operation faster?
Your algorithm is "Order(N^2)". If you have a million rows, it will take about a trillion comparisions! (Actually closer to N*N/2.)
I assume you want to run this again later? Or you want to avoid inserting duplicates? In either case, the following will speed things up greatly.
ALTER TABLE `comment` ADD COLUMN hash BINARY(20); UPDATE `comment` SET hash = UNHEX(SHA1(body)); ALTER TABLE `comment` ADD UNIQUE(hash); -- (see discussion below)
hash with a value of
UNHEX(SHA1(body)). If the hash is a dup, then
body is extremely likely to be a dup, and the
INSERT will suitably fail.
INDEXgives you a much faster way to do the dup-elimination: check
UNIQUEprevents future insertion of dups.
So, let's do it in two phases:
DELETEis Order(N) -- a million tests, not a trillion.
ADD UNIQUE. -- Now future
INSERTswill balk; so add suitable code to deal with such. Consider
All of that will take minutes, not hours. And you won't have to do any of it again.
(With a newer version of MySQL, you could use
GENERATED ALWAYS to avoid some of the code.)
To do the delete once and do it without modifying the table:
CREATE TABLE t ( id INT NOT NULL, hash BINARY(20), PRIMARY KEY(id), INDEX(hash) ) ENGINE=InnoDB; INSERT INTO t (id, hash) SELECT id, UNHEX(SHA1(body)) FROM `comment`; DELETE FROM `comment` JOIN t AS t1 ON t1.id = comment.id JOIN t AS t2 ON t2.hash = t1.hash AND t2.id < t1.id; DROP TABLE t;
Caveat: I have not tested this code. Recommend you test it separately, perhaps with a subset of the data.
Answered by Rick James on January 2, 2022
1 Asked on December 26, 2021 by jrdba
1 Asked on December 26, 2021 by rik-bradley
3 Asked on December 24, 2021 by thx-1138
2 Asked on December 24, 2021 by mikkergp
1 Asked on December 24, 2021 by dennis-finke
2 Asked on December 24, 2021
1 Asked on December 24, 2021 by ingus
0 Asked on December 24, 2021
1 Asked on December 24, 2021
1 Asked on December 22, 2021 by user73639
1 Asked on December 22, 2021
1 Asked on December 22, 2021 by hder
1 Asked on December 21, 2021 by scott-magnan
1 Asked on December 21, 2021 by lee-m
1 Asked on December 19, 2021 by adam-heeg
Get help from others!