TransWikia.com

MySQL MATCH() AGAINST() FULLTEXT Index - Achieve Partial String Match Combined with Phrase Match

Stack Overflow Asked by user8588978 on December 11, 2021

I have a table with column content that has a FULLTEXT index.

I want to take advantage of the speed of MATCH() on large text.

I want the search to be as accurate as possible.

When I search for the phrase string "large truck" this way:

SELECT * FROM MyTable WHERE MATCH(content) AGAINST('"large truck"' IN BOOLEAN MODE);

Some instances are missed.

MyTable:

|   content    |
----------------
|Large n truck| FOUND ✓
----------------
|large truck   | FOUND ✓
----------------
|large trucks  | *PLURAL MISSED!
----------------
|large truckl  | *TYPE-O MISSED!

If I use the standard LIKE / wildcard method:

SELECT * FROM  `MyTable` WHERE  `content` LIKE  '%large truck%'

MyTable:

|   content    |
----------------
|Large n truck| *MISSED!
----------------
|large truck   | FOUND ✓
----------------
|large trucks  | FOUND ✓
----------------
|large truckl  | FOUND ✓

It seems I can’t use a PHRASE search with wildcard together either:

SELECT * FROM MyTable WHERE MATCH(content) AGAINST('"large truck*"' IN BOOLEAN MODE); **DOES NOT WORK**

OR

SELECT * FROM MyTable WHERE MATCH(content) AGAINST('"large truck"*' IN BOOLEAN MODE); **DOES NOT WORK**

So…

How do I successfully use MATCH() AGAINST() searching for a phrase, and get all instances returned – even case insensitive partial string matches?

2 Answers

The trick that I often use with FT is to do it in two steps:

  1. Do a MATCH, hoping to get all the desired text, but possibly some extra results.
  2. AND with another condition -- LIKE (faster) or REGEXP (more powerful).

The MATCH will be fast because of FT; the other part will be performed second, so it will be fast because there aren't many rows to check.

This matches your criteria:

SELECT * FROM MyTable
    WHERE MATCH(content) AGAINST('+large +truck*' IN BOOLEAN MODE)
      AND content REGEXP "large[[:space:]]+truck";

Phrased differently, the query will run something like this:

  1. Let's say the table has 10K rows.
  2. The FT expression will be evaluated. The MATCH will be very fast (because of the way it is designed). It will find all rows with both "large" and "truck*" anywhere in content. Now, let's say, there are 30 rows that satisfy that.
  3. The rest of the WHERE is evaluated. But it is done against only those 30 rows. So, even though the REGEXP is costly, it is not being done often.
  4. Then perhaps 14 rows are returned.

The net effect is that the whole query runs "fast", which was one of your requirements.

Note: I needed the second part to prevent these

large green truck
the truck is large

Depending on the version, you may need this REGEXP instead: "large\s+truck".

Answered by Rick James on December 11, 2021

Here is a quick hack with REGEXP, but it's not solving the issue as it does not use the fulltext index:

SELECT * 
FROM MyTable 
WHERE content REGEXP("large[[:space:]]+truck*");

Of course you can also use the FT index by searching without exact phrase:

SELECT * 
FROM MyTable 
WHERE MATCH(content) AGAINST('+large +truck*' IN BOOLEAN MODE);

But this will eventually include records that you do not want as it is not the same as exact phrase search.

Unfortunately phrase search (double quotes - "") and truncation operator (wildcard - *) will not work regardless if you use InnoDB or MyISAM. It does not work with the distance operator with InnoDB too (which is probably coming from the same origin). I guess it is related in the nature how data of the full text index is stored.

Answered by Philip Petrov on December 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP