TransWikia.com

Full text search on larger text rows

Database Administrators Asked by Spedo De La Rossa on January 4, 2021

I am wondering what is the best practice to use the right query in my case.

The problem is I tried the LIKE and ILIKE and it was faster then pg_trgm and the postgres full text search, because they often run in a timeout or throw an error:

NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.

So I am not sure, if I did it right and my table looks like this and has over 10.000 entries:

  Column   |     Type      | Modifiers
-----------+---------------+-----------
 id        | integer       | not null
 doc       | bytea         | not null
 textdoc   | text          | not null

So I have the document (PDF or DOCX) and I give it to the converter which makes a huge string out of the documents and put it into the textdoc. A document can have 150 pages or 40.000 words(strings). So in each textdoc can be a huge amount of words. And when I have over 10.000 entries in the database and each of this entry has a huge amount of words in textdoc, how can I speed this up.

So my actual query looks (simplified) like this:

SELECT * FROM mytable WHERE textdoc ILIKE '%word from input mask%' 

When I am searching the web, the standard message is don’t use ILIKE. I used pg_trgmand it wasn’t really faster. It was all about three minutes for a query.

Required information:

Database: "PostgreSQL 10.4 on x86_64-pc-linux-gnu, (Red Hat), 64-bit"

Analyze (hope I did it right).. I run:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM mytable WHERE textdoc ILIKE '%word from input mask%'

Here is the output: https://explain.depesz.com/s/wqLh

Another question is, the notice of the 2047 characters: Does this mean, that a the text in the column textdoc is over 2047 or a string in the textdoc is over 2047 characters?

One Answer

The NOTICE means that textdoc contains at least one word within it that is more than 2047 characters. The definition of a word depends on what configuration you use for your ts_config. Words are generally strings of characters separated by spaces or certain punctuation. Hyphenated words will generally be emitted as each separate word, plus the entire hyphenated phrase as an extra "word", at least with the ts_configs that I am familiar with. You can use the ts_debug function to find out how it actually breaks up the string emitted by your specific converter using your chosen ts_config.

select alias, description, token from mytable cross join ts_debug('english',textdoc) 
    where length(token)>2047;

(On second thought, ts_lexize may be better than ts_debug, but I am not as familiar with it.)


Regarding the ILIKE query, your query is returning 2/3 of the rows in the table. Any index, whether pg_trgm-based or tsvector-based, is unlikely to help you in that situation.

Answered by jjanes on January 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP