TransWikia.com

Trigram similarity (pg_trgm) with German umlauts

Database Administrators Asked on November 21, 2021

I try to figure out how to improve Postgres 10.6 pg_trgm queries with German umlauts (äöü). In german ‘ö’ can be written as ‘oe’. But beware: not every ‘oe’ can be written as ‘ö’.

CREATE TABLE public.names
  (name text COLLATE pg_catalog."default");

CREATE INDEX names_idx
    ON public.names USING gin (name COLLATE pg_catalog."default" gin_trgm_ops);
SHOW LC_COLLATE; -- de_DE.UTF-8

When I use the similarity()
function to query the similarity for ‘Schoenstraße’.

SELECT
    name,
    similarity (name, 'Schoenstraße') AS similarity,
    show_trgm (name)
FROM
    names
WHERE
    name % 'Schoenstraße'
ORDER BY
    similarity DESC;

I get the following result:

Name            similarity  show_trgm

Schyrenstraße   0.588235    {0x9a07c3,0xde3801,""  s"","" sc"",chy,ens,hyr,nst,ren,sch,str,tra,0x76a40e,yre}
Schönstraße     0.5625      {0x9a07c3,0xde3801,0xf00320,0x095f29,""  s"","" sc"",0x6deea5,nst,sch,str,tra,0x76a40e}

db<>fiddle here

Is there anything I can do to improve that or do I need to replace all umlauts in the DB?

One Answer

For starters: A German poet is still a Poet. 'Poet' <> 'Pöt'; 'oe' <> 'ö'. The convention to replace 'ö' with 'oe' is largely out of use nowadays. See:

I have had related problems many times. The additional module unaccent is instrumental to cover spelling variants and typos on both sides. Install it once per database:

CREATE EXTENSION unaccent;

See:

The tricky part is to get index support with it. Details may be hard for beginners.

Trigram similarity

Then create the wrapper function as instructed here (same link again):

Create a new index with it:

CREATE INDEX names_trgm_idx ON names USING gin (f_unaccent(name) gin_trgm_ops);

Query demo:

SELECT name
     , similarity(f_unaccent(name), f_unaccent('Schoenstraße')) AS sim_unaccent
     , similarity(name, 'Schoenstraße') AS sim
FROM   names
WHERE  f_unaccent(name) % f_unaccent('Schoenstraße')
ORDER  BY f_unaccent(name) <-> f_unaccent('Schoenstraße')
        , name <-> 'Schoenstraße';  -- best match first

Sorting by name <-> 'Schoenstraße' additionally serves as tiebreaker to get best matches first (after filtering cheaply with the index).

Consider a GiST index instead of GIN for "nearest neighbor"-type queries with LIMIT. See:

Full text search (FTS or just TS)

Consider full text search to just match complete words (after stemming), or with prefix matching / phrase search ... See:

The unaccent module provides a dictionary for integration in a TEXT SEARCH CONFIGURATION:

CREATE TEXT SEARCH CONFIGURATION de (COPY = german);
ALTER TEXT SEARCH CONFIGURATION de
ALTER MAPPING FOR hword, hword_part, word WITH unaccent, german_stem;

Index based on it:

CREATE INDEX names_fts_idx ON names USING GIN (to_tsvector('de', name));

Alternatively, since you are dealing with names, you might base it on the simple text search configuration instead of german. No stop words, no stemming. Names don't necessarily follow language-specific rules.

Query demo:

SELECT name
     , ts_lexize('unaccent', name)
     , to_tsvector('de', name)
     , to_tsquery('de', 'Schoenstraße')
     , to_tsvector('de', name) @@ to_tsquery('de', 'Schoenstraße') AS match
     , similarity(name, 'Schoenstraße')
FROM   names
WHERE  to_tsvector('de', name) @@ to_tsquery('de', 'Schoenstraße')
ORDER  BY name <-> 'Schoenstraße';  -- exact matches first

Again, sorting by name <-> 'Schoenstraße' serves as tiebreaker, maybe additionally to other sort criteria.

Note, however, that looking for 'Schoenstraße' with FTS does not find 'Schönstraße', as unaccent() does not change 'oe', and maps 'ö' to 'o'. ('oe' <> 'ö').

Related:

db<>fiddle here -- looking for 'Schoenstraße'

db<>fiddle here -- looking for 'Schönstraße'

Answered by Erwin Brandstetter on November 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP