TransWikia.com

Indexes not used for join

Database Administrators Asked by Dmitriy Grankin on September 11, 2020

I join listings_tail table and materialised view cals_reviewed_wide_scaled on listing_id column of type integer.

The table has around 1,5 billion rows and 2 mln unique listing ids. The MV has been tested with a range from 500 to 500 000 rows. Both columns are indexed with btree, hash, and brin.

explain always returns Seq Scan despite of the indexes available.

How to optimize the indexing to make them work?

query

SELECT * from listings_tail L
INNER JOIN cals_reviewed_wide_scaled R 
ON L.listing_id = R.listing_id;

tables

d listings_tail

 id                   | bigint                      | not null default nextval('listings_tail_id_seq'::regclass)
 listing_id           | integer                     | not null
 parsing_timestamp    | timestamp without time zone | not null
 value                | character varying           | not null
 homes_details_inside | integer                     | not null
 tags                 | character varying[]         | not null
Indexes:
    "listings_tail_pkey" PRIMARY KEY, btree (id)
    "listings_tail_dd_unique" UNIQUE CONSTRAINT, btree (listing_id, value, homes_details_inside, tags)
    "listing_id_tail" btree (listing_id)
    "tail_listing_id_brin" brin (listing_id)
    "tail_listing_id_hash" hash (listing_id)

d cals_reviewed_wide
 id                | integer                     | not null default nextval('cals_reviewed_wide_id_seq'::regclass)
 listing_id        | integer                     | not null
 date              | timestamp without time zone | not null
 price             | double precision            |
 parsing_timestamp | timestamp without time zone | not null
 min_nights        | double precision            |
 max_nights        | double precision            |
 available         | integer                     | not null
 created_at        | timestamp without time zone | not null
 revs_id           | integer                     | not null
Indexes:
    "cals_reviewed_wide_pkey" PRIMARY KEY, btree (id)
    "cals_reviewed_wide_unique" UNIQUE CONSTRAINT, btree (listing_id, date)
    "cals_reviewed_wide_revs_id_ixs" btree (revs_id)
    "ix_cals_reviewed_wide_listing_id" btree (listing_id)
Foreign-key constraints:
    "cals_reviewed_wide_revs_id_fkey" FOREIGN KEY (revs_id) REFERENCES reviews(id)

d random_cals_reviewed_wide_scaled

View "public.random_cals_reviewed_wide_scaled"
 Column  |  Type  | Modifiers
---------+--------+-----------
 revs_id | bigint |

d cals_reviewed_wide_scaled
Materialized view "public.cals_reviewed_wide_scaled"

-------------------+-----------------------------+-----------
 listing_id        | integer                     |
 date              | timestamp without time zone |
 price             | double precision            |
 parsing_timestamp | timestamp without time zone |
 min_nights        | double precision            |
 max_nights        | double precision            |
 created_at        | timestamp without time zone |
 revs_id           | integer                     |
Indexes:
    "cals_reviewed_wide_scaled_listing_id_brin" brin (listing_id)
    "cals_reviewed_wide_scaled_listing_id_hash" hash (listing_id)
    "cals_reviewed_wide_scaled_listing_id_ixs" btree (listing_id)

materialized view

CREATE MATERIALIZED VIEW cals_reviewed_wide_scaled
      as 
      SELECT 
       C.listing_id, 
       C.date, 
       C.price,
       C.parsing_timestamp,
       C.min_nights,
       C.max_nights,
       C.created_at,
       C.revs_id
       FROM cals_reviewed_wide as C
       JOIN random_cals_reviewed_wide_scaled as V
       ON C.revs_id = V.revs_id;

explain

EXPLAIN (ANALYZE, BUFFERS):

enter image description here

One Answer

Both tables are not small, and you have no WHERE condition that restricts the number of rows in either table.

work_mem is big enough to contain the hash of cals_reviewed_wide_scaled, so the most efficient plan is a hash join.

With a hash join, no index can be used, and a sequential scan is the most efficient access method.


You can try if a nested loop join might be faster.

For that, make sure that there is an index on listings_tail.listing_id, then try the following to nudge PostgreSQL in the right direction:

  • Try increasing effective_cache_size to the amount of RAM you have got.

  • Try lowering random_page_cost to 1.0.

  • If all fails, experiment with setting the following in your database session:

    SET enable_hashjoin = off;
    SET enable_mergejoin = off;
    

    Then try again, and you should see a nested loop join. That way you can test if that is actually faster and PostgreSQL is wrong.

Answered by Laurenz Albe on September 11, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP