How to find the most common strings between rows

Database Administrators Asked by ATMdev on January 2, 2022

In SQL 2017, I have a ‘food’ table as follows:

ID    item    Ingredients
----  -----   -----------
1     item1   flour,sugar,cocoa,butter
2     item2   flour,sugar,butter,water
3     item3   sugar,cocoa,water
4     item4   sugar,cocoa,butter
5     item5   flour,water

Although the "Ingredients" are a CSV in a single field, I have no problem moving them to a related table:

ID    Ingredient
---   ----------
 1    flour
 1    sugar
 1    cocoa
 1    butter
 2    flour

or just select them with a CROSS APPLY string_split

select ID, item, ingredient
from food
CROSS APPLY string_split(food.ingredients,',') as ing

Now here is what I am trying to attempt:

For any given row ex:(where id = 1) I want to return the top n rows (other than id = 1) that have the most common ingredients with id = 1.

So the top 2 with the most common ingredients with id = 1 should be:

ID    item    Ingredients
----  -----   -----------
2     item2   flour,sugar,butter,water    ( 3 in common)
3     item3   sugar,cocoa,water           ( 2 in common)
4     item4   sugar,cocoa,butter          ( 3 in common)
5     item5   flour,water                 ( 1 in common)

The result should return rows with id = 2 and id = 4 as both have 3 ingredients in common with row id = 1

This is my first question on StackExchange, and I hope it is clear.
I can write this as a C# or Python batch program to create a table of ‘closest ingredients’, but I’m hoping I can do this in SQL.

Thanks !

One Answer

Please store the ingredients in a table not a list. So let's call that ItemIngredient with two columns ItemId and Ingredient and leave it at that. The primary key will be (ItemId, Ingredient). Because searches by Ingredient will probably happen a lot, let's put a non-unique index on Ingredient (which is necessary for below to work the most efficiently).

     ,Match.ItemId AS MatchItemId
     ,COUNT(*) AS IngredientMatchCount
     ,RANK() OVER (PARTITION BY Base.ItemId ORDER BY COUNT(*) DESC, Match.ItemId) AS MatchRank
      ItemIngredient Base
      ItemIngredient Match
        ON Match.Ingredient = Base.Ingredient 
             AND Match.ItemId <> Base.ItemId
    -- Put your WHERE clause here if you want to only fine matches for certain Ids
  ) matched
  MatchRank <= 1 --Or whatever

You can experiment with RANK vs DENSE_RANK at some point to see what provides the best results for you. If you need details about the Item you can join back in the outer query.

Answered by bbaird on January 2, 2022

Add your own answers!

Related Questions

Install mysql through deb file on ubuntu 20.04

1  Asked on September 13, 2020 by ajeetkumar


Indexes not used for join

1  Asked on September 11, 2020 by dmitriy-grankin


tables inheritance stored procedure

1  Asked on September 10, 2020 by juclart-ngouedi


Schema for scarce goods

1  Asked on September 5, 2020 by empedokles


Restoring multiple files to multiple database with pg_restore

1  Asked on September 2, 2020 by irakli-ugulava


Getting midnight of today

4  Asked on August 25, 2020 by logu


Compare 2 columns in 2 different tables with non uniques values

3  Asked on August 18, 2020 by deepak-garg


How to easily convert utf8 tables to utf8mb4 in MySQL 5.5

10  Asked on August 14, 2020 by geoaxis


SSIS Package Not Being Deployed

1  Asked on August 14, 2020 by steveg


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP