I need to create a system where the user creates dynamic filters based on our customer’s attributes. There are more or less 30 possible filters and 30 millions of customers, but the number of customer increase every day and the attribute value can change every day too, so we have insert and updates in these set of data every day. Another thing is that I can create a new filter or remove.
In this case we can use a relational database like Oracle and create a index for every colunm, but with inserts and updates every day, can I have a problem with performance? Should I use a search engine for this case like Elasticsearch? Or there is a recommended database or architecture for this use case?
I need to return a count of customers that match these filters at most in 5 seconds.
The user can use
>, <, =, >=, <= to filter or use
city IN ('New York', 'Seattle')
(If you have deep enough pockets for Oracle and can handle the audit/licensing requirements)
Create the big ugly wide table, but then use Oracle's in-memory column store to speed your analytic queries. The high level view is each column is split, dedicated to memory (at least partially), and data compression and some other storage tricks are used to speed lookups.
This works great for the type of ad-hoc workflow you describe and doesn't require much thought about table design other than choosing the correct data types (all else being equal, smaller = faster).
As always, test thoroughly as vendor's claims are often inflated. But the engineering is sound and others have found column stores to be very beneficial.
SQL Server has a similar solution, the Clustered Columnstore Index, but there are impacts to
update performance that may make it unsuitable to your workflow. If you have a wide window during which you can perform your updates, it may work well. Worth a try.
(If you don't have deep pockets or clustered columnstore doesn't work)
Recreate, in part, the same idea as a columnstore but in a typical database table. Some refer to this as sixth normal form (a timestamp is not required for the definition), others vertical partitioning. I just tend to think of it as don't read more data than you have to.
You will need an RDMS that has both clustered indexes (index-organized tables) and join elimination. Right now that's SQL Server, Oracle, Sybase, and DB2.
The idea is you have your
Customer table with a very basic amount of data stored with it. For every attribute, you will create a table
Customer_<attribute> that contains the key from
Customer and the column(s)* associated with that attribute. If that attribute is
NULL you will not insert a row into that table.**
This seems like a lot of work, but if you have a list of columns and their data types, you can write a script to generate the table creation as well as the required insert/update procedures.
You can also write a script to generate a view with all of your columns (for simplicity's sake you can just
LEFT JOIN all the attributes back to the
Customer table). The view will look like a big ugly table, but with join eliminate it will behave much differently.
So if you search for customers on three attributes, the query engine will look at the request, realize it only needs to hit three tables in addition to
Customer and off you go.
Some tables will benefit from secondary indexes and you can add them as necessary and if they improve query performance (they don't always).
This solution also has the added benefit of making updates/inserts fairly quick and without requiring the all the data to undergo a lock.
The biggest downside to this solution is you'll get some "expert" who will walk in, proclaim "joins are bad!" and you'll have to patiently explain to them why they are not.
*If one of the attributes is comprised of a foreign key reference to a composite key, you would not split them apart. Additionally, if those columns will be used together to search for a customer you would include them.
**This pattern requires
NOT NULL be enforced through your update/delete processes.
(Really, please do not do these)
Entity-attribute-value (EAV). Looks simple. In practice it's a nightmare, join logic is convoluted, data/relational integrity is impossible to maintain and table locks are disastrous.
Big, ugly, wide table with an index on every column. Wastes a lot of space, performance isn't great.
Big Data/Whatever is hot this week. No/limited indexes. No clear access path to the data. Data/relational integrity may vary from weak to non-existent.
Answered by bbaird on January 6, 2022
2 Asked on February 15, 2021 by hanxue
5 Asked on February 15, 2021 by alfonx
1 Asked on February 15, 2021 by sql_noexpert
0 Asked on February 15, 2021 by mark-copenhaver
0 Asked on February 15, 2021 by mark-luzon
1 Asked on February 15, 2021 by tittus
1 Asked on February 14, 2021 by sqlstad
2 Asked on February 11, 2021 by alz
1 Asked on February 10, 2021 by nihar
1 Asked on February 6, 2021 by isxaker
1 Asked on February 6, 2021 by davygravy
1 Asked on February 6, 2021 by aritra
1 Asked on February 5, 2021 by trives
1 Asked on February 5, 2021 by malks
1 Asked on February 4, 2021 by john-rock-bilodeau
1 Asked on February 4, 2021 by freudianslip
Get help from others!