TransWikia.com

Hive / Impala best practice code structuring

Data Science Asked by Gerardsson on August 2, 2020

Coming from a DWH-background I am used to putting subqueries almost everywhere in my queries. On a Hadoop project (with Hive version 1.1.0 on Cloudera), I noticed we can forego subqueries in some cases.

It made me wonder if there are similar SQL-dialect specific differences between what is used in Hadoop SQL and what you would use in a DWH-setting. So I would like to extend this question so that people can mention what they noticed as differences between Hadoop and DWH in when structuring their queries. I noticed there was very little reference to this topic for Hadoop.

There are some normal principles that apply to Hadoop like mentioned here: https://streever.atlassian.net/wiki/spaces/HADOOP/pages/3211279/Hive+SQL+Best+Practices

It would be nice to get a few of your best practices for working with Hadoop. E.g. You write your queries as neutral as possible so that it works in both Hive and Impala avoiding using language-specific functions such as left (Impala only)

The example I came across was that a group by-query worked in two different ways. In classic RDBMS I know that only one of both would work.

Here is an example (columns are displayed in different order then in group by):

Classic SQL (Oracle, SQL Server, etc)

select t2.b, t2.a

from (select t1.a,t1.b from table1 as t1 

group by t1.a, t1.b) t2

Hive SQL

select t1.b, t1.a 

from table1 as t1 

group by t1.a, t1.b

Notice that everything is combined in 1 single query Hadoop. The classic SQL snippet first does the group by in a subquery before the data is displayed in the right order.

I would be curious to know if you came across other subtleties like this one.

I know, based on my experience, that it is good practice to respect the following guidelines while coding:

  1. Keep your code clean (hence avoid sub-queries if you can do without)

  2. Consider what impact changes to settings might have. E.g. yarn.nodemanager.resource.memory-mb=24576 is great to use, but what happens if you are not allowed to change the node memory size. In an automated job it is not necessarily good practice. Look at the general Hadoop settings, e.g. file size.

  3. Avoid using functions that are specific to a program. E.g. select left("Hello world", 3) is useful in Impala but in Hive it has to be rewritten to select left("Hello world", 1,4). This can lead to problems when later on down the line it will run in a different program.

One Answer

The biggest help to me was learning about how MapReduce works on a distributed environment. I don't have exact rules of thumb for you, but I might be able to help.

  1. In Hadoop/Hive, the hive.execution.engine definitely seems to matter. Make sure understand the differences and try changing it for certain problems.
  2. Subqueries can actually HELP performance, especially with really wide tables, MapReduce as your execution engine, and Hive. For example, if you only want 2 or 3 columns from a 2nd table, its better to:
SELECT T1.*, T2.col1, T2.col2
FROM T1
LEFT OUTER JOIN (SELECT key, col1, col2 FROM T2) T2
ON T1.key = T2.key;

This is all due to how the execution engine works under-the-covers, so learning about that and the differences of the execution engines will really help.

  1. Temp tables are your friend - prior to hadoop, I avoided staging data in temporary tables a lot and wrote more complex SQL to avoid it. But with Hadoop and big data, I've actually found it much faster to create tables and then subsequently joining to them helps in some cases with massive data.

  2. Learn how the table's underlying data is stored. STORED AS TEXTFILE is a lot different than data stored as parquet

  3. Impala vs Hive also has performance implications. I've found to find Hive more stable, but slower.

  4. How the data is stored (skewed or not, bucketed or not, etc) makes a huge different. Keep an eye out for queries where you have 100 reducers and 99 of them finish super fast and 1 of them takes forever.

Like I mentioned, spend time learning how the execution engines do the work in a distributed environment. Once you understand how the execution happens behind the scenes, you can start to pick up on other subtleties like why this:

select count(distinct user)
from table;

can be a low slower than this:

select count(*) from
(select user
 from table
 group by user) q;

Answered by Josh on August 2, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP