TransWikia.com

sqoop split by option is giving error while using a derived column in the split by option

Stack Overflow Asked by anu john on November 10, 2021

I have an Oracle query which is fetching 25 million records, there is no pk or no columns which is distributed properly to make as a split by column. So I have thought of making a sequence number using ROW_number() over () as RANGEGROUP. But when I use this pseudo column its giving me an error saying

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: "P"."RANGEGROUP": invalid identifier at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:91).

I am properly giving the alias, even I tried with out alias to the pseudo column, its still giving the same error.
Can we use derived columns in Sqoop split by, or the column should be physically present in table?

One Answer

Use subquery to wrap row_number calculation, then use derived column in the split-by.

   --query "select col1, ... colN, RANGEGROUP 
               from (select t.*, row_number() OVER (order by t.item_id ) AS RANGEGROUP
                      from table t ) s 
              where 1=1 and $CONDITIONS"

row_number should be deterministic, it means when executed multiple times, it should assign exactly the same number to all rows. What can happen if ORDER BY in the OVER contains not unique column or combination: row_number can return different numbers for the same rows. And if you are using it in the split-by, you will get duplication because the same row can be in split range 1, say 1-100, in mapper2 sqoop will execute same query with filter for range 2, say (101-200) the same row can appear also in that range. Sqoop runs the same query in different containers(mappers) with different condition to get split ranges in parallel.

If Id is int (and much better if it is evenly distributed), use that ID. Why you may need row_number is when it is STRING column. read this: https://stackoverflow.com/a/37389134/2700344, split-column is not necessarily a PK

Answered by leftjoin on November 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP