Loading a Patitioned Hive Table with Cluster clause, sort(hands on explanation)
--------------------------------------------
Step 1 : create a stage table to load data into table directly from storage area.
hive> create table
product_stage (prod_name string, description varchar(20),category string,qty_onhand int, prod_num bigint, packaged_with Array<string> )
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
step 2:create a partitioned table
>create table
product_part (prod_name string, description varchar(20),qty_onhand int, prod_num bigint, packaged_with Array<string>)
partitioned by (category string)
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
step 3:
set hive.exec.dynamic.partition.mode=nonstrict (to dynamically load partition data without specifying it statically like
:partition category='ram')
when loading partition dynamically the column should be last specified in the select clause.
step4:to load data
hive>from
product_stage
insert overwrite table product_part partition (category)
select prod_name, description,qty_onhand,prod_num, packaged_with,category
Hint:one part file for each partition
Additional :
hive>create table
product_pcsb (prod_name string, description varchar(20),qty_onhand int, prod_num bigint, packaged_with Array<string>)
partitioned by (category string)
CLUSTERED BY(prod_num) SORTED BY(qty_onhand ASC) into 2 buckets
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
Hint: two part files will be produced for each partition with sorted on qty_onhand in each part file.
hive>create table
product_pc3b (prod_name string, description varchar(20),qty_onhand int, prod_num bigint, packaged_with Array<string>)
partitioned by (category string)
CLUSTERED BY(prod_num) into 3 buckets
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
Hint:Three part file will be produced for each partition
Data file
PROD_NAME DESCRIPTION CATEGORY QTY_ON_HAND PROD_NUM PACKAGED_WITH
2 GB Memory E 2 GB Memory ECC Ram 3000 87655 manual:heatsink
4 GB Memory E 4 GB Memory ECC Ram 1000 87659 manual:heatsink
16 GB Memory E 16 GB Memory ECC Ram 238 87634 manual
500 GB HD J 500 GB HD Panther Brand HD 200 45628 atacable:manual
500 GB HD T 500 GB HD Tiger Brand HD 498 45641 satacable:manual
1 TB HD J 1 TB HD Jargon Brand HD 231 45691
4 Core CPU J3 4 Core CPU Jargon Brand 3 GHZ CPU 50 98820 thermalpaste:heatsink:manual
2 Core CPU J2 2 Core CPU Jargon Brand 2 GHZ CPU 118 98838 thermalpaste:heatsink
1 Core CPU J2 1 Core CPU Jargon Brand 2 GHZ CPU 203 98792 thermalpaste:heatsink
94F991 MB Motherboard F991 CPU MB 19 282299
94G822 MB Motherboard G822 CPU MB 30 282109
93H772 MB Motherboard H772 CPU MB 15 282009 cables:screws
93G Video Video Card Jargon Brand 93G Video 80 99202 dvd:manual:game
84G1 Video Video Card Jargon Brand 84F1 Video 14 99207 dvd:manual:hdmicable
09K Video Video Card Tiger Brand 84F1 Video 5 98243 manual:game
J Case 1500 Computer Case Jargon Brand Style 1500 Case 20 77623 fans:manual:screws
J Case 1501 Computer Case Jargon Brand Style 1501 Case 18 77624 fans:manual:screws
T Case 4332 Computer Case Tiger Brand Style 4332 Case 7 88211 fans:manual:screws:watercooler
J Power 300W Power Supply Jargon Brand 300 Watts Power 28 92387 cables:screws
J Power 500W Power Supply Jargon Brand 500 Watts Power 17 92373 cables:screws
T Power 300W Power Supply Tiger Brand 300 Watts Power 8 93347 cables:screws
DVD J INT DVD Jargon Brand Internal Optical 23 88734 manual
DVD J EXT DVD Jargon Brand External Optical 45 88821
DVD T INT DVD Tiger Brand Internal Optical 19 82331 satacable:manual
DVD T EXT DVD Tiger Brand External Optical 17 82337 satacable:manual
the below statement Still works but result is similar to a regular load with one part file (clustered by , sorted by clauses does not do any)
hive>create table
product_cls (prod_name string, description varchar(20),qty_onhand int, prod_num bigint, packaged_with Array<string>)
CLUSTERED BY(prod_num) SORTED BY(qty_onhand ASC) into 5 buckets
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
hive>from
product
insert overwrite table
product_cls
select prod_name, description,qty_onhand,prod_num, packaged_with
Errors:Without INTO BUCKETS clause followed to CLUSTERED BY < column_name> a error will be generated.