Problem explanation: Lateral View,Explode(),Partition By,Rank(),Dense_rank() ,Max(),Min(),Count(),Sum(),Avg() ,Cume_Dist(),Row_Number() In Hive
-----------------------hive> describe sale1;
OK
cust_id int
prod_no int
qty int
datestring string
sales_id int
Time taken: 0.106 seconds, Fetched: 5 row(s)
hive> describe cust_temp;
OK
first_name string
last_name string
category string
phone_number string
customer_id int
address array<string>
Time taken: 0.103 seconds, Fetched: 6 row(s)
Explode in hive: (one execution using lateral view /other with sub-query)
hive>select customer_id as cid, address_det from cust_temp
lateral view explode(address) mytable1 as address_det ;
hive>select t.cid from
(select concat(customer_id," ",address[0]) as cid from cust_temp
union all
select concat(customer_id," ",address[1]) as cid from cust_temp) t
sample result for above two
3783 Mediumcity
3783 67890
3784 Bigcity
3784 12345
3785 township
3785 67890
3786 Beck Drive
3786 12345
Time taken: 14.012 seconds, Fetched: 130 row(s)
hive>select customer_id,address[0],address[1] from cust_temp;
-------------------
Max,Min,Count,Sum,Avg:
hive> select avg(qty) from sale1;
hive>select count(qty) from sale1;
hive>select max(qty) from sale1;
hive>select min(qty) from sale1;
hive>select max(qty) ,min(qty) , avg(qty),count(qty) from sale1;
hive>select cust_id, max(qty) ,min(qty) , avg(qty),count(qty) from sale1
group by cust_id;
hive>select cust_id, min(qty),max(qty) , avg(qty),count(qty),cast(datestring as date) from sale1
group by cust_id,cast(datestring as date);
hint: individual results based on cust_id, datestring
---------------
Using Partition by to calculate Max,Min,Count,Sum,Avg:
Note: The table sale1 is not partitioned in any way , you can check the metadata using describe sale1.So calculating any aggregates depending on huge varying data of specific fields in the table is possible by mentioning partition by clause as shown below
hive>select cust_id, datestring,
sum(qty) over (partition by cast(datestring as date) order by cust_id) ,
count(sales_id) over (partition by cast(datestring as date) order by cust_id) from sale1;
Hint:The above statement produces a result based on
a)partition datestring ordered by cust_id in the datestring data .(on a particular date 'n' of customers can buy 'n' of products)
b)in every datestring with same data(when dealing with next customer) , it produces a sum,count,avg for every customer purchase taking the previous customer aggregates and calculating the present it is dealing with.
sample result:
cust_id, date ,sum of quantity, count no transaction, avg transactions per customer
830 2012-01-04 1 1 1.0
64 2012-01-07 1 1 1.0
331 2012-01-08 2 1 2.0
322 2012-01-09 6 1 6.0
587 2012-01-09 7 2 3.5
922 2012-01-09 8 3 2.6666666666666665
7 2012-01-24 1 1 1.0
482 2012-01-24 2 2 1.0
773 2012-01-24 6 3 2.0
64 2012-02-12 4 1 4.0
452 2012-02-12 27 2 13.5
97 2012-02-15 1 1 1.0
128 2012-02-15 5 2 2.5
19 2012-02-17 7 1 7.0
647 2012-02-17 9 2 4.5
482 2012-03-08 1 1 1.0
922 2012-03-09 24 1 24.0
47 2012-03-10 1 1 1.0
431 2012-03-10 2 2 1.0
433 2012-04-07 2 1 2.0
922 2012-04-09 3 1 3.0
895 2012-04-15 7 1 7.0
1993 2012-04-15 10 2 5.0
720 2012-05-09 2 1 2.0
3221 2012-05-09 17 2 8.5
24 2012-05-17 5 1 5.0
773 2012-05-17 15 2 7.5
102 2012-06-20 1 1 1.0
720 2012-06-20 2 2 1.0
11 2012-07-24 2 1 2.0
3221 2012-07-24 3 2 1.5
176 2012-08-09 1 1 1.0
227 2012-09-24 1 1 1.0
323 2012-09-24 3 2 1.5
64 2012-10-09 17 1 17.0
452 2012-10-09 19 2 9.5
895 2012-12-09 31 1 31.0
1993 2012-12-09 33 2 16.5
Time taken: 21.323 seconds, Fetched: 38 row(s)
hive> select datestring,
sum(qty) over (partition by cast(datestring as date)) ,
count(sales_id) over (partition by cast(datestring as date) ) ,
avg(qty) over (partition by cast(datestring as date)) from sale1;
Hint:If the partition by cast(datestring as date) is only specified , then only the final result will vary because it will partition by only first field date and aggregations are done on the whole partition . Every repetition of same data of datestring it will print one line with same aggregates
Sample result
OK
2012-01-04 1 1 1.0
2012-01-07 1 1 1.0
2012-01-08 2 1 2.0
2012-01-09 8 3 2.6666666666666665
2012-01-09 8 3 2.6666666666666665
2012-01-09 8 3 2.6666666666666665
2012-01-24 6 3 2.0
---------------------------------------
Rank & Dense Rank
hive>select a.*,
rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a
ORDER BY a.cust_id;
(OR/below both gives same result)
hive>select a.*,
rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a;
hive>select a.*,
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a
ORDER BY a.cust_id;
(OR/below both gives same result)
hive>select a.*,
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a;
Hint:Dense Rank will rank the cust_id in ascending order in sequence without any missing integer
hive>select a.cust_id, count(*),
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sales a
group BY a.cust_id;
hive>select a.cust_id, a.datestring,
percent_rank() over (partition by cast(a.datestring as date) order by a.cust_id) as rankcust1,
dense_rank() over (partition by cast(a.datestring as date) order by a.cust_id ) as rankcust2,
rank() over (partition by cast(a.datestring as date) order by a.cust_id ) as rankcust0
from sale1 a ;
The below result is ranked, dense_rank ,percent_rank by
step 1: partition by date
step2: in each partition by date /order it by cust_id
Step3:In each partition after step2 rank them accordingly
sample Output:
cust_id ,datestring, percent_rank, dense_rank, rank
331 2012-01-08 0.0 1 1
322 2012-01-09 0.0 1 1
322 2012-01-09 0.0 1 1
587 2012-01-09 0.5 2 3
587 2012-01-09 0.5 2 3
922 2012-01-09 1.0 3 5
7 2012-01-24 0.0 1 1
7 2012-01-24 0.0 1 1
11 2012-01-24 0.3333333333333333 2 3
482 2012-01-24 0.5 3 4
482 2012-01-24 0.5 3 4
773 2012-01-24 0.8333333333333334 4 6
3221 2012-01-24 1.0 5 7
64 2012-02-12 0.0 1 1
452 2012-02-12 1.0 2 2
hive>select a.cust_id, a.datestring,
dense_rank() over ( order by a.cust_id ) as rankcust2,
rank() over (order by a.cust_id ) as rankcust0
from sale1 a ;
The below result is
step1:order by cust_id
Step2:then rank by accordingly
sample output:
7 2012-01-24 1 1
7 2012-01-24 1 1
11 2012-01-24 2 3
11 2012-07-24 2 3
19 2012-06-20 3 5
19 2012-02-17 3 5
24 2012-05-17 4 7
24 2012-12-09 4 7
47 2012-03-10 5 9
__________________
Cume_Dist()
hive>select cust_id,
cume_dist() over (order by qty ) from sale1;
hive>select cust_id,
cume_dist() over (partition by cast(datestring as date) order by cust_id) from sale1 ;
____________________
Row_Number()
hive>select cust_id, datestring,
row_number() over (partition by cast(datestring as date) ) from sale1;
Result: explanation: Step1: partition by datestring
step2: in each datestring partition row numbers are assigned.
step3:for every partition the row number starts from one the beginning
OK
830 2012-01-04 1
64 2012-01-07 1
331 2012-01-08 1
587 2012-01-09 1
922 2012-01-09 2
322 2012-01-09 3
322 2012-01-09 4
587 2012-01-09 5
773 2012-01-24 1
482 2012-01-24 2
7 2012-01-24 3
7 2012-01-24 4
3221 2012-01-24 5
11 2012-01-24 6
482 2012-01-24 7
produces 51 rows.
hive>select cust_id, datestring,
row_number() over (order by cust_id ) from sale1;
Explanation:
step1:Row_number() is given by order of cust_id
step2:Along with it cust_id,datestring is displayed
sample output:
7 2012-01-24 1
7 2012-01-24 2
11 2012-01-24 3
11 2012-07-24 4
19 2012-06-20 5
19 2012-02-17 6
24 2012-05-17 7
24 2012-12-09 8
47 2012-03-10 9
64 2012-10-09 10
64 2012-02-12 11
64 2012-01-07 12
97 2012-02-15 13
hive>select cust_id, datestring,
row_number() over (partition by cast(datestring as date) ) from sale1
group by cust_id, datestring;
Hint:produces 44 rows
sample result:Step1: partition by datestring
Step2: because it grouped by cust_id and datestring ( the duplicates in them are removed if both fields are same in the rows)
Step3:row_number is given to each row in each partition (that is the reason we are seeing the number starting from one again)
OK
830 2012-01-04 1
64 2012-01-07 1
331 2012-01-08 1
322 2012-01-09 1
922 2012-01-09 2
587 2012-01-09 3
773 2012-01-24 1
11 2012-01-24 2
3221 2012-01-24 3
7 2012-01-24 4
482 2012-01-24 5
452 2012-02-12 1
64 2012-02-12 2
128 2012-02-15 1
97 2012-02-15 2
647 2012-02-17 1
19 2012-02-17 2
482 2012-03-08 1
922 2012-03-09 1
47 2012-03-10
431 2012-03-10 2
433 2012-04-07 1
922 2012-04-09 1
895 2012-04-15 1
1993 2012-04-15 2
720 2012-05-09 1
3221 2012-05-09 2
24 2012-05-17 1
773 2012-05-17 2
720 2012-06-20 1
452 2012-06-20 2
102 2012-06-20 3
19 2012-06-20 4
3221 2012-07-24 1
11 2012-07-24 2
176 2012-08-09 1
323 2012-09-24 1
227 2012-09-24 2
64 2012-10-09 1
452 2012-10-09 2
1993 2012-12-09 1
24 2012-12-09 2
895 2012-12-09 3
482 2012-12-09 4
Time taken: 47.869 seconds, Fetched: 44 row(s)
hive>select cust_id, datestring,
row_number() over (order by cast(datestring as date)) from sale1
group by cust_id, datestring;
Result:Explanation:
Step1: order by datestring
Step2: because it grouped by cust_id and datestring ( the duplicates in them are removed if both fields are same in the rows)
Step3:row_number is given to each row.
OK
830 2012-01-04 1
64 2012-01-07 2
331 2012-01-08 3
322 2012-01-09 4
922 2012-01-09 5
587 2012-01-09 6
773 2012-01-24 7
11 2012-01-24 8
3221 2012-01-24 9
7 2012-01-24 10
482 2012-01-24 11
452 2012-02-12 12
64 2012-02-12 13
128 2012-02-15 14
97 2012-02-15 15
647 2012-02-17 16
19 2012-02-17 17
482 2012-03-08 18
922 2012-03-09 19
47 2012-03-10 20
431 2012-03-10 21
433 2012-04-07 22
922 2012-04-09 23
895 2012-04-15 24
1993 2012-04-15 25
720 2012-05-09 26
3221 2012-05-09 27
24 2012-05-17 28
773 2012-05-17 29
720 2012-06-20 30
452 2012-06-20 31
102 2012-06-20 32
19 2012-06-20 33
3221 2012-07-24 34
11 2012-07-24 35
176 2012-08-09 36
323 2012-09-24 37
227 2012-09-24 38
64 2012-10-09 39
452 2012-10-09 40
1993 2012-12-09 41
24 2012-12-09 42
895 2012-12-09 43
482 2012-12-09 44
Time taken: 47.808 seconds, Fetched: 44 row(s)