HadoopExamples: Lateral View,Explode(),Partition By,Rank(),Dense_rank() ,Max(),Min(),Count(),Sum(),Avg() ,Cume_Dist(),Row

Problem explanation: Lateral View,Explode(),Partition By,Rank(),Dense_rank() ,Max(),Min(),Count(),Sum(),Avg() ,Cume_Dist(),Row_Number() In Hive

-----------------------
hive> describe sale1;
OK
cust_id                 int
prod_no                 int
qty                     int
datestring              string
sales_id                int
Time taken: 0.106 seconds, Fetched: 5 row(s)
hive> describe cust_temp;
OK
first_name              string
last_name               string
category                string
phone_number            string
customer_id             int
address                 array<string>
Time taken: 0.103 seconds, Fetched: 6 row(s)

Explode in hive: (one execution using lateral view /other with sub-query)

hive>select customer_id as cid, address_det from cust_temp
lateral view explode(address) mytable1 as address_det ;

hive>select t.cid from
(select concat(customer_id," ",address[0]) as cid from cust_temp
union all
select concat(customer_id," ",address[1]) as cid from cust_temp) t

sample result for above two
3783   Mediumcity
3783   67890
3784   Bigcity
3784   12345
3785   township
3785   67890
3786   Beck Drive
3786   12345
Time taken: 14.012 seconds, Fetched: 130 row(s)

hive>select customer_id,address[0],address[1] from cust_temp;
-------------------
Max,Min,Count,Sum,Avg:
hive> select avg(qty) from sale1;
hive>select count(qty) from sale1;
hive>select max(qty) from sale1;
hive>select min(qty) from sale1;
hive>select max(qty) ,min(qty) , avg(qty),count(qty) from sale1;

hive>select cust_id, max(qty) ,min(qty) , avg(qty),count(qty) from sale1
group by cust_id;

hive>select cust_id, min(qty),max(qty) , avg(qty),count(qty),cast(datestring as date) from sale1
group by   cust_id,cast(datestring as date);

hint: individual results based on cust_id, datestring
---------------
Using Partition by to calculate Max,Min,Count,Sum,Avg:
Note: The table sale1 is not partitioned in any way , you can check the metadata using describe sale1.So calculating any aggregates depending on huge varying data of specific fields in the table is possible by mentioning partition by clause as shown below

hive>select cust_id, datestring,
sum(qty) over (partition by cast(datestring as date) order by cust_id) ,
count(sales_id) over (partition by cast(datestring as date) order by cust_id) from sale1;

Hint:The above statement produces a result based on
a)partition datestring ordered by cust_id in the datestring data .(on a particular date 'n' of customers can buy 'n' of products)
b)in every datestring with same data(when dealing with next customer) , it produces a sum,count,avg for every customer purchase taking the previous customer aggregates and calculating the present it is dealing with.

sample result:
cust_id, date ,sum of quantity, count no transaction, avg transactions per customer
830    2012-01-04    1    1    1.0
64    2012-01-07    1    1    1.0
331    2012-01-08    2    1    2.0
322    2012-01-09    6    1    6.0
587    2012-01-09    7    2    3.5
922    2012-01-09    8    3    2.6666666666666665
7    2012-01-24    1    1    1.0
482    2012-01-24    2    2    1.0
773    2012-01-24    6    3    2.0
64    2012-02-12    4    1    4.0
452    2012-02-12    27    2    13.5
97    2012-02-15    1    1    1.0
128    2012-02-15    5    2    2.5
19    2012-02-17    7    1    7.0
647    2012-02-17    9    2    4.5
482    2012-03-08    1    1    1.0
922    2012-03-09    24    1    24.0
47    2012-03-10    1    1    1.0
431    2012-03-10    2    2    1.0
433    2012-04-07    2    1    2.0
922    2012-04-09    3    1    3.0
895    2012-04-15    7    1    7.0
1993    2012-04-15    10    2    5.0
720    2012-05-09    2    1    2.0
3221    2012-05-09    17    2    8.5
24    2012-05-17    5    1    5.0
773    2012-05-17    15    2    7.5
102    2012-06-20    1    1    1.0
720    2012-06-20    2    2    1.0
11    2012-07-24    2    1    2.0
3221    2012-07-24    3    2    1.5
176    2012-08-09    1    1    1.0
227    2012-09-24    1    1    1.0
323    2012-09-24    3    2    1.5
64    2012-10-09    17    1    17.0
452    2012-10-09    19    2    9.5
895    2012-12-09    31    1    31.0
1993    2012-12-09    33    2    16.5
Time taken: 21.323 seconds, Fetched: 38 row(s)

hive> select datestring,
sum(qty) over (partition by cast(datestring as date)) ,
count(sales_id) over (partition by cast(datestring as date) ) ,
avg(qty) over (partition by cast(datestring as date)) from sale1;

Hint:If the partition by cast(datestring as date) is only specified , then only the final result will vary because it will partition by only first field date and aggregations are done on the whole partition . Every repetition of same data of datestring it will print one line with same aggregates
Sample result
OK
2012-01-04    1    1    1.0
2012-01-07    1    1    1.0
2012-01-08    2    1    2.0
2012-01-09    8    3    2.6666666666666665
2012-01-09    8    3    2.6666666666666665
2012-01-09    8    3    2.6666666666666665
2012-01-24    6    3    2.0

---------------------------------------
Rank & Dense Rank

hive>select a.*,
rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a
ORDER BY a.cust_id;

(OR/below both gives same result)

hive>select a.*,
rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a;

hive>select a.*,
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a
ORDER BY a.cust_id;

(OR/below both gives same result)

hive>select a.*,
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a;

Hint:Dense Rank will rank the cust_id in ascending order in sequence without any missing integer

hive>select a.cust_id, count(*),
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sales a
group BY a.cust_id;

hive>select a.cust_id, a.datestring,
percent_rank() over (partition by cast(a.datestring as date) order by a.cust_id) as rankcust1,
dense_rank() over (partition by cast(a.datestring as date) order by a.cust_id ) as rankcust2,
rank() over (partition by cast(a.datestring as date) order by a.cust_id ) as rankcust0
from sale1 a ;

The below result is ranked, dense_rank ,percent_rank by
step 1: partition by date
step2: in each partition by date /order it by cust_id
Step3:In each partition after step2 rank them accordingly

sample Output:
cust_id ,datestring, percent_rank, dense_rank, rank
331   2012-01-08   0.0   1   1
322   2012-01-09   0.0   1   1
322   2012-01-09   0.0   1   1
587   2012-01-09   0.5   2   3
587   2012-01-09   0.5   2   3
922   2012-01-09   1.0   3   5
7   2012-01-24   0.0   1   1
7   2012-01-24   0.0   1   1
11   2012-01-24   0.3333333333333333   2   3
482   2012-01-24   0.5   3   4
482   2012-01-24   0.5   3   4
773   2012-01-24   0.8333333333333334   4   6
3221   2012-01-24   1.0   5   7
64   2012-02-12   0.0   1   1
452   2012-02-12   1.0   2   2

hive>select a.cust_id, a.datestring,

dense_rank() over ( order by a.cust_id ) as rankcust2,
rank() over (order by a.cust_id ) as rankcust0
from sale1 a ;

The below result is
step1:order by cust_id
Step2:then rank by accordingly
sample output:
7   2012-01-24   1   1
7   2012-01-24   1   1
11   2012-01-24   2   3
11   2012-07-24   2   3
19   2012-06-20   3   5
19   2012-02-17   3   5
24   2012-05-17   4   7
24   2012-12-09   4   7
47   2012-03-10   5   9

__________________

Cume_Dist()
hive>select cust_id,
cume_dist() over (order by qty ) from sale1;

hive>select cust_id,
cume_dist() over (partition by cast(datestring as date) order by cust_id) from sale1 ;
____________________
Row_Number()

hive>select cust_id, datestring,
row_number() over (partition by cast(datestring as date) ) from sale1;
Result: explanation: Step1: partition by datestring
step2: in each datestring partition row numbers are assigned.
step3:for every partition the row number starts from one the beginning
OK
830    2012-01-04    1
64    2012-01-07    1
331    2012-01-08    1
587    2012-01-09    1
922    2012-01-09    2
322    2012-01-09    3
322    2012-01-09    4
587    2012-01-09    5
773    2012-01-24    1
482    2012-01-24    2
7    2012-01-24    3
7    2012-01-24    4
3221    2012-01-24    5
11    2012-01-24    6
482    2012-01-24    7
produces 51 rows.

hive>select cust_id, datestring,
row_number() over (order by cust_id ) from sale1;

Explanation:
step1:Row_number() is given by order of cust_id
step2:Along with it cust_id,datestring is displayed
sample output:
7    2012-01-24    1
7    2012-01-24    2
11    2012-01-24    3
11    2012-07-24    4
19    2012-06-20    5
19    2012-02-17    6
24    2012-05-17    7
24    2012-12-09    8
47    2012-03-10    9
64    2012-10-09    10
64    2012-02-12    11
64    2012-01-07    12
97    2012-02-15    13

hive>select cust_id, datestring,
row_number() over (partition by cast(datestring as date) ) from sale1
group by cust_id, datestring;

Hint:produces 44 rows
sample result:Step1: partition by datestring
Step2: because it grouped by cust_id and datestring ( the duplicates in them are removed if both fields are same in the rows)
Step3:row_number is given to each row in each partition (that is the reason we are seeing the number starting from one again)
OK
830   2012-01-04   1
64   2012-01-07   1
331   2012-01-08   1
322   2012-01-09   1
922   2012-01-09   2
587   2012-01-09   3
773   2012-01-24   1
11   2012-01-24   2
3221   2012-01-24   3
7   2012-01-24   4
482   2012-01-24   5
452   2012-02-12   1
64   2012-02-12   2
128   2012-02-15   1
97   2012-02-15   2
647   2012-02-17   1
19   2012-02-17   2
482   2012-03-08   1
922   2012-03-09   1
47   2012-03-10
431   2012-03-10   2
433   2012-04-07   1
922   2012-04-09   1
895   2012-04-15   1
1993   2012-04-15   2
720   2012-05-09   1
3221   2012-05-09   2
24   2012-05-17   1
773   2012-05-17   2
720   2012-06-20   1
452   2012-06-20   2
102   2012-06-20   3
19   2012-06-20   4
3221   2012-07-24   1
11   2012-07-24   2
176   2012-08-09   1
323   2012-09-24   1
227   2012-09-24   2
64   2012-10-09   1
452   2012-10-09   2
1993   2012-12-09   1
24   2012-12-09   2
895   2012-12-09   3
482   2012-12-09   4
Time taken: 47.869 seconds, Fetched: 44 row(s)

hive>select cust_id, datestring,
row_number() over (order by cast(datestring as date)) from sale1
group by cust_id, datestring;

Result:Explanation:
Step1: order by datestring
Step2: because it grouped by cust_id and datestring ( the duplicates in them are removed if both fields are same in the rows)
Step3:row_number is given to each row.
OK
830   2012-01-04   1
64   2012-01-07   2
331   2012-01-08   3
322   2012-01-09   4
922   2012-01-09   5
587   2012-01-09   6
773   2012-01-24   7
11   2012-01-24   8
3221   2012-01-24   9
7   2012-01-24   10
482   2012-01-24   11
452   2012-02-12   12
64   2012-02-12   13
128   2012-02-15   14
97   2012-02-15   15
647   2012-02-17   16
19   2012-02-17   17
482   2012-03-08   18
922   2012-03-09   19
47   2012-03-10   20
431   2012-03-10   21
433   2012-04-07   22
922   2012-04-09   23
895   2012-04-15   24
1993   2012-04-15   25
720   2012-05-09   26
3221   2012-05-09   27
24   2012-05-17   28
773   2012-05-17   29
720   2012-06-20   30
452   2012-06-20   31
102   2012-06-20   32
19   2012-06-20   33
3221   2012-07-24   34
11   2012-07-24   35
176   2012-08-09   36
323   2012-09-24   37
227   2012-09-24   38
64   2012-10-09   39
452   2012-10-09   40
1993   2012-12-09   41
24   2012-12-09   42
895   2012-12-09   43
482   2012-12-09   44
Time taken: 47.808 seconds, Fetched: 44 row(s)

HadoopExamples

Lateral View,Explode(),Partition By,Rank(),Dense_rank() ,Max(),Min(),Count(),Sum(),Avg() ,Cume_Dist(),Row_Number() In Hive

Problem explanation: Lateral View,Explode(),Partition By,Rank(),Dense_rank() ,Max(),Min(),Count(),Sum(),Avg() ,Cume_Dist(),Row_Number() In Hive

No comments:

Post a Comment