HadoopExamples: 2015

Teradata BTEQ utility export in report format(In record mode)

Problem:Teradata BTEQ utility export in report format(In record mode)

write a script file
hyperjcop1:~/samplepro # vi expscr1.btq
**(writing a script file)

.logon dbc/dbc,dbc;

.export report file = /root/samplepro/dbctablesexp.txt

.set retlimit *
select * from dbc.tables;

.export reset;
--------------------------------------------------
run the script file ecpscr.btq

BTEQ -- Enter your SQL request or BTEQ command:
.run file = /root/samplepro/expscr1.btq
---------------------------------------
Once the output gets completed
go to
hyperjcop1:~/samplepro #       cat dbctablesexp.txt ( to see the output)

----x-xxxx-xxxxxxx----------------
cli>vi expscr2.btq

.logon dbc/dbc,dbc;
database db_dec;

.export report file = /root/samplepro/sampleexport.txt
.set retlimit 8

select * from eztest
union all
select * from sampletable2;

.export reset;
--------------------------------------------------
Hint:

.set retlimit will set the limit for the number of record count to return for the select statement executed in the script.
In bteq utility execute statement show controls to see the session controls set for the utility.Unless you change it according to your requirement , these settings applies to any statement executed.

b)once in any script if any changes has been made to controls those will be applied to successive execution statement result too.

Note: The report format (record mode) will print the results in rows with row headings and fields seperated by tab.

see sample result

DatabaseName                   TableName                           Version TableKind ProtectionType JournalFlag CreatorName
------------------------------ ------------------------------ ------------ --------- -------------- ----------- --------------------
SYSUDTLIB                      INTPRDTIMETYPE_CONSTR1                    1 H         F              N           DBC
DBC                            ProfileInfoX                              1 V         F              NN          DBC
SYSUDTLIB                      ST_Geometry_Constructor6                  1 H         F              N           DBC
DBC                            SessionInfoVX                             1 V         F              NN          DBC
tdwm                           TDWMResPsWDJoinByNode                     1 M         F              NN          DBC
DBC                            TriggersTbl_TD12                          1 T         F              NN          DBC
TD_SYSFNLIB                    CEILINGBIGINT                             1 F         N              N           DBC
DBC                            FunctionsX                                1 V         F              NN          DBC

IMPORT MODES : INDICDATA, DATA, VARTEXT
OUTPUT MODES : DATA, INDICDATA ,REPORT , DIF

In Teradata using bteq utility to load file .txt extension to a table.(import)

Problem Statement: In Teradata using bteq utility to load file .txt extension to a table.(import)

---
A sample file:
hyperjcop1:~/samplepro # cat test.txt
first,second,lname
20,21,sam
21,22,ram
23,24,siri
25.26,sat
27.28,joe
29,30,jim
31,32,rita
33,34,doe
35,36,cathy
36,37,missy

-----------------------------------------------------
cli>bteq
it will prompt :
Teradata BTEQ 13.10.00.00 for LINUX. Enter your logon or BTEQ command:
.logon dbc/vmtest
.logon dbc/vmtest
Password:(enter your password)
*** Logon successfully completed.
*** Teradata Database Release is 13.10.00.10
*** Teradata Database Version is 13.10.00.14
*** Transaction Semantics are BTET.
*** Session Character Set Name is 'ASCII'.

*** Total elapsed time was 1 second.
BTEQ -- Enter your SQL request or BTEQ command:

.run file = /root/samplepro/btscript1.btq(the script file your want to execute for import)
------------------------------------

hyperjcop1:~/samplepro # vi btscript1.btq

*(write the file and save it)

.logon dbc/vmtest,vmtest;

.remark "<script file to load data from test.txt to eztest table in vmtest >"

sel date, time ;
.set width 132;

.import VARTEXT ',' file = /root/samplepro/test.txt

.quiet on ;

.REPEAT *

using d1(varchar(10)), d2(varchar(10)), d3(varchar(20))
insert into vmtest.eztest
values(:d1,:d2,:d3);
------------------------------

Note:
a) the above sample file has a integer format fields in first,second
b)but in using clause coded varchar(10) to represent them , because the import is taken as vartext with comma as delimiter(.import VARTEXT ',')
c)But the table eztest is having columns first_field1,second_field2 as integers .
d)the above script runs fine without errors loading the first,second field of sample text (test.txt) into table as integer values .
d1(varchar(10)) ------>first_field1(integer) in eztest table

d2(varchar(10))------->second_field2 (integer) in eztest table
the implicit conversion takes place.
e)
Hint :will load wrong values, but script will run fine without errors. (Doesn't see values as 20,21 as above, in the table but stored with numeric conversation to a 10 digit value equivalent )
d1(integer) ------>first_field1(integer) in eztest table

d2(integer)------->second_field2 (integer) in eztest table

Lateral View,Explode(),Partition By,Rank(),Dense_rank() ,Max(),Min(),Count(),Sum(),Avg() ,Cume_Dist(),Row_Number() In Hive

Problem explanation: Lateral View,Explode(),Partition By,Rank(),Dense_rank() ,Max(),Min(),Count(),Sum(),Avg() ,Cume_Dist(),Row_Number() In Hive

-----------------------
hive> describe sale1;
OK
cust_id                 int
prod_no                 int
qty                     int
datestring              string
sales_id                int
Time taken: 0.106 seconds, Fetched: 5 row(s)
hive> describe cust_temp;
OK
first_name              string
last_name               string
category                string
phone_number            string
customer_id             int
address                 array<string>
Time taken: 0.103 seconds, Fetched: 6 row(s)

Explode in hive: (one execution using lateral view /other with sub-query)

hive>select customer_id as cid, address_det from cust_temp
lateral view explode(address) mytable1 as address_det ;

hive>select t.cid from
(select concat(customer_id," ",address[0]) as cid from cust_temp
union all
select concat(customer_id," ",address[1]) as cid from cust_temp) t

sample result for above two
3783   Mediumcity
3783   67890
3784   Bigcity
3784   12345
3785   township
3785   67890
3786   Beck Drive
3786   12345
Time taken: 14.012 seconds, Fetched: 130 row(s)

hive>select customer_id,address[0],address[1] from cust_temp;
-------------------
Max,Min,Count,Sum,Avg:
hive> select avg(qty) from sale1;
hive>select count(qty) from sale1;
hive>select max(qty) from sale1;
hive>select min(qty) from sale1;
hive>select max(qty) ,min(qty) , avg(qty),count(qty) from sale1;

hive>select cust_id, max(qty) ,min(qty) , avg(qty),count(qty) from sale1
group by cust_id;

hive>select cust_id, min(qty),max(qty) , avg(qty),count(qty),cast(datestring as date) from sale1
group by   cust_id,cast(datestring as date);

hint: individual results based on cust_id, datestring
---------------
Using Partition by to calculate Max,Min,Count,Sum,Avg:
Note: The table sale1 is not partitioned in any way , you can check the metadata using describe sale1.So calculating any aggregates depending on huge varying data of specific fields in the table is possible by mentioning partition by clause as shown below

hive>select cust_id, datestring,
sum(qty) over (partition by cast(datestring as date) order by cust_id) ,
count(sales_id) over (partition by cast(datestring as date) order by cust_id) from sale1;

Hint:The above statement produces a result based on
a)partition datestring ordered by cust_id in the datestring data .(on a particular date 'n' of customers can buy 'n' of products)
b)in every datestring with same data(when dealing with next customer) , it produces a sum,count,avg for every customer purchase taking the previous customer aggregates and calculating the present it is dealing with.

sample result:
cust_id, date ,sum of quantity, count no transaction, avg transactions per customer
830    2012-01-04    1    1    1.0
64    2012-01-07    1    1    1.0
331    2012-01-08    2    1    2.0
322    2012-01-09    6    1    6.0
587    2012-01-09    7    2    3.5
922    2012-01-09    8    3    2.6666666666666665
7    2012-01-24    1    1    1.0
482    2012-01-24    2    2    1.0
773    2012-01-24    6    3    2.0
64    2012-02-12    4    1    4.0
452    2012-02-12    27    2    13.5
97    2012-02-15    1    1    1.0
128    2012-02-15    5    2    2.5
19    2012-02-17    7    1    7.0
647    2012-02-17    9    2    4.5
482    2012-03-08    1    1    1.0
922    2012-03-09    24    1    24.0
47    2012-03-10    1    1    1.0
431    2012-03-10    2    2    1.0
433    2012-04-07    2    1    2.0
922    2012-04-09    3    1    3.0
895    2012-04-15    7    1    7.0
1993    2012-04-15    10    2    5.0
720    2012-05-09    2    1    2.0
3221    2012-05-09    17    2    8.5
24    2012-05-17    5    1    5.0
773    2012-05-17    15    2    7.5
102    2012-06-20    1    1    1.0
720    2012-06-20    2    2    1.0
11    2012-07-24    2    1    2.0
3221    2012-07-24    3    2    1.5
176    2012-08-09    1    1    1.0
227    2012-09-24    1    1    1.0
323    2012-09-24    3    2    1.5
64    2012-10-09    17    1    17.0
452    2012-10-09    19    2    9.5
895    2012-12-09    31    1    31.0
1993    2012-12-09    33    2    16.5
Time taken: 21.323 seconds, Fetched: 38 row(s)

hive> select datestring,
sum(qty) over (partition by cast(datestring as date)) ,
count(sales_id) over (partition by cast(datestring as date) ) ,
avg(qty) over (partition by cast(datestring as date)) from sale1;

Hint:If the partition by cast(datestring as date) is only specified , then only the final result will vary because it will partition by only first field date and aggregations are done on the whole partition . Every repetition of same data of datestring it will print one line with same aggregates
Sample result
OK
2012-01-04    1    1    1.0
2012-01-07    1    1    1.0
2012-01-08    2    1    2.0
2012-01-09    8    3    2.6666666666666665
2012-01-09    8    3    2.6666666666666665
2012-01-09    8    3    2.6666666666666665
2012-01-24    6    3    2.0

---------------------------------------
Rank & Dense Rank

hive>select a.*,
rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a
ORDER BY a.cust_id;

(OR/below both gives same result)

hive>select a.*,
rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a;

hive>select a.*,
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a
ORDER BY a.cust_id;

(OR/below both gives same result)

hive>select a.*,
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sale1 a;

Hint:Dense Rank will rank the cust_id in ascending order in sequence without any missing integer

hive>select a.cust_id, count(*),
dense_rank() over (ORDER BY a.cust_id ASC) as rankcust from sales a
group BY a.cust_id;

hive>select a.cust_id, a.datestring,
percent_rank() over (partition by cast(a.datestring as date) order by a.cust_id) as rankcust1,
dense_rank() over (partition by cast(a.datestring as date) order by a.cust_id ) as rankcust2,
rank() over (partition by cast(a.datestring as date) order by a.cust_id ) as rankcust0
from sale1 a ;

The below result is ranked, dense_rank ,percent_rank by
step 1: partition by date
step2: in each partition by date /order it by cust_id
Step3:In each partition after step2 rank them accordingly

sample Output:
cust_id ,datestring, percent_rank, dense_rank, rank
331   2012-01-08   0.0   1   1
322   2012-01-09   0.0   1   1
322   2012-01-09   0.0   1   1
587   2012-01-09   0.5   2   3
587   2012-01-09   0.5   2   3
922   2012-01-09   1.0   3   5
7   2012-01-24   0.0   1   1
7   2012-01-24   0.0   1   1
11   2012-01-24   0.3333333333333333   2   3
482   2012-01-24   0.5   3   4
482   2012-01-24   0.5   3   4
773   2012-01-24   0.8333333333333334   4   6
3221   2012-01-24   1.0   5   7
64   2012-02-12   0.0   1   1
452   2012-02-12   1.0   2   2

hive>select a.cust_id, a.datestring,

dense_rank() over ( order by a.cust_id ) as rankcust2,
rank() over (order by a.cust_id ) as rankcust0
from sale1 a ;

The below result is
step1:order by cust_id
Step2:then rank by accordingly
sample output:
7   2012-01-24   1   1
7   2012-01-24   1   1
11   2012-01-24   2   3
11   2012-07-24   2   3
19   2012-06-20   3   5
19   2012-02-17   3   5
24   2012-05-17   4   7
24   2012-12-09   4   7
47   2012-03-10   5   9

__________________

Cume_Dist()
hive>select cust_id,
cume_dist() over (order by qty ) from sale1;

hive>select cust_id,
cume_dist() over (partition by cast(datestring as date) order by cust_id) from sale1 ;
____________________
Row_Number()

hive>select cust_id, datestring,
row_number() over (partition by cast(datestring as date) ) from sale1;
Result: explanation: Step1: partition by datestring
step2: in each datestring partition row numbers are assigned.
step3:for every partition the row number starts from one the beginning
OK
830    2012-01-04    1
64    2012-01-07    1
331    2012-01-08    1
587    2012-01-09    1
922    2012-01-09    2
322    2012-01-09    3
322    2012-01-09    4
587    2012-01-09    5
773    2012-01-24    1
482    2012-01-24    2
7    2012-01-24    3
7    2012-01-24    4
3221    2012-01-24    5
11    2012-01-24    6
482    2012-01-24    7
produces 51 rows.

hive>select cust_id, datestring,
row_number() over (order by cust_id ) from sale1;

Explanation:
step1:Row_number() is given by order of cust_id
step2:Along with it cust_id,datestring is displayed
sample output:
7    2012-01-24    1
7    2012-01-24    2
11    2012-01-24    3
11    2012-07-24    4
19    2012-06-20    5
19    2012-02-17    6
24    2012-05-17    7
24    2012-12-09    8
47    2012-03-10    9
64    2012-10-09    10
64    2012-02-12    11
64    2012-01-07    12
97    2012-02-15    13

hive>select cust_id, datestring,
row_number() over (partition by cast(datestring as date) ) from sale1
group by cust_id, datestring;

Hint:produces 44 rows
sample result:Step1: partition by datestring
Step2: because it grouped by cust_id and datestring ( the duplicates in them are removed if both fields are same in the rows)
Step3:row_number is given to each row in each partition (that is the reason we are seeing the number starting from one again)
OK
830   2012-01-04   1
64   2012-01-07   1
331   2012-01-08   1
322   2012-01-09   1
922   2012-01-09   2
587   2012-01-09   3
773   2012-01-24   1
11   2012-01-24   2
3221   2012-01-24   3
7   2012-01-24   4
482   2012-01-24   5
452   2012-02-12   1
64   2012-02-12   2
128   2012-02-15   1
97   2012-02-15   2
647   2012-02-17   1
19   2012-02-17   2
482   2012-03-08   1
922   2012-03-09   1
47   2012-03-10
431   2012-03-10   2
433   2012-04-07   1
922   2012-04-09   1
895   2012-04-15   1
1993   2012-04-15   2
720   2012-05-09   1
3221   2012-05-09   2
24   2012-05-17   1
773   2012-05-17   2
720   2012-06-20   1
452   2012-06-20   2
102   2012-06-20   3
19   2012-06-20   4
3221   2012-07-24   1
11   2012-07-24   2
176   2012-08-09   1
323   2012-09-24   1
227   2012-09-24   2
64   2012-10-09   1
452   2012-10-09   2
1993   2012-12-09   1
24   2012-12-09   2
895   2012-12-09   3
482   2012-12-09   4
Time taken: 47.869 seconds, Fetched: 44 row(s)

hive>select cust_id, datestring,
row_number() over (order by cast(datestring as date)) from sale1
group by cust_id, datestring;

Result:Explanation:
Step1: order by datestring
Step2: because it grouped by cust_id and datestring ( the duplicates in them are removed if both fields are same in the rows)
Step3:row_number is given to each row.
OK
830   2012-01-04   1
64   2012-01-07   2
331   2012-01-08   3
322   2012-01-09   4
922   2012-01-09   5
587   2012-01-09   6
773   2012-01-24   7
11   2012-01-24   8
3221   2012-01-24   9
7   2012-01-24   10
482   2012-01-24   11
452   2012-02-12   12
64   2012-02-12   13
128   2012-02-15   14
97   2012-02-15   15
647   2012-02-17   16
19   2012-02-17   17
482   2012-03-08   18
922   2012-03-09   19
47   2012-03-10   20
431   2012-03-10   21
433   2012-04-07   22
922   2012-04-09   23
895   2012-04-15   24
1993   2012-04-15   25
720   2012-05-09   26
3221   2012-05-09   27
24   2012-05-17   28
773   2012-05-17   29
720   2012-06-20   30
452   2012-06-20   31
102   2012-06-20   32
19   2012-06-20   33
3221   2012-07-24   34
11   2012-07-24   35
176   2012-08-09   36
323   2012-09-24   37
227   2012-09-24   38
64   2012-10-09   39
452   2012-10-09   40
1993   2012-12-09   41
24   2012-12-09   42
895   2012-12-09   43
482   2012-12-09   44
Time taken: 47.808 seconds, Fetched: 44 row(s)

Hive Script Execution/hive -e Option/Passing Varaibles To Hive Script

Problem Explanation: Hive Script Execution/hive -e Option/Passing Varaibles To Hive Script

-----------------------
A 'n' of hive commands in batch can be stored into a hive script file with a file name with .hive, .sql, .q extensions.

Create a file named myscript.sql
----start of myscript.sql
select cast(t3.stdt as date) from
(select concat(st_dt[2],'-',st_dt[0],'-',st_dt[1]) stdt from customersales.emp_det)t3
---end of myscript.sql

Execute myscript.sql

$HIVE_HOME/bin>hive -S -f myscript1.sql(with silent option)
-------
myscript1.q contents in a file

select cast(t3.stdt as date) from (select concat(st_dt[2],'-',st_dt[0],'-',st_dt[1]) stdt
from
${hiveconf : my_var1})t3

select concat(st_dt[2],'-',st_dt[0],'-',st_dt[1]) from ${hiveconf : my_var1} limit 1;

----end of file myscript1.q

Execute myscript1.q

$HIVE_HOME/bin> hive -S -hiveconf my_var1=customersales.emp_det -f /home/hadoop/Desktop/tempscript1.q

---------
Execute with -e option

$HIVE_HOME/bin>hive -S -e "select cast(t3.stdt as date) from (select concat(st_dt[2],'-',st_dt[0],'-',st_dt[1]) stdt from customersales.emp_det)t3 ;"

Loading a Flat File & Converting a Date Variable In a Flat File Into Hive Date Format Field In a Table.

Loading a Flat File & Converting a Date Variable In a Flat File Into Hive Date Format Field In a Table.(Hands on explanation)

--------------------------
Step 1:Create a Stage table

hive>create table emp_stage
(fn string , ln string,cur_wk string, prv_wk string, stdt Array<string>,enddt Array<string>,eid bigint)
row format delimited
fields terminated by ','
collection items terminated by '/'
stored as textfile;

Note: Sample of flat file data
First_ Name , last_name, Currently_ working, Previously_working,Start_date,End_date,Employee_id
John,Wade,Washigton , ,1/8/2014,,2938

Franklin,Josephs,Detriot,washington ,11/12/2013,2/12/2015,2913

Elizabeth,Reynolds,Cleveland, ,4/30/2011,,2891

Arthur,Gibbons,fortwayne, ,6/28/2015, ,400

Franklin,Josephs,washington, ,2/12/2015, ,2913

But the date format of hive is YYYY-MM-DD

Step 2:
hive>Load data inpath
'/user/hadoop/Emp_details.csv'
into table emp_stage;

Step 3: Insert overwrite directory
'/user/hadoop/date_temp'
select fn , ln ,cur_wk , prv_wk,
concat(stdt[2],'-',stdt[0],'-',stdt[1]),
concat(enddt[2],'-',enddt[0],'-',enddt[1]),
eid from emp_stage;

step4:create table emp_det
(fn string , ln string,cur_wk string, prv_wk string, stdt date,enddt date,eid bigint)
row format delimited
fields terminated by '\001'
collection items terminated by '-'
stored as textfile;

step5:load data inpath
'/user/hadoop/date_temp'
into table emp_det;

step 6:To display in hive date format in hive cli

select cast(t3.st_dt as date), cast(t3.end_dt as date) from (select concat(stdt[2],'-',stdt[0],'-',stdt[1]) st_dt , concat(enddt[2],'-',enddt[0],'-',enddt[1]) end_dt from emp_stage)t3;

Hive Insert For New Records From Staging Table Into a Existing Hive Table.(For Scdtype-2)

Hive Insert For New Records From Staging Table Into a Existing Hive Table.(For Scdtype-2)(hands on explanation)

-----------------------------------------------
Step 1:
In scd-type 2 implementation using date range table , to insert a new record from a end of the day flat file ,into a existing hive table without overwrite . Load the entire end of day flat file into stage table and then to existing hive table with comparisions between them.

hive>
create table employee_stage
(fn string , ln string, current_workingcity string, previously_workedcity string, start_date Array<string>,end_date Array<string>, employee_id bigint)
row format delimited
fields terminated by ','
collection items terminated by '/'
stored as textfile;

hive>load data inpath
'/user/hadoop/employee_day12march2014'
into table employee_stage;

The hive supports only equi joins .If a table_stage table has 100 new records out of 1000 records that needs to be inserted into existing hive table "employee_history" (which is scdtype-2 with date range) without any errors execute

hive> insert overwrite directory
'/user/hadoop/newinsert=day1march2014'
select c.employee_id from employee_stage c where c.employee_id
NOT IN
(select e.employee_id from employee_history e);
Note: so you have a flat file with records of new employee_id's that been inserted into employee_history

hive>
insert into table employee_history
select c.* from employee_stage c where c.employee_id
NOT IN
(select e.employee_id from employee_history e);
hint: make sure the employee_id column in employee_stage table in not null.

Ctas With Three Table Joins In Two Different Databases In a Subquery

Ctas With Three Table Joins In Two Different Databases In a Subquery(hands on explanation)

--------
hive> describe default.pd_det;
OK
dept                   string
cost                   int
id                     int
asmb_city              string
asmb_ct                string
retail                 int
country                string

# Partition Information
# col_name               data_type              comment

country                string

------
hive> describe default.item_avgprofit_byctry ;
OK
avgprf                  int
makein                  string
itemshipped             int
Time taken: 2.596 seconds, Fetched: 3 row(s)
----
hive> describe customersales.product_part ;
OK
prod_name               string
description             varchar(20)
qty_onhand              int
prod_num                bigint
packaged_with           array<string>
category                string

# Partition Information
# col_name                data_type               comment

category                string
Time taken: 0.185 seconds, Fetched: 11 row(s)
-----
Step 2: A join between three tables
table 1:
default.pd_det a (join column :a.country)
table 2:
default.item_avgprofit_byctry (join cloumn:b.makein)
table3:
customersales.product_part c (join column: c.prod_num with a.id)
______________________
Step 3: CTAS

create table threejoin_ctastwodifferentdb stored as textfile as

select t3.cost,t3.pdid,t3.madein,t3.pdname from
(
select a.cost as cost, a.id as pdid, b.makein as madein, c.prod_name as pdname from default.pd_det a
join default.item_avgprofit_byctry b on (a.country = b.makein)
join customersales.product_part c on (a.id = c.prod_num)
)t3

------
hive> describe threejoin_ctastwodifferentdb;
OK
cost                    int
pdid                    int
madein                  string
pdname                  string
Time taken: 1.17 seconds, Fetched: 4 row(s)

CTAS has these restrictions:(which we created in the above example " threejoin_ctastwodifferentdb"cannot be

    The target table cannot be a partitioned table.
    The target table cannot be an external table.
    The target table cannot be a list bucketing table.

Create a External Hive Partitioned Table.

Create a External Hive Partitioned Table.(hands on explanation)

------------------------
Step1:Create a External table .The partition files would be the sub directories of location in hdfs under /user/hadoop/country/ . Under directory country the sub directories are partitions in the view of table definition as shown below

hive>create EXTERNAL table pd_det
(dept string, cost int, id int,asmb_city string,asmb_ct string, retail int)
PARTITIONED BY (country string)
row format delimited
fields terminated by '\001'
stored as textfile
location '/user/hadoop/country/';

Step 2:Whenever a new directory is added to hdfs in path /user/hadoop/country/ , the alter table with add partition command must be executed .
a)For example if a new directory with name format 'country=usa' is added to hdfs://localhost:8020/user/hadoop/country path ,
then it means you are adding a new partition to table pd_det with partition country value as usa.
The data file present in the directory of hdfs://localhost:8020/user/hadoop/country/country=usa will be read into table.
b) The table pd_det to recognize the partition so must execute a alter table command as shown below.

hive>Alter table pd_det add partition(country ='usa');

Step 3: To create a data file inside a directory from table present in hive execute

hive>insert overwrite directory
'/user/hadoop/country/country=india'
select
dept,cost, id ,asmb_city,asmb_ct,retail
from pd_temp
where make = 'india';

step 4: If the folder name is different than above specified then specify the location with the alter statement

hive>alter table pd_det
add if not exists partition(country='mexico')
location '/user/hadoop/country/mexico' ;

Pig Running in local mode and accessing file from hdfs/ storing in Hdfs

Pig Running in local mode and accessing file from hdfs/ storing in Hdfs(problem explanation)

----------------------
Step 1:Running Pig -x local mode by defining a absolute path with a scheme and a authority hadoop can still read distributed file system files and results can be stored in hdfs.

salrec = load 'hdfs://localhost:8020/user/hadoop/Sales.csv' using PigStorage(',')
AS
(custid:long, prod_id:long, qty_pur:int, pur_date:datetime, sale_id:long);

custrec = load 'hdfs://localhost:8020/user/hadoop/Customer.csv' using PigStorage(',')
AS
(fn:chararray, ln:chararray, status:chararray, ph:chararray, custid1:long, add:chararray);

pd_rec = load 'hdfs://localhost:8020/user/hadoop/Product.csv' using PigStorage(',')
AS (pdname:chararray, pddesc:chararray, pdcat:chararray, pdgty:long, pdid:long, pkwith:chararray);

crs_rec = join salrec by prod_id , pd_rec by pdid;

crs_rec1 = group crs_rec by custid,prod_id;

x = foreach crs_rec1 generate group, flatten(crs_rec.(custid,sale_id,pdname,qty_pur,fn));

store x into 'hdfs://localhost:8020/user/hadoop/piglocaltohdfs2' USING PigStorage(',');

sample output in hdfs :
(7,98243),7,34842,09K Video,1,Allen
(11,77623),11,34843,J Case 1500,2,John
(19,88734),19,34857,DVD J INT,7,Hubert
(24,45641),24,34856,500 GB HD T,5,Roger

SubQuery's in Hive

SubQuery's in Hive(hands on explanation)

-----------------------------
USING IN CLAUSE
hive>select first_name , customer_id from customers_temp
where customer_id
in (select cust_id from sales);

USING EXISTS

hive>select first_name,customer_id from customers_temp
where exists
(select cust_id from sales where customers_temp.customer_id = sales.cust_id and sales.qty >1 );

IN FROM CLAUSE

hive >From
(select City , avg(temp) avgtemp from citymonthlytemp group by city ) t
select max(t.avgtemp) maxtemperature

Hint: displaying the maximum temperature .

hive (customersales_stg)> describe product;
OK
column_name    data_type    comment
product_name               string
description             varchar(20)
category                string
quantity_onhand         int
product_id                bigint
packaged_with           array<string>
product_price            decimal(4,2)

describe sales;
column_name        data_type    comment
customer_id             int
productid_insales       int
quantity_brought        int
date_sold            string
sales_id                int

select t.custid, sum(t.price) from
                     (select s.productid_insales pdno, s.customer_id custid,
                     p.productprice price from sales s join product p
                     on s.productid_insales = p.product_id ) t   group by t.custid;

or below(both produces same results)

from (select s.productid_insales pdno, s.customer_id custid, p.productprice price from sales s join product p
on s.productid_insales = p.product_id ) t select t.custid as customerid , sum(t.price) total_purchase_amount group by t.custid;

Hive Left Semi Join / IN clause with subquery

Hive Left Semi Join / IN clause with sub query(hands on explanation)

______________
Step1:
hive> describe sales;
OK
cust_id                 int
prod_no                 int
qty                     int
datestring              string
sales_id                int
Time taken: 0.104 seconds, Fetched: 5 row(s)
hive> describe customer_temp;
FAILED: SemanticException [Error 10001]: Table not found customer_temp
hive> describe customers_temp;
OK
first_name              string
last_name               string
category                string
phone_number            string
customer_id             int
address                 array<string>
Time taken: 0.097 seconds, Fetched: 6 row(s)

Step 2: Using a subquery using like clause
select c.first_name , c.phone_number from customers_temp c where c.customer_id in (select cust_id from sales);

Step 3: Using left semi join
select c.first_name , c.phone_number from customers_temp c left semi join sales s on c.customer_id =s.cust_id ;

Hint: Both results will be same.

A Join in Hive between three tables

A Join in Hive between three tables(hands on explanation)

----------------------------------
Step 1:
hive> describe sales;
OK
cust_id                 int
prod_no                 int
qty                     int
datestring              string
sales_id                int
Time taken: 0.104 seconds, Fetched: 5 row(s)
hive> describe customer_temp;
FAILED: SemanticException [Error 10001]: Table not found customer_temp
hive> describe customers_temp;
OK
first_name              string
last_name               string
category                string
phone_number            string
customer_id             int
address                 array<string>
Time taken: 0.097 seconds, Fetched: 6 row(s)
hive> describe salescust_proddet;
OK
cid                     int
sid                     int
pno                     bigint
Time taken: 0.138 seconds, Fetched: 3 row(s)

Step2:
hive>select /*+streamtable(spd)*/ s.*, c.*, spd.* from sales s full outer join customers_temp c on (s.cust_id = c.customer_id) left outer join salescust_proddet spd on (spd.cid =c.customer_id);

Hint:88 rows with Nulls,Sample output(64    92387    4    2012-01-24    34847    Mello    Reynolds    A    905-245-4431    64    ["Mediumcity","67890"]    64    34826    98243
64    92387    4    2012-01-24    34847    Mello    Reynolds    A    905-245-4431    64    ["Mediumcity","67890"]    64    34833    77624
NULL    NULL    NULL    NULL    NULL    Elizabeth    Metzer    A    416-322-9001    85    ["Bigcity","12345"]    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    Jovan    Melcic    A    905-324-4456    92    ["Mediumcity","67890"]    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    Erick    Jansen    A    905-498-8211    93    ["Mediumcity","67890"]    NULL    NULL    NULL)

Step3:
hive>select /*+streamtable(spd)*/ s.*, c.*, spd.* from sales s full outer join customers_temp c on (s.cust_id = c.customer_id) right outer join salescust_proddet spd on (spd.cid =c.customer_id);

Hint:64 rows without nulls

Update And Delete For Version 0.14 of Hive

Update And Delete For Version 0.14 of Hive (hands on explanation).

-------------------------
step 1: run a command to see which version are u running

Run this at CLI
cli>hive --version ( will give a result)
or
cli>ls /usr/lib/hive*/lib | grep metastore

Step 2:Set hive.optimize.sort.dynamic.partition=false;

Step 3: Run the command
hive>update customers_temp SET category="A" where fn="Lisa" AND ln="Scott";

Hint: Sub queries are not supported, Partition columns cannot be updated, Bucketing Columns cannot be updated

Arithmetic operators ,UDF's ,casts,literals , etc are supported. If a WHERE clause is defined then only rows that match the WHERE clause will be updated

___-
hive> delete from customers_temp where custid = 256 ;

------------------------
Major and Minor compaction must be specified. Any changes to the table is stored as delta files in HDFS. Hdfs data adheres to is write once and read many. So any changes to data are stored as delta files.

some of the properties to work with.

Configure the following Hive properties to enable transactions:

    hive.txn.manager

    hive.compactor.initiator.on

    hive.compactor.worker.threads
-------------------------------------------
If you experience problems while enabling Hive transactions, check the Hive log file at /tmp/hive/hive.log.

Performing Manual Compactions

Hive administrators use the ALTER TABLE DDL command to queue requests that compact base and delta files for a table or partition:

ALTER TABLE tablename [PARTITION (partition_key='partition_value' [,...])] COMPACT 'compaction_type'

Use the SHOW COMPACTIONS command to monitor the progress of the compactions:
hive >SHOW COMPACTIONS

Note:

ALTER TABLE will compact tables even if the NO_AUTO_COMPACTION table property is set.

hive>show transactions;

hive>show locks;

Lock state:

    Acquired - transaction initiator hold the lock

    Waiting - transaction initiator is waiting for the lock

    Aborted - the lock has timed out but has not yet been cleaned

Lock type:

    Exclusive - the lock may not be shared

    Shared_read - the lock may be shared with any number of other shared_read locks

    Shared_write - the lock may be shared by any number of other shared_read locks but not with other shared_write locks
------------------------------------

hive > SHOW COMPACTIONS

Compaction state:

    Initiated - waiting in queue

    Working - currently compacting

    Ready for cleaning - compaction completed and old files scheduled for removal

hive >SHOW TRANSACTIONS

    Transaction ID

    Transaction state

    Hive user who initiated the transaction

    Host machine where transaction was initiated
-------------------------------------------
Transaction Limitations

HDP currently has the following limitations for ACID-based transactions in Hive:

    The BEGIN, COMMIT, and ROLLBACK SQL statements are not yet supported. All operations are automatically committed as transactions.

    The user initiating the Hive session must have write permission for the destination partition or table.

    Zookeeper and in-memory locks are not compatible with transactions.

    Only ORC files are supported.

    Destination tables must be bucketed and not sorted.

    Snapshot-level isolation, similar to READ COMMITTED. A query is provided with a consistent snapshot of the data during execution.
-----------------------------------------------------
to insert into hive:
hive>
insert into table product values ("Samsung Galaxy Mini","samsung galaxy mini version no 345",phones,3)

------------------
Metastore Mysql installation in hive:
directory where mysql is installed.
/usr/share/doc/mysql-server-5.1.73

To start mysqld at boot time you have to copy
support-files/mysql.server to the right place for your system

PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:
-s root password 'new-password'
/usr/bin/mysqladmin -u root -h localhost.localdomain password 'new-password'
----------------------------

Few more configurations to peek