HadoopExamples: October 2015

SubQuery's in Hive

SubQuery's in Hive(hands on explanation)

-----------------------------
USING IN CLAUSE
hive>select first_name , customer_id from customers_temp
where customer_id
in (select cust_id from sales);

USING EXISTS

hive>select first_name,customer_id from customers_temp
where exists
(select cust_id from sales where customers_temp.customer_id = sales.cust_id and sales.qty >1 );

IN FROM CLAUSE

hive >From
(select City , avg(temp) avgtemp from citymonthlytemp group by city ) t
select max(t.avgtemp) maxtemperature

Hint: displaying the maximum temperature .

hive (customersales_stg)> describe product;
OK
column_name    data_type    comment
product_name               string
description             varchar(20)
category                string
quantity_onhand         int
product_id                bigint
packaged_with           array<string>
product_price            decimal(4,2)

describe sales;
column_name        data_type    comment
customer_id             int
productid_insales       int
quantity_brought        int
date_sold            string
sales_id                int

select t.custid, sum(t.price) from
                     (select s.productid_insales pdno, s.customer_id custid,
                     p.productprice price from sales s join product p
                     on s.productid_insales = p.product_id ) t   group by t.custid;

or below(both produces same results)

from (select s.productid_insales pdno, s.customer_id custid, p.productprice price from sales s join product p
on s.productid_insales = p.product_id ) t select t.custid as customerid , sum(t.price) total_purchase_amount group by t.custid;

Hive Left Semi Join / IN clause with subquery

Hive Left Semi Join / IN clause with sub query(hands on explanation)

______________
Step1:
hive> describe sales;
OK
cust_id                 int
prod_no                 int
qty                     int
datestring              string
sales_id                int
Time taken: 0.104 seconds, Fetched: 5 row(s)
hive> describe customer_temp;
FAILED: SemanticException [Error 10001]: Table not found customer_temp
hive> describe customers_temp;
OK
first_name              string
last_name               string
category                string
phone_number            string
customer_id             int
address                 array<string>
Time taken: 0.097 seconds, Fetched: 6 row(s)

Step 2: Using a subquery using like clause
select c.first_name , c.phone_number from customers_temp c where c.customer_id in (select cust_id from sales);

Step 3: Using left semi join
select c.first_name , c.phone_number from customers_temp c left semi join sales s on c.customer_id =s.cust_id ;

Hint: Both results will be same.

A Join in Hive between three tables

A Join in Hive between three tables(hands on explanation)

----------------------------------
Step 1:
hive> describe sales;
OK
cust_id                 int
prod_no                 int
qty                     int
datestring              string
sales_id                int
Time taken: 0.104 seconds, Fetched: 5 row(s)
hive> describe customer_temp;
FAILED: SemanticException [Error 10001]: Table not found customer_temp
hive> describe customers_temp;
OK
first_name              string
last_name               string
category                string
phone_number            string
customer_id             int
address                 array<string>
Time taken: 0.097 seconds, Fetched: 6 row(s)
hive> describe salescust_proddet;
OK
cid                     int
sid                     int
pno                     bigint
Time taken: 0.138 seconds, Fetched: 3 row(s)

Step2:
hive>select /*+streamtable(spd)*/ s.*, c.*, spd.* from sales s full outer join customers_temp c on (s.cust_id = c.customer_id) left outer join salescust_proddet spd on (spd.cid =c.customer_id);

Hint:88 rows with Nulls,Sample output(64    92387    4    2012-01-24    34847    Mello    Reynolds    A    905-245-4431    64    ["Mediumcity","67890"]    64    34826    98243
64    92387    4    2012-01-24    34847    Mello    Reynolds    A    905-245-4431    64    ["Mediumcity","67890"]    64    34833    77624
NULL    NULL    NULL    NULL    NULL    Elizabeth    Metzer    A    416-322-9001    85    ["Bigcity","12345"]    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    Jovan    Melcic    A    905-324-4456    92    ["Mediumcity","67890"]    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    Erick    Jansen    A    905-498-8211    93    ["Mediumcity","67890"]    NULL    NULL    NULL)

Step3:
hive>select /*+streamtable(spd)*/ s.*, c.*, spd.* from sales s full outer join customers_temp c on (s.cust_id = c.customer_id) right outer join salescust_proddet spd on (spd.cid =c.customer_id);

Hint:64 rows without nulls

Update And Delete For Version 0.14 of Hive

Update And Delete For Version 0.14 of Hive (hands on explanation).

-------------------------
step 1: run a command to see which version are u running

Run this at CLI
cli>hive --version ( will give a result)
or
cli>ls /usr/lib/hive*/lib | grep metastore

Step 2:Set hive.optimize.sort.dynamic.partition=false;

Step 3: Run the command
hive>update customers_temp SET category="A" where fn="Lisa" AND ln="Scott";

Hint: Sub queries are not supported, Partition columns cannot be updated, Bucketing Columns cannot be updated

Arithmetic operators ,UDF's ,casts,literals , etc are supported. If a WHERE clause is defined then only rows that match the WHERE clause will be updated

___-
hive> delete from customers_temp where custid = 256 ;

------------------------
Major and Minor compaction must be specified. Any changes to the table is stored as delta files in HDFS. Hdfs data adheres to is write once and read many. So any changes to data are stored as delta files.

some of the properties to work with.

Configure the following Hive properties to enable transactions:

    hive.txn.manager

    hive.compactor.initiator.on

    hive.compactor.worker.threads
-------------------------------------------
If you experience problems while enabling Hive transactions, check the Hive log file at /tmp/hive/hive.log.

Performing Manual Compactions

Hive administrators use the ALTER TABLE DDL command to queue requests that compact base and delta files for a table or partition:

ALTER TABLE tablename [PARTITION (partition_key='partition_value' [,...])] COMPACT 'compaction_type'

Use the SHOW COMPACTIONS command to monitor the progress of the compactions:
hive >SHOW COMPACTIONS

Note:

ALTER TABLE will compact tables even if the NO_AUTO_COMPACTION table property is set.

hive>show transactions;

hive>show locks;

Lock state:

    Acquired - transaction initiator hold the lock

    Waiting - transaction initiator is waiting for the lock

    Aborted - the lock has timed out but has not yet been cleaned

Lock type:

    Exclusive - the lock may not be shared

    Shared_read - the lock may be shared with any number of other shared_read locks

    Shared_write - the lock may be shared by any number of other shared_read locks but not with other shared_write locks
------------------------------------

hive > SHOW COMPACTIONS

Compaction state:

    Initiated - waiting in queue

    Working - currently compacting

    Ready for cleaning - compaction completed and old files scheduled for removal

hive >SHOW TRANSACTIONS

    Transaction ID

    Transaction state

    Hive user who initiated the transaction

    Host machine where transaction was initiated
-------------------------------------------
Transaction Limitations

HDP currently has the following limitations for ACID-based transactions in Hive:

    The BEGIN, COMMIT, and ROLLBACK SQL statements are not yet supported. All operations are automatically committed as transactions.

    The user initiating the Hive session must have write permission for the destination partition or table.

    Zookeeper and in-memory locks are not compatible with transactions.

    Only ORC files are supported.

    Destination tables must be bucketed and not sorted.

    Snapshot-level isolation, similar to READ COMMITTED. A query is provided with a consistent snapshot of the data during execution.
-----------------------------------------------------
to insert into hive:
hive>
insert into table product values ("Samsung Galaxy Mini","samsung galaxy mini version no 345",phones,3)

------------------
Metastore Mysql installation in hive:
directory where mysql is installed.
/usr/share/doc/mysql-server-5.1.73

To start mysqld at boot time you have to copy
support-files/mysql.server to the right place for your system

PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:
-s root password 'new-password'
/usr/bin/mysqladmin -u root -h localhost.localdomain password 'new-password'
----------------------------

Few more configurations to peek

Loading a tar file GZIP or BZIP2 into Hive table/Using CTAS/Like

Loading a tar file GZIP or BZIP2 into Hive table/Using CTAS/Like(Hands on explanation)

------------------------------
Step 1: Set a few Hive properties

Set Hive.exec.compress.output = true
Set io.seqfile.compression.type = block

Step 2:- To zip a file to GZIP

tar -cvzf /home/hadoop/Desktop/customer.tar.gz /home/hadoop/Desktop/Customer.csv

Step3:
create table customer_gz
(fn string, ln string, cat string, ph string, cid int,add array<string>)
row format delimited
fields terminated by ','
collection items terminated by ','
stored as textfile;

step4: load data local inpath
'/home/hadoop/Desktop/customer.tar.gz'
into table customer_gz;

step5: create table customer_gz_seq
stored as sequencefile as
select * from customer_gz ;

step6: select * from customer_gz_seq ;

Hint: textfile of compression format Gzip or Bzip2 are not spittable on hadoop environment , so its not utilizing the parallel processing power of hadoop cluster. So it's better to load it into sequence file table

-----
Just to copy a table definition without any data, create a table as shown below.

hive>create table customer_gz_seq_bckup LIKE customer_gz_seq;

hint: you cannot specify any more clauses in between LIKE and new table name mentioned.

Storing a local CSV file into table of ORC Format;

Storing a local CSV file into table of ORC Format(hands on explanation)

--------------------------------------------------------
Step 1:
Create a stage table of storage format textfile

hive>create table product_stage
(prod_name string, description varchar(20),category string,qty_onhand int, prod_num bigint, packaged_with Array<string> )
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;

Step 2: load data into table stage

hive>Load data local inpath '/home/hadoop/Desktop/Product.csv'
overwrite into table product_stage ;

Step3 :
Write a CTAS command

hive>Create table product_orc
stored as orc as
select * from product_stage;

Hint:When loading data from a csv file into orc table format using Load statement it doesn't produce any error but while executing a separate statement select * from table table_name it will produce an error

" Failed with exception java.io.IOException:java.io.IOException: Malformed ORC file hdfs://localhost:8020/user/hive/warehouse/customersales.db/product_stage_orc/Product.csv. Invalid postscript".

Loading a Patitioned Hive Table with Cluster clause, sort

Loading a Patitioned Hive Table with Cluster clause, sort(hands on explanation)

--------------------------------------------
Step 1 : create a stage table to load data into table directly from storage area.
hive> create table product_stage (prod_name string, description varchar(20),category string,qty_onhand int, prod_num bigint, packaged_with Array<string> )
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;

step 2:create a partitioned table
>create table product_part (prod_name string, description varchar(20),qty_onhand int, prod_num bigint, packaged_with Array<string>)
partitioned by (category string)
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;

step 3: set hive.exec.dynamic.partition.mode=nonstrict (to dynamically load partition data without specifying it statically like :partition category='ram')
when loading partition dynamically the column should be last specified in the select clause.

step4:to load data
hive>from product_stage
insert overwrite table product_part partition (category)
select prod_name, description,qty_onhand,prod_num, packaged_with,category
Hint:one part file for each partition

Additional :
hive>create table product_pcsb (prod_name string, description varchar(20),qty_onhand int, prod_num bigint, packaged_with Array<string>)
partitioned by (category string)
CLUSTERED BY(prod_num) SORTED BY(qty_onhand ASC) into 2 buckets
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
Hint: two part files will be produced for each partition with sorted on qty_onhand in each part file.

hive>create table product_pc3b (prod_name string, description varchar(20),qty_onhand int, prod_num bigint, packaged_with Array<string>)
partitioned by (category string)
CLUSTERED BY(prod_num) into 3 buckets
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;
Hint:Three part file will be produced for each partition

Data file
PROD_NAME    DESCRIPTION    CATEGORY    QTY_ON_HAND    PROD_NUM    PACKAGED_WITH
2 GB Memory E    2 GB Memory ECC    Ram    3000    87655    manual:heatsink
4 GB Memory E    4 GB Memory ECC    Ram    1000    87659    manual:heatsink
16 GB Memory E    16 GB Memory ECC    Ram    238    87634    manual
500 GB HD J    500 GB HD Panther Brand    HD    200    45628    atacable:manual
500 GB HD T    500 GB HD Tiger Brand    HD    498    45641    satacable:manual
1 TB HD J    1 TB HD Jargon Brand    HD    231    45691
4 Core CPU J3    4 Core CPU Jargon Brand 3 GHZ     CPU    50    98820    thermalpaste:heatsink:manual
2 Core CPU J2    2 Core CPU Jargon Brand 2 GHZ     CPU    118    98838    thermalpaste:heatsink
1 Core CPU J2    1 Core CPU Jargon Brand 2 GHZ     CPU    203    98792    thermalpaste:heatsink
94F991 MB    Motherboard F991 CPU    MB    19    282299
94G822 MB    Motherboard G822 CPU    MB    30    282109
93H772 MB    Motherboard H772 CPU    MB    15    282009    cables:screws
93G Video    Video Card Jargon Brand 93G    Video    80    99202    dvd:manual:game
84G1 Video    Video Card Jargon Brand 84F1    Video    14    99207    dvd:manual:hdmicable
09K Video    Video Card Tiger Brand 84F1    Video    5    98243    manual:game
J Case 1500    Computer Case Jargon Brand Style 1500    Case    20    77623    fans:manual:screws
J Case 1501    Computer Case Jargon Brand Style 1501    Case    18    77624    fans:manual:screws
T Case 4332    Computer Case Tiger Brand Style 4332    Case    7    88211    fans:manual:screws:watercooler
J Power 300W    Power Supply Jargon Brand 300 Watts    Power    28    92387    cables:screws
J Power 500W    Power Supply Jargon Brand 500 Watts    Power    17    92373    cables:screws
T Power 300W    Power Supply Tiger Brand 300 Watts    Power    8    93347    cables:screws
DVD J INT    DVD Jargon Brand Internal    Optical    23    88734    manual
DVD J EXT    DVD Jargon Brand External    Optical    45    88821
DVD T INT    DVD Tiger Brand Internal    Optical    19    82331    satacable:manual
DVD T EXT    DVD Tiger Brand External    Optical    17    82337    satacable:manual

the below statement Still works but result is similar to a regular load with one part file (clustered by , sorted by clauses does not do any)
hive>create table product_cls (prod_name string, description varchar(20),qty_onhand int, prod_num bigint, packaged_with Array<string>)
CLUSTERED BY(prod_num) SORTED BY(qty_onhand ASC) into 5 buckets
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;

hive>from product
insert overwrite table product_cls
select prod_name, description,qty_onhand,prod_num, packaged_with

Errors:Without INTO BUCKETS clause followed to CLUSTERED BY < column_name> a error will be generated.