Saturday, January 18, 2020

Data Science Study Note-14: Hive Hue tips and error collections

What's HUE?

Hue is an open source SQL Cloud Editor for browsing, querying and visualizing data. Its goal is to make self service data querying more widespread in organizations.

What is Hive?


Apache Hive is a Data warehouse system which is built to work on Hadoop. It is used to querying and managing large datasets residing in distributed storage.
Before becoming an open source project of Apache Hadoop, Hive was originated in Facebook.
It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Common Functions in Hive Hue SQL

Hadoop Ecosystem tools such as HDFS, YARN, MapReduce, Hive, Pig, HBase, Spark, Oozie, Flume and Sqoop.

Hive defines a simple SQL-like query language to querying and managing large datasets called Hive-QL ( HQL ). By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to querying and managing large datasets residing in) or in other data storage systems such as Apache HBase.

Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical Processing. Hive supports overwriting or apprehending data, but not updates and deletes.

ORC file background: Back in January 2013, we created ORC files as part of the initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop.

ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Facebook uses ORC to save tens of petabytes in their data warehouse and demonstrated that ORC is significantly faster than RC File or Parquet.

#===============================================================
## find particular string in a column
select column1 from test
where instr(column1, 'google') != 0;

or use like statement: 
where column1 like "%CURTIS LN%";

or use filter for pyspark dataframe: 
df.filter($"column1".contains("google"))

select distinct cntycd,statecd,INITCAP(cntyname) as cntyname,
lower(cntyname) as cntyname_l,
upper(cntyname) as cntyname_u
from db.v_datasupplier 

#be careful when use statement: not in (....), it will actually filter out null case.  
multisplitcd not in ('M','D') or multisplitcd is null

##find the maximum/min value from two or more columns
greatest(var1,var2,...) as var_max,
least(var1,var2,...) as var_min
##note: max() or min() works for each column, ntot across columns. 
Generate the percentile for some column: say price_ref
percent_rank() over (PARTITION BY  cntycd order by price_ref) as pcnt_rank

count(irislandusecd) over (partition by irislandusecd 
    order by cnt desc rows unbounded preceding) as rank --rank of 1,2,3,...

percent_rank() over (PARTITION BY  
    irislandusecd order by cnt) as pcnt_rank -- rank of 100%, 96%, ...

year(recordingdtderived1) as tran_year
cast(trans.recordingdtderived as timestamp) as recordingdtderived1,
to_date(cast(date1 as timestamp)) as date2
datediff(date1, date2)>0  #that means: date1 is later than date2

#using different function in bigquery to convert to datetime: 
SELECT timestamp_millis(effectivedt) as dte2,
TIMESTAMP_MILLIS(effectivedt) as date_from_millisecond,
TIMESTAMP_SECS(effectivedt) as date_from_seconds,
FROM `c***-poc-****.df.lender` 
where effectivedt is not null LIMIT 10;

#rename/change table/dataset name
ALTER TABLE default.old_data_name RENAME TO default.new_data_name;

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...]) --/#*****/add Column
ALTER TABLE name DROP [COLUMN] column_name    --/#*****/drop column
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

ALTER TABLE name { ADD [IF NOT EXISTS] | DROP [IF EXISTS] } PARTITION (partition_spec)

#for the zipcode with leading 0 or trailing zeros: 
lpad(ProductId, 13, "0")
rpad(ProductId, 13, "0")
rand() as random_num
#===============================================================

Most popular HIVE command?
#===============================================================
#example of Hql in terminal:
#to list all databases:
show databases;

#to use a particular database:
use database1;

#To list all tables: 
show tables;
show tables like "project_*"; --list all tables starting with project_, only for one condition. 

SHOW TABLES FROM ``
WHERE 
    `Tables_in_` LIKE '%cms%'
    OR `Tables_in_` LIKE '%role%';
--use where statement for multiple patterns.

#to see the details description of the hive table: 
#location/time created columns etc:
desc formatted tablename1;

#desribe a table: 
describe dataname1;

#create a database:
create database dataname1;

#use default database:
Use default;

#pin hive table to the same GCS file on different vm: 
#1. find the code from the original vm1:
show create table tablename1;
#2: copy code until tablename1 part,then run them in vm2
both are run in hive.

#Kill some process(not in hive mode): 
yarn application -kill application_#######;

#check the list of tables in some AWS folder: 
aws s3 ls s3://folder/emr_*********/

#hdfs list all 
hdfs dfs -ls
hdfs dfs -copyFromLocal cvaluator_table.csv temp_table

#each time load all the files in one folder to one table in Hadoop, so make sure to have diff separate folders for diff files. 
#to delete some files: 
HDFS dfs -rm  temp_table/csv_name1.csv

#load local file to table: 
LOAD DATA INPATH '/user/user1/temp_table' OVERWRITE INTO TABLE default.table_name1;

#INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
#INSERT INTO is used to append the data into existing data in a table. 

#When we insert the data Hive throwing errors, the dynamic partition mode is strict and dynamic partition not enabled. So we need to set the following parameters in Hive shell.
set hive.exec.dynamic.partition=true;
#To enable dynamic partitions, by default, it’s false
set hive.exec.dynamic.partition.mode=nonstrict;

#user quit to exit from hive command line: 
quit;
#===============================================================

Common setup configuration to run in Hive:
#===============================================================
set hive.mapred.mode=nonstrict;
set hive.strict.checks.cartesian.product = false;
SET hive.exec.compress.intermediate=true; 
SET tez.am.resource.memory.mb=96000; 
set tez.grouping.min-size=35000000;
set tez.grouping.max-size=35000000;
SET hive.tez.container.size=8000;
set mapreduce.fileoutputcommitter.algorithm.version=2;
set hive.blobstore.use.blobstore.as.scratchdir=true;
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size=256000000;
set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;

--see more details for the data files
sudo du -h --max-depth=1 -x

--list all the hive files: 
hdfs dfs -ls /user/hive/warehouse

--check the orc data size in hive/hue:
hdfs dfs -du -s -h /user/hive/warehouse/data1

--check the orc/parquet data size of gcs bucket file:
gsutil du -s -ch gs://e*g-d**-data/hive_data/check0

--export a hive table to csv: 
hive -e "set hive.cli.print.header=true;set hive.resultset.use.unique.column.names=true; 
select * from default.dataset2" | sed 's/[\t]/,/g; s/NULL//g'   > dataset2.csv

--then make copy from the hive to gs bucket:
gsutil cp dataset2.csv gs://buckets/folder/dataset2.csv

gsutil -m cp -r  gs://buckets//prod/*  prod/
#to have another folder prod/, otherwise might get this error message:
#CommandException: Wrong number of arguments for "cp" command

--export to a zipped csv file, using | as delimiter:
hive -e "set hive.cli.print.header=true;set hive.resultset.use.unique.column.names=true; 
select * from default.dataset2" | sed 's/[\t]/|/g; s/NULL//g' | gzip > dataset2.csv.gz
#===============================================================

Why do we need to use the command: sudo su
Once you run the sudo su command, it will ask you for your user password instead of the root password
sudo su command allows you to login into root account, otherwise you might get permission access error. After successful login it don't asks for authentication again. If you want authentication for everytime then you have to exit out of the root account by running exit command. After you finished what you do, type exit; you will lose your privileges, and return to normal.

When you got the permission error message, you can also run "sudo -s" to continue, "s for shell", for instance:
sudo -s cd prodout

To see the difference between su and sudo -s, do cd ~ and then pwd after each of them:
In the first case for su, you'll be in root's home directory, because you're root.

In the second case for sudo -s, you'll be in your own home directory, because you're yourself with root privileges.

The major difference between sudo -i and sudo -s is:
sudo -i gives you the root environment, i.e. your ~/.bashrc is ignored.
sudo -s gives you the user's environment, so your ~/.bashrc is respected.
Here is an example, you can see that I have an application lsl in my ~/.bin/ directory which is accessible via sudo -s but not accessible with sudo -i. Note also that the Bash prompt changes as will with sudo -i but not with sudo -s.

Here is some example using root directory to delete some system file via "sudo su -":
#===============================================================
#on the new terminal e.g. Iterm2 on Mac:
sudo su –
pw:
cd /cores
ls -lh
(find the last one for the #id)
lldb -c core.#id
target list
exit
 
#after the exit, go to delete those targeted files: 
ls -l
rm -rf *
#===============================================================


In case if you got the error message: "... is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container", then you need to change to bigger size for those two setup: SET tez.am.resource.memory.mb=96000; SET hive.tez.container.size=8000;


Edureka’s PySpark Certification Training is designed to provide you the knowledge and skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).

Throughout the PySpark Training, you will get an in-depth knowledge of Apache Spark and the Spark Ecosystem,which includes Spark RDD, Spark SQL, Spark MLlib and Spark Streaming.
You will also get comprehensive knowledge of Python Programming language, HDFS, Sqoop, Flume, Spark GraphX and Messaging System such as Kafka.

How do we upload CSV file to AWS/Hadoop?

#===============================================================
#From the terminal

#Vim is a free and powerful text editor that comes with your Mac. 
vim life.md  ####to edit the file life.md
##In Normal mode, you can type ZZ to save everything and exit.

##from the root directly /user: Cat is to show files: 
cat .bash_profile
#bash_profile is some hidden file in the folder, use ls -A to see all the files.
#note: ls -a will not list all the files due to: . .. showing up. 
#ls -r: reverse order in the listing. 
#ls -X: show the extention of the files.
#ls -t: list the newest first. 
#ls -R: list all the subdirectory,  very long. 
#ls -l --author: list the author of the files
#ls -s --block-size=k file1: list the file1 with its size in Kb
#ls -s -h: list all files with their sizes. 

#Here are the steps to create bash_profile file: 
#1. Start up Terminal.
#2. Type "cd ~/" to go to your home folder.
#3. Type "touch .bash_profile" to create your new file.
#4. Edit .bash_profile with your favorite editor (or you can just type "open -e .bash_profile" to open it in TextEdit.

##In the bash.profile: add the turnel via, watch out the single quote or wired quote
alias puid='ssh -o ProxyCommand='\''ssh -W %h:%p *****@52.32.###.###'\'' ****@puid-emr -L 1233:puid-emr:22’
##Add the above line of code onto bash_profile via: vim .bash_profile, use ZZ to exit the editor mode. 

#After we have added the previous line of code into the bash_profile, then we can directly call: 
source .bash_profile
puid
#This will bring us to the log in window, type the same pw twice, 
#directly from bastion/bridge server to HIVE server, otherwise we have to login twice. 
#now we can see the directly via: 
ls
##need to make a directory/folder from the HDSF root via: 
hdfs dfs -mkdir temp_table
##To see the new output folder via: 
hdfs dfs -ls
##then we should see the new folder, in case we will move a lot of partition file onto that folder. 
##now move the file via Filezilla from local to the AWS root directory. 
##now use the command to copy file from the previous AWS file to hdfs: 
hdfs dfs -copy
hdfs dfs -copyFromLocal cvaluator_table.csv temp_table

#each time load all the files in one folder to one table in Hadoop, so make sure to have diff separate folders for diff files. 
#to delete some files: 
HDFS dfs -rm  temp_table/csv_name1.csv

###########################################
#Now go to HIVE to create the template file: 
drop table default.table_name1;
CREATE TABLE default.table_name1
(
    letter string,
    factor double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties('skip.header.line.count'='1'); #include the header as the first row

LOAD DATA INPATH '/user/user1/temp_table' OVERWRITE INTO TABLE default.table_name1;
#===============================================================

Error Issue: when running some very simple SQL on Hive/Hue, you might get some error message like this: "Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask at org.apache.hive.service.cli.operation.".


Solution: most like you have set up something like this in the beginning of your SQL session:
set hive.execution.engine=tez;

Try to configure the execution engine to MapReduce instead of Tez, that is, changing the execution engine to MR, instead of using tez:
set hive.execution.engine=mr;

You can also aunch hive in a debug session mode to have more info:
hive --hiveconf hive.root.logger=INFO,console

EAILED: SemanticException TOK_ALLCOLREF is not supported in current context or Error Message: TOK_ALLCOLREF is not supported in current context - while Using DISTINCT in HIVE

Solution: Hive doesn't support DISTINCT * syntax. You can manually specify every field of the table to get the same result。

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask

Solution: add the following code in the beginnning of run:
SET hive.auto.convert.join=false;

How do run loop in the hive hql?
First, we should be adding the loop logic to shell script, not the loop inside the hql code.
Step 1: Create the script File: loop_data.sh
#note no need to put "," to separate the looped values.
#no need to put single/double quote in the loop either.
#Create file in the terminal directly, not in the windows txt format.
#!/bin/bash
for flag in 1 2
do
  hive -hivevar flag=$flag -f new_data.hql
done
another example like:
#!/bin/bash
for state in 01 02 04   05      06      08      09      10      11      12      13      15      16      17      18      19      20      21      22      23      2$
do
   echo "running for state fips: " $state
   hive -hiveconf state=$state -f prod_step1c_sql.hql
done
Then Step 2: execute the script like:
sh loop_data.sh

here is the example inside of that prod_step1c_sql.hql:
set hive.cli.print.header=true;
set hive.execution.engine=mr;
SET hive.execution.engine=tez;
SET tez.am.resource.memory.mb=24000;
SET hive.tez.container.size=24000;
SET hive.tez.java.opts="-Xmx12000m";
SET hive.compute.query.using.stats=false;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
SET hive.auto.convert.join=false;

set hive.mapred.mode=nonstrict;
set hive.strict.checks.cartesian.product = false;
SET hive.exec.compress.intermediate=true;
SET tez.am.resource.memory.mb=24000;
set tez.grouping.min-size=35000000;
set tez.grouping.max-size=35000000;
SET hive.tez.container.size=24000;
set mapreduce.fileoutputcommitter.algorithm.version=2;
set hive.blobstore.use.blobstore.as.scratchdir=true;
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size=25600000;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

DROP TABLE IF EXISTS default.data_${hiveconf:state};
CREATE TABLE default.data_${hiveconf:state}
stored as parquet
tblproperties("compress"="SNAPPY")
as select a.*
from data0 a
where state=${hiveconf:state}
;

Here are some details for the file new_data.hql:
create table new_data as
select 
  a.*,
  b.y
from 
  A as a left join 
  B as b on 
  a.x = b.x
where
  flag = ${hiveconf:flag}

In case if you are creating more table. Since you should split out the DDL & DML in 2 separate scripts.
DDL: create_new_data.hql
create table new_data as
select 
  a.*,
  b.y
from 
  A as a left join 
  B as b on 
  a.x = b.x
where 
  1 = 0;
DML: insert_new_data.hql
DML: insert_new_data.hql
insert into new_data 
select 
  a.*,
  b.y
from 
  A as a left join 
  B as b on 
  a.x = b.x
where
  flag = ${hiveconf:flag} 
Then update you shell script like:

File Name: loop_new_data.sh
# Create table
hive -f create_new_data.hql

# Insert data
for flag in 1 2
do
  hive -hiveconf flag=$flag -f insert_new_data.hql
done
And execute it like:
sh loop_new_data.sh

No comments:

Post a Comment

Python Study Notes: how to load stock data, manipulate data, find patterns for profit?

#================================================ from pandas_datareader import data as pdr #run the upgrade if see error: pandas_dataread...