Saturday, January 18, 2020

Data Science Note-14: Hive Hue tips and error collections

What's HUE?

Hue is an open source SQL Cloud Editor for browsing, querying and visualizing data. Its goal is to make self service data querying more widespread in organizations.

What is Hive?

Apache Hive is a Data warehouse system which is built to work on Hadoop. It is used to querying and managing large datasets residing in distributed storage.
Before becoming an open source project of Apache Hadoop, Hive was originated in Facebook.
It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Common Functions in Hive Hue SQL

Hadoop Ecosystem tools such as HDFS, YARN, MapReduce, Hive, Pig, HBase, Spark, Oozie, Flume and Sqoop.

Hive defines a simple SQL-like query language to querying and managing large datasets called Hive-QL ( HQL ). By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to querying and managing large datasets residing in) or in other data storage systems such as Apache HBase.

Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical Processing. Hive supports overwriting or apprehending data, but not updates and deletes.

ORC file background: Back in January 2013, we created ORC files as part of the initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop.

ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Facebook uses ORC to save tens of petabytes in their data warehouse and demonstrated that ORC is significantly faster than RC File or Parquet.

## find particular string in a column
select val from test
where instr(val, 'google') != 0;

#be careful when use statement: not in (....), it will actually filter out null case.  
multisplitcd not in ('M','D') or multisplitcd is null

##find the maximum/min value from two or more columns
greatest(var1,var2,...) as var_max,
least(var1,var2,...) as var_min
##note: max() or min() works for each column, ntot across columns. 
Generate the percentile for some column: say price_ref
percent_rank() over (PARTITION BY  cntycd order by price_ref) as Abnormal_index

year(recordingdtderived1) as tran_year
cast(trans.recordingdtderived as timestamp) as recordingdtderived1,
to_date(cast(date1 as timestamp)) as date2
datediff(date1, date2)>0  #that means: date1 is later than date2

#rename/change table/dataset name
ALTER TABLE default.old_data_name RENAME TO default.new_data_name;

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...]) --/#*****/add Column
ALTER TABLE name DROP [COLUMN] column_name    --/#*****/drop column
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])


#for the zipcode with leading 0 or trailing zeros: 
lpad(ProductId, 13, "0")
rpad(ProductId, 13, "0")
rand() as random_num

Most popular HIVE command?

#example of Hql in terminal:
show databases;

#create a database:
create database dataname1;

#use default database:
Use default;

#Kill some process(not in hive mode): 
yarn application -kill application_#######;

#desribe a table: 
describe dataname1;

#check the list of tables in some AWS folder: 
aws s3 ls s3://folder/emr_*********/

#hdfs list all 
hdfs dfs -ls
hdfs dfs -copyFromLocal cvaluator_table.csv temp_table

#each time load all the files in one folder to one table in Hadoop, so make sure to have diff separate folders for diff files. 
#to delete some files: 
HDFS dfs -rm  temp_table/csv_name1.csv

#load local file to table: 
LOAD DATA INPATH '/user/user1/temp_table' OVERWRITE INTO TABLE default.table_name1;

#INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
#INSERT INTO is used to append the data into existing data in a table. 

#When we insert the data Hive throwing errors, the dynamic partition mode is strict and dynamic partition not enabled. So we need to set the following parameters in Hive shell.
set hive.exec.dynamic.partition=true;
#To enable dynamic partitions, by default, it’s false
set hive.exec.dynamic.partition.mode=nonstrict;

#user quit to exit from hive command line: 

Common setup configuration to run in Hive:
set hive.mapred.mode=nonstrict;
set hive.strict.checks.cartesian.product = false;
SET hive.exec.compress.intermediate=true; 
set tez.grouping.min-size=35000000;
set tez.grouping.max-size=35000000;
SET hive.tez.container.size=8000;
set mapreduce.fileoutputcommitter.algorithm.version=2;
set = true;
set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;

--see more details for the data files
sudo du -h --max-depth=1 -x

--list all the hive files: 
hdfs dfs -ls /user/hive/warehouse

--check the orc data size in hive/hue:
hdfs dfs -du -s -h /user/hive/warehouse/data1

--export a hive table to csv: 
hive -e "set hive.cli.print.header=true;set hive.resultset.use.unique.column.names=true; 
select * from default.dataset2" | sed 's/[\t]/,/g; s/NULL//g'   > dataset2.csv

--then make copy from the hive to gs bucket:
gsutil cp dataset2.csv gs://buckets/folder/dataset2.csv

--export to a zipped csv file, using | as delimiter:
hive -e "set hive.cli.print.header=true;set hive.resultset.use.unique.column.names=true; 
select * from default.dataset2" | sed 's/[\t]/|/g; s/NULL//g' | gzip > dataset2.csv.gz

In case if you got the error message: "... is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container", then you need to change to bigger size for those two setup: SET; SET hive.tez.container.size=8000;

Edureka’s PySpark Certification Training is designed to provide you the knowledge and skills that are required to become a successful Spark Developer using Python and prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175).

Throughout the PySpark Training, you will get an in-depth knowledge of Apache Spark and the Spark Ecosystem,which includes Spark RDD, Spark SQL, Spark MLlib and Spark Streaming.
You will also get comprehensive knowledge of Python Programming language, HDFS, Sqoop, Flume, Spark GraphX and Messaging System such as Kafka.

How do we upload CSV file to AWS/Hadoop?

#From the terminal

#Vim is a free and powerful text editor that comes with your Mac. 
vim  ####to edit the file
##In Normal mode, you can type ZZ to save everything and exit.

##from the root directly jkuang: Cat is to show files: 
cat .bash_profile
#bash_profile is some hidden file in the folder, use ls -a to see that. 

#Here are the steps to create bash_profile file: 
#1. Start up Terminal.
#2. Type "cd ~/" to go to your home folder.
#3. Type "touch .bash_profile" to create your new file.
#4. Edit .bash_profile with your favorite editor (or you can just type "open -e .bash_profile" to open it in TextEdit.

##In the bash.profile: add the turnel via, watch out the single quote or wired quote
alias puid='ssh -o ProxyCommand='\''ssh -W %h:%p jkuang@52.32.###.###'\'' jkuang@puid-emr -L 1233:puid-emr:22’
##Add the above line of code onto bash_profile via: vim .bash_profile, use ZZ to exit the editor mode. 

#After we have added the previous line of code into the bash_profile, then we can directly call: 
source .bash_profile
#This will bring us to the log in window, type the same pw twice, 
#directly from bastion/bridge server to HIVE server, otherwise we have to login twice. 
#now we can see the directly via: 
##need to make a directory/folder from the HDSF root via: 
hdfs dfs -mkdir temp_table
##To see the new output folder via: 
hdfs dfs -ls
##then we should see the new folder, in case we will move a lot of partition file onto that folder. 
##now move the file via Filezilla from local to the AWS root directory. 
##now use the command to copy file from the previous AWS file to hdfs: 
hdfs dfs -copy
hdfs dfs -copyFromLocal cvaluator_table.csv temp_table

#each time load all the files in one folder to one table in Hadoop, so make sure to have diff separate folders for diff files. 
#to delete some files: 
HDFS dfs -rm  temp_table/csv_name1.csv

#Now go to HIVE to create the template file: 
drop table default.table_name1;
CREATE TABLE default.table_name1
    letter string,
    factor double
tblproperties('skip.header.line.count'='1'); #include the header as the first row

LOAD DATA INPATH '/user/user1/temp_table' OVERWRITE INTO TABLE default.table_name1;

Error Issue: when running some very simple SQL on Hive/Hue, you might get some error message like this: "Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask at org.apache.hive.service.cli.operation.".

Solution: most like you have set up something like this in the beginning of your SQL session:
set hive.execution.engine=tez;

Try to configure the execution engine to MapReduce instead of Tez, that is, changing the execution engine to MR, instead of using tez:
set hive.execution.engine=mr;

You can also aunch hive in a debug session mode to have more info:
hive --hiveconf hive.root.logger=INFO,console

EAILED: SemanticException TOK_ALLCOLREF is not supported in current context or Error Message: TOK_ALLCOLREF is not supported in current context - while Using DISTINCT in HIVE

Solution: Hive doesn't support DISTINCT * syntax. You can manually specify every field of the table to get the same result。

FAILED: Execution Error, return code 1 from

Solution: add the following code in the beginnning of run:

No comments:

Post a Comment

GCP/Python study notes: upload download file from gcp in python

#=========================================================== import pickle import os import pandas as pd import numpy as np import scipy as ...