EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer
Physics
C.S.

How to distribute data in netezza server?
Choosing the distribution key of table in Netezza
Remember to generate statistics on the big netezza table

Error: The count of bad input rows might reach the maxerrors limit

Answer: When you load the data into Netezza server, you might need to specify the distribution key by specifying options:

dbcreate_table_ops='distribute on (var1)'     or
'dbcreate_table_ops='distribute on random'

In Netezza the each table data is distributed across may servers(SPU's). This distribution of data results in parallel processing. So, we should be careful in choosing the right distribution key while creating a table.

There are two types of distribution methods:
Hash & Random algorithm
Random algorithm applies when you specify random() in the 'distribute on' clause.

Hash algorithm applies when you specify column names in the distribute on clause. When you don't specify distribute on clause while creating a table, by default, NPS distributes the data on first column in the create table statement.

Tips to choose a good distribution method: A good rule of thumb is, Select the columns withe unique values and high cardinality. If you are not sure which variable to choose as distribution key, you can simply use the random distribution.

data netlib.foo(distribute_on=random);
mycol1=1;mycol2='test';   run;

To create a distribution key on more than one column, separate the columns with commas.
data netlib.mytab(DISTRIBUTE_ON='col1,col2');
col1=1;col2=12345;col4='mytest';col5=98.45;   run;

Choose columns that distributes rows evenly across all the SPU's.
proc sql;
create table netezza_lib.dataout
(distribute_on = 'random' (or change to 'keyvar' ))
as select * from work.datain; quit;

proc sql; connect to netezza(server='server_name'
execute (create table outputfile as
select var1, var2, var3
from another_netezza_data
order by var1,var3
distribute on random) by netezza;
execute (generate statistics on outputfile) by netezza; quit;
/******************************************************/

After you load data from an external table into a user table, you should run GENERATE STATISTICS to update the statistics for the user table. This will help to improve the performance of queries that run against that table.

Always specify the distribution clause even if you are distributing on first column in the table. Use same distribution key and datatypes for commonly joined tables. Do not distribute on floating point and boolean datatypes.

Important: If you know ahead that you are going to join two tables later, by matching two columns in those two tables, then you can use distribution key on the column in each table, it will speed the join process much faster. This is due to the "zonemap" inside Netezza.

Once you distribute both tables in the same key, then the data slices in each blade will be similar, so it will be much faster for them to join since they are in the same rangy by the distribution key, they don't need to "Re-Broadcast" inside Netezza, i.e. find and move all the pieces together first then do the join.

In general, don't use the columns that have the same repeated values for distribution key, since it brings heavy skewness to the netezza table, which may bring you some trouble in netezza.

Also don't use variables like date as your distribution key, this is because you usually don't the selection on the sepcific dates, in that case, the data you want to select might be all in one S-blade, which is not that efficient for your join.

Furthermore, Joining on the numerical variables is faster than the joining on the character varaibles(determined by netezza algorithem).

Question: What if you got the following error when you upload data into netezza: The count of bad input rows might reach the maxerrors limit?

Answer: If you run the following example, you will get the above error message. The issue is due to the format.

libname mydata odbc dsn=nzsql;
data test;
format tenants 20.0; /* notice there is no decimal */
tenants=50.1; output; run;

proc delete data=mydb.test; run;
data mydb.test; set test; run;

What you need to do is just change the data format from format tenants 20.0 to format tenants 20.1, then you will get around the error. Source: SAS Support.