EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer

SAS programming Skills
Mystery of Proc Sort by Nodup and Nodupkey

Question: What's the difference in using Nodup(or NODUPRECS) and Nodupkey?
When should we apply Nodup? When should we apply Nodupkey?

Noduprecs or Nodup: delete duplicate observations, if we find a record having all the same values for all columns as another record, then the observation is not written to the output data set.
Nodupkey: only delete observations with duplicate BY variable values, the other variables that we didn't specify in BY statement might be the same for two records.

In other words, we have more records left by Nodupkey than by Noduprecs.

Note: With the NODUPKEY option, PROC SORT is comparing all BY variable values while the NODUP option compares all the variables in the data set that is being sorted. An easy way to remember the difference between these options is to keep in mind the word "key" in NODUPKEY. It evaluates the “key” or BY variable values that you specify.

One thing to beware of with both options is that they both compare the previous observation written to the output data set. So, if the observations that you want eliminated are not adjacent in the data set after the sort, they will not be eliminated.

Example: In the following example, we can see proc sort with nodup, in the output data test1, there are 4 records(2 records for id=3, because not all variables have the same value for those records); but for proc sort with nodupkey, only 3 records in the outptut data test2(because we only allow one case for id=3, no more repeat).


Related links:

Continue to next: How can I easily create lag and lead variables?   SAS tutorial home
Back to: How to distribute data in Netezza Server   Statistics tutorial home