Damerau Levenshtein distance in SQL Netezza
In information theory and computer science, the Damerau-Levenshtein distance
(named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string
metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other,
and there is no way to do it with fewer than three edits:
1. kitten --> sitten (substitution of "s" for "k")
2. sitten --> sittin (substitution of "i" for "e")
3. sittin --> sitting (insertion of "g" at the end).
Damerau-Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters;
Longest common subsequence metric allows only insertion and deletion, not substitution; Hamming distance allows only substitution, hence, it only applies to strings of the same length.
Netezza funcstions to calculate Damerau-Levenshtein distance:
: le_dst ( string_expression1 , string_expression2 )
Returns a value indicating how different the two input strings are,
calculated according to the Levenshtein edit distance algorithm.
: dle_dst ( string_expression1 , string_expression2 )
Returns a value indicating how different the two input strings are, calculated according to the Damerau-Levenshtein distance algorithm
Continue to :
Running SAS procedures inside Netezza
SAS tutorial home
Top SAS Tuninig Techniques for Large Dataset
Statistics tutorial home