Monday, June 1, 2020

Python study notes 7: String matching algorithm


What is Levenshtein distance?
Levenshtein distance between two words is the minimum number of single-character edits/transformations (insertions, deletions or substitutions) required to change one word into the other

Levenshtein distance is used mainly to address typos, and I find it pretty much useless if you want to compare two documents for example. That’s where the Cosine similarity comes in. It’s the exact opposite, useless for typo detection, but great for a whole sentence, or document similarity calculation.

#================================================
pip3 install fuzzywuzzy[speedup]
pip install python-Levenshtein
import Levenshtein
Levenshtein.distance('Levenshtein','Levensthein') #output 2

Levenshtein.distance('This is a foo bar sentence','This sentence is similar to a foo bar sentence') 
#output 20, way too many 
#================================================

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

You’ll want to construct a vector space from all the ‘sentences’ you want to calculate similarity for. That vector space will have as many dimensions as there are unique words in all sentences combined.

You’ll need the string module to remove punctuations from the string — ‘sentence’ and ‘sentence.’ are different by default, and you want to avoid that. CountVectorizer will take care of converting strings to numerical vectors, which is also neat. Finally, as this article is written in English, you’ll want to remove the most frequent words which give no meaning — they are called stopwords — words like ‘I’, ‘me’, ‘myself’, etc.

If you don’t remove the stopwords you’ll end up with a higher-dimensional space, and the angle between vectors will be greater — implying less similarity — even though the vectors convey pretty much the same information.

#================================================
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stopwords=stopwords.words("english")

def clean_string(text):
  text=''.join([word for word in text if word not in string.punctuation])
  text=text.lower()
  text=' '.join([word for word in text.split() if word not in stopwords])

sentences=['This is a foo bar sentence','This sentence is similar to a foo bar sentence',
'This is another string, but it is not quite similar to the previous one',
'This is just another string']

cleaned=list(map(clean_string,sentences))
cleaned

vectorizer=CountVectorizer().fit_transform(cleaned)
vectors=vectorizer.toarray()
vectors

csim=cosine_similarity(vectors)
csim
#================================================

3. Fuzzywuzzy is a Python library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package. There are usually 3 metrics:
ratio, partial ratio, token_sort_ratio.
a. ratio measure the Levinshtein distance between those 2 strings.
b. partial ratio measure the Levenshtein distance between shorter string and any portion of the longer string, then find the maximal value out of all possible sub-strings of the longer string.
c. token_sort_ratio: ignores word order.
d. token_set_ratio: ignores duplicated words.

#================================================
from fuzzywuzzy import fuzz

# process is used to compare a string to MULTIPLE other strings
from fuzzywuzzy import process
import pandas as pd
df = pd.read_csv('room_type.csv')
df.head(10)

from fuzzywuzzy import fuzz
#ratio , compares the entire string similarity
fuzz.ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')
#output: 62, means 62% similar

#partial_ratio , compares partial string similarity.
fuzz.ratio("this is a test", "this is a test with diff!")
#output: 72, by using deletion.
fuzz.partial_ratio("this is a test", "this is a test with diff!")
#output: 100, exactly match with the portion of the longer string.

fuzz.ratio("this is a test", "this is a test with diff!")
#output: 72
fuzz.token_sort_ratio("this is a diff test", "this is a test with diff!")
#output: 88, should be higher than ratio.

fuzz.token_sort_ratio("this is a test with diff!","this is a diff test with diff")
#output: 91
#token_sort_ratio: ignore the order of words
fuzz.token_set_ratio("this is a test with diff!","this is a diff test with diff")
#output: 100, continue ignoring the duplicated words, so higher

choices = ['fuzzy fuzzy was a bear', 'is this a test', 'THIS IS A TEST!!']
process.extract("this is a test", choices, scorer=fuzz.token_sort_ratio)
[('is this a test', 100),
 ('THIS IS A TEST!!', 100),
 ('fuzzy fuzzy was a bear', 28)]
#================================================

Jaccard Index/Similarity: the num of overlapping/intersection cases/ the number of union cases
Jaccard Distance: 1- Jaccard Index

Hamming Distance: It's to compute the distance of two 1-d arrays u, v.
#================================================
>>> from scipy.spatial import distance
>>> distance.hamming([1, 0, 0], [0, 1, 0])
0.66666666666666663
>>> distance.hamming([1, 0, 0], [1, 1, 0])
0.33333333333333331
>>> distance.hamming([1, 0, 0], [2, 0, 0])
0.33333333333333331
>>> distance.hamming([1, 0, 0], [3, 0, 0])
0.33333333333333331
#================================================



Solution #1: Python builtin
use SequenceMatcher from difflib

pros: native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.

example :
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75


Solution #2: jellyfish library
its a very good library with good coverage and few issues. it supports:
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance

pros: easy to use, gamut of supported algorithms, tested.
cons: not native library.

example:

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1

No comments:

Post a Comment

Data Science Study Notes: How do we detect outliers (anomaly records) in python?

In this blogger we will cover what's the major difference for different approaches to detect outliers (anomaly records)  in python, a...