About a year ago, I saw a colleague of mine working on a large data set, aiming to measure the similarity level between each pair of strings. My colleague started to develop a method that calculated the “distance” between the strings, with a set of rules he came up with.
Since I was familiar with the FuzzyWuzzy python library, I knew there was a faster and more efficient way to determine if a pair of strings was similar or different. No tedious calculation needed, no reinventing the wheel, just familiarity with FuzzyWuzzy, and with a few other tips that I will share with you today.
What Is String Comparison, And How Can FuzzyWuzzy Help?
FuzzyWuzzy is a Python library that calculates a similarity score for two given strings.
The similarity score is given on a scale of 0 (completely unrelated) to 100 (a close match).
After all, if all you need is exact string comparison, Python has got you covered:
>>> "check out this example" == "check out this example"
True>>> "check out this example" == "something completely different"
But what if your strings are not necessarily exactly the same, yet you still need to know how similar they are?
>>> "check out this example" == "check out this exampel"
This isn’t very helpful.
But check this out:
>>> fuzz.ratio("check out this example", "check out this exampel")
If the relevant libraries are not installed on your virtual environment and imported before usage, you will need to run the followings:
In a command line:
pip install pandas
pip install fuzzywuzzy
# or — preferably
pip install fuzzywuzzy[speedup]
# [speedup] installs python-Levenshtein library
for better performance
Inside your IDE:
>>> import pandas as pd
>>> from fuzzywuzzy import fuzz
Levenshtein Distance — Behind the Scenes of FuzzyWuzzy
The different FuzzyWuzzy functions use Levenshtein distance — a popular algorithm that calculates the distance between two strings. The “further” those strings are from one another, the greater is the distance score (as opposed to FuzzyWuzzy output).
However, Levenshtein distance has a major disadvantage:
It has one logic, while in real life there are a few different ways to define similarity of strings.
Levenshtein distance cannot cover different cases with one piece of logic. Some cases would be defined as similar strings by a human being, yet missed by Levenstein.
>>> Levenshtein.distance(“measuring with Levenshtein”,
11>>> Levenshtein.distance(“Levenshtein distance is here”,
“here is distance Levenshtein”)
If someone had to decide which pair of strings is more similar, they would probably pick the second option. However, we can see that it actually got a higher distance score.
We need something better than that.
Getting to Know the Functions
FuzzyWuzzy has a few different functions. Each of these functions takes a pair of strings and returns a 0–100 match score. But each of these functions has a slightly different logic, so we can select the most appropriate function to meet our needs.
Let’s get to know some prominent FuzzyWuzzy functions:
The simplest function. Calculates the score based on the following logic:
Given two strings, where T is the total number of elements in both sequences, and M is the number of matches, the similarity between those strings is:
2*(M / T)*100
>>> fuzz.ratio("great", "green")
In this example -
T = len(“great”)+len(“green”) = 10
M = 3
So the formula is 2*(3/10)*100
For two strings, A and B, where len(A) < len(B) and len(A) = n, this function will run ratio function between A and all n-length substrings of B and will retrieve the highest of all scores calculated in this process.
Let’s take a look at the following three examples, and then discuss them:
>>> fuzz.partial_ratio(“let’s compare strings”, “strings”)
>>> fuzz.partial_ratio(“let’s compare strings”, “stings”)
>>> fuzz.ratio(“let’s compare strings”, “strings”)
The examples above demonstrate the followings:
1. String in B is fully included in string A, therefore the matching score is 100.
2. String in B is almost fully included in string A , except for a “typo”, therefore the matching score is 83.
3. Same as the first example, but using the ratio function between A and B, instead of partial_ratio. Since ratio function is not aimed to handle the substring case — the score is lower.
Sorts the tokens inside both strings (usually split into individual words) and then compares them. This will retrieve a 100 matching score for strings A, B, where A and B contain the same tokens but in different orders.
See the examples below, once with token_sort_ratio function, and once with ratio function, to see the different results.
>>> fuzz.token_sort_ratio("let's compare strings",
"strings compare let's")
>>> fuzz.ratio("let's compare strings", "strings compare let's")
What about the same token but different counts?
>>> fuzz.token_sort_ratio("let's compare", "let's compare compare")
Well, this isn’t the functions’ specialty. For this case, we have token_set_ratio coming to the rescue.
The main purpose of token_set_ratio is to ignore duplicates and
order-differences between the given tokens.
We can see that the two examples below got the same matching score, although in the second example some tokens have been duplicated and their order was changed.
>>> fuzz.token_set_ratio("we compare strings together",
"together we talk")
>>> fuzz.token_set_ratio("strings we we we compare together",
"together together talk we")
The logic behind this function, after simplifying it a bit, is as follows:
Given strings A, B (let’s have A = ‘hi hey ho’ and B = ‘yo yay ho’):
C = intersection of A and B (‘ho’)
D = C + remainder of A (‘ho hi hey’)
E = C + remainder of B (‘ho yo yay’)
So token_set_ratio runs ratio(C, D), ratio(C, E), ratio(D, E) and returns the highest score of the three.
The full logic contains additional normalization, such as applying set() on A and B, applying sort() on C, and more.
The code implemented by each of the functions described above, as well as other useful FuzzyWuzzy functions, can be found here.
Getting To The Nitty Gritty
In this article, we covered the basics of FuzzyWuzzy library and its functions.
However, getting the best results for our project does not begin and end with knowing which functions are available out there and how to use them.
Doing it like a pro means knowing how to clean data before working with it, how to compare not only pairs of strings but big tables as well, which threshold score (or scores) to use and so much more.
To learn all of these, I invite you to read the successive article called FuzzyWuzzy — the Before and After.
I would love to hear about your experience with FuzzyWuzzy: when did you use it? For what purposes do you plan using it in the future? What makes working with FuzzyWuzzy fun and comfortable for you?
Comment below and share your thoughts!