Comparing Strings Is Easy With FuzzyWuzzy

About a year ago, I saw a colleague of mine working on a large data set, aiming to measure the similarity level between each pair of strings. My colleague started to develop a method that calculated the “distance” between the strings, with a set of rules he came up with.

Since I was familiar with the FuzzyWuzzy python library, I knew there was a faster and more efficient way to determine if a pair of strings was similar or different. No tedious calculation needed, no reinventing the wheel, just familiarity with FuzzyWuzzy, and with a few other tips that I will share with you today.

What Is String Comparison, And How Can FuzzyWuzzy Help?

FuzzyWuzzy is a Python library that calculates a similarity score for two given strings.

The similarity score is given on a scale of 0 (completely unrelated) to 100 (a close match).

After all, if all you need is exact string comparison, Python has got you covered:

But what if your strings are not necessarily exactly the same, yet you still need to know how similar they are?

This isn’t very helpful.

But check this out:

Much better.

Getting Started

If the relevant libraries are not installed on your virtual environment and imported before usage, you will need to run the followings:

In a command line:

Inside your IDE:

Levenshtein Distance — Behind the Scenes of FuzzyWuzzy

The different FuzzyWuzzy functions use Levenshtein distance — a popular algorithm that calculates the distance between two strings. The “further” those strings are from one another, the greater is the distance score (as opposed to FuzzyWuzzy output).

However, Levenshtein distance has a major disadvantage:
It has one logic, while in real life there are a few different ways to define similarity of strings.
Levenshtein distance cannot cover different cases with one piece of logic. Some cases would be defined as similar strings by a human being, yet missed by Levenstein.

For example:

If someone had to decide which pair of strings is more similar, they would probably pick the second option. However, we can see that it actually got a higher distance score.

We need something better than that.

Getting to Know the Functions

FuzzyWuzzy has a few different functions. Each of these functions takes a pair of strings and returns a 0–100 match score. But each of these functions has a slightly different logic, so we can select the most appropriate function to meet our needs.

Let’s get to know some prominent FuzzyWuzzy functions:

  • fuzz.ratio:
    The simplest function. Calculates the score based on the following logic:
    Given two strings, where T is the total number of elements in both sequences, and M is the number of matches, the similarity between those strings is:
    2*(M / T)*100

In this example -
T = len(“great”)+len(“green”) = 10
M = 3
So the formula is 2*(3/10)*100

  • fuzz.partial_ratio:
    For two strings, A and B, where len(A) < len(B) and len(A) = n, this function will run ratio function between A and all n-length substrings of B and will retrieve the highest of all scores calculated in this process.
    Let’s take a look at the following three examples, and then discuss them:

The examples above demonstrate the followings:
1. String in B is fully included in string A, therefore the matching score is 100.
2. String in B is almost fully included in string A , except for a “typo”, therefore the matching score is 83.
3. Same as the first example, but using the ratio function between A and B, instead of partial_ratio. Since ratio function is not aimed to handle the substring case — the score is lower.

  • fuzz.token_sort_ratio
    Sorts the tokens inside both strings (usually split into individual words) and then compares them. This will retrieve a 100 matching score for strings A, B, where A and B contain the same tokens but in different orders.
    See the examples below, once with token_sort_ratio function, and once with ratio function, to see the different results.

What about the same token but different counts?

Well, this isn’t the functions’ specialty. For this case, we have token_set_ratio coming to the rescue.

  • fuzz.token_set_ratio
    The main purpose of token_set_ratio is to ignore duplicates and
    order-differences between the given tokens.

    We can see that the two examples below got the same matching score, although in the second example some tokens have been duplicated and their order was changed.

The logic behind this function, after simplifying it a bit, is as follows:
Given strings A, B (let’s have A = ‘hi hey ho’ and B = ‘yo yay ho’):
C = intersection of A and B (‘ho’)
D = C + remainder of A (‘ho hi hey’)
E = C + remainder of B (‘ho yo yay’)
So token_set_ratio runs ratio(C, D), ratio(C, E), ratio(D, E) and returns the highest score of the three.

The full logic contains additional normalization, such as applying set() on A and B, applying sort() on C, and more.

The code implemented by each of the functions described above, as well as other useful FuzzyWuzzy functions, can be found here.

Getting To The Nitty Gritty

In this article, we covered the basics of FuzzyWuzzy library and its functions.

However, getting the best results for our project does not begin and end with knowing which functions are available out there and how to use them.

Doing it like a pro means knowing how to clean data before working with it, how to compare not only pairs of strings but big tables as well, which threshold score (or scores) to use and so much more.

To learn all of these, I invite you to read the successive article called FuzzyWuzzy — the Before and After.

I would love to hear about your experience with FuzzyWuzzy: when did you use it? For what purposes do you plan using it in the future? What makes working with FuzzyWuzzy fun and comfortable for you?

Comment below and share your thoughts!

I’m a software developer with previous experience in risk & data analysis, working in a FinTech company.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store