Fuzzy Matching Functions

Overview of Fuzzy Matching

The concept of Fuzzy Matching when working with data is the idea that you may want to consider two records a "match" even if they do not match perfectly. Mis-spellings in user-entered data are a common example. For example, two records containing the name "George Washington" and "George Washnigton" likely represent a match, but an approach that only looks for exact matches would miss it, and Regular Expressions, while quite powerful, would not easily identify all possible ways these records could be mis-matched.

Fuzzy matching computes a score from 0 to 100 that represents how similar two values are. A score of 0 means "not similar at all" and a score of 100 represents a perfect match. There are a variety of ways to compute such a score using concepts related to distance metrics ("how far away is one string from another string"). A very common such method is referred to as Levenshtein Distance (see https://en.wikipedia.org/wiki/Levenshtein_distance for more technical information). LityxIQ uses an approach based on this method for computing fuzzy matching scores.

There are two functions in LityxIQ that support fuzzy matching computations: fuzzy, and fuzzy_tokens. They are described in detail below:

The fuzzy function

The fuzzy function is used to directly compare two strings to create a fuzzy match score.

Usage: fuzzy(str1, str2, partial, case_insensitive) gives a score 0 to 100 representing how close str1 and str2 are.

The first two parameters str1 and str2 are the two strings to be compared. Note that one or both can be variables in a dataset or fixed comparison strings.
The third parameter "partial" is optional, and is 0 by default, meaning that it will compare the full strings, not partial strings. Setting it to 1 does a partial-string comparison. In this case, the score that comes back is based on the similarity of a partial match, not necessarily requiring a full match. For example, if comparing to the string "New York Yankees", the string "New York" is a strong partial match, but not as strong a full match.
The fourth parameter is optional, and is 1 by default, meaning that all calculations are made case-insensitive. To be case sensitive (“a” is different from “A”), set it to 0.

The fuzzy_tokens function

The fuzzy_tokens function differs from the fuzzy function in that the strings being compared are first re-organized before making the fuzzy scoring computations. The "re-organization" of the strings is based on the words, or tokens, that make it up. For example, the strings "New York Yankees" and "Yankees New York" clearly represent a good match, but just aligning the characters from beginning to end would not recognize that. fuzzy_tokens will evaluate the matching piece by piece and recognize that this is actually a perfect match (score=100).

Usage: fuzzy_tokens(str1, str2, token_type, partial) gives a score 0 to 100 representing how close str1 and str2 are, but re-orders the input strings based on “tokens” which are essentially words in the string

This function always does case insensitive matching.
As with the fuzzy function, the first two parameters str1 and str2 are the two strings to be compared. Note that one or both can be variables in a dataset or fixed comparison strings.
The third parameter token_type is optional, and can be either 'set' or 'sort'. The default is 'set'. 'set' means that the order of the tokens/words is not important. 'sort' will sort the tokens/words in each string in alphabetical order before doing a comparison.
The fourth parameter “partial” is optional, and is similar to the description above for the fuzzy function. By default it is 0, but can be set to 1 to have the score computed based on partial matching.

Note that for both functions, if one or both of str1 or str2 is null, the result will come back 0.

Fuzzy Matching Functions

Overview of Fuzzy Matching

The fuzzy function

The fuzzy_tokens function

Other Articles