Overview of Fuzzy Matching
The concept of Fuzzy Matching when working with data is the idea that you may want to consider two records a "match" even if they do not match perfectly. Mis-spellings in user-entered data are a common example. For example, two records containing the name "George Washington" and "George Washnigton" likely represent a match, but an approach that only looks for exact matches would miss it, and Regular Expressions, while quite powerful, would not easily identify all possible ways these records could be mis-matched.
Fuzzy matching computes a score from 0 to 100 that represents how similar two values are. A score of 0 means "not similar at all" and a score of 100 represents a perfect match. There are a variety of ways to compute such a score using concepts related to distance metrics ("how far away is one string from another string"). A very common such method is referred to as Levenshtein Distance (see https://en.wikipedia.org/wiki/Levenshtein_distance for more technical information). LityxIQ uses an approach based on this method for computing fuzzy matching scores.
There are two functions in LityxIQ that support fuzzy matching computations: fuzzy, and fuzzy_tokens. They are described in detail below:
The fuzzy function
The fuzzy function is used to directly compare two strings to create a fuzzy match score.
Usage: fuzzy(str1, str2, partial, case_insensitive) gives a score 0 to 100 representing how close str1 and str2 are.
- The first two parameters str1 and str2 are the two strings to be compared. Note that one or both can be variables in a dataset or fixed comparison strings.
- The third parameter "partial" is optional, and is 0 by default, meaning that it will compare the full strings, not partial strings. Setting it to 1 does a partial-string comparison. In this case, the score that comes back is based on the similarity of a partial match, not necessarily requiring a full match. For example, if comparing to the string "New York Yankees", the string "New York" is a strong partial match, but not as strong a full match.
- The fourth parameter is optional, and is 1 by default, meaning that all calculations are made case-insensitive. To be case sensitive (“a” is different from “A”), set it to 0.
The fuzzy_tokens function
The fuzzy_tokens function differs from the fuzzy function in that the strings being compared are first re-organized before making the fuzzy scoring computations. The "re-organization" of the strings is based on the words, or tokens, that make it up. For example, the strings "New York Yankees" and "Yankees New York" clearly represent a good match, but just aligning the characters from beginning to end would not recognize that. fuzzy_tokens will evaluate the matching piece by piece and recognize that this is actually a perfect match (score=100).
Usage: fuzzy_tokens(str1, str2, token_type, partial) gives a score 0 to 100 representing how close str1 and str2 are, but re-orders the input strings based on “tokens” which are essentially words in the string
- This function always does case insensitive matching.
- As with the fuzzy function, the first two parameters str1 and str2 are the two strings to be compared. Note that one or both can be variables in a dataset or fixed comparison strings.
- The third parameter token_type is optional, and can be either 'set' or 'sort'. The default is 'set'. 'set' means that the order of the tokens/words is not important. 'sort' will sort the tokens/words in each string in alphabetical order before doing a comparison.
- The fourth parameter “partial” is optional, and is similar to the description above for the fuzzy function. By default it is 0, but can be set to 1 to have the score computed based on partial matching.
Note that for both functions, if one or both of str1 or str2 is null, the result will come back 0.