Fuzzy Matching Functions

Overview of Fuzzy Matching

The concept of Fuzzy Matching when working with data is the idea that you may want to consider two records a "match" even if they do not match perfectly.  Mis-spellings in user-entered data are a common example.  For example, two records containing the name "George Washington" and "George Washnigton" likely represent a match, but an approach that only looks for exact matches would miss it, and Regular Expressions, while quite powerful, would not easily identify all possible ways these records could be mis-matched.

Fuzzy matching computes a score from 0 to 100 that represents how similar two values are.  A score of 0 means "not similar at all" and a score of 100 represents a perfect match.  There are a variety of ways to compute such a score using concepts related to distance metrics ("how far away is one string from another string").  A very common such method is referred to as Levenshtein Distance (see https://en.wikipedia.org/wiki/Levenshtein_distance for more technical information).  LityxIQ uses an approach based on this method for computing fuzzy matching scores.

There are two functions in LityxIQ that support fuzzy matching computations: fuzzy, and fuzzy_tokens.  They are described in detail below:

The fuzzy function

The fuzzy function is used to directly compare two strings to create a fuzzy match score.

Usage: fuzzy(str1, str2, partial, case_insensitive) gives a score 0 to 100 representing how close str1 and str2 are.

  • The first two parameters str1 and str2 are the two strings to be compared.  Note that one or both can be variables in a dataset or fixed comparison strings.
  • The third parameter "partial" is optional, and is 0 by default, meaning that it will compare the full strings, not partial strings.  Setting it to 1 does a partial-string comparison.  In this case, the score that comes back is based on the similarity of a partial match, not necessarily requiring a full match.  For example, if comparing to the string "New York Yankees", the string "New York" is a strong partial match, but not as strong a full match.
  • The fourth parameter is optional, and is 1 by default, meaning that all calculations are made case-insensitive.  To be case sensitive (“a” is different from “A”), set it to 0.

The fuzzy_tokens function

The fuzzy_tokens function differs from the fuzzy function in that the strings being compared are first re-organized before making the fuzzy scoring computations.  The "re-organization" of the strings is based on the words, or tokens, that make it up.  For example, the strings "New York Yankees" and "Yankees New York" clearly represent a good match, but just aligning the characters from beginning to end would not recognize that.  fuzzy_tokens will evaluate the matching piece by piece and recognize that this is actually a perfect match (score=100).

Usage: fuzzy_tokens(str1, str2, token_type, partial) gives a score 0 to 100 representing how close str1 and str2 are, but re-orders the input strings based on “tokens” which are essentially words in the string

  • This function always does case insensitive matching.
  • As with the fuzzy function, the first two parameters str1 and str2 are the two strings to be compared.  Note that one or both can be variables in a dataset or fixed comparison strings.
  • The third parameter token_type is optional, and can be either 'set' or 'sort'.  The default is 'set'.  'set' means that the order of the tokens/words is not important.  'sort' will sort the tokens/words in each string in alphabetical order before doing a comparison.
  • The fourth parameter “partial” is optional, and is similar to the description above for the fuzzy function.  By default it is 0, but can be set to 1 to have the score computed based on partial matching.

Note that for both functions, if one or both of str1 or str2 is null, the result will come back 0.