fuzzy join in R: What fuzzy join in R is and how to perform it

fuzzy join in R

In this article, we will show you how to use the fuzzy join in R. The fuzzy join in R can help you join two tables that do not have exact matching. Let’s follow this article to learn what the fuzzy join is and how to perform the fuzzy join with the explanation and examples below.

What does fuzzy join do in R?

The fuzzy join in R is built-in in the ‘fuzzyjoin’ package. This package has many functions concerning fuzzy join. The fuzzy join can help you join two tables that do not have exact matching. The function people usually apply to join two tables with fuzzy join is the stringdist_join() function. Let’s learn about its syntax of it below.

Syntax:

fuzzyjoin :: stringdist_join(x, y, by = NULL, max_dist= 2, method, mode =”inner”, ignore_case = FALSE, distance_col = NULL, …)

Parameters:

  • x: The first table.
  • y: The second table.
  • by: The specified column is selected to join two tables. The default is NULL.
  • max_dist: The default is 2. The maximum distance is used for joining.
  • method: The method for computing the string distance. 
  • mode: The mode to join two tables. One of the inner, left, right, full, semi, anti join. The default is the inner join.
  • ignore_case: Case insensitive or not. The default is FALSE.
  • distance_col: Add a column whose value is the difference between two tables or not. The default is NULL.

Use the fuzzy join with the stringdist_join() function

You can use fuzzy join two tables by the stringdist_join() function to join two dataframes that do not have exact matching.

But first, you have to install the ‘fuzzyjoin’ package to use this function.

Install the ‘fuzzyjoin’ package to work with fuzzy join

You can install the ‘fuzzyjoin’ package by the following command.

install.packages("fuzzyjoin")

Use the fuzzy join with the string_dist() function

You can use the string_dist() function to fuzzy join two tables that do not have exact matching.

Look at the example below.

# Create the first table.
df1 <- data.frame(
	name = c('Toi Pham', 'Hoe Doan', 'Ronaldo', 'Maguire', 'Fred'),
	age = c (20, 30, 37, 30,27)
)

# Create the second table.
df2 <- data.frame(
	name = c('Toi Pham','Hoe Doan','Ronaldo'),
	salary = c(10,20, 2000)
)

library(fuzzyjoin)

# Use the stringdist_join function to fuzzy join df1 and df2.
fuzzyjoin::stringdist_join(df1, df2, by='name', mode='inner', method ='jw',
                           max_dist = 99,distance_col='dist')

Output

   name.x age  name.y  salary dist
1 Toi Pham 20 Toi Pham 10    0.0000000
2 Toi Pham 20 Hoe Doan 20    0.4166667
3 Toi Pham 20 Ronaldo  2000  0.4880952
4 Hoe Doan 30 Toi Pham 10    0.4166667
5 Hoe Doan 30 Hoe Doan 20    0.0000000
6 Hoe Doan 30 Ronaldo  2000  0.5099206
7 Ronaldo  37 Toi Pham 10    0.4880952
8 Ronaldo  37 Hoe Doan 20    0.5099206
9 Ronaldo  37 Ronaldo  2000  0.0000000
10 Maguire 30 Toi Pham 10    0.5773810
11 Maguire 30 Hoe Doan 20    1.0000000
12 Maguire 30 Ronaldo  2000  0.5714286
13 Fred    27 Toi Pham 10    1.0000000
14 Fred    27 Hoe Doan 20    0.5416667
15 Fred    27 Ronaldo  2000  0.5357143

Summary

You have learned what fuzzy join in R is and how to use the fuzzy join by one of the functions in the ‘fuzzyjoin’ package. Fuzzy join in R is to join two tables that do not have exact matching. You can use fuzzy join two tables by the stringdist_fucntion(). We hope this tutorial is helpful to you. Thanks!

Maybe you are interested:

Posted in R

Leave a Reply

Your email address will not be published. Required fields are marked *