In this article, we will show you how to use the fuzzy join in R. The fuzzy join in R can help you join two tables that do not have exact matching. Let’s follow this article to learn what the fuzzy join is and how to perform the fuzzy join with the explanation and examples below.
What does fuzzy join do in R?
The fuzzy join in R is built-in in the ‘fuzzyjoin’ package. This package has many functions concerning fuzzy join. The fuzzy join can help you join two tables that do not have exact matching. The function people usually apply to join two tables with fuzzy join is the stringdist_join() function. Let’s learn about its syntax of it below.
fuzzyjoin :: stringdist_join(x, y, by = NULL, max_dist= 2, method, mode =”inner”, ignore_case = FALSE, distance_col = NULL, …)
- x: The first table.
- y: The second table.
- by: The specified column is selected to join two tables. The default is NULL.
- max_dist: The default is 2. The maximum distance is used for joining.
- method: The method for computing the string distance.
- mode: The mode to join two tables. One of the inner, left, right, full, semi, anti join. The default is the inner join.
- ignore_case: Case insensitive or not. The default is FALSE.
- distance_col: Add a column whose value is the difference between two tables or not. The default is NULL.
Use the fuzzy join with the stringdist_join() function
You can use fuzzy join two tables by the stringdist_join() function to join two dataframes that do not have exact matching.
But first, you have to install the ‘fuzzyjoin’ package to use this function.
Install the ‘fuzzyjoin’ package to work with fuzzy join
You can install the ‘fuzzyjoin’ package by the following command.
Use the fuzzy join with the string_dist() function
You can use the string_dist() function to fuzzy join two tables that do not have exact matching.
Look at the example below.
# Create the first table. df1 <- data.frame( name = c('Toi Pham', 'Hoe Doan', 'Ronaldo', 'Maguire', 'Fred'), age = c (20, 30, 37, 30,27) ) # Create the second table. df2 <- data.frame( name = c('Toi Pham','Hoe Doan','Ronaldo'), salary = c(10,20, 2000) ) library(fuzzyjoin) # Use the stringdist_join function to fuzzy join df1 and df2. fuzzyjoin::stringdist_join(df1, df2, by='name', mode='inner', method ='jw', max_dist = 99,distance_col='dist')
name.x age name.y salary dist 1 Toi Pham 20 Toi Pham 10 0.0000000 2 Toi Pham 20 Hoe Doan 20 0.4166667 3 Toi Pham 20 Ronaldo 2000 0.4880952 4 Hoe Doan 30 Toi Pham 10 0.4166667 5 Hoe Doan 30 Hoe Doan 20 0.0000000 6 Hoe Doan 30 Ronaldo 2000 0.5099206 7 Ronaldo 37 Toi Pham 10 0.4880952 8 Ronaldo 37 Hoe Doan 20 0.5099206 9 Ronaldo 37 Ronaldo 2000 0.0000000 10 Maguire 30 Toi Pham 10 0.5773810 11 Maguire 30 Hoe Doan 20 1.0000000 12 Maguire 30 Ronaldo 2000 0.5714286 13 Fred 27 Toi Pham 10 1.0000000 14 Fred 27 Hoe Doan 20 0.5416667 15 Fred 27 Ronaldo 2000 0.5357143
You have learned what fuzzy join in R is and how to use the fuzzy join by one of the functions in the ‘fuzzyjoin’ package. Fuzzy join in R is to join two tables that do not have exact matching. You can use fuzzy join two tables by the stringdist_fucntion(). We hope this tutorial is helpful to you. Thanks!
Maybe you are interested:
- ave() Function in R
- min() and max() Function in R
- kriging in R: What the kriging in R is and how to perform the kriging in R
Name of the university: PTIT