How To Perform Left Anti Join In PySpark

Left anti join in PySpark is one of the most common join types in this software framework. Alongside the right anti join, it allows you to extract key insights from your data.

This tutorial will explain how this join type works and how you can perform with the join() method.

Left Anti Join In PySpark

As with SQL, one of the join types available in Spark is the left anti join. It brings in only rows from the left DataFrame that don’t have any matching rows from the right DataFrame.

To carry out this join type, you can use the join() method on the first DataFrame. Its syntax is as follows:

join(other, on, how)

In which:

  • ‘other’ is the DataFrame you need to join to the right side of the current one.
  • The optional argument ‘on’ determines the join column(s), which must be on both DataFrames. It can be a Column expression, a list, or a string.
  • ‘how’ – this optional string argument controls the join type. By default, the join() uses the inner type. You can switch to the left anti join mode by setting the ‘how’ argument to ‘leftanti’.

To demonstrate this join type in PySpark, let’s create two DataFrames containing information about some employees, including their names, positions, and ages.

import findspark
findspark.init()

# Import pyspark in Python script
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("learnshareit.com").getOrCreate()

# Create two DataFrames containing information about some employees
df1 = spark.createDataFrame (
    [(1, "John", "Manager"), (2, "Charlotte", "Developer"), (3, "Oliver", "Tester")],
    ["id", "name", "title"]
)

df2 = spark.createDataFrame ([(1, "45"), (2, "25"), (4, "34")], ["id", "age"])

As you can see, these two DataFrames have the same column ‘id’ and number of rows (3). But in this column, the second DataFrame only shares the first two entries with the first one.

>>> df1.show()

>>> df2.show()

When doing a left anti join in the column’ id’, PySpark will only the 3rd row of the first DataFrame. Why? It is to see why if you recall the definition of the left anti join.

The IDs of the first two rows of the first DataFrame also exist in the second DataFrame. As a result, the join() method removes them from the result. On the other hand, there is no ID ‘3’ on the second DataFrame. That is why join() keeps it.

This is how you can perform a left anti join on the column ‘id’ with join():

>>> df3 = df1.join(df2, on = ‘id’, how = ‘leftanti’)

>>> df3.show()

You can use a left anti join when you want to find the rows in one DataFrame that do not have a match in another dataframe based on a common key. This can be useful for identifying missing or incorrect data or for comparing the contents of two DataFrames.

For example, you might use the left anti join to compare the contents of two tables in a database or to identify customers in one table that are not present in another table.

Remember that the order of DataFrames is important when you perform any joining actions.

If you invoke the join() method on the second DataFrame instead, the result will be different:

>>> df3 = df2.join(df1, on = ‘id’, how = ‘leftanti’)

>>> df3.show()

The left anti join now looks for rows on df2 that don’t have a match on df1 instead.

Summary

The left anti join in PySpark is useful when you want to compare data between DataFrames and find missing entries. PySpark provides this join type in the join() method, but you must explicitly specify the ‘how’ argument in order to use it.

Maybe you are interested:

Leave a Reply

Your email address will not be published. Required fields are marked *