How To Find The Difference Between Two Data Frames In Pandas

How To Find difference between two data frames

You can use the compare() method to find the Difference between two Data Frames in Pandas. Keep reading to learn how to use it in your applications.

Find the Difference between two Data Frames

pandas.DataFrame.compare()

The compare() method compares the DataFrame from which it is called with another DataFrame, showing the differences between them. You will need Pandas version 1.1.0 or above to use this method.

Syntax:

compare(other, align_axis, keep_shape, keep_equal, result_names)

The only required parameter here is other, which is the DataFrame you want to compare to. These two DataFrames need to be identically labeled, meaning they should have the same shape and identical column and row labels. The compare() method will raise a ValueError exception when this isn’t the case.

The returned object is a DataFrame showing the differences between the two DataFrames side by side. If they are identical, compare() returns an empty DataFrame.

You can also change the behavior of this method by using the parameters below:

  • align_axis: this parameter determines the axis along which compare() will stack the differences between two DataFrames. The default value is 1, which equals “columns”. You can set it to 0 or “index” if you want to draw rows alternately when displaying the results.
  • keep_shape: this boolean parameter determines whether compare() should keep all columns and rows. The default value is False, meaning it will keep only the ones containing different values.
  • keep_equal: this boolean parameter determines whether compare() should keep equal values. The default value is False, meaning it will show equal values as NaNs.
  • result_names: you can use this parameter to set the names for the DataFrames used in the results. It should be a tuple. The default value is (‘self’, ‘other’).

Examples

We will use a DataFrame containing the numbers of visitors to some sites as an example:

import pandas as pd
df1 = pd.DataFrame({
    "Month": ["May", "June", "July", "August"],
    "LearnShareIT": [23424, 34585, 36446, 47575],
    "Quora": [8843555, 8598598, 9814040, 8320944],
    "Stack Overflow": [6038055, 7840955, 7949055, 8390949]
    }, columns=["Month", "LearnShareIT", "Quora", "Stack Overflow"],
)
print(df1)

Output:

    Month  LearnShareIT    Quora  Stack Overflow
0     May         23424  8843555         6038055
1    June         34585  8598598         7840955
2    July         36446  9814040         7949055
3  August         47575  8320944         8390949

To illustrate how the compare() method works, we need another DataFrame with the same shape and column/row labels. The quickest way to do this is to copy it:

df2 = df1.copy()

The DataFrames df1 and df2 should be identical since the copy() method makes a copy of both indices and data of the original DataFrame. We can verify this with compare():

print(df1.compare(df2))

Output:

Empty DataFrame
Columns: []
Index: []

After comparing df1 with df2, compare() determines that they are identical and returns an empty DataFrame.

To see how it behaves when there are differences between the DataFrames, we can modify df2 a little bit before comparing it to df1 again:

df2.loc[2, "LearnShareIT"] = 38649
df2.loc[3, "Quora"] = 8847203
print(df1.compare(df2))

Output:

  LearnShareIT               Quora           
          self    other       self      other
2      36446.0  38649.0        NaN        NaN
3          NaN      NaN  8320944.0  8847203.0

The result is no longer an empty DataFrame. Instead, it has a MultiIndex and contains stacked columns labeled as “self” and “other”. This is how compare() displays the differences it detects during the comparison.

In this example, we change one value in the column “LearnShareIT” and one in “Quora”. That is why the MultiIndex with these column labels. Nested in them are “self” and “other” columns stacked side by side to show the differences between them, with “self” is the first DataFrame and “other” is the second.

By default, compare() only shows different values and displays NaN on equal entries. This helps us quickly identify where the differences are located in the DataFrames.

If you want to show the identical values in the result as well, set keep_equal to True:

print(df1.compare(df2, keep_equal = True))

Output:

  LearnShareIT           Quora         
          self  other     self    other
2        36446  38649  9814040  9814040
3        47575  47575  8320944  8847203

Summary

The compare() can find the Difference between two Data Frames in Pandas. It returns the differences in a DataFrame, stacking them side by side for easier inspection.

Maybe you are interested:

Leave a Reply

Your email address will not be published. Required fields are marked *