You can use the compare() method to find the Difference between two Data Frames in Pandas. Keep reading to learn how to use it in your applications.
Find the Difference between two Data Frames
pandas.DataFrame.compare()
The compare() method compares the DataFrame from which it is called with another DataFrame, showing the differences between them. You will need Pandas version 1.1.0 or above to use this method.
Syntax:
compare(other, align_axis, keep_shape, keep_equal, result_names)
The only required parameter here is other, which is the DataFrame you want to compare to. These two DataFrames need to be identically labeled, meaning they should have the same shape and identical column and row labels. The compare() method will raise a ValueError exception when this isn’t the case.
The returned object is a DataFrame showing the differences between the two DataFrames side by side. If they are identical, compare() returns an empty DataFrame.
You can also change the behavior of this method by using the parameters below:
- align_axis: this parameter determines the axis along which compare() will stack the differences between two DataFrames. The default value is 1, which equals “columns”. You can set it to 0 or “index” if you want to draw rows alternately when displaying the results.
- keep_shape: this boolean parameter determines whether compare() should keep all columns and rows. The default value is False, meaning it will keep only the ones containing different values.
- keep_equal: this boolean parameter determines whether compare() should keep equal values. The default value is False, meaning it will show equal values as NaNs.
- result_names: you can use this parameter to set the names for the DataFrames used in the results. It should be a tuple. The default value is (‘self’, ‘other’).
Examples
We will use a DataFrame containing the numbers of visitors to some sites as an example:
import pandas as pd
df1 = pd.DataFrame({
"Month": ["May", "June", "July", "August"],
"LearnShareIT": [23424, 34585, 36446, 47575],
"Quora": [8843555, 8598598, 9814040, 8320944],
"Stack Overflow": [6038055, 7840955, 7949055, 8390949]
}, columns=["Month", "LearnShareIT", "Quora", "Stack Overflow"],
)
print(df1)
Output:
Month LearnShareIT Quora Stack Overflow
0 May 23424 8843555 6038055
1 June 34585 8598598 7840955
2 July 36446 9814040 7949055
3 August 47575 8320944 8390949
To illustrate how the compare() method works, we need another DataFrame with the same shape and column/row labels. The quickest way to do this is to copy it:
df2 = df1.copy()
The DataFrames df1 and df2 should be identical since the copy() method makes a copy of both indices and data of the original DataFrame. We can verify this with compare():
print(df1.compare(df2))
Output:
Empty DataFrame
Columns: []
Index: []
After comparing df1 with df2, compare() determines that they are identical and returns an empty DataFrame.
To see how it behaves when there are differences between the DataFrames, we can modify df2 a little bit before comparing it to df1 again:
df2.loc[2, "LearnShareIT"] = 38649
df2.loc[3, "Quora"] = 8847203
print(df1.compare(df2))
Output:
LearnShareIT Quora
self other self other
2 36446.0 38649.0 NaN NaN
3 NaN NaN 8320944.0 8847203.0
The result is no longer an empty DataFrame. Instead, it has a MultiIndex and contains stacked columns labeled as “self” and “other”. This is how compare() displays the differences it detects during the comparison.
In this example, we change one value in the column “LearnShareIT” and one in “Quora”. That is why the MultiIndex with these column labels. Nested in them are “self” and “other” columns stacked side by side to show the differences between them, with “self” is the first DataFrame and “other” is the second.
By default, compare() only shows different values and displays NaN on equal entries. This helps us quickly identify where the differences are located in the DataFrames.
If you want to show the identical values in the result as well, set keep_equal to True:
print(df1.compare(df2, keep_equal = True))
Output:
LearnShareIT Quora
self other self other
2 36446 38649 9814040 9814040
3 47575 47575 8320944 8847203
Summary
The compare() can find the Difference between two Data Frames in Pandas. It returns the differences in a DataFrame, stacking them side by side for easier inspection.
Maybe you are interested:
- Get the length of an Array in Python
- Check if a Value is Zero or not None in Python
- Return a default value if None in Python
- Function print None in Python
- Pandas Groupby Two Columns With Examples
- How To Fix “DataFrame constructor not properly called!”
My name is Robert. I have a degree in information technology and two years of expertise in software development. I’ve come to offer my understanding on programming languages. I hope you find my articles interesting.
Job: Developer
Name of the university: HUST
Major: IT
Programming Languages: Java, C#, C, Javascript, R, Typescript, ReactJs, Laravel, SQL, Python