Usually, there should be no problem working with Latin characters. Except when interacting with special characters, we can see the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte”.
Why does the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte” appear? And how to solve it?
Encode and decode 2 different character sets
The error appears when we encode with one character set and try to use a different character set when we want to decode an object. See the example for a better understanding.
encoding = 'LearnShäreIT'.encode('latin-1') decoding = encoding.decode('utf-8') print(decoding) # UnicodeDecodeError
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 7
To solve this error, you must use the character set that was previously used for encoding when you decode the string you want, like the code sample below.
encoding = 'LearnShäreIT'.encode('utf-8') # Using the same character set decoding = encoding.decode('utf-8') print(decoding)
Output:
LearnShäreIT
The charset is inconsistent when saving files and reading files
When we create and save a CSV file, we choose the UTF-16 BE charset, as shown below.
But when reading the file with pandas.read_csv(), we use the default character set of read_csv()
which is utf-8. See the code below for a better understanding.
import pandas as pd # Using encoding = 'utf-8' but charset of data.csv = 'utf-16' data = pd.read_csv('data.csv') print(data)
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0
We have to set the encoding='utf-16'
for consistency between encoding and decoding. Like this:
import pandas as pd # Using encoding='utf-16' data = pd.read_csv('data.csv', encoding='utf-16') print(data)
Output:
Name Website
0 LearnShareIT learnshareit.com
1 Facebook facebook.com
2 Google google.com
3 Udemy udemy.com
Using detect()
function in the chardet package
You can use chardet to detect the character encoding of a file. This library is handy when working with a large pile of text. But it can also be used when working with downloaded data you don’t know its charset.
Syntax:
chardet.detect(data)
Parameter:
- data: data in the file you want to detect charset.
The detect() function detects what charset a non-Unicode string is using. It returns a dictionary containing the automatically detected charset and confidence level.
Before using the detect()
function, we need to install the chardet with the following command line:
pip install chardet
Then we will import the chardet at the top of the python file. Next, we pass the data into the detect()
function to detect its charset. After getting the charset, pass it to the read_csv()
. Like this:
import chardet import pandas as pd # Detect character encoding of data.csv enc = chardet.detect(open('data.csv', 'rb').read()) print(enc['encoding']) # UTF-16 # Use pandas to read data.csv data = pd.read_csv('data.csv', encoding=enc['encoding']) print(data)
Output:
UTF-16
Name Website
0 LearnShareIT learnshareit.com
1 Facebook facebook.com
2 Google google.com
3 Udemy udemy.com
Change character encoding manually
This way is very simple. Just open the file you need to read with notepad++. On the menu bar, select Encoding -> Convert to UTF-8. Like this:
Code:
import pandas as pd # Using pandas to read data.csv with charset = UTF-8 data = pd.read_csv('data.csv') print(data)
Output:
Name Website
0 LearnShareIT learnshareit.com
1 Facebook facebook.com
2 Google google.com
3 Udemy udemy.com
Summary
Basically, the error “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte” comes from the inconsistency between the encoding and decoding processes. As long as you make sure to use a character set for encoding and decoding (such as UTF-8), you won’t get this error again.
Have a lucky day!
Maybe you are interested:
- “unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte”
- UnicodeDecodeError: ‘charmap’ codec can’t decode byte
- UnicodeDecodeError: ‘ascii’ codec can’t decode byte
Hi, I’m Cora Lopez. I have a passion for teaching programming languages such as Python, Java, Php, Javascript … I’m creating the free python course online. I hope this helps you in your learning journey.
Name of the university: HCMUE
Major: IT
Programming Languages: HTML/CSS/Javascript, PHP/sql/laravel, Python, Java