UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte – How to fix this error?

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte

Usually, there should be no problem working with Latin characters. Except when interacting with special characters, we can see the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte”.

Why does the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte” appear? And how to solve it?

Encode and decode 2 different character sets

The error appears when we encode with one character set and try to use a different character set when we want to decode an object. See the example for a better understanding.

encoding = 'LearnShäreIT'.encode('latin-1')
decoding = encoding.decode('utf-8')

print(decoding) # UnicodeDecodeError

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 7

To solve this error, you must use the character set that was previously used for encoding when you decode the string you want, like the code sample below.

encoding = 'LearnShäreIT'.encode('utf-8')

# Using the same character set
decoding = encoding.decode('utf-8')

print(decoding)

Output:

LearnShäreIT

The charset is inconsistent when saving files and reading files

When we create and save a CSV file, we choose the UTF-16 BE charset, as shown below.

But when reading the file with pandas.read_csv(), we use the default character set of read_csv() which is utf-8. See the code below for a better understanding.

import pandas as pd

# Using encoding = 'utf-8' but charset of data.csv = 'utf-16'
data = pd.read_csv('data.csv')

print(data)

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0

We have to set the encoding='utf-16' for consistency between encoding and decoding. Like this:

import pandas as pd

# Using encoding='utf-16'
data = pd.read_csv('data.csv', encoding='utf-16')

print(data)

Output:

          Name           Website
0  LearnShareIT  learnshareit.com
1      Facebook      facebook.com
2        Google        google.com
3         Udemy         udemy.com

Using detect() function in the chardet package

You can use chardet to detect the character encoding of a file. This library is handy when working with a large pile of text. But it can also be used when working with downloaded data you don’t know its charset.

Syntax:

chardet.detect(data)

Parameter:

  • data: data in the file you want to detect charset.

The detect() function detects what charset a non-Unicode string is using. It returns a dictionary containing the automatically detected charset and confidence level.

Before using the detect() function, we need to install the chardet with the following command line:

pip install chardet

Then we will import the chardet at the top of the python file. Next, we pass the data into the detect() function to detect its charset. After getting the charset, pass it to the read_csv(). Like this:

import chardet
import pandas as pd

# Detect character encoding of data.csv
enc = chardet.detect(open('data.csv', 'rb').read())

print(enc['encoding'])  # UTF-16

# Use pandas to read data.csv
data = pd.read_csv('data.csv', encoding=enc['encoding'])

print(data)

Output:

UTF-16
          Name           Website
0  LearnShareIT  learnshareit.com
1      Facebook      facebook.com
2        Google        google.com
3         Udemy         udemy.com

Change character encoding manually

This way is very simple. Just open the file you need to read with notepad++. On the menu bar, select Encoding -> Convert to UTF-8. Like this:

Code:

import pandas as pd

# Using pandas to read data.csv with charset = UTF-8
data = pd.read_csv('data.csv')

print(data)

Output:

          Name           Website
0  LearnShareIT  learnshareit.com
1      Facebook      facebook.com
2        Google        google.com
3         Udemy         udemy.com

Summary

Basically, the error “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte in position: invalid continuation byte” comes from the inconsistency between the encoding and decoding processes. As long as you make sure to use a character set for encoding and decoding (such as UTF-8), you won’t get this error again.

Have a lucky day!

Maybe you are interested:

Leave a Reply

Your email address will not be published. Required fields are marked *