UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte – How To Fix It?

“unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte”

If you are getting trouble with the error “Unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte”, take it easy and follow our article to overcome the problem. Read on it now.

Why does error “Unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte” occur?

This problem is common error when reading a file under CSV format in pandas. It happens because the read_csv() function in pandas uses utf-8 Standard Encodings, which is defaulted in Python, but the file contains some special characters.

Now, we will read a CSV file about the biomedical domain by pandas and how does the error happen.

You can download the CVS file here.

Code:

import pandas as pd
data = pd.read_csv("alldata_1_for_kaggle.csv")
data.head()

Result:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 3: invalid start byte

Note: You may get the same error with a format like that: UnicodeDecodeError: 'utf-8' codec can't decode byte <<memory address>>  in position <<position>>: invalid start byte error.

Solutions to solve this problem

Solution for reading csv file:

Some common encodings can bypass the codecs lookup machinery to improve performance, such as latin1, iso-8859-1, ascii, us-ascii, etc.

You can pass a parameter named encoding with a string value that defines the type of encoding to perform the data. 

In our example, we use latin1 to encode the data.

Code:

import pandas as pd
data = pd.read_csv("alldata_1_for_kaggle.csv", encoding = 'latin1') # Pass encoding parameter
data.head()

Result:

Unnamed:    0               0                                                  a
0           0  Thyroid_Cancer  Thyroid surgery in  children in a single insti...
1           1  Thyroid_Cancer  " The adopted strategy was the same as that us...
2           2  Thyroid_Cancer  coronary arterybypass grafting thrombosis ï¬b...
3           3  Thyroid_Cancer   Solitary plasmacytoma SP of the skull is an u...
4           4  Thyroid_Cancer   This study aimed to investigate serum matrix ...

Solution for reading text and json file:

The initial content of json and txt file:

{"student":[
    { "firstName":"™œœ''™™œ""××""™"ˆ'γ°°'ˆ'"œ™"ε""íö", "lastName":"Doe" },
    { "firstName":"Anna", "lastName":"Smith" },
    { "firstName":"Peter", "lastName":"Jones" }
]}
œMedical Informatics and œHealth Care Sciences

Open file and read with binary mode

Syntax:

file_reader = open("path/to/file", "rb")
  • rb: the binary reading mode

Read json file:

import json
 
file = open('a.json', 'rb')
content = json.load(file) 
print(content)

Result:

{'student': [{'firstName': "™œ\x9dœ\x9d''™™œ\x9d”“××””™“ˆ’γ°°’ˆ’“œ\x9d™“ε““Ã\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}

Read the txt file:

file = open('a.txt', 'rb') 
print(file.read())

Result:

b'\xc5\x93Medical Informatics\xc2\x9d and \xc5\x93Health Care Sciences'

Ignoring errors when reading file

Syntax:

file = open("path/to/file", "r", errors = "ignore") 

To ignore encoding errors can lead to data loss.

Read json file:

import json
 
file = open('a.json', 'r', errors = 'ignore')
content = json.load(file)
print(content)

Result:

{'student': [{'firstName': "™œÂÅ“Â''™™œÂâ€â€œÃƒâ€”×â€â€â„¢â€œË†â€™ÃŽÂ³Â°Â°â€™Ë†â€™â€œÅ“™“ε““ÃÂ\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}

Read txt file:

file = open('a.txt', 'r',  errors='ignore')
print(file.read())

Result:

œMedical Informatics and œHealth Care Sciences

Summary

Unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte is a common error when reading files. Through our article, I hope you understand the root of the problem and the solution to the problem.

Maybe you are interested:

Leave a Reply

Your email address will not be published. Required fields are marked *