If you are getting trouble with the error “Unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte”, take it easy and follow our article to overcome the problem. Read on it now.
Why does error “Unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte” occur?
This problem is common error when reading a file under CSV format in pandas. It happens because the read_csv()
function in pandas uses utf-8 Standard Encodings, which is defaulted in Python, but the file contains some special characters.
Now, we will read a CSV file about the biomedical domain by pandas and how does the error happen.
You can download the CVS file here.
Code:
import pandas as pd data = pd.read_csv("alldata_1_for_kaggle.csv") data.head()
Result:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 3: invalid start byte
Note: You may get the same error with a format like that: UnicodeDecodeError: 'utf-8' codec can't decode byte <<memory address>> in position <<position>>: invalid start byte
error.
Solutions to solve this problem
Solution for reading csv file:
Some common encodings can bypass the codecs lookup machinery to improve performance, such as latin1
, iso-8859-1
, ascii
, us-ascii
, etc.
You can pass a parameter named encoding
with a string value that defines the type of encoding to perform the data.
In our example, we use latin1
to encode the data.
Code:
import pandas as pd data = pd.read_csv("alldata_1_for_kaggle.csv", encoding = 'latin1') # Pass encoding parameter data.head()
Result:
Unnamed: 0 0 a
0 0 Thyroid_Cancer Thyroid surgery in children in a single insti...
1 1 Thyroid_Cancer " The adopted strategy was the same as that us...
2 2 Thyroid_Cancer coronary arterybypass grafting thrombosis ï¬b...
3 3 Thyroid_Cancer Solitary plasmacytoma SP of the skull is an u...
4 4 Thyroid_Cancer This study aimed to investigate serum matrix ...
Solution for reading text and json file:
The initial content of json and txt file:
{"student":[ { "firstName":"™œœ''™™œ""××""™"ˆ'γ°°'ˆ'"œ™"ε""Ãö", "lastName":"Doe" }, { "firstName":"Anna", "lastName":"Smith" }, { "firstName":"Peter", "lastName":"Jones" } ]}
œMedical Informatics and œHealth Care Sciences
Open file and read with binary mode
Syntax:
file_reader = open("path/to/file", "rb")
- rb: the binary reading mode
Read json file:
import json file = open('a.json', 'rb') content = json.load(file) print(content)
Result:
{'student': [{'firstName': "™œ\x9dœ\x9d''™™œ\x9d”“××””™“ˆ’γ°°’ˆ’“œ\x9d™“ε““Ã\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}
Read the txt file:
file = open('a.txt', 'rb') print(file.read())
Result:
b'\xc5\x93Medical Informatics\xc2\x9d and \xc5\x93Health Care Sciences'
Ignoring errors when reading file
Syntax:
file = open("path/to/file", "r", errors = "ignore")
To ignore encoding errors can lead to data loss.
Read json file:
import json file = open('a.json', 'r', errors = 'ignore') content = json.load(file) print(content)
Result:
{'student': [{'firstName': "™œÂÅ“Â''™™œÂâ€â€œÃƒâ€”×â€â€â„¢â€œË†â€™ÃŽÂ³Â°Â°â€™Ë†â€™â€œÅ“™“ε““ÃÂ\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}
Read txt file:
file = open('a.txt', 'r', errors='ignore') print(file.read())
Result:
œMedical Informatics and œHealth Care Sciences
Summary
Unicodedecodeerror: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte is a common error when reading files. Through our article, I hope you understand the root of the problem and the solution to the problem.
Maybe you are interested:
- UnicodeDecodeError: ‘ascii’ codec can’t decode byte
- UnicodeEncodeError: ‘ascii’ codec can’t encode character in position
- AttributeError: ‘dict’ object has no attribute ‘iteritems’

My name is Robert Collier. I graduated in IT at HUST university. My interest is learning programming languages; my strengths are Python, C, C++, and Machine Learning/Deep Learning/NLP. I will share all the knowledge I have through my articles. Hope you like them.
Name of the university: HUST
Major: IT
Programming Languages: Python, C, C++, Machine Learning/Deep Learning/NLP