In this tutorial, we will show you how to remove the
non UTF-8 characters from a String in Python. The non UTF-8 characters appear a lot in the binary file, but it usually does not bring much value in reality. Follow us to learn how to remove it with the explanation and examples below.
What Is The Non Utf-8 Character?
non UTF-8 characters are the characters that usually have symbols in mathematics or characters from countries whose languages are unsupported. The non UTF-8 characters appear a lot in the binary file because of its security.
Here are some examples of the non UTF-8 characters:
İcrvt4722ç LearnShareITı Ä°crvt4722Ã§ LearnShareITÄ± Äæ��Learn�Share� ��IT��Ã§
Three Strings also contains the non UTF-8 characters.
After learning the definition of the non UTF-8 characters, we will show you how to remove it in the next title below.
How To Remove The Non UTF-8 Characters From A String In Python
To remove the non UTF-8 characters from a String in Python, you can use the encode() function and handle this String after encoding or the ord() function.
Remove the non UTF-8 characters from a String with the ord function
You can use the
ord() function to check if the character belongs to ASCII or not. If True, this character is a UTF-8 character.
Look at the example below.
myStr = 'Äæ�Learn�Share��IT�Ã§' print('The old String: ' + myStr) newStr = '' # Loop the old String. for character in myStr: # Select the UTF-8 character. if ord(character) >= 0 and ord(character) <= 126: newStr += character print('The new String: ' + newStr)
The old String: Äæ�Learn�Share��IT�Ã§ The new String: LearnShareIT
Note that you should use this solution with the normal case because the UTF-8 currently has some characters from foreign languages.
Remove the non UTF-8 characters from a String with the encode function
In this example, we will show you how to remove the non UTF-8 characters by using the encoding method. Also, you have to ignore errors by assigning the value ‘ignore’ to the ‘errors’ parameters to prevent getting some errors. Follow these steps to learn more about this solution.
- Step 1: Encode the String by the encode() function. The non UTF-8 characters will have an ‘\’ and 3 characters after.
- Step 2: Convert the result to the String.
- Step 3: Handle this String and get the result.
Look at the example below to learn more about this solution.
def removeNonUTF_8(currentStr=str()): # Encode the current String currentStr = currentStr.encode('utf-8', errors='ignore') # After encoding, the result will be like this print('After encoding: '+str(currentStr)) # Convert this result to the String newStr = str(currentStr) # Remove redundant characters. newStr = newStr[2:len(newStr)-1] # Split this String and remove the non UTF-8 characters arr = newStr.split('\\') result = arr for i in range(1, len(arr)): # Remove 3 characters after '\' result += arr[i][3:] return result # Try this function with the String in the first title myStr = 'Äæ�Learn�Share��IT�Ã§' print('The old String: ' + myStr) print('The new String: ' + removeNonUTF_8(myStr))
The old String: Äæ�Learn�Share��IT�Ã§ After encoding: b'\xc3\x84\xc3\xa6\xef\xbf\xbdLearn\xef\xbf\xbdShare\xef\xbf\xbd\xef\xbf\xbdIT\xef\xbf\xbd\xc3\x83\xc2\xa7' The new String: LearnShareIT
You have learned the definition and how to remove the
non utf-8 characters from a String in Python in two ways. You should use the first solution with the usual case. But for the most accurate results, you should use the second solution. We hope this tutorial is helpful to you. Thanks!
Name of the university: PTIT