How To Remove The Non UTF-8 Characters From A String In Python

In this tutorial, we will show you how to remove the non UTF-8 characters from a String in Python. The non UTF-8 characters appear a lot in the binary file, but it usually does not bring much value in reality. Follow us to learn how to remove it with the explanation and examples below.

What Is The Non Utf-8 Character?

The non UTF-8 characters are the characters that usually have symbols in mathematics or characters from countries whose languages are unsupported. The non UTF-8 characters appear a lot in the binary file because of its security. 

Here are some examples of the non UTF-8 characters:

İcrvt4722ç LearnShareITı
İcrvt4722ç LearnShareITı
Äæ��Learn�Share� ��IT��ç

Three Strings also contains the non UTF-8 characters. 

After learning the definition of the non UTF-8 characters, we will show you how to remove it in the next title below.

How To Remove The Non UTF-8 Characters From A String In Python

To remove the non UTF-8 characters from a String in Python, you can use the encode() function and handle this String after encoding or the ord() function.

Remove the non UTF-8 characters from a String with the ord function

You can use the ord() function to check if the character belongs to ASCII or not. If True, this character is a UTF-8 character.

Look at the example below.

myStr = 'Äæ�Learn�Share��IT�ç'
print('The old String: ' + myStr)

newStr = ''

# Loop the old String.
for character in myStr:
    # Select the UTF-8 character.
    if ord(character) >= 0 and ord(character) <= 126:
        newStr += character

print('The new String: ' + newStr)

Output

The old String: Äæ�Learn�Share��IT�ç
The new String: LearnShareIT

Note that you should use this solution with the normal case because the UTF-8 currently has some characters from foreign languages.

Remove the non UTF-8 characters from a String with the encode function

In this example, we will show you how to remove the non UTF-8 characters by using the encoding method. Also, you have to ignore errors by assigning the value ‘ignore’ to the ‘errors’ parameters to prevent getting some errors. Follow these steps to learn more about this solution.

  • Step 1: Encode the String by the encode() function. The non UTF-8 characters will have an ‘\’ and 3 characters after.
  • Step 2: Convert the result to the String.
  • Step 3: Handle this String and get the result.

Look at the example below to learn more about this solution.

def removeNonUTF_8(currentStr=str()):
    # Encode the current String
    currentStr = currentStr.encode('utf-8', errors='ignore')

    # After encoding, the result will be like this
    print('After encoding: '+str(currentStr))

    # Convert this result to the String
    newStr = str(currentStr)

    # Remove redundant characters.
    newStr = newStr[2:len(newStr)-1]

    # Split this String and remove the non UTF-8 characters
    arr = newStr.split('\\')

    result = arr[0]
    for i in range(1, len(arr)):
        # Remove 3 characters after '\'
        result += arr[i][3:]

    return result

# Try this function with the String in the first title
myStr = 'Äæ�Learn�Share��IT�ç'
print('The old String: ' + myStr)
print('The new String: ' + removeNonUTF_8(myStr))

Output

The old String: Äæ�Learn�Share��IT�ç
After encoding: b'\xc3\x84\xc3\xa6\xef\xbf\xbdLearn\xef\xbf\xbdShare\xef\xbf\xbd\xef\xbf\xbdIT\xef\xbf\xbd\xc3\x83\xc2\xa7'
The new String: LearnShareIT

Summary

You have learned the definition and how to remove the non utf-8 characters from a String in Python in two ways. You should use the first solution with the usual case. But for the most accurate results, you should use the second solution. We hope this tutorial is helpful to you. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *