How To Strip The HTML Tags From A String In Python

Strip the HTML tags from a string in Python

To strip the HTML tags from a string in Python, you can use the re.sub() method or Beautifulsoup library. Follow the article to better understand.

Strip the HTML tags from a string in Python

What is an HTML tag? HTML tags are the beginning and end of a section of your website. HTML tags usually have the following structure: <tagname>content</tagname>. Some common tags in HTML like: <!DOCTYPE>,<html>,<head>, <body>, …

Example:

<head>
<title>Title of the web page</title>
</head>

To strip the HTML tags from a string in Python, I have the following ways:

Use re.sub() method

In this case, I will show you how to strip the HTML tags from a string in Python using the ‘re.sub’ method.

Regular Expression (RegEx) is a sequence of special characters that add a particular pattern, representing strings or a collection of strings.

In Python, Regular Expression is expressed through the ‘re’ module, so you have to import ‘re’ module before you want to use Regular Expression. ‘re’ module has many methods and functions to work with RegEx, but one of the essential methods is ‘re.sub’.

The re.sub() method will replace all pattern matches in the string with something else passed in and return the modified string.

Syntax:

re.sub(pattern, replace, string, count)

Parameters:

  • pattern: is RegEx.
  • replace: is the replacement for the resulting string that matches the pattern.
  • string: is the string to match.
  • count: is the number of replacements. Python will treat this value as 0, match and replace all qualified strings if left blank.

Example:

  • Import ‘re’ module and create an HTML string.
  • Use the re.sub() method to strip the HTML tags from HTML string.
import re

html_string = '''
<body>
  <h1>visit </h1>
  <h2>learnshareit</h2>
  <h3>website</h3>
</body>
'''
print("Original String: ", html_string)

# Use re.sub() method to strip the HTML tags from HTML string
resultStr = re.sub('<[^<]+?>', '', html_string)
print("Result String: ", resultStr)

Output:

Original String:  
<body>
  <h1>visit </h1>
  <h2>learnshareit</h2>
  <h3>website</h3>
</body>

Result String:  

  visit 
  learnshareit
  website

Note:

<[^<]+?> is Pattern, and to represent Pattern we use these special characters to match the above example.

Use Beautifulsoup

BeautifulSoup is a Python library for extracting data from HTML and XML files.

Example:

  • Import BeautifulSoup.
  • Use get_text() to get all the text contained in the HTML page.

Note: you need to install ‘beautifulsoup4’ modules before using, otherwise the program will give an error:

 Traceback (most recent call last):
  File "./prog.py", line 5, print <module>
ModuleNotFoundError: No module named 'bs4'
from bs4 import BeautifulSoup

html_string = '''
<body>
  <h1>visit </h1>
  <h2>learnshareit</h2>
  <h3>website</h3>
</body>
'''
soup = BeautifulSoup(html_string, features = "html.parser")

# Use get_text() to get all the text contained in the HTML page
resultStr = soup.get_text()
print(resultStr)

Output:

visit 
learnshareit
website

Summary

If you have any questions about how to strip the HTML tags from a string in Python, leave a comment below. I will answer your questions. Thank you for reading!

Maybe you are interested:

Leave a Reply

Your email address will not be published. Required fields are marked *