To strip the HTML tags from a string in Python, you can use the re.sub()
method or Beautifulsoup
library. Follow the article to better understand.
Strip the HTML tags from a string in Python
What is an HTML tag? HTML tags are the beginning and end of a section of your website. HTML tags usually have the following structure: <tagname>content</tagname>
. Some common tags in HTML like: <!DOCTYPE>
,<html>
,<head>
, <body>
, …
Example:
<head> <title>Title of the web page</title> </head>
To strip the HTML tags from a string in Python, I have the following ways:
Use re.sub() method
In this case, I will show you how to strip the HTML tags from a string in Python using the ‘re.sub’ method.
Regular Expression (RegEx) is a sequence of special characters that add a particular pattern, representing strings or a collection of strings.
In Python, Regular Expression is expressed through the ‘re’ module, so you have to import ‘re’ module before you want to use Regular Expression. ‘re’ module has many methods and functions to work with RegEx, but one of the essential methods is ‘re.sub’.
The re.sub()
method will replace all pattern matches in the string with something else passed in and return the modified string.
Syntax:
re.sub(pattern, replace, string, count)
Parameters:
- pattern: is RegEx.
- replace: is the replacement for the resulting string that matches the pattern.
- string: is the string to match.
- count: is the number of replacements. Python will treat this value as 0, match and replace all qualified strings if left blank.
Example:
- Import ‘re’ module and create an HTML string.
- Use the
re.sub()
method to strip the HTML tags from HTML string.
import re html_string = ''' <body> <h1>visit </h1> <h2>learnshareit</h2> <h3>website</h3> </body> ''' print("Original String: ", html_string) # Use re.sub() method to strip the HTML tags from HTML string resultStr = re.sub('<[^<]+?>', '', html_string) print("Result String: ", resultStr)
Output:
Original String:
<body>
<h1>visit </h1>
<h2>learnshareit</h2>
<h3>website</h3>
</body>
Result String:
visit
learnshareit
website
Note:
<[^<]+?>
is Pattern, and to represent Pattern we use these special characters to match the above example.
Use Beautifulsoup
BeautifulSoup is a Python library for extracting data from HTML and XML files.
Example:
- Import BeautifulSoup.
- Use
get_text()
to get all the text contained in the HTML page.
Note: you need to install ‘beautifulsoup4’ modules before using, otherwise the program will give an error:
Traceback (most recent call last):
File "./prog.py", line 5, print <module>
ModuleNotFoundError: No module named 'bs4'
from bs4 import BeautifulSoup html_string = ''' <body> <h1>visit </h1> <h2>learnshareit</h2> <h3>website</h3> </body> ''' soup = BeautifulSoup(html_string, features = "html.parser") # Use get_text() to get all the text contained in the HTML page resultStr = soup.get_text() print(resultStr)
Output:
visit
learnshareit
website
Summary
If you have any questions about how to strip the HTML tags from a string in Python, leave a comment below. I will answer your questions. Thank you for reading!
Maybe you are interested:
- Remove the HTML tags from a String in Python
- Split a string on the first occurrence in Python
- Split a string by comma in Python

My name is Jason Wilson, you can call me Jason. My major is information technology, and I am proficient in C++, Python, and Java. I hope my writings are useful to you while you study programming languages.
Name of the university: HHAU
Major: IT
Programming Languages: C++, Python, Java