Categories We Write About

Automatically clean messy HTML content

To automatically clean messy HTML content, follow this structured approach. Here’s a sample Python script using BeautifulSoup (from the bs4 library) to remove unwanted tags, fix malformed structures, and output clean HTML:

python
from bs4 import BeautifulSoup def clean_html(html_content): # Parse the HTML with BeautifulSoup soup = BeautifulSoup(html_content, "html.parser") # List of tags to remove completely tags_to_remove = ['script', 'style', 'meta', 'iframe', 'object', 'embed', 'link', 'noscript'] for tag in tags_to_remove: for match in soup.find_all(tag): match.decompose() # Optionally unwrap tags like <span>, <font>, etc., but keep content tags_to_unwrap = ['span', 'font', 'b', 'i'] for tag in tags_to_unwrap: for match in soup.find_all(tag): match.unwrap() # Remove all attributes like style, class, id for tag in soup.find_all(True): tag.attrs = {} # Strip extra whitespace cleaned_html = str(soup) cleaned_html = 'n'.join(line.strip() for line in cleaned_html.splitlines() if line.strip()) return cleaned_html # Example usage messy_html = """ <html><head><title>Test</title><style>body{color:red;}</style></head> <body><h1 class="header" style="font-size:20px;">Welcome</h1> <p id="para1">This is a <span style="color:blue;">test</span> paragraph.</p> <script>alert("Hi")</script></body></html> """ cleaned = clean_html(messy_html) print(cleaned)

What This Script Does:

  • Removes: unwanted tags (<script>, <style>, etc.).

  • Unwraps: tags like <span> while preserving their content.

  • Strips attributes: cleans style, class, id, etc., from tags.

  • Minifies: trims blank lines and excess whitespace.

Let me know if you want a version that keeps specific tags or attributes (e.g., for SEO or accessibility).

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About