Automatically clean messy HTML content

To automatically clean messy HTML content, follow this structured approach. Here’s a sample Python script using BeautifulSoup (from the bs4 library) to remove unwanted tags, fix malformed structures, and output clean HTML:

python
from bs4 import BeautifulSoup

def clean_html(html_content):
    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(html_content, "html.parser")

    # List of tags to remove completely
    tags_to_remove = ['script', 'style', 'meta', 'iframe', 'object', 'embed', 'link', 'noscript']

    for tag in tags_to_remove:
        for match in soup.find_all(tag):
            match.decompose()

    # Optionally unwrap tags like <span>, <font>, etc., but keep content
    tags_to_unwrap = ['span', 'font', 'b', 'i']
    for tag in tags_to_unwrap:
        for match in soup.find_all(tag):
            match.unwrap()

    # Remove all attributes like style, class, id
    for tag in soup.find_all(True):
        tag.attrs = {}

    # Strip extra whitespace
    cleaned_html = str(soup)
    cleaned_html = 'n'.join(line.strip() for line in cleaned_html.splitlines() if line.strip())

    return cleaned_html

# Example usage
messy_html = """
<html><head><title>Test</title><style>body{color:red;}</style></head>
<body><h1 class="header" style="font-size:20px;">Welcome</h1>
<p id="para1">This is a <span style="color:blue;">test</span> paragraph.</p>
<script>alert("Hi")</script></body></html>
"""

cleaned = clean_html(messy_html)
print(cleaned)

What This Script Does:

Removes: unwanted tags (<script>, <style>, etc.).
Unwraps: tags like <span> while preserving their content.
Strips attributes: cleans style, class, id, etc., from tags.
Minifies: trims blank lines and excess whitespace.

Let me know if you want a version that keeps specific tags or attributes (e.g., for SEO or accessibility).

Share This Page:

What This Script Does:

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Writing Thread-Safe Memory Management in C++

Writing Tests for Animation Systems

Writing Secure C++ Code with Proper Memory Management

Writing Secure C++ Code with Proper Memory Management (1)