Build a newsletter cleanup script

Here is a Python script that performs a basic newsletter cleanup task. The script assumes you have a raw text file containing newsletter content with common issues like extra whitespace, duplicated lines, unwanted promotional sections, or broken links. The script will clean up the text by:

Removing duplicate lines
Stripping extra spaces
Removing lines containing specific unwanted keywords (e.g., “unsubscribe,” “advertisement”)
Fixing basic URL formatting issues

You can expand or customize this script depending on your exact newsletter content and cleanup needs.

python
import re

def load_newsletter(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.readlines()
    return content

def remove_duplicates(lines):
    seen = set()
    cleaned_lines = []
    for line in lines:
        stripped = line.strip()
        if stripped and stripped not in seen:
            seen.add(stripped)
            cleaned_lines.append(stripped)
    return cleaned_lines

def remove_unwanted_sections(lines, keywords):
    filtered = []
    for line in lines:
        if not any(keyword.lower() in line.lower() for keyword in keywords):
            filtered.append(line)
    return filtered

def fix_urls(lines):
    fixed_lines = []
    url_pattern = re.compile(r'(http[s]?://[^s]+)')
    for line in lines:
        # Remove trailing punctuation from URLs
        def clean_url(match):
            url = match.group(0)
            return url.rstrip('.,;:')
        fixed_line = url_pattern.sub(clean_url, line)
        fixed_lines.append(fixed_line)
    return fixed_lines

def save_cleaned_newsletter(lines, output_path):
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write('n'.join(lines))
    print(f"Cleaned newsletter saved to {output_path}")

def main():
    input_file = 'newsletter_raw.txt'
    output_file = 'newsletter_cleaned.txt'
    unwanted_keywords = ['unsubscribe', 'advertisement', 'promo', 'sale', 'sponsored']

    lines = load_newsletter(input_file)
    lines = remove_duplicates(lines)
    lines = remove_unwanted_sections(lines, unwanted_keywords)
    lines = fix_urls(lines)

    save_cleaned_newsletter(lines, output_file)

if __name__ == '__main__':
    main()

How it works:

Place your raw newsletter content in newsletter_raw.txt.
The script removes duplicate lines.
Lines containing words like “unsubscribe,” “advertisement,” etc., are removed.
URLs ending with punctuation marks like commas or periods are cleaned.
The cleaned content is saved to newsletter_cleaned.txt.

You can customize the list unwanted_keywords or extend the cleanup logic to fit your exact needs.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic